It has been a long time since I wrote this post, and in between a wonderfull set of tools for coverage calculation and visualization has been published: deepTools does metagene analysis, and much, much more in a better than I could possibly do with my näive scripts.
Try it. It is worth it.
#######
There a few analysis tasks for which one would think a solution, or an easy to follow example, is readily available in the web (it has been a long time since the Web became the Internet). One such example is how to do coverage analysis of ChIP-seq, or any other data for that matter. That is, how are the reads distributed around a region of interest, commonly the TSS?
HTSeq, the python library for NGS data analysis offers that, but I prefer to do my plots in R and ggplot. I also find bioconductor nice but some of the objects and the packages literature hard to understand for someone looking for imminently practical information. So I decided to ask around in my institute and create my own script. I like to do scripts that can be easily modified and also that generate plots that can be used with little or now change for presentations.
Create the bed file with TSSs:
Firstly, how to create the BED file with the TSSs using a combination of bioconductor and bedtools - it is quite fast and it very-well documented. This only needs to be ran once and there is a myriad of ways to do it. Of course one could also use any bed file with features of interest.
Counting the number of reads per nucleotide
Because I run over several bam files and I like to keep the file names tidy I’ve wrapped the bedtools function in a shell function so that the several analysis can ran in parallel to save time. Because this step takes a of time, it sends me an email at the end. This way I can do something else without being constantly looking at it. Hint: running it in a screen session also saves a lot of headaches. If you are doing for a single bam file, or are not versed in shell scripting all you need is to run this:
But in all likelihood you will have several bam files:
At the end of this we will have a file that contains something like:
Col1-6 are the original entries of the bed file containing the TSSs, followed by the position of the base in the feature (remember this is a bp-resolution coverage) and the number of reads that mapped to the base.
Now we are ready calculate some statistics on own many reads map to each base surrounding the TSSs and plot these. In the next snippet I’ll calculate the sum of reads per base but calculating the average requires a minor change to the script. Since several of the operations take some time, I use the package doMC save time but this should work without it.
Al there is lacking now is putting all of this in a nice wrapper, which could be easily done, but I preferred to show the different stages of the process separated because it is easier to explain. I hope it helps.