Calculating and plotting mapped reads - a simple R/shell script
As this is my first post I’ll start with something very simple I did recently. The problem is simple: “how many reads did does my deep-seq experiment have and how many were mapped to the genome?”. I am now dealing with several RNA-seq and ChIP-seq experiments and having a pipeline to get this info fast for several files is important to me.
There are several suggestion out there on how to it, but I wanted something that could be used routinely for several files per experiment, and that could output the results in a visual manner (and also a table).
Using SAMtools as suggested here (thanks!) I’ve started with a simple bash script that would output the mapped reads into a text file and possible do the maths:
Unfortunately arithmetic operations are not something advisable to do in shell - in my particular case because I am using a remote server without a necessary package. That minor setback led to think about doing in R: It can send commands to the shell; it is certainly built for mathematical/statistical operations and it has ggplot!
So this how the code looks like this:
To execute all you have to is to go to the folder that contains the .bam files and excute from the shell:
This is how the plot looks like:
This of course very simple, which also means very easy to change for ones specific needs, but serves the purpose of having quick visual information on how data mapped and perhaps present in group meetings. That said, never underestimate the power of visual information and one of the many beauties of ggplot is that the plots can be customized to exhaustion (I’ll update this plot soonish to include the % mapping: UPDATED).
This simple plot contains a lot of useful information: (i) How many reads we collected; (ii) how many were mapped (the total numbers are visible in the Y-axis); (iii) visual information on the proportion of mapped reads; and just in case, (iv) also the actual proportion of mapped and unmapped reads. And this for 12 mapped files!