Moving to the Dark Side

Leaving the Pipette for a Keyboard.

Function to find maximal coverage in multiple bigwigs II

[This is an updated version of this post with improved functions and a reproducible example]

I really like the package Gviz to prepare figures for presentations and publications (I have used in B plus some tidying up in inskape). It is a fantastic visualization package, but the time and effort that it takes to get the figures just right is a little too much for my daily. An example of this is when plotting coverage tracks; by default axis of each panel are scaled independently which makes visualization tricky.

Let’s fist load the packages. derfinder contains some example datasets (chosen almost randomly). You can find the vignette here.

library('Gviz')
library('rtracklayer')
library('derfinderData')
files <- dir(system.file('extdata', 'A1C', package = 'derfinderData'),  full.names = TRUE)
names(files) <- gsub('\\.bw', '', dir(system.file('extdata', 'AMY', package = 'derfinderData')))
head(files)
##                                                                                 HSB113 
## "/home/adomingu/R/x86_64-pc-linux-gnu-library/3.3/derfinderData/extdata/A1C/HSB103.bw" 
##                                                                                 HSB123 
## "/home/adomingu/R/x86_64-pc-linux-gnu-library/3.3/derfinderData/extdata/A1C/HSB114.bw" 
##                                                                                 HSB126 
## "/home/adomingu/R/x86_64-pc-linux-gnu-library/3.3/derfinderData/extdata/A1C/HSB123.bw" 
##                                                                                 HSB130 
## "/home/adomingu/R/x86_64-pc-linux-gnu-library/3.3/derfinderData/extdata/A1C/HSB126.bw" 
##                                                                                 HSB135 
## "/home/adomingu/R/x86_64-pc-linux-gnu-library/3.3/derfinderData/extdata/A1C/HSB130.bw" 
##                                                                                 HSB136 
## "/home/adomingu/R/x86_64-pc-linux-gnu-library/3.3/derfinderData/extdata/A1C/HSB135.bw"
bw_file1 <- files[1]
bw_file2 <- files[3]
bw1 <- import(files[1])
bw2 <- import(files[2])

Let’s create some genomic regions to look at:

gr1 <- GRanges(
  seqnames=c("chr21", "chr21"),
  ranges=IRanges(c(48055507, 17059283), c(48085155, 17070283)),
  strand=c("+", "-")
  )
gr1
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames               ranges strand
##          <Rle>            <IRanges>  <Rle>
##   [1]    chr21 [48055507, 48085155]      +
##   [2]    chr21 [17059283, 17070283]      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

One of these is the gene PRMT2, and the other is some random region without an annotated gene. We will see it’s usefulness later.

And does the plot look like using the defaults? Let’s have a look:

gen = "hg19"
gr1a <- as.data.frame(gr1)
start=gr1a[1,2]
end=gr1a[1,3]

gTrack <- BiomartGeneRegionTrack(
  genome = gen,
  chromosome = "chr21",
  start = start,
  end = end,
  name = "ENSEMBL")

dTrack1 <- DataTrack(
  bw_file1,
  type = "l",
  name="track1")
dTrack2 <- DataTrack(
  bw_file2,
  type = "l",
  name="track2")

# call the display function plotTracks
track.list=list(dTrack1, dTrack2, gTrack)
plotTracks(track.list,from=start,to=end,chromsome="chr21")

plot of chunk plotDefault

Two things become clear, though not immediately:

  • the coverage is lower in track2, but since these tracks do not share the same scale a simple glance would convey the wrong message.
  • Due to the internals of Gviz, the y-axis of both tracks have different front size. This is because they are scaled to the y-max. This could be fixed, but makes the figure a bit clunky for presentations.

The way I am solving both these issues is to determine the maximal coverage for both tracks, and then adjust the y-lim of both to this value. Initially I was doing this manually, using a Kent tool in the terminal, and using the value in R. This is of course time consuming and not adequate if one starts using Gviz on a regular basis, mixing and matching tracks in the plots.

Because I had a project that required me to plot the coverage over an entire chomosome rather than a gene, my first programatic solution was to calculate the top coverage for a given chromosome. The basic function maxCovBw will take an imported BigWig and do this. The maxCovFiles function is a wrapper that will do the same but for a list of files.

## calculate the max values
 maxCovBw <- function(gr, myChr) {
    max_cov <- max(gr[seqnames(gr) %in% myChr,]$score)
    return(max_cov)
 }

 maxCovFiles <- function(bws, chrs){
    for(i in seq_along(chrs)){
       myChr = chrs[i]
       print(myChr)
       max_coverage <- c()
       max_cov <- round(
          max(
             sapply(bws, maxCovBw, myChr=myChr)
          )
       , 0)
       max_coverage[myChr] <- max_cov 
    }
    return(data.frame(max_coverage))
 }

max_cov <- maxCovFiles(list(bw1, bw2), seqnames(seqinfo(bw1)))
## [1] "chr21"

We can see that there is only one chromosome in these files:

seqnames(seqinfo(bw1))
## [1] "chr21"
seqnames(seqinfo(bw2))
## [1] "chr21"

which we can now use to call the function:

max_cov <- maxCovFiles(list(bw1, bw2), "chr21")
## [1] "chr21"
max_cov
##       max_coverage
## chr21         2250

As I mentioned, this is useful for those situations when I am plotting a full chromosome, but in most situations what is being plotted is a narrow region, for instance, at a gene. Also a chromosome could be represented as a region as c(0, seqlengths(seqinfo(bw1)). So let’s modify the function:

## calculate the max values
maxCovBw <- function(bw, gr) {
  ovlp <- subsetByOverlaps(bw, gr)
  if (length(ovlp) > 0) {
    print('not empty')
    max_cov <- max(ovlp$score)
  } else {
    print('WARNING: The selected genomic region has no coverage value in the BigWig')
    print('WARNING: Coverage value is arbitrary set to Zero.')
    max_cov <- 0 
  }
  print(max_cov)
  return(max_cov)
}

maxCovFiles <- function(bws, gr){
  # bws <- lapply(bws, rtracklayer:::import)
  max_cov <- c()
  for(i in 1:length(gr)){
    my_feat = gr[i, ]
    max_cov[i] <- round(
      max(
        sapply(bws, maxCovBw, gr=my_feat)
      )
      , 2)
  }
  values(gr) <- max_cov 
  return(gr)
}


gr2 <- maxCovFiles(list(bw1, bw2), gr1)
## [1] "not empty"
## [1] 19.84
## [1] "not empty"
## [1] 10.06
## [1] "WARNING: The selected genomic region has no coverage value in the BigWig"
## [1] "WARNING: Coverage value is arbitrary set to Zero."
## [1] 0
## [1] "WARNING: The selected genomic region has no coverage value in the BigWig"
## [1] "WARNING: Coverage value is arbitrary set to Zero."
## [1] 0
gr2
## GRanges object with 2 ranges and 1 metadata column:
##       seqnames               ranges strand |         X
##          <Rle>            <IRanges>  <Rle> | <numeric>
##   [1]    chr21 [48055507, 48085155]      + |     19.84
##   [2]    chr21 [17059283, 17070283]      - |         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

As you can see we now have the top value (which is pretty low) for each of our regions of interest. The coordinates selected contain an edge case for which there is no overlap. I decided to set it to zero to avoid breaking things. How does this look like in a plot then?

gr2 <- as.data.frame(gr2)
start=gr2[1,2]
end=gr2[1,3]

gTrack <- BiomartGeneRegionTrack(
  genome = gen,
  chromosome = "chr21",
  start = start,
  end = end,
  name = "ENSEMBL")

max_cov=gr2[1,6]

dTrack1 <- DataTrack(
  bw1,
  type = "l",
  ylim=c(0, max_cov),
  name="track1")
dTrack2 <- DataTrack(
  bw2,
  type = "l",
  ylim=c(0, max_cov),
  name="track2")

# call the display function plotTracks
track.list=list(dTrack1, dTrack2, gTrack)
plotTracks(track.list,from=start,to=end,chromsome="chr21")

plot of chunk plotYlim

This simple change, ylim=c(0, max_cov), makes massive visual difference. A quick look at the plot and it will immediately be clear that track1 has higher coverage, and the visuals are now more consistent. This is something that could go into a manuscript draft without too many changes.

Of course both functions could do with some more input testing and cleaning up, but they are working and a good starting point for a quick plot. One thing that is missing and would be useful is the ability use bam/sam as an input as BigWigs would become limiting in certain situations. Something to improve once the need arises.