Question

coverage over specific GenomicRanges

0

Entering edit mode

Jake ▴ 90

@jake-7236

Last seen 2.6 years ago

United States

Hi,

I have RNA-seq data in a bam file and I have two GRangesLists: one with exons by gene for a subset of genes and one with introns by gene for a subset of genes. I'm trying to calculate the basepair coverage over exons per gene and introns per gene. For any given gene, I'd like to be able to plot basepair coverage in terms of relative exon position (e.g. start of first exon is 1 and if there are 2 100 bp exons, the last number of the x-axis will be 200). I played around a bit with the coverage() function and views based on my two GRangesLists, but it didn't seem like the best way to do this or I was doing something wrong. I was just wondering if anyone has suggestions for how to do this?

Thanks

genomicranges • 4.8k views

ADD COMMENT • link updated 9.8 years ago by Hervé Pagès 16k • written 9.8 years ago by Jake ▴ 90

score 1 · Answer 1 · 2015-07-08

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 17 hours ago

Seattle, WA, United States

Hi,

To compute this relative coverage you could do something like this. Let's say cvg is the full reference coverage e.g. it was obtained with:

library(GenomicAlignments)
cvg <- coverage(readGAlignments("path/to/bam/file"))

and grl is a GRangesList object containing exons (or introns) grouped by transcript or gene.

Then the following code:

rangeCoverage <- function(gr, cvg)
{
    gr <- reduce(gr)
    unlist(revElements(cvg[gr], strand(gr) == "-"), use.names=FALSE)
}
RleList(lapply(grl, rangeCoverage, cvg))

will return an RleList object (like cvg) parallel to grl i.e. each list element in it will represent the coverage of the exons (or introns) in the corresponding list element in grl.

Note that this code won't be very efficient if you have thousands of genes or transcripts in grl but should be fast enough if you have only a few hundreds.

Cheers,

H.

ADD COMMENT • link 9.8 years ago Hervé Pagès 16k

1

Entering edit mode

A different way to calculate coverage is

library(GenomicAlignments)
cvg <- coverage(BamFile("path/to/bam/file.bam"))

which will attempt to manage memory (readGAlignments() and add to coverage in chunks) rather than having to read the alignments into memory all at once. The actual code implementing coverage,BamFile-method is available with selectMethod("coverage", "BamFile").

ADD REPLY • link 9.8 years ago Martin Morgan 25k

0

Entering edit mode

Thanks. I think that this will do exactly what I want. I played with a few other methods, but it seems like letting this run for awhile for many genes is still better than some other methods. I just had a couple clarifying questions. I haven't encountered the revElements function before, my understanding from looking it up is revElements(x, i) reverses the elements x when i is true, correct? I see that you're doing this so that for genes on the "-" strand, coverage is still read 5' -> 3'. However, my understanding is that right now it reverses each element, but when you unlist it to get the coverage over the gene it is still showing the last exon first and the first exon last so the coverage profile is actually getting messed up. Is this correct or am I misunderstanding something?

ADD REPLY • link 9.8 years ago Jake ▴ 90

0

Entering edit mode

Correct. The revElements() step was an attempt at keeping the coverage oriented from 5' to 3'. But you're right that it is messing up the final result. The following version of rangeCoverage() will hopefully do a better job:

rangeCoverage <- function(gr, cvg)
{
    gr <- reduce(gr)
    ans <- unlist(cvg[gr], use.names=FALSE)
    if (runValue(strand(gr))[[1L]] == "-")
        ans <- rev(ans)
    ans
}

This assumes that all the exons (or introns) in a given gene or transcript are on the same chromosome and strand. For genes or transcripts with trans-splicing events, the above code won't work properly.

H.

ADD REPLY • link 9.7 years ago Hervé Pagès 16k

0

Entering edit mode

I have been reading into this more, would it not be better to just calculate coverage over a specific region by passing that region to the ScanBamParam in coverage()?

coverage(BamFile("path/to/bam/file.bam"), param=ScanBamParam(which=exons))

ADD REPLY • link 9.7 years ago Jake ▴ 90

0

Entering edit mode

The problem is that coverage() is a little imperfect. Here's some set-up

library(Rsamtools)
library(GenomicAlignments)
bfl <- BamFile(system.file(package="Rsamtools", "extdata", "ex1.bam"))
param1 <- ScanBamParam(which=GRanges("seq1", IRanges(10, width=1)))
param2 <- ScanBamParam(which=GRanges("seq1", IRanges(c(10, 10), width=1)))

You can see that with param1 it reports coverage of reads that overlap the region, and it reports the coverage of the full read rather than just the region requested.

> coverage(bfl, param=param1)[["seq1"]]
integer-Rle of length 1575 with 10 runs
  Lengths:    2    2    1    3   28    1    2    2    2 1532
  Values :    1    2    3    4    5    4    3    2    1    0

And here you can see with param2 the ranges are interpreted independently, so overlapping ranges result in double-counting reads.

> coverage(bfl, param=param2)[["seq1"]]
integer-Rle of length 1575 with 10 runs
  Lengths:    2    2    1    3   28    1    2    2    2 1532
  Values :    2    4    6    8   10    8    6    4    2    0

If your use case is to query for a single transcript, say, or if for some reason you are 100% sure that reads do not span ranges, then you could ask for coverage of the range() of the transcript, and then take views on that; this would be much more efficient than reading in the entire file, but usually we actually want coverage across many ranges.

I'll mention Rsamtools::pileup() (very nicely implemented by Nathaniel Hayden) as another alternative, e.g.,

> pparam <- PileupParam(distinguish_nucleotides=FALSE,
+                       distinguish_strands=FALSE)
> pileup(bfl, scanBamParam=param1, pileupParam=pparam)
  seqnames pos count which_label
1     seq1  10     5  seq1:10-10

This suffers from the same problem of double-counting on overlapping ranges, but the ranges could be reduce()ed before calling pileup.

ADD REPLY • link 9.7 years ago Martin Morgan 25k

score 0 · Answer 2 · 2015-07-08

I would use ggbio or Gviz. The Gviz users guide is super comprehensive, so you can just read that to get an idea. ggbio, not so much.

For ggbio, you want to make a 'tracks' object. Let's say you are interested in Entrez Gene ID 620393 (Fam150a), because who isn't?

library(TxDb.Mmusculus.UCSC.mm10.knownGene)

gr <- GRanges("chr1", IRanges(6359000, 6395000))

bfls <- dir(".", "bam$")

t1 <- autoplot(TxDb.Mmusculus.UCSC.mm10.knownGene, which = gr)

t2 <- lapply(bfls, autoplot, which = gr)

names(t2) <- bfls

t2 <- tracks(t2)

t1 <- tracks(t1)

c(t1,t2)

You can futz around with things to get the names of the files printed horizontally, etc, but you get the general drift, no?