Hi,
I have RNA-seq data in a bam file and I have two GRangesLists: one with exons by gene for a subset of genes and one with introns by gene for a subset of genes. I'm trying to calculate the basepair coverage over exons per gene and introns per gene. For any given gene, I'd like to be able to plot basepair coverage in terms of relative exon position (e.g. start of first exon is 1 and if there are 2 100 bp exons, the last number of the x-axis will be 200). I played around a bit with the coverage() function and views based on my two GRangesLists, but it didn't seem like the best way to do this or I was doing something wrong. I was just wondering if anyone has suggestions for how to do this?
Thanks
A different way to calculate coverage is
which will attempt to manage memory (readGAlignments() and add to coverage in chunks) rather than having to read the alignments into memory all at once. The actual code implementing
coverage,BamFile-method
is available withselectMethod("coverage", "BamFile")
.Thanks. I think that this will do exactly what I want. I played with a few other methods, but it seems like letting this run for awhile for many genes is still better than some other methods. I just had a couple clarifying questions. I haven't encountered the revElements function before, my understanding from looking it up is revElements(x, i) reverses the elements x when i is true, correct? I see that you're doing this so that for genes on the "-" strand, coverage is still read 5' -> 3'. However, my understanding is that right now it reverses each element, but when you unlist it to get the coverage over the gene it is still showing the last exon first and the first exon last so the coverage profile is actually getting messed up. Is this correct or am I misunderstanding something?
Correct. The
revElements()
step was an attempt at keeping the coverage oriented from 5' to 3'. But you're right that it is messing up the final result. The following version ofrangeCoverage()
will hopefully do a better job:This assumes that all the exons (or introns) in a given gene or transcript are on the same chromosome and strand. For genes or transcripts with trans-splicing events, the above code won't work properly.
H.
I have been reading into this more, would it not be better to just calculate coverage over a specific region by passing that region to the ScanBamParam in coverage()?
The problem is that coverage() is a little imperfect. Here's some set-up
You can see that with
param1
it reports coverage of reads that overlap the region, and it reports the coverage of the full read rather than just the region requested.And here you can see with
param2
the ranges are interpreted independently, so overlapping ranges result in double-counting reads.If your use case is to query for a single transcript, say, or if for some reason you are 100% sure that reads do not span ranges, then you could ask for coverage of the
range()
of the transcript, and then take views on that; this would be much more efficient than reading in the entire file, but usually we actually want coverage across many ranges.I'll mention
Rsamtools::pileup()
(very nicely implemented by Nathaniel Hayden) as another alternative, e.g.,This suffers from the same problem of double-counting on overlapping ranges, but the ranges could be
reduce()
ed before calling pileup.