I tried to use coverageByTranscript function from GenomicAlignments package to calculate the RNA-seq coverage along transcripts. This function requires the exons be sorted in ascending rank (i.e. the first Exxon in the transcript should be listed in the GRanges object first, and so forth, see reference below
https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/coverageByTranscript).
My understanding is that if a gene is in the reference strand, exons should be sorted in ascending order genomic coordinate. If on the minus strand, the first exon in the transcript, but with highest genomic coordinate should be listed first. The documentation above said that "If transcripts was obtained with exonsBy, then the exons are guaranteed to be ordered by ascending rank. See ?exonsBy for more information". This statement appears to be incorrect. I have a gif file named genes.gtf. I used the following command to build the Granges list for all genes as follows:
txdb <- makeTxDbFromGFF(gtf_file)
all_genes <- exonsBy(txdb, by="gene")
However the returned GRanges list does not order exons in the ascending rank for minus-strand genes. For example:
>all_genes[[3]]
GRanges object with 16 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr10 52559169-52566640 - | 126742 <NA>
[2] chr10 52569654-52569802 - | 126743 <NA>
[3] chr10 52570800-52570936 - | 126744 <NA>
[4] chr10 52573617-52573798 - | 126745 <NA>
This does cause problems for coverageByTranscript in calculating the coverage scores. It's messed up. Is my understanding about exonsBy function incorrect or the documentation of coverageByTranscript is outdate or incorrect? Thanks.
Thanks so much! I indeed ignored the fine details under exonsBy. Is there a quick way to sort exon in ascending rank if they are on minus strand? It takes extremely long time if I do the following:
However if I try endoapply, its faster than above loop, but not very fast. Is there any easy more efficient way to do this? Thanks again!
I'm confused as to why you need to do this. You cannot use the GRangesList object returned by
exonsBy(txdb, by="gene")
as input forcoverageByTranscript()
. That's because the exons in this object are grouped by gene and not by transcript, which means that exons for all the transcripts in a gene are mixed up together. There is no re-ordering that will fix that. Instead you must use a GRangesList object where the exons are grouped by transcript and ordered by their rank in the transcript, which is exactly whatexonsBy(txdb, by="tx")
will give you.H.
I am not to distinguish between different transcripts as it's just not possible to deconvolve them based on reads count. The reads count from different splicing forms will be just piled up, and repeatedly counted on the shared axons if we do the coverage for transcripts. Instead I am listing all exons from the same gene, and then use pcoverageByTranscript to calculate the coverage on every exons of the same gene. This does work out as I need so far.
If you only want the coverage for each exon, no need to order anything. Also it no longer matters how the exons are grouped (by gene or by transcript or by any other criteria):
cvg_by_ex
is an RleList object parallel toall_exons
i.e. with one Rle vector per exon inall_exons
. The names oncvg_by_ex
are the exon ids. Subsetting the object to show only exons with a non-zero coverage:You can map some summarization of this back to the corresponding gene with something like:
IMPORTANT NOTE: The relist operation above is very fast and works only because
ex_mean_cvg
is parallel tounlist(ex_by_gene)
!ex_mean_cvg_by_gene
is a NumericList object with one list element per gene. Each list element is a named numeric vector. The names on each numeric vector are the exon ids for that gene. The values are the mean coverage for the exons in that gene:Hope this helps,
H.
That makes sense. I will give it a try. Thanks again for great help!
But there is one potential issue: if a read partially overlaps with an exons, that will be counted in the coverage score? I need to judge whether a read or a pair of reads are compatible with the features (exons for the same gene) using something like IntersectionStrict.
Of course reads that partially overlap with an exon will be counted in the coverage score. The "1. A SIMPLE ARTIFICIAL EXAMPLE WITH ONLY ONE TRANSCRIPT" section in the examples of the
?coverageByTranscript
man page emphasizes this behavior ofcoverageByTranscript()
by walking you thru an example that shows exactly that. In particular it says:If that's not what you want and you'd prefer something that behaves like IntersectionStrict instead, then you can do something like this:
Keep reads that fall within the exons of a single gene. We'll call them "good reads":
Group the "good reads" by exon. Note that a "good read" is allowed to fall within 2 or more exons as long as these exons belong to the same gene:
reads_by_exon
is a GAlignmentsList object that contains reads grouped by exon. More precisely it is parallel toall_exons
i.e. it has one list element per exon inall_exons
. Each list element is a GAlignments object containing the reads that fall within that exon.Proceed as previously except that now you need to use
pcoverageByTranscript()
instead ofcoverageByTranscript()
:etc...
At this point we are pretty far from the topic of your original post ("exonsBy does not order exons by ascending rank, is it a bug or outdated documentation?") so I'd strongly suggest that you start a new post if you decide to follow up on this.
H.
Thanks. I had another post earlier, in which you basically have helped me out in the same way(very grateful). It worked. I want to follow the IntersectionStrict mode as HTSeq requires to calculate the coverage curve. if one read align to both genes (i.e. genes may overlap), that read should not be counted to avoid confusion. I assume IntersectionStrict does this job. I just found when I treat the ordered all exons from the same gene as one transcript to use pcoverageByTranscript to calculate the coverage per gene (all exons within the same gene), I have to have exons sorted in appropriate order. I overlooked the fine print of exonsBy in the first place (this post). If I reverse the order or exons for minus strand genes, everything works well now. Also I compared with the HTSeq output in terms of read pair count, they are very consistent. Thanks again. ~~~~~~~ https://support.bioconductor.org/p/123463/