Hi Manuel,
The coverageByTranscript
function has a man page with extensive examples. Did you check it? Only the parts of a read that overlap with a transcript participate to the coverage for this transcript. I just added a simple example to the man page to illustrate this (this is in GenomicFeatures 1.27.15). Repeating it here:
First, let's get a transcript
library(GenomicFeatures)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
my_transcript <- dm3_transcripts["FBtr0300689"]
my_transcript
# GRangesList object of length 1:
# $FBtr0300689
# GRanges object with 2 ranges and 3 metadata columns:
# seqnames ranges strand | exon_id exon_rank
# <Rle> <IRanges> <Rle> | <integer> <integer>
# [1] chr2L [7529, 8116] + | 1 1
# [2] chr2L [8193, 9484] + | 3 2
Then let's create 3 artificial aligned reads
We represent them as a GRanges object of length 3 that contains the genomic positions of the 3 reads. Note that these reads are simple alignments i.e. each of them can be represented with a single range. This would not be the case if they were junction reads.
my_reads <- GRanges(c("chr2L:7531-7630", "chr2L:8101-8200", "chr2L:8141-8240"))
Compute the coverage on the reference genome
coverage(my_reads)
# RleList of length 1
# $chr2L
# integer-Rle of length 8240 with 6 runs
# Lengths: 7530 100 470 40 60 40
# Values : 0 1 0 1 2 1
As you can see, all the genomic positions in the 3 ranges participate to the coverage. This can be confirmed by comparing:
> sum(coverage(my_reads))
chr2L
300
with:
> sum(width(my_reads))
[1] 300
They are the same.
Compute the coverage on the transcript
The 1st read is fully contained within the 1st exon:
coverageByTranscript(my_reads[1], my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 3 runs
# Lengths: 2 100 1778
# Values : 0 1 0
Note that the length of the Rle (1880) is the length of the transcript.
The 2nd and 3rd reads overlap the 2 exons and the intron. Only the parts that overlap the exons participate to coverage:
coverageByTranscript(my_reads[2], my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 3 runs
# Lengths: 572 24 1284
# Values : 0 1 0
Hope this helps,
H.
Thanks. It is definitely part of the answer. I have checked the explanations & the help pages. I will give an example for a decoded coverage vector (paired-end reads, star-mapping, mouse-genome, single cell, start of a 3'UTR):
What is the explanation of 'peaks' (fluctuation of coverage every 2 - 5 base pairs) in this coverage vectors? Two explanations: (I) The reads were mapped to another part of the gene (intron, exon etc.) and only a small part of the read (sometimes 3 positions) are contributing to the coverage (because the quality of the read is locally low?) (II) There was a problem with my decoding or workflow (I don not think so.)
Best regards
Manuel
Not sure that I can be of much help. But don't we see the same kind of pattern here with the dm3 transcript that receives the highest coverage from the untreated1_chr4 data set from the pasillaBamSubset package?
Coverage on this transcript:
The coverage profile for this transcript can be plotted with:
I don't have access to yours so I can't really compare but maybe you can.
H.
What I did was something like this (and now it looks similar to me coverage vectors):
The story goes as follows: My supervisor asked me where this wiggly shapes of coverage vectors are coming from?
I see that with 'runmean' the coverage vectors will be smoothed.
Do you know someone who could explain this effect (of wiggly coverage vectors)?
I'm not a statistician but my understanding is that the distribution of reads along the reference genome has a noise component to it. There are various causes for this noise, I'm not an expert, but one of them is a per-position bias due to the nucleotide content. Could the wiggly aspect of the coverage profile just reflect the intrinsically noisy nature of HTS experiments?
H.
BTW a more efficient/accurate/meaningful way to obtain the coverage by transcript would be to align the reads to the transcriptome instead of the reference genome. Then you can just use
coverage()
instead ofcoverageByTranscript()
on the resulting BAM file. I've never tried it but that's what I would do if I wanted reliable transcript coverage. It has the advantage of counting only the reads that are compatible with the transcript (i.e. reads that can possibly originate from that transcript).Note that it's still possible to count transcript-compatible reads when working with reads aligned to the reference genome. For example it can be done with
pcoverageByTranscript()
(the man page shows how to do this). But aligning the reads to the transcriptome first is more straightforward and reliable. It doesn't even require a splice-aware aligner (and if you use a splice-aware aligner, you should turn that feature off) so the mapping process is simpler and more reliable.H.