Peaks in coverageByTranscript {GenomicFeatures}
2
0
Entering edit mode
@manuelgoepferich-11102
Last seen 5.3 years ago

Hi,

for my project it would be very important to know how the function

coverageByTranscript {GenomicFeatures}

really works. For many coverage plots (coverage of a 3'UTR) I get no step functions but rather spikes.

Is the whole read considered as coverage or just the end or start (How many base pairs)?

Shall I include an example?

P.S.: I am using paired-end reads.

GenomicFeatures • 1.5k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 7 days ago
Seattle, WA, United States

Hi Manuel,

The coverageByTranscript function has a man page with extensive examples. Did you check it? Only the parts of a read that overlap with a transcript participate to the coverage for this transcript. I just added a simple example to the man page to illustrate this (this is in GenomicFeatures 1.27.15). Repeating it here:

First, let's get a transcript

library(GenomicFeatures)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
my_transcript <- dm3_transcripts["FBtr0300689"]
my_transcript
# GRangesList object of length 1:
# $FBtr0300689 
# GRanges object with 2 ranges and 3 metadata columns:
#       seqnames       ranges strand |   exon_id exon_rank
#          <Rle>    <IRanges>  <Rle> | <integer> <integer>
#   [1]    chr2L [7529, 8116]      + |         1         1
#   [2]    chr2L [8193, 9484]      + |         3         2

Then let's create 3 artificial aligned reads

We represent them as a GRanges object of length 3 that contains the genomic positions of the 3 reads. Note that these reads are simple alignments i.e. each of them can be represented with a single range. This would not be the case if they were junction reads.

my_reads <- GRanges(c("chr2L:7531-7630", "chr2L:8101-8200", "chr2L:8141-8240"))

Compute the coverage on the reference genome

coverage(my_reads)
# RleList of length 1
# $chr2L
# integer-Rle of length 8240 with 6 runs
#   Lengths: 7530  100  470   40   60   40
#   Values :    0    1    0    1    2    1

As you can see, all the genomic positions in the 3 ranges participate to the coverage. This can be confirmed by comparing:

> sum(coverage(my_reads))
chr2L 
  300

with:

> sum(width(my_reads))
[1] 300

They are the same.

Compute the coverage on the transcript

The 1st read is fully contained within the 1st exon:

coverageByTranscript(my_reads[1], my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 3 runs
#   Lengths:    2  100 1778
#   Values :    0    1    0

Note that the length of the Rle (1880) is the length of the transcript.

The 2nd and 3rd reads overlap the 2 exons and the intron. Only the parts that overlap the exons participate to coverage:

coverageByTranscript(my_reads[2], my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 3 runs
#   Lengths:  572   24 1284
#   Values :    0    1    0

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thanks. It is definitely part of the answer. I have checked the explanations & the help pages. I will give an example for a decoded coverage vector (paired-end reads, star-mapping, mouse-genome, single cell, start of a 3'UTR):

62 62 62 59 53 53 53 53 52 50 50 50 50 45 45 45 45 45 42 42 42 43 49 49 50 50 49 49 49 48 48 48 50 50 50 46 46 ... 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  0  0  0  0

What is the explanation of 'peaks' (fluctuation of coverage every 2 - 5 base pairs) in this coverage vectors? Two explanations: (I) The reads were mapped to another part of the gene (intron, exon etc.) and only a small part of the read (sometimes 3 positions) are contributing to the coverage (because the quality of the read is locally low?) (II) There was a problem with my decoding or workflow (I don not think so.)

Best regards

Manuel

ADD REPLY
1
Entering edit mode

Not sure that I can be of much help. But don't we see the same kind of pattern here with the dm3 transcript that receives the highest coverage from the untreated1_chr4 data set from the pasillaBamSubset package?

library(pasillaBamSubset)
library(GenomicAlignments)
reads <- readGAlignments(untreated1_chr4())
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
tx_cvg <- coverageByTranscript(reads, dm3_transcripts, ignore.strand=TRUE)
which.max(sum(tx_cvg))
# FBtr0308296 
#       23775 

Coverage on this transcript:

tx_cvg$FBtr0308296
# integer-Rle of length 1349 with 1030 runs
#   Lengths:    1    1    2    1    1    1 ... 25    3    4    2  115
#   Values :  105  110  111  116  117  118 ...  4    3    2    1    0

The coverage profile for this transcript can be plotted with:

plot(as.vector(runmean(tx_cvg$FBtr0308296, 50)))

I don't have access to yours so I can't really compare but maybe you can.

H.

ADD REPLY
0
Entering edit mode

What I did was something like this (and now it looks similar to me coverage vectors):

plot(decode(tx_cvg$FBtr0308296))

The story goes as follows: My supervisor asked me where this wiggly shapes of coverage vectors are coming from?

I see that with 'runmean' the coverage vectors will be smoothed.

Do you know someone who could explain this effect (of wiggly coverage vectors)?

 

 

ADD REPLY
0
Entering edit mode

I'm not a statistician but my understanding is that the distribution of reads along the reference genome has a noise component to it. There are various causes for this noise, I'm not an expert, but one of them is a per-position bias due to the nucleotide content. Could the wiggly aspect of the coverage profile just reflect the intrinsically noisy nature of HTS experiments?

H.

ADD REPLY
0
Entering edit mode

BTW a more efficient/accurate/meaningful way to obtain the coverage by transcript would be to align the reads to the transcriptome instead of the reference genome. Then you can just use coverage() instead of coverageByTranscript() on the resulting BAM file. I've never tried it but that's what I would do if I wanted reliable transcript coverage. It has the advantage of counting only the reads that are compatible with the transcript (i.e. reads that can possibly originate from that transcript).

Note that it's still possible to count transcript-compatible reads when working with reads aligned to the reference genome. For example it can be done with pcoverageByTranscript() (the man page shows how to do this). But aligning the reads to the transcriptome first is more straightforward and reliable. It doesn't even require a splice-aware aligner (and if you use a splice-aware aligner, you should turn that feature off) so the mapping process is simpler and more reliable.  

H.

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 days ago
United States

This function is hidden in the namespace. You could hypothetically do

library(GenomicFeatures)

getAnywhere(coverageByTranscript)

To see the function body. An arguably better way to proceed would be to go to the GitHub mirror for this package and look at the source files. Or you could check out from subversion, using 'readonly' for the username and password. But do note that the repositories are going to be changing soon, so the subversion repo and the GitHub mirror will be going away. I suppose an alternative that will remain is to get the package source from the bottom of the page here and untar and unzip.

ADD COMMENT

Login before adding your answer.

Traffic: 544 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6