Hi,
Some of you may have answers for this.
It seems that the duplicate reads are very common in mRNA-seq data.
Duplicate reads are those being mapped to exact the same chromosome
location and on the same strand (maybe from PCR amplification). I
would like to know what are the general practice to deal with it? I
suspect some of those may contribute to the large overdispersion in
the final count data.
Thanks,
Jason
Hi Jason
> It seems that the duplicate reads are very common in mRNA-seq data.
> Duplicate reads are those being mapped to exact the same chromosome
> location and on the same strand (maybe from PCR amplification). I
> would like to know what are the general practice to deal with it? I
> suspect some of those may contribute to the large overdispersion in
> the final count data.
I know it is soemtimes recommended to remove them but I'd advise
against
this.
One of the advantages of RNA-Seq over expression microarrays is the
large
gain in dynamic range. On arrays, lowly expressed genes drown in
background
flourescence and highly expressed genes saturate the hybridisation,
giving
you a dynamic range of typically little more 25 dB (i.e., ratios of up
to
at most 1:300).
In RNA-Seq, very weak genes give rise to less than 10 counts while the
strongest genes may give more well above 100,000 counts, i.e., the
usable
dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000).
Now, imagine you would count several reads mapping to the same
position at
most once. Then, a transcript of, say, 1 kB can at most accumulate
1,000
counts, even if it were one of those strongly expressed ones with
5-figure
raw count. Hence, you would dramatically squash your dynamic range and
lose
all hope for linearity (i.e., you cannot expect any more that the
count
rate is at least roughly proportional to the concentration).
Of course, if there are PCR artifacts, they destroy the linearity as
well.
So, if you have an exon, to which only very few reads map except for
one
specific position that shows a pile of hundreds of reads, all with
precisely the same coordinates, then is reason for concern. I have
seen
such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them
at all
recently, but I think they were a common concern two years ago. I
wonder
where they went. Did they maybe improve the PCR steps of the library
preparation protocols?)
Simon
Thanks Simon for the insightful comments.
I think you are right on this. From an empirical comparison I just did
between the RNA-Seq and quantitative-PCR data, the unfiltered one
seems to give better concordance with the PCR data (based on fc).
Thanks again,
Jason
On Sat, Feb 12, 2011 at 12:39 PM, Simon Anders <anders at="" embl.de="">
wrote:
> Hi Jason
>
>> It seems that the duplicate reads are very common in mRNA-seq data.
>> Duplicate reads are those being mapped to exact the same chromosome
>> location and on the same strand (maybe from PCR amplification). I
>> would like to know what are the general practice to deal with it? I
>> suspect some of those may contribute to the large overdispersion in
>> the final count data.
>
> I know it is soemtimes recommended to remove them but I'd advise
against
> this.
>
> One of the advantages of RNA-Seq over expression microarrays is the
large
> gain in dynamic range. On arrays, lowly expressed genes drown in
background
> flourescence and highly expressed genes saturate the hybridisation,
giving
> you a dynamic range of typically little more 25 dB (i.e., ratios of
up to
> at most 1:300).
>
> In RNA-Seq, very weak genes give rise to less than 10 counts while
the
> strongest genes may give more well above 100,000 counts, i.e., the
usable
> dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000).
>
> Now, imagine you would count several reads mapping to the same
position at
> most once. Then, a transcript of, say, 1 kB can at most accumulate
1,000
> counts, even if it were one of those strongly expressed ones with
5-figure
> raw count. Hence, you would dramatically squash your dynamic range
and lose
> all hope for linearity (i.e., you cannot expect any more that the
count
> rate is at least roughly proportional to the concentration).
>
> Of course, if there are PCR artifacts, they destroy the linearity as
well.
> So, if you have an exon, to which only very few reads map except for
one
> specific position that shows a pile of hundreds of reads, all with
> precisely the same coordinates, then is reason for concern. I have
seen
> such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them
at all
> recently, but I think they were a common concern two years ago. I
wonder
> where they went. Did they maybe improve the PCR steps of the library
> preparation protocols?)
>
> ?Simon
>
>
>
>
Hi Dr. Anders and Dr. Jason,
May I ask, what is the frequency of duplicates that you have had in
your data?
I have had ~0.6 duplicates in my final aligned and filtered (unique
match and number of mismatches) dataset. As of now I have run analysis
without them.
Thanks,
Fernando
-----Original Message-----
From: bioconductor-bounces@r-project.org [mailto:bioconductor-
bounces@r-project.org] On Behalf Of Simon Anders
Sent: Saturday, February 12, 2011 11:39 AM
To: Jason Lu
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] duplicate reads in mRNA-Seq
Hi Jason
> It seems that the duplicate reads are very common in mRNA-seq data.
> Duplicate reads are those being mapped to exact the same chromosome
> location and on the same strand (maybe from PCR amplification). I
> would like to know what are the general practice to deal with it? I
> suspect some of those may contribute to the large overdispersion in
> the final count data.
I know it is soemtimes recommended to remove them but I'd advise
against this.
One of the advantages of RNA-Seq over expression microarrays is the
large gain in dynamic range. On arrays, lowly expressed genes drown in
background flourescence and highly expressed genes saturate the
hybridisation, giving you a dynamic range of typically little more 25
dB (i.e., ratios of up to at most 1:300).
In RNA-Seq, very weak genes give rise to less than 10 counts while the
strongest genes may give more well above 100,000 counts, i.e., the
usable dynamic range is now easily exceeding 45 dB or 50 dB
(1:100,000).
Now, imagine you would count several reads mapping to the same
position at most once. Then, a transcript of, say, 1 kB can at most
accumulate 1,000 counts, even if it were one of those strongly
expressed ones with 5-figure raw count. Hence, you would dramatically
squash your dynamic range and lose all hope for linearity (i.e., you
cannot expect any more that the count rate is at least roughly
proportional to the concentration).
Of course, if there are PCR artifacts, they destroy the linearity as
well.
So, if you have an exon, to which only very few reads map except for
one specific position that shows a pile of hundreds of reads, all with
precisely the same coordinates, then is reason for concern. I have
seen such "towers" only rarely in RNA-Seq. (Actually, I haven't seem
them at all recently, but I think they were a common concern two years
ago. I wonder where they went. Did they maybe improve the PCR steps of
the library preparation protocols?)
Simon
_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor