duplicate reads in mRNA-Seq

0

Entering edit mode

jason0701 ▴ 190

@jason0701-3921

Last seen 5.0 years ago

Hi, Some of you may have answers for this. It seems that the duplicate reads are very common in mRNA-seq data. Duplicate reads are those being mapped to exact the same chromosome location and on the same strand (maybe from PCR amplification). I would like to know what are the general practice to deal with it? I suspect some of those may contribute to the large overdispersion in the final count data. Thanks, Jason

• 1.7k views

ADD COMMENT • link updated 13.8 years ago by Simon Anders ★ 3.8k • written 13.8 years ago by jason0701 ▴ 190

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 4.3 years ago

Zentrum für Molekularbiologie, Universi…

Hi Jason > It seems that the duplicate reads are very common in mRNA-seq data. > Duplicate reads are those being mapped to exact the same chromosome > location and on the same strand (maybe from PCR amplification). I > would like to know what are the general practice to deal with it? I > suspect some of those may contribute to the large overdispersion in > the final count data. I know it is soemtimes recommended to remove them but I'd advise against this. One of the advantages of RNA-Seq over expression microarrays is the large gain in dynamic range. On arrays, lowly expressed genes drown in background flourescence and highly expressed genes saturate the hybridisation, giving you a dynamic range of typically little more 25 dB (i.e., ratios of up to at most 1:300). In RNA-Seq, very weak genes give rise to less than 10 counts while the strongest genes may give more well above 100,000 counts, i.e., the usable dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000). Now, imagine you would count several reads mapping to the same position at most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 counts, even if it were one of those strongly expressed ones with 5-figure raw count. Hence, you would dramatically squash your dynamic range and lose all hope for linearity (i.e., you cannot expect any more that the count rate is at least roughly proportional to the concentration). Of course, if there are PCR artifacts, they destroy the linearity as well. So, if you have an exon, to which only very few reads map except for one specific position that shows a pile of hundreds of reads, all with precisely the same coordinates, then is reason for concern. I have seen such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all recently, but I think they were a common concern two years ago. I wonder where they went. Did they maybe improve the PCR steps of the library preparation protocols?) Simon

ADD COMMENT • link 13.8 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Thanks Simon for the insightful comments. I think you are right on this. From an empirical comparison I just did between the RNA-Seq and quantitative-PCR data, the unfiltered one seems to give better concordance with the PCR data (based on fc). Thanks again, Jason On Sat, Feb 12, 2011 at 12:39 PM, Simon Anders <anders at="" embl.de=""> wrote: > Hi Jason > >> It seems that the duplicate reads are very common in mRNA-seq data. >> Duplicate reads are those being mapped to exact the same chromosome >> location and on the same strand (maybe from PCR amplification). I >> would like to know what are the general practice to deal with it? I >> suspect some of those may contribute to the large overdispersion in >> the final count data. > > I know it is soemtimes recommended to remove them but I'd advise against > this. > > One of the advantages of RNA-Seq over expression microarrays is the large > gain in dynamic range. On arrays, lowly expressed genes drown in background > flourescence and highly expressed genes saturate the hybridisation, giving > you a dynamic range of typically little more 25 dB (i.e., ratios of up to > at most 1:300). > > In RNA-Seq, very weak genes give rise to less than 10 counts while the > strongest genes may give more well above 100,000 counts, i.e., the usable > dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000). > > Now, imagine you would count several reads mapping to the same position at > most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 > counts, even if it were one of those strongly expressed ones with 5-figure > raw count. Hence, you would dramatically squash your dynamic range and lose > all hope for linearity (i.e., you cannot expect any more that the count > rate is at least roughly proportional to the concentration). > > Of course, if there are PCR artifacts, they destroy the linearity as well. > So, if you have an exon, to which only very few reads map except for one > specific position that shows a pile of hundreds of reads, all with > precisely the same coordinates, then is reason for concern. I have seen > such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all > recently, but I think they were a common concern two years ago. I wonder > where they went. Did they maybe improve the PCR steps of the library > preparation protocols?) > > ?Simon > > > >

ADD REPLY • link 13.8 years ago jason0701 ▴ 190

0

Entering edit mode

Hi Dr. Anders and Dr. Jason, May I ask, what is the frequency of duplicates that you have had in your data? I have had ~0.6 duplicates in my final aligned and filtered (unique match and number of mismatches) dataset. As of now I have run analysis without them. Thanks, Fernando -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Simon Anders Sent: Saturday, February 12, 2011 11:39 AM To: Jason Lu Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] duplicate reads in mRNA-Seq Hi Jason > It seems that the duplicate reads are very common in mRNA-seq data. > Duplicate reads are those being mapped to exact the same chromosome > location and on the same strand (maybe from PCR amplification). I > would like to know what are the general practice to deal with it? I > suspect some of those may contribute to the large overdispersion in > the final count data. I know it is soemtimes recommended to remove them but I'd advise against this. One of the advantages of RNA-Seq over expression microarrays is the large gain in dynamic range. On arrays, lowly expressed genes drown in background flourescence and highly expressed genes saturate the hybridisation, giving you a dynamic range of typically little more 25 dB (i.e., ratios of up to at most 1:300). In RNA-Seq, very weak genes give rise to less than 10 counts while the strongest genes may give more well above 100,000 counts, i.e., the usable dynamic range is now easily exceeding 45 dB or 50 dB (1:100,000). Now, imagine you would count several reads mapping to the same position at most once. Then, a transcript of, say, 1 kB can at most accumulate 1,000 counts, even if it were one of those strongly expressed ones with 5-figure raw count. Hence, you would dramatically squash your dynamic range and lose all hope for linearity (i.e., you cannot expect any more that the count rate is at least roughly proportional to the concentration). Of course, if there are PCR artifacts, they destroy the linearity as well. So, if you have an exon, to which only very few reads map except for one specific position that shows a pile of hundreds of reads, all with precisely the same coordinates, then is reason for concern. I have seen such "towers" only rarely in RNA-Seq. (Actually, I haven't seem them at all recently, but I think they were a common concern two years ago. I wonder where they went. Did they maybe improve the PCR steps of the library preparation protocols?) Simon _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.8 years ago Biase, Fernando ▴ 150

Login before adding your answer.