RNASeq, differential expression between group, and large variance within groups
2
0
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia
Dear Simon and Laurant, I can't agree with Simon's statement that edgeR does no better than DESeq at downweighting tags with extreme variances, or that this has to do with the number of replicates. While extreme cases like the example that Laurant mentions may need special intervention, edgeR was specifically designed to downweight highly variable tags, and this is just as effective with few replicates as for many. Let's simulate a dataset with Laurant's tag as the first one: library(edgeR) y <- matrix(rpois(9999*6,lambda=50),9999,6) y <- rbind(c(0,0,0,92207,0,0),y) rownames(y) <- 1:10000 d <- DGEList(counts=y,group=factor(c(1,1,1,2,2,2))) d2 <- estimateTagwiseDisp(d,prior.n=1) et <- exactTest(d2,common.disp=FALSE) topTags(et) This analysis finds no tag to be differentially expressed, just as you would want if you view the large count for tag1 to be an outlier. (Here I have chosen prior.n to be lower than the default. The default value prior.n=10 does result in tag1 being identified as differentially expressed. It is hard to give universal guidelines for how to best to choose prior.n). Best wishes Gordon ------ ORIGINAL MESSAGE -------- [Bioc-sig-seq] RNASeq, differential expression between group, and large variance within groups Simon Anders anders at embl.de Mon Feb 21 20:34:00 CET 2011 Dear Laurant On 02/21/2011 03:36 PM, Laurent Gautier wrote: > We are looking at tag-based RNASeq data, and after running popular > packages for finding differential expression (edgeR, and DEGseq) we were > looking that the actual counts for the significant ones. > > We are observing a somewhat extreme variance within each group for those > (say one sample with high count for gene X while others have zero > count). > > For example, gene X flagged as differentially expressed has the > following counts (adjusted p-value with DGESeq is 9.401479e-10): > 0 grp_A > 0 grp_A > 0 grp_A > 92207 grp_B > 0 grp_B > 0 grp_B > > The underlying binomial is obviously not like the almost-Gaussian > assumed in microarrays/t-test-like approaches, but this kind of outcome > is somehow intriguing me. Do people here have experience to share > regarding how well such gene hold through the qPCR verification step ? I have seen such genes as well in my data sets, and I am in fact worried that DESeq does not do a too great job handling them. [...] In most data sets these are only very few genes, but still, it is not a fully satisfactory state of affair. I recently tested how edgeR deals with the issue and found that it does not do a much better job in handling such genes unless you have a large number of replicates. [...] Cheers Simon +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sanders at fs.tum.de --------------------------------------------- Professor Gordon K Smyth, NHMRC Senior Research Fellow, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
RNASeq qPCR edgeR DESeq RNASeq qPCR edgeR DESeq • 1.6k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia
My previous post relates to a discussion on the Bioc-sig-seq mailing list. I redirected it from Bioc-sig-seq to the main Bioconductor list by mistake rather than intention (sorry), but the topic of RNA-Seq analysis and outliers might be of interest to the main list anyway. Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.8 years ago
United States
I know we are dealing with lots of data, but still, in "handling" a case like the one below, I would want to know more about the sample that produced the huge outlying count. I would prefer that features with this type of behavior be flagged, rather than merged with the rest of the data to be declared significant (or not). This unusual sample could be affecting the entire analysis - not just the one feature that is bizarre - so I want it brought to my attention. --naomi At 10:50 PM 3/1/2011, Gordon K Smyth wrote: >Dear Simon and Laurant, > >I can't agree with Simon's statement that edgeR does no better than >DESeq at downweighting tags with extreme variances, or that this has >to do with the number of replicates. While extreme cases like the >example that Laurant mentions may need special intervention, edgeR >was specifically designed to downweight highly variable tags, and >this is just as effective with few replicates as for many. > >Let's simulate a dataset with Laurant's tag as the first one: > > library(edgeR) > y <- matrix(rpois(9999*6,lambda=50),9999,6) > y <- rbind(c(0,0,0,92207,0,0),y) > rownames(y) <- 1:10000 > d <- DGEList(counts=y,group=factor(c(1,1,1,2,2,2))) > d2 <- estimateTagwiseDisp(d,prior.n=1) > et <- exactTest(d2,common.disp=FALSE) > topTags(et) > >This analysis finds no tag to be differentially expressed, just as >you would want if you view the large count for tag1 to be an outlier. > >(Here I have chosen prior.n to be lower than the default. The >default value prior.n=10 does result in tag1 being identified as >differentially expressed. It is hard to give universal guidelines >for how to best to choose prior.n). > >Best wishes >Gordon > >------ ORIGINAL MESSAGE -------- >[Bioc-sig-seq] RNASeq, differential expression between group, and >large variance within groups >Simon Anders anders at embl.de >Mon Feb 21 20:34:00 CET 2011 > >Dear Laurant > >On 02/21/2011 03:36 PM, Laurent Gautier wrote: > >>We are looking at tag-based RNASeq data, and after running popular >>packages for finding differential expression (edgeR, and DEGseq) we >>were looking that the actual counts for the significant ones. >> >>We are observing a somewhat extreme variance within each group for >>those (say one sample with high count for gene X while others have zero count). >> >>For example, gene X flagged as differentially expressed has the >>following counts (adjusted p-value with DGESeq is 9.401479e-10): >>0 grp_A >>0 grp_A >>0 grp_A >>92207 grp_B >>0 grp_B >>0 grp_B >> >>The underlying binomial is obviously not like the almost-Gaussian >>assumed in microarrays/t-test-like approaches, but this kind of outcome >>is somehow intriguing me. Do people here have experience to share >>regarding how well such gene hold through the qPCR verification step ? > >I have seen such genes as well in my data sets, and I am in fact worried >that DESeq does not do a too great job handling them. > >[...] > >In most data sets these are only very few genes, but still, it is not a >fully satisfactory state of affair. I recently tested how edgeR deals with >the issue and found that it does not do a much better job in handling such >genes unless you have a large number of replicates. > >[...] > >Cheers > Simon > >+--- >| Dr. Simon Anders, Dipl.-Phys. >| European Molecular Biology Laboratory (EMBL), Heidelberg >| office phone +49-6221-387-8632 >| preferred (permanent) e-mail: sanders at fs.tum.de > > >--------------------------------------------- >Professor Gordon K Smyth, >NHMRC Senior Research Fellow, >Bioinformatics Division, >Walter and Eliza Hall Institute of Medical Research, >1G Royal Parade, Parkville, Vic 3052, Australia. >Tel: (03) 9345 2326, Fax (03) 9347 0852, >smyth at wehi.edu.au >http://www.wehi.edu.au >http://www.statsci.org/smyth > >_____________________________________________________________________ _ >The information in this email is confidential and inten...{{dropped:11}}
ADD COMMENT
0
Entering edit mode
Hi Naomi, Agreed. My post here was to dispel the notion that edgeR wasn't capable dowweighting these tags, if that was what the investigator wants to do. (A couple of investigators, both bioinformaticians rather than biologists, have suggested to me that this is the "correct" course of action). I've argued that we need to know how these unusual feastures arise before deciding what should be done: https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-March/001874.html Another possibility that I should mentioned in that post is that the unusual count might correspond to an outlier individual, and it might be of scientific interest to know that. Cheers Gordon On Fri, 4 Mar 2011, Naomi Altman wrote: > I know we are dealing with lots of data, but still, in "handling" a case like > the one below, I would want to know more about the sample that produced the > huge outlying count. I would prefer that features with this type of behavior > be flagged, rather than merged with the rest of the data to be declared > significant (or not). This unusual sample could be affecting the entire > analysis - not just the one feature that is bizarre - so I want it brought to > my attention. > > --naomi > > > > At 10:50 PM 3/1/2011, Gordon K Smyth wrote: >> Dear Simon and Laurant, >> >> I can't agree with Simon's statement that edgeR does no better than DESeq >> at downweighting tags with extreme variances, or that this has to do with >> the number of replicates. While extreme cases like the example that >> Laurant mentions may need special intervention, edgeR was specifically >> designed to downweight highly variable tags, and this is just as effective >> with few replicates as for many. >> >> Let's simulate a dataset with Laurant's tag as the first one: >> >> library(edgeR) >> y <- matrix(rpois(9999*6,lambda=50),9999,6) >> y <- rbind(c(0,0,0,92207,0,0),y) >> rownames(y) <- 1:10000 >> d <- DGEList(counts=y,group=factor(c(1,1,1,2,2,2))) >> d2 <- estimateTagwiseDisp(d,prior.n=1) >> et <- exactTest(d2,common.disp=FALSE) >> topTags(et) >> >> This analysis finds no tag to be differentially expressed, just as you >> would want if you view the large count for tag1 to be an outlier. >> >> (Here I have chosen prior.n to be lower than the default. The default >> value prior.n=10 does result in tag1 being identified as differentially >> expressed. It is hard to give universal guidelines for how to best to >> choose prior.n). >> >> Best wishes >> Gordon >> >> ------ ORIGINAL MESSAGE -------- >> [Bioc-sig-seq] RNASeq, differential expression between group, and large >> variance within groups >> Simon Anders anders at embl.de >> Mon Feb 21 20:34:00 CET 2011 >> >> Dear Laurant >> >> On 02/21/2011 03:36 PM, Laurent Gautier wrote: >> >>> We are looking at tag-based RNASeq data, and after running popular >>> packages for finding differential expression (edgeR, and DEGseq) we were >>> looking that the actual counts for the significant ones. >>> >>> We are observing a somewhat extreme variance within each group for those >>> (say one sample with high count for gene X while others have zero count). >>> >>> For example, gene X flagged as differentially expressed has the >>> following counts (adjusted p-value with DGESeq is 9.401479e-10): >>> 0 grp_A >>> 0 grp_A >>> 0 grp_A >>> 92207 grp_B >>> 0 grp_B >>> 0 grp_B >>> >>> The underlying binomial is obviously not like the almost-Gaussian >>> assumed in microarrays/t-test-like approaches, but this kind of outcome >>> is somehow intriguing me. Do people here have experience to share >>> regarding how well such gene hold through the qPCR verification step ? >> >> I have seen such genes as well in my data sets, and I am in fact worried >> that DESeq does not do a too great job handling them. >> >> [...] >> >> In most data sets these are only very few genes, but still, it is not a >> fully satisfactory state of affair. I recently tested how edgeR deals with >> the issue and found that it does not do a much better job in handling such >> genes unless you have a large number of replicates. >> >> [...] >> >> Cheers >> Simon >> >> +--- >> | Dr. Simon Anders, Dipl.-Phys. >> | European Molecular Biology Laboratory (EMBL), Heidelberg >> | office phone +49-6221-387-8632 >> | preferred (permanent) e-mail: sanders at fs.tum.de >> >> >> --------------------------------------------- >> Professor Gordon K Smyth, >> NHMRC Senior Research Fellow, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Tel: (03) 9345 2326, Fax (03) 9347 0852, >> smyth at wehi.edu.au >> http://www.wehi.edu.au >> http://www.statsci.org/smyth >> >> ______________________________________________________________________ >> The information in this email is confidential and intend...{{dropped:4}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLY

Login before adding your answer.

Traffic: 688 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6