edgeR - estimateGLMCommonDisp - warnings

edgeR - estimateGLMCommonDisp - warnings - huge logFC

0

Entering edit mode

Filippis, Ioannis ▴ 30

@filippis-ioannis-4720

Last seen 10.7 years ago

Hi, I am using edgeR for a 2x2 factorial design (Strain*Treatment) without any replicates and the estimateGLMCommonDisp and glmFit functions. When I run estimateGLMCommonDisp, I get warnings 1: In optimize(f = fun, interval = interval^0.25, y = y, ... : NA/Inf replaced by maximum positive value and when I run glmFit and then glmLRT, I get huge fold change values for some genes. However, if I do a pairwise exactTest for the samples examined for the above contrast, the fold change for that genes is high but normal. I would really appreciate any feedback on the cause of warnings and huge logFC. Many thanks for your help. Best, Ioannis [[alternative HTML version deleted]]

edgeR edgeR • 1.8k views

ADD COMMENT • link updated 13.8 years ago by Robert Castelo ★ 3.4k • written 13.8 years ago by Filippis, Ioannis ▴ 30

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Dear Ioannis, If the counts are zeros for some libraries for some genes, then it should be no surprise that some of the logFC might be very large. The raw fold changes are infinite. The real problem though is that running estimateGLMCommonDisp() without replicates is meaningless, since the dispersion is not actually estimable without replicates. The function will probably just return a dispersion of zero in this case. If you must analyse RNA-Seq data without replicates, you could estimate the dispersion very roughly by treating all the libraries as if they were replicates, by d2 <- estimateCommonDisp(d), or d2 <- estimateGLMCommonDisp(d) and then proceed using this conservative dispersion estimate. Best wishes Gordon > Date: Fri, 8 Jul 2011 08:18:56 +0000 > From: "Filippis, Ioannis" <i.filippis at="" imperial.ac.uk=""> > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: [BioC] edgeR - estimateGLMCommonDisp - warnings - huge logFC > Content-Type: text/plain > > Hi, > > I am using edgeR for a 2x2 factorial design (Strain*Treatment) without > any replicates and the estimateGLMCommonDisp and glmFit functions. > > When I run estimateGLMCommonDisp, I get warnings > 1: In optimize(f = fun, interval = interval^0.25, y = y, ... : > NA/Inf replaced by maximum positive value > and when I run glmFit and then glmLRT, I get huge fold change values for some genes. > > However, if I do a pairwise exactTest for the samples examined for the above contrast, the fold change for that genes is high but normal. > > I would really appreciate any feedback on the cause of warnings and huge logFC. > > Many thanks for your help. > > Best, > Ioannis ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 13.8 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, thank you very much for the reply. The problem in using as common dispersion the dispersion among the samples, is that this dispersion is too high leading to 0 DE genes. And I expect to have such high diversion, it is not some artifact of the data. I also wonder whether it is statistically wrong to add 1 to the counts matrix before any edgeR analysis. This would lead to "meaningfull" logFC values and logical volcano plots. Many thanks for your help. Best regards, Ioannis ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 08 July 2011 12:16 To: Filippis, Ioannis Cc: Bioconductor mailing list Subject: edgeR - estimateGLMCommonDisp - warnings - huge logFC Dear Ioannis, If the counts are zeros for some libraries for some genes, then it should be no surprise that some of the logFC might be very large. The raw fold changes are infinite. The real problem though is that running estimateGLMCommonDisp() without replicates is meaningless, since the dispersion is not actually estimable without replicates. The function will probably just return a dispersion of zero in this case. If you must analyse RNA-Seq data without replicates, you could estimate the dispersion very roughly by treating all the libraries as if they were replicates, by d2 <- estimateCommonDisp(d), or d2 <- estimateGLMCommonDisp(d) and then proceed using this conservative dispersion estimate. Best wishes Gordon > Date: Fri, 8 Jul 2011 08:18:56 +0000 > From: "Filippis, Ioannis" <i.filippis at="" imperial.ac.uk=""> > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: [BioC] edgeR - estimateGLMCommonDisp - warnings - huge logFC > Content-Type: text/plain > > Hi, > > I am using edgeR for a 2x2 factorial design (Strain*Treatment) without > any replicates and the estimateGLMCommonDisp and glmFit functions. > > When I run estimateGLMCommonDisp, I get warnings > 1: In optimize(f = fun, interval = interval^0.25, y = y, ... : > NA/Inf replaced by maximum positive value > and when I run glmFit and then glmLRT, I get huge fold change values for some genes. > > However, if I do a pairwise exactTest for the samples examined for the above contrast, the fold change for that genes is high but normal. > > I would really appreciate any feedback on the cause of warnings and huge logFC. > > Many thanks for your help. > > Best, > Ioannis ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 13.8 years ago Filippis, Ioannis ▴ 30

0

Entering edit mode

Dear Ioannis, It is asking quite a lot of a statistical procedure, to perform an analysis of statistical significance for your data, in the absence of any replication. The short answer is that you are probably asking too much! In the absence of replicates your choices are: 1. Be satisfied with an MDS plot and by looking at fold changes. Do not attempt a significance analysis. 2. Estimate the dispersion by removing one or both of your factors from the model. You could remove the factor with smallest fold changes for example. This will only be successful if the number DE genes is relatively small. 3. You could try estimateGLMCommonDisp(method="deviance",robust=TRUE,subset=NULL) This is our current best attempt at an automatic method to estimate dispersion without replicates. It will only give good results when the counts are not too small and the number of DE genes is not too large. 4. You could identify a number of transcripts that should not be DE, and estimate the dispersion from them: d0 <- estimateCommonDisp(d[housekeeping,]) 5. Simply pick a reasonable dispersion value (dispersion=0.3 say), based on your experience with similar data, and use that. Although subjective, this is much better and more defensible than assuming Poission variation. As far as fold-changes are concerned, I am currently working with a PhD student on the problem of shrinking fold changes to be more reliable, but the simple approach of adding a constant to your counts is not statistically meaningful in the context of an edgeR analysis. Adding a constant may be a useful thing to do, and I often recommend: y <- t(log2( t(counts+0.5) / (lib.size+0.5)) for descriptive purposes, but the resulting log-abundances should be treating as roughly normal (eg analysed in limma) rather than analysed in edgeR. Obviously, you can't creat a volcano plot unless you have a significance analysis as well. Best wishes Gordno --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Fri, 8 Jul 2011, Filippis, Ioannis wrote: Dear Gordon, thank you very much for the reply. The problem in using as common dispersion the dispersion among the samples, is that this dispersion is too high leading to 0 DE genes. And I expect to have such high diversion, it is not some artifact of the data. I also wonder whether it is statistically wrong to add 1 to the counts matrix before any edgeR analysis. This would lead to "meaningfull" logFC values and logical volcano plots. Many thanks for your help. Best regards, Ioannis ________________________________________ From: Gordon K Smyth [smyth@wehi.EDU.AU] Sent: 08 July 2011 12:16 To: Filippis, Ioannis Cc: Bioconductor mailing list Subject: edgeR - estimateGLMCommonDisp - warnings - huge logFC Dear Ioannis, If the counts are zeros for some libraries for some genes, then it should be no surprise that some of the logFC might be very large. The raw fold changes are infinite. The real problem though is that running estimateGLMCommonDisp() without replicates is meaningless, since the dispersion is not actually estimable without replicates. The function will probably just return a dispersion of zero in this case. If you must analyse RNA-Seq data without replicates, you could estimate the dispersion very roughly by treating all the libraries as if they were replicates, by d2 <- estimateCommonDisp(d), or d2 <- estimateGLMCommonDisp(d) and then proceed using this conservative dispersion estimate. Best wishes Gordon > Date: Fri, 8 Jul 2011 08:18:56 +0000 > From: "Filippis, Ioannis" <i.filippis at="" imperial.ac.uk=""> > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: [BioC] edgeR - estimateGLMCommonDisp - warnings - huge logFC > Content-Type: text/plain > > Hi, > > I am using edgeR for a 2x2 factorial design (Strain*Treatment) without > any replicates and the estimateGLMCommonDisp and glmFit functions. > > When I run estimateGLMCommonDisp, I get warnings > 1: In optimize(f = fun, interval = interval^0.25, y = y, ... : > NA/Inf replaced by maximum positive value > and when I run glmFit and then glmLRT, I get huge fold change values for some genes. > > However, if I do a pairwise exactTest for the samples examined for the > above contrast, the fold change for that genes is high but normal. > > I would really appreciate any feedback on the cause of warnings and huge logFC. > > Many thanks for your help. > > Best, > Ioannis ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 13.8 years ago Gordon Smyth 52k

0

Entering edit mode

Robert Castelo ★ 3.4k

@rcastelo

Last seen 18 days ago

Barcelona/Universitat Pompeu Fabra

hi, this very recent correspondence published at Nature Biotech may help you, and possibly help others interested in this thread, to frame some of the more general remarks made by Gordon Smyth in his last message: Hansen KD, Wu Z, Irizarry RA & Leek JT Sequencing technology does not eliminate biological variability Nature Biotechnology 29, 572?573 (2011) doi:10.1038/nbt.1910 Published online 11 July 2011 http://www.nature.com/nbt/journal/v29/n7/full/nbt.1910.html cheers, robert. On Fri, 2011-07-08 at 08:18 +0000, Filippis, Ioannis wrote: > Hi, > > I am using edgeR for a 2x2 factorial design (Strain*Treatment) without any replicates and the estimateGLMCommonDisp and glmFit functions. > > When I run estimateGLMCommonDisp, I get warnings > 1: In optimize(f = fun, interval = interval^0.25, y = y, ... : > NA/Inf replaced by maximum positive value > and when I run glmFit and then glmLRT, I get huge fold change values for some genes. > > However, if I do a pairwise exactTest for the samples examined for the above contrast, the fold change for that genes is high but normal. > > I would really appreciate any feedback on the cause of warnings and huge logFC. > > Many thanks for your help. > > Best, > Ioannis > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.8 years ago Robert Castelo ★ 3.4k

Login before adding your answer.