Question

Limma: correct calculation of B statistics (log odds)

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

Dear Jose, There are a number of important points that need to be made here: 1. The moderated t-statistics and p-values returned by limma do not use any assumption about the proportion of DE genes. These statistics rank the genes in the same order as the B-statistic, so there is actually no need to use the B-statistic and therefore no need to rely on the prior assumed proportion. 2. The B-statistic is the log-posterior odds of differential expression. Naturally the posterior probability has to depend on the prior probability. However the ranking of the genes given by the B-statistic doesn't depend on the prior probability. 3. It is easy in principle to estimate the proportion of differentially expressed genes from a statistical model, but such an estimate is all too likely to be practically bogus. Please see Section 6.4 of Smyth (2004) where there is a careful discussion of how the prior proportion could be estimated but why this is not done automatically in limma. Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No. 1, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf 4. In my opinion, the difficulty of getting a biologically meaningful estimate of the proportion of truly DE genes is shared by all statistical models, not just the one that limma is based on. 5. If you have a prior belief based on sound *biological* grounds that the overall proportion of DE genes should be substantially greater or less than 0.01, then you are permitted and encouraged to set the proportion for yourself. 6. If you really can't resist estimating the proportion of DE genes, the best method that I know of it that proposed by Ferkinstad et al (2005). This is implemented in limma in the convest() function. In limma you can use p0 <- convest(fit$p.value[,"coefofinterest"]) to estime the proportion of non-DE genes for your contrast of interest. Ferkingstad, E., Langaas, M., and Lindqvist, B. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society Series B, 67, 555-572. http://www.math.ntnu.no/~mettela/SFG/research.imf 7. Finally, note that the number of truly DE genes is likely to be quite a bit greater than the number of statistically significant DE genes. This is because there may be many genes which are DE but with such small fold changes that you have no realistic chance of detecting them. This fact has two consequences. Firstly it means that you can't estimate the proportion simply by looking at the number of significant genes. Secondly it means that the proportion of truly DE may actually be of no real biological interest even if you knew it. This is because it includes, perhaps even is mostly made up of, genes of no practical interest because the fold changes too small to be important. Best wishes Gordon >Date: Wed, 19 Apr 2006 13:05:59 +0100 >From: J.delasHeras at ed.ac.uk >Subject: [BioC] Limma: correct calculation of B statistics (log odds) >To: bioconductor at stat.math.ethz.ch > > >I have been using B values to rank genes in order of more likely to >less likely (differentially expressed) in LimmaGUI. > >I am now using Limma, I noticed the default value for the parameter >"proportion" (on the function eBayes) is set at 0.01 (expected 1% >differentially expressed genes). I didn't pay much attention to this >parameter before, because in LimmaGUI you cannot specify it. > >However, now that I use "straight" Limma more I was playing with the >proportion parameter and it affects the B stats a lot. Therefore I come >to the question of what's the best way to estimate this parameter. > >My first guess is to use the P values (FDR, calculated by BH) to decide >a cut off, usually 0.05. Then see how many genes are differentially >expressed according to that rule. And use this observed proportion of >differentially expressed genes as my proportion parameter. > >Is this the correct way to do it? > >Jose > >-- >Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 >Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 >Swann Building, Mayfield Road >University of Edinburgh >Edinburgh EH9 3JR >UK

Microarray Genetics limma limmaGUI Microarray Genetics limma limmaGUI • 4.1k views

ADD COMMENT • link updated 19.0 years ago by J.delasHeras@ed.ac.uk ★ 1.9k • written 19.0 years ago by Gordon Smyth 52k

score 0 · Answer 1 · 2006-04-21

Dear Ben, Please see also my longer reply to Jose in a separate email. The t-statistics, p-values and gene rankings provided by limma do not depend on the assumed proportion. In fact part of the motivation for developing the moderated t-statistics was to obtain a statistic with the same power as the posterior odds without needing this difficult-to-estimate quantity. While the B-statistic does depend on the prior assumed proportion, this is dependence is very straightforward, well understand and explicit. The prior log-odds simply adds a constant to all the genewise B-statistics. It doesn't change the ordering. I agree with your desire to avoid dependence on unjustified assumptions. My approach in limma has been to minimise assumptions where possible but otherwise to make the assumptions very explicit. What I personally feel uneasy about are statistical methods which propose to estimate quantities about which the data contains very little information. The dependence on assumptions may be hard to see. It seems to me that the proportion of DE genes is just such a quantity, because its estimation must be highly sensitive to model assumptions in small microarray experiments. I could easily provide an automatic estimate of this quantity as part of the eBayes() computations in limma, but I deliberately chose not to do this. Expanding a little further on this topic, it seems to me that a biologically meaningful treatment of the proportion of truly DE genes would require a more careful definition of the concept of differential expression than has so far appeared in the literature. It seems to me that mathematicians and biologists have different things in mind when they think of this quantity. Mathematicians are including many genes with very small fold changes which the biologists would do not consider of interest. A biologically meaningful treatment would have to specify how large a fold change needs to be in order to be considered material. I suspect that biologists are going to be surprised by how sensitive the estimated proportion is to this threshold. Best wishes Gordon >[BioC] Limma: correct calculation of B statistics (log odds) >Wittner, Ben, Ph.D. Wittner.Ben at mgh.harvard.edu >Thu Apr 20 19:40:10 CEST 2006 > >Jose, > >I'm very glad you asked this question. One of the things that has made me wary >of using limma is that the proportion of differentially expressed >genes is often >one of the primary things I'm trying to discover from the data, so I >feel uneasy >making an assumption as to what that proportion is. In your email >below, you say >that the output of limma is sensitive to the assumption, which, of >course, makes >me feel even more uneasy about it. >I've not noticed any responses on the BioC list. Has anyone commented on this >issue to you? > >-Ben

score 0 · Answer 2 · 2006-04-21

Dear Gordon, many thanks for (as usual) a very helpful and informative answer. > 2. The B-statistic is the log-posterior odds of differential > expression. Naturally the posterior probability has to depend on the > prior probability. However the ranking of the genes given by the > B-statistic doesn't depend on the prior probability. But you can't use the B statistic to decide a suitable cut off for your experiment unless the proportion of DE genes has been estimated. Since the adjusted P Values rank the genes in the same order as B... why use B at all? Volcano plots (M vs B) are useful no matter how B is estimated, but they would be more meaningful if we could say something like "for B>3 we can be pretty confident these genes are truly DE" > 5. If you have a prior belief based on sound *biological* grounds > that the overall proportion of DE genes should be substantially > greater or less than 0.01, then you are permitted and encouraged to > set the proportion for yourself. I don't, really. I may believe that more genes are DE in some types of experiments than others, but I can't tell what the proportion should be, based on prior knowledge of the biology. Sometimes I expect "many" changes, and others "not so many", and the observations fit... but how many is many? I don't know. That's why I thought about maybe using FDR, but it seems that the issue is more complex than that (ah, it always is...) > 6. If you really can't resist estimating the proportion of DE genes, > the best method that I know of it that proposed by Ferkinstad et al > (2005). This is implemented in limma in the convest() function. In > limma you can use > > p0 <- convest(fit$p.value[,"coefofinterest"]) > > to estime the proportion of non-DE genes for your contrast of interest. > > Ferkingstad, E., Langaas, M., and Lindqvist, B. (2005). Estimating > the proportion of true null hypotheses, with application > to DNA microarray data. Journal of the Royal Statistical Society > Series B, 67, 555-572. > http://www.math.ntnu.no/~mettela/SFG/research.imf thank you very much for that, Gordon. > 7. Finally, note that the number of truly DE genes is likely to be > quite a bit greater than the number of statistically significant DE > genes. This is because there may be many genes which are DE but with > such small fold changes that you have no realistic chance of > detecting them. This fact has two consequences. Firstly it means that > you can't estimate the proportion simply by looking at the number of > significant genes. Secondly it means that the proportion of truly DE > may actually be of no real biological interest even if you knew it. > This is because it includes, perhaps even is mostly made up of, genes > of no practical interest because the fold changes too small to be important. This is very true... So, in practical terms, it's probably best to stick to P values when I need to make cut offs, and use B if I think a volcano plot ilustrates better the point I want to make about the DE genes in a particular experiment, but without giving too much importance to the actual value (or use the convest function to estimate the proportion of DE genes, but bearing in mind that -as you point out above- the true proportion is likely to be larger, just beyond our limits of detection)... does that sound about right? Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK