A question about Limma

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

> Date: Sun, 2 Jan 2005 14:05:15 -0800 (PST) > From: "Fangxin Hong" <fhong@salk.edu> > Subject: [BioC] A question about Limma > To: bioconductor@stat.math.ethz.ch > Message-ID: <1867.66.75.240.64.1104703515.squirrel@66.75.240.64> > Content-Type: text/plain;charset=iso-8859-1 > > Hi Bioconductor users; > I have a general question about limma model. > In limma package, usually one linear model applies to all genes, and error > variances from all genes are modified simultaneously. What if some > factors, for example, one main effect, is only significant for some genes. > Then if we want identify genes based on the significance of another main > effect (of interest). What is the best way to do it? Currently I juse > leave this factor in the model which is applied to all genes,

That's what I do, leave all terms in the models for all the genes. I don't see a strong case for doing a separate model selection process for every gene.

> but this > might under-estimate the total number of genes on which the effect of > interest is significant.

Why do you think so? The only disadvantage of keeping a non-significant term in the model is a reduction in residual degrees of freedom, with some consequential loss of power, but this disadvantage is mitigated by the empirical Bayes moderation process.

Perhaps someday someone will work out a model selection theory for massively parallel regression situations like microarray experiments, but there isn't such a theory now. It seems safer to me to have the same model for every gene, keeping all the 'a priori' important predictors in the model.

Gordon

> I am sorry if this question has been asked/answered here before, I > wouldn't find it through searching the archive. Any comment, suggestion or > experience is appreciated. > > Fangxin > -- > Fangxin Hong, Ph.D. > Plant Biology Laboratory > The Salk Institute > 10010 N. Torrey Pines Rd. > La Jolla, CA 92037 > E-mail: fhong@salk.edu

PROcess Microarray limma • 1.7k views

ADD COMMENT • link 20.3 years ago • updated 23 months ago Gordon Smyth 52k

0

Entering edit mode

Fangxin Hong ▴ 810

@fangxin-hong-912

Last seen 10.6 years ago

Hi Gordon; Thanks for the reply. That make me feel confident about my thoughts. >> but this >> might under-estimate the total number of genes on which the effect of >> interest is significant. > > Why do you think so? The only disadvantage of keeping a non- significant > term in the model is a > reduction in residual degrees of freedom, with some consequential loss of > power, but this > disadvantage is mitigated by the empirical Bayes moderation process. In model selection framework, deleting one in-significant effect from the model might make other effect become significant(for example P<0.05). However, since the empirical Bayes moderation process is able to modify the error variance, that should be fine. > Perhaps someday someone will work out a model selection theory for > massively parallel regression > situations like microarray experiments, but there isn't such a theory now. > It seems safer to me > to have the same model for every gene, keeping all the 'a priori' > important predictors in the > model. I agree. Fangxin -- Fangxin Hong, Ph.D. Plant Biology Laboratory The Salk Institute 10010 N. Torrey Pines Rd. La Jolla, CA 92037 E-mail: fhong@salk.edu

ADD COMMENT • link 20.3 years ago Fangxin Hong ▴ 810

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.0 years ago

United States

Reducing the model based on removing nonsignificant effects is called "pre-test estimation". It is known to increase the false-positive rate, even in the classical setting. In the microarray setting, there is no compelling reason to use pre-test estimators that differ from gene to gene. --Naomi Altman At 10:57 PM 1/3/2005 +1100, Gordon K Smyth wrote: > > Date: Sun, 2 Jan 2005 14:05:15 -0800 (PST) > > From: "Fangxin Hong" <fhong@salk.edu> > > Subject: [BioC] A question about Limma > > To: bioconductor@stat.math.ethz.ch > > Message-ID: <1867.66.75.240.64.1104703515.squirrel@66.75.240.64> > > Content-Type: text/plain;charset=iso-8859-1 > > > > Hi Bioconductor users; > > I have a general question about limma model. > > In limma package, usually one linear model applies to all genes, and error > > variances from all genes are modified simultaneously. What if some > > factors, for example, one main effect, is only significant for some genes. > > Then if we want identify genes based on the significance of another main > > effect (of interest). What is the best way to do it? Currently I juse > > leave this factor in the model which is applied to all genes, > >That's what I do, leave all terms in the models for all the genes. I >don't see a strong case for >doing a separate model selection process for every gene. > > > but this > > might under-estimate the total number of genes on which the effect of > > interest is significant. > >Why do you think so? The only disadvantage of keeping a non- significant >term in the model is a >reduction in residual degrees of freedom, with some consequential loss of >power, but this >disadvantage is mitigated by the empirical Bayes moderation process. > >Perhaps someday someone will work out a model selection theory for >massively parallel regression >situations like microarray experiments, but there isn't such a theory >now. It seems safer to me >to have the same model for every gene, keeping all the 'a priori' >important predictors in the >model. > >Gordon > > > I am sorry if this question has been asked/answered here before, I > > wouldn't find it through searching the archive. Any comment, suggestion or > > experience is appreciated. > > > > Fangxin > > -- > > Fangxin Hong, Ph.D. > > Plant Biology Laboratory > > The Salk Institute > > 10010 N. Torrey Pines Rd. > > La Jolla, CA 92037 > > E-mail: fhong@salk.edu > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 20.3 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

I agree. In my reply to Fangxin I should have added that I would remove a non-essential effect like a dye-effect if it appeared non-significant, but I'd remove it for all the genes. Gordon On Tue, January 4, 2005 1:18 am, Naomi Altman said: > Reducing the model based on removing nonsignificant effects is called > "pre-test estimation". It is known to increase the false-positive rate, > even in the classical setting. In the microarray setting, there is no > compelling reason to use pre-test estimators that differ from gene to gene. > > --Naomi Altman > > At 10:57 PM 1/3/2005 +1100, Gordon K Smyth wrote: >> > Date: Sun, 2 Jan 2005 14:05:15 -0800 (PST) >> > From: "Fangxin Hong" <fhong@salk.edu> >> > Subject: [BioC] A question about Limma >> > To: bioconductor@stat.math.ethz.ch >> > Message-ID: <1867.66.75.240.64.1104703515.squirrel@66.75.240.64> >> > Content-Type: text/plain;charset=iso-8859-1 >> > >> > Hi Bioconductor users; >> > I have a general question about limma model. >> > In limma package, usually one linear model applies to all genes, and error >> > variances from all genes are modified simultaneously. What if some >> > factors, for example, one main effect, is only significant for some genes. >> > Then if we want identify genes based on the significance of another main >> > effect (of interest). What is the best way to do it? Currently I juse >> > leave this factor in the model which is applied to all genes, >> >>That's what I do, leave all terms in the models for all the genes. I >>don't see a strong case for >>doing a separate model selection process for every gene. >> >> > but this >> > might under-estimate the total number of genes on which the effect of >> > interest is significant. >> >>Why do you think so? The only disadvantage of keeping a non- significant >>term in the model is a >>reduction in residual degrees of freedom, with some consequential loss of >>power, but this >>disadvantage is mitigated by the empirical Bayes moderation process. >> >>Perhaps someday someone will work out a model selection theory for >>massively parallel regression >>situations like microarray experiments, but there isn't such a theory >>now. It seems safer to me >>to have the same model for every gene, keeping all the 'a priori' >>important predictors in the >>model. >> >>Gordon >> >> > I am sorry if this question has been asked/answered here before, I >> > wouldn't find it through searching the archive. Any comment, suggestion or >> > experience is appreciated. >> > >> > Fangxin >> > -- >> > Fangxin Hong, Ph.D. >> > Plant Biology Laboratory >> > The Salk Institute >> > 10010 N. Torrey Pines Rd. >> > La Jolla, CA 92037 >> > E-mail: fhong@salk.edu >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor@stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Bioinformatics Consulting Center > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 >

ADD REPLY • link 20.3 years ago Gordon Smyth 52k

0

Entering edit mode

Is it possible that dye-effect is still tested to be significant for some genes, let's say 40% of genes? Do we remove or keep this effect for all genes? I met this problem, the factor origin (like differnent laboratories)was significant for > 50% genes, what I did was keeping it in the model for all genes. However, I can't figure out a nice explanation of this, like why dye effect is only significant for 40% of genes, what does this tell us about this effect. Thanks. Fangxin > I agree. In my reply to Fangxin I should have added that I would remove a > non-essential effect > like a dye-effect if it appeared non-significant, but I'd remove it for > all the genes. > > Gordon > > On Tue, January 4, 2005 1:18 am, Naomi Altman said: >> Reducing the model based on removing nonsignificant effects is called >> "pre-test estimation". It is known to increase the false-positive rate, >> even in the classical setting. In the microarray setting, there is no >> compelling reason to use pre-test estimators that differ from gene to >> gene. >> >> --Naomi Altman >> >> At 10:57 PM 1/3/2005 +1100, Gordon K Smyth wrote: >>> > Date: Sun, 2 Jan 2005 14:05:15 -0800 (PST) >>> > From: "Fangxin Hong" <fhong@salk.edu> >>> > Subject: [BioC] A question about Limma >>> > To: bioconductor@stat.math.ethz.ch >>> > Message-ID: <1867.66.75.240.64.1104703515.squirrel@66.75.240.64> >>> > Content-Type: text/plain;charset=iso-8859-1 >>> > >>> > Hi Bioconductor users; >>> > I have a general question about limma model. >>> > In limma package, usually one linear model applies to all genes, and >>> error >>> > variances from all genes are modified simultaneously. What if some >>> > factors, for example, one main effect, is only significant for some >>> genes. >>> > Then if we want identify genes based on the significance of another >>> main >>> > effect (of interest). What is the best way to do it? Currently I juse >>> > leave this factor in the model which is applied to all genes, >>> >>>That's what I do, leave all terms in the models for all the genes. I >>>don't see a strong case for >>>doing a separate model selection process for every gene. >>> >>> > but this >>> > might under-estimate the total number of genes on which the effect of >>> > interest is significant. >>> >>>Why do you think so? The only disadvantage of keeping a non- significant >>>term in the model is a >>>reduction in residual degrees of freedom, with some consequential loss >>> of >>>power, but this >>>disadvantage is mitigated by the empirical Bayes moderation process. >>> >>>Perhaps someday someone will work out a model selection theory for >>>massively parallel regression >>>situations like microarray experiments, but there isn't such a theory >>>now. It seems safer to me >>>to have the same model for every gene, keeping all the 'a priori' >>>important predictors in the >>>model. >>> >>>Gordon >>> >>> > I am sorry if this question has been asked/answered here before, I >>> > wouldn't find it through searching the archive. Any comment, >>> suggestion or >>> > experience is appreciated. >>> > >>> > Fangxin >>> > -- >>> > Fangxin Hong, Ph.D. >>> > Plant Biology Laboratory >>> > The Salk Institute >>> > 10010 N. Torrey Pines Rd. >>> > La Jolla, CA 92037 >>> > E-mail: fhong@salk.edu >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor@stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Bioinformatics Consulting Center >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> > > > -- Fangxin Hong, Ph.D. Plant Biology Laboratory The Salk Institute 10010 N. Torrey Pines Rd. La Jolla, CA 92037 E-mail: fhong@salk.edu

ADD REPLY • link 20.3 years ago Fangxin Hong ▴ 810

0

Entering edit mode

40% sounds to me like a lot of genes. I keep it in. Not even the strongest effect will be significant for every gene. And non-significance doesn't mean the effect is zero.

Whether you keep a nuisance effect in also depends on the size of the experiment. With many arrays, definitely keep it in. With very few arrays the cost of estimating a nuisance parameter is relatively greater. Where's the cutoff? Don't know. Only experience will tell. I am currently thinking the cutoff for a dye-effect with two-color replicated dye-swap data is around 4 arrays, depending obviously on the technology.

Gordon

ADD REPLY • link 20.3 years ago • updated 23 months ago Gordon Smyth 52k

0

Entering edit mode

The dye effect is likely to be significant for only some genes (after normalization) because the dye is a chemical reagent that binds to the cDNA or RNA. The binding properties of the 2 dyes differ, and the chemical compositions of all cDNA fragments are different. Sorry I cannot give more exact chemistry info but that is the reason stripped of the technicalities. The reason I put "after normalization" is that scanner settings and how the labeling was performed can affect the mean detection over the entire array, but the mean effect (over all genes) is remove during normalization. So, what we are calling the "dye-effect" is called the "dye by gene interaction" in some papers such as Churchill and Kerr, 2001. --Naomi At 04:31 PM 1/6/2005 -0800, Fangxin Hong wrote: >Is it possible that dye-effect is still tested to be significant for some >genes, let's say 40% of genes? Do we remove or keep this effect for all >genes? >I met this problem, the factor origin (like differnent laboratories)was >significant for > 50% genes, what I did was keeping it in the model for >all genes. >However, I can't figure out a nice explanation of this, like why dye >effect is only significant for 40% of genes, what does this tell us about >this effect. > >Thanks. >Fangxin > > I agree. In my reply to Fangxin I should have added that I would remove a > > non-essential effect > > like a dye-effect if it appeared non-significant, but I'd remove it for > > all the genes. > > > > Gordon > > > > On Tue, January 4, 2005 1:18 am, Naomi Altman said: > >> Reducing the model based on removing nonsignificant effects is called > >> "pre-test estimation". It is known to increase the false- positive rate, > >> even in the classical setting. In the microarray setting, there is no > >> compelling reason to use pre-test estimators that differ from gene to > >> gene. > >> > >> --Naomi Altman > >> > >> At 10:57 PM 1/3/2005 +1100, Gordon K Smyth wrote: > >>> > Date: Sun, 2 Jan 2005 14:05:15 -0800 (PST) > >>> > From: "Fangxin Hong" <fhong@salk.edu> > >>> > Subject: [BioC] A question about Limma > >>> > To: bioconductor@stat.math.ethz.ch > >>> > Message-ID: <1867.66.75.240.64.1104703515.squirrel@66.75.240.64> > >>> > Content-Type: text/plain;charset=iso-8859-1 > >>> > > >>> > Hi Bioconductor users; > >>> > I have a general question about limma model. > >>> > In limma package, usually one linear model applies to all genes, and > >>> error > >>> > variances from all genes are modified simultaneously. What if some > >>> > factors, for example, one main effect, is only significant for some > >>> genes. > >>> > Then if we want identify genes based on the significance of another > >>> main > >>> > effect (of interest). What is the best way to do it? Currently I juse > >>> > leave this factor in the model which is applied to all genes, > >>> > >>>That's what I do, leave all terms in the models for all the genes. I > >>>don't see a strong case for > >>>doing a separate model selection process for every gene. > >>> > >>> > but this > >>> > might under-estimate the total number of genes on which the effect of > >>> > interest is significant. > >>> > >>>Why do you think so? The only disadvantage of keeping a non- significant > >>>term in the model is a > >>>reduction in residual degrees of freedom, with some consequential loss > >>> of > >>>power, but this > >>>disadvantage is mitigated by the empirical Bayes moderation process. > >>> > >>>Perhaps someday someone will work out a model selection theory for > >>>massively parallel regression > >>>situations like microarray experiments, but there isn't such a theory > >>>now. It seems safer to me > >>>to have the same model for every gene, keeping all the 'a priori' > >>>important predictors in the > >>>model. > >>> > >>>Gordon > >>> > >>> > I am sorry if this question has been asked/answered here before, I > >>> > wouldn't find it through searching the archive. Any comment, > >>> suggestion or > >>> > experience is appreciated. > >>> > > >>> > Fangxin > >>> > -- > >>> > Fangxin Hong, Ph.D. > >>> > Plant Biology Laboratory > >>> > The Salk Institute > >>> > 10010 N. Torrey Pines Rd. > >>> > La Jolla, CA 92037 > >>> > E-mail: fhong@salk.edu > >>> > >>>_______________________________________________ > >>>Bioconductor mailing list > >>>Bioconductor@stat.math.ethz.ch > >>>https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > >> Naomi S. Altman 814-865-3791 (voice) > >> Associate Professor > >> Bioinformatics Consulting Center > >> Dept. of Statistics 814-863-7114 (fax) > >> Penn State University 814-865-1348 (Statistics) > >> University Park, PA 16802-2111 > >> > > > > > > > > >-- >Fangxin Hong, Ph.D. >Plant Biology Laboratory >The Salk Institute >10010 N. Torrey Pines Rd. >La Jolla, CA 92037 >E-mail: fhong@salk.edu Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 20.3 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

> The dye effect is likely to be significant for only some genes (after > normalization) because the dye is a chemical reagent that binds to the > cDNA > or RNA. > The binding properties of the 2 dyes differ, and the chemical compositions > of all cDNA fragments are different. Sorry I cannot give more exact > chemistry > info but that is the reason stripped of the technicalities. That makes sense. > The reason I put "after normalization" is that scanner settings and how > the > labeling was performed can affect the mean detection over the entire > array, > but the mean effect (over all genes) is remove during normalization. Will normalizaion completely remove the mean effect of dye? I didn't tudy spot cDNA array much, but for affy arrays, normalization can't remove effect like labs completely. I normalized data generated at two labs (similar experimental setting) together, but limma model test that the mean effect which is lab here is still significant for >50 genes, with lab*treatment interaction included. dye effect is reasonably be "dye-by-gene interaction", but lab effect seems should be mean effect for all genes. Why normalizaion still can't remove it completely? Thanks Fangxin > So, > what we are calling the "dye-effect" is called the "dye by gene > interaction" in some papers such as Churchill and Kerr, 2001. > > --Naomi > > > At 04:31 PM 1/6/2005 -0800, Fangxin Hong wrote: >>Is it possible that dye-effect is still tested to be significant for some >>genes, let's say 40% of genes? Do we remove or keep this effect for all >>genes? >>I met this problem, the factor origin (like differnent laboratories)was >>significant for > 50% genes, what I did was keeping it in the model for >>all genes. >>However, I can't figure out a nice explanation of this, like why dye >>effect is only significant for 40% of genes, what does this tell us about >>this effect. >> >>Thanks. >>Fangxin >> > I agree. In my reply to Fangxin I should have added that I would >> remove a >> > non-essential effect >> > like a dye-effect if it appeared non-significant, but I'd remove it >> for >> > all the genes. >> > >> > Gordon >> > >> > On Tue, January 4, 2005 1:18 am, Naomi Altman said: >> >> Reducing the model based on removing nonsignificant effects is called >> >> "pre-test estimation". It is known to increase the false- positive >> rate, >> >> even in the classical setting. In the microarray setting, there is >> no >> >> compelling reason to use pre-test estimators that differ from gene to >> >> gene. >> >> >> >> --Naomi Altman >> >> >> >> At 10:57 PM 1/3/2005 +1100, Gordon K Smyth wrote: >> >>> > Date: Sun, 2 Jan 2005 14:05:15 -0800 (PST) >> >>> > From: "Fangxin Hong" <fhong@salk.edu> >> >>> > Subject: [BioC] A question about Limma >> >>> > To: bioconductor@stat.math.ethz.ch >> >>> > Message-ID: <1867.66.75.240.64.1104703515.squirrel@66.75.240.64> >> >>> > Content-Type: text/plain;charset=iso-8859-1 >> >>> > >> >>> > Hi Bioconductor users; >> >>> > I have a general question about limma model. >> >>> > In limma package, usually one linear model applies to all genes, >> and >> >>> error >> >>> > variances from all genes are modified simultaneously. What if some >> >>> > factors, for example, one main effect, is only significant for >> some >> >>> genes. >> >>> > Then if we want identify genes based on the significance of >> another >> >>> main >> >>> > effect (of interest). What is the best way to do it? Currently I >> juse >> >>> > leave this factor in the model which is applied to all genes, >> >>> >> >>>That's what I do, leave all terms in the models for all the genes. I >> >>>don't see a strong case for >> >>>doing a separate model selection process for every gene. >> >>> >> >>> > but this >> >>> > might under-estimate the total number of genes on which the effect >> of >> >>> > interest is significant. >> >>> >> >>>Why do you think so? The only disadvantage of keeping a >> non-significant >> >>>term in the model is a >> >>>reduction in residual degrees of freedom, with some consequential >> loss >> >>> of >> >>>power, but this >> >>>disadvantage is mitigated by the empirical Bayes moderation process. >> >>> >> >>>Perhaps someday someone will work out a model selection theory for >> >>>massively parallel regression >> >>>situations like microarray experiments, but there isn't such a theory >> >>>now. It seems safer to me >> >>>to have the same model for every gene, keeping all the 'a priori' >> >>>important predictors in the >> >>>model. >> >>> >> >>>Gordon >> >>> >> >>> > I am sorry if this question has been asked/answered here before, I >> >>> > wouldn't find it through searching the archive. Any comment, >> >>> suggestion or >> >>> > experience is appreciated. >> >>> > >> >>> > Fangxin >> >>> > -- >> >>> > Fangxin Hong, Ph.D. >> >>> > Plant Biology Laboratory >> >>> > The Salk Institute >> >>> > 10010 N. Torrey Pines Rd. >> >>> > La Jolla, CA 92037 >> >>> > E-mail: fhong@salk.edu >> >>> >> >>>_______________________________________________ >> >>>Bioconductor mailing list >> >>>Bioconductor@stat.math.ethz.ch >> >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >> >> Naomi S. Altman 814-865-3791 (voice) >> >> Associate Professor >> >> Bioinformatics Consulting Center >> >> Dept. of Statistics 814-863-7114 (fax) >> >> Penn State University 814-865-1348 >> (Statistics) >> >> University Park, PA 16802-2111 >> >> >> > >> > >> > >> >> >>-- >>Fangxin Hong, Ph.D. >>Plant Biology Laboratory >>The Salk Institute >>10010 N. Torrey Pines Rd. >>La Jolla, CA 92037 >>E-mail: fhong@salk.edu > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Bioinformatics Consulting Center > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > > > -- Fangxin Hong, Ph.D. Plant Biology Laboratory The Salk Institute 10010 N. Torrey Pines Rd. La Jolla, CA 92037 E-mail: fhong@salk.edu

ADD REPLY • link 20.3 years ago Fangxin Hong ▴ 810

Login before adding your answer.