combining GSA and lmFit

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Dear Dick, Anything in GSA which works with the SAM statistic should also work fine with limma moderated t-statistics. However there are several issues that come to my mind which affect both statistics. Firstly, both SAM and limma statistics depend on the whole ensemble of genes, i.e., they are not merely computed genewise. This is unlike the floored mean statistics assumed in the GSA theory paper. This has clear computational implications, but also could give rise to some theoretical issues. Secondly, it's not too clear to me whether it makes sense to compute regularized or moderated statistics after the standardization steps that GSA does. Thirdly, GSA computes p-values from permuation, and permutation does not perform well for linear models. These are simply my thoughts, which you asked for. You may have ways around all these issues. Best wishes Gordon > Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT) > From: Dick Beyer <dbeyer at="" u.washington.edu=""> > Subject: [BioC] combining GSA and lmFit > To: Bioconductor <bioconductor at="" stat.math.ethz.ch=""> > > Hi All, > > I have extended the GSA code (http://www- stat.stanford.edu/~tibs/GSA/) > to include lmFit() from the limma package so as to have linear model > capabilities with GSA. Basically, I'm using the modified t-statistic > values from lmFit just like the SAM-like t-statistic values are used in > the GSA code. > > I was wondering if anyone had any thoughts on whether this was, in > principle, an OK thing to be doing. I am worrying about whether there > is an underlying issue I'm not aware of in using the moderated > t-statistic values from limma as opposed to the SAM t-statistic values > that uses the s0 term in the denominator. > > My tests on some microarray data I have shows that in a qqplot of > t-statistic values from the two methods, they are in pretty close > agreement except for large values of the t-values. > > If anyone knows of reasons not to be doing this or could point me to > places with possible explanations, I'd be very grateful. > > Cheers, > Dick > > ******************************************************************** *********** > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html > http://staff.washington.edu/~dbeyer

Microarray limma Microarray limma • 1.8k views

ADD COMMENT • link updated 15.9 years ago by Dick Beyer ★ 1.4k • written 15.9 years ago by Gordon Smyth 52k

0

Entering edit mode

Dick Beyer ★ 1.4k

@dick-beyer-26

Last seen 10.6 years ago

Hi Gordon, Thanks for sharing your views on this topic. I was wondering, when you say "Thirdly, GSA computes p-values from permuation, and permutation does not perform well for linear models," what is it about the permuation approach that does not perform very well? Is it due to the null hypothesis being equality of distributions rather than assuming your are testing equality of means? I see that the Bioconductor package GSEAlm which uses linear models with GSEA and uses sample permutations might have similar problems as combining GSA and lmFit. I guess I was thinking a GSA/lmFit combination would be OK because of people using GSEAlm. But if the underlying null assumption is equality of distributions rather than a weaker null of equality of means, then that would be important to keep in mind when interpreting the resulting p-values. I wonder if there would be a useful way to test the issues you raise about the effect on gene set analysis when using limma or SAM statistics which depend on ensembles of genes, as well as the effect of the moderated statistics of limma or SAM on the GSA standardization method. I'm not sure I understand the GSA steps enough to know if those are designed to take care of problems you might otherwise use a SAM-type statistic to deal with. Lots to think about. Thanks very much for your comments. Cheers, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html http://staff.washington.edu/~dbeyer ********************************************************************** ********* On Tue, 28 Apr 2009, Gordon K Smyth wrote: > Dear Dick, > > Anything in GSA which works with the SAM statistic should also work fine > with limma moderated t-statistics. > > However there are several issues that come to my mind which affect both > statistics. Firstly, both SAM and limma statistics depend on the whole > ensemble of genes, i.e., they are not merely computed genewise. This is > unlike the floored mean statistics assumed in the GSA theory paper. This > has clear computational implications, but also could give rise to some > theoretical issues. > > Secondly, it's not too clear to me whether it makes sense to compute > regularized or moderated statistics after the standardization steps that GSA > does. > > Thirdly, GSA computes p-values from permuation, and permutation does not > perform well for linear models. > > These are simply my thoughts, which you asked for. You may have ways around > all these issues. > > Best wishes > Gordon > >> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT) >> From: Dick Beyer <dbeyer at="" u.washington.edu=""> >> Subject: [BioC] combining GSA and lmFit >> To: Bioconductor <bioconductor at="" stat.math.ethz.ch=""> >> >> Hi All, >> >> I have extended the GSA code (http://www- stat.stanford.edu/~tibs/GSA/) to >> include lmFit() from the limma package so as to have linear model >> capabilities with GSA. Basically, I'm using the modified t-statistic >> values from lmFit just like the SAM-like t-statistic values are used in >> the GSA code. >> >> I was wondering if anyone had any thoughts on whether this was, in >> principle, an OK thing to be doing. I am worrying about whether there is >> an underlying issue I'm not aware of in using the moderated t-statistic >> values from limma as opposed to the SAM t-statistic values that uses the >> s0 term in the denominator. >> >> My tests on some microarray data I have shows that in a qqplot of >> t-statistic values from the two methods, they are in pretty close >> agreement except for large values of the t-values. >> >> If anyone knows of reasons not to be doing this or could point me to >> places with possible explanations, I'd be very grateful. >> >> Cheers, >> Dick >> >> ******************************************************************* ************ >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >> http://staff.washington.edu/~dbeyer >

ADD COMMENT • link 15.9 years ago Dick Beyer ★ 1.4k

0

Entering edit mode

Dear Dick, On Tue, 28 Apr 2009, Dick Beyer wrote: > Hi Gordon, > > Thanks for sharing your views on this topic. > > I was wondering, when you say "Thirdly, GSA computes p-values from > permuation, and permutation does not perform well for linear models," > what is it about the permutation approach that does not perform very > well? Is it due to the null hypothesis being equality of distributions > rather than assuming your are testing equality of means? Well, you could put it like that. Suppose you have a one-way layout. The trouble is that the groups other than the two you want to compare interfere rather than help with the permutations. Have a look at the research literature -- there's very little on permutation outside the two group problem. > I see that the Bioconductor package GSEAlm which uses linear models with > GSEA and uses sample permutations might have similar problems as > combining GSA and lmFit. I guess I was thinking a GSA/lmFit combination > would be OK because of people using GSEAlm. But if the underlying null > assumption is equality of distributions rather than a weaker null of > equality of means, then that would be important to keep in mind when > interpreting the resulting p-values. My understanding is that GSEAlm implements proposals from a nice paper published in Bioinformatics and is concerned with issues different to your GSA/lm combination, but the authors may have more to say. > I wonder if there would be a useful way to test the issues you raise > about the effect on gene set analysis when using limma or SAM statistics > which depend on ensembles of genes, as well as the effect of the > moderated statistics of limma or SAM on the GSA standardization method. > I'm not sure I understand the GSA steps enough to know if those are > designed to take care of problems you might otherwise use a SAM-type > statistic to deal with. If it was easy to do GSA for linear models, I expect that Efron and Tibshirini would have done it. I'm not sure it is wise to cobble something ad hoc together if you don't understand the theory very well. I didn't reply in order to point you to my work, but my group has taken an approach to GSEA for linear models which we're very happy with, which is implemented in the roast() and romer() functions of the limma package. Best wishes Gordon > Lots to think about. Thanks very much for your comments. > > Cheers, > Dick > ******************************************************************** *********** > Richard P. Beyer, Ph.D. University of Washington > Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 > Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 > Seattle, WA 98105-6099 > http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html > http://staff.washington.edu/~dbeyer > ******************************************************************** *********** > > On Tue, 28 Apr 2009, Gordon K Smyth wrote: > >> Dear Dick, >> >> Anything in GSA which works with the SAM statistic should also work fine >> with limma moderated t-statistics. >> >> However there are several issues that come to my mind which affect both >> statistics. Firstly, both SAM and limma statistics depend on the whole >> ensemble of genes, i.e., they are not merely computed genewise. This is >> unlike the floored mean statistics assumed in the GSA theory paper. This >> has clear computational implications, but also could give rise to some >> theoretical issues. >> >> Secondly, it's not too clear to me whether it makes sense to compute >> regularized or moderated statistics after the standardization steps that GSA >> does. >> >> Thirdly, GSA computes p-values from permuation, and permutation does not >> perform well for linear models. >> >> These are simply my thoughts, which you asked for. You may have ways around >> all these issues. >> >> Best wishes >> Gordon >> >>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT) >>> From: Dick Beyer <dbeyer at="" u.washington.edu=""> >>> Subject: [BioC] combining GSA and lmFit >>> To: Bioconductor <bioconductor at="" stat.math.ethz.ch=""> >>> >>> Hi All, >>> >>> I have extended the GSA code (http://www- stat.stanford.edu/~tibs/GSA/) to >>> include lmFit() from the limma package so as to have linear model >>> capabilities with GSA. Basically, I'm using the modified t-statistic >>> values from lmFit just like the SAM-like t-statistic values are used in >>> the GSA code. >>> >>> I was wondering if anyone had any thoughts on whether this was, in >>> principle, an OK thing to be doing. I am worrying about whether there is >>> an underlying issue I'm not aware of in using the moderated t-statistic >>> values from limma as opposed to the SAM t-statistic values that uses the >>> s0 term in the denominator. >>> >>> My tests on some microarray data I have shows that in a qqplot of >>> t-statistic values from the two methods, they are in pretty close >>> agreement except for large values of the t-values. >>> >>> If anyone knows of reasons not to be doing this or could point me to >>> places with possible explanations, I'd be very grateful. >>> >>> Cheers, >>> Dick >>> >>> ****************************************************************** ************* >>> Richard P. Beyer, Ph.D. University of Washington >>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >>> Seattle, WA 98105-6099 >>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >>> http://staff.washington.edu/~dbeyer

ADD REPLY • link 15.9 years ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon, Thanks for your thoughtful comments. I think you have steered me in the right direction. I see after a more careful reading of your methods in limma and those in GSEAlm and GSA, that not knowing how to do permutations correctly when there are more than two groups is a problem. I'm glad for all the help in keeping me from doing something dumb. Cheers, Dick ********************************************************************** ********* Richard P. Beyer, Ph.D. University of Washington Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 Seattle, WA 98105-6099 http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html http://staff.washington.edu/~dbeyer ********************************************************************** ********* On Wed, 29 Apr 2009, Gordon K Smyth wrote: > Dear Dick, > > On Tue, 28 Apr 2009, Dick Beyer wrote: > >> Hi Gordon, >> >> Thanks for sharing your views on this topic. >> >> I was wondering, when you say "Thirdly, GSA computes p-values from >> permuation, and permutation does not perform well for linear models," what >> is it about the permutation approach that does not perform very well? Is it >> due to the null hypothesis being equality of distributions rather than >> assuming your are testing equality of means? > > Well, you could put it like that. Suppose you have a one-way layout. The > trouble is that the groups other than the two you want to compare interfere > rather than help with the permutations. Have a look at the research literature > -- there's very little on permutation outside the two group problem. > >> I see that the Bioconductor package GSEAlm which uses linear models with >> GSEA and uses sample permutations might have similar problems as combining >> GSA and lmFit. I guess I was thinking a GSA/lmFit combination would be OK >> because of people using GSEAlm. But if the underlying null assumption is >> equality of distributions rather than a weaker null of equality of means, >> then that would be important to keep in mind when interpreting the resulting >> p-values. > > My understanding is that GSEAlm implements proposals from a nice paper > published in Bioinformatics and is concerned with issues different to your > GSA/lm combination, but the authors may have more to say. > >> I wonder if there would be a useful way to test the issues you raise about >> the effect on gene set analysis when using limma or SAM statistics which >> depend on ensembles of genes, as well as the effect of the moderated >> statistics of limma or SAM on the GSA standardization method. I'm not sure I >> understand the GSA steps enough to know if those are designed to take care >> of problems you might otherwise use a SAM-type statistic to deal with. > > If it was easy to do GSA for linear models, I expect that Efron and > Tibshirini would have done it. I'm not sure it is wise to cobble > something ad hoc together if you don't understand the theory very well. > > I didn't reply in order to point you to my work, but my group has taken an > approach to GSEA for linear models which we're very happy with, which is > implemented in the roast() and romer() functions of the limma package. > > Best wishes > Gordon > >> Lots to think about. Thanks very much for your comments. >> >> Cheers, >> Dick >> ******************************************************************* ************ >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >> http://staff.washington.edu/~dbeyer >> ******************************************************************* ************ >> >> On Tue, 28 Apr 2009, Gordon K Smyth wrote: >> >>> Dear Dick, >>> >>> Anything in GSA which works with the SAM statistic should also work fine >>> with limma moderated t-statistics. >>> >>> However there are several issues that come to my mind which affect both >>> statistics. Firstly, both SAM and limma statistics depend on the whole >>> ensemble of genes, i.e., they are not merely computed genewise. This is >>> unlike the floored mean statistics assumed in the GSA theory paper. This >>> has clear computational implications, but also could give rise to some >>> theoretical issues. >>> >>> Secondly, it's not too clear to me whether it makes sense to compute >>> regularized or moderated statistics after the standardization steps that >>> GSA >>> does. >>> >>> Thirdly, GSA computes p-values from permuation, and permutation does not >>> perform well for linear models. >>> >>> These are simply my thoughts, which you asked for. You may have ways >>> around >>> all these issues. >>> >>> Best wishes >>> Gordon >>> >>>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT) >>>> From: Dick Beyer <dbeyer at="" u.washington.edu=""> >>>> Subject: [BioC] combining GSA and lmFit >>>> To: Bioconductor <bioconductor at="" stat.math.ethz.ch=""> >>>> >>>> Hi All, >>>> >>>> I have extended the GSA code (http://www- stat.stanford.edu/~tibs/GSA/) >>>> to >>>> include lmFit() from the limma package so as to have linear model >>>> capabilities with GSA. Basically, I'm using the modified t-statistic >>>> values from lmFit just like the SAM-like t-statistic values are used in >>>> the GSA code. >>>> >>>> I was wondering if anyone had any thoughts on whether this was, in >>>> principle, an OK thing to be doing. I am worrying about whether there >>>> is >>>> an underlying issue I'm not aware of in using the moderated t-statistic >>>> values from limma as opposed to the SAM t-statistic values that uses the >>>> s0 term in the denominator. >>>> >>>> My tests on some microarray data I have shows that in a qqplot of >>>> t-statistic values from the two methods, they are in pretty close >>>> agreement except for large values of the t-values. >>>> >>>> If anyone knows of reasons not to be doing this or could point me to >>>> places with possible explanations, I'd be very grateful. >>>> >>>> Cheers, >>>> Dick >>>> >>>> ***************************************************************** ************** >>>> Richard P. Beyer, Ph.D. University of Washington >>>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >>>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >>>> Seattle, WA 98105-6099 >>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >>>> http://staff.washington.edu/~dbeyer >

ADD REPLY • link 15.9 years ago Dick Beyer ★ 1.4k

0

Entering edit mode

Hi, a very slightly contrary point of view: Gordon K Smyth wrote: > Dear Dick, > > On Tue, 28 Apr 2009, Dick Beyer wrote: > >> Hi Gordon, >> >> Thanks for sharing your views on this topic. >> >> I was wondering, when you say "Thirdly, GSA computes p-values from >> permuation, and permutation does not perform well for linear models," >> what is it about the permutation approach that does not perform very >> well? Is it due to the null hypothesis being equality of >> distributions rather than assuming your are testing equality of means? > > Well, you could put it like that. Suppose you have a one-way layout. > The trouble is that the groups other than the two you want to compare > interfere rather than help with the permutations. Have a look at the > research literature -- there's very little on permutation outside the > two group problem. AFAIK there has been a great deal of work and some of the fundamental problems for dealing with complex experiments have been addressed in recent years. Pesarin, "Multivariate Permutation Tests" is one reasonable book. and the coin package, from CRAN has another view. There are a few other tools around. > >> I see that the Bioconductor package GSEAlm which uses linear models >> with GSEA and uses sample permutations might have similar problems as >> combining GSA and lmFit. I guess I was thinking a GSA/lmFit >> combination would be OK because of people using GSEAlm. But if the >> underlying null assumption is equality of distributions rather than a >> weaker null of equality of means, then that would be important to keep >> in mind when interpreting the resulting p-values. > > My understanding is that GSEAlm implements proposals from a nice paper > published in Bioinformatics and is concerned with issues different to > your GSA/lm combination, but the authors may have more to say. Not sure I am following the necessary distinction here. Yes, in general permutation tests are modeling equality of distributions and that is what is going on, presently in GSEAlm. However, one can take much stronger views using either of the references posted above. best wishes Robert > >> I wonder if there would be a useful way to test the issues you raise >> about the effect on gene set analysis when using limma or SAM >> statistics which depend on ensembles of genes, as well as the effect >> of the moderated statistics of limma or SAM on the GSA standardization >> method. I'm not sure I understand the GSA steps enough to know if >> those are designed to take care of problems you might otherwise use a >> SAM-type statistic to deal with. > > If it was easy to do GSA for linear models, I expect that Efron and > Tibshirini would have done it. I'm not sure it is wise to cobble > something ad hoc together if you don't understand the theory very well. > > I didn't reply in order to point you to my work, but my group has taken > an approach to GSEA for linear models which we're very happy with, which > is implemented in the roast() and romer() functions of the limma package. > > Best wishes > Gordon > >> Lots to think about. Thanks very much for your comments. >> >> Cheers, >> Dick >> ******************************************************************* ************ >> >> Richard P. Beyer, Ph.D. University of Washington >> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >> Seattle, WA 98105-6099 >> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >> http://staff.washington.edu/~dbeyer >> ******************************************************************* ************ >> >> >> On Tue, 28 Apr 2009, Gordon K Smyth wrote: >> >>> Dear Dick, >>> >>> Anything in GSA which works with the SAM statistic should also work fine >>> with limma moderated t-statistics. >>> >>> However there are several issues that come to my mind which affect both >>> statistics. Firstly, both SAM and limma statistics depend on the whole >>> ensemble of genes, i.e., they are not merely computed genewise. This is >>> unlike the floored mean statistics assumed in the GSA theory paper. >>> This >>> has clear computational implications, but also could give rise to some >>> theoretical issues. >>> >>> Secondly, it's not too clear to me whether it makes sense to compute >>> regularized or moderated statistics after the standardization steps >>> that GSA >>> does. >>> >>> Thirdly, GSA computes p-values from permuation, and permutation does not >>> perform well for linear models. >>> >>> These are simply my thoughts, which you asked for. You may have ways >>> around >>> all these issues. >>> >>> Best wishes >>> Gordon >>> >>>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT) >>>> From: Dick Beyer <dbeyer at="" u.washington.edu=""> >>>> Subject: [BioC] combining GSA and lmFit >>>> To: Bioconductor <bioconductor at="" stat.math.ethz.ch=""> >>>> >>>> Hi All, >>>> >>>> I have extended the GSA code >>>> (http://www-stat.stanford.edu/~tibs/GSA/) to >>>> include lmFit() from the limma package so as to have linear model >>>> capabilities with GSA. Basically, I'm using the modified t-statistic >>>> values from lmFit just like the SAM-like t-statistic values are used in >>>> the GSA code. >>>> >>>> I was wondering if anyone had any thoughts on whether this was, in >>>> principle, an OK thing to be doing. I am worrying about whether >>>> there is >>>> an underlying issue I'm not aware of in using the moderated t-statistic >>>> values from limma as opposed to the SAM t-statistic values that uses >>>> the >>>> s0 term in the denominator. >>>> >>>> My tests on some microarray data I have shows that in a qqplot of >>>> t-statistic values from the two methods, they are in pretty close >>>> agreement except for large values of the t-values. >>>> >>>> If anyone knows of reasons not to be doing this or could point me to >>>> places with possible explanations, I'd be very grateful. >>>> >>>> Cheers, >>>> Dick >>>> >>>> ***************************************************************** ************** >>>> >>>> Richard P. Beyer, Ph.D. University of Washington >>>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695 >>>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100 >>>> Seattle, WA 98105-6099 >>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html >>>> http://staff.washington.edu/~dbeyer > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org

ADD REPLY • link 15.9 years ago rgentleman ★ 5.5k

Login before adding your answer.