"romer"ing and "roast"ing around gene sets

0

Entering edit mode

Robert M. Flight ▴ 280

@robert-m-flight-4158

Last seen 4 months ago

United States

Hi All, I am having trouble with the distinction between the functions "roast" and "romer" in the limma package. From the publication describing "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems that it tests a particular gene set for differential expression, whereas "romer" tests a battery of sets to find those that are differentially expressed compared to the rest? I am really having trouble discerning the true difference between these two, and how they compare to GSEA. I always thoght that the primary purpose of GSEA was to determine those gene sets that are significantly associated with a phenotypic comparison, i.e. those gene sets showing differential expression. If any one can help me clear this up, that would be great, because as of now I am thoroughly confused. To me, if I have a dataset, and I want to know which gene sets (from say MSigDB) are differentially expressed, then it sounds like I would use "roast", but the way it is described in the publication (and the help in limma), this isn't what I would do, but rather I should use "romer", and see if any of the sets show differential expression compared to the rest in the database. Color me confused, -Robert Robert M. Flight, Ph.D. Bioinformatics and Biomedical Computing Laboratory University of Louisville Louisville, KY PH 502-852-0467 EM robert.flight at louisville.edu EM rflight79 at gmail.com Williams and Holland's Law: ? ? ?? If enough data is collected, anything may be proven by statistical methods.

limma limma • 3.3k views

ADD COMMENT • link updated 14.5 years ago by Gordon Smyth 52k • written 14.5 years ago by Robert M. Flight ▴ 280

0

Entering edit mode

Di Wu ▴ 190

@di-wu-1837

Last seen 10.3 years ago

Hi Robert, Thank you for asking. I am sorry if the two methods caused confusing. You are right about the understanding of the "roast" paper and "romer" in your first paragraph. The null hypothesis in "roast" is for the focused gene sets as demonstrated in the real data example of the "roast" paper. "mroast" function in limma can test multiple gene sets at the same time. Essentially "mroast" is doing the same as you run a series of "roast" followed by multiple testing adjustments for p values. "mroast" works in a more computing efficiently way than a series of "roast". The null hypothesis in roast is equivalent as the self-contained hypothesis discussed by Goeman and Bühlmann (2007) http://www.ncbi.nlm.nih.gov/pubmed/17303618 Bioinformatics. 2007 Apr 15;23(8):980-7 >From my point of view, GSEA tests the combbined hypothesis, as it uses rank (a test score for competitive test, relative to other genes) and sample labeling permutations (which generates the null distribution under the self-contained hypothesis, "significantly associated with a phenotypic comparison"). Roast is can handle both small sample size data as well as larger samples size data. Sample size is a concern for self-contained hypothesis (focus set). In your case, I think romer is more suitable. Let me know if you have more questions regarding these two functions in limma. I am still using my email address in Monash in the mailing list though I am in WEHI now. Cheers, Di On Wed, Jul 14, 2010 at 12:58 AM, Robert M. Flight <rflight79@gmail.com>wrote: > Hi All, > > I am having trouble with the distinction between the functions "roast" > and "romer" in the limma package. From the publication describing > "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems > that it tests a particular gene set for differential expression, > whereas "romer" tests a battery of sets to find those that are > differentially expressed compared to the rest? > > I am really having trouble discerning the true difference between > these two, and how they compare to GSEA. I always thoght that the > primary purpose of GSEA was to determine those gene sets that are > significantly associated with a phenotypic comparison, i.e. those gene > sets showing differential expression. > > If any one can help me clear this up, that would be great, because as > of now I am thoroughly confused. To me, if I have a dataset, and I > want to know which gene sets (from say MSigDB) are differentially > expressed, then it sounds like I would use "roast", but the way it is > described in the publication (and the help in limma), this isn't what > I would do, but rather I should use "romer", and see if any of the > sets show differential expression compared to the rest in the > database. > > Color me confused, > > -Robert > > Robert M. Flight, Ph.D. > Bioinformatics and Biomedical Computing Laboratory > University of Louisville > Louisville, KY > > PH 502-852-0467 > EM robert.flight@louisville.edu > EM rflight79@gmail.com > > Williams and Holland's Law: > If enough data is collected, anything may be proven by > statistical methods. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 14.5 years ago Di Wu ▴ 190

0

Entering edit mode

Hi Di, Thank you for the clarification. I think that helps me to better know which one of the two to use. Cheers, -Robert On Tue, Jul 13, 2010 at 23:40, Di Wu <di.wu at="" med.monash.edu.au=""> wrote: > Hi Robert, > > Thank you for asking. I am sorry if the two methods caused confusing. > > You are right about the understanding of the "roast" paper and "romer" in > your first paragraph. > > The null hypothesis in "roast" is for the focused gene sets as demonstrated > in the real data example of the "roast" paper. "mroast" function in limma > can test multiple gene sets at the same time. Essentially "mroast" is doing > the same as you run a series of "roast" followed by multiple testing > adjustments for p values. "mroast" works in a more computing efficiently way > than a series of "roast". > > The null hypothesis in roast is equivalent as the self-contained hypothesis > discussed by Goeman and B?hlmann (2007) > http://www.ncbi.nlm.nih.gov/pubmed/17303618 > Bioinformatics. 2007 Apr 15;23(8):980-7 > > From my point of view, GSEA tests the combbined hypothesis, as it uses rank > (a test score for competitive test, relative to other genes)? and sample > labeling permutations (which generates the null distribution under the > self-contained hypothesis, "significantly associated with a phenotypic > comparison"). > > Roast is can handle both small sample size data as well as larger samples > size data. Sample size is a concern for self-contained hypothesis (focus > set). > > In your case, I think romer is more suitable. Let me know if you have more > questions regarding these two functions in limma. > > I am still using my email address in Monash in the mailing list though I am > in WEHI now. > > Cheers, > Di > > On Wed, Jul 14, 2010 at 12:58 AM, Robert M. Flight <rflight79 at="" gmail.com=""> > wrote: >> >> Hi All, >> >> I am having trouble with the distinction between the functions "roast" >> and "romer" in the limma package. From the publication describing >> "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems >> that it tests a particular gene set for differential expression, >> whereas "romer" tests a battery of sets to find those that are >> differentially expressed compared to the rest? >> >> I am really having trouble discerning the true difference between >> these two, and how they compare to GSEA. I always thoght that the >> primary purpose of GSEA was to determine those gene sets that are >> significantly associated with a phenotypic comparison, i.e. those gene >> sets showing differential expression. >> >> If any one can help me clear this up, that would be great, because as >> of now I am thoroughly confused. To me, if I have a dataset, and I >> want to know which gene sets (from say MSigDB) are differentially >> expressed, then it sounds like I would use "roast", but the way it is >> described in the publication (and the help in limma), this isn't what >> I would do, but rather I should use "romer", and see if any of the >> sets show differential expression compared to the rest in the >> database. >> >> Color me confused, >> >> -Robert >> >> Robert M. Flight, Ph.D. >> Bioinformatics and Biomedical Computing Laboratory >> University of Louisville >> Louisville, KY >> >> PH 502-852-0467 >> EM robert.flight at louisville.edu >> EM rflight79 at gmail.com >> >> Williams and Holland's Law: >> ? ? ?? If enough data is collected, anything may be proven by >> statistical methods. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 14.5 years ago Robert M. Flight ▴ 280

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Dear Robert, I'm just adding briefly to Di's comments. > From: "Robert M. Flight" <rflight79 at="" gmail.com=""> > To: bioconductor at stat.math.ethz.ch > Subject: [BioC] "romer"ing and "roast"ing around gene sets > > Hi All, > > I am having trouble with the distinction between the functions "roast" > and "romer" in the limma package. From the publication describing > "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems that > it tests a particular gene set for differential expression, whereas > "romer" tests a battery of sets to find those that are differentially > expressed compared to the rest? Yes. > I am really having trouble discerning the true difference between these > two, and how they compare to GSEA. I always thoght that the primary > purpose of GSEA was to determine those gene sets that are significantly > associated with a phenotypic comparison, i.e. those gene sets showing > differential expression. This is an understandable assumption, which isn't quite true! GSEA actually tries to pick out the sets that stand out as more strongly differentially expressed (DE) than others. So, if all the sets were DE to exactly the same degree, then GSEA wouldn't find anything significant, because no set would stand out from the others. This is actually a biologically well-motivated approach when you are testing large numbers of sets. If you want to test every set in the MSigDB, then testing one by one with roast() would probably be just too slow anyway. romer() is more efficient when the number of sets is very large. Beware that romer(), like GSEA, tends to give pretty modest p-values. The ranking of the sets may be more useful than the absolute p-values. Best wishes Gordon > If any one can help me clear this up, that would be great, because as of > now I am thoroughly confused. To me, if I have a dataset, and I want to > know which gene sets (from say MSigDB) are differentially expressed, > then it sounds like I would use "roast", but the way it is described in > the publication (and the help in limma), this isn't what I would do, but > rather I should use "romer", and see if any of the sets show > differential expression compared to the rest in the database. > > Color me confused, > > -Robert > > Robert M. Flight, Ph.D. > Bioinformatics and Biomedical Computing Laboratory > University of Louisville > Louisville, KY > > PH 502-852-0467 > EM robert.flight at louisville.edu > EM rflight79 at gmail.com > > Williams and Holland's Law: > ? ? ?? If enough data is collected, anything may be proven by > statistical methods. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 14.5 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, Reading your email, I think there is something I am not following completely. You say, regarding the GSEA-like approach in "romer" > This is actually a > biologically well-motivated approach when you are testing large numbers of > sets. > > If you want to test every set in the MSigDB, then testing one by one with > roast() would probably be just too slow anyway. romer() is more efficient > when the number of sets is very large. What I found very attractive about roast is that the differential expression test is done for groups of genes so, in addition to possible increases in power, interpretation is simplified (e.g., if we use all the GO categories, we deal only with ~ 1500 entities). Even if the examples in your Bioinformatics paper involve just a few sets, I was thinking about systematically using roast in, say, all GO categories, or all the 690 canonical pathways. Moreover, if we want to use the "focused gene testing", even if roast takes longer, I do not see how the larger efficiency of romer would make it an alternative procedure: they are answering different questions, right? But now, I am starting to think that maybe the idea of systematically testing all 1500 go categories might be a bad idea. Best, R. P.S. The help for roast says y it must be a numeric matrix. But I think it works fine with ExpressionSet objects directly, too. On Thursday 15 July 2010 03:29:49 Gordon K Smyth wrote: > Dear Robert, > > I'm just adding briefly to Di's comments. > > > From: "Robert M. Flight" <rflight79 at="" gmail.com=""> > > To: bioconductor at stat.math.ethz.ch > > Subject: [BioC] "romer"ing and "roast"ing around gene sets > > > > Hi All, > > > > I am having trouble with the distinction between the functions "roast" > > and "romer" in the limma package. From the publication describing > > "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems that > > it tests a particular gene set for differential expression, whereas > > "romer" tests a battery of sets to find those that are differentially > > expressed compared to the rest? > > Yes. > > > I am really having trouble discerning the true difference between these > > two, and how they compare to GSEA. I always thoght that the primary > > purpose of GSEA was to determine those gene sets that are significantly > > associated with a phenotypic comparison, i.e. those gene sets showing > > differential expression. > > This is an understandable assumption, which isn't quite true! GSEA > actually tries to pick out the sets that stand out as more strongly > differentially expressed (DE) than others. So, if all the sets were DE to > exactly the same degree, then GSEA wouldn't find anything significant, > because no set would stand out from the others. This is actually a > biologically well-motivated approach when you are testing large numbers of > sets. > > If you want to test every set in the MSigDB, then testing one by one with > roast() would probably be just too slow anyway. romer() is more efficient > when the number of sets is very large. > > Beware that romer(), like GSEA, tends to give pretty modest p-values. > The ranking of the sets may be more useful than the absolute p-values. > > Best wishes > Gordon > > > If any one can help me clear this up, that would be great, because as of > > now I am thoroughly confused. To me, if I have a dataset, and I want to > > know which gene sets (from say MSigDB) are differentially expressed, > > then it sounds like I would use "roast", but the way it is described in > > the publication (and the help in limma), this isn't what I would do, but > > rather I should use "romer", and see if any of the sets show > > differential expression compared to the rest in the database. > > > > Color me confused, > > > > -Robert > > > > Robert M. Flight, Ph.D. > > Bioinformatics and Biomedical Computing Laboratory > > University of Louisville > > Louisville, KY > > > > PH 502-852-0467 > > EM robert.flight at louisville.edu > > EM rflight79 at gmail.com > > > > Williams and Holland's Law: > > ? ? ?? If enough data is collected, anything may be proven by > > statistical methods. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:24}}

ADD REPLY • link 14.5 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Dear Ramon, I agree. Using roast() on a database on gene sets is fine as long as you allow for multiple testing in an appropriate way. We provide the mroast() function to try to make this easier. My lab recently had occasion to use mroast() with all canonical pathways and we found that it took only a few minutes on an oldish PC with nrotations=9999. You're right, romer() and roast() are answering different questions. As long as you're aware of this, then you're on firm ground. And the reason why we suggest for romer() for really large scale testing is simply because roast() can give so many statistically significant results as to be harder to interpret, especially if you use set.statistic="msq". This might not be a problem for you. At this stage, roast() is the more mature software product. While we've used romer() for a study published in Blood, we haven't yet published the methodology in its own right, and it will probably be refined a bit more before we do so. Thanks for the P.S. about the documentation. I've updated it now. Best regards Gordon On Fri, 16 Jul 2010, Ramon Diaz-Uriarte wrote: > Dear Gordon, > > Reading your email, I think there is something I am not following > completely. You say, regarding the GSEA-like approach in "romer" > >> This is actually a biologically well-motivated approach when you are >> testing large numbers of sets. >> >> If you want to test every set in the MSigDB, then testing one by one >> with roast() would probably be just too slow anyway. romer() is more >> efficient when the number of sets is very large. > > What I found very attractive about roast is that the differential > expression test is done for groups of genes so, in addition to possible > increases in power, interpretation is simplified (e.g., if we use all > the GO categories, we deal only with ~ 1500 entities). Even if the > examples in your Bioinformatics paper involve just a few sets, I was > thinking about systematically using roast in, say, all GO categories, or > all the 690 canonical pathways. > > Moreover, if we want to use the "focused gene testing", even if roast > takes longer, I do not see how the larger efficiency of romer would make > it an alternative procedure: they are answering different questions, > right? > > > But now, I am starting to think that maybe the idea of systematically > testing all 1500 go categories might be a bad idea. > > > Best, > > R. > > P.S. The help for roast says y it must be a numeric matrix. But I think > it works fine with ExpressionSet objects directly, too. > > > > > > On Thursday 15 July 2010 03:29:49 Gordon K Smyth wrote: >> Dear Robert, >> >> I'm just adding briefly to Di's comments. >> >>> From: "Robert M. Flight" <rflight79 at="" gmail.com=""> >>> To: bioconductor at stat.math.ethz.ch >>> Subject: [BioC] "romer"ing and "roast"ing around gene sets >>> >>> Hi All, >>> >>> I am having trouble with the distinction between the functions "roast" >>> and "romer" in the limma package. From the publication describing >>> "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems that >>> it tests a particular gene set for differential expression, whereas >>> "romer" tests a battery of sets to find those that are differentially >>> expressed compared to the rest? >> >> Yes. >> >>> I am really having trouble discerning the true difference between these >>> two, and how they compare to GSEA. I always thoght that the primary >>> purpose of GSEA was to determine those gene sets that are significantly >>> associated with a phenotypic comparison, i.e. those gene sets showing >>> differential expression. >> >> This is an understandable assumption, which isn't quite true! GSEA >> actually tries to pick out the sets that stand out as more strongly >> differentially expressed (DE) than others. So, if all the sets were DE to >> exactly the same degree, then GSEA wouldn't find anything significant, >> because no set would stand out from the others. This is actually a >> biologically well-motivated approach when you are testing large numbers of >> sets. >> >> If you want to test every set in the MSigDB, then testing one by one with >> roast() would probably be just too slow anyway. romer() is more efficient >> when the number of sets is very large. >> >> Beware that romer(), like GSEA, tends to give pretty modest p-values. >> The ranking of the sets may be more useful than the absolute p-values. >> >> Best wishes >> Gordon >> >>> If any one can help me clear this up, that would be great, because as of >>> now I am thoroughly confused. To me, if I have a dataset, and I want to >>> know which gene sets (from say MSigDB) are differentially expressed, >>> then it sounds like I would use "roast", but the way it is described in >>> the publication (and the help in limma), this isn't what I would do, but >>> rather I should use "romer", and see if any of the sets show >>> differential expression compared to the rest in the database. >>> >>> Color me confused, >>> >>> -Robert >>> >>> Robert M. Flight, Ph.D. >>> Bioinformatics and Biomedical Computing Laboratory >>> University of Louisville >>> Louisville, KY >>> >>> PH 502-852-0467 >>> EM robert.flight at louisville.edu >>> EM rflight79 at gmail.com >>> >>> Williams and Holland's Law: >>> ? ? ?? If enough data is collected, anything may be proven by >>> statistical methods. > -- > Ramon Diaz-Uriarte > Structural Biology and Biocomputing Programme > Spanish National Cancer Centre (CNIO) > http://ligarto.org/rdiaz > Phone: +34-91-732-8000 ext. 3019 > Fax: +-34-91-224-6972 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 14.4 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, Thanks a lot for your answer. It clarifies all the issues. And, of course, thanks for such a nice piece of software. Best, R. On Monday 19 July 2010 01:57:21 Gordon K Smyth wrote: > Dear Ramon, > > I agree. Using roast() on a database on gene sets is fine as long as you > allow for multiple testing in an appropriate way. We provide the mroast() > function to try to make this easier. My lab recently had occasion to use > mroast() with all canonical pathways and we found that it took only a few > minutes on an oldish PC with nrotations=9999. > > You're right, romer() and roast() are answering different questions. As > long as you're aware of this, then you're on firm ground. And the reason > why we suggest for romer() for really large scale testing is simply > because roast() can give so many statistically significant results as to > be harder to interpret, especially if you use set.statistic="msq". This > might not be a problem for you. > > At this stage, roast() is the more mature software product. While we've > used romer() for a study published in Blood, we haven't yet published the > methodology in its own right, and it will probably be refined a bit more > before we do so. > > Thanks for the P.S. about the documentation. I've updated it now. > > Best regards > Gordon > > On Fri, 16 Jul 2010, Ramon Diaz-Uriarte wrote: > > Dear Gordon, > > > > Reading your email, I think there is something I am not following > > completely. You say, regarding the GSEA-like approach in "romer" > > > >> This is actually a biologically well-motivated approach when you are > >> testing large numbers of sets. > >> > >> If you want to test every set in the MSigDB, then testing one by one > >> with roast() would probably be just too slow anyway. romer() is more > >> efficient when the number of sets is very large. > > > > What I found very attractive about roast is that the differential > > expression test is done for groups of genes so, in addition to possible > > increases in power, interpretation is simplified (e.g., if we use all > > the GO categories, we deal only with ~ 1500 entities). Even if the > > examples in your Bioinformatics paper involve just a few sets, I was > > thinking about systematically using roast in, say, all GO categories, or > > all the 690 canonical pathways. > > > > Moreover, if we want to use the "focused gene testing", even if roast > > takes longer, I do not see how the larger efficiency of romer would make > > it an alternative procedure: they are answering different questions, > > right? > > > > > > But now, I am starting to think that maybe the idea of systematically > > testing all 1500 go categories might be a bad idea. > > > > > > Best, > > > > R. > > > > P.S. The help for roast says y it must be a numeric matrix. But I think > > it works fine with ExpressionSet objects directly, too. > > > > On Thursday 15 July 2010 03:29:49 Gordon K Smyth wrote: > >> Dear Robert, > >> > >> I'm just adding briefly to Di's comments. > >> > >>> From: "Robert M. Flight" <rflight79 at="" gmail.com=""> > >>> To: bioconductor at stat.math.ethz.ch > >>> Subject: [BioC] "romer"ing and "roast"ing around gene sets > >>> > >>> Hi All, > >>> > >>> I am having trouble with the distinction between the functions "roast" > >>> and "romer" in the limma package. From the publication describing > >>> "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems > >>> that it tests a particular gene set for differential expression, > >>> whereas "romer" tests a battery of sets to find those that are > >>> differentially expressed compared to the rest? > >> > >> Yes. > >> > >>> I am really having trouble discerning the true difference between these > >>> two, and how they compare to GSEA. I always thoght that the primary > >>> purpose of GSEA was to determine those gene sets that are significantly > >>> associated with a phenotypic comparison, i.e. those gene sets showing > >>> differential expression. > >> > >> This is an understandable assumption, which isn't quite true! GSEA > >> actually tries to pick out the sets that stand out as more strongly > >> differentially expressed (DE) than others. So, if all the sets were DE > >> to exactly the same degree, then GSEA wouldn't find anything > >> significant, because no set would stand out from the others. This is > >> actually a biologically well-motivated approach when you are testing > >> large numbers of sets. > >> > >> If you want to test every set in the MSigDB, then testing one by one > >> with roast() would probably be just too slow anyway. romer() is more > >> efficient when the number of sets is very large. > >> > >> Beware that romer(), like GSEA, tends to give pretty modest p-values. > >> The ranking of the sets may be more useful than the absolute p-values. > >> > >> Best wishes > >> Gordon > >> > >>> If any one can help me clear this up, that would be great, because as > >>> of now I am thoroughly confused. To me, if I have a dataset, and I want > >>> to know which gene sets (from say MSigDB) are differentially expressed, > >>> then it sounds like I would use "roast", but the way it is described in > >>> the publication (and the help in limma), this isn't what I would do, > >>> but rather I should use "romer", and see if any of the sets show > >>> differential expression compared to the rest in the database. > >>> > >>> Color me confused, > >>> > >>> -Robert > >>> > >>> Robert M. Flight, Ph.D. > >>> Bioinformatics and Biomedical Computing Laboratory > >>> University of Louisville > >>> Louisville, KY > >>> > >>> PH 502-852-0467 > >>> EM robert.flight at louisville.edu > >>> EM rflight79 at gmail.com > >>> > >>> Williams and Holland's Law: > >>> ? ? ?? If enough data is collected, anything may be proven by > >>> statistical methods. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:20}}

ADD REPLY • link 14.4 years ago Ramon Diaz ★ 1.1k

Login before adding your answer.