GOStat and multiple testing

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 10.7 years ago

Hello, I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? I'd be happy to receive comments on this and to heare about your experience. kind regards, Arne

GO GOstats GO GOstats • 4.9k views

ADD COMMENT • link updated 20.7 years ago by A.J. Rossini ▴ 810 • written 20.7 years ago by Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

Stephen Henderson ★ 1.0k

@stephen-henderson-71

Last seen 8.0 years ago

Perhaps correction of multiple testing in this case is not so important as the GO terminology itself is such a subjective entity. GOstats then can be viewed as a useful tool and scores as a useful index, but an attempt to attain exactness is flawed at a level prior to enumeration. To create them may only encourage over-interpretation. -----Original Message----- From: Arne.Muller To: bioconductor Sent: 8/4/04 12:06 PM Subject: [BioC] GOStat and multiple testing Hello, I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? I'd be happy to receive comments on this and to heare about your experience. kind regards, Arne _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor ********************************************************************** This email and any files transmitted with it are confidentia...{{dropped}}

ADD COMMENT • link 20.7 years ago Stephen Henderson ★ 1.0k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 10 weeks ago

United States

On Aug 4, 2004, at 7:06 AM, <arne.muller@aventis.com> wrote: > Hello, > > I was wondering if one needs to correct the p-values from the > hypergeometirx test from GOstat for mutliple testing, since one > performs many tests (over all GO categories found in the gene list). > I'm not sure if correction for multiple testing makse sense since the > GO terms are highly dependent (terms on the same branch + one gene is > annotated in several terms). > > Robert Gentleman mentiones in the GOstats documentation that the > multiple testing issue is not solved yet? I assume GOHyperG does not > perform any kind of multiple testing correction, is this right? It doesn't. I use these results as rough guides to the data, but not something of "statistical significance". In other words, I think of it as a means to understand the data rather than to prove something about it. Also, making Rgraphviz plots of "significant" categories based on some arbitrary cutoff can give you a sense of the "clustering" of your findings in the GO DAG. This is a visual way of taking into account the highly dependent nature of the GO. > I'd be happy to receive comments on this and to heare about your > experience.

ADD COMMENT • link 20.7 years ago Sean Davis 21k

0

Entering edit mode

rgentleman ★ 5.5k

@rgentleman-7725

Last seen 10.0 years ago

United States

On Wed, Aug 04, 2004 at 01:06:30PM +0200, Arne.Muller@aventis.com wrote: > Hello, > > I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). > > Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? Hi, it does not, and I am unaware of any general solution to the problem of adjusting p-values here. The structure of GO is such that there are issues due to lack of independence. There are some other problems, but I have not had time to write up my ideas yet. I have to say that I am also not so convinced that this is the best way to do things (classifying genes as interesting or not, and then doing the hypergeometric test), although I have yet to come up with a better way. I agree with those that have suggested that this is best used as a rough guide to interesting categories (others projects seem have different opinions, and I think some do use some sort of p-value correction). Robert > > I'd be happy to receive comments on this and to heare about your experience. > > kind regards, > > Arne > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- +--------------------------------------------------------------------- ------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: M1B20 | | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | +--------------------------------------------------------------------- ------+

ADD COMMENT • link 20.7 years ago rgentleman ★ 5.5k

0

Entering edit mode

Correcting p-values for multiple hypothesis testing in GO analysis is a hard problem conceptually. I'm not aware of any general solution. In a recently-published set of Perl modules for GO term analysis, http://bioinformatics.oupjournals.org/cgi/content/abstract/bth456v1 we support False Discovery Rate calculations (based on permutations of results) as a substitute. It's probably not perfect, but according to our simulations it's better than either uncorrected p-values or a simple correction (e.g., Bonferroni). Our software uses a hypergeometric test on a list of selected genes. Another approach would be to calculate a p-value (e.g., by Cox regression) for all genes on a microarray, and test the significance of each GO term using Fisher meta-analysis. (I'm sure I've seen a refererence to that approach, but can't recall it now.) -- Jeremy Gollub, Ph.D. jgollub@genome.stanford.edu (W) 650/736-0075 On Thu, 5 Aug 2004, Robert Gentleman wrote: > On Wed, Aug 04, 2004 at 01:06:30PM +0200, Arne.Muller@aventis.com wrote: > > Hello, > > > > I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). > > > > Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? > > Hi, > it does not, and I am unaware of any general solution to the > problem of adjusting p-values here. The structure of GO is such that > there are issues due to lack of independence. There are some other > problems, but I have not had time to write up my ideas yet. > I have to say that I am also not so convinced that this is > the best way to do things (classifying genes as interesting or not, > and then doing the hypergeometric test), although I have yet to come > up with a better way. I agree with those that have suggested that > this is best used as a rough guide to interesting categories (others > projects seem have different opinions, and I think some do use some > sort of p-value correction). > > Robert > > > > > I'd be happy to receive comments on this and to heare about your experience. > > > > kind regards, > > > > Arne > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > -- > +------------------------------------------------------------------- --------+ > | Robert Gentleman phone : (617) 632-5250 | > | Associate Professor fax: (617) 632-2444 | > | Department of Biostatistics office: M1B20 | > | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | > +------------------------------------------------------------------- --------+ > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 20.7 years ago Jeremy Gollub ▴ 80

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.7 years ago

It (FDR by bootstrapping) doesn't solve the basic problem with lack of independence, which makes it useful but wrong, or just wrong, depending on how pragmatic you want to be. Jeremy Gollub <jgollub@genome.stanford.edu> writes: > Correcting p-values for multiple hypothesis testing in GO analysis is a > hard problem conceptually. I'm not aware of any general solution. > > In a recently-published set of Perl modules for GO term analysis, > > http://bioinformatics.oupjournals.org/cgi/content/abstract/bth456v1 > > we support False Discovery Rate calculations (based on permutations of > results) as a substitute. It's probably not perfect, but according to our > simulations it's better than either uncorrected p-values or a simple > correction (e.g., Bonferroni). > > Our software uses a hypergeometric test on a list of selected genes. > Another approach would be to calculate a p-value (e.g., by Cox > regression) for all genes on a microarray, and test the significance of > each GO term using Fisher meta-analysis. (I'm sure I've seen a > refererence to that approach, but can't recall it now.) > > -- > Jeremy Gollub, Ph.D. > jgollub@genome.stanford.edu > (W) 650/736-0075 > > On Thu, 5 Aug 2004, Robert Gentleman wrote: > >> On Wed, Aug 04, 2004 at 01:06:30PM +0200, Arne.Muller@aventis.com wrote: >> > Hello, >> > >> > I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). >> > >> > Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? >> >> Hi, >> it does not, and I am unaware of any general solution to the >> problem of adjusting p-values here. The structure of GO is such that >> there are issues due to lack of independence. There are some other >> problems, but I have not had time to write up my ideas yet. >> I have to say that I am also not so convinced that this is >> the best way to do things (classifying genes as interesting or not, >> and then doing the hypergeometric test), although I have yet to come >> up with a better way. I agree with those that have suggested that >> this is best used as a rough guide to interesting categories (others >> projects seem have different opinions, and I think some do use some >> sort of p-value correction). >> >> Robert >> >> > >> > I'd be happy to receive comments on this and to heare about your experience. >> > >> > kind regards, >> > >> > Arne >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@stat.math.ethz.ch >> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> >> -- >> +------------------------------------------------------------------ ---------+ >> | Robert Gentleman phone : (617) 632-5250 | >> | Associate Professor fax: (617) 632-2444 | >> | Department of Biostatistics office: M1B20 | >> | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | >> +------------------------------------------------------------------ ---------+ >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -- Anthony Rossini Research Associate Professor rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

ADD COMMENT • link 20.7 years ago A.J. Rossini ▴ 810

0

Entering edit mode

That's certainly true. We decided to be pragmatic. If I've understood the problem correctly, there are two major problems with determining the significance of a GO annotation. First is the lack of independence in the DAG (directed acyclic graph) structure. Bootstrapping won't fix that. Second, though, is the problem that, for a small group of test genes at least, any GO term that comes up at all will appear ridiculously significant when using a hypergeometric test. What we found is that FDR calculations seem to deal with this second issue better than a FWER correction. -- Jeremy Gollub, Ph.D. jgollub@genome.stanford.edu (W) 650/736-0075 On Thu, 5 Aug 2004, A.J. Rossini wrote: > > It (FDR by bootstrapping) doesn't solve the basic problem with lack of > independence, which makes it useful but wrong, or just wrong, > depending on how pragmatic you want to be. > > > Jeremy Gollub <jgollub@genome.stanford.edu> writes: > > > Correcting p-values for multiple hypothesis testing in GO analysis is a > > hard problem conceptually. I'm not aware of any general solution. > > > > In a recently-published set of Perl modules for GO term analysis, > > > > http://bioinformatics.oupjournals.org/cgi/content/abstract/bth456v1 > > > > we support False Discovery Rate calculations (based on permutations of > > results) as a substitute. It's probably not perfect, but according to our > > simulations it's better than either uncorrected p-values or a simple > > correction (e.g., Bonferroni). > > > > Our software uses a hypergeometric test on a list of selected genes. > > Another approach would be to calculate a p-value (e.g., by Cox > > regression) for all genes on a microarray, and test the significance of > > each GO term using Fisher meta-analysis. (I'm sure I've seen a > > refererence to that approach, but can't recall it now.) > > > > -- > > Jeremy Gollub, Ph.D. > > jgollub@genome.stanford.edu > > (W) 650/736-0075 > > > > On Thu, 5 Aug 2004, Robert Gentleman wrote: > > > >> On Wed, Aug 04, 2004 at 01:06:30PM +0200, Arne.Muller@aventis.com wrote: > >> > Hello, > >> > > >> > I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). > >> > > >> > Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? > >> > >> Hi, > >> it does not, and I am unaware of any general solution to the > >> problem of adjusting p-values here. The structure of GO is such that > >> there are issues due to lack of independence. There are some other > >> problems, but I have not had time to write up my ideas yet. > >> I have to say that I am also not so convinced that this is > >> the best way to do things (classifying genes as interesting or not, > >> and then doing the hypergeometric test), although I have yet to come > >> up with a better way. I agree with those that have suggested that > >> this is best used as a rough guide to interesting categories (others > >> projects seem have different opinions, and I think some do use some > >> sort of p-value correction). > >> > >> Robert > >> > >> > > >> > I'd be happy to receive comments on this and to heare about your experience. > >> > > >> > kind regards, > >> > > >> > Arne > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor@stat.math.ethz.ch > >> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >> > >> -- > >> +---------------------------------------------------------------- -----------+ > >> | Robert Gentleman phone : (617) 632-5250 | > >> | Associate Professor fax: (617) 632-2444 | > >> | Department of Biostatistics office: M1B20 | > >> | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | > >> +---------------------------------------------------------------- -----------+ > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@stat.math.ethz.ch > >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > -- > Anthony Rossini Research Associate Professor > rossini@u.washington.edu http://www.analytics.washington.edu/ > Biomedical and Health Informatics University of Washington > Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center > UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable > FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email > > CONFIDENTIALITY NOTICE: This e-mail message and any attachments may be > confidential and privileged. If you received this message in error, > please destroy it and notify the sender. Thank you. >

ADD REPLY • link 20.7 years ago Jeremy Gollub ▴ 80

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.7 years ago

Correct. And your work continues to confirm the second issue in general, which is nice. But it's the first that is particularly nasty to create a reasonable solution for. I'd really like to see one! best, -tony Jeremy Gollub <jgollub@genome.stanford.edu> writes: > That's certainly true. We decided to be pragmatic. > > If I've understood the problem correctly, there are two major problems > with determining the significance of a GO annotation. First is the lack > of independence in the DAG (directed acyclic graph) structure. > Bootstrapping won't fix that. Second, though, is the problem that, for a > small group of test genes at least, any GO term that comes up at all will > appear ridiculously significant when using a hypergeometric test. What we > found is that FDR calculations seem to deal with this second issue better > than a FWER correction. > > -- > Jeremy Gollub, Ph.D. > jgollub@genome.stanford.edu > (W) 650/736-0075 > > On Thu, 5 Aug 2004, A.J. Rossini wrote: > >> >> It (FDR by bootstrapping) doesn't solve the basic problem with lack of >> independence, which makes it useful but wrong, or just wrong, >> depending on how pragmatic you want to be. >> >> >> Jeremy Gollub <jgollub@genome.stanford.edu> writes: >> >> > Correcting p-values for multiple hypothesis testing in GO analysis is a >> > hard problem conceptually. I'm not aware of any general solution. >> > >> > In a recently-published set of Perl modules for GO term analysis, >> > >> > http://bioinformatics.oupjournals.org/cgi/content/abstract/bth456v1 >> > >> > we support False Discovery Rate calculations (based on permutations of >> > results) as a substitute. It's probably not perfect, but according to our >> > simulations it's better than either uncorrected p-values or a simple >> > correction (e.g., Bonferroni). >> > >> > Our software uses a hypergeometric test on a list of selected genes. >> > Another approach would be to calculate a p-value (e.g., by Cox >> > regression) for all genes on a microarray, and test the significance of >> > each GO term using Fisher meta-analysis. (I'm sure I've seen a >> > refererence to that approach, but can't recall it now.) >> > >> > -- >> > Jeremy Gollub, Ph.D. >> > jgollub@genome.stanford.edu >> > (W) 650/736-0075 >> > >> > On Thu, 5 Aug 2004, Robert Gentleman wrote: >> > >> >> On Wed, Aug 04, 2004 at 01:06:30PM +0200, Arne.Muller@aventis.com wrote: >> >> > Hello, >> >> > >> >> > I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). >> >> > >> >> > Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? >> >> >> >> Hi, >> >> it does not, and I am unaware of any general solution to the >> >> problem of adjusting p-values here. The structure of GO is such that >> >> there are issues due to lack of independence. There are some other >> >> problems, but I have not had time to write up my ideas yet. >> >> I have to say that I am also not so convinced that this is >> >> the best way to do things (classifying genes as interesting or not, >> >> and then doing the hypergeometric test), although I have yet to come >> >> up with a better way. I agree with those that have suggested that >> >> this is best used as a rough guide to interesting categories (others >> >> projects seem have different opinions, and I think some do use some >> >> sort of p-value correction). >> >> >> >> Robert >> >> >> >> > >> >> > I'd be happy to receive comments on this and to heare about your experience. >> >> > >> >> > kind regards, >> >> > >> >> > Arne >> >> > >> >> > _______________________________________________ >> >> > Bioconductor mailing list >> >> > Bioconductor@stat.math.ethz.ch >> >> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> >> >> >> -- >> >> +--------------------------------------------------------------- ------------+ >> >> | Robert Gentleman phone : (617) 632-5250 | >> >> | Associate Professor fax: (617) 632-2444 | >> >> | Department of Biostatistics office: M1B20 | >> >> | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | >> >> +--------------------------------------------------------------- ------------+ >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor@stat.math.ethz.ch >> >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> >> >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@stat.math.ethz.ch >> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> > >> >> -- >> Anthony Rossini Research Associate Professor >> rossini@u.washington.edu http://www.analytics.washington.edu/ >> Biomedical and Health Informatics University of Washington >> Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center >> UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable >> FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email >> >> CONFIDENTIALITY NOTICE: This e-mail message and any attachments may be >> confidential and privileged. If you received this message in error, >> please destroy it and notify the sender. Thank you. >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -- Anthony Rossini Research Associate Professor rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

ADD COMMENT • link 20.7 years ago A.J. Rossini ▴ 810

0

Entering edit mode

peter robinson ▴ 300

@peter-robinson-529

Last seen 10.7 years ago

Dear List, in light of the recent discussion on multiple testing in GO analysis, a project we recently finished may be of interest to some members of the list: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&d opt=Abstract&list_uids=15254011&itool=iconabstr We took a pragmatic approach of carrying out the analysis 1000 times with randomized data to estimate statistical significance of our results, which have to do with a genome-wide analysis of 5' CpG Island genes. For the purposes of microarray analysis however, my 2c is that there is no substitute for a biologically trained eyeball-o-metric analysis, since enrichment of relatively specific terms with smaller numbers of genes (that may not be statistically significant after multiple- testing corrections) certainly may suggest biological meaning. -Peter -----Original Message----- Date: Thu, 5 Aug 2004 21:01:01 +0200 Subject: Re: [BioC] GOStat and multiple testing From: rossini@blindglobe.net (A.J. Rossini) To: Jeremy Gollub <jgollub@genome.stanford.edu> Correct. And your work continues to confirm the second issue in general, which is nice. But it's the first that is particularly nasty to create a reasonable solution for. I'd really like to see one! best, -tony Jeremy Gollub <jgollub@genome.stanford.edu> writes: > That's certainly true. We decided to be pragmatic. > > If I've understood the problem correctly, there are two major problems > with determining the significance of a GO annotation. First is the lack > of independence in the DAG (directed acyclic graph) structure. > Bootstrapping won't fix that. Second, though, is the problem that, for a > small group of test genes at least, any GO term that comes up at all will > appear ridiculously significant when using a hypergeometric test. What we > found is that FDR calculations seem to deal with this second issue better > than a FWER correction. > > -- > Jeremy Gollub, Ph.D. > jgollub@genome.stanford.edu > (W) 650/736-0075 > > On Thu, 5 Aug 2004, A.J. Rossini wrote: > >> >> It (FDR by bootstrapping) doesn't solve the basic problem with lack of >> independence, which makes it useful but wrong, or just wrong, >> depending on how pragmatic you want to be. >> >> >> Jeremy Gollub <jgollub@genome.stanford.edu> writes: >> >> > Correcting p-values for multiple hypothesis testing in GO analysis is a >> > hard problem conceptually. I'm not aware of any general solution. >> > >> > In a recently-published set of Perl modules for GO term analysis, >> > >> > http://bioinformatics.oupjournals.org/cgi/content/abstract/bth456v1 >> > >> > we support False Discovery Rate calculations (based on permutations of >> > results) as a substitute. It's probably not perfect, but according to our >> > simulations it's better than either uncorrected p-values or a simple >> > correction (e.g., Bonferroni). >> > >> > Our software uses a hypergeometric test on a list of selected genes. >> > Another approach would be to calculate a p-value (e.g., by Cox >> > regression) for all genes on a microarray, and test the significance of >> > each GO term using Fisher meta-analysis. (I'm sure I've seen a >> > refererence to that approach, but can't recall it now.) >> > >> > -- >> > Jeremy Gollub, Ph.D. >> > jgollub@genome.stanford.edu >> > (W) 650/736-0075 >> > >> > On Thu, 5 Aug 2004, Robert Gentleman wrote: >> > >> >> On Wed, Aug 04, 2004 at 01:06:30PM +0200, Arne.Muller@aventis.com wrote: >> >> > Hello, >> >> > >> >> > I was wondering if one needs to correct the p-values from the hypergeometirx test from GOstat for mutliple testing, since one performs many tests (over all GO categories found in the gene list). I'm not sure if correction for multiple testing makse sense since the GO terms are highly dependent (terms on the same branch + one gene is annotated in several terms). >> >> > >> >> > Robert Gentleman mentiones in the GOstats documentation that the multiple testing issue is not solved yet? I assume GOHyperG does not perform any kind of multiple testing correction, is this right? >> >> >> >> Hi, >> >> it does not, and I am unaware of any general solution to the >> >> problem of adjusting p-values here. The structure of GO is such that >> >> there are issues due to lack of independence. There are some other >> >> problems, but I have not had time to write up my ideas yet. >> >> I have to say that I am also not so convinced that this is >> >> the best way to do things (classifying genes as interesting or not, >> >> and then doing the hypergeometric test), although I have yet to come >> >> up with a better way. I agree with those that have suggested that >> >> this is best used as a rough guide to interesting categories (others >> >> projects seem have different opinions, and I think some do use some >> >> sort of p-value correction). >> >> >> >> Robert >> >> >> >> > >> >> > I'd be happy to receive comments on this and to heare about your experience. >> >> > >> >> > kind regards, >> >> > >> >> > Arne >> >> > >> >> > _______________________________________________ >> >> > Bioconductor mailing list >> >> > Bioconductor@stat.math.ethz.ch >> >> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> >> >> >> -- >> >> +--------------------------------------------------------------------- ------+ >> >> | Robert Gentleman phone : (617) 632-5250 | >> >> | Associate Professor fax: (617) 632-2444 | >> >> | Department of Biostatistics office: M1B20 | >> >> | Harvard School of Public Health email: rgentlem@jimmy.harvard.edu | >> >> +--------------------------------------------------------------------- ------+ >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor@stat.math.ethz.ch >> >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> >> >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@stat.math.ethz.ch >> > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> > >> >> -- >> Anthony Rossini Research Associate Professor >> rossini@u.washington.edu http://www.analytics.washington.edu/ >> Biomedical and Health Informatics University of Washington >> Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center >> UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable >> FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email >> >> CONFIDENTIALITY NOTICE: This e-mail message and any attachments may be >> confidential and privileged. If you received this message in error, >> please destroy it and notify the sender. Thank you. >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -- Anthony Rossini Research Associate Professor rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any\ attachm...{{dropped}}

ADD COMMENT • link 20.7 years ago peter robinson ▴ 300

0

Entering edit mode

Nicholas Lewin-Koh ▴ 430

@nicholas-lewin-koh-63

Last seen 10.7 years ago

Hi, I have been thinking about the problem a little differently, and I am working on a write-up. The first approach is, rather than testing each node independently, fit the whole go tree as a logistic regression with the response a {0,1} for each gene (ie expressed, not} and the predictors the whole go- tree. Obviously that alone is not the solution, but if we treat this as a spatial problem (the dag being the spatial join matrix) we can create a set of difference penalties on the coefficients that are dictated by the dag structure. Then we can do inference on the the beta's to see which terms are influential. One would fit a separate model for each category, Cellular component, biological Process, Molecular function. The second approach, that I haven't developed as far, is to think of each gene as starting at the root, and look at its survival along paths in the dag, so that the aggregate is some sort of branching process. I don't know yet if this model is useful. Also inherent in the analysis is a potentially huge bias due to unannotated genes. I was thinking of approaching this using a kind of mark-recapture approach on the terms, kind of like the stochastic abundance models they use in ecology to predict the number of species in a community. We can come up with a bias correction if we have a term abundance distribution for the GO-classes. The logistic model is something I am actively working on, the rest are half baked thoughts I have been diddling with and haven't had time to chase too far. Nicholas

ADD COMMENT • link 20.7 years ago Nicholas Lewin-Koh ▴ 430

0

Entering edit mode

A.J. Rossini ▴ 810

@aj-rossini-209

Last seen 10.7 years ago

Might want to look at the social network literature -- something like a p-* model for adjustment of the variance, and possibly looking at the adjoint graph (flipping nodes/edges) might produce reasonably adjusted coefficients. best, -tony Nicholas Lewin-Koh <nikko@hailmail.net> writes: > Hi, > I have been thinking about the problem a little differently, and I am > working on a write-up. > > The first approach is, rather than testing each node independently, > fit the whole go tree as a logistic regression with the response a {0,1} > for each gene (ie expressed, not} and the predictors the whole go- tree. > Obviously that alone is not the solution, but if we treat this as a > spatial problem (the dag being the spatial join matrix) we can create a > set of difference penalties on the coefficients that are dictated by the > dag structure. Then we can do inference on the the beta's to see which > terms are influential. One would fit a separate model for each category, > Cellular component, biological Process, Molecular function. > > The second approach, that I haven't developed as far, is to think of > each gene as starting at the root, and look at its survival along > paths in the dag, so that the aggregate is some sort of branching > process. I don't know yet if this model is useful. > > Also inherent in the analysis is a potentially huge bias due to > unannotated genes. I was thinking of approaching this using a kind of > mark-recapture approach on the terms, kind of like the stochastic > abundance models they use in ecology to predict the number of species in > a community. We can come up with a bias correction if we have a term > abundance distribution for the GO-classes. > > The logistic model is something I am actively working on, the rest are > half baked thoughts I have been diddling with and haven't had time to > chase too far. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > -- Anthony Rossini Research Associate Professor rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

ADD COMMENT • link 20.7 years ago A.J. Rossini ▴ 810

Login before adding your answer.