testing GO categories with Fisher's exact test.

0

Entering edit mode

Nicholas Lewin-Koh ▴ 430

@nicholas-lewin-koh-63

Last seen 10.2 years ago

Hi all, I have a few questions about testing for over representation of terms in a cluster. let's consider a simple case, a set of chips from an experiment say treated and untreted with 10,000 genes on the chip and 1000 differentially expressed. Of the 10000, 7000 can be annotated and 6000 have a GO function assinged to them at a suitible level. Say for this example there are 30 Go clasess that appear. I then conduct Fisher's exact test 30 times on each GO category to detect differential representation of terms in the expressed set and correct for multiple testing. My question is on the validity of this procedure. Just from experience many genes will have multiple functions assigned to them so the genes falling into GO classes are not independent. Also, there is the large set of un-annotated genes so we are in effect ignoring the influence of all the unannotated genes on the outcome. Do people have any thoughts or opinions on these approaches? It is appearing all over the place in bioinformatics tools like FATIGO, EASE, DAVID etc. I find that the formal testing approach makes me very uncomfortable, especially as the biologists I work with tend to over interpret the results. I am very interested to see the discussion on this topic. Nicholas

GO Category GO Category • 3.7k views

ADD COMMENT • link updated 20.8 years ago by James W. MacDonald 67k • written 20.8 years ago by Nicholas Lewin-Koh ▴ 430

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 10.2 years ago

Hello, Here're some thoughts on the GO testing ... I don't know how the GO module in BioC works - but I guess you're looking at just one GO class, don't you? Note, that one should always treat the three categories independent - although they're clearly not independent ...). I think the fisher test is ok to test for independence of the variables e.g. apoptosis in gene set versus apoptosis on while chip ... . The fact only 50 to 60% of a chip can be annotated via a GO class should not matter. You're assuming that these 50% of the genes are representative for the entrire population of genes. This kind of assumption is common in many areas of science I guess. Anyway, even if *all* un-annotated genes are falling into completely new (yet unknown) categories, this doesn't realy matter - that fact that say "apoptosis" si strongly over-represented in your dataset would not be influenced. However, if *unproportionally* many un-annotated genes are in fact involved in apoptosis, then you're assumtion is wrong and you've got a problem. So, I don't worry too much about the fact that only a portion of my genes are annotated with GO - otherwise I couldn't do any kind of analysis :-( I was wondering about the multiple annotation isssue, too. One gene is often annotated several times in a GO class, but I think this is not a problem, and it makes "biologically" sense. Especially for molecular_fucntion you can have a range of functions for one protein - at least one per domain. The biochemical functions for the domains may be independent. Especially when annotating protiens via InterPro you may end up with a per domain based GO annotation for your protein. For cellular compartment this may be different. Biological_process is more tricky I think. You may just expect your protein to be involved in just one biological process. However, this is not true. Say Glycogen Synthase Kinase 3 beta is a protein involved in a signalling pathway controlling cell proliferation and glycogen synthesis. Basically these are two independent pathways (actually I think this is not realy reflected in GO biol. proc. for this protein :-( ). Following the above example one could test the number of annotations rather than genes. If a gene is annotated with 2 independent GO terms within e.g. biol. proc (i.e. terms that do not belong to the same branch), then you'd the 2 fucntions twice! I guess you're doing this implicitely anyway - one gene can increase the "counter" (the frequency of observation) for several terms. This is fine as long as you do this for the population (the chip) and the sample (your data set). I'd like to extend the GO testing issue a bit further. Most people look at just over-represented terms (over-represented in the dataset with respect to what's expecxted by chance when sampling from the chip). What about under-representation. The biological interpretation of over- representation is easy (this biological process is stronlgy effected ...), but how do you interpret a significant under-representation? Following the above, it's also important to consider the correct tail of the distribution, i.e. to set the alternative of the fisher.test to "two.sided", "less" or "greater" ... kind regards, Arne -- Arne Muller, Ph.D. Aventis Pharma, Drug Safety Evaluation > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Nicholas > Lewin-Koh > Sent: 24 February 2004 09:33 > To: bioconductor@stat.math.ethz.ch > Cc: rdiaz@cnio.es > Subject: [BioC] testing GO categories with Fisher's exact test. > > > Hi all, > I have a few questions about testing for over representation > of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the > 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for > this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO > category to detect > differential representation of terms in the expressed > set and correct for multiple testing. > > My question is on the validity of this procedure. Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. Do people have any > thoughts or > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like > FATIGO, EASE, > DAVID etc. I find that > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. > I am very interested to see the discussion on this topic. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 20.8 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 10.2 years ago

Surely, at the point where you are seeing "lots" of eg apoptosis genes in your cluster, drop the statistics and start the biology? Remember the ultimate proof that of any statistical sense is that it makes biological sense and is biologically validated. Do we really need to know if an annotation is significant?? -----Original Message----- From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] Sent: 24 February 2004 08:33 To: bioconductor@stat.math.ethz.ch Cc: rdiaz@cnio.es Subject: [BioC] testing GO categories with Fisher's exact test. Hi all, I have a few questions about testing for over representation of terms in a cluster. let's consider a simple case, a set of chips from an experiment say treated and untreted with 10,000 genes on the chip and 1000 differentially expressed. Of the 10000, 7000 can be annotated and 6000 have a GO function assinged to them at a suitible level. Say for this example there are 30 Go clasess that appear. I then conduct Fisher's exact test 30 times on each GO category to detect differential representation of terms in the expressed set and correct for multiple testing. My question is on the validity of this procedure. Just from experience many genes will have multiple functions assigned to them so the genes falling into GO classes are not independent. Also, there is the large set of un-annotated genes so we are in effect ignoring the influence of all the unannotated genes on the outcome. Do people have any thoughts or opinions on these approaches? It is appearing all over the place in bioinformatics tools like FATIGO, EASE, DAVID etc. I find that the formal testing approach makes me very uncomfortable, especially as the biologists I work with tend to over interpret the results. I am very interested to see the discussion on this topic. Nicholas _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.8 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 10.2 years ago

> > > Surely, at the point where you are seeing "lots" of eg > apoptosis genes in your cluster, > drop the statistics and start the biology? > > Remember the ultimate proof that of any statistical sense is > that it makes biological sense and is biologically validated. > Do we really need to know if an annotation is significant?? Hm, I think it's a good start to know what is significant ... . On the other hand I've to agree with you - there are often border line GO terms in my datasets that are just not significant but fitting well into my hypothesis. Especially for annotating a dataset via GO, one looks into a "biological theme", and so it may be sensible to forget about everything that is populated with less than say 5 genes or so ... kind regards, Arne > -----Original Message----- > From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] > Sent: 24 February 2004 08:33 > To: bioconductor@stat.math.ethz.ch > Cc: rdiaz@cnio.es > Subject: [BioC] testing GO categories with Fisher's exact test. > > > Hi all, > I have a few questions about testing for over representation > of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the > 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for > this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO > category to detect > differential representation of terms in the expressed > set and correct for multiple testing. > > My question is on the validity of this procedure. Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. Do people have any > thoughts or > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like > FATIGO, EASE, > DAVID etc. I find that > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. > I am very interested to see the discussion on this topic. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 20.8 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

Ramon Diaz ★ 1.1k

@ramon-diaz-159

Last seen 10.2 years ago

Dear Nicholas, On Tuesday 24 February 2004 09:33, Nicholas Lewin-Koh wrote: > Hi all, > I have a few questions about testing for over representation of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO category to detect > differential representation of terms in the expressed > set and correct for multiple testing. I think I understand your setup. Just to double check, let me rephrase as: - for every one of the 30 GO terms, you set up a 2x2 contingency table (genes with/without the GO term by genes in class A vs. genes in class B), and carry out a Fisher's exact test, so you do 30 tests. However, I am not sure what you mean by "7000 can be annotated and 6000 have a GO function assinged to them at a suitible level". Does this mean that, if a gene has no GO annotation you will not introduce it into the above 2x2 tables? It could be in the table (so the sum of entries in each of the 2x2 is 10000); it just goes to the "absent" cells. > > My question is on the validity of this procedure. Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. Yes, sure, though I'd rather reword it as saying that the presence/absence of a GO term X (e.g., metabolism) is not independent of the presence/absence of GO term Y (e.g., transport). However, I don't see this as an inherent problem. Suppose you measure arm length, body mass, and height, of a bunch of men and women, and carry out three t-tests. Of course, the three variables are correlated. Now, you might have used Hotelling's T-test for testing the null hypothesis that the multivariate mean (in the space defined by the three traits) of the sexes do not differ. But that is a different biological question from asking "do they differ in any one of the three traits", which is what you would be asking if you run 3 t-tests. [Some of these issues are discussed very nicely by W. Krzanowski in "Principles of multivariate analysis", pp. 235 -251 on the 1988 edition, and in the categorical variable case by Fienberg, "The analysis of cross-classified categorical data, 2nd ed", in pp. 20-21]. >From the above point of view, I think that many of the examples in Westfall & Young ("Resampling-based multiple testing") could also be reframed in a multivariate way. But they are not. The reason, I think, is that in most of these cases (i.e., FatiGO, Westfall & Young, etc) the biologists are interested in fishing in a sea of univariate hypotheses. I think that most of the questions that biologists are asking in these cases are often univariate. A multivariate alternative would be to use a log-linear model of a 31-way contingency table: we have 10000 genes that we cross-classify according to group membership (differentially expressed or not), and each of the K = 30 GO terms (with two values for each term: present or basent). So we have a multidimensional table of 2 x 2^30. This won't work. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. This relates to the more general problem of the quality of GO annotations, with two related problems: a) absence of annotation does not necessarily mean absence of that GO function, but maybe just that that particular aspect has never been studied for that gene; b) presence of an annotation does not mean that the gene really has that function, since there are msitakes in the annotation; in fact, GO has a bunch of levels for "quality of annotations" (see http://www.geneontology.org/GO.evidence.html ). It is my understanding that most tools, right now, just ignore these issues. I am not sure how serious the consequences are, but so far at least our experience seems to be that results make sense (e.g., see our examples in http://bioinfo.cnio.es/docus/papers/techreports.html#FatiGO-NNSP and http://bioinfo.cnio.es/docus/papers/techreports.html#camda-02). Of course, this is no excuse. A possible way would be to explicitly model what presence and absence of annotation mean, probably making use of the information contained in the "quality of annotations", within a bayesian framework. M. Battacharjee and I have been working on it (but, because of my delays, this is becoming a never-ending project). > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like FATIGO, EASE, > DAVID etc. I find that Yes, several people have had similar ideas. And I think there are a few other similar tools around. > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. I don't see your last point: how the formal testing leads to overinterpretation. Best, Ram?n -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

ADD COMMENT • link 20.8 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Hello Nicholas! This is in continuation to Ramon's comments/suggestions to you. I have done some work connected to functional enrichment assessment. In a Bayesian framework it is actually possible to address some of the questions you have raised, e.g. multiple hypotheses testing, lack of /missing annotation, annotation quality of available annotation. In a Bayesian framework inference is drawn using the joint distribution of all the attributes, which are in this case GO annotations and expression measurements for a gene. If there is dependence either within functionalities or genes it is taken care off by this joint distribution (of course only to the extent the model permits). I have done some simple modelling accounting for annotation error too. The models can take into account both missing annotation as well as possible erroneous annotation. But so far I have worked with very simplistic models and the real problem requires much more in-depth analysis of the annotation information, where we (i.e. me and Ramon) have made some progress but not satisfactory yet. Another problem is successfully implementation of such huge Bayesian models. Computation can be and often is quite difficult. Which is one of the reasons why classical univariate hypothesis testing technique is so popular, although obviously they are only approximate methods. Some of us at Helsinki are trying to address the computation question too. But I guess it's still some more time before we are ready to handle the variety of unstructured dependence like the ones we see in microarray data or functional data. Wishes, Madhu On Tue, 24 Feb 2004, Ramon Diaz-Uriarte wrote: > Dear Nicholas, > > On Tuesday 24 February 2004 09:33, Nicholas Lewin-Koh wrote: > > Hi all, > > I have a few questions about testing for over representation of terms in > > a cluster. > > let's consider a simple case, a set of chips from an experiment say > > treated and untreted with 10,000 > > genes on the chip and 1000 differentially expressed. Of the 10000, 7000 > > can be annotated and 6000 have > > a GO function assinged to them at a suitible level. Say for this example > > there are 30 Go clasess that appear. > > I then conduct Fisher's exact test 30 times on each GO category to detect > > differential representation of terms in the expressed > > set and correct for multiple testing. > > I think I understand your setup. Just to double check, let me rephrase as: > - for every one of the 30 GO terms, you set up a 2x2 contingency table (genes > with/without the GO term by genes in class A vs. genes in class B), and carry > out a Fisher's exact test, so you do 30 tests. > > However, I am not sure what you mean by "7000 can be annotated and 6000 have > a GO function assinged to them at a suitible level". Does this mean that, if a > gene has no GO annotation you will not introduce it into the above 2x2 > tables? It could be in the table (so the sum of entries in each of the 2x2 is > 10000); it just goes to the "absent" cells. > > > > > My question is on the validity of this procedure. Just from experience > > many genes will > > have multiple functions assigned to them so the genes falling into GO > > classes are not independent. > > Yes, sure, though I'd rather reword it as saying that the presence/absence of > a GO term X (e.g., metabolism) is not independent of the presence/absence of > GO term Y (e.g., transport). > > However, I don't see this as an inherent problem. Suppose you measure arm > length, body mass, and height, of a bunch of men and women, and carry out > three t-tests. Of course, the three variables are correlated. > Now, you might have used Hotelling's T-test for testing the null hypothesis > that the multivariate mean (in the space defined by the three traits) of the > sexes do not differ. But that is a different biological question from asking > "do they differ in any one of the three traits", which is what you would be > asking if you run 3 t-tests. [Some of these issues are discussed very nicely > by W. Krzanowski in "Principles of multivariate analysis", pp. 235 -251 on > the 1988 edition, and in the categorical variable case by Fienberg, "The > analysis of cross-classified categorical data, 2nd ed", in pp. 20-21]. > > From the above point of view, I think that many of the examples in Westfall & > Young ("Resampling-based multiple testing") could also be reframed in a > multivariate way. But they are not. The reason, I think, is that in most of > these cases (i.e., FatiGO, Westfall & Young, etc) the biologists are > interested in fishing in a sea of univariate hypotheses. I think that most of > the questions that biologists are asking in these cases are often univariate. > > A multivariate alternative would be to use a log-linear model of a 31-way > contingency table: we have 10000 genes that we cross-classify according to > group membership (differentially expressed or not), and each of the K = 30 GO > terms (with two values for each term: present or basent). So we have a > multidimensional table of 2 x 2^30. This won't work. > > > Also, there is the large set of un-annotated genes so we are in effect > > ignoring the influence of > > all the unannotated genes on the outcome. > > > > This relates to the more general problem of the quality of GO annotations, > with two related problems: > > a) absence of annotation does not necessarily mean absence of that GO > function, but maybe just that that particular aspect has never been studied > for that gene; > > b) presence of an annotation does not mean that the gene really has that > function, since there are msitakes in the annotation; in fact, GO has a bunch > of levels for "quality of annotations" (see > http://www.geneontology.org/GO.evidence.html > ). > > > It is my understanding that most tools, right now, just ignore these issues. I > am not sure how serious the consequences are, but so far at least our > experience seems to be that results make sense (e.g., see our examples in > > http://bioinfo.cnio.es/docus/papers/techreports.html#FatiGO-NNSP > > and > > http://bioinfo.cnio.es/docus/papers/techreports.html#camda-02). > > Of course, this is no excuse. A possible way would be to explicitly model what > presence and absence of annotation mean, probably making use of the > information contained in the "quality of annotations", within a bayesian > framework. M. Battacharjee and I have been working on it (but, because of my > delays, this is becoming a never-ending project). > > > > opinions on these approaches? It is > > appearing all over the place in bioinformatics tools like FATIGO, EASE, > > DAVID etc. I find that > > Yes, several people have had similar ideas. And I think there are a few other > similar tools around. > > > the formal testing approach makes me very uncomfortable, especially as > > the biologists I work with tend to over interpret the results. > > > I don't see your last point: how the formal testing leads to > overinterpretation. > > > Best, > > Ram?n > > -- > Ram?n D?az-Uriarte > Bioinformatics Unit > Centro Nacional de Investigaciones Oncol?gicas (CNIO) > (Spanish National Cancer Center) > Melchor Fern?ndez Almagro, 3 > 28029 Madrid (Spain) > Fax: +-34-91-224-6972 > Phone: +-34-91-224-6900 > > http://bioinfo.cnio.es/~rdiaz > PGP KeyID: 0xE89B3462 > (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > > > >

ADD REPLY • link 20.8 years ago Madhuchhanda Bhattacharjee ▴ 10

0

Entering edit mode

Charles Berry ▴ 290

@charles-berry-5754

Last seen 5.6 years ago

United States

On Tue, 24 Feb 2004, Nicholas Lewin-Koh wrote: > Hi all, > I have a few questions about testing for over representation of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO category to detect > differential representation of terms in the expressed > set and correct for multiple testing. > > My question is on the validity of this procedure. It depends on what hypotheses you wish to test. The uniform distribution of the p value under the null hypothesis depends on ***all*** the assumptions of the test obtaining. The trouble is that you probably do not want to test whether the genes on your microarray are independent, since you already know that they are not: > Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. Do people have any thoughts or > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like FATIGO, EASE, > DAVID etc. SAM and similar permutation based approaches can be implemented for this setup to get p-values (or FDR's) that do not depend on independence of genes/transcripts. The results given by permutation (of sample identities using the hypergeometric p-value as the test statistic) are several orders of magnitude more conservative than using the original 'p-value' even without correcting for multiple comparisons in several data sets I have seen. I recall someone from the MAPPfinder group remarking at a conference last July that MAPPfinder 2.0 would implement permutation methods. But I cannot find this release yet using google. Another approach to permutation testing of expression vs ontology is outlined in: Mootha VK et al. PGC-1 -responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34(3):267 73, 2003. I find that > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. Testing a better focussed hypothesis should increase your comfort level. :-) > I am very interested to see the discussion on this topic. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry@tajo.ucsd.edu UC San Diego http://hacuna.ucsd.edu/members/ccb.html La Jolla, San Diego 92093-0717

ADD COMMENT • link 20.8 years ago Charles Berry ▴ 290

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 10.2 years ago

Forgive my naivety, but could one not use a chi-squared test here? We have an observed amount of genes in each category, and could calculate an expected from the size of the cluster and the distribution of all genes throughout GO categories... ? -----Original Message----- From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] Sent: 24 February 2004 08:33 To: bioconductor@stat.math.ethz.ch Cc: rdiaz@cnio.es Subject: [BioC] testing GO categories with Fisher's exact test. Hi all, I have a few questions about testing for over representation of terms in a cluster. let's consider a simple case, a set of chips from an experiment say treated and untreted with 10,000 genes on the chip and 1000 differentially expressed. Of the 10000, 7000 can be annotated and 6000 have a GO function assinged to them at a suitible level. Say for this example there are 30 Go clasess that appear. I then conduct Fisher's exact test 30 times on each GO category to detect differential representation of terms in the expressed set and correct for multiple testing. My question is on the validity of this procedure. Just from experience many genes will have multiple functions assigned to them so the genes falling into GO classes are not independent. Also, there is the large set of un-annotated genes so we are in effect ignoring the influence of all the unannotated genes on the outcome. Do people have any thoughts or opinions on these approaches? It is appearing all over the place in bioinformatics tools like FATIGO, EASE, DAVID etc. I find that the formal testing approach makes me very uncomfortable, especially as the biologists I work with tend to over interpret the results. I am very interested to see the discussion on this topic. Nicholas _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.8 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

Dear Michael, On Wednesday 25 February 2004 10:07, michael watson (IAH-C) wrote: > Forgive my naivety, but could one not use a chi-squared test here? > We have an observed amount of genes in each category, and could calculate > an expected from the size of the cluster and the distribution of all genes > throughout GO categories... I am not sure I see what your question is so I'll answer to two different questions, and hope one of them does it: 1. Chi-square vs. Fisher's exact test: Yes, sure. That is, in fact, what some tools do. Others directly use Fisher's exact test for contingency tables because/if/when some of the usual assumptions of the chi-square test do not hold (e.g., many of the cells have expected counts < 5; actual recommedations vary; Agresti [Categorical data analysis], Conover [practical nonparametric statistics], etc, provide more details on the "rules of thumb" for when not to trust the chi-square approximation). Fisher's exact test is the classical small sample test for independence in contingency tables. 2. Can you do a single test with all GO terms at the same time? I guess you could, but as I said in the other email the problem is number of cells if the number of GO terms is even modest: you have a bunch of genes (e.g., 10000) and each is cross-classified according to presence/absence of each of the K GO terms. (So "We have an observed amount of genes in each category" is not actually an accurate description of the sampling scheme here). That gives you a K-way contingency table; for any modest K, the number of cells blows up (e.g., K = 20, you have 2^20 cells). And then, you need to be careful with the loglinear model in terms of what hypothesis you want to test, and which ones you are actually testing. Best, R. > > ? > > -----Original Message----- > From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] > Sent: 24 February 2004 08:33 > To: bioconductor@stat.math.ethz.ch > Cc: rdiaz@cnio.es > Subject: [BioC] testing GO categories with Fisher's exact test. > > > Hi all, > I have a few questions about testing for over representation of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO category to detect > differential representation of terms in the expressed > set and correct for multiple testing. > > My question is on the validity of this procedure. Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. Do people have any thoughts or > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like FATIGO, EASE, > DAVID etc. I find that > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. > I am very interested to see the discussion on this topic. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

ADD REPLY • link 20.8 years ago Ramon Diaz ★ 1.1k

0

Entering edit mode

Hi, I am crafting a longer reply which I will send later. But, in relation to the chi-squared test yes you could do a chi-sqared test, but as Ramon poionts out the reason to use Fisher's exact test is exactly that for small samples the test performs better. Though I have always found it very unsatisfying that Fisher's tea lady needed 4 out 4 correct for the result to be sugnificant :) . In regard to point 2, In theory if you fit the complete model, you would have 2^20 for a k-way table. However in this problem you could collapse a lot of the dimensionality as many of the cells would have 0's, and with 0 information inference is hard. Also, one could make some fairly loose assumptions on the correlation structure and probably reduce the dimensionality further. My 2c Nicholas On Wed, 25 Feb 2004 11:11:28 +0100, "Ramon Diaz-Uriarte" <rdiaz@cnio.es> said: > Dear Michael, > > On Wednesday 25 February 2004 10:07, michael watson (IAH-C) wrote: > > Forgive my naivety, but could one not use a chi-squared test here? > > We have an observed amount of genes in each category, and could calculate > > an expected from the size of the cluster and the distribution of all genes > > throughout GO categories... > > I am not sure I see what your question is so I'll answer to two different > questions, and hope one of them does it: > > 1. Chi-square vs. Fisher's exact test: Yes, sure. That is, in fact, what > some > tools do. Others directly use Fisher's exact test for contingency tables > because/if/when some of the usual assumptions of the chi-square test do > not > hold (e.g., many of the cells have expected counts < 5; actual > recommedations > vary; Agresti [Categorical data analysis], Conover [practical > nonparametric > statistics], etc, provide more details on the "rules of thumb" for when > not > to trust the chi-square approximation). Fisher's exact test is the > classical > small sample test for independence in contingency tables. > > 2. Can you do a single test with all GO terms at the same time? I guess > you > could, but as I said in the other email the problem is number of cells if > the > number of GO terms is even modest: you have a bunch of genes (e.g., > 10000) > and each is cross-classified according to presence/absence of each of the > K > GO terms. (So "We have an observed amount of genes in each category" is > not > actually an accurate description of the sampling scheme here). That gives > you > a K-way contingency table; for any modest K, the number of cells blows up > (e.g., K = 20, you have 2^20 cells). And then, you need to be careful > with > the loglinear model in terms of what hypothesis you want to test, and > which > ones you are actually testing. > > Best, > > R. > > > > > > > > ? > > > > -----Original Message----- > > From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] > > Sent: 24 February 2004 08:33 > > To: bioconductor@stat.math.ethz.ch > > Cc: rdiaz@cnio.es > > Subject: [BioC] testing GO categories with Fisher's exact test. > > > > > > Hi all, > > I have a few questions about testing for over representation of terms in > > a cluster. > > let's consider a simple case, a set of chips from an experiment say > > treated and untreted with 10,000 > > genes on the chip and 1000 differentially expressed. Of the 10000, 7000 > > can be annotated and 6000 have > > a GO function assinged to them at a suitible level. Say for this example > > there are 30 Go clasess that appear. > > I then conduct Fisher's exact test 30 times on each GO category to detect > > differential representation of terms in the expressed > > set and correct for multiple testing. > > > > My question is on the validity of this procedure. Just from experience > > many genes will > > have multiple functions assigned to them so the genes falling into GO > > classes are not independent. > > Also, there is the large set of un-annotated genes so we are in effect > > ignoring the influence of > > all the unannotated genes on the outcome. Do people have any thoughts or > > opinions on these approaches? It is > > appearing all over the place in bioinformatics tools like FATIGO, EASE, > > DAVID etc. I find that > > the formal testing approach makes me very uncomfortable, especially as > > the biologists I work with tend to over interpret the results. > > I am very interested to see the discussion on this topic. > > > > Nicholas > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > -- > Ram?n D?az-Uriarte > Bioinformatics Unit > Centro Nacional de Investigaciones Oncol?gicas (CNIO) > (Spanish National Cancer Center) > Melchor Fern?ndez Almagro, 3 > 28029 Madrid (Spain) > Fax: +-34-91-224-6972 > Phone: +-34-91-224-6900 > > http://bioinfo.cnio.es/~rdiaz > PGP KeyID: 0xE89B3462 > (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc) > > >

ADD REPLY • link 20.8 years ago Nicholas Lewin-Koh ▴ 430

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 10.2 years ago

Hello, The Chi-square test needs at least 5 expected genes, if this is true for your study you can use it. However, the chi-square test is an approximation of the fisher test, so you may want to use the fisher test directly. The chi- square test is computitionally a lot more efficient than the fisher test - but "these days" that's not an argument anymore ;-) . regards, Arne > -----Original Message----- > From: bioconductor-bounces+arne.muller=aventis.com@stat.math.ethz.ch > [mailto:bioconductor-bounces+arne.muller=aventis.com@stat.math > .ethz.ch]O > n Behalf Of michael watson (IAH-C) > Sent: 25 February 2004 10:07 > To: 'Nicholas Lewin-Koh'; bioconductor@stat.math.ethz.ch > Cc: rdiaz@cnio.es > Subject: RE: [BioC] testing GO categories with Fisher's exact test. > > > Forgive my naivety, but could one not use a chi-squared test here? > We have an observed amount of genes in each category, and > could calculate an expected from > the size of the cluster and the distribution of all genes > throughout GO categories... > > ? > > -----Original Message----- > From: Nicholas Lewin-Koh [mailto:nikko@hailmail.net] > Sent: 24 February 2004 08:33 > To: bioconductor@stat.math.ethz.ch > Cc: rdiaz@cnio.es > Subject: [BioC] testing GO categories with Fisher's exact test. > > > Hi all, > I have a few questions about testing for over representation > of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the > 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for > this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO > category to detect > differential representation of terms in the expressed > set and correct for multiple testing. > > My question is on the validity of this procedure. Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. Do people have any > thoughts or > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like > FATIGO, EASE, > DAVID etc. I find that > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. > I am very interested to see the discussion on this topic. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 20.8 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

I should add to this thread that there is existing software that will do resampling to assess global significance of the p-values obtained from this sort of analysis. http://dot.ped.med.umich.edu:2000/pub/sig_terms/index.htm Best, Jim James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> <cberry@tajo.ucsd.edu> 02/24/04 02:23PM >>> On Tue, 24 Feb 2004, Nicholas Lewin-Koh wrote: > Hi all, > I have a few questions about testing for over representation of terms in > a cluster. > let's consider a simple case, a set of chips from an experiment say > treated and untreted with 10,000 > genes on the chip and 1000 differentially expressed. Of the 10000, 7000 > can be annotated and 6000 have > a GO function assinged to them at a suitible level. Say for this example > there are 30 Go clasess that appear. > I then conduct Fisher's exact test 30 times on each GO category to detect > differential representation of terms in the expressed > set and correct for multiple testing. > > My question is on the validity of this procedure. It depends on what hypotheses you wish to test. The uniform distribution of the p value under the null hypothesis depends on ***all*** the assumptions of the test obtaining. The trouble is that you probably do not want to test whether the genes on your microarray are independent, since you already know that they are not: > Just from experience > many genes will > have multiple functions assigned to them so the genes falling into GO > classes are not independent. > Also, there is the large set of un-annotated genes so we are in effect > ignoring the influence of > all the unannotated genes on the outcome. Do people have any thoughts or > opinions on these approaches? It is > appearing all over the place in bioinformatics tools like FATIGO, EASE, > DAVID etc. SAM and similar permutation based approaches can be implemented for this setup to get p-values (or FDR's) that do not depend on independence of genes/transcripts. The results given by permutation (of sample identities using the hypergeometric p-value as the test statistic) are several orders of magnitude more conservative than using the original 'p-value' even without correcting for multiple comparisons in several data sets I have seen. I recall someone from the MAPPfinder group remarking at a conference last July that MAPPfinder 2.0 would implement permutation methods. But I cannot find this release yet using google. Another approach to permutation testing of expression vs ontology is outlined in: Mootha VK et al. PGC-1 -responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, 34(3):267 73, 2003. I find that > the formal testing approach makes me very uncomfortable, especially as > the biologists I work with tend to over interpret the results. Testing a better focussed hypothesis should increase your comfort level. :-) > I am very interested to see the discussion on this topic. > > Nicholas > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry@tajo.ucsd.edu UC San Diego http://hacuna.ucsd.edu/members/ccb.html La Jolla, San Diego 92093-0717 _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 20.8 years ago James W. MacDonald 67k

0

Entering edit mode

Another tool is http://fatigo.bioinfo.cnio.es Best, R. On Wednesday 25 February 2004 14:50, James MacDonald wrote: > I should add to this thread that there is existing software that will do > resampling to assess global significance of the p-values obtained from > this sort of analysis. > > http://dot.ped.med.umich.edu:2000/pub/sig_terms/index.htm > > Best, > > Jim > > > > James W. MacDonald > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > >>> <cberry@tajo.ucsd.edu> 02/24/04 02:23PM >>> > > On Tue, 24 Feb 2004, Nicholas Lewin-Koh wrote: > > Hi all, > > I have a few questions about testing for over representation of terms > > in > > > a cluster. > > let's consider a simple case, a set of chips from an experiment say > > treated and untreted with 10,000 > > genes on the chip and 1000 differentially expressed. Of the 10000, > > 7000 > > > can be annotated and 6000 have > > a GO function assinged to them at a suitible level. Say for this > > example > > > there are 30 Go clasess that appear. > > I then conduct Fisher's exact test 30 times on each GO category to > > detect > > > differential representation of terms in the expressed > > set and correct for multiple testing. > > > > My question is on the validity of this procedure. > > It depends on what hypotheses you wish to test. The uniform > distribution > of the p value under the null hypothesis depends on ***all*** the > assumptions of the test obtaining. > > The trouble is that you probably do not want to test whether the genes > on > your microarray are independent, since you already know that they are > > not: > > Just from experience > > many genes will > > have multiple functions assigned to them so the genes falling into > > GO > > > classes are not independent. > > > > Also, there is the large set of un-annotated genes so we are in > > effect > > > ignoring the influence of > > all the unannotated genes on the outcome. Do people have any thoughts > > or > > > opinions on these approaches? It is > > appearing all over the place in bioinformatics tools like FATIGO, > > EASE, > > > DAVID etc. > > SAM and similar permutation based approaches can be implemented for > this > setup to get p-values (or FDR's) that do not depend on independence of > genes/transcripts. > > The results given by permutation (of sample identities using the > hypergeometric p-value as the test statistic) are several orders of > magnitude more conservative than using the original 'p-value' even > without > correcting for multiple comparisons in several data sets I have seen. > > I recall someone from the MAPPfinder group remarking at a conference > last > July that MAPPfinder 2.0 would implement permutation methods. But I > cannot > find this release yet using google. > > Another approach to permutation testing of expression vs ontology is > outlined in: > > Mootha VK et al. PGC-1 -responsive genes involved in > oxidative phosphorylation are coordinately downregulated in human > diabetes. Nature Genetics, 34(3):267 73, 2003. > > I find that > > > the formal testing approach makes me very uncomfortable, especially > > as > > > the biologists I work with tend to over interpret the results. > > Testing a better focussed hypothesis should increase your comfort > level. > > :-) > : > > I am very interested to see the discussion on this topic. > > > > Nicholas > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cberry@tajo.ucsd.edu UC San Diego > http://hacuna.ucsd.edu/members/ccb.html La Jolla, San Diego > 92093-0717 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor -- Ram?n D?az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol?gicas (CNIO) (Spanish National Cancer Center) Melchor Fern?ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz PGP KeyID: 0xE89B3462 (http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)

ADD REPLY • link 20.8 years ago Ramon Diaz ★ 1.1k

Login before adding your answer.