Dear Nicholas,
On Tuesday 24 February 2004 09:33, Nicholas Lewin-Koh wrote:
> Hi all,
> I have a few questions about testing for over representation of
terms in
> a cluster.
> let's consider a simple case, a set of chips from an experiment say
> treated and untreted with 10,000
> genes on the chip and 1000 differentially expressed. Of the 10000,
7000
> can be annotated and 6000 have
> a GO function assinged to them at a suitible level. Say for this
example
> there are 30 Go clasess that appear.
> I then conduct Fisher's exact test 30 times on each GO category to
detect
> differential representation of terms in the expressed
> set and correct for multiple testing.
I think I understand your setup. Just to double check, let me rephrase
as:
- for every one of the 30 GO terms, you set up a 2x2 contingency table
(genes
with/without the GO term by genes in class A vs. genes in class B),
and carry
out a Fisher's exact test, so you do 30 tests.
However, I am not sure what you mean by "7000 can be annotated and
6000 have
a GO function assinged to them at a suitible level". Does this mean
that, if a
gene has no GO annotation you will not introduce it into the above 2x2
tables? It could be in the table (so the sum of entries in each of the
2x2 is
10000); it just goes to the "absent" cells.
>
> My question is on the validity of this procedure. Just from
experience
> many genes will
> have multiple functions assigned to them so the genes falling into
GO
> classes are not independent.
Yes, sure, though I'd rather reword it as saying that the
presence/absence of
a GO term X (e.g., metabolism) is not independent of the
presence/absence of
GO term Y (e.g., transport).
However, I don't see this as an inherent problem. Suppose you measure
arm
length, body mass, and height, of a bunch of men and women, and carry
out
three t-tests. Of course, the three variables are correlated.
Now, you might have used Hotelling's T-test for testing the null
hypothesis
that the multivariate mean (in the space defined by the three traits)
of the
sexes do not differ. But that is a different biological question from
asking
"do they differ in any one of the three traits", which is what you
would be
asking if you run 3 t-tests. [Some of these issues are discussed very
nicely
by W. Krzanowski in "Principles of multivariate analysis", pp. 235
-251 on
the 1988 edition, and in the categorical variable case by Fienberg,
"The
analysis of cross-classified categorical data, 2nd ed", in pp. 20-21].
>From the above point of view, I think that many of the examples in
Westfall &
Young ("Resampling-based multiple testing") could also be reframed in
a
multivariate way. But they are not. The reason, I think, is that in
most of
these cases (i.e., FatiGO, Westfall & Young, etc) the biologists are
interested in fishing in a sea of univariate hypotheses. I think that
most of
the questions that biologists are asking in these cases are often
univariate.
A multivariate alternative would be to use a log-linear model of a
31-way
contingency table: we have 10000 genes that we cross-classify
according to
group membership (differentially expressed or not), and each of the K
= 30 GO
terms (with two values for each term: present or basent). So we have a
multidimensional table of 2 x 2^30. This won't work.
> Also, there is the large set of un-annotated genes so we are in
effect
> ignoring the influence of
> all the unannotated genes on the outcome.
This relates to the more general problem of the quality of GO
annotations,
with two related problems:
a) absence of annotation does not necessarily mean absence of that GO
function, but maybe just that that particular aspect has never been
studied
for that gene;
b) presence of an annotation does not mean that the gene really has
that
function, since there are msitakes in the annotation; in fact, GO has
a bunch
of levels for "quality of annotations" (see
http://www.geneontology.org/GO.evidence.html
).
It is my understanding that most tools, right now, just ignore these
issues. I
am not sure how serious the consequences are, but so far at least our
experience seems to be that results make sense (e.g., see our examples
in
http://bioinfo.cnio.es/docus/papers/techreports.html#FatiGO-NNSP
and
http://bioinfo.cnio.es/docus/papers/techreports.html#camda-02).
Of course, this is no excuse. A possible way would be to explicitly
model what
presence and absence of annotation mean, probably making use of the
information contained in the "quality of annotations", within a
bayesian
framework. M. Battacharjee and I have been working on it (but,
because of my
delays, this is becoming a never-ending project).
> opinions on these approaches? It is
> appearing all over the place in bioinformatics tools like FATIGO,
EASE,
> DAVID etc. I find that
Yes, several people have had similar ideas. And I think there are a
few other
similar tools around.
> the formal testing approach makes me very uncomfortable, especially
as
> the biologists I work with tend to over interpret the results.
I don't see your last point: how the formal testing leads to
overinterpretation.
Best,
Ram?n
--
Ram?n D?az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol?gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern?ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900
http://bioinfo.cnio.es/~rdiaz
PGP KeyID: 0xE89B3462
(
http://bioinfo.cnio.es/~rdiaz/0xE89B3462.asc)