Hi Tony, James and All,
This mail is to inform you that this issue is now cleared. As per
what I have been conveyed by Robert Gentleman through James the
explanation is as follows:
The conditioning has to happen on both the GO terms for the selected
genes as well as for the GO terms from the universe. Hence one can get
different sizes depending on the input gene IDs. If you re-run the
example (provided by James) given below setting conditional=FALSE, you
will see that the size values always stay the same.
Thanks to all of you for looking into this and helping me understand
the concept and its implications.
James code:
##############################################
set.seed(12345)
set1 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2"))
set.seed(54321)
set2 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2"))
univ <- unique(getEG(ls(hgu95av2GO), "hgu95av2"))
p <- new("GOHyperGParams", geneIds=set1, universeGeneIds=univ,
annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1,
ontology="BP")
hyp <- hyperGTest(p)
first <- summary(hyp, categorySize=10)
p <- new("GOHyperGParams", geneIds=set2, universeGeneIds=univ,
annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1,
ontology="BP")
hyp <- hyperGTest(p)
second <- summary(hyp, categorySize=10)
ind <- first[,1] %in% second[,1]
data.frame(first[ind,c(1,6)], second[first[ind,1],6])
sum(first[ind,6] != second[first[ind,1],6])
###############################################
Best Regards,
Chanchal
===============================
Chanchal Kumar, Ph.D. Candidate
Dept. of Proteomics and Signal Transduction
Max Planck Institute of Biochemistry
Am Klopferspitz 18
82152 D-Martinsried (near Munich)
Germany
e-mail: chanchal at biochem.mpg.de
Phone: (Office) +49 (0) 89 8578 2296
Fax:(Office) +49 (0) 89 8578 2219
http://www.biochem.mpg.de/mann/
===============================
________________________________________
From: tc.fhcrc@gmail.com [mailto:tc.fhcrc@gmail.com] On Behalf Of Tony
Chiang
Sent: Tuesday, March 04, 2008 10:57 AM
To: Chanchal Kumar
Cc: James W. MacDonald; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Inconsistency in GOstats results after GO
enrichment
Hi Chanchal,
I have not had a chance yet to look over your code. I will do so in
the next couple of days and get you a definitive answer if I can
figure it out. From the gut though, I suspect that I if remember
Seth's implementation correctly, he looks at the test set and universe
set with respect to the annotation package. If there is anything in
the test or universe has yet to be annotated in the package, these
genes are removed before the analysis takes place. I don't really know
if *this* is the issue with which you are concerned, but like I said,
I will have a look at get back to you asap.
Cheers,
--tc
On Sat, Mar 1, 2008 at 2:53 PM, Chanchal Kumar <chanchal at="" biochem.mpg.de=""> wrote:
Hi James, Tony and All,
? Thanks for your inputs and now I have assembled a small dataset
which will help you reproduce the inconsistency in GO enrichment
output using GOstats package.
To summarize, my concern is when I compare any two "test" datasets
against the common "universe" set then I get different "Category Size"
output of "universe" for common GO terms across these two "test"
datasets. The .txt data files (test1, test2, universe) are attached
with the mail and the script is below. The ids are EntrezID for Human
and may be redundant in each file. But then I am removing the
redundancy before calculating enrichment. Also I am using a p-value
cutoff of 1 just to get everything.
After analysis I find discrepancies in these GO terms (there are
more):
GO:0016070 ,GO:0051179 ,GO:0006996, GO:0000074 , GO:0031325,
GO:0007399, GO:0006464, GO:0030097, GO:0007586, GO:0006810,
GO:0007243, GO:0048878
I look forward to your inputs. In rest of the mail sessionInfo() is
followed by script.
> ?sessionInfo()
R version 2.6.2 (2008-02-08)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] splines ? tools ? ? stats ? ? graphics ?grDevices utils ? ?
datasets ?methods ? base
other attached packages:
?[1] hgu95av2.db_2.0.2 ? GOstats_2.4.0 ? ? ? Category_2.4.0 ? ?
?genefilter_1.16.0 ? survival_2.34 ? ? ? RBGL_1.14.0
?[7] annotate_1.16.1 ? ? xtable_1.5-2 ? ? ? ?GO.db_2.0.2 ? ? ? ?
AnnotationDbi_1.0.6 RSQLite_0.6-7 ? ? ? DBI_0.2-4
[13] Biobase_1.16.3 ? ? ?graph_1.16.1
loaded via a namespace (and not attached):
[1] cluster_1.11.9
####################SCRIPT STARTS######################
######################################################
# Load packages
library("GOstats")
library("hgu95av2.db")
# read the universe set
universe<-read.delim(file="universe.txt",sep="\t",header=T, colClasses
= "character")
universe<-as.vector(universe[,1])
mode(universe)<-"character"
########## Analysing Test 1 ##################
# read the test set 1
test<-read.delim(file="test1.txt",sep="\t",header=T, colClasses =
"character")
test<-as.vector(test[,1])
mode(test)<-"character"
hgCutoff <- 1
#testing for GO enrichment using conditional test
params <- new("GOHyperGParams", geneIds = unique(test),
universeGeneIds = unique(universe), annotation = "hgu95av2",
ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE,
testDirection = "over")
hgOver <- hyperGTest(params)
# Choose only those categories which have at least 3 members in the
universe set
d<-summary(hgOver,categorySize =3)
write.table(d,"Test1_GOBP.txt",row.names=F,col.names=T,sep="\t",quote=
F)
########## Analysing Test 2 ##################
# read the test set 2
test<-read.delim(file="test2.txt",sep="\t",header=T, colClasses =
"character")
test<-as.vector(test[,1])
mode(test)<-"character"
hgCutoff <- 1
#testing for GO enrichment using conditional test
params <- new("GOHyperGParams", geneIds = unique(test),
universeGeneIds = unique(universe), annotation = "hgu95av2",
ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE,
testDirection = "over")
hgOver <- hyperGTest(params)
# Choose only those categories which have at least 3 members in the
universe set
d<-summary(hgOver,categorySize =3)
write.table(d,"Test2_GOBP.txt",row.names=F,col.names=T,sep="\t",quote=
F)
##########################SCRIPT ENDS##########################
###############################################################
Best Regards,
Chanchal
===============================
Chanchal Kumar, Ph.D. Candidate
Dept. of Proteomics and Signal Transduction
Max Planck Institute of Biochemistry
Am Klopferspitz 18
82152 D-Martinsried (near Munich)
Germany
e-mail: chanchal at biochem.mpg.de
Phone: (Office) +49 (0) 89 8578 2296
Fax:(Office) +49 (0) 89 8578 2219
http://www.biochem.mpg.de/mann/
===============================
________________________________________
From: tc.fhcrc@gmail.com [mailto:tc.fhcrc@gmail.com] On Behalf Of Tony
Chiang
Sent: Friday, February 29, 2008 3:01 PM
To: James W. MacDonald
Cc: Chanchal Kumar; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Inconsistency in GOstats results after GO
enrichment
Hi Chanchal,
?
In addition to the small example, please provide the list with your
output of sessionInfo(). This way, when we try to reproduce your
example, we will try to use the same version of software packages that
you are using.
?
Cheers,
--Tony
?
On 2/29/08, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote:
Hi Chanchal,
Unless you can give us a small reproducible example that shows what
you
are talking about, there is really nothing anybody can do about this.
Best,
Jim
Chanchal Kumar wrote:
> Dear All,
>
>
>
>????I have been using the "GOstats" package for the GO enrichment on
my
> dataset. I have two cluster of genes(test1, test2) and a reference
> set(universe set). After doing enrichment on the two sets for GO BP
> w.r.t to the universe set I get a list of enriched terms. In both
cases
> I get "Cell Cycle" (GO:0007049) as enriched. Till there it's fine.
But
> strangely I see that the Category Size ( "Size" column of the output
)
> gives me different numbers for Cell cycle annotated genes in my
universe
> set( in test1 77, in test2 366). Now my concern is that given I use
the
> same universe set how I can get different numbers of Cell cycle
> annotated universe genes? This is just one example and I see these
> inconsistencies for many common terms.
>
>
>
> Any insights on this problem will be very helpful.
>
>
>
> Best Regards,
>
> Chanchal
>
> ===============================
>
> Chanchal Kumar, Ph.D. Candidate
>
> Dept. of Proteomics and Signal Transduction
> Max Planck Institute of Biochemistry
> Am Klopferspitz 18
> 82152 D-Martinsried (near Munich)
> Germany
> e-mail: chanchal at biochem.mpg.de
> Phone: (Office) +49 (0) 89 8578 2296
>
> Fax:(Office) +49 (0) 89 8578 2219
>
http://www.biochem.mpg.de/mann/
> ===============================
>
>
>
>
>?????? [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor