Assess inter-study consistency

0

Entering edit mode

Scott Ochsner ▴ 300

@scott-ochsner-599

Last seen 10.3 years ago

Dear BioC, I would like to use simple correlation to assess the consistency between a seven independent expression array datasets. All datasets are on the same platform, hgu133a. In the materials and methods section from http://cancerres.aacrjournals.org/cgi/content/full/67/21/10296#top they state, "To assess for consistency between the three studies, Pearson correlation was computed pair-wise between the mean values of common genes. The three studies showed significant positive pair-wise correlation." I'm having trouble following their statement. I don't have to worry about common genes as all of the seven studies I'm looking at are on the same platform. I thought of doing something as below: #eset is your standard ExpressionSet object #treatment is a vector describing which group each array belongs to. There are two groups, cont. and drug. >avg<-function(eset,treatment){ + tmp<-aggregate(t(exprs(eset)),by=list(treatment),mean) + rownames(tmp)<-tmp[,1] + t(tmp[,-1]) + } >groupAverage<-avg(eset,treatment) > dim(groupAverage) [1] 22277 14 > cor(sampleAverage) c.d3529 c.d3834 c.d4006 c.d4025 c.d6800 c.d8540 c.d9936 e.d3529 e.d3834 e.d4006 e.d4025 e.d6800 e.d8540 c.d3529 1.0000000 0.9659532 0.7933771 0.7498652 0.8957816 0.8874096 0.9041292 0.9917589 0.9535454 0.7964003 0.7577108 0.8889499 0.8904473 c.d3834 0.9659532 1.0000000 0.8071949 etc.... Questions: 1. Since I'm expecting most of the probe sets on these arrays to not change, shouldn't I expect high correlation even between the cont. and drug groups? Or in other words, how informative is doing cor across all of the probe sets? 2. How might I assess the significance of these correlations. > sessionInfo() R version 2.7.0 (2008-04-22) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] splines tools stats graphics grDevices utils datasets methods base other attached packages: [1] affycoretools_1.12.0 annaffy_1.12.1 KEGG.db_2.2.0 gcrma_2.12.1 matchprobes_1.12.0 biomaRt_1.14.0 [7] RCurl_0.9-3 GOstats_2.6.0 Category_2.6.0 RBGL_1.16.0 GO.db_2.2.0 graph_1.18.1 [13] limma_2.14.2 affy_1.18.1 preprocessCore_1.2.0 affyio_1.8.0 MLInterfaces_1.14.1 annotate_1.18.0 [19] xtable_1.5-2 AnnotationDbi_1.2.1 RSQLite_0.6-8 DBI_0.2-4 rda_1.0 rpart_3.1-41 [25] genefilter_1.20.0 survival_2.34-1 MASS_7.2-41 Biobase_2.0.1 loaded via a namespace (and not attached): [1] class_7.2-41 cluster_1.11.10 XML_1.95-2 Scott A. Ochsner, Ph.D. NURSA Bioinformatics Molecular and Cellular Biology Baylor College of Medicine Houston, TX. 77030 phone: 713-798-6227

GO hgu133a probe GO hgu133a probe • 888 views

ADD COMMENT • link updated 16.3 years ago by Thomas Hampton ▴ 750 • written 16.3 years ago by Scott Ochsner ▴ 300

0

Entering edit mode

Thomas Hampton ▴ 750

@thomas-hampton-2820

Last seen 10.3 years ago

This is one of my favorite topics. cor.test returns a p value, so you could consider that. But as you are intimating, most genes need to be expressed at the same level regardless of what you do to the system, or the cells will die, so this doesn't really answer your question of concordance, because you want to report on the genes that did something, not the ones that just sat there. There are many ways you could score this. My favorite is to select the top N most regulated genes in each each experiment. If you pick a small number, then you will be focusing on the part of your experiment you are likely to report as a result. The basic idea of a very simple statistic is: how likely is a particular gene to make it into the top 1%, all 7 times? And the answer to that is, not very often, under the null hypothesis. If you model this as picking genes at random out of an urn, then it would be .01^7. With a tiny number of genes picked, you could use dbinom, or get fancier and use dhyper. All this assumes independence, which kind of strange on an array where you have multiple probes for the same gene, so all this is just a starting point... Cheers T On Sep 4, 2008, at 1:40 PM, Ochsner, Scott A wrote: > Dear BioC, > > I would like to use simple correlation to assess the consistency > between a seven independent expression array datasets. All > datasets are on the same platform, hgu133a. > > In the materials and methods section from http:// > cancerres.aacrjournals.org/cgi/content/full/67/21/10296#top they > state, > "To assess for consistency between the three studies, Pearson > correlation was computed pair-wise between the mean values of > common genes. The three studies showed significant positive pair- > wise correlation." > > I'm having trouble following their statement. I don't have to > worry about common genes as all of the seven studies I'm looking at > are on the same platform. > > I thought of doing something as below: > > #eset is your standard ExpressionSet object > #treatment is a vector describing which group each array belongs > to. There are two groups, cont. and drug. > >> avg<-function(eset,treatment){ > + tmp<-aggregate(t(exprs(eset)),by=list(treatment),mean) > + rownames(tmp)<-tmp[,1] > + t(tmp[,-1]) > + } >> groupAverage<-avg(eset,treatment) >> dim(groupAverage) > [1] 22277 14 > >> cor(sampleAverage) > c.d3529 c.d3834 c.d4006 c.d4025 c.d6800 > c.d8540 c.d9936 e.d3529 e.d3834 e.d4006 e.d4025 > e.d6800 e.d8540 > c.d3529 1.0000000 0.9659532 0.7933771 0.7498652 0.8957816 0.8874096 > 0.9041292 0.9917589 0.9535454 0.7964003 0.7577108 0.8889499 0.8904473 > c.d3834 0.9659532 1.0000000 0.8071949 etc.... > > > Questions: > 1. Since I'm expecting most of the probe sets on these arrays to > not change, shouldn't I expect high correlation even between the > cont. and drug groups? Or in other words, how informative is doing > cor across all of the probe sets? > > 2. How might I assess the significance of these correlations. > >> sessionInfo() > R version 2.7.0 (2008-04-22) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United States. > 1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] splines tools stats graphics grDevices utils > datasets methods base > > other attached packages: > [1] affycoretools_1.12.0 annaffy_1.12.1 KEGG.db_2.2.0 > gcrma_2.12.1 matchprobes_1.12.0 biomaRt_1.14.0 > [7] RCurl_0.9-3 GOstats_2.6.0 Category_2.6.0 > RBGL_1.16.0 GO.db_2.2.0 graph_1.18.1 > [13] limma_2.14.2 affy_1.18.1 preprocessCore_1.2.0 > affyio_1.8.0 MLInterfaces_1.14.1 annotate_1.18.0 > [19] xtable_1.5-2 AnnotationDbi_1.2.1 RSQLite_0.6-8 > DBI_0.2-4 rda_1.0 rpart_3.1-41 > [25] genefilter_1.20.0 survival_2.34-1 MASS_7.2-41 > Biobase_2.0.1 > > loaded via a namespace (and not attached): > [1] class_7.2-41 cluster_1.11.10 XML_1.95-2 > > Scott A. Ochsner, Ph.D. > NURSA Bioinformatics > Molecular and Cellular Biology > Baylor College of Medicine > Houston, TX. 77030 > phone: 713-798-6227 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD COMMENT • link 16.3 years ago Thomas Hampton ▴ 750

Login before adding your answer.