Problem with p-values calculated by eBayes--corrected format
1
0
Entering edit mode
@chen-zhuoxun-3131
Last seen 10.2 years ago
Hi Bioconductors, I am really sorry about sending this email again. I didn't know that the table on my email will be lost and reformat. I corrected the format now. Thank you for your patience. ? I have a very weird problem with the statistics with my microarray data. I would like to ask for your help. I am running a microarray with 16 groups, 3 samples/group. On my genechip, every probe is spotted 2 times. By comparing two groups (let?s say A and B), I came across a gene that is very significant by running the following codes, with a p-value= 0.001669417 ---------------------------------------------------------------------- -------------------------------------- corfit <- duplicateCorrelation(Gvsn, design = design, ndups = 2, spacing = 1) fit <- ?lmFit(Gvsn, design = design, ndups = 2, spacing = 1, correlation = corfit$consensus) contrast.matrix <- makeContrasts(A-B, levels=design) fit2 <- contrasts.fit(fit, contrast.matrix) fit3 <- eBayes(fit2) ---------------------------------------------------------------------- -------------------------------------- Then, I looked at the raw data; copy and paste onto Excel and did a simple t-test ? A B 1 6.938162 7.093199 2 7.012382 8.05612 3 7.000305 6.999078 Avg 6.983616 7.382799 contrast 0.399182 p-value one tailed, unequal variance, t-test=0.179333 one tailed, equal variance, t-test=0.151844 ? ? The p-value is NOT even close to 0.05. Then I looked at the contrast of fit3$coefficient, it is 0.399182, which indicates the data input for the codes are correct. I don?t understand why it has such a huge difference on p-value between those two methods. Could somebody please help me with it? Thanks, Zhuoxun Chen ?? SessionInfo: ---------------------------------------------------------------------- --------------------------------------------------------------- R version 2.8.0 (2008-10-20) i386-pc-mingw32 ?locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 ?attached base packages: ?[1] grid????? splines?? tools???? stats???? graphics? grDevices utils???? datasets? methods?? base???? ?other attached packages: ?[1] gplots_2.6.0????????? gdata_2.4.2?????????? gtools_2.5.0????????? org.Hs.eg.db_2.2.6??? GSEABase_1.4.0?????? ?[6] PGSEA_1.10.0????????? Ruuid_1.20.0????????? Rgraphviz_1.20.2????? XML_1.94-0.1????????? bioDist_1.14.0?????? [11] GOstats_2.8.0???????? Category_2.8.0??????? genefilter_1.22.0???? survival_2.34-1?????? RBGL_1.18.0????????? [16] annotate_1.20.0?????? xtable_1.5-4????????? graph_1.20.0????????? eArrayCanary.db_1.0.0 annaffy_1.14.0?????? [21] KEGG.db_2.2.5???????? GO.db_2.2.5?????????? RSQLite_0.7-1???????? DBI_0.2-4???????????? AnnotationDbi_1.4.0? [26] statmod_1.3.6???????? RODBC_1.2-3?????????? RColorBrewer_1.0-2??? vsn_3.8.0???????????? affy_1.20.0????????? [31] Biobase_2.2.0???????? lattice_0.17-15?????? limma_2.16.3???????? ?loaded via a namespace (and not attached): [1] affyio_1.10.1??????? cluster_1.11.11????? preprocessCore_1.4.0 ---------------------------------------------------------------------- -------------------------------------------------------------- ?
Microarray GO probe Microarray GO probe • 1.0k views
ADD COMMENT
0
Entering edit mode
@kasper-daniel-hansen-2979
Last seen 17 months ago
United States
On Jan 9, 2009, at 9:21 , Chen, Zhuoxun wrote: > Hi Bioconductors, > > I am really sorry about sending this email again. I didn't know that > the table on my email will be lost and reformat. I corrected the > format now. Thank you for your patience. > > I have a very weird problem with the statistics with my microarray > data. I would like to ask for your help. > I am running a microarray with 16 groups, 3 samples/group. On my > genechip, every probe is spotted 2 times. > By comparing two groups (let?s say A and B), I came across a gene > that is very significant by running the following codes, with a p- > value= 0.001669417 > -------------------------------------------------------------------- ---------------------------------------- > corfit <- duplicateCorrelation(Gvsn, design = design, ndups = 2, > spacing = 1) > fit <- lmFit(Gvsn, design = design, ndups = 2, spacing = 1, > correlation = corfit$consensus) > contrast.matrix <- makeContrasts(A-B, levels=design) > fit2 <- contrasts.fit(fit, contrast.matrix) > fit3 <- eBayes(fit2) > -------------------------------------------------------------------- ---------------------------------------- > Then, I looked at the raw data; copy and paste onto Excel and did a > simple t-test > > A B > 1 6.938162 7.093199 > 2 7.012382 8.05612 > 3 7.000305 6.99907 This is 1 contrast with 3 samples in each group. But where is the data from the second probe? And what is the values of corfit? > > Avg 6.983616 7.382799 > contrast 0.399182 > > p-value > one tailed, unequal variance, t-test=0.179333 > one tailed, equal variance, t-test=0.151844 > > The p-value is NOT even close to 0.05. Then I looked at the contrast > of fit3$coefficient, it is 0.399182, which indicates the data input > for the codes are correct. > > I don?t understand why it has such a huge difference on p-value > between those two methods. Could somebody please help me with it? You are both allowing for correlation (which may or may not be sensible, that is hard to know unless you post more details) and you do an empirical Bayes correction. So you are pretty far from doing a standard t-test, and I see no big problem in method "A" giving a different answer from method "B" when the two methods are somewhat different.. Explaining in details what the difference is, is way beyond the scope of an email. A super short answer is that you combine information from having multiple spots measuing the same transcript and that you borrow information about the gene-level variance from looking at the behaviour of all genes. If you want more details, I suggest you read up on mixed models as well as empirical bayes correction. A good starting point will Gordon's sagmb article, cited in limma. Kasper
ADD COMMENT
0
Entering edit mode
Hi Zhuoxun Chen! As Kasper indicates below there might be a combination of reasons for the difference you observe but one of them is in fact quite easy to explain. One of the main differences between the limma version of a t-test with the standard t-test is the standard of difference (SED), which one uses as the denominator in the the t-statistic. Limma shrinks that SED towards the average SED across all genes, i.e. for genes with high variances the Limma SED will be smaller than the one used by the traditional t-test and the t-statistic will thus be bigger when using limma. It seems that the one big value in Group B results in a high SED when using the standard t-test (so gives a no- significant result), but limma shrinks it to a smaller number which makes the result more significant. As several studies have shown this is a good strategy in general, but obviously there will be cases where a standard t-test might result in a better decision. If you go through the list of all genes you probably will also find examples where the traditional t-test gives you a spurious significant result but limma doesn't (as Kasper already wrote: different methods will give different results). As said before the shrinkage of variances/SEDs might not be the only reason for the observed difference but I assume it is a contributing factor. Best Wishes Claus > > I don't understand why it has such a huge difference on p-value > > between those two methods. Could somebody please help me with it? > > You are both allowing for correlation (which may or may not be > sensible, that is hard to know unless you post more details) and you > do an empirical Bayes correction. So you are pretty far from doing a > standard t-test, and I see no big problem in method "A" giving a > different answer from method "B" when the two methods are somewhat > different.. Explaining in details what the difference is, is way > beyond the scope of an email. A super short answer is that you combine > information from having multiple spots measuing the same transcript > and that you borrow information about the gene-level variance from > looking at the behaviour of all genes. If you want more details, I > suggest you read up on mixed models as well as empirical bayes > correction. A good starting point will Gordon's sagmb article, cited > in limma. > > Kasper > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor The University of Aberdeen is a charity registered in Scotland, No SC013683.
ADD REPLY

Login before adding your answer.

Traffic: 431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6