(stupid) question about wilcoxon test and finding interesting genes
2
0
Entering edit mode
@dipl-ing-johannes-rainer-846
Last seen 10.3 years ago
hi, i must excuse myself for my question, but i'm not really good in statistics... we have done affymetrix genechips with samples from patients before and after treatment. until now i searched for genes that are influenced by the treatment using M values but i wanted also to apply a statistical test to get some proof that the genes i found are significant. so i applied a wilcoxon paired test to the expression values (one test per gene). my samples size is 13 (13 chips with samples before treatment and 13 afterwards). i subtracted the values after treatment from those before treatment ( p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , untreated is a matrix with 13 columns and 54000 rows (genes) and the same is treated). according to the p values i got nearly every gene is significant, also if the gene is not regulated. so my question, do i have to correct the p values or was i totally wrong with the assumption to get significant (and regulated) genes in this way? thanks
• 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States
Dipl.-Ing. Johannes Rainer wrote: > hi, > i must excuse myself for my question, but i'm not really good in > statistics... > > we have done affymetrix genechips with samples from patients before and > after treatment. until now i searched for genes that are influenced by > the treatment using M values but i wanted also to apply a statistical > test to get some proof that the genes i found are significant. > > so i applied a wilcoxon paired test to the expression values (one test > per gene). my samples size is 13 (13 chips with samples before treatment > and 13 afterwards). i subtracted the values after treatment from those > before treatment ( > > p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , untreated is > a matrix with 13 columns and 54000 rows (genes) and the same is > treated). according to the p values i got nearly every gene is > significant, also if the gene is not regulated. > > so my question, do i have to correct the p values or was i totally wrong > with the assumption to get significant (and regulated) genes in this way? You have to correct the p-values to account for the fact that you have done 54,000 simultaneous tests. See e.g., ?p.adjust Jim > > thanks > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109
ADD COMMENT
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.7 years ago
United States
If I understand what you did, you should have only 1 column of p-values - 1 per gene. So, I think your apply command did not work as you expected (although I think it should have). My understanding is that you have 2 arrays per patient and took the 13 M values. Applying a Wilcoxon test to each row should test that the median difference is 0. Try doing the test on a couple of rows and then compare with the output you obtained. After you get 1 p-value per gene, you should apply a multiple comparisons adjustment. FDR is popular and can be computed using the "qvalue" library in Bioconductor. --Naomi At 09:44 AM 2/11/2005, Dipl.-Ing. Johannes Rainer wrote: >hi, >i must excuse myself for my question, but i'm not really good in statistics... > >we have done affymetrix genechips with samples from patients before and >after treatment. until now i searched for genes that are influenced by the >treatment using M values but i wanted also to apply a statistical test to >get some proof that the genes i found are significant. > >so i applied a wilcoxon paired test to the expression values (one test per >gene). my samples size is 13 (13 chips with samples before treatment and >13 afterwards). i subtracted the values after treatment from those before >treatment ( > >p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , untreated is a >matrix with 13 columns and 54000 rows (genes) and the same is treated). >according to the p values i got nearly every gene is significant, also if >the gene is not regulated. > >so my question, do i have to correct the p values or was i totally wrong >with the assumption to get significant (and regulated) genes in this way? > >thanks > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD COMMENT
0
Entering edit mode
yes, you are right, i applied a wilcoxon test to the M values (which is in this case the same as the paired wilcox of the log2 expression values). i got a vector of p values, one p value for each gene. the p values i got were a little bit surprising to me, because i found genes significant, although they were not that much different between the sample and the control group. something about 6000 genes have a p value less then 0.05, so this might be ok (i was a little bit too quick by saying that every gene is significantly different :) ). so the next step is to correct the p values... i thought correcting p values is only necessary when i do multiple testing? sorry for my question, but i am more used to do some programming and work with databases then doing statistics... thanks to all your answers, you help me very much! thanks! Quoting Naomi Altman <naomi@stat.psu.edu>: > If I understand what you did, you should have only 1 column of > p-values - 1 per gene. So, I think your apply command did not work > as you expected (although I think it should have). > > My understanding is that you have 2 arrays per patient and took the > 13 M values. Applying a Wilcoxon test to each row should test that > the median difference is 0. > > Try doing the test on a couple of rows and then compare with the > output you obtained. > > After you get 1 p-value per gene, you should apply a multiple > comparisons adjustment. FDR is popular and can be computed using the > "qvalue" library in Bioconductor. > > --Naomi > > At 09:44 AM 2/11/2005, Dipl.-Ing. Johannes Rainer wrote: >> hi, >> i must excuse myself for my question, but i'm not really good in >> statistics... >> >> we have done affymetrix genechips with samples from patients before >> and after treatment. until now i searched for genes that are >> influenced by the treatment using M values but i wanted also to >> apply a statistical test to get some proof that the genes i found >> are significant. >> >> so i applied a wilcoxon paired test to the expression values (one >> test per gene). my samples size is 13 (13 chips with samples before >> treatment and 13 afterwards). i subtracted the values after >> treatment from those before treatment ( >> >> p.vals <- apply((untreated-treated),MARGIN=1,wilcox.test) , >> untreated is a matrix with 13 columns and 54000 rows (genes) and the >> same is treated). according to the p values i got nearly every gene >> is significant, also if the gene is not regulated. >> >> so my question, do i have to correct the p values or was i totally >> wrong with the assumption to get significant (and regulated) genes >> in this way? >> >> thanks >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Bioinformatics Consulting Center > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > >
ADD REPLY
0
Entering edit mode
When there is no differential expression (and if the genes were independent) then the p-values should be uniformly distributed. So, if you test at level alpha and you have N genes, you SHOULD find alpha*N genes that have significant results (and all are false positives). The FDR correction does 2 things simultaneously - it estimates the percentage of genes that differentially express (using departures of the p-values from the uniform distribution" - and then estimates the False discovery rate for any observed p-value. I guess we have to ask what "necessary" and "multiple testing" mean. There are 2 kinds of error - false detects and false non-detects. We do not do this type of correction if we worry more about false non-detects. If false detects are a bigger problem, then the FDR estimate allows us to estimate when we have an acceptable rate. If you are really testing only a few genes on your arrays, I would not use FDR. If you are really testing all the genes, then I think you have a "highly multiple" testing situation. I don't really like the term "adjusted p-value" for FDR estimates. They are not probabilities, they are estimated error rates. But that issue was discussed a few weeks ago on this list. --Naomi >so the next step is to correct the p values... i thought correcting p >values is only necessary when i do multiple testing? sorry for my >question, but i am more used to do some programming and work with >databases then doing statistics... > > >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor@stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>Naomi S. Altman 814-865-3791 (voice) >>Associate Professor >>Bioinformatics Consulting Center >>Dept. of Statistics 814-863-7114 (fax) >>Penn State University 814-865-1348 (Statistics) >>University Park, PA 16802-2111 >> > > > Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD REPLY

Login before adding your answer.

Traffic: 733 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6