P values on Log or Non-Log Values
1
0
Entering edit mode
Park, Richard ▴ 220
@park-richard-227
Last seen 10.3 years ago
Hi Everyone, I am currently using the mt.teststat to calculate p-values between various samples. I was wondering if anyone knew if it was ok to run p-values on logged or non-logged values? In the past using MAS processing, I always calculated pvalues on the raw values, however I have recently switched to processing cel files through rma and the raw data produced from this processing is log base 2. My lab has noticed that log transformation Is not very visible with high p.values (above 0.1), but spreads them all over the place in the low (significant !) range. By running a t.test on loged values, it greatly enhances the significance (up to 100-fold, compared to running on straight values) when significance derives from tight distributions, but has very little or no effect when significance derives from more distant means Anyone have any ideas on which method is correct? thanks, Richard Park Computational Data Analyzer Joslin Diabetes Center
• 5.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States
>From a theoretical standpoint it is more correct to do t-tests on logged data because one of the assumptions of the t-test is that the underlying data are normally distributed. Microarray expression values are almost always strongly right-skewed, and logging causes the distribution to become much more symmetrical. It is doubtful that the logged data are normally distributed, but the t-test is fairly robust to violations of the normality assumption as long as the data are relatively symmetrical. You can also permute your data to estimate the null distribution if you want to remove the reliance on normality. However, in my opinion it is still better to use symmetrical (logged) data when permuting. HTH, Jim James W. MacDonald UMCCC Microarray Core Facility 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623 >>> "Park, Richard" <richard.park@joslin.harvard.edu> 05/05/03 10:34AM >>> Hi Everyone, I am currently using the mt.teststat to calculate p-values between various samples. I was wondering if anyone knew if it was ok to run p-values on logged or non-logged values? In the past using MAS processing, I always calculated pvalues on the raw values, however I have recently switched to processing cel files through rma and the raw data produced from this processing is log base 2. My lab has noticed that log transformation Is not very visible with high p.values (above 0.1), but spreads them all over the place in the low (significant !) range. By running a t.test on loged values, it greatly enhances the significance (up to 100-fold, compared to running on straight values) when significance derives from tight distributions, but has very little or no effect when significance derives from more distant means Anyone have any ideas on which method is correct? thanks, Richard Park Computational Data Analyzer Joslin Diabetes Center _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
ADD COMMENT
0
Entering edit mode
At 01:36 AM 6/05/2003, James MacDonald wrote: > >From a theoretical standpoint it is more correct to do t-tests on logged > data because one of the assumptions of the t-test is that the underlying > data are normally distributed. Microarray expression values are almost > always strongly right-skewed, and logging causes the distribution to > become much more symmetrical. > >It is doubtful that the logged data are normally distributed, but the >t-test is fairly robust to violations of the normality assumption as long >as the data are relatively symmetrical. Don't forget that results on the robustness of the t-test to normality assume that (i) there are a reasonable number of objections, at least 15 say, and (ii) the p-values which need to be accurate are those around 0.05 rather than around 1e-5. Neither of these assumptions are true in the microarray context! But the main point here is, as Jim says, it has to be a whole lot better on the log-scale because the log-intensities are more symmetrically distributed. Cheers Gordon >You can also permute your data to estimate the null distribution if you >want to remove the reliance on normality. However, in my opinion it is >still better to use symmetrical (logged) data when permuting. > >HTH, > >Jim > > >James W. MacDonald >UMCCC Microarray Core Facility >1500 E. Medical Center Drive >7410 CCGC >Ann Arbor MI 48109 >734-647-5623 > > >>> "Park, Richard" <richard.park@joslin.harvard.edu> 05/05/03 10:34AM >>> >Hi Everyone, >I am currently using the mt.teststat to calculate p-values between various >samples. I was wondering if anyone knew if it was ok to run p-values on >logged or non-logged values? In the past using MAS processing, I always >calculated pvalues on the raw values, however I have recently switched to >processing cel files through rma and the raw data produced from this >processing is log base 2. > >My lab has noticed that log transformation Is not very visible with high >p.values (above 0.1), but spreads them all over the place in the low >(significant !) range. By running a t.test on loged values, it greatly >enhances the significance (up to 100-fold, compared to running on straight >values) when significance derives from tight distributions, but has very >little or no effect when significance derives from more distant means > >Anyone have any ideas on which method is correct? > >thanks, >Richard Park >Computational Data Analyzer >Joslin Diabetes Center
ADD REPLY
0
Entering edit mode
Hi, > At 01:36 AM 6/05/2003, James MacDonald wrote: > > >From a theoretical standpoint it is more correct to do t-tests on logged > > data because one of the assumptions of the t-test is that the underlying > > data are normally distributed. Microarray expression values are almost > > always strongly right-skewed, and logging causes the distribution to > > become much more symmetrical. > ... > But the main point here is, as Jim says, it has to be a whole lot better on > the log-scale because the log-intensities are more symmetrically distributed. Blythe Durbin has done some studies on the effect of transformations on the distribution of microarray data [1], comparing raw scale, log scale, and a "generalized log", i.e. a function of the form f(x) = log(x+sqrt(x^2+c^2)) - log(2) that behaves like the log for x>>c and like a linear function for x~0. While the log is good for high intensities, for small x the log might lead to strongly fluctuating values and even create skewness, so the generalized log is in many cases a good interpolation. Another nice property of the latter is that for a suitable choice of c it can stabilize the variance, i.e. make the standard deviation of the data approximately independent of their mean. [1] http://handel.cipic.ucdavis.edu/~dmrocke/biolikelihood.pdf Chapter 3. Best regards Wolfgang
ADD REPLY

Login before adding your answer.

Traffic: 641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6