Quantile normalization vs. data distributions

0

Entering edit mode

Stan Smiley ▴ 80

@stan-smiley-567

Last seen 10.2 years ago

Greetings, I have been trying to find a quantitative measure to tell when the data distributions between chips are 'seriously' different enough from each other to violate the assumptions behind quantile normalization. I've been through the archives and seen some discussion of this matter, but didn't come away with a quantitative measure I could apply to my data sets to assure me that it would be OK to use quantile normalization. "Quantile normalization uses a single standard for all chips, however it assumes that no serious change in distribution occurs" Could someone please point me in the right direction on this? Thanks. Stan Smiley stan.smiley@genetics.utah.edu

Normalization Normalization • 1.7k views

ADD COMMENT • link updated 20.7 years ago by Naomi Altman ★ 6.0k • written 20.7 years ago by Stan Smiley ▴ 80

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.6 years ago

United States

This is a very good question that I have also been puzzling over. It seems useless to try tests of equality of the distribution such as Kolmogorov-Smirnov- due to the huge sample size you would almost certainly get a significant result. Currently, I am using the following graphical method: 1. I compute a kernel density estimate of the combined data of all probes on all the arrays. 2. I compute a kernel density estimate of the data for each array. 3. I plot both smooths on the same plot, and decide if they are the same. Looking at what I wrote above, I think it would be better in steps 1 and 2 to background correct and center each array before combining. It might also be between to reduce the data to standardized scores before combining, unless you think that the overall scaling is due to your "treatment effect". It seems like half of what I do is ad hoc, so I always welcome any criticisms or suggestions. --Naomi Altman At 06:07 PM 3/11/2004, Stan Smiley wrote: >Greetings, > >I have been trying to find a quantitative measure to tell when the data >distributions >between chips are 'seriously' different enough from each other to violate >the >assumptions behind quantile normalization. I've been through the archives >and seen some discussion of this matter, but didn't come away with a >quantitative measure I >could apply to my data sets to assure me that it would be OK to use quantile >normalization. > > >"Quantile normalization uses a single standard for all chips, however it >assumes that no serious change in distribution occurs" > >Could someone please point me in the right direction on this? > >Thanks. > >Stan Smiley >stan.smiley@genetics.utah.edu > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 20.7 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

At least to me this is a question of what assumptions do I need to make to carry out a normalization (not necessarily restricted to quantile normalization). In particular: Can I expect at least one of the following to be true for my data set? a) Only a few genes (relative to the total number on the array) are changing b) About the same number of genes are increasing in expression as are decreasing in expression between any two treatments. If this is not the case then you may have problems with any normalization. Naomi has suggested a reasonable approach, if you want to take a more data exploratory approach. Provided there is not some sort of confounding variable, big differences between treatment groups in this sort of plot might indicate that you do not want to normalize across all chips. Perhaps in that case you might consider normalizing within treatment group. My guess would be that usually you'd find within group differences (in terms of densities) larger than between groups. Thanks, Ben On Mon, 2004-03-15 at 07:04, Naomi Altman wrote: > This is a very good question that I have also been puzzling over. It seems > useless to try > tests of equality of the distribution such as Kolmogorov-Smirnov- due to > the huge sample size you > would almost certainly get a significant result. > > Currently, I am using the following graphical method: > > 1. I compute a kernel density estimate of the combined data of all probes > on all the arrays. > 2. I compute a kernel density estimate of the data for each array. > 3. I plot both smooths on the same plot, and decide if they are the same. > > Looking at what I wrote above, I think it would be better in steps 1 and 2 > to background correct and > center each array before combining. It might also be between to reduce the > data to standardized scores before combining, unless > you think that the overall scaling is due to your "treatment effect". > > It seems like half of what I do is ad hoc, so I always welcome any > criticisms or suggestions. > > --Naomi Altman > > At 06:07 PM 3/11/2004, Stan Smiley wrote: > >Greetings, > > > >I have been trying to find a quantitative measure to tell when the data > >distributions > >between chips are 'seriously' different enough from each other to violate > >the > >assumptions behind quantile normalization. I've been through the archives > >and seen some discussion of this matter, but didn't come away with a > >quantitative measure I > >could apply to my data sets to assure me that it would be OK to use quantile > >normalization. > > > > > >"Quantile normalization uses a single standard for all chips, however it > >assumes that no serious change in distribution occurs" > > > >Could someone please point me in the right direction on this? > > > >Thanks. > > > >Stan Smiley > >stan.smiley@genetics.utah.edu > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor@stat.math.ethz.ch > >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Bioinformatics Consulting Center > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 20.7 years ago Ben Bolstad ★ 1.1k

0

Entering edit mode

Paul Boutros ▴ 340

@paul-boutros-371

Last seen 10.2 years ago

We've been testing something similar. We: a) center each array around 0 and scale to 1 SD b) compute kernel-densities for each array c) perform all pairwise comparisons between arrays, using area under both curves as a similarity metric d) Manually verify the most extreme outliers (e.g. the pairs of arrays with the smallest common area) This seems to work okay for us. As you say, any direct distributional test with large arrays always finds significant differences in our hands. Paul Date: Mon, 15 Mar 2004 10:04:57 -0500 From: Naomi Altman <naomi@stat.psu.edu> Subject: Re: [BioC] Quantile normalization vs. data distributions To: "Stan Smiley" <swsmiley@genetics.utah.edu>, "Bioconductor Mailing list" <bioconductor@stat.math.ethz.ch> Message-ID: <6.0.0.22.2.20040314225049.01d7ffb8@stat.psu.edu> Content-Type: text/plain; charset="us-ascii"; format=flowed This is a very good question that I have also been puzzling over. It seems useless to try tests of equality of the distribution such as Kolmogorov-Smirnov- due to the huge sample size you would almost certainly get a significant result. Currently, I am using the following graphical method: 1. I compute a kernel density estimate of the combined data of all probes on all the arrays. 2. I compute a kernel density estimate of the data for each array. 3. I plot both smooths on the same plot, and decide if they are the same. Looking at what I wrote above, I think it would be better in steps 1 and 2 to background correct and center each array before combining. It might also be between to reduce the data to standardized scores before combining, unless you think that the overall scaling is due to your "treatment effect". It seems like half of what I do is ad hoc, so I always welcome any criticisms or suggestions. --Naomi Altman At 06:07 PM 3/11/2004, Stan Smiley wrote: >Greetings, > >I have been trying to find a quantitative measure to tell when the data >distributions >between chips are 'seriously' different enough from each other to violate >the >assumptions behind quantile normalization. I've been through the archives >and seen some discussion of this matter, but didn't come away with a >quantitative measure I >could apply to my data sets to assure me that it would be OK to use quantile >normalization. > > >"Quantile normalization uses a single standard for all chips, however it >assumes that no serious change in distribution occurs" > >Could someone please point me in the right direction on this? > >Thanks. > >Stan Smiley >stan.smiley@genetics.utah.edu > >_______________________________________________ >Bioconductor mailing list >Bioconductor@stat.math.ethz.ch >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 20.7 years ago Paul Boutros ▴ 340

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 10.2 years ago

Hello, I've two questions regarding the suggestions from Naomi. 1. I've had a look at some density plots (*after* rma bgcorret + quantile normalisation across all chips of my experiment). The tails of the plots look very similar wheras the at high density some plots differ in shape or value. When/how would you consider the two distributions to be equal? 2. As a non-statistician I'm a bit confused that statistical test will nearly always find a significant difference between distributions when the samples are large (I remember someone mentioned this to me - without explanations - about 2 years ago in a posting to the R-list). Is there a way to "normalize" the test results (e.g. the p-values) by the size of the sample? I guess such a significant difference as reported by a test is a *real* difference (otherwise all statistical test would be worthless ...). Can one assume, that even if the two distributions are statistically different, one can treat them as equal judged by visuall investigatigation of a density plot or histogram? What is a large sample? If a test finds a difference between two distributions, how do I know it's not just because of the sample size? Is there something like a "maximum sample size test" (similar to determining the power of a test)? Thanks again for your comments, +kind regarrds, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of > Naomi Altman > Sent: 15 March 2004 16:05 > To: Stan Smiley; Bioconductor Mailing list > Subject: Re: [BioC] Quantile normalization vs. data distributions > > > This is a very good question that I have also been puzzling > over. It seems > useless to try > tests of equality of the distribution such as > Kolmogorov-Smirnov- due to > the huge sample size you > would almost certainly get a significant result. > > Currently, I am using the following graphical method: > > 1. I compute a kernel density estimate of the combined data > of all probes > on all the arrays. > 2. I compute a kernel density estimate of the data for each array. > 3. I plot both smooths on the same plot, and decide if they > are the same. > > Looking at what I wrote above, I think it would be better in > steps 1 and 2 > to background correct and > center each array before combining. It might also be between > to reduce the > data to standardized scores before combining, unless > you think that the overall scaling is due to your "treatment effect". > > It seems like half of what I do is ad hoc, so I always welcome any > criticisms or suggestions. > > --Naomi Altman > > At 06:07 PM 3/11/2004, Stan Smiley wrote: > >Greetings, > > > >I have been trying to find a quantitative measure to tell > when the data > >distributions > >between chips are 'seriously' different enough from each > other to violate > >the > >assumptions behind quantile normalization. I've been through > the archives > >and seen some discussion of this matter, but didn't come away with a > >quantitative measure I > >could apply to my data sets to assure me that it would be OK > to use quantile > >normalization. > > > > > >"Quantile normalization uses a single standard for all > chips, however it > >assumes that no serious change in distribution occurs" > > > >Could someone please point me in the right direction on this? > > > >Thanks. > > > >Stan Smiley > >stan.smiley@genetics.utah.edu > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor@stat.math.ethz.ch > >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Bioinformatics Consulting Center > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 > (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 20.7 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 3.6 years ago

United States

The problem with p-values is that they measure the "surprise factor" not the size of the effect. Suppose that you are testing a cholesterol busting drug, and it really has the effect of lowering mean choldesterol (over your population) by .001. Does anyone care? (Cholesterol values generally range from about 100-400.) But if your sample size is big enough, you have power to detect infinitismally small differences. For the purpose of normalization, we probably want the probe distributions to be similar. If they are already identical, we do not need to normalize. So, with a sufficiently large sample, all we will learn is that the probe distributions are not identical - but not how far apart they are. --Naomi At 10:58 AM 3/16/2004, Arne.Muller@aventis.com wrote: >Hello, > >I've two questions regarding the suggestions from Naomi. > >1. I've had a look at some density plots (*after* rma bgcorret + quantile >normalisation across all chips of my experiment). The tails of the plots look >very similar wheras the at high density some plots differ in shape or value. >When/how would you consider the two distributions to be equal? > >2. As a non-statistician I'm a bit confused that statistical test will nearly >always find a significant difference between distributions when the samples >are large (I remember someone mentioned this to me - without explanations - >about 2 years ago in a posting to the R-list). Is there a way to "normalize" >the test results (e.g. the p-values) by the size of the sample? > >I guess such a significant difference as reported by a test is a *real* >difference (otherwise all statistical test would be worthless ...). Can one >assume, that even if the two distributions are statistically different, one >can treat them as equal judged by visuall investigatigation of a density plot >or histogram? > >What is a large sample? If a test finds a difference between two >distributions, how do I know it's not just because of the sample size? Is >there something like a "maximum sample size test" (similar to determining the >power of a test)? > >Thanks again for your comments, > > +kind regarrds, > > Arne > >-- >Arne Muller, Ph.D. >Toxicogenomics, Aventis Pharma >arne dot muller domain=aventis com > > > -----Original Message----- > > From: bioconductor-bounces@stat.math.ethz.ch > > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of > > Naomi Altman > > Sent: 15 March 2004 16:05 > > To: Stan Smiley; Bioconductor Mailing list > > Subject: Re: [BioC] Quantile normalization vs. data distributions > > > > > > This is a very good question that I have also been puzzling > > over. It seems > > useless to try > > tests of equality of the distribution such as > > Kolmogorov-Smirnov- due to > > the huge sample size you > > would almost certainly get a significant result. > > > > Currently, I am using the following graphical method: > > > > 1. I compute a kernel density estimate of the combined data > > of all probes > > on all the arrays. > > 2. I compute a kernel density estimate of the data for each array. > > 3. I plot both smooths on the same plot, and decide if they > > are the same. > > > > Looking at what I wrote above, I think it would be better in > > steps 1 and 2 > > to background correct and > > center each array before combining. It might also be between > > to reduce the > > data to standardized scores before combining, unless > > you think that the overall scaling is due to your "treatment effect". > > > > It seems like half of what I do is ad hoc, so I always welcome any > > criticisms or suggestions. > > > > --Naomi Altman > > > > At 06:07 PM 3/11/2004, Stan Smiley wrote: > > >Greetings, > > > > > >I have been trying to find a quantitative measure to tell > > when the data > > >distributions > > >between chips are 'seriously' different enough from each > > other to violate > > >the > > >assumptions behind quantile normalization. I've been through > > the archives > > >and seen some discussion of this matter, but didn't come away with a > > >quantitative measure I > > >could apply to my data sets to assure me that it would be OK > > to use quantile > > >normalization. > > > > > > > > >"Quantile normalization uses a single standard for all > > chips, however it > > >assumes that no serious change in distribution occurs" > > > > > >Could someone please point me in the right direction on this? > > > > > >Thanks. > > > > > >Stan Smiley > > >stan.smiley@genetics.utah.edu > > > > > >_______________________________________________ > > >Bioconductor mailing list > > >Bioconductor@stat.math.ethz.ch > > >https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Bioinformatics Consulting Center > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 > > (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > Naomi S. Altman 814-865-3791 (voice) Associate Professor Bioinformatics Consulting Center Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 20.7 years ago Naomi Altman ★ 6.0k

Login before adding your answer.