edgeR normalization factors

0

Entering edit mode

王喆 ▴ 60

@-4142

Last seen 10.6 years ago

Hello, Â I have a question about using TMM normalization factors.Â I want to modify the count for each gene after normalization. Should I just need to divide the count of each gene by the normalization factor for its library? Then, I may use the normalized data for DE analysis and other further analysis (e.g. clustering). Thanks a lot, Zhe [[alternative HTML version deleted]]

Normalization Normalization • 3.4k views

ADD COMMENT • link updated 14.8 years ago by Naomi Altman ★ 6.0k • written 14.8 years ago by 王喆 ▴ 60

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.0 years ago

United States

Multiply. And yes, you should use the normalized data for DE and clustering. Otherwise, highly expressing genes in your sample will depress the expression of other genes relative to the size of the library, inducing spurious "differential" expression. I have been simulating data to try to understand this better. --Naomi At 11:19 PM 6/27/2010, ?????? wrote: >Hello, >? >I have a question about using TMM normalization >factors.? I want to modify the count for each >gene after normalization. Should I just need to >divide the count of each gene by the >normalization factor for its library? Then, I >may use the normalized data for DE >analysis and other further analysis (e.g. clustering). > >Thanks a lot, >Zhe > > > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 14.8 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

(Travelling so this is a rather quick response) I disagree with Naomi. First, for a differential expression analysis, we prefer to use the counts as is, and use the normalization factors as offsets in the statistical modeling. So, these normalization factors actually DO NOT change the data (this is unlike microarray data normalization). Second, for clustering, visualization etc. you may want to calculate a normalized expression value. Using the normalization factors that you calculate using calcNormFactors() multiplied by the library size (See Section 6 of the manual), you could DIVIDE your raw counts by this number for each library. Maybe also multiple by 10M so you have counts per 10M? I think what Naomi is talking about (highly expressed genes depressing the expression of other genes) is covered in our paper: http://genomebiology.com/2010/11/3/R25 Cheers, Mark > Multiply. > > And yes, you should use the normalized data for > DE and clustering. Otherwise, highly expressing > genes in your sample will depress the expression > of other genes relative to the size of the > library, inducing spurious "differential" > expression. I have been simulating data to try to understand this better. > > --Naomi > > At 11:19 PM 6/27/2010, ?????? wrote: >>Hello, >>? >>I have a question about using TMM normalization >>factors.? I want to modify the count for each >>gene after normalization. Should I just need to >>divide the count of each gene by the >>normalization factor for its library? Then, I >>may use the normalized data for DE >>analysis and other further analysis (e.g. clustering). >> >>Thanks a lot, >>Zhe >> >> >> >> >> [[alternative HTML version deleted]] >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 14.8 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Zhe, for clustering and similar endeavours, transforming the data to a "logarithm-like" variance-stabilised scale is useful. See e.g. chapter 7 "Sample clustering" of the vignette of the DESeq package. For differential expression, I agree with Mark that you want to use the counts as is, and use the normalization factors as parameters in the statistical modeling. Wolfgang On Jun/29/10 10:21 AM, Mark Robinson wrote: > > (Travelling so this is a rather quick response) > > I disagree with Naomi. > > First, for a differential expression analysis, we prefer to use the counts > as is, and use the normalization factors as offsets in the statistical > modeling. So, these normalization factors actually DO NOT change the data > (this is unlike microarray data normalization). > > Second, for clustering, visualization etc. you may want to calculate a > normalized expression value. Using the normalization factors that you > calculate using calcNormFactors() multiplied by the library size (See > Section 6 of the manual), you could DIVIDE your raw counts by this number > for each library. Maybe also multiple by 10M so you have counts per 10M? > > I think what Naomi is talking about (highly expressed genes depressing the > expression of other genes) is covered in our paper: > http://genomebiology.com/2010/11/3/R25 > > Cheers, > Mark > >> Multiply. >> >> And yes, you should use the normalized data for >> DE and clustering. Otherwise, highly expressing >> genes in your sample will depress the expression >> of other genes relative to the size of the >> library, inducing spurious "differential" >> expression. I have been simulating data to try to understand this better. >> >> --Naomi >> >> At 11:19 PM 6/27/2010, ?????? wrote: >>> Hello, >>> ? >>> I have a question about using TMM normalization >>> factors.? I want to modify the count for each >>> gene after normalization. Should I just need to >>> divide the count of each gene by the >>> normalization factor for its library? Then, I >>> may use the normalized data for DE >>> analysis and other further analysis (e.g. clustering). >>> >>> Thanks a lot, >>> Zhe >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:16}}

ADD REPLY • link 14.8 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Thanks Mark and have a good trip. Zhe --- 10å¹´6æ29æ¥ï¼å¨äº, Mark Robinson <mrobinson@wehi.edu.au> åéï¼ (Travelling so this is a rather quick response) I disagree with Naomi. First, for a differential expression analysis, we prefer to use the counts as is, and use the normalization factors as offsets in the statistical modeling.Â So, these normalization factors actually DO NOT change the data (this is unlike microarray data normalization). Second, for clustering, visualization etc. you may want to calculate a normalized expression value.Â Using the normalization factors that you calculate using calcNormFactors() multiplied by the library size (See Section 6 of the manual), you could DIVIDE your raw counts by this number for each library.Â Maybe also multiple by 10M so you have counts per 10M? I think what Naomi is talking about (highly expressed genes depressing the expression of other genes) is covered in our paper: http://genomebiology.com/2010/11/3/R25 Cheers, Mark > Multiply. > > And yes, you should use the normalized data for > DE and clustering.Â Otherwise, highly expressing > genes in your sample will depress the expression > of other genes relative to the size of the > library, inducing spurious "differential" > expression.Â I have been simulating data to try to understand this better. > > --Naomi > > At 11:19 PM 6/27/2010, Ã§Å½â¹Ã¥Ââ wrote: >>Hello, >>Ã >>I have a question about using TMM normalization >>factors.Ã I want to modify the count for each >>gene after normalization. Should I just need to >>divide the count of each gene by the >>normalization factor for its library? Then, I >>may use the normalized data for DE >>analysis and other further analysis (e.g. clustering). >> >>Thanks a lot, >>Zhe >> >> >> >> >>Â Â Â Â Â Â [[alternative HTML version deleted]] >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor@stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. AltmanÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 814-865-3791 (voice) > Associate Professor > Dept. of StatisticsÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â 814-863-7114 (fax) > Penn State UniversityÂ Â Â Â Â Â Â Â Â Â Â Â Â Â 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:12}}

ADD REPLY • link 14.8 years ago 王喆 ▴ 60

0

Entering edit mode

Hi On Tue, 29 Jun 2010 21:53:18 +0800 (CST), ?? <zhedianyou at="" yahoo.cn=""> wrote: > I disagree with Naomi. > > First, for a differential expression analysis, we prefer to use the counts > as is, and use the normalization factors as offsets in the statistical > modeling.?? So, these normalization factors actually DO NOT change the > data > (this is unlike microarray data normalization). > > Second, for clustering, visualization etc. you may want to calculate a > normalized expression value.?? Using the normalization factors that you > calculate using calcNormFactors() multiplied by the library size (See > Section 6 of the manual), you could DIVIDE your raw counts by this number > for each library.?? Maybe also multiple by 10M so you have counts per 10M? > > I think what Naomi is talking about (highly expressed genes depressing the > expression of other genes) is covered in our paper: > http://genomebiology.com/2010/11/3/R25 For visualization, the normalized values should to the job. For clustering, however, you may still run into problem, because count data, normalized or not, is heteroskedastic, and if you feed such data to a typical distance function such as R's 'dist', the result will depends nearly only on the most strongly expressed genes as they have the strongest variance. Hence, you should perform a variance-stabilizing transformation (VST) on the data before handing it to dist (or to any other statistical function that is designed for homoskedastic data). Our 'DESeq' package (another tool for the same use case as edgeR, using a different way to estimate variance) has such a function ('getVarianceStabilizedData'), but it assumes that you use DESeq's variance estimation scheme and the vignette explains how to use it e.g. for clustering. If you prefer to stick to edgeR: To my knowledge, it does not have this functionality but you could add it yourself with a one-liner as follows: edgeR's variance-mean ratio is variance = mean + common_dispersion * mean^2 and from such a function, the is obtained by integrating variance^(-1/2) w.r.t. mean. According to Wolfram Alpha, this gives transformed_data = 2 * asinh( sqrt( common_dispersion * normalized_count ) ) / sqrt( common_dispersion ) but you may want to double-check this. Simon

ADD REPLY • link 14.8 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.0 years ago

United States

Of course Mark is right. --Naomi At 04:21 AM 6/29/2010, Mark Robinson wrote: >(Travelling so this is a rather quick response) > >I disagree with Naomi. > >First, for a differential expression analysis, we prefer to use the counts >as is, and use the normalization factors as offsets in the statistical >modeling. So, these normalization factors actually DO NOT change the data >(this is unlike microarray data normalization). > >Second, for clustering, visualization etc. you may want to calculate a >normalized expression value. Using the normalization factors that you >calculate using calcNormFactors() multiplied by the library size (See >Section 6 of the manual), you could DIVIDE your raw counts by this number >for each library. Maybe also multiple by 10M so you have counts per 10M? > >I think what Naomi is talking about (highly expressed genes depressing the >expression of other genes) is covered in our paper: >http://genomebiology.com/2010/11/3/R25 > >Cheers, >Mark > > > Multiply. > > > > And yes, you should use the normalized data for > > DE and clustering. Otherwise, highly expressing > > genes in your sample will depress the expression > > of other genes relative to the size of the > > library, inducing spurious "differential" > > expression. I have been simulating data to try to understand this better. > > > > --Naomi > > > > At 11:19 PM 6/27/2010, ?????? wrote: > >>Hello, > >>? > >>I have a question about using TMM normalization > >>factors.? I want to modify the count for each > >>gene after normalization. Should I just need to > >>divide the count of each gene by the > >>normalization factor for its library? Then, I > >>may use the normalized data for DE > >>analysis and other further analysis (e.g. clustering). > >> > >>Thanks a lot, > >>Zhe > >> > >> > >> > >> > >> [[alternative HTML version deleted]] > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor at stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_____________________________________________________________________ _ >The information in this email is confidential and intend...{{dropped:4}} > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 14.8 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.0 years ago

United States

Of course Mark is correct for DE analysis. What I should have said is that the normalized Library Size should be used for DE. And this is certainly covered in the paper. For clustering, I think you probably will need to change the data - but it depends on what you are clustering and the distance measure. --Naomi At 04:21 AM 6/29/2010, Mark Robinson wrote: >(Travelling so this is a rather quick response) > >I disagree with Naomi. > >First, for a differential expression analysis, we prefer to use the counts >as is, and use the normalization factors as offsets in the statistical >modeling. So, these normalization factors actually DO NOT change the data >(this is unlike microarray data normalization). > >Second, for clustering, visualization etc. you may want to calculate a >normalized expression value. Using the normalization factors that you >calculate using calcNormFactors() multiplied by the library size (See >Section 6 of the manual), you could DIVIDE your raw counts by this number >for each library. Maybe also multiple by 10M so you have counts per 10M? > >I think what Naomi is talking about (highly expressed genes depressing the >expression of other genes) is covered in our paper: >http://genomebiology.com/2010/11/3/R25 > >Cheers, >Mark > > > Multiply. > > > > And yes, you should use the normalized data for > > DE and clustering. Otherwise, highly expressing > > genes in your sample will depress the expression > > of other genes relative to the size of the > > library, inducing spurious "differential" > > expression. I have been simulating data to try to understand this better. > > > > --Naomi > > > > At 11:19 PM 6/27/2010, ?????? wrote: > >>Hello, > >>? > >>I have a question about using TMM normalization > >>factors.? I want to modify the count for each > >>gene after normalization. Should I just need to > >>divide the count of each gene by the > >>normalization factor for its library? Then, I > >>may use the normalized data for DE > >>analysis and other further analysis (e.g. clustering). > >> > >>Thanks a lot, > >>Zhe > >> > >> > >> > >> > >> [[alternative HTML version deleted]] > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor at stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_____________________________________________________________________ _ >The information in this email is confidential >and intended solely for the addressee. >You must not disclose, forward, print or use it >without the permission of the sender. >_____________________________________________________________________ _ Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 14.8 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Thank you for your suggestions. Zhe --- 10å¹´6æ29æ¥ï¼å¨äº, Naomi Altman <naomi@stat.psu.edu> åéï¼ åä»¶äºº: Naomi Altman <naomi@stat.psu.edu> ä¸»é¢: Re: [BioC] edgeR normalization factors æ¶ä»¶äºº: "Mark Robinson" <mrobinson@wehi.edu.au>, "Naomi Altman" <naomi@stat.psu.edu> æé: "Ã§Å½â¹Ã¥ââ " <zhedianyou@yahoo.cn>, bioconductor@stat.math.ethz.ch æ¥æ: 2010å¹´6æ29æ¥,å¨äº,ä¸å11:20 Of course Mark is correct for DE analysis.Â What I should have said is that the normalized Library Size should be used for DE.Â And this is certainly covered in the paper. For clustering, I think you probably will need to change the data - but it depends on what you are clustering and the distance measure. --Naomi At 04:21 AM 6/29/2010, Mark Robinson wrote: >(Travelling so this is a rather quick response) > >I disagree with Naomi. > >First, for a differential expression analysis, we prefer to use the counts >as is, and use the normalization factors as offsets in the statistical >modeling.Â So, these normalization factors actually DO NOT change the data >(this is unlike microarray data normalization). > >Second, for clustering, visualization etc. you may want to calculate a >normalized expression value.Â Using the normalization factors that you >calculate using calcNormFactors() multiplied by the library size (See >Section 6 of the manual), you could DIVIDE your raw counts by this number >for each library.Â Maybe also multiple by 10M so you have counts per 10M? > >I think what Naomi is talking about (highly expressed genes depressing the >expression of other genes) is covered in our paper: >http://genomebiology.com/2010/11/3/R25 > >Cheers, >Mark > > > Multiply. > > > > And yes, you should use the normalized data for > > DE and clustering.Â Otherwise, highly expressing > > genes in your sample will depress the expression > > of other genes relative to the size of the > > library, inducing spurious "differential" > > expression.Â I have been simulating data to try to understand this better. > > > > --Naomi > > > > At 11:19 PM 6/27/2010, Ã§Å½â¹Ã¥Ââ wrote: > >>Hello, > >>Ã > >>I have a question about using TMM normalization > >>factors.Ã I want to modify the count for each > >>gene after normalization. Should I just need to > >>divide the count of each gene by the > >>normalization factor for its library? Then, I > >>may use the normalized data for DE > >>analysis and other further analysis (e.g. clustering). > >> > >>Thanks a lot, > >>Zhe > >> > >> > >> > >> > >>Â Â Â Â Â Â [[alternative HTML version deleted]] > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor@stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. AltmanÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 814-865-3791 (voice) > > Associate Professor > > Dept. of StatisticsÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â 814-863-7114 (fax) > > Penn State UniversityÂ Â Â Â Â Â Â Â Â Â Â Â Â Â 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_____________________________________________________________________ _ >The information in this email is confidential \ >and ...{{dropped:22}}

ADD REPLY • link 14.8 years ago 王喆 ▴ 60

Login before adding your answer.