Normalization

0

Entering edit mode

Vittoria Roncalli ▴ 40

@vittoria-roncalli-5633

Last seen 10.7 years ago

Hi, I am a edgeR user and I am a little bit confused on the normalization topic. I am using EdgeR to get different expressed genes within 3 conditions (RnaSeq) with 3 replicates each. I am following the user guide step: -update counts file (from mapping against reference transcriptome) - filter the low counts reads (1cpm) - reassess library size - estimate common dispersion Mi first question is related to the normalization. Why, after I import my file, next to the library size there is then column with norm.factors? $samples group lib.size norm.factors X48h_C_r1.sam CONTROL 10898526 1 X48h_C_r2.sam CONTROL 7176817 1 X48h_C_r3.sam CONTROL 9511875 1 X48h_LD_r1.sam LD 11350347 1 X48h_LD_r2.sam LD 14836541 1 X48h_LD_r3.sam LD 12635344 1 X48h_HD_r1.sam HD 11840963 1 X48h_HD_r2.sam HD 17335549 1 X48h_HD_r3.sam HD 10274526 1 Is the normalization automated? What is the difference with the "calNormFactors?" Moreover, if I do not run the calNormFactors, what is into the pseudo.counts output? I am very confused about those points. Thanks in advance for your help. Looking forward to hearing from you. Vittoria -- Vittoria Roncalli Graduate Research Assistant Center Békésy Laboratory of Neurobiology Pacific Biosciences Research Center University of Hawaii at Manoa 1993 East-West Road Honolulu, HI 96822 USA Tel: 808-4695693 [[alternative HTML version deleted]]

Normalization edgeR Normalization edgeR • 1.7k views

ADD COMMENT • link updated 12.2 years ago by Ryan C. Thompson ★ 7.9k • written 12.2 years ago by Vittoria Roncalli ▴ 40

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

To answer your first question, when you first create a DGEList object, all the normalization factors are initially set to 1 by default. This is equivalent to no normalization. Once you use calcNormFactors, the normalization factors will be set appropriately. I'm not sure about the second question. Could you provide an example of how you are obtaining pseudocounts with edgeR? On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote: > Hi, I am a edgeR user and I am a little bit confused on the normalization > topic. > I am using EdgeR to get different expressed genes within 3 conditions > (RnaSeq) with 3 replicates each. > I am following the user guide step: > > -update counts file (from mapping against reference transcriptome) > - filter the low counts reads (1cpm) > - reassess library size > - estimate common dispersion > > Mi first question is related to the normalization. Why, after I import my > file, next to the library size there is then column with norm.factors? > > $samples > > group lib.size norm.factors > > X48h_C_r1.sam CONTROL 10898526 1 > > X48h_C_r2.sam CONTROL 7176817 1 > > X48h_C_r3.sam CONTROL 9511875 1 > > X48h_LD_r1.sam LD 11350347 1 > > X48h_LD_r2.sam LD 14836541 1 > > X48h_LD_r3.sam LD 12635344 1 > > X48h_HD_r1.sam HD 11840963 1 > > X48h_HD_r2.sam HD 17335549 1 > > X48h_HD_r3.sam HD 10274526 1 > > > > Is the normalization automated? What is the difference with the > "calNormFactors?" > > Moreover, if I do not run the calNormFactors, what is into the > pseudo.counts output? > > > I am very confused about those points. > > > Thanks in advance for your help. > > > Looking forward to hearing from you. > > > Vittoria > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

Hi Vittoria, Please use "Reply All" so that your reply also goes to the mailing list. The normalization factors are used to adjust the library sizes (I forget the details, I believe they are given in the User's Guide), and then the pseudo counts are obtained by normalizing the counts to the adjusted library sizes. Since you have not used any normalization factors (i.e. all norm factors = 1), the pseudo counts will simply be some constant factor of counts-per-million, if I'm not mistaken. If you want absolutely no normalization, you would have to set both the normalization factors and library sizes to 1, I think. In any case, the pseudo counts are only for descriptive purposes. The statistical testing in edgeR happens using the raw integer counts. On 02/27/2013 10:12 PM, Vittoria Roncalli wrote: > Hi Ryan, > > thanks for your reply. > I obtain pesudo.counts with the following commands > > " > > > raw.data <- read.table("counts 2.txt",sep="\t",header=T) > > > d <- raw.data[, 2:10] > > > d[is.na <http: is.na="">(d)] <- 0 > > > rownames(d) <- raw.data[, 1] > > > group <- c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD") > > > d <- DGEList(counts = d, group = group) > > Calculating library sizes from column totals. > > > keep <- rowSums (cpm(d)>1) >=3 > > > d <- d[keep,] > > > dim(d) > > [1] 28755 9 > > > d <- DGEList(counts = d, group = group) > > Calculating library sizes from column totals. > > > d <- estimateCommonDisp(d) > > > After the common dispersion, I get in the DGE list > > $counts > > $samples > > $commondispersion > > $pseudo.counts > > $logCPM > > $pseudo.lib.size > > > > Then I write a table for the pseudo.counts and I will continue with > those for the DGE. > > Considering that I did non normalize the libraries, what are the > different counts in the pseudo.counts output? > > > Thanks so much > > > Vittoria > On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson > <rct@thompsonclan.org <mailto:rct@thompsonclan.org="">> wrote: > > To answer your first question, when you first create a DGEList > object, all the normalization factors are initially set to 1 by > default. This is equivalent to no normalization. Once you use > calcNormFactors, the normalization factors will be set appropriately. > > I'm not sure about the second question. Could you provide an > example of how you are obtaining pseudocounts with edgeR? > > > On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote: > > Hi, I am a edgeR user and I am a little bit confused on the > normalization > topic. > I am using EdgeR to get different expressed genes within 3 > conditions > (RnaSeq) with 3 replicates each. > I am following the user guide step: > > -update counts file (from mapping against reference transcriptome) > - filter the low counts reads (1cpm) > - reassess library size > - estimate common dispersion > > Mi first question is related to the normalization. Why, after > I import my > file, next to the library size there is then column with > norm.factors? > > $samples > > group lib.size norm.factors > > X48h_C_r1.sam CONTROL 10898526 1 > > X48h_C_r2.sam CONTROL 7176817 1 > > X48h_C_r3.sam CONTROL 9511875 1 > > X48h_LD_r1.sam LD 11350347 1 > > X48h_LD_r2.sam LD 14836541 1 > > X48h_LD_r3.sam LD 12635344 1 > > X48h_HD_r1.sam HD 11840963 1 > > X48h_HD_r2.sam HD 17335549 1 > > X48h_HD_r3.sam HD 10274526 1 > > > > Is the normalization automated? What is the difference with the > "calNormFactors?" > > Moreover, if I do not run the calNormFactors, what is into the > pseudo.counts output? > > > I am very confused about those points. > > > Thanks in advance for your help. > > > Looking forward to hearing from you. > > > Vittoria > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org <mailto:bioconductor@r-project.org> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > > Vittoria Roncalli > > Graduate Research Assistant > Center Békésy Laboratory of Neurobiology > Pacific Biosciences Research Center > University of Hawaii at Manoa > 1993 East-West Road > Honolulu, HI 96822 USA > > Tel: 808-4695693 > [[alternative HTML version deleted]]

ADD COMMENT • link 12.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan, Thanks again for your explanation, you saved my day! Considering your expertise, I would ask you another question. I run on the raw data counts a simple one way anova (I have 3 treatments with 3 reps each) and I found out that there is no significant difference between them. Then, with EdgeR I was able, to extract a list of DGE fro each pairwise comparison. Is this because the ANOVA is calculated on the overall library (total # genes) while the DGE comes from a t-test for each individual gene? I found this explanation on Bullard et al 2010, but I am not sure if I have misunderstood something. Does it make sense to you? Have a good day,and thanks again for your help. Vittoria On Wed, Feb 27, 2013 at 9:48 PM, Ryan C. Thompson <rct@thompsonclan.org>wrote: > Hi Vittoria, > > Please use "Reply All" so that your reply also goes to the mailing list. > > The normalization factors are used to adjust the library sizes (I forget > the details, I believe they are given in the User's Guide), and then the > pseudo counts are obtained by normalizing the counts to the adjusted > library sizes. Since you have not used any normalization factors (i.e. all > norm factors = 1), the pseudo counts will simply be some constant factor of > counts-per-million, if I'm not mistaken. If you want absolutely no > normalization, you would have to set both the normalization factors and > library sizes to 1, I think. > > In any case, the pseudo counts are only for descriptive purposes. The > statistical testing in edgeR happens using the raw integer counts. > > > On 02/27/2013 10:12 PM, Vittoria Roncalli wrote: > > Hi Ryan, > > thanks for your reply. > I obtain pesudo.counts with the following commands > > " > > > raw.data <- read.table("counts 2.txt",sep="\t",header=T) > > > d <- raw.data[, 2:10] > > > d[is.na(d)] <- 0 > > > rownames(d) <- raw.data[, 1] > > > group <- c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD") > > > d <- DGEList(counts = d, group = group) > Calculating library sizes from column totals. > > > keep <- rowSums (cpm(d)>1) >=3 > > > d <- d[keep,] > > > dim(d) > > [1] 28755 9 > > > d <- DGEList(counts = d, group = group) > > Calculating library sizes from column totals. > > > d <- estimateCommonDisp(d) > > After the common dispersion, I get in the DGE list > > $counts > > $samples > > $commondispersion > > $pseudo.counts > > $logCPM > > $pseudo.lib.size > > > Then I write a table for the pseudo.counts and I will continue with > those for the DGE. > > Considering that I did non normalize the libraries, what are the > different counts in the pseudo.counts output? > > > Thanks so much > > > Vittoria > On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson <rct@thompsonclan.org>wrote: > >> To answer your first question, when you first create a DGEList object, >> all the normalization factors are initially set to 1 by default. This is >> equivalent to no normalization. Once you use calcNormFactors, the >> normalization factors will be set appropriately. >> >> I'm not sure about the second question. Could you provide an example of >> how you are obtaining pseudocounts with edgeR? >> >> >> On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote: >> >>> Hi, I am a edgeR user and I am a little bit confused on the >>> normalization >>> topic. >>> I am using EdgeR to get different expressed genes within 3 conditions >>> (RnaSeq) with 3 replicates each. >>> I am following the user guide step: >>> >>> -update counts file (from mapping against reference transcriptome) >>> - filter the low counts reads (1cpm) >>> - reassess library size >>> - estimate common dispersion >>> >>> Mi first question is related to the normalization. Why, after I import my >>> file, next to the library size there is then column with norm.factors? >>> >>> $samples >>> >>> group lib.size norm.factors >>> >>> X48h_C_r1.sam CONTROL 10898526 1 >>> >>> X48h_C_r2.sam CONTROL 7176817 1 >>> >>> X48h_C_r3.sam CONTROL 9511875 1 >>> >>> X48h_LD_r1.sam LD 11350347 1 >>> >>> X48h_LD_r2.sam LD 14836541 1 >>> >>> X48h_LD_r3.sam LD 12635344 1 >>> >>> X48h_HD_r1.sam HD 11840963 1 >>> >>> X48h_HD_r2.sam HD 17335549 1 >>> >>> X48h_HD_r3.sam HD 10274526 1 >>> >>> >>> >>> Is the normalization automated? What is the difference with the >>> "calNormFactors?" >>> >>> Moreover, if I do not run the calNormFactors, what is into the >>> pseudo.counts output? >>> >>> >>> I am very confused about those points. >>> >>> >>> Thanks in advance for your help. >>> >>> >>> Looking forward to hearing from you. >>> >>> >>> Vittoria >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > > > -- > > Vittoria Roncalli > > Graduate Research Assistant > Center Békésy Laboratory of Neurobiology > Pacific Biosciences Research Center > University of Hawaii at Manoa > 1993 East-West Road > Honolulu, HI 96822 USA > > Tel: 808-4695693 > > > -- Vittoria Roncalli Graduate Research Assistant Center Békésy Laboratory of Neurobiology Pacific Biosciences Research Center University of Hawaii at Manoa 1993 East-West Road Honolulu, HI 96822 USA Tel: 808-4695693 [[alternative HTML version deleted]]

ADD REPLY • link 12.2 years ago Vittoria Roncalli ▴ 40

0

Entering edit mode

Hi Vittoria, It would be best if you could show code examples of what gave you an empty list and what gave you a list of differentially expressed genes and what code didn't. Whether you you are doing a pairwise comparison or a multi-way "ANOVA-style" comparison, edgeR is actually performing the same test. In general, if all three pairwise comparisons are yielding significant hits, I would expect some significant hits in the three-way comparison as well. -Ryan On Thu 28 Feb 2013 11:26:17 AM PST, Vittoria Roncalli wrote: > Hi Ryan, > > Thanks again for your explanation, you saved my day! > Considering your expertise, I would ask you another question. > I run on the raw data counts a simple one way anova (I have 3 > treatments with 3 reps each) and I found out that there is no > significant difference between them. Then, with EdgeR I was able, to > extract a list of DGE fro each pairwise comparison. Is this because > the ANOVA is calculated on the overall library (total # genes) while > the DGE comes from a t-test for each individual gene? I found this > explanation on Bullard et al 2010, but I am not sure if I have > misunderstood something. > > Does it make sense to you? > > Have a good day,and thanks again for your help. > > Vittoria > > On Wed, Feb 27, 2013 at 9:48 PM, Ryan C. Thompson > <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org="">> wrote: > > Hi Vittoria, > > Please use "Reply All" so that your reply also goes to the mailing > list. > > The normalization factors are used to adjust the library sizes (I > forget the details, I believe they are given in the User's Guide), > and then the pseudo counts are obtained by normalizing the counts > to the adjusted library sizes. Since you have not used any > normalization factors (i.e. all norm factors = 1), the pseudo > counts will simply be some constant factor of counts-per- million, > if I'm not mistaken. If you want absolutely no normalization, you > would have to set both the normalization factors and library sizes > to 1, I think. > > In any case, the pseudo counts are only for descriptive purposes. > The statistical testing in edgeR happens using the raw integer counts. > > > On 02/27/2013 10:12 PM, Vittoria Roncalli wrote: >> Hi Ryan, >> >> thanks for your reply. >> I obtain pesudo.counts with the following commands >> >> " >> >> > raw.data <- read.table("counts 2.txt",sep="\t",header=T) >> >> > d <- raw.data[, 2:10] >> >> > d[is.na <http: is.na="">(d)] <- 0 >> >> > rownames(d) <- raw.data[, 1] >> >> > group <- c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD") >> >> > d <- DGEList(counts = d, group = group) >> >> Calculating library sizes from column totals. >> >> > keep <- rowSums (cpm(d)>1) >=3 >> >> > d <- d[keep,] >> >> > dim(d) >> >> [1] 28755 9 >> >> > d <- DGEList(counts = d, group = group) >> >> Calculating library sizes from column totals. >> >> > d <- estimateCommonDisp(d) >> >> >> After the common dispersion, I get in the DGE list >> >> $counts >> >> $samples >> >> $commondispersion >> >> $pseudo.counts >> >> $logCPM >> >> $pseudo.lib.size >> >> >> >> Then I write a table for the pseudo.counts and I will continue >> with those for the DGE. >> >> Considering that I did non normalize the libraries, what are the >> different counts in the pseudo.counts output? >> >> >> Thanks so much >> >> >> Vittoria >> On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson >> <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org="">> wrote: >> >> To answer your first question, when you first create a >> DGEList object, all the normalization factors are initially >> set to 1 by default. This is equivalent to no normalization. >> Once you use calcNormFactors, the normalization factors will >> be set appropriately. >> >> I'm not sure about the second question. Could you provide an >> example of how you are obtaining pseudocounts with edgeR? >> >> >> On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote: >> >> Hi, I am a edgeR user and I am a little bit confused on >> the normalization >> topic. >> I am using EdgeR to get different expressed genes within >> 3 conditions >> (RnaSeq) with 3 replicates each. >> I am following the user guide step: >> >> -update counts file (from mapping against reference >> transcriptome) >> - filter the low counts reads (1cpm) >> - reassess library size >> - estimate common dispersion >> >> Mi first question is related to the normalization. Why, >> after I import my >> file, next to the library size there is then column with >> norm.factors? >> >> $samples >> >> group lib.size norm.factors >> >> X48h_C_r1.sam CONTROL 10898526 1 >> >> X48h_C_r2.sam CONTROL 7176817 1 >> >> X48h_C_r3.sam CONTROL 9511875 1 >> >> X48h_LD_r1.sam LD 11350347 1 >> >> X48h_LD_r2.sam LD 14836541 1 >> >> X48h_LD_r3.sam LD 12635344 1 >> >> X48h_HD_r1.sam HD 11840963 1 >> >> X48h_HD_r2.sam HD 17335549 1 >> >> X48h_HD_r3.sam HD 10274526 1 >> >> >> >> Is the normalization automated? What is the difference >> with the >> "calNormFactors?" >> >> Moreover, if I do not run the calNormFactors, what is >> into the >> pseudo.counts output? >> >> >> I am very confused about those points. >> >> >> Thanks in advance for your help. >> >> >> Looking forward to hearing from you. >> >> >> Vittoria >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> -- >> >> Vittoria Roncalli >> >> Graduate Research Assistant >> Center B?k?sy Laboratory of Neurobiology >> Pacific Biosciences Research Center >> University of Hawaii at Manoa >> 1993 East-West Road >> Honolulu, HI 96822 USA >> >> Tel: 808-4695693 <tel:808-4695693> >> > > > > > -- > > Vittoria Roncalli > > Graduate Research Assistant > Center B?k?sy Laboratory of Neurobiology > Pacific Biosciences Research Center > University of Hawaii at Manoa > 1993 East-West Road > Honolulu, HI 96822 USA > > Tel: 808-4695693 >

ADD REPLY • link 12.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Oh, I just realized you are using the non-GLM-based mode of operation for edgeR. I am much more familiar with the GLM workflow, and I believe that the GLM-based workflow is now preferred over the exactTest-based one. In fact, I'm not even sure how to do an ANOVA-style comparision of 3 or more groups using exactTest. In any case, the best way to describe what you are tyring to do is to is to show the code you are using. The answers could depend on what options you are using, how you are calculating dispersions, and many other small factors. Also please tell us which versions of R, and edgeR you are using. On Thu 28 Feb 2013 11:38:04 AM PST, Ryan C. Thompson wrote: > Hi Vittoria, > > It would be best if you could show code examples of what gave you an > empty list and what gave you a list of differentially expressed genes > and what code didn't. Whether you you are doing a pairwise comparison > or a multi-way "ANOVA-style" comparison, edgeR is actually performing > the same test. In general, if all three pairwise comparisons are > yielding significant hits, I would expect some significant hits in the > three-way comparison as well. > > -Ryan > > On Thu 28 Feb 2013 11:26:17 AM PST, Vittoria Roncalli wrote: >> Hi Ryan, >> >> Thanks again for your explanation, you saved my day! >> Considering your expertise, I would ask you another question. >> I run on the raw data counts a simple one way anova (I have 3 >> treatments with 3 reps each) and I found out that there is no >> significant difference between them. Then, with EdgeR I was able, to >> extract a list of DGE fro each pairwise comparison. Is this because >> the ANOVA is calculated on the overall library (total # genes) while >> the DGE comes from a t-test for each individual gene? I found this >> explanation on Bullard et al 2010, but I am not sure if I have >> misunderstood something. >> >> Does it make sense to you? >> >> Have a good day,and thanks again for your help. >> >> Vittoria >> >> On Wed, Feb 27, 2013 at 9:48 PM, Ryan C. Thompson >> <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org="">> wrote: >> >> Hi Vittoria, >> >> Please use "Reply All" so that your reply also goes to the mailing >> list. >> >> The normalization factors are used to adjust the library sizes (I >> forget the details, I believe they are given in the User's Guide), >> and then the pseudo counts are obtained by normalizing the counts >> to the adjusted library sizes. Since you have not used any >> normalization factors (i.e. all norm factors = 1), the pseudo >> counts will simply be some constant factor of counts-per- million, >> if I'm not mistaken. If you want absolutely no normalization, you >> would have to set both the normalization factors and library sizes >> to 1, I think. >> >> In any case, the pseudo counts are only for descriptive purposes. >> The statistical testing in edgeR happens using the raw integer >> counts. >> >> >> On 02/27/2013 10:12 PM, Vittoria Roncalli wrote: >>> Hi Ryan, >>> >>> thanks for your reply. >>> I obtain pesudo.counts with the following commands >>> >>> " >>> >>> > raw.data <- read.table("counts 2.txt",sep="\t",header=T) >>> >>> > d <- raw.data[, 2:10] >>> >>> > d[is.na <http: is.na="">(d)] <- 0 >>> >>> > rownames(d) <- raw.data[, 1] >>> >>> > group <- >>> c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD") >>> >>> > d <- DGEList(counts = d, group = group) >>> >>> Calculating library sizes from column totals. >>> >>> > keep <- rowSums (cpm(d)>1) >=3 >>> >>> > d <- d[keep,] >>> >>> > dim(d) >>> >>> [1] 28755 9 >>> >>> > d <- DGEList(counts = d, group = group) >>> >>> Calculating library sizes from column totals. >>> >>> > d <- estimateCommonDisp(d) >>> >>> >>> After the common dispersion, I get in the DGE list >>> >>> $counts >>> >>> $samples >>> >>> $commondispersion >>> >>> $pseudo.counts >>> >>> $logCPM >>> >>> $pseudo.lib.size >>> >>> >>> >>> Then I write a table for the pseudo.counts and I will continue >>> with those for the DGE. >>> >>> Considering that I did non normalize the libraries, what are the >>> different counts in the pseudo.counts output? >>> >>> >>> Thanks so much >>> >>> >>> Vittoria >>> On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson >>> <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org="">> wrote: >>> >>> To answer your first question, when you first create a >>> DGEList object, all the normalization factors are initially >>> set to 1 by default. This is equivalent to no normalization. >>> Once you use calcNormFactors, the normalization factors will >>> be set appropriately. >>> >>> I'm not sure about the second question. Could you provide an >>> example of how you are obtaining pseudocounts with edgeR? >>> >>> >>> On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote: >>> >>> Hi, I am a edgeR user and I am a little bit confused on >>> the normalization >>> topic. >>> I am using EdgeR to get different expressed genes within >>> 3 conditions >>> (RnaSeq) with 3 replicates each. >>> I am following the user guide step: >>> >>> -update counts file (from mapping against reference >>> transcriptome) >>> - filter the low counts reads (1cpm) >>> - reassess library size >>> - estimate common dispersion >>> >>> Mi first question is related to the normalization. Why, >>> after I import my >>> file, next to the library size there is then column with >>> norm.factors? >>> >>> $samples >>> >>> group lib.size norm.factors >>> >>> X48h_C_r1.sam CONTROL 10898526 1 >>> >>> X48h_C_r2.sam CONTROL 7176817 1 >>> >>> X48h_C_r3.sam CONTROL 9511875 1 >>> >>> X48h_LD_r1.sam LD 11350347 1 >>> >>> X48h_LD_r2.sam LD 14836541 1 >>> >>> X48h_LD_r3.sam LD 12635344 1 >>> >>> X48h_HD_r1.sam HD 11840963 1 >>> >>> X48h_HD_r2.sam HD 17335549 1 >>> >>> X48h_HD_r3.sam HD 10274526 1 >>> >>> >>> >>> Is the normalization automated? What is the difference >>> with the >>> "calNormFactors?" >>> >>> Moreover, if I do not run the calNormFactors, what is >>> into the >>> pseudo.counts output? >>> >>> >>> I am very confused about those points. >>> >>> >>> Thanks in advance for your help. >>> >>> >>> Looking forward to hearing from you. >>> >>> >>> Vittoria >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >>> >>> -- >>> >>> Vittoria Roncalli >>> >>> Graduate Research Assistant >>> Center B?k?sy Laboratory of Neurobiology >>> Pacific Biosciences Research Center >>> University of Hawaii at Manoa >>> 1993 East-West Road >>> Honolulu, HI 96822 USA >>> >>> Tel: 808-4695693 <tel:808-4695693> >>> >> >> >> >> >> -- >> >> Vittoria Roncalli >> >> Graduate Research Assistant >> Center B?k?sy Laboratory of Neurobiology >> Pacific Biosciences Research Center >> University of Hawaii at Manoa >> 1993 East-West Road >> Honolulu, HI 96822 USA >> >> Tel: 808-4695693 >>

ADD REPLY • link 12.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

If you want to do an ANOVA-style test for differences across all three groups, I highly recommend that you look into learning how to use edgeR in its GLM mode. The edgeR User's Guide has all the information you need for this. If you are doing ANOVA using another software, then it is not doing the same tests as edgeR, so you should expect different results. edgeR is designed more specifically to operate on the kind of data produced by RNA-seq. It makes certain assumptions based on the fact that you are dealing with discrete count data rather than a continuous measure of expression. A full discussion of the differences is too long to include here. The edgeR User's Guide has some information, and the edgeR publications have a more complete discussion. -Ryan On Thu 28 Feb 2013 12:04:26 PM PST, Vittoria Roncalli wrote: > Hi Ryan, > > I am using R version 2.15.1 on Mac. > > Actually I have run the ANOVA using another software, because I am not > familiar with the GLM mode. > The codes I used to get the exact test and the DGE are the following > > > "# Estimate common dispersion > > d <- estimateCommonDisp(d) > > > pseudo_counts<- data.frame(rownames(d$pseudo.alt),d$pseudo.alt) > > names(pseudo_counts) = c("A1","A2","A3","B1","B2","B3","C1","C2","C3") > > write.table(pseudo_counts,"pseudo_counts.txt",sep="\t",row=F,quote=F) > > > RPD2_vs_UPD2_de.com<- exactTest(d, pair=c("A","B")) > > > # Check to see if any NA values are present in results > > RPD2_vs_UPD2_de.com<- exactTest(d, pair=c("A","B")) > > dim(subsetRPD2_vs_UPD2_de.com$table,is.na > <http: is.na="">RPD2_vs_UPD2_de.com$table$p.value))) > > > # Top Tags > > RPD2_vs_UPD2_results<- topTagsRPD2_vs_UPD2_de.com, n=26788) > > > RPD2_vs_UPD2_results.output<- > merge(RPD2_vs_UPD2_results$table,d$pseudo.counts,by.x=0,by.y=0) > > names(RPD2_vs_UPD2_results.output) <- > c("ID","logConc","logFC","p.val","adj.P.val","H1L","H2L","H8L","H9L" ,"C4L","C5L","C8L","C9L") > > > sum(p.adjustRPD2_vs_UPD2_de.com$table$PValue, method="BH") < 0.05) > > > #1038 > > > top.com <http: top.com=""><- topTagsRPD2_vs_UPD2_de.com, n=1038) > > > sumtop.com <http: top.com="">$table$logFC> 0) > > # 638 > > > sumtop.com <http: top.com="">$table$logFC< 0) > > # 400 " > > > Any ideas on how I can get the list of DGE genes even if I can't find > significance with ANOVA? > > > > > > > On Thu, Feb 28, 2013 at 9:44 AM, Ryan C. Thompson > <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org="">> wrote: > > Oh, I just realized you are using the non-GLM-based mode of > operation for edgeR. I am much more familiar with the GLM > workflow, and I believe that the GLM-based workflow is now > preferred over the exactTest-based one. In fact, I'm not even sure > how to do an ANOVA-style comparision of 3 or more groups using > exactTest. > > In any case, the best way to describe what you are tyring to do is > to is to show the code you are using. The answers could depend on > what options you are using, how you are calculating dispersions, > and many other small factors. Also please tell us which versions > of R, and edgeR you are using. > > > On Thu 28 Feb 2013 11:38:04 AM PST, Ryan C. Thompson wrote: > > Hi Vittoria, > > It would be best if you could show code examples of what gave > you an > empty list and what gave you a list of differentially > expressed genes > and what code didn't. Whether you you are doing a pairwise > comparison > or a multi-way "ANOVA-style" comparison, edgeR is actually > performing > the same test. In general, if all three pairwise comparisons are > yielding significant hits, I would expect some significant > hits in the > three-way comparison as well. > > -Ryan > > On Thu 28 Feb 2013 11:26:17 AM PST, Vittoria Roncalli wrote: > > Hi Ryan, > > Thanks again for your explanation, you saved my day! > Considering your expertise, I would ask you another question. > I run on the raw data counts a simple one way anova (I have 3 > treatments with 3 reps each) and I found out that there is no > significant difference between them. Then, with EdgeR I > was able, to > extract a list of DGE fro each pairwise comparison. Is > this because > the ANOVA is calculated on the overall library (total # > genes) while > the DGE comes from a t-test for each individual gene? I > found this > explanation on Bullard et al 2010, but I am not sure if I have > misunderstood something. > > Does it make sense to you? > > Have a good day,and thanks again for your help. > > Vittoria > > On Wed, Feb 27, 2013 at 9:48 PM, Ryan C. Thompson > <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org=""> > <mailto:rct at="" thompsonclan.org=""> <mailto:rct at="" thompsonclan.org="">>> wrote: > > Hi Vittoria, > > Please use "Reply All" so that your reply also goes to > the mailing > list. > > The normalization factors are used to adjust the > library sizes (I > forget the details, I believe they are given in the > User's Guide), > and then the pseudo counts are obtained by normalizing > the counts > to the adjusted library sizes. Since you have not used any > normalization factors (i.e. all norm factors = 1), the > pseudo > counts will simply be some constant factor of > counts-per-million, > if I'm not mistaken. If you want absolutely no > normalization, you > would have to set both the normalization factors and > library sizes > to 1, I think. > > In any case, the pseudo counts are only for > descriptive purposes. > The statistical testing in edgeR happens using the raw > integer > counts. > > > On 02/27/2013 10:12 PM, Vittoria Roncalli wrote: > > Hi Ryan, > > thanks for your reply. > I obtain pesudo.counts with the following commands > > " > > > raw.data <- read.table("counts > 2.txt",sep="\t",header=T) > > > d <- raw.data[, 2:10] > > > d[is.na <http: is.na=""> <http: is.na="">(d)] <- 0 > > > rownames(d) <- raw.data[, 1] > > > group <- > c("CONTROL","CONTROL","__CONTROL","LD","LD","LD","HD","__HD","HD") > > > d <- DGEList(counts = d, group = group) > > Calculating library sizes from column totals. > > > keep <- rowSums (cpm(d)>1) >=3 > > > d <- d[keep,] > > > dim(d) > > [1] 28755 9 > > > d <- DGEList(counts = d, group = group) > > Calculating library sizes from column totals. > > > d <- estimateCommonDisp(d) > > > After the common dispersion, I get in the DGE list > > $counts > > $samples > > $commondispersion > > $pseudo.counts > > $logCPM > > $pseudo.lib.size > > > > Then I write a table for the pseudo.counts and I > will continue > with those for the DGE. > > Considering that I did non normalize the > libraries, what are the > different counts in the pseudo.counts output? > > > Thanks so much > > > Vittoria > On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson > <rct at="" thompsonclan.org=""> <mailto:rct at="" thompsonclan.org=""> > <mailto:rct at="" thompsonclan.org=""> <mailto:rct at="" thompsonclan.org="">>> wrote: > > To answer your first question, when you first > create a > DGEList object, all the normalization factors > are initially > set to 1 by default. This is equivalent to no > normalization. > Once you use calcNormFactors, the > normalization factors will > be set appropriately. > > I'm not sure about the second question. Could > you provide an > example of how you are obtaining pseudocounts > with edgeR? > > > On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria > Roncalli wrote: > > Hi, I am a edgeR user and I am a little > bit confused on > the normalization > topic. > I am using EdgeR to get different > expressed genes within > 3 conditions > (RnaSeq) with 3 replicates each. > I am following the user guide step: > > -update counts file (from mapping against > reference > transcriptome) > - filter the low counts reads (1cpm) > - reassess library size > - estimate common dispersion > > Mi first question is related to the > normalization. Why, > after I import my > file, next to the library size there is > then column with > norm.factors? > > $samples > > group lib.size norm.factors > > X48h_C_r1.sam CONTROL 10898526 1 > > X48h_C_r2.sam CONTROL 7176817 1 > > X48h_C_r3.sam CONTROL 9511875 1 > > X48h_LD_r1.sam LD 11350347 1 > > X48h_LD_r2.sam LD 14836541 1 > > X48h_LD_r3.sam LD 12635344 1 > > X48h_HD_r1.sam HD 11840963 1 > > X48h_HD_r2.sam HD 17335549 1 > > X48h_HD_r3.sam HD 10274526 1 > > > > Is the normalization automated? What is > the difference > with the > "calNormFactors?" > > Moreover, if I do not run the > calNormFactors, what is > into the > pseudo.counts output? > > > I am very confused about those points. > > > Thanks in advance for your help. > > > Looking forward to hearing from you. > > > Vittoria > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > > -- > > Vittoria Roncalli > > Graduate Research Assistant > Center B?k?sy Laboratory of Neurobiology > Pacific Biosciences Research Center > University of Hawaii at Manoa > 1993 East-West Road > Honolulu, HI 96822 USA > > Tel: 808-4695693 <tel:808-4695693> > <tel:808-4695693 <tel:808-4695693="">> > > > > > > -- > > Vittoria Roncalli > > Graduate Research Assistant > Center B?k?sy Laboratory of Neurobiology > Pacific Biosciences Research Center > University of Hawaii at Manoa > 1993 East-West Road > Honolulu, HI 96822 USA > > Tel: 808-4695693 <tel:808-4695693> > > > > > -- > > Vittoria Roncalli > > Graduate Research Assistant > Center B?k?sy Laboratory of Neurobiology > Pacific Biosciences Research Center > University of Hawaii at Manoa > 1993 East-West Road > Honolulu, HI 96822 USA > > Tel: 808-4695693 >

ADD COMMENT • link 12.2 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.