Question

questions on edgeR package

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 33 minutes ago

WEHI, Melbourne, Australia

Dear Ming, I have said this in a separate thread, but thought I should summarize it for the original thread. After some investigation, it is clear that edgeR's problem with the TCGA RSEM "raw counts" is that they are not integers. edgeR is a likelihood- based package and it evaluates the negative binomial likelihood, which is always exactly zero at non-integer values. Hence edgeR cannot estimate the dispersion sensibly if any of the input counts are fractional. If you round the RSEM counts to integers, then edgeR will run ok. I have made some changes to the edgeR GLM code on Bioc-devel so that it does this rounding automatically. (The edgeR classic code was doing this anyway.) Note that this work-around is only sensible for input values that are estimates of counts. Best wishes Gordon On Wed, 29 Jan 2014, Gordon K Smyth wrote: > Dear Ming, > > Something is seriously wrong -- you shouldn't get these warnings, you > shouldn't get such a large dispersion estimate and, if you do, you shouldn't > get such small p-values. > > I suspect that the culprit is RSEM. edgeR is designed to accept raw read > counts rather than estimated abundance levels as might be output from RSEM or > Cufflinks. > > There are a number of tools that produce genewise read counts from RNA-seq > data. My lab routinely uses subread and featureCounts: > > http://www.ncbi.nlm.nih.gov/pubmed/24227677 > > which are available from the Bioconductor package Rsubread. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Tue, 28 Jan 2014, Ming Yi wrote: > >> >> Hi, >> >> I have a few questions on edgeR package. I used it a while ago. Recently, I >> started to use it again for an ongoing project and realized that it has >> been updated a lot with many new feaures. >> >> The data I am working on has 29 pairs of tumor and matched normal tissue >> samples from the same patients that have been run through RNAseq, the data >> has been processed by the popular RSEM method to be raw counts for each >> sample (tumor or macthed normal tissue of each patients). So I have a >> matrix of raw counts for 58 columns of data for all genes. we are looking >> for differential expressed genes between tumors and normals. Here are some >> main commands I used, eliminating some basic commands for simplicity: >> >>> show(head(targets)); >> SampleName Subject Type >> 1 T4626_01A 38_4626 Tumor >> 2 N4626_11A 38_4626 Normal >> 3 T2668_01A 44_2668 Tumor >> 4 N2668_11A 44_2668 Normal >> ...... >>> Patient<-factor(targets$Subject); >>> Type<-factor(targets$Type); >> ...... >> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>> show(dim(y)); >> [1] 20531 58 >> >>> #filter low expressed genes >>> y<-y[rowSums(y$counts)>=50,] >>> #Recompute the library sizes: >>> y$samples$lib.size <- colSums(y$counts) >>> #Use Entrez Gene IDs as row names: >>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>> ##Apply TMM normalization >>> y <- calcNormFactors(y,method=c("TMM")); >>> show(y$samples); >> group lib.size norm.factors >> T4626_01A 1 97435132 0.9834587 >> N4626_11A 1 62583672 0.9154837 >> T2668_01A 1 77034746 0.9687860 >> N2668_11A 1 58631311 0.8644352 >> ...... >>> design<-model.matrix(~Patient+Type); >>> rownames(design) <- colnames(y); >>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >> Disp = 3.99994 , BCV = 2 >> There were 50 or more warnings (use warnings() to see the first 50) >>> show(warnings()); >> Warning messages: >> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 4.520000 >> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 5.380000 >> .... >>> y <- estimateGLMTrendedDisp(y, design); >> Loading required package: splines >> There were 50 or more warnings (use warnings() to see the first 50) >>> y <- estimateGLMTagwiseDisp(y, design) >> There were 50 or more warnings (use warnings() to see the first 50) >>> fit <- glmFit(y, design); >>> lrt <- glmLRT(fit); >>> show(topTags(lrt)); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ....... >>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>> cpmRslt<-cpm(y) >>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>> show(cpmRslt[1:10,1:6]); >> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >> ..... >>> show(Rslt[1:10,]); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ..... >>> show(summary(de <- decideTestsDGE(lrt))); >> [,1] >> -1 6179 >> 0 4832 >> 1 7204 >> >> My questions are below: >> >> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower >> expressed using 50 as cutoff here,kind of arbitrary. is there any better >> way to assess the cutoff value? can any one suggest a good cutoff based on >> his or her experience in such setting? >> >> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives >> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why >> we got warning here (see warning message as above inside my commands), any >> issue on the warnings? >> >> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, >> for example, gene 6532 always have tumor samples consistently much lower in >> cpm counts compared to that of matched normal samples, which is consistent >> with the logFC result in Rslt. So from the top genes, the result looks very >> promising. However, when I look into the full table of Rslt, there are so >> many genes appear to be differential expressed (see result of summary(de <- >> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in >> the dataset) in this table with FDR <0.05, which make me a bit nervous >> considering the issues as I asked in last two questions. Is this what was >> usually seen in tumor vs normal comparison. This is a paired test of >> course, which may increase the power and may see more differntial genes. >> what if I try pooled tumors vs pooled normals disregard of matched tumor vs >> normal sample pairs? which one shall be better for RNA-seq data, paired >> test vs pooled test? >> >> 4. For pooled test, I may use exactTest(), also based on the guide, there >> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm >> model. anything like that worthy of trying? Any advice? >> >> 5. I did MDS plot, which looks nice in that the tumor and normal are >> separated very well in one side vs other. In the plotSmear plot, I did see >> some red spots inside the black spots zones in the center zone (supposed to >> be lower logFC), why they were called as significant? May this explain why >> we get so many DEGs >13k of them? Or large variations across samples? >> >> Thanks so much in advance for your help! >> >> Mike >> NIH >> Maryland, USA. >> >>> show(sessionInfo()); >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods >> [8] base >> other attached packages: >> [1] edgeR_3.4.2 limma_3.16.8 >> >> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

RNASeq edgeR Rsubread RNASeq edgeR Rsubread • 1.8k views

ADD COMMENT • link updated 10.8 years ago by Yi, Ming NIH/NCI [C] ▴ 100 • written 10.8 years ago by Gordon Smyth 52k

score 0 · Answer 1 · 2014-02-06

0

Entering edit mode

Yi, Ming NIH/NCI [C] ▴ 100

@yi-ming-nihnci-c-4571

Last seen 9.4 years ago

United States

Dear Dr. Smyth: Thanks so much for your help and advice. Indeed, I did try to use edger on rounded RSEM processed raw counts, result is quite different, here are the comparisons FYI: Not rounded: > y <- estimateGLMCommonDisp(y, design, verbose=TRUE); Disp = 3.99994 , BCV = 2 After rounded the RSEM raw counts to integers: > y <- estimateGLMCommonDisp(y, design, verbose=TRUE); Disp = 0.32356 , BCV = 0.5688 Not rounded: > show(summary(de <- decideTestsDGE(lrt))); [,1] -1 6179 0 4832 1 7204 After rounded the RSEM raw counts to integers: > show(summary(de <- decideTestsDGE(lrt))); [,1] -1 4948 0 7778 1 5490 So observation is: after rounded, the dispersion is much smaller, and DEGs also has about 3k less genes, which appears to be promising. Then compared with limma-voom result of the same data: show(summary(de <- decideTests(fit))); ... PatientTCGA_91_6836 TypeTumor -1 13 4917 0 18186 7276 1 16 6022 The number of EDGs seems rather close to that when using edgeR on rounded RSEM raw counts. But what I am trying to ask your opinion here: in these situations, you would like more on the limma-voom result or the edgeR result on rounded RSEM raw counts? Since I used limma a lot on array data, but not much except this one on RNA-seq data. Thanks again for all your help and advice! Best, Ming -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.edu.au] Sent: Thursday, February 06, 2014 12:40 AM To: Ming Yi Cc: Bioconductor mailing list Subject: Re: [BioC] questions on edgeR package Dear Ming, I have said this in a separate thread, but thought I should summarize it for the original thread. After some investigation, it is clear that edgeR's problem with the TCGA RSEM "raw counts" is that they are not integers. edgeR is a likelihood- based package and it evaluates the negative binomial likelihood, which is always exactly zero at non-integer values. Hence edgeR cannot estimate the dispersion sensibly if any of the input counts are fractional. If you round the RSEM counts to integers, then edgeR will run ok. I have made some changes to the edgeR GLM code on Bioc-devel so that it does this rounding automatically. (The edgeR classic code was doing this anyway.) Note that this work-around is only sensible for input values that are estimates of counts. Best wishes Gordon On Wed, 29 Jan 2014, Gordon K Smyth wrote: > Dear Ming, > > Something is seriously wrong -- you shouldn't get these warnings, you > shouldn't get such a large dispersion estimate and, if you do, you shouldn't > get such small p-values. > > I suspect that the culprit is RSEM. edgeR is designed to accept raw read > counts rather than estimated abundance levels as might be output from RSEM or > Cufflinks. > > There are a number of tools that produce genewise read counts from RNA-seq > data. My lab routinely uses subread and featureCounts: > > http://www.ncbi.nlm.nih.gov/pubmed/24227677 > > which are available from the Bioconductor package Rsubread. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Tue, 28 Jan 2014, Ming Yi wrote: > >> >> Hi, >> >> I have a few questions on edgeR package. I used it a while ago. Recently, I >> started to use it again for an ongoing project and realized that it has >> been updated a lot with many new feaures. >> >> The data I am working on has 29 pairs of tumor and matched normal tissue >> samples from the same patients that have been run through RNAseq, the data >> has been processed by the popular RSEM method to be raw counts for each >> sample (tumor or macthed normal tissue of each patients). So I have a >> matrix of raw counts for 58 columns of data for all genes. we are looking >> for differential expressed genes between tumors and normals. Here are some >> main commands I used, eliminating some basic commands for simplicity: >> >>> show(head(targets)); >> SampleName Subject Type >> 1 T4626_01A 38_4626 Tumor >> 2 N4626_11A 38_4626 Normal >> 3 T2668_01A 44_2668 Tumor >> 4 N2668_11A 44_2668 Normal >> ...... >>> Patient<-factor(targets$Subject); >>> Type<-factor(targets$Type); >> ...... >> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>> show(dim(y)); >> [1] 20531 58 >> >>> #filter low expressed genes >>> y<-y[rowSums(y$counts)>=50,] >>> #Recompute the library sizes: >>> y$samples$lib.size <- colSums(y$counts) >>> #Use Entrez Gene IDs as row names: >>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>> ##Apply TMM normalization >>> y <- calcNormFactors(y,method=c("TMM")); >>> show(y$samples); >> group lib.size norm.factors >> T4626_01A 1 97435132 0.9834587 >> N4626_11A 1 62583672 0.9154837 >> T2668_01A 1 77034746 0.9687860 >> N2668_11A 1 58631311 0.8644352 >> ...... >>> design<-model.matrix(~Patient+Type); >>> rownames(design) <- colnames(y); >>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >> Disp = 3.99994 , BCV = 2 >> There were 50 or more warnings (use warnings() to see the first 50) >>> show(warnings()); >> Warning messages: >> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 4.520000 >> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 5.380000 >> .... >>> y <- estimateGLMTrendedDisp(y, design); >> Loading required package: splines >> There were 50 or more warnings (use warnings() to see the first 50) >>> y <- estimateGLMTagwiseDisp(y, design) >> There were 50 or more warnings (use warnings() to see the first 50) >>> fit <- glmFit(y, design); >>> lrt <- glmLRT(fit); >>> show(topTags(lrt)); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ....... >>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>> cpmRslt<-cpm(y) >>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>> show(cpmRslt[1:10,1:6]); >> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >> ..... >>> show(Rslt[1:10,]); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ..... >>> show(summary(de <- decideTestsDGE(lrt))); >> [,1] >> -1 6179 >> 0 4832 >> 1 7204 >> >> My questions are below: >> >> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower >> expressed using 50 as cutoff here,kind of arbitrary. is there any better >> way to assess the cutoff value? can any one suggest a good cutoff based on >> his or her experience in such setting? >> >> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives >> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why >> we got warning here (see warning message as above inside my commands), any >> issue on the warnings? >> >> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, >> for example, gene 6532 always have tumor samples consistently much lower in >> cpm counts compared to that of matched normal samples, which is consistent >> with the logFC result in Rslt. So from the top genes, the result looks very >> promising. However, when I look into the full table of Rslt, there are so >> many genes appear to be differential expressed (see result of summary(de <- >> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in >> the dataset) in this table with FDR <0.05, which make me a bit nervous >> considering the issues as I asked in last two questions. Is this what was >> usually seen in tumor vs normal comparison. This is a paired test of >> course, which may increase the power and may see more differntial genes. >> what if I try pooled tumors vs pooled normals disregard of matched tumor vs >> normal sample pairs? which one shall be better for RNA-seq data, paired >> test vs pooled test? >> >> 4. For pooled test, I may use exactTest(), also based on the guide, there >> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm >> model. anything like that worthy of trying? Any advice? >> >> 5. I did MDS plot, which looks nice in that the tumor and normal are >> separated very well in one side vs other. In the plotSmear plot, I did see >> some red spots inside the black spots zones in the center zone (supposed to >> be lower logFC), why they were called as significant? May this explain why >> we get so many DEGs >13k of them? Or large variations across samples? >> >> Thanks so much in advance for your help! >> >> Mike >> NIH >> Maryland, USA. >> >>> show(sessionInfo()); >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods >> [8] base >> other attached packages: >> [1] edgeR_3.4.2 limma_3.16.8 >> >> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:9}}

ADD COMMENT • link 10.8 years ago Yi, Ming NIH/NCI [C] ▴ 100

0

Entering edit mode

Ming, Given RSEM counts, I would personally use the limma-voom for preference because it doesn't require you to round the counts. But you could use either in your case. I said the same to you yesterday: https://www.stat.math.ethz.ch/pipermail/bioconductor/2014-February/057 538.html Gordon On Thu, 6 Feb 2014, Yi, Ming (NIH/NCI) [C] wrote: > > Dear Dr. Smyth: > > Thanks so much for your help and advice. Indeed, I did try to use edger > on rounded RSEM processed raw counts, result is quite different, here > are the comparisons FYI: > > Not rounded: >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > Disp = 3.99994 , BCV = 2 > After rounded the RSEM raw counts to integers: >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > Disp = 0.32356 , BCV = 0.5688 > > > Not rounded: >> show(summary(de <- decideTestsDGE(lrt))); > [,1] > -1 6179 > 0 4832 > 1 7204 > After rounded the RSEM raw counts to integers: >> show(summary(de <- decideTestsDGE(lrt))); > [,1] > -1 4948 > 0 7778 > 1 5490 > > So observation is: after rounded, the dispersion is much smaller, and > DEGs also has about 3k less genes, which appears to be promising. > > Then compared with limma-voom result of the same data: > show(summary(de <- decideTests(fit))); > ... > PatientTCGA_91_6836 TypeTumor > -1 13 4917 > 0 18186 7276 > 1 16 6022 > > > The number of EDGs seems rather close to that when using edgeR on > rounded RSEM raw counts. But what I am trying to ask your opinion here: > in these situations, you would like more on the limma-voom result or the > edgeR result on rounded RSEM raw counts? Since I used limma a lot on > array data, but not much except this one on RNA-seq data. > > Thanks again for all your help and advice! > > Best, > > Ming > > > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.edu.au] > Sent: Thursday, February 06, 2014 12:40 AM > To: Ming Yi > Cc: Bioconductor mailing list > Subject: Re: [BioC] questions on edgeR package > > Dear Ming, > > I have said this in a separate thread, but thought I should summarize it > for the original thread. > > After some investigation, it is clear that edgeR's problem with the TCGA > RSEM "raw counts" is that they are not integers. edgeR is a likelihood- > based package and it evaluates the negative binomial likelihood, which is > always exactly zero at non-integer values. Hence edgeR cannot estimate > the dispersion sensibly if any of the input counts are fractional. > > If you round the RSEM counts to integers, then edgeR will run ok. > > I have made some changes to the edgeR GLM code on Bioc-devel so that it > does this rounding automatically. (The edgeR classic code was doing this > anyway.) Note that this work-around is only sensible for input values > that are estimates of counts. > > Best wishes > Gordon > > On Wed, 29 Jan 2014, Gordon K Smyth wrote: > >> Dear Ming, >> >> Something is seriously wrong -- you shouldn't get these warnings, you >> shouldn't get such a large dispersion estimate and, if you do, you shouldn't >> get such small p-values. >> >> I suspect that the culprit is RSEM. edgeR is designed to accept raw read >> counts rather than estimated abundance levels as might be output from RSEM or >> Cufflinks. >> >> There are a number of tools that produce genewise read counts from RNA-seq >> data. My lab routinely uses subread and featureCounts: >> >> http://www.ncbi.nlm.nih.gov/pubmed/24227677 >> >> which are available from the Bioconductor package Rsubread. >> >> Best wishes >> Gordon >> >> --------------------------------------------- >> Professor Gordon K Smyth, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> http://www.statsci.org/smyth >> >> On Tue, 28 Jan 2014, Ming Yi wrote: >> >>> >>> Hi, >>> >>> I have a few questions on edgeR package. I used it a while ago. Recently, I >>> started to use it again for an ongoing project and realized that it has >>> been updated a lot with many new feaures. >>> >>> The data I am working on has 29 pairs of tumor and matched normal tissue >>> samples from the same patients that have been run through RNAseq, the data >>> has been processed by the popular RSEM method to be raw counts for each >>> sample (tumor or macthed normal tissue of each patients). So I have a >>> matrix of raw counts for 58 columns of data for all genes. we are looking >>> for differential expressed genes between tumors and normals. Here are some >>> main commands I used, eliminating some basic commands for simplicity: >>> >>>> show(head(targets)); >>> SampleName Subject Type >>> 1 T4626_01A 38_4626 Tumor >>> 2 N4626_11A 38_4626 Normal >>> 3 T2668_01A 44_2668 Tumor >>> 4 N2668_11A 44_2668 Normal >>> ...... >>>> Patient<-factor(targets$Subject); >>>> Type<-factor(targets$Type); >>> ...... >>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>>> show(dim(y)); >>> [1] 20531 58 >>> >>>> #filter low expressed genes >>>> y<-y[rowSums(y$counts)>=50,] >>>> #Recompute the library sizes: >>>> y$samples$lib.size <- colSums(y$counts) >>>> #Use Entrez Gene IDs as row names: >>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>>> ##Apply TMM normalization >>>> y <- calcNormFactors(y,method=c("TMM")); >>>> show(y$samples); >>> group lib.size norm.factors >>> T4626_01A 1 97435132 0.9834587 >>> N4626_11A 1 62583672 0.9154837 >>> T2668_01A 1 77034746 0.9687860 >>> N2668_11A 1 58631311 0.8644352 >>> ...... >>>> design<-model.matrix(~Patient+Type); >>>> rownames(design) <- colnames(y); >>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >>> Disp = 3.99994 , BCV = 2 >>> There were 50 or more warnings (use warnings() to see the first 50) >>>> show(warnings()); >>> Warning messages: >>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >>> non-integer x = 4.520000 >>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >>> non-integer x = 5.380000 >>> .... >>>> y <- estimateGLMTrendedDisp(y, design); >>> Loading required package: splines >>> There were 50 or more warnings (use warnings() to see the first 50) >>>> y <- estimateGLMTagwiseDisp(y, design) >>> There were 50 or more warnings (use warnings() to see the first 50) >>>> fit <- glmFit(y, design); >>>> lrt <- glmLRT(fit); >>>> show(topTags(lrt)); >>> Coefficient: TypeTumor >>> symbols entrezID gene_id logFC logCPM LR PValue >>> FDR >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >>> 0 >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >>> 0 >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >>> 0 >>> ....... >>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>>> cpmRslt<-cpm(y) >>>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>>> show(cpmRslt[1:10,1:6]); >>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >>> ..... >>>> show(Rslt[1:10,]); >>> Coefficient: TypeTumor >>> symbols entrezID gene_id logFC logCPM LR PValue >>> FDR >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >>> 0 >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >>> 0 >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >>> 0 >>> ..... >>>> show(summary(de <- decideTestsDGE(lrt))); >>> [,1] >>> -1 6179 >>> 0 4832 >>> 1 7204 >>> >>> My questions are below: >>> >>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower >>> expressed using 50 as cutoff here,kind of arbitrary. is there any better >>> way to assess the cutoff value? can any one suggest a good cutoff based on >>> his or her experience in such setting? >>> >>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives >>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why >>> we got warning here (see warning message as above inside my commands), any >>> issue on the warnings? >>> >>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, >>> for example, gene 6532 always have tumor samples consistently much lower in >>> cpm counts compared to that of matched normal samples, which is consistent >>> with the logFC result in Rslt. So from the top genes, the result looks very >>> promising. However, when I look into the full table of Rslt, there are so >>> many genes appear to be differential expressed (see result of summary(de <- >>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in >>> the dataset) in this table with FDR <0.05, which make me a bit nervous >>> considering the issues as I asked in last two questions. Is this what was >>> usually seen in tumor vs normal comparison. This is a paired test of >>> course, which may increase the power and may see more differntial genes. >>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs >>> normal sample pairs? which one shall be better for RNA-seq data, paired >>> test vs pooled test? >>> >>> 4. For pooled test, I may use exactTest(), also based on the guide, there >>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm >>> model. anything like that worthy of trying? Any advice? >>> >>> 5. I did MDS plot, which looks nice in that the tumor and normal are >>> separated very well in one side vs other. In the plotSmear plot, I did see >>> some red spots inside the black spots zones in the center zone (supposed to >>> be lower logFC), why they were called as significant? May this explain why >>> we get so many DEGs >13k of them? Or large variations across samples? >>> >>> Thanks so much in advance for your help! >>> >>> Mike >>> NIH >>> Maryland, USA. >>> >>>> show(sessionInfo()); >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> attached base packages: >>> [1] splines stats graphics grDevices utils datasets methods >>> [8] base >>> other attached packages: >>> [1] edgeR_3.4.2 limma_3.16.8 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 10.8 years ago Gordon Smyth 52k

0

Entering edit mode

Thx for your advice, Ming > Date: Fri, 7 Feb 2014 09:15:36 +1100 > From: smyth@wehi.EDU.AU > To: yiming@mail.nih.gov > CC: yi02@hotmail.com; bioconductor@r-project.org > Subject: RE: [BioC] questions on edgeR package > > Ming, > > Given RSEM counts, I would personally use the limma-voom for preference > because it doesn't require you to round the counts. But you could use > either in your case. I said the same to you yesterday: > > https://www.stat.math.ethz.ch/pipermail/bioconductor/2014-February/0 57538.html > > Gordon > > On Thu, 6 Feb 2014, Yi, Ming (NIH/NCI) [C] wrote: > > > > > Dear Dr. Smyth: > > > > Thanks so much for your help and advice. Indeed, I did try to use edger > > on rounded RSEM processed raw counts, result is quite different, here > > are the comparisons FYI: > > > > Not rounded: > >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > > Disp = 3.99994 , BCV = 2 > > After rounded the RSEM raw counts to integers: > >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > > Disp = 0.32356 , BCV = 0.5688 > > > > > > Not rounded: > >> show(summary(de <- decideTestsDGE(lrt))); > > [,1] > > -1 6179 > > 0 4832 > > 1 7204 > > After rounded the RSEM raw counts to integers: > >> show(summary(de <- decideTestsDGE(lrt))); > > [,1] > > -1 4948 > > 0 7778 > > 1 5490 > > > > So observation is: after rounded, the dispersion is much smaller, and > > DEGs also has about 3k less genes, which appears to be promising. > > > > Then compared with limma-voom result of the same data: > > show(summary(de <- decideTests(fit))); > > ... > > PatientTCGA_91_6836 TypeTumor > > -1 13 4917 > > 0 18186 7276 > > 1 16 6022 > > > > > > The number of EDGs seems rather close to that when using edgeR on > > rounded RSEM raw counts. But what I am trying to ask your opinion here: > > in these situations, you would like more on the limma-voom result or the > > edgeR result on rounded RSEM raw counts? Since I used limma a lot on > > array data, but not much except this one on RNA-seq data. > > > > Thanks again for all your help and advice! > > > > Best, > > > > Ming > > > > > > > > > > > > -----Original Message----- > > From: Gordon K Smyth [mailto:smyth@wehi.edu.au] > > Sent: Thursday, February 06, 2014 12:40 AM > > To: Ming Yi > > Cc: Bioconductor mailing list > > Subject: Re: [BioC] questions on edgeR package > > > > Dear Ming, > > > > I have said this in a separate thread, but thought I should summarize it > > for the original thread. > > > > After some investigation, it is clear that edgeR's problem with the TCGA > > RSEM "raw counts" is that they are not integers. edgeR is a likelihood- > > based package and it evaluates the negative binomial likelihood, which is > > always exactly zero at non-integer values. Hence edgeR cannot estimate > > the dispersion sensibly if any of the input counts are fractional. > > > > If you round the RSEM counts to integers, then edgeR will run ok. > > > > I have made some changes to the edgeR GLM code on Bioc-devel so that it > > does this rounding automatically. (The edgeR classic code was doing this > > anyway.) Note that this work-around is only sensible for input values > > that are estimates of counts. > > > > Best wishes > > Gordon > > > > On Wed, 29 Jan 2014, Gordon K Smyth wrote: > > > >> Dear Ming, > >> > >> Something is seriously wrong -- you shouldn't get these warnings, you > >> shouldn't get such a large dispersion estimate and, if you do, you shouldn't > >> get such small p-values. > >> > >> I suspect that the culprit is RSEM. edgeR is designed to accept raw read > >> counts rather than estimated abundance levels as might be output from RSEM or > >> Cufflinks. > >> > >> There are a number of tools that produce genewise read counts from RNA-seq > >> data. My lab routinely uses subread and featureCounts: > >> > >> http://www.ncbi.nlm.nih.gov/pubmed/24227677 > >> > >> which are available from the Bioconductor package Rsubread. > >> > >> Best wishes > >> Gordon > >> > >> --------------------------------------------- > >> Professor Gordon K Smyth, > >> Bioinformatics Division, > >> Walter and Eliza Hall Institute of Medical Research, > >> 1G Royal Parade, Parkville, Vic 3052, Australia. > >> http://www.statsci.org/smyth > >> > >> On Tue, 28 Jan 2014, Ming Yi wrote: > >> > >>> > >>> Hi, > >>> > >>> I have a few questions on edgeR package. I used it a while ago. Recently, I > >>> started to use it again for an ongoing project and realized that it has > >>> been updated a lot with many new feaures. > >>> > >>> The data I am working on has 29 pairs of tumor and matched normal tissue > >>> samples from the same patients that have been run through RNAseq, the data > >>> has been processed by the popular RSEM method to be raw counts for each > >>> sample (tumor or macthed normal tissue of each patients). So I have a > >>> matrix of raw counts for 58 columns of data for all genes. we are looking > >>> for differential expressed genes between tumors and normals. Here are some > >>> main commands I used, eliminating some basic commands for simplicity: > >>> > >>>> show(head(targets)); > >>> SampleName Subject Type > >>> 1 T4626_01A 38_4626 Tumor > >>> 2 N4626_11A 38_4626 Normal > >>> 3 T2668_01A 44_2668 Tumor > >>> 4 N2668_11A 44_2668 Normal > >>> ...... > >>>> Patient<-factor(targets$Subject); > >>>> Type<-factor(targets$Type); > >>> ...... > >>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > >>>> show(dim(y)); > >>> [1] 20531 58 > >>> > >>>> #filter low expressed genes > >>>> y<-y[rowSums(y$counts)>=50,] > >>>> #Recompute the library sizes: > >>>> y$samples$lib.size <- colSums(y$counts) > >>>> #Use Entrez Gene IDs as row names: > >>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID > >>>> ##Apply TMM normalization > >>>> y <- calcNormFactors(y,method=c("TMM")); > >>>> show(y$samples); > >>> group lib.size norm.factors > >>> T4626_01A 1 97435132 0.9834587 > >>> N4626_11A 1 62583672 0.9154837 > >>> T2668_01A 1 77034746 0.9687860 > >>> N2668_11A 1 58631311 0.8644352 > >>> ...... > >>>> design<-model.matrix(~Patient+Type); > >>>> rownames(design) <- colnames(y); > >>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > >>> Disp = 3.99994 , BCV = 2 > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> show(warnings()); > >>> Warning messages: > >>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > >>> non-integer x = 4.520000 > >>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > >>> non-integer x = 5.380000 > >>> .... > >>>> y <- estimateGLMTrendedDisp(y, design); > >>> Loading required package: splines > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> y <- estimateGLMTagwiseDisp(y, design) > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> fit <- glmFit(y, design); > >>>> lrt <- glmLRT(fit); > >>>> show(topTags(lrt)); > >>> Coefficient: TypeTumor > >>> symbols entrezID gene_id logFC logCPM LR PValue > >>> FDR > >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 > >>> 0 > >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 > >>> 0 > >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 > >>> 0 > >>> ....... > >>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); > >>>> cpmRslt<-cpm(y) > >>>> cpmRslt<-cpmRslt[rownames(Rslt),]; > >>>> show(cpmRslt[1:10,1:6]); > >>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A > >>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 > >>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 > >>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 > >>> ..... > >>>> show(Rslt[1:10,]); > >>> Coefficient: TypeTumor > >>> symbols entrezID gene_id logFC logCPM LR PValue > >>> FDR > >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 > >>> 0 > >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 > >>> 0 > >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 > >>> 0 > >>> ..... > >>>> show(summary(de <- decideTestsDGE(lrt))); > >>> [,1] > >>> -1 6179 > >>> 0 4832 > >>> 1 7204 > >>> > >>> My questions are below: > >>> > >>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower > >>> expressed using 50 as cutoff here,kind of arbitrary. is there any better > >>> way to assess the cutoff value? can any one suggest a good cutoff based on > >>> his or her experience in such setting? > >>> > >>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives > >>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why > >>> we got warning here (see warning message as above inside my commands), any > >>> issue on the warnings? > >>> > >>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, > >>> for example, gene 6532 always have tumor samples consistently much lower in > >>> cpm counts compared to that of matched normal samples, which is consistent > >>> with the logFC result in Rslt. So from the top genes, the result looks very > >>> promising. However, when I look into the full table of Rslt, there are so > >>> many genes appear to be differential expressed (see result of summary(de <- > >>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in > >>> the dataset) in this table with FDR <0.05, which make me a bit nervous > >>> considering the issues as I asked in last two questions. Is this what was > >>> usually seen in tumor vs normal comparison. This is a paired test of > >>> course, which may increase the power and may see more differntial genes. > >>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs > >>> normal sample pairs? which one shall be better for RNA-seq data, paired > >>> test vs pooled test? > >>> > >>> 4. For pooled test, I may use exactTest(), also based on the guide, there > >>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm > >>> model. anything like that worthy of trying? Any advice? > >>> > >>> 5. I did MDS plot, which looks nice in that the tumor and normal are > >>> separated very well in one side vs other. In the plotSmear plot, I did see > >>> some red spots inside the black spots zones in the center zone (supposed to > >>> be lower logFC), why they were called as significant? May this explain why > >>> we get so many DEGs >13k of them? Or large variations across samples? > >>> > >>> Thanks so much in advance for your help! > >>> > >>> Mike > >>> NIH > >>> Maryland, USA. > >>> > >>>> show(sessionInfo()); > >>> R version 3.0.1 (2013-05-16) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> locale: > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >>> [7] LC_PAPER=C LC_NAME=C > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>> attached base packages: > >>> [1] splines stats graphics grDevices utils datasets methods > >>> [8] base > >>> other attached packages: > >>> [1] edgeR_3.4.2 limma_3.16.8 > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}

ADD REPLY • link 10.8 years ago Ming ▴ 380