"True" normalization factors in edgeR

0

Entering edit mode

Zhan Tianyu ▴ 40

@zhan-tianyu-6632

Last seen 10.6 years ago

Hi all, I have a question concerning the normalization factors in edgeR. I think that if we know which genes are none-DE genes in advance, we could calculate "true" normalization factors based on those information. Given "true normalization factors", edgeR could find more right DE genes, or even find all the right DE genes when the simulation is simple. The procedures I used in edgeR are: library(edgeR) dge <- DGEList(counts = counts, group = group ) norm <- calcNormFactors(dge,method="TMM") d <- estimateGLMCommonDisp(norm, design = design) d <- estimateGLMTrendedDisp(d,design=design) d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10) f <- glmFit(d, design = design) lr <- glmLRT(f, coef=2) pval = lr$table$PValue padj = p.adjust(pval, "BH") cbind(pval = pval, padj = padj) I used the order of p-value or adjust p-value to label DE genes (for example, the first 300 genes in the order of p value from low to hight, in a scenario with 30% DE in 1000 genes). In a simple simulation with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80 wrong DE genes in 300 DE genes. The simulation method is in the attachment, and I used 10 genes as a demonstration. Then I tried two methods to calculate "true" normalization factors". The first one is using calnormFactors() function, but using counts from all none-DE genes. The second one is taking log of all the none-DE counts, and then calculate the median of log(counts)[,i]-log(counts)[,1], where i is the index of each individual (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm))) to make sure that all factors multiple to one. After that, I replaced the "norm" in third line of above codes with the "true normalization factors". However, both methods would find 110 wrong DE genes, which is higher than the false discovery rates of default edgeR method (80 wrong DE genes). I am wondering what is a better procedure of finding the "true" normalization factors in edgeR, given which genes are none-DE? Thank you! Best regards, Tianyu Zhan

Normalization edgeR Normalization edgeR • 2.5k views

ADD COMMENT • link 10.7 years ago Zhan Tianyu ▴ 40

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.4 years ago

Hi Tianyu, Sorry for the slow response (was away). I have read your attachment, but your question and simulation is still a little unclear to me. First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA-seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV(). Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do? Maybe send a completely reproducible example ? Give your sessionInfo() too. > I am wondering what is a better procedure > of finding the "true" normalization factors in edgeR, given which genes are > none-DE? The other concern is that, in practice, you never know what is non-DE. So, how is this useful? Cheers, Mark ---------- Prof. Dr. Mark Robinson Statistical Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://ow.ly/riRea On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at="" gmail.com=""> wrote: > Hi all, > > I have a question concerning the normalization factors in edgeR. I > think that if we know which genes are none-DE genes in advance, we could > calculate "true" normalization factors based on those information. Given > "true normalization factors", edgeR could find more right DE genes, or even > find all the right DE genes when the simulation is simple. > The procedures I used in edgeR are: > library(edgeR) > dge <- DGEList(counts = counts, group = group ) > norm <- calcNormFactors(dge,method="TMM") > d <- estimateGLMCommonDisp(norm, design = design) > d <- estimateGLMTrendedDisp(d,design=design) > d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10) > f <- glmFit(d, design = design) > lr <- glmLRT(f, coef=2) > pval = lr$table$PValue > padj = p.adjust(pval, "BH") > cbind(pval = pval, padj = padj) > I used the order of p-value or adjust p-value to label DE genes > (for example, the first 300 genes in the order of p value from low to > hight, in a scenario with 30% DE in 1000 genes). In a simple simulation > with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80 > wrong DE genes in 300 DE genes. The simulation method is in the attachment, > and I used 10 genes as a demonstration. > Then I tried two methods to calculate "true" normalization > factors". The first one is using calnormFactors() function, but using > counts from all none-DE genes. The second one is taking log of all the > none-DE counts, and then calculate the median of > log(counts)[,i]-log(counts)[,1], where i is the index of each individual > (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm))) > to make sure that all factors multiple to one. > After that, I replaced the "norm" in third line of above codes > with the "true normalization factors". However, both methods would find 110 > wrong DE genes, which is higher than the false discovery rates of default > edgeR method (80 wrong DE genes). I am wondering what is a better procedure > of finding the "true" normalization factors in edgeR, given which genes are > none-DE? > Thank you! > > Best regards, > Tianyu Zhan > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.7 years ago Mark Robinson ▴ 880

0

Entering edit mode

Zhan Tianyu ▴ 40

@zhan-tianyu-6632

Last seen 10.6 years ago

Hi Mark: Hello, Sorry for the unclear demonstration in the previous email. My question is how to find "true" normalization factors in edgeR given non-DE genes? I think that if edgeR knows non-DE genes, it would have lower false positive rates (find fewer wrong DE genes). By studying this, I want to have a better understanding of how edgeR calculates normalization factors. Here is my sessionInfo(): > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin13.1.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] grid splines parallel stats graphics grDevices utils datasets methods base other attached packages: [1] gdata_2.13.3 tm_0.6 NLP_0.1-3 gridExtra_0.9.1 tweeDEseqCountData_1.2.0 [6] Biobase_2.24.0 BiocGenerics_0.10.0 edgeR_3.6.7 limma_3.20.8 loaded via a namespace (and not attached): [1] gtools_3.4.1 slam_0.1-32 tools_3.1.0 Here is a reproducible example: 1. Run the source code written by Xiaobei and you: Pipeline: http://130.60.190.4/robinson_lab/edgeR_robust/pipeline.pdf Source: http://130.60.190.4/robinson_lab/edgeR_robust/robust_simulation.R 2. Simulate NB data: library(tweeDEseqCountData) data(pickrell) pickrell<-as.matrix(exprs(pickrell.eset)) set.seed(3) nSamp<-8 grp <- as.factor(rep(0:1, each = nSamp/2)) ## set up group: 4 vs 4 data_2 <- NBsim(foldDiff=3,dataset =pickrell, nTags = 1000, group = grp, verbose = TRUE, add.outlier = TRUE, outlierMech = "S", pOutlier = 0.1, drop.extreme.dispersion = 0.1,pDiff=0.3,pUp=0.9) 3. Plot the true DE genes I have put in: d <- DGEList(counts=data_2$counts, group=grp, lib.size=colSums(data_2$counts)) rownames(d$counts) <- paste("ids",1:nrow(d$counts),sep="") d <- estimateCommonDisp(d) true.tags<-rownames(data_2$counts[data_2$indDE,]) plotSmear(d, de.tags=true.tags) 4. Introduce two methods of calculating "true" normalization factors with non-DE genes. The first method works on log of non-DE counts. It first calculates the difference of log non-DE counts between each individual (column) and first individual (column). Then I take exponential of the median of each column to get "true" normalization factors. The second method is using calcNormFactors() function, but with non-DE genes. calNorm<-function(counts,group,indNonDE){ counts<-counts+1 ## Avoid zero counts issue ## First method: Calculate true normalization factors based on median of log counts difference, and stored them in "norm_edgeR_median" ## variable. norm<-rep(NA,8) baseline<-counts[indNonDE,1] for (i in 1:length(norm)) { norm_temp<-data_2$counts[indNonDE,i] norm[i]<-median(log(norm_temp)-log(baseline)) } norm_edgeR_median<-exp(norm) ## Forced all factors multiple to 1 norm_edgeR_median<-norm_edgeR_median/exp(mean(log(norm_edgeR_median))) ## Second method: Calculate true normalization factors for edgeR with none-DE genes, using calcNormFactors() function, ## and stored them in "norm_edgeR_calc" variable. counts_edgeR<-counts[indNonDE,] de <- DGEList(counts = counts_edgeR, group = group ) de <- calcNormFactors(de,method="TMM") norm_edgeR_calc<-de$samples$norm.factors list(norm_edgeR_median=norm_edgeR_median,norm_edgeR_calc=norm_edgeR_ca lc) } library(gridExtra) norm_factors<-calNorm(data_2$counts,grp,data_2$indNonDE) edgeR_median.pfun <- function(counts, group, design = NULL, mc.cores = 4, prior.df=10) { ## edgeR standard pipeline ## library(edgeR) d <- DGEList(counts = counts, group = group ) d <- calcNormFactors(d,method="TMM") ### Use median of log difference of none-DE genes as normalization factors d$samples$norm.factors<-norm_factors$norm_edgeR_median d <- estimateGLMCommonDisp(d, design = design) d <- estimateGLMTrendedDisp(d,design=design) d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df) f <- glmFit(d, design = design) lr <- glmLRT(f, coef=2) pval = lr$table$PValue padj = p.adjust(pval, "BH") cbind(pval = pval, padj = padj) } edgeR_calc.pfun <- function(counts, group, design = NULL, mc.cores = 4, prior.df=10) { ## edgeR standard pipeline ## library(edgeR) d <- DGEList(counts = counts, group = group ) d <- calcNormFactors(d,method="TMM") ### Use calcNormFactors() with none-DE genes to calculate normalization factors. d$samples$norm.factors<-norm_factors$norm_edgeR_calc d <- estimateGLMCommonDisp(d, design = design) d <- estimateGLMTrendedDisp(d,design=design) d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df) f <- glmFit(d, design = design) lr <- glmLRT(f, coef=2) pval = lr$table$PValue padj = p.adjust(pval, "BH") cbind(pval = pval, padj = padj) } 5. Generate false positive plot: pvals_2 <- pval(data_2, method =c("edgeR","edgeR_median","edgeR_calc"), count.type = "counts",mc.cores = 4) fdPlot(pvals_2, cex = 0.6,xlim=c(0,700)) I think the result is unexpected because edgeR_median and edgeR_calc have higher false positive rate than default edgeR. The reason would be that I am not using a proper way to calculate "true" normalization factors with non-DE genes? Thanks for your time! Best regards, Tianyu Zhan 2014-08-10 7:24 GMT-04:00 Mark Robinson <mark.robinson@imls.uzh.ch>: > And, please keep the discussion on the list. Others can benefit. > > Also, re-read the Posting guide: > http://www.bioconductor.org/help/mailing-list/posting-guide/ > > M. > > > > > On 10.08.2014, at 13:08, Mark Robinson <mark.robinson@imls.uzh.ch> wrote: > > > Hi Tianyu, > > > > Good questions get good answers. > > > > If you want my help, you should make it easier for me: > > > > 1. Answer my original questions. In particular, look at your data to see > what normalization factors you've introduced. That may give you some > indication of if you've done something wrong. > > > > 2. Instead of sending me a bunch of files, tell me exactly what I should > look at. > > > > Mark > > > > > > ---------- > > Prof. Dr. Mark Robinson > > Statistical Bioinformatics, Institute of Molecular Life Sciences > > University of Zurich > > http://ow.ly/riRea > > > > > > > > > > > > > > > > On 09.08.2014, at 20:28, Zhan Tianyu <sewen67@gmail.com> wrote: > > > >> Hi Professor Robinson: > >> Hello, > >> Thanks for your response. It is true that in reality we never know > what is non-DE genes. However, I want to have a better understanding of > edgeR and its performance. I thank that if edgeR knows non-DE genes in > advance, it would have lower false positive rates (find more right DE > genes). > >> The attachments contains R scripts for data simulation and false > positive plots. You could first run the source code of "R_robust.R", > "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by you and > Xiaobei, and I also added three functions of edgeR: edgdR_median.pfun, > edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take advantage > of known non-DE genes. The details of how I calculate "true" normalization > factors are in the file "calNorm.R" file. > >> Then I tested the default edgeR, and three functions with known > non-DE genes in two simulations: normal simulation ( Nsim() function in the > file "simulation_normal.R"), and NB simulation ( NBsim() in the file > "R_robust"). However, the false positive plots show that default edgeR > performed the best (it has the lowest false positive rates). I think the > reason is that I am not using the right way to calculate true normalization > factors for edgeR, assuming that I know non-DE genes. I am wondering what > is a proper way to find the true normalization factors edgeR with non-DE > genes? Thanks for your time! > >> > >> Best regards, > >> Tianyu Zhan > >> > >> > >> 2014-08-09 10:34 GMT-04:00 Mark Robinson <mark.robinson@imls.uzh.ch>: > >> Hi Tianyu, > >> > >> Sorry for the slow response (was away). > >> > >> I have read your attachment, but your question and simulation is still > a little unclear to me. First, some useful diagnostics for me to see > whether this simulation is in any way reasonable to a real RNA-seq dataset > are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you > have put in -- or a dispersion-mean plot -- plotBCV(). Also, there is no > code in there to show exactly what you did when you say 'I replaced the > "norm" in third line of above codes with the "true normalization factors"'; > what did you exactly do? Maybe send a completely reproducible example ? > Give your sessionInfo() too. > >> > >>> I am wondering what is a better procedure > >>> of finding the "true" normalization factors in edgeR, given which > genes are > >>> none-DE? > >> > >> The other concern is that, in practice, you never know what is non-DE. > So, how is this useful? > >> > >> Cheers, Mark > >> > >> > >> ---------- > >> Prof. Dr. Mark Robinson > >> Statistical Bioinformatics, Institute of Molecular Life Sciences > >> University of Zurich > >> http://ow.ly/riRea > >> > >> > >> > >> > >> > >> > >> > >> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67@gmail.com> wrote: > >> > >>> Hi all, > >>> > >>> I have a question concerning the normalization factors in edgeR. I > >>> think that if we know which genes are none-DE genes in advance, we > could > >>> calculate "true" normalization factors based on those information. > Given > >>> "true normalization factors", edgeR could find more right DE genes, or > even > >>> find all the right DE genes when the simulation is simple. > >>> The procedures I used in edgeR are: > >>> library(edgeR) > >>> dge <- DGEList(counts = counts, group = group ) > >>> norm <- calcNormFactors(dge,method="TMM") > >>> d <- estimateGLMCommonDisp(norm, design = design) > >>> d <- estimateGLMTrendedDisp(d,design=design) > >>> d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10) > >>> f <- glmFit(d, design = design) > >>> lr <- glmLRT(f, coef=2) > >>> pval = lr$table$PValue > >>> padj = p.adjust(pval, "BH") > >>> cbind(pval = pval, padj = padj) > >>> I used the order of p-value or adjust p-value to label DE genes > >>> (for example, the first 300 genes in the order of p value from low to > >>> hight, in a scenario with 30% DE in 1000 genes). In a simple simulation > >>> with 30% asymmetry DE in 1000 total genes, the default edgeR would > find 80 > >>> wrong DE genes in 300 DE genes. The simulation method is in the > attachment, > >>> and I used 10 genes as a demonstration. > >>> Then I tried two methods to calculate "true" normalization > >>> factors". The first one is using calnormFactors() function, but using > >>> counts from all none-DE genes. The second one is taking log of all the > >>> none-DE counts, and then calculate the median of > >>> log(counts)[,i]-log(counts)[,1], where i is the index of each > individual > >>> (each row in the count matrix). Then I used > norm<-norm/exp(mean(log(norm))) > >>> to make sure that all factors multiple to one. > >>> After that, I replaced the "norm" in third line of above codes > >>> with the "true normalization factors". However, both methods would > find 110 > >>> wrong DE genes, which is higher than the false discovery rates of > default > >>> edgeR method (80 wrong DE genes). I am wondering what is a better > procedure > >>> of finding the "true" normalization factors in edgeR, given which > genes are > >>> none-DE? > >>> Thank you! > >>> > >>> Best regards, > >>> Tianyu Zhan > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> <r scripts="" tianyu="" zhan.zip=""> > > > > [[alternative HTML version deleted]]

ADD COMMENT • link 10.7 years ago Zhan Tianyu ▴ 40

0

Entering edit mode

Hi Tianyu, That's much clearer, thanks. In calNorm(), I changed: counts_edgeR<-counts[indNonDE,] de <- DGEList(counts = counts_edgeR, group = group ) de <- calcNormFactors(de,method="TMM") to: de <- DGEList(counts = counts, group = group ) de1 <- calcNormFactors(de[indNonDE,],method="TMM") de$samples$norm.factors <- de1$samples$norm.factors And, the results improve greatly. I'll leave it as an exercise to you to fix the other method you came up with. Basically, edgeR multiplies the two factors together ($lib.size and $norm.factors from the $samples element) to make the library size. You were actually changing them both (since you created a DGEList object with new library sizes), when I think what you want is to just change the relative factors. See also the following Section in the edgeR user's guide: (http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/ins t/doc/edgeRUsersGuide.pdf) 2.6.3 RNA composition [..] We call the product of the original library size and the scaling factors the effective library size. The effective library size replaces the original library size in all downsteam analyses. BTW, I haven't followed your calculations, but you may also be interested in the following manuscript, which proposes a similar thing: http://www.biomedcentral.com/1471-2105/14/219/abstract Hope that helps, Mark ---------- Prof. Dr. Mark Robinson Statistical Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://ow.ly/riRea On 10.08.2014, at 18:38, Zhan Tianyu <sewen67 at="" gmail.com=""> wrote: > Hi Mark: > Hello, > Sorry for the unclear demonstration in the previous email. My question is how to find "true" normalization factors in edgeR given non-DE genes? I think that if edgeR knows non-DE genes, it would have lower false positive rates (find fewer wrong DE genes). By studying this, I want to have a better understanding of how edgeR calculates normalization factors. > > Here is my sessionInfo(): > > > sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-apple-darwin13.1.0 (64-bit) > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > attached base packages: > [1] grid splines parallel stats graphics grDevices utils datasets methods base > other attached packages: > [1] gdata_2.13.3 tm_0.6 NLP_0.1-3 gridExtra_0.9.1 tweeDEseqCountData_1.2.0 > [6] Biobase_2.24.0 BiocGenerics_0.10.0 edgeR_3.6.7 limma_3.20.8 > loaded via a namespace (and not attached): > [1] gtools_3.4.1 slam_0.1-32 tools_3.1.0 > > Here is a reproducible example: > > 1. Run the source code written by Xiaobei and you: > Pipeline: http://130.60.190.4/robinson_lab/edgeR_robust/pipeline.pdf > Source: http://130.60.190.4/robinson_lab/edgeR_robust/robust_simulation.R > > 2. Simulate NB data: > > library(tweeDEseqCountData) > data(pickrell) > pickrell<-as.matrix(exprs(pickrell.eset)) > set.seed(3) > nSamp<-8 > grp <- as.factor(rep(0:1, each = nSamp/2)) ## set up group: 4 vs 4 > data_2 <- NBsim(foldDiff=3,dataset =pickrell, nTags = 1000, group = grp, verbose = TRUE, add.outlier = TRUE, > outlierMech = "S", pOutlier = 0.1, drop.extreme.dispersion = 0.1,pDiff=0.3,pUp=0.9) > > 3. Plot the true DE genes I have put in: > > d <- DGEList(counts=data_2$counts, group=grp, lib.size=colSums(data_2$counts)) > rownames(d$counts) <- paste("ids",1:nrow(d$counts),sep="") > d <- estimateCommonDisp(d) > true.tags<-rownames(data_2$counts[data_2$indDE,]) > plotSmear(d, de.tags=true.tags) > > 4. Introduce two methods of calculating "true" normalization factors with non-DE genes. The first method works on log of non-DE counts. It first calculates the difference of log non-DE counts between each individual (column) and first individual (column). Then I take exponential of the median of each column to get "true" normalization factors. The second method is using calcNormFactors() function, but with non-DE genes. > > calNorm<-function(counts,group,indNonDE){ > counts<-counts+1 ## Avoid zero counts issue > ## First method: Calculate true normalization factors based on median of log counts difference, and stored them in "norm_edgeR_median" > ## variable. > norm<-rep(NA,8) > baseline<-counts[indNonDE,1] > for (i in 1:length(norm)) > { > norm_temp<-data_2$counts[indNonDE,i] > norm[i]<-median(log(norm_temp)-log(baseline)) > } > norm_edgeR_median<-exp(norm) > ## Forced all factors multiple to 1 > norm_edgeR_median<-norm_edgeR_median/exp(mean(log(norm_edgeR_median))) > > ## Second method: Calculate true normalization factors for edgeR with none-DE genes, using calcNormFactors() function, > ## and stored them in "norm_edgeR_calc" variable. > counts_edgeR<-counts[indNonDE,] > de <- DGEList(counts = counts_edgeR, group = group ) > de <- calcNormFactors(de,method="TMM") > norm_edgeR_calc<-de$samples$norm.factors > list(norm_edgeR_median=norm_edgeR_median,norm_edgeR_calc =norm_edgeR_calc) > } > > library(gridExtra) > norm_factors<-calNorm(data_2$counts,grp,data_2$indNonDE) > > edgeR_median.pfun <- > function(counts, group, design = NULL, mc.cores = 4, prior.df=10) > { > ## edgeR standard pipeline ## > library(edgeR) > d <- DGEList(counts = counts, group = group ) > d <- calcNormFactors(d,method="TMM") > ### Use median of log difference of none-DE genes as normalization factors > d$samples$norm.factors<-norm_factors$norm_edgeR_median > d <- estimateGLMCommonDisp(d, design = design) > d <- estimateGLMTrendedDisp(d,design=design) > d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df) > f <- glmFit(d, design = design) > lr <- glmLRT(f, coef=2) > pval = lr$table$PValue > padj = p.adjust(pval, "BH") > cbind(pval = pval, padj = padj) > } > > edgeR_calc.pfun <- > function(counts, group, design = NULL, mc.cores = 4, prior.df=10) > { > ## edgeR standard pipeline ## > library(edgeR) > d <- DGEList(counts = counts, group = group ) > d <- calcNormFactors(d,method="TMM") > ### Use calcNormFactors() with none-DE genes to calculate normalization factors. > d$samples$norm.factors<-norm_factors$norm_edgeR_calc > d <- estimateGLMCommonDisp(d, design = design) > d <- estimateGLMTrendedDisp(d,design=design) > d <- estimateGLMTagwiseDisp(d, design = design, prior.df = prior.df) > f <- glmFit(d, design = design) > lr <- glmLRT(f, coef=2) > pval = lr$table$PValue > padj = p.adjust(pval, "BH") > cbind(pval = pval, padj = padj) > } > > 5. Generate false positive plot: > > pvals_2 <- pval(data_2, method =c("edgeR","edgeR_median","edgeR_calc"), count.type = "counts",mc.cores = 4) > fdPlot(pvals_2, cex = 0.6,xlim=c(0,700)) > > I think the result is unexpected because edgeR_median and edgeR_calc have higher false positive rate than default edgeR. The reason would be that I am not using a proper way to calculate "true" normalization factors with non-DE genes? Thanks for your time! > > Best regards, > Tianyu Zhan > > > 2014-08-10 7:24 GMT-04:00 Mark Robinson <mark.robinson at="" imls.uzh.ch="">: > And, please keep the discussion on the list. Others can benefit. > > Also, re-read the Posting guide: > http://www.bioconductor.org/help/mailing-list/posting-guide/ > > M. > > > > > On 10.08.2014, at 13:08, Mark Robinson <mark.robinson at="" imls.uzh.ch=""> wrote: > > > Hi Tianyu, > > > > Good questions get good answers. > > > > If you want my help, you should make it easier for me: > > > > 1. Answer my original questions. In particular, look at your data to see what normalization factors you've introduced. That may give you some indication of if you've done something wrong. > > > > 2. Instead of sending me a bunch of files, tell me exactly what I should look at. > > > > Mark > > > > > > ---------- > > Prof. Dr. Mark Robinson > > Statistical Bioinformatics, Institute of Molecular Life Sciences > > University of Zurich > > http://ow.ly/riRea > > > > > > > > > > > > > > > > On 09.08.2014, at 20:28, Zhan Tianyu <sewen67 at="" gmail.com=""> wrote: > > > >> Hi Professor Robinson: > >> Hello, > >> Thanks for your response. It is true that in reality we never know what is non-DE genes. However, I want to have a better understanding of edgeR and its performance. I thank that if edgeR knows non-DE genes in advance, it would have lower false positive rates (find more right DE genes). > >> The attachments contains R scripts for data simulation and false positive plots. You could first run the source code of "R_robust.R", "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by you and Xiaobei, and I also added three functions of edgeR: edgdR_median.pfun, edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take advantage of known non-DE genes. The details of how I calculate "true" normalization factors are in the file "calNorm.R" file. > >> Then I tested the default edgeR, and three functions with known non-DE genes in two simulations: normal simulation ( Nsim() function in the file "simulation_normal.R"), and NB simulation ( NBsim() in the file "R_robust"). However, the false positive plots show that default edgeR performed the best (it has the lowest false positive rates). I think the reason is that I am not using the right way to calculate true normalization factors for edgeR, assuming that I know non-DE genes. I am wondering what is a proper way to find the true normalization factors edgeR with non-DE genes? Thanks for your time! > >> > >> Best regards, > >> Tianyu Zhan > >> > >> > >> 2014-08-09 10:34 GMT-04:00 Mark Robinson <mark.robinson at="" imls.uzh.ch="">: > >> Hi Tianyu, > >> > >> Sorry for the slow response (was away). > >> > >> I have read your attachment, but your question and simulation is still a little unclear to me. First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA- seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV(). Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do? Maybe send a completely reproducible example ? Give your sessionInfo() too. > >> > >>> I am wondering what is a better procedure > >>> of finding the "true" normalization factors in edgeR, given which genes are > >>> none-DE? > >> > >> The other concern is that, in practice, you never know what is non-DE. So, how is this useful? > >> > >> Cheers, Mark > >> > >> > >> ---------- > >> Prof. Dr. Mark Robinson > >> Statistical Bioinformatics, Institute of Molecular Life Sciences > >> University of Zurich > >> http://ow.ly/riRea > >> > >> > >> > >> > >> > >> > >> > >> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at="" gmail.com=""> wrote: > >> > >>> Hi all, > >>> > >>> I have a question concerning the normalization factors in edgeR. I > >>> think that if we know which genes are none-DE genes in advance, we could > >>> calculate "true" normalization factors based on those information. Given > >>> "true normalization factors", edgeR could find more right DE genes, or even > >>> find all the right DE genes when the simulation is simple. > >>> The procedures I used in edgeR are: > >>> library(edgeR) > >>> dge <- DGEList(counts = counts, group = group ) > >>> norm <- calcNormFactors(dge,method="TMM") > >>> d <- estimateGLMCommonDisp(norm, design = design) > >>> d <- estimateGLMTrendedDisp(d,design=design) > >>> d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10) > >>> f <- glmFit(d, design = design) > >>> lr <- glmLRT(f, coef=2) > >>> pval = lr$table$PValue > >>> padj = p.adjust(pval, "BH") > >>> cbind(pval = pval, padj = padj) > >>> I used the order of p-value or adjust p-value to label DE genes > >>> (for example, the first 300 genes in the order of p value from low to > >>> hight, in a scenario with 30% DE in 1000 genes). In a simple simulation > >>> with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80 > >>> wrong DE genes in 300 DE genes. The simulation method is in the attachment, > >>> and I used 10 genes as a demonstration. > >>> Then I tried two methods to calculate "true" normalization > >>> factors". The first one is using calnormFactors() function, but using > >>> counts from all none-DE genes. The second one is taking log of all the > >>> none-DE counts, and then calculate the median of > >>> log(counts)[,i]-log(counts)[,1], where i is the index of each individual > >>> (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm))) > >>> to make sure that all factors multiple to one. > >>> After that, I replaced the "norm" in third line of above codes > >>> with the "true normalization factors". However, both methods would find 110 > >>> wrong DE genes, which is higher than the false discovery rates of default > >>> edgeR method (80 wrong DE genes). I am wondering what is a better procedure > >>> of finding the "true" normalization factors in edgeR, given which genes are > >>> none-DE? > >>> Thank you! > >>> > >>> Best regards, > >>> Tianyu Zhan > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> <r scripts="" tianyu="" zhan.zip=""> > > > >

ADD REPLY • link 10.7 years ago Mark Robinson ▴ 880

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 6.4 years ago

Hi Tianyu, Good questions get good answers. If you want my help, you should make it easier for me: 1. Answer my original questions. In particular, look at your data to see what normalization factors you've introduced. That may give you some indication of if you've done something wrong. 2. Instead of sending me a bunch of files, tell me exactly what I should look at. Mark ---------- Prof. Dr. Mark Robinson Statistical Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://ow.ly/riRea On 09.08.2014, at 20:28, Zhan Tianyu <sewen67 at="" gmail.com=""> wrote: > Hi Professor Robinson: > Hello, > Thanks for your response. It is true that in reality we never know what is non-DE genes. However, I want to have a better understanding of edgeR and its performance. I thank that if edgeR knows non-DE genes in advance, it would have lower false positive rates (find more right DE genes). > The attachments contains R scripts for data simulation and false positive plots. You could first run the source code of "R_robust.R", "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by you and Xiaobei, and I also added three functions of edgeR: edgdR_median.pfun, edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take advantage of known non-DE genes. The details of how I calculate "true" normalization factors are in the file "calNorm.R" file. > Then I tested the default edgeR, and three functions with known non-DE genes in two simulations: normal simulation ( Nsim() function in the file "simulation_normal.R"), and NB simulation ( NBsim() in the file "R_robust"). However, the false positive plots show that default edgeR performed the best (it has the lowest false positive rates). I think the reason is that I am not using the right way to calculate true normalization factors for edgeR, assuming that I know non-DE genes. I am wondering what is a proper way to find the true normalization factors edgeR with non-DE genes? Thanks for your time! > > Best regards, > Tianyu Zhan > > > 2014-08-09 10:34 GMT-04:00 Mark Robinson <mark.robinson at="" imls.uzh.ch="">: > Hi Tianyu, > > Sorry for the slow response (was away). > > I have read your attachment, but your question and simulation is still a little unclear to me. First, some useful diagnostics for me to see whether this simulation is in any way reasonable to a real RNA- seq dataset are an M-A plot -- plotSmear(), perhaps highlighting the true DE genes you have put in -- or a dispersion-mean plot -- plotBCV(). Also, there is no code in there to show exactly what you did when you say 'I replaced the "norm" in third line of above codes with the "true normalization factors"'; what did you exactly do? Maybe send a completely reproducible example ? Give your sessionInfo() too. > > > I am wondering what is a better procedure > > of finding the "true" normalization factors in edgeR, given which genes are > > none-DE? > > The other concern is that, in practice, you never know what is non- DE. So, how is this useful? > > Cheers, Mark > > > ---------- > Prof. Dr. Mark Robinson > Statistical Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://ow.ly/riRea > > > > > > > > On 04.08.2014, at 22:44, Zhan Tianyu <sewen67 at="" gmail.com=""> wrote: > > > Hi all, > > > > I have a question concerning the normalization factors in edgeR. I > > think that if we know which genes are none-DE genes in advance, we could > > calculate "true" normalization factors based on those information. Given > > "true normalization factors", edgeR could find more right DE genes, or even > > find all the right DE genes when the simulation is simple. > > The procedures I used in edgeR are: > > library(edgeR) > > dge <- DGEList(counts = counts, group = group ) > > norm <- calcNormFactors(dge,method="TMM") > > d <- estimateGLMCommonDisp(norm, design = design) > > d <- estimateGLMTrendedDisp(d,design=design) > > d <- estimateGLMTagwiseDisp(d, design = design, prior.df = 10) > > f <- glmFit(d, design = design) > > lr <- glmLRT(f, coef=2) > > pval = lr$table$PValue > > padj = p.adjust(pval, "BH") > > cbind(pval = pval, padj = padj) > > I used the order of p-value or adjust p-value to label DE genes > > (for example, the first 300 genes in the order of p value from low to > > hight, in a scenario with 30% DE in 1000 genes). In a simple simulation > > with 30% asymmetry DE in 1000 total genes, the default edgeR would find 80 > > wrong DE genes in 300 DE genes. The simulation method is in the attachment, > > and I used 10 genes as a demonstration. > > Then I tried two methods to calculate "true" normalization > > factors". The first one is using calnormFactors() function, but using > > counts from all none-DE genes. The second one is taking log of all the > > none-DE counts, and then calculate the median of > > log(counts)[,i]-log(counts)[,1], where i is the index of each individual > > (each row in the count matrix). Then I used norm<-norm/exp(mean(log(norm))) > > to make sure that all factors multiple to one. > > After that, I replaced the "norm" in third line of above codes > > with the "true normalization factors". However, both methods would find 110 > > wrong DE genes, which is higher than the false discovery rates of default > > edgeR method (80 wrong DE genes). I am wondering what is a better procedure > > of finding the "true" normalization factors in edgeR, given which genes are > > none-DE? > > Thank you! > > > > Best regards, > > Tianyu Zhan > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > <r scripts="" tianyu="" zhan.zip="">

ADD COMMENT • link 10.7 years ago Mark Robinson ▴ 880

Login before adding your answer.