Question

Large logCPM values when using offsets in edgeR for normalization

0

Entering edit mode

Ying W ▴ 90

@ying-w-4341

Last seen 9.6 years ago

United States

Dear list, I am interested in adjusting for GC bias in my experiment. I have my data in an "SeqExpressionSet" object (using EDASeq library) called eset and I then run the following commands: dataOffset = withinLaneNormalization(eset,"gc", which="full",offset=TRUE) dataOffset2 = betweenLaneNormalization(dataOffset, which="full",offset=TRUE) y = DGEList(exprs(dataOffset), group = pData(eset)$conditions) y = calcNormFactors(y,method="TMM") y$design = model.matrix(~y$samples$group) y = estimateGLMCommonDisp(y, y$design, offset = -offst(dataOffset2)) y = estimateGLMTagwiseDisp(y, y$design, offset = -offst(dataOffset2)) #no trended lrt_offset <- glmLRT(glmFit(y,y$design, offset = -offst(dataOffset2)), coef=2) Where group is something like c(1,1,2,2). However, I get very high logCPM values > head(lrt_offset$table) logFC logCPM LR PValue 1 -2.087455 24.90139 18.59469 1.616704e-05 2 -2.182279 24.71384 21.99644 2.731570e-06 3 -1.962187 25.17329 14.93835 1.110818e-04 4 -1.808961 24.67918 13.09471 2.961312e-04 5 -2.189829 26.08874 22.87277 1.730868e-06 6 -1.816757 25.37329 15.85988 6.820944e-05 > summary(lrt_offset$table$logCPM) Min. 1st Qu. Median Mean 3rd Qu. Max. 22.27 24.76 25.32 25.39 25.95 30.58 Whereas before the glm command, the logCPM values look more like what I am expecting > summary(y$logCPM) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.032 4.068 4.716 4.778 5.369 10.260 Plotting y$logCPM andlrt_offset$table$logCPM however does not result in a straight line (but it has high correlation). I've noticed in the manual for both EDASeq and cqn (two packages for GC bias correction) the logCPM values are also much higher (look under topTags output). Could someone shed some insight into why these values are much higher (statistically)? Computationally, I know that if offsets are not provided, default offsets of log(lib.size) is used in the glmFit() function. The code on the svn seems to have been modified quite recently so maybe this will be resolved in the next update. On a related note, I want to adjust for GC bias (with lane normalization) and also for sequencing depth (between lane normalization) but I want to do the latter in a simple way (and not using the betweenLaneNormalization function above), could I do something like this? myOffset = -offst(dataOffset) + log(lib.size) > head(offst(dataOffset), n=3) s1 s2 y1 y2 1 0.1870279 0.1033789 0.1605521 0.09555714 2 -0.3311078 -0.2439545 -0.1688206 -0.23481517 3 0.4720900 0.2818505 0.6052292 0.63641814 >log(lib.size) [1] 14.94330 14.68804 13.10804 13.59071 Where dataOffset was generated above by doing full quantile normalization using the withinLaneNormalization function above. I was thinking about this because edgeR automatically generates an offset when computing glmFit() to adjust for different library sizes (using log(lib.size) as offset). Could I do something similar but also adjust for GC biases and have normal looking logCPM values? Lastly, what is the best way to get the 'adjusted' count values out of edgeR? Is it by using the fitted values from glmFit() or using cpm() on the non-fitted values ? would the latter adjust for offset? Thanks, Ying Wu > sessionInfo() R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.0.8 limma_3.14.4 EDASeq_1.4.0 [4] R.oo_1.13.0 R.methodsS3_1.4.2 aroma.light_1.28.0 [7] matrixStats_0.6.2 ShortRead_1.16.4 latticeExtra_0.6-24 [10] RColorBrewer_1.0-5 Rsamtools_1.10.2 lattice_0.20-10 [13] Biostrings_2.26.3 GenomicRanges_1.10.7 IRanges_1.16.6 [16] Biobase_2.18.0 BiocGenerics_0.4.0 knitr_1.1 loaded via a namespace (and not attached): [1] annotate_1.36.0 AnnotationDbi_1.20.6 bitops_1.0-5 [4] DBI_0.2-5 DESeq_1.10.1 digest_0.6.3 [7] evaluate_0.4.3 formatR_0.7 genefilter_1.40.0 [10] geneplotter_1.36.0 grid_2.15.0 hwriter_1.3 [13] markdown_0.5.4 parallel_2.15.0 RSQLite_0.11.2 [16] splines_2.15.0 stats4_2.15.0 stringr_0.6.2 [19] survival_2.37-4 tools_2.15.0 XML_3.95-0.2 [22] xtable_1.7-1 zlibbioc_1.4.0

Sequencing edgeR cqn EDASeq Sequencing edgeR cqn EDASeq • 2.8k views

ADD COMMENT • link updated 12.0 years ago by Gordon Smyth 52k • written 12.0 years ago by Ying W ▴ 90

score 0 · Answer 1 · 2013-03-23

Dear Ying Wu, As you have observed, the offset for glmFit() defaults to log(lib.size). If a value for offset is user-supplied, then it takes precedence over lib.size, and all calculations are done as if lib.size was equal to exp(offset). Hence this affects directly the calculation of logCPM. If you want to use an offset matrix with glmFit(), but you want the logCPM values to reflect the actual library sizes, then you should reset the mean value for offset to be the same as the mean value for log(lib.size): m <- mean(log(y$samples$lib.size)) offset <- offset - mean(offset) + m fit <- glmFit(y, design, offset=offset) etc. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth >> From: Ying Wu <daiyingw at="" usc.edu=""> >> Subject: Large logCPM values when using offsets in edgeR for normalization >> Date: 21. M?rz 2013 19:17:50 MEZ >> Mon Mar 18 18:31:12 CET 2013 >>> Dear list, >>> I am interested in adjusting for GC bias in my experiment. I have my >>> data in an "SeqExpressionSet" object (using EDASeq library) called eset >>> and I then run the following commands: >>> >>> dataOffset = withinLaneNormalization(eset,"gc", which="full",offset=TRUE) >>> dataOffset2 = betweenLaneNormalization(dataOffset, >>> which="full",offset=TRUE) >>> >>> y = DGEList(exprs(dataOffset), group = pData(eset)$conditions) >>> y = calcNormFactors(y,method="TMM") >>> y$design = model.matrix(~y$samples$group) >>> y = estimateGLMCommonDisp(y, y$design, offset = -offst(dataOffset2)) >>> y = estimateGLMTagwiseDisp(y, y$design, offset = -offst(dataOffset2)) >>> #no trended >>> lrt_offset <- glmLRT(glmFit(y,y$design, offset = -offst(dataOffset2)), >>> coef=2) >>> >>> Where group is something like c(1,1,2,2). However, I get very high >>> logCPM values >>> >>>> head(lrt_offset$table) >>> logFC logCPM LR PValue >>> 1 -2.087455 24.90139 18.59469 1.616704e-05 >>> 2 -2.182279 24.71384 21.99644 2.731570e-06 >>> 3 -1.962187 25.17329 14.93835 1.110818e-04 >>> 4 -1.808961 24.67918 13.09471 2.961312e-04 >>> 5 -2.189829 26.08874 22.87277 1.730868e-06 >>> 6 -1.816757 25.37329 15.85988 6.820944e-05 >>>> summary(lrt_offset$table$logCPM) >>> Min. 1st Qu. Median Mean 3rd Qu. Max. >>> 22.27 24.76 25.32 25.39 25.95 30.58 >>> >>> Whereas before the glm command, the logCPM values look more like what I >>> am expecting >>> >>>> summary(y$logCPM) >>> Min. 1st Qu. Median Mean 3rd Qu. Max. >>> 2.032 4.068 4.716 4.778 5.369 10.260 >>> >>> Plotting y$logCPM and lrt_offset$table$logCPM however does not result in >>> a straight line (but it has high correlation). I've noticed in the >>> manual for both EDASeq and cqn (two packages for GC bias correction) the >>> logCPM values are also much higher (look under topTags output). >>> >>> Could someone shed some insight into why these values are much higher >>> (statistically)? Computationally, I know that if offsets are not >>> provided, default offsets of log(lib.size) is used in the glmFit() >>> function. The code on the svn seems to have been modified quite recently >>> so maybe this will be resolved in the next update. >>> >>> On a related note, I want to adjust for GC bias (with lane >>> normalization) and also for sequencing depth (between lane >>> normalization) but I want to do the latter in a simple way (and not >>> using the betweenLaneNormalization function above), could I do something >>> like this? >>> >>> myOffset = -offst(dataOffset) + log(lib.size) >>> >>>> head(offst(dataOffset), n=3) >>> s1 s2 y1 y2 >>> 1 0.1870279 0.1033789 0.1605521 0.09555714 >>> 2 -0.3311078 -0.2439545 -0.1688206 -0.23481517 >>> 3 0.4720900 0.2818505 0.6052292 0.63641814 >>>> log(lib.size) >>> [1] 14.94330 14.68804 13.10804 13.59071 >>> >>> Where dataOffset was generated above by doing full quantile >>> normalization using the withinLaneNormalization function above. I was >>> thinking about this because edgeR automatically generates an offset when >>> computing glmFit() to adjust for different library sizes (using >>> log(lib.size) as offset). Could I do something similar but also adjust >>> for GC biases and have normal looking logCPM values? >>> >>> Lastly, what is the best way to get the 'adjusted' count values out of >>> edgeR? Is it by using the fitted values from glmFit() or using cpm() on >>> the non-fitted values ? would the latter adjust for offset? >>> >>> Thanks, >>> Ying Wu >>> >>>> sessionInfo() >>> R version 2.15.0 (2012-03-30) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8 >>> [5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] edgeR_3.0.8 limma_3.14.4 EDASeq_1.4.0 >>> [4] R.oo_1.13.0 R.methodsS3_1.4.2 aroma.light_1.28.0 >>> [7] matrixStats_0.6.2 ShortRead_1.16.4 latticeExtra_0.6-24 >>> [10] RColorBrewer_1.0-5 Rsamtools_1.10.2 lattice_0.20-10 >>> [13] Biostrings_2.26.3 GenomicRanges_1.10.7 IRanges_1.16.6 >>> [16] Biobase_2.18.0 BiocGenerics_0.4.0 knitr_1.1 >>> >>> loaded via a namespace (and not attached): >>> [1] annotate_1.36.0 AnnotationDbi_1.20.6 bitops_1.0-5 >>> [4] DBI_0.2-5 DESeq_1.10.1 digest_0.6.3 >>> [7] evaluate_0.4.3 formatR_0.7 genefilter_1.40.0 >>> [10] geneplotter_1.36.0 grid_2.15.0 hwriter_1.3 >>> [13] markdown_0.5.4 parallel_2.15.0 RSQLite_0.11.2 >>> [16] splines_2.15.0 stats4_2.15.0 stringr_0.6.2 >>> [19] survival_2.37-4 tools_2.15.0 XML_3.95-0.2 >>> [22] xtable_1.7-1 zlibbioc_1.4.0 >>> > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}