Dear edgeR user community,
I have RNA-seq data downloaded from TCGA from 7 patients, both lung tumor and normal samples. At first I used edgeR with the GLM approach to account for patient differences, but the logFC values seemed too small. Min and max are -1.03 and 0.84. Then I tried the classic approach (with the same data, same filtering, etc.), and I had logFC values larger by a factor of ~10. Min and max are -10.55 and 9.6. DE genes are pretty consistent between the two approaches, common genes are 91% and 82% of all DE genes, respectively.
Also, with the GLM approach it was calculated that BCV = 0.4879. Is that too high? I think it is close enough to 40%, but I'm curious about other opinions.
Classic approach:
library(edgeR) rawdata <- read.delim("matrix.txt", stringsAsFactors=FALSE, check.names=FALSE) targets <- readTargets("targets.csv") rownames(targets) <- targets$ID targets <- targets[c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01"),] y <- DGEList(counts=round(rawdata[,c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01")]), group=targets$Group, genes=rawdata[,1]) keep <- rowSums(cpm(y)>0.5) >= 7 nrow(y) y <- y[keep,] nrow(y) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) de <- decideTestsDGE(et) summary(de)
GLM approach:
library(edgeR) rawdata <- read.delim("matrix.txt", stringsAsFactors=FALSE, check.names=FALSE) targets <- readTargets("targets.csv") rownames(targets) <- targets$ID targets <- targets[c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01"),] y <- DGEList(counts=round(rawdata[,c("LUAD_09_11", "LUAD_16_11", "LUAD_19_11", "LUAD_28_11", "LUAD_36_11", "LUAD_48_11", "LUAD_57_11", "LUAD_09_01", "LUAD_16_01", "LUAD_19_01", "LUAD_28_01", "LUAD_36_01", "LUAD_48_01", "LUAD_57_01")]), group=targets$Group, genes=rawdata[,1]) keep <- rowSums(cpm(y)>0.5) >= 7 nrow(y) y <- y[keep,] nrow(y) y$samples$lib.size <- colSums(y$counts) rownames(y$counts) <- rownames(y$genes) y <- calcNormFactors(y) design <- model.matrix(~targets$Patient+targets$Group) rownames(design) <- colnames(y) y <- estimateGLMCommonDisp(y, design, verbose=TRUE) y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y, design) fit <- glmFit(y, design) lrt <- glmLRT(fit) o <- order(lrt$table$PValue) summary(de <- decideTestsDGE(lrt))
Session info:
> sessionInfo() R version 3.1.3 (2015-03-09) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_3.4.2 limma_3.18.13 loaded via a namespace (and not attached): [1] tools_3.1.3
Do you have any explanation why is this possible? Which approach should I use? Any thoughts?
Thank you,
Zsuzsa
It's hard to say anything about your analysis, given that you haven't shown the code that you've used. Please edit your answer to include the relevant lines of code you used to get to the figures above.
Thanks for the notice, I have included codes in my question.