Question

Remove batch effects from RNA-seq data using edgeR and sva/ComBat

0

Entering edit mode

David O'Brien ▴ 10

@david-obrien-5970

Last seen 10.2 years ago

I'm trying to analyze an RNA-seq experiment where the PCA plot shows better clustering by the day the experiment was done rather than treatment type. Using edgeR to determine differentially expressed genes resulted in less than 5 genes with an FDR under 5%. Creating a GLM model to remove batch effects for day of experiment as stated in the edgeR manual resulted in 42 genes with an FDR less than 5%. An improvement, but still not good. So I tried using ComBat and the result was 986 genes with an FDR under 5%. Looking at the GO enrichment, the differentially expressed genes seem to make sense, but since ComBat was developed for microarrays, I'm concerned that there may be some caveats with this approach that I'm missing. Looking at the top genes below, the log2 fold change is really low and generally this just seems too good to be true. So my question is: Are there any reasons why using ComBat with RNA-seq data is not legit? And if so, can you see any problems with the approach below? mean_control mean_treatment logFC pval padj Gene5727 51.224797 45.371919 -0.1750427 3.474361e-08 0.0003224554 Gene3059 8.998311 5.740828 -0.6483954 1.056473e-07 0.0003268376 Gene11899 35.044302 39.027842 0.1553238 7.398559e-08 0.0003268376 Gene11724 2.556712 3.684178 0.5270535 1.959058e-07 0.0003636404 Gene12218 30.852989 23.702209 -0.3803888 1.908726e-07 0.0003636404 Gene4952 26.122068 30.466346 0.2219474 3.346424e-07 0.0005176360 My code is below. I've attached a file, dge.Rdata, that contains the counts info that is output from readDGE, so you can have the initial counts info. require(edgeR) require(sva) source('code/annotate_edgeR.R') files = data.frame(files=c('counts.control0', 'counts.control1', 'counts.control2', 'counts.control3', 'counts.treatment0', 'counts.treatment1', 'counts.treatment2', 'counts.treatment3'), group=c('control', 'control', 'control', 'control', 'treatment', 'treatment', 'treatment', 'treatment'), day=rep(0:3,2) ) labels <- paste0(files$group, files$day) dge <- readDGE(files=files, path='data/HTSeq/', labels=labels) rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene names to anonymize data ################################ # save(dge, file='objs/dge.Rdata') # SEE ATTACHED FILE # ############################### ## filter out the no_feature etc. rows dge <- dge[1:(nrow(dge)-5), ] ## This mitochondrial rRNA gene takes up a massive portion of my libraries dge <- dge[!rownames(dge)%in%'Gene13515', ] ## filter out lowly expressed genes keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where cpm is > 1 dge <- dge[keep, ] ## Recompute library sizes dge$samples$lib.size <- colSums(dge$counts) ## Normalize for lib size dge <- calcNormFactors(dge) ## ComBat mod <- model.matrix(~as.factor(group), data=dge$sample) mod0 <- model.matrix(~1, data=dge$sample) batch <- dge$sample$day combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod) pval_combat = f.pvalue(combat, mod, mod0) padj_combat = p.adjust(pval_combat, method="BH") mean_control <- rowMeans(combat[, 1:4]) mean_treatment <- rowMeans(combat[, 5:8]) logFC <- log2(mean_treatment/mean_control) res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat, padj=padj_combat) res <- res[order(res$padj), ] R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.16.0 sva_3.6.0 mgcv_1.7-23 corpcor_1.6.6 edgeR_3.2.3 limma_3.16.4 loaded via a namespace (and not attached): [1] grid_3.0.1 lattice_0.20-15 Matrix_1.0-12 nlme_3.1-109 RCurl_1.95-4.1 tools_3.0.1 [7] XML_3.96-1.1

GO edgeR GO edgeR • 7.2k views

ADD COMMENT • link updated 11.5 years ago by Gordon Smyth 52k • written 11.5 years ago by David O'Brien ▴ 10

score 0 · Answer 1 · 2013-06-06

Hi David, I think you're right that this is too good to be true. First, cpm values are grossly non-normal. If you wish to use a method designed for microarrays, you need to convert to the log-scale: data = cpm(dge, log=TRUE, prior.count=2) See Section 2.10 of the edgeR User's Guide. Second, as Evan has said in a separate email, your sample sizes are too small for ComBat to be reliable. You have only 8 samples total, and only two samples in each batch. ComBat isn't designed to work in this range. So adjusting for batch in the glm model is the way to go. Best wishes Gordon PS. We didn't get your attached file, presumably because it is over the size limit for the mailing list. > Date: Tue, 4 Jun 2013 15:40:20 -0400 > From: "David O'Brien" <dobrie10 at="" jhmi.edu=""> > To: bioconductor at r-project.org > Subject: [BioC] Remove batch effects from RNA-seq data using edgeR and > sva/ComBat > > I'm trying to analyze an RNA-seq experiment where the PCA plot shows better > clustering by the day the experiment was done rather than treatment type. > Using edgeR to determine differentially expressed genes resulted in less > than 5 genes with an FDR under 5%. Creating a GLM model to remove batch > effects for day of experiment as stated in the edgeR manual resulted in 42 > genes with an FDR less than 5%. An improvement, but still not good. So I > tried using ComBat and the result was 986 genes with an FDR under 5%. > Looking at the GO enrichment, the differentially expressed genes seem to > make sense, but since ComBat was developed for microarrays, I'm concerned > that there may be some caveats with this approach that I'm missing. Looking > at the top genes below, the log2 fold change is really low and generally > this just seems too good to be true. So my question is: Are there any > reasons why using ComBat with RNA-seq data is not legit? And if so, can you > see any problems with the approach below? > > mean_control mean_treatment logFC pval padj > Gene5727 51.224797 45.371919 -0.1750427 3.474361e-08 0.0003224554 > Gene3059 8.998311 5.740828 -0.6483954 1.056473e-07 0.0003268376 > Gene11899 35.044302 39.027842 0.1553238 7.398559e-08 0.0003268376 > Gene11724 2.556712 3.684178 0.5270535 1.959058e-07 0.0003636404 > Gene12218 30.852989 23.702209 -0.3803888 1.908726e-07 0.0003636404 > > Gene4952 26.122068 30.466346 0.2219474 3.346424e-07 0.0005176360 > > > My code is below. I've attached a file, dge.Rdata, that contains the > counts info that is output from readDGE, so you can have the initial > counts info. > > > require(edgeR) > require(sva) > source('code/annotate_edgeR.R') > files = data.frame(files=c('counts.control0', 'counts.control1', > 'counts.control2', 'counts.control3', 'counts.treatment0', > 'counts.treatment1', 'counts.treatment2', 'counts.treatment3'), > group=c('control', 'control', 'control', 'control', > 'treatment', 'treatment', 'treatment', 'treatment'), > day=rep(0:3,2) > ) > labels <- paste0(files$group, files$day) > dge <- readDGE(files=files, path='data/HTSeq/', labels=labels) > rownames(dge$counts) <- paste0('Gene', 1:nrow(dge$counts)) #Change gene > names to anonymize data > ################################ > # save(dge, file='objs/dge.Rdata') > # SEE ATTACHED FILE # > ############################### > > ## filter out the no_feature etc. rows > dge <- dge[1:(nrow(dge)-5), ] > ## This mitochondrial rRNA gene takes up a massive portion of my libraries > dge <- dge[!rownames(dge)%in%'Gene13515', ] > ## filter out lowly expressed genes > keep <- rowSums(cpm(dge) > 1) >= 3 ## gene has at least 3 columns where cpm > is > 1 > dge <- dge[keep, ] > ## Recompute library sizes > dge$samples$lib.size <- colSums(dge$counts) > ## Normalize for lib size > dge <- calcNormFactors(dge) > > ## ComBat > mod <- model.matrix(~as.factor(group), data=dge$sample) > mod0 <- model.matrix(~1, data=dge$sample) > batch <- dge$sample$day > > combat <- ComBat(dat=cpm(dge), batch=batch, mod=mod) > pval_combat = f.pvalue(combat, mod, mod0) > padj_combat = p.adjust(pval_combat, method="BH") > mean_control <- rowMeans(combat[, 1:4]) > mean_treatment <- rowMeans(combat[, 5:8]) > logFC <- log2(mean_treatment/mean_control) > > res <- data.frame(mean_control, mean_treatment, logFC, pval=pval_combat, > padj=padj_combat) > res <- res[order(res$padj), ] > > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.16.0 sva_3.6.0 mgcv_1.7-23 corpcor_1.6.6 > edgeR_3.2.3 limma_3.16.4 > > loaded via a namespace (and not attached): > [1] grid_3.0.1 lattice_0.20-15 Matrix_1.0-12 nlme_3.1-109 > RCurl_1.95-4.1 tools_3.0.1 > [7] XML_3.96-1.1 > > ------------------------------ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}