Hello! My first question has to do with applying functions from edgeR and limma to identify genes that are expressed at levels that are consistent within individuals over time but differential between individuals. We have replicate A and B samples for individuals from two cohorts (100 and 500 series below), and we're identifying differentially expressed genes separately in each cohort because there are batch effects between them:
Dems ID Individual Bleed Batch 101A-0 101 A 1 101B-0 101 B 1 102A-0 102 A 1 102B-0 102 B 1 103A-0 103 A 1 103B-0 103 B 1 107A-0 107 A 1 107B-0 107 B 1 108A-0 108 A 1 108B-0 108 B 1 109A-0 109 A 1 109B-0 109 B 1 110A-0 110 A 1 110B-0 110 B 1 111A-0 111 A 1 111B-0 111 B 1 112A-0 112 A 1 112B-0 112 B 1 502A-0 502 A 2 502B-0 502 B 2 503A-0 503 A 2 503B-0 503 B 2 504A-0 504 A 2 504B-0 504 B 2 505A-0 505 A 2 505B-0 505 B 2 506A-0 506 A 2 506B-0 506 B 2 507A-0 507 A 2 507B-0 507 B 2 508A-0 508 A 2 508B-0 508 B 2 509A-0 509 A 2 509B-0 509 B 2 510A-0 510 A 2 510B-0 510 B 2 511A-0 511 A 2 511B-0 511 B 2 512A-0 512 A 2 512B-0 512 B 2
I begin by creating a DGEList with data for all 40 samples, filtering out genes expressed at low levels, and then normalizing the data using TMM and voom:
dge <- DGEList(data) keep <- rowSums(cpm(dge)>=1) >=6 fil_dge <- dge[keep, , keep.lib.size = FALSE] fil_dge <- calcNormFactors(fil_dge) Individual <- factor(Dems$Individual) indDesign <- model.matrix(~0+Individual) rownames(indDesign) <- Dems$ID v1 <- voom(fil_dge,indDesign)
I then filter out genes with absolute value mean fold changes >1.5 between A and B replicates using the log2 CPM values generated by voom:
n_genes <- nrow(fil_dge) n_individuals <- (nrow(Dems))/2 abs_log_FCs <- matrix(0,n_genes,n_individuals) row.names(abs_log_FCs) <- rownames(fil_dge) for (i in 1:n_individuals){ abs_log_FCs[,i] <- abs((v1$E[,2*i])-(v1$E[,(2*i)-1])) } ltFCthresh <- which((rowMeans(abs_log_FCs))<(log(1.5,2))) ltFCthresh <- row.names(abs_log_FCs)[ltFCthresh] fil_v1 <- v1[ltFCthresh,]
Finally, I use fil_v1 to perform differential gene expression analysis between each individual one-by-one (e.g. 101 vs. 102, 101 vs. 103, etc.) separately for each cohort. I determine which genes are differentially expressed by converting F.p.values from limma's eBayes function into FDRs for each cohort, and overlap differentially expressed genes from each cohort to find the genes that are differentially expressed in each case. Are these acceptable applications of edgeR and limma? For example, is it okay to subset the EList from voom normalization like this before performing differential gene expression, or should I do voom normalization again after filtering out genes that are expressed variably between A and B replicates?
Second, I'd like to iteratively arrange the 20 individuals into random groups of 2 and perform differential gene expression analysis between the groups using the gene set from the above analysis each time. Because I'm combining the cohorts in this analysis and there are batch effects between them I'd like to apply sva. From what I've read the best way to do that would be to specify which groups are being contrasted (group variable below), perform sva for each contrast, derive the surrogate variables and include them in a voom normalization, and subset the previously defined gene set (gene_set variable below) from the resulting EList for differential expression analysis as follows:
mod0 <- model.matrix(~1, data=Dems) mod1 <- model.matrix(~groups, data=Dems) sva.obj <- sva(v1$E, mod1, mod0) mod2 <- model.matrix(~groups+sva.obj$sv, data=Dems) v2 <- voom(fil_dge, mod2) fil_v2 <- v2[gene_set,]
Is there a better way to account for the batch effect prior to differential expression analysis (I realize I could block it but I believe I'd lose power that way) or is that the most effective way to apply sva and voom normalization given what I'd like to accomplish? Insight is much appreciated, thanks! All the best!
Adam
Thanks for your response Dr. Smyth, it's very helpful! My second question actually has to do with looking at potential subgroups within the 20 individuals using genes that are differentially expressed among them, so I would still like to iteratively organize them into groups of 2 and do differential gene expression between the groups each time. To accomplish that would you subset v to genes that are differentially expressed among individuals, block the batch effect and perform each contrast of interest or use sva as I outlined above? Thanks again, all the best!
Adam
Great thanks again for your help!
Adam