Dear all!
I am analysing RNA sequencing data by using edgeR package, but I am not sure about how should I make the design matrix. The counts table are obtained by using Rsubread package.
Firstly, I will describe what my samples look like. I have two stable doxycycline (Dox) inducible cell clones expressing either shCtrl(shC) or shTargetgene (shT). They were treated with Dox or without Dox. By adding Dox, shCtrl cells should not be affected, while shTargetgene cells will have knockdown of the target gene. My samples are in four different groups: C (control), C plus Dox, T and T plus Dox and I have prepared triplicates. Therefore, in total 12 samples: Group <- c(rep("C",3),rep("C+Dox",3),rep("T",3),rep("T+Dox",3)). Actually, all the three groups : "C", "C+Dox" and "T" are all considered to be controls for group "T+Dox".
My idea is that I want to get a list of genes that are differentially expressed between W plus and W groups but not between C plus and C group.
Another thing need to be mentioned is that: since I have collected the cells and prepared RNA in different days, the batch effect exits. Therefore, I used batch = as.factor(c(1,2,3,1,2,3,4,5,6,4,5,6)) in my case.
How could I make a correct design? Should I make a contrast between T and T+Dox group and get top tags, subset the first list of genes with FDR < 0.05 and abs(logFC)>1. Then make another contrast between C and C+Dox and also subset the second list of genes with FDR < 0.05 and abs(logFC)>1. After that I substract the genes in the second list from the first gene lists.
Is that good to do the analysis in this way or is there any design can make things easier? Can you be nice and help me with it?
Bellow is the pipeline I want to use for the RNA seq analysis. This is a simplified pipeline. I have not added contrasts in the pipeline. I also attached sessionInfo().
#read csv file containing rawcounts counts <- read.csv (file="rawcounts.csv",sep = ",") colnames( counts ) <- paste(c("C_1","C_2","C_3","C plus Dox_1","C plus Dox_2","C plus Dox_3","T_1","T_2","T_3","T plus Dox_1","T plus Dox_2","T plus Dox_3"),sep=",") #building the edgeR Object Group <- c(rep("C",3),rep("C+Dox",3),rep("T",3),rep("T+Dox",3)) cds <- DGEList(counts, group = Group) names( cds ) dim(cds) #Filtering keep <- rowSums(cpm(cds) >0.5) >= 2 cdsfilter <- cds[keep,,keep.lib.sizes= FALSE] cdsfilter$samples table(keep) cdsNorm <- calcNormFactors(cdsfilter) cdsNorm$samples #Batch effect removal batch = as.factor(c(1,2,3,1,2,3,4,5,6,4,5,6)) logFC <- predFC(cdsNorm,design,prior.count = 1, dispersion = 0.05) cor(logFC[,1:6]) logCPM <- cpm(cdsNorm,log =TRUE, prior.count =5) logCPM <- removeBatchEffect(logCPM,batch =batch) # MDS plot plotMDS(logCPM,lables= cdsNorm$Group, ylim = c(-1,1), col=c(1,1,1,2,2,2,3,3,3,4,4,4),main = "MDS plot") legend("bottomleft", legend = c("shC-Dox","shC+Dox","shT-Dox","shT+Dox"),col= 1:4, pch=16,cex=1) #The design matrix before fit GLMs design <- model.matrix(~Group+batch) rownames(design) <- colnames(cdsNorm) design # estimate the dispersion cdsGLM <- estimateGLMCommonDisp(cdsNorm,design) cdsGLM$common.dispersion cdsTag <- estimateGLMTagwiseDisp(cdsGLM,design) plotBCV(cdsTag,main="BCV plot") plotMeanVar(cdsTag,show.raw =TRUE,show.tagwise= TRUE, show.binned = TRUE) #Differentiatl expression(GLM model for multiple factor) fit<-glmQLFit(cdsTag,design,robust =TRUE) head(fit$coefficients) plotQLDisp(fit) lrt <- glmLRT(fit, coef =2) topTags (lrt) res.sig <- topTags (lrt)
> seesionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.2 (El Capitan) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.28.0 org.Mm.eg.db_3.3.0 AnnotationDbi_1.34.4 [4] edgeR_3.14.0 DESeq2_1.12.4 BiocInstaller_1.22.3 [7]limma_3.28.20 GenomicAlignments_1.8.4 SummarizedExperiment_1.2.3 Biobase_2.32.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.6 RColorBrewer_1.1-2 plyr_1.8.4 bitops_1.0-6 [5] tools_3.3.1 zlibbioc_1.18.0 statmod_1.4.25 rpart_4.1-10 [9] RSQLite_1.0.0 annotate_1.50.0 gtable_0.2.0 lattice_0.20-33 [13] Matrix_1.2-6 DBI_0.5 gridExtra_2.2.1 genefilter_1.54.2 [17] cluster_2.0.4 locfit_1.5-9.1 grid_3.3.1 nnet_7.3-12 [21] data.table_1.9.6 XML_3.98-1.4 survival_2.39-5 BiocParallel_1.6.6 [25] foreign_0.8-66 latticeExtra_0.6-28 Formula_1.2-1 org.Hs.eg.db_3.3.0 [29] GO.db_3.3.0 geneplotter_1.50.0 ggplot2_2.1.0 Hmisc_3.17-4 [33] scales_0.4.0 splines_3.3.1 xtable_1.8-2 colorspace_1.2-6 [37] acepack_1.3-3.3 RCurl_1.95-4.8 munsell_0.4.3 chron_2.3-47
Thanks Aaron for your great help!
I have taken your suggestions and rewrite the R code, however, There is a problem with the new design :
design <- model.matrix(~0+Group+batch)
It shows: Error in glmFit.default(y, design, offset = log(lib.size), dispersion = dispersion, : Design matrix not of full rank. The following coefficients not estimable:
batch6
I suspected the problem is due to the batch set, therefore, I changed the
batch <- as.factor(c(1,2,3,1,2,3,1,2,3,1,2,3))
Then the problem was solved. But I do not understand why. If I want to keep the batch set as before, how should I make the right design then?
Bellow is my modified R code:
Oops - you have to remove one of the
batch4-6
coefficients to get to full rank, otherwise they are linearly dependent with theGroupT*
coefficients. I've modified my response above (arbitrarily removingbatch4
).Thank you very much Aaron!