the raw code:
########################################################
> plant = c(rep("MK",5),rep("NR",30),rep("NS",15))
> fungi = c(rep("32",2),rep("26",18),rep("32",30))
> time = c(rep("usp",5),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3))
> batch = c(c("b","c"),rep(c("a","b","c"),16))
> batch = factor(batch)
> grouping <- factor(paste(plant, fungi, time, sep="."))
> table(grouping)
grouping
MK.26.usp MK.32.usp NR.26.168hpi NR.26.18hpi NR.26.24hpi NR.26.48hpi NR.26.96hpi NR.32.168hpi NR.32.18hpi NR.32.24hpi NR.32.48hpi NR.32.96hpi NS.32.168hpi NS.32.18hpi NS.32.24hpi NS.32.48hpi NS.32.96hpi
3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
> counts <- read.csv("All_Fragments_16_3_30.csv")
> head(counts[,1:5],3)
GeneID MK.32.usp.b MK.32.usp.c MK.26.usp.a MK.26.usp.b
1 PSTCY32_21349 570 517 820 992
2 Contig410 5 5 1 0
3 Contig413 223 328 220 181
> d <- DGEList(counts = counts[,2:51], genes = counts[,1], group = grouping)
> keep <- rowSums(cpm(d)>1)>=2
> d <- d[keep,]
> d <- DGEList(counts= d$counts, genes = d$genes, group = grouping)
> d = calcNormFactors(d)
> design = model.matrix(~ condition, samples)
> design <- model.matrix(~ 0 + grouping + batch)
> d2 = estimateGLMCommonDisp(d,design)
> d2 = estimateGLMTrendedDisp(d2, design)
> d2 = estimateGLMTagwiseDisp(d2, design)
> f = glmFit(d2, design)
> con <- makeContrasts( ( groupingNR.26.18hpi - groupingMK.26.usp ) -
( groupingNR.32.18hpi - groupingMK.32.usp ) ,
( groupingNR.26.18hpi - groupingMK.26.usp ) -
( groupingNS.32.18hpi - groupingMK.32.usp ) ,
( groupingNR.26.24hpi - groupingMK.26.usp ) -
( groupingNR.32.24hpi - groupingMK.32.usp ) ,
( groupingNR.26.24hpi - groupingMK.26.usp ) -
( groupingNS.32.24hpi - groupingMK.32.usp ) ,
( groupingNR.26.48hpi - groupingMK.26.usp ) -
( groupingNR.32.48hpi - groupingMK.32.usp ) ,
( groupingNR.26.48hpi - groupingMK.26.usp ) -
( groupingNS.32.48hpi - groupingMK.32.usp ) ,
( groupingNR.26.96hpi - groupingMK.26.usp ) -
( groupingNR.32.96hpi - groupingMK.32.usp ) ,
( groupingNR.26.96hpi - groupingMK.26.usp ) -
( groupingNS.32.96hpi - groupingMK.32.usp ) ,
( groupingNR.26.168hpi - groupingMK.26.usp ) -
( groupingNR.32.168hpi - groupingMK.32.usp ) ,
( groupingNR.26.168hpi - groupingMK.26.usp ) -
( groupingNS.32.168hpi - groupingMK.32.usp ) ,levels=design)
> lrt <- glmLRT(f,contrast=con)
> tt <- topTags(lrt, n=Inf)$table
########################################################
In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.
So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
design
groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc
the levels of con:groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc
After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.
I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h.
After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.
> design3 (= model.matrix(~ 0 + grouping))
So my questions are:
1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”
There's no NILS or NILR in the code you've shown above, only MK, NR or NS.