I am using the edgeR for 2 datasets, but one dataset is a subset of another one. I have 3 different diet (D1, D2, D3), and 3 time points (T1, T2, T3). One dataset looks like-
table(mapping_file4$Time, mapping_file4$Diet)
ctr D1 D2 D3
T1 12 12 12 12
T2 12 12 12 12
T3 12 12 12 12
mapping_file4$group<-paste(mapping_file4$Diet, mapping_file4$Time, sep = ".")
y <- DGEList(counts=rawCounts_total4, genes=row.names(rawCounts_total4))
y <- calcNormFactors(y)
design <- model.matrix(~ 0 + group, data=mapping_file4)
rownames(design) <- colnames(y)
y <- estimateDisp(y, design, robust=TRUE)
fit <- glmQLFit(y, design, robust=TRUE)
qlt <- glmQLFTest(fit, contrast=makeContrasts(groupmc1.T1 -groupctr.T1, levels=design))
topgenes<-topTags(qlt, n=dim(row.names(rawCounts_total4)))
table(topgenes$table$FDR<0.05)
FALSE TRUE
69363 26
But when I use another dataset, which is a subset of the above dataset, that looks like-
table(mapping_file4$Time, mapping_file4$Diet)
ctr D1
T1 12 12
T2 12 12
T3 12 12
qlt <- glmQLFTest(fit, contrast=makeContrasts(groupmc1.T1 -groupctr.T1, levels=design))
topgenes<-topTags(qlt, n=dim(row.names(rawCounts_total4)))
table(topgenes$table$FDR<0.05)
FALSE TRUE
69378 11
I used the same commands for the above dataset as well. My question is why I am getting different differentially expressed genes for the same Diet (D1) compared with control). But the 11 genes are the subset of 26 genes.
Many thanks!
Thank you for the explanation, but I still don't understand how removing the other samples, in this case removing other Diet samples (D2 and D3) reduces the power for D1 group comparing it with control samples. In both the datasets, I am using the contrast function to compare between D1 and control at time point 1.
Lastly, what would you recommend doing in this case? If I want to compare D1 with control at time point 1, should I include all the samples (in this case D2 and D3) or subsetting it?
I'm having a hard time finding an edgeR specific answer to illustrate how more samples help with statistical power, but take a look at this post by Michael Love where he explains how having more samples helps you to better estimate your gene level dispersions in DESeq2. In particular, pay attention to this:
... and note that there is an entry in the DESeq2 faq that can help explain more.
Although edgeR and DESeq2 have their differences, you can imagine a "similar-enough" thing is happening in the edgeR world. There is always the primary literature you can read through if you really want to know the gory details, but perhaps this may be enough to guide your intuition.
As for my recommendation in this case, since the samples are "similar enough" (conducted in same experiment, same paradigm, same cell type, etc), I'd keep all of the samples together for the analysis, then just pull out the stats for the contrast of interest.