I have a RNAseq experiment with 5 time points and 2 cell type (2*5=10 conditions) and 3 replicates each. The experiment can be further divided to two sub-experiment -- cell type generation (3 time points * 2 cell types) and cell type re-stimulation (2 time points * 2 cell types) -- since we are not interested in comparing cell type generation samples to cell type re-stimulation samples.
cell1 12hr
cell1 24hr
cell1 48hr
cell2 12hr
cell2 24hr
cell2 48hr
cell1 rest0hr
cell1 restim4hr
cell2 rest0hr
cell2 restim4hr
I performed TMM normalization using all 30 samples. However, when it comes to exploratory visualization, it is hard to tell obvious condition-wise clustering on all 30 samples using PCA plot. allPCAplot
Thus, I did PCA independently on sub-experiment (cell type generation, cell type re-stimulation). Cluster becomes a little bit obvious but still hard to see condition-wise clustering. Here is the example of cell type generation experiment PCA. cell type generation PCA plot
Finally, I just did PCA on samples which I want to do pairwise comparison. (eg. cell type comparison at each time points: cell1vscell2 at 12hr, cell1vscell2 at 24hr, ..... And eg. time point comparison within cell type: cell1 restim4hr vs. 0hr, .....). condition clustering become obvious and replicate batch is the major confoundering factor here. Here is the example of cell1vscell2 at 24hr. cell1vscell2 at 24hr PCA
# prepare design
groups <- factor(gsub("_rep[0-9]","",colnames(y$counts)))
groups <- relevel(groups, ref="cell1_12hr")
batch <- factor(str_extract(colnames(y$counts),"rep[0-9]"))
batch <- relevel(batch, ref="rep1")
design <- model.matrix(~batch+groups, data=y$samples)
So for the differential analysis on edgeR, I did similar thing like my exploratory analysis. Using all 30 samples calculate dispersion estimateDisp
, fit data glmQLFit
and perform test glmQLFTest
to get DE in pairwise (eg. cell2vscell1 at 24hr). However, I got no differential genes.
Then, I tried to subset samples based sub-experiment (eg. cell type generation sub-experiment include triplicates from cell112hr, cell212hr, cell124hr, cell224hr, cell148hr, cell248hr), I used those 18 samples for differential genes in pairwise (eg. cell2vscell1 at 24hr). Still, I got no differential genes.
Finally, when I subset samples only for pairwise comparison groups (eg. cell2vscell1 at 24hr includes triplicates from cell124hr and cell224hr), then use those subset samples to recalculate dispersion estimateDisp
, fit data glmQLFit
and perform test glmQLFTest
. Using smaller number of samples, I got smaller dispersion which is only relevant to the samples I want to compare. I am able to get hundreds differential genes.
So my question is : is it valid to subset dataset for differential analysis? Which method (using all samples, subset samples based on experiment, subset only pairwise comparison group samples) generates more reliable results?
Hi Aaron, Thanks for your very helpful response!
Yes, I do assume that batch effect is the same for every group. Based on the sample preparation information I got, the libraries for the whole experiment were collected and made at three different time/batches, which correspond to the replicate number indicated in the sample name.
Also, I do have low coverage samples. However, since I have only three replicates for each group, I don't know how I can do DE analysis on 1~2 replicates after removing those low coverage samples (~ just using average fold change to compare groups I guess).
I did voomWithQualityWeights as you suggested, on both batch correct design (~batch+groups) and no batch design (~groups). I have seen great variations on samples weights (sample weights plot). I am not sure how low you would consider very low weights in this case.
I am new to this "sample weight" concept. If I correctly understood the post you pointed me to, the
sample.weights
fromvoomWithQualityWeights
is to measure the sample variance, which can determine variance heterogeneity across the groups. If I choose to use voom for the rest analysis, should I usev$sample.weights
as the weights inlmFit(..., weights=..)
for fitting?The experimental design does not make the assumption any more (or less) reasonable. Well, it mitigates the technical component of the batch effect, but biological batch effects are less predictable in terms of additivity.
You'll have to be more specific. 2-fold lower? 10-fold lower? 100-fold lower? If it's more than 5-fold lower, I usually assume that something went wrong beyond inaccuracy in the sample pooling.
The link doesn't work.
Just give the
voom
output tolmFit
. It will detect the weights automatically.