Question

subset samples for RNAseq differential analysis

0

Entering edit mode

sckinta • 0

@sckinta-13776

Last seen 6.1 years ago

I have a RNAseq experiment with 5 time points and 2 cell type (2*5=10 conditions) and 3 replicates each. The experiment can be further divided to two sub-experiment -- cell type generation (3 time points * 2 cell types) and cell type re-stimulation (2 time points * 2 cell types) -- since we are not interested in comparing cell type generation samples to cell type re-stimulation samples.

cell1   12hr
cell1   24hr
cell1   48hr
cell2   12hr
cell2   24hr
cell2   48hr
cell1   rest0hr
cell1   restim4hr
cell2   rest0hr
cell2   restim4hr

I performed TMM normalization using all 30 samples. However, when it comes to exploratory visualization, it is hard to tell obvious condition-wise clustering on all 30 samples using PCA plot. allPCAplot

Thus, I did PCA independently on sub-experiment (cell type generation, cell type re-stimulation). Cluster becomes a little bit obvious but still hard to see condition-wise clustering. Here is the example of cell type generation experiment PCA. cell type generation PCA plot

Finally, I just did PCA on samples which I want to do pairwise comparison. (eg. cell type comparison at each time points: cell1vscell2 at 12hr, cell1vscell2 at 24hr, ..... And eg. time point comparison within cell type: cell1 restim4hr vs. 0hr, .....). condition clustering become obvious and replicate batch is the major confoundering factor here. Here is the example of cell1vscell2 at 24hr. cell1vscell2 at 24hr PCA

# prepare design
groups <- factor(gsub("_rep[0-9]","",colnames(y$counts)))
groups <- relevel(groups, ref="cell1_12hr")

batch <- factor(str_extract(colnames(y$counts),"rep[0-9]"))
batch <- relevel(batch, ref="rep1")

design <- model.matrix(~batch+groups, data=y$samples)

So for the differential analysis on edgeR, I did similar thing like my exploratory analysis. Using all 30 samples calculate dispersion estimateDisp, fit data glmQLFit and perform test glmQLFTest to get DE in pairwise (eg. cell2vscell1 at 24hr). However, I got no differential genes.

Then, I tried to subset samples based sub-experiment (eg. cell type generation sub-experiment include triplicates from cell112hr, cell212hr, cell124hr, cell224hr, cell148hr, cell248hr), I used those 18 samples for differential genes in pairwise (eg. cell2vscell1 at 24hr). Still, I got no differential genes.

Finally, when I subset samples only for pairwise comparison groups (eg. cell2vscell1 at 24hr includes triplicates from cell124hr and cell224hr), then use those subset samples to recalculate dispersion estimateDisp, fit data glmQLFit and perform test glmQLFTest. Using smaller number of samples, I got smaller dispersion which is only relevant to the samples I want to compare. I am able to get hundreds differential genes.

So my question is : is it valid to subset dataset for differential analysis? Which method (using all samples, subset samples based on experiment, subset only pairwise comparison group samples) generates more reliable results?

edger rnasq limma • 1.8k views

ADD COMMENT • link updated 6.1 years ago by Aaron Lun ★ 28k • written 6.1 years ago by sckinta • 0

score 3 · Answer 1 · 2019-03-17

Let's get this out of the way first: more DE genes != correct. From your post, it sounds like you just kept on fiddling with your analysis until you got some DE genes. While it is certainly healthy to explore your data, do not be prejudiced into thinking that the best analysis parameters are those that give you the most DE genes. Indeed, the subsetted analysis has less information to work with and is more dependent on the distributional assumptions of the empirical Bayes shrinkage - it may well be that the analysis on the full data set is actually shielding you from violations of those assumptions, and that the hundreds of DE genes in the subsetted analysis are all false positives!

Given what you already know, my suggestion would be to dig further into your data set to identify the problematic samples or groups, rather than just pretending that you never generated them in the first place. There are a bunch of possible explanations:

Your ~batch+groups design assumes that the batch effect is the same for every group. If this is not true, you will have inflated dispersions as the model will not fit correctly.
One or more groups have low-quality samples - these should have been apparent from the start, e.g., based on RINs or sequencing coverage, and should have been removed beforehand.
The dispersions differ across groups - see related comments here.

I would check this out by running voomWithQualityWeights and seeing if any groups or samples have very low weights. Once you know this, you can make an educated decision about what to do with your analysis. I would be inclined to use voom for the rest of the DE analysis, but if you find a clearly problematic sample or group, you could consider removing it and continuing on with edgeR. However, if you can't find the cause of the difference, I would find it difficult to determine whether "hundreds of DE genes" was any more correct than "no DE genes".