subset samples for RNAseq differential analysis
1
0
Entering edit mode
sckinta • 0
@sckinta-13776
Last seen 5.8 years ago

I have a RNAseq experiment with 5 time points and 2 cell type (2*5=10 conditions) and 3 replicates each. The experiment can be further divided to two sub-experiment -- cell type generation (3 time points * 2 cell types) and cell type re-stimulation (2 time points * 2 cell types) -- since we are not interested in comparing cell type generation samples to cell type re-stimulation samples.

cell1   12hr
cell1   24hr
cell1   48hr
cell2   12hr
cell2   24hr
cell2   48hr
cell1   rest0hr
cell1   restim4hr
cell2   rest0hr
cell2   restim4hr

I performed TMM normalization using all 30 samples. However, when it comes to exploratory visualization, it is hard to tell obvious condition-wise clustering on all 30 samples using PCA plot. allPCAplot

Thus, I did PCA independently on sub-experiment (cell type generation, cell type re-stimulation). Cluster becomes a little bit obvious but still hard to see condition-wise clustering. Here is the example of cell type generation experiment PCA. cell type generation PCA plot

Finally, I just did PCA on samples which I want to do pairwise comparison. (eg. cell type comparison at each time points: cell1vscell2 at 12hr, cell1vscell2 at 24hr, ..... And eg. time point comparison within cell type: cell1 restim4hr vs. 0hr, .....). condition clustering become obvious and replicate batch is the major confoundering factor here. Here is the example of cell1vscell2 at 24hr. cell1vscell2 at 24hr PCA

# prepare design
groups <- factor(gsub("_rep[0-9]","",colnames(y$counts)))
groups <- relevel(groups, ref="cell1_12hr")

batch <- factor(str_extract(colnames(y$counts),"rep[0-9]"))
batch <- relevel(batch, ref="rep1")

design <- model.matrix(~batch+groups, data=y$samples)

So for the differential analysis on edgeR, I did similar thing like my exploratory analysis. Using all 30 samples calculate dispersion estimateDisp, fit data glmQLFit and perform test glmQLFTest to get DE in pairwise (eg. cell2vscell1 at 24hr). However, I got no differential genes.

Then, I tried to subset samples based sub-experiment (eg. cell type generation sub-experiment include triplicates from cell112hr, cell212hr, cell124hr, cell224hr, cell148hr, cell248hr), I used those 18 samples for differential genes in pairwise (eg. cell2vscell1 at 24hr). Still, I got no differential genes.

Finally, when I subset samples only for pairwise comparison groups (eg. cell2vscell1 at 24hr includes triplicates from cell124hr and cell224hr), then use those subset samples to recalculate dispersion estimateDisp, fit data glmQLFit and perform test glmQLFTest. Using smaller number of samples, I got smaller dispersion which is only relevant to the samples I want to compare. I am able to get hundreds differential genes.

So my question is : is it valid to subset dataset for differential analysis? Which method (using all samples, subset samples based on experiment, subset only pairwise comparison group samples) generates more reliable results?

edger rnasq limma • 1.6k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 16 hours ago
The city by the bay

Let's get this out of the way first: more DE genes != correct. From your post, it sounds like you just kept on fiddling with your analysis until you got some DE genes. While it is certainly healthy to explore your data, do not be prejudiced into thinking that the best analysis parameters are those that give you the most DE genes. Indeed, the subsetted analysis has less information to work with and is more dependent on the distributional assumptions of the empirical Bayes shrinkage - it may well be that the analysis on the full data set is actually shielding you from violations of those assumptions, and that the hundreds of DE genes in the subsetted analysis are all false positives!

Given what you already know, my suggestion would be to dig further into your data set to identify the problematic samples or groups, rather than just pretending that you never generated them in the first place. There are a bunch of possible explanations:

  • Your ~batch+groups design assumes that the batch effect is the same for every group. If this is not true, you will have inflated dispersions as the model will not fit correctly.
  • One or more groups have low-quality samples - these should have been apparent from the start, e.g., based on RINs or sequencing coverage, and should have been removed beforehand.
  • The dispersions differ across groups - see related comments here.

I would check this out by running voomWithQualityWeights and seeing if any groups or samples have very low weights. Once you know this, you can make an educated decision about what to do with your analysis. I would be inclined to use voom for the rest of the DE analysis, but if you find a clearly problematic sample or group, you could consider removing it and continuing on with edgeR. However, if you can't find the cause of the difference, I would find it difficult to determine whether "hundreds of DE genes" was any more correct than "no DE genes".

ADD COMMENT
0
Entering edit mode

Hi Aaron, Thanks for your very helpful response!

Yes, I do assume that batch effect is the same for every group. Based on the sample preparation information I got, the libraries for the whole experiment were collected and made at three different time/batches, which correspond to the replicate number indicated in the sample name.

Also, I do have low coverage samples. However, since I have only three replicates for each group, I don't know how I can do DE analysis on 1~2 replicates after removing those low coverage samples (~ just using average fold change to compare groups I guess).

I did voomWithQualityWeights as you suggested, on both batch correct design (~batch+groups) and no batch design (~groups). I have seen great variations on samples weights (sample weights plot). I am not sure how low you would consider very low weights in this case.

I am new to this "sample weight" concept. If I correctly understood the post you pointed me to, the sample.weights from voomWithQualityWeights is to measure the sample variance, which can determine variance heterogeneity across the groups. If I choose to use voom for the rest analysis, should I use v$sample.weights as the weights in lmFit(..., weights=..) for fitting?

ADD REPLY
0
Entering edit mode

Yes, I do assume that batch effect is the same for every group. Based on the sample preparation information I got, the libraries for the whole experiment were collected and made at three different time/batches, which correspond to the replicate number indicated in the sample name.

The experimental design does not make the assumption any more (or less) reasonable. Well, it mitigates the technical component of the batch effect, but biological batch effects are less predictable in terms of additivity.

Also, I do have low coverage samples.

You'll have to be more specific. 2-fold lower? 10-fold lower? 100-fold lower? If it's more than 5-fold lower, I usually assume that something went wrong beyond inaccuracy in the sample pooling.

I am not sure how low you would consider very low weights in this case.

The link doesn't work.

If I choose to use voom for the rest analysis, should I use v$sample.weights as the weights in lmFit(..., weights=..) for fitting?

Just give the voom output to lmFit. It will detect the weights automatically.

ADD REPLY

Login before adding your answer.

Traffic: 575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6