I'm experimenting with Limma to perform differential expression of Pseudo-bulk'd single cell gene expression data. Workflow roughly as follows:
- Assign clusters / inferred cell types from a case / control experiment with 6 biological replicates
- Sum counts by cluster + condition + donor
- Paired Design Matrix (Primary Term is a concatenation of Cluster_Condition)
- Remove any samples with particularly high scaling factors
- Filter by Expression
- Fit model + topTable
I've tried two different approaches here, and got some quite different results. I've got an idea of what the right thing to do is, but any comments from the community are very much welcome to understand what's going on under the hood.
Approach 1 - Use all cluster / inferred cell types as one big experiment, derive a contrast matrix to test case / control for each cluster
Approach 2 - Create a new model for each cluster
Approach 2 seems the right thing to do given the results I see with approach 1. Approach 1 has in some tests given me differentially expressed genes where for both conditions being tested, the expression of those genes is zero. I suspect this is likely an artefact of having highly heterogeneous samples, in that there will be some with zero gene expression, some with very high, and that's between sample variability rather than between condition (case / control).
My only hesitation is that Approach 1 in some cases appears (artificially or not), more powerful in some core comparisons where I have a good idea of the biology at play. These same comparisons in Approach 2 barely reach statistical significance. I'd like to understand more what's going on here and if there's any way I could improve Approach 1.
Thanks
You might find the DE analysis section of the OSCA book helpful. In particular, Aaron’s reasoning for fitting separate models - “We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch.”