I have scRNA-seq data generated by Smart-seq2 which includes one plate of wild-type cells and a separate plate of knock-out cells. I would like to combine the data from these two plates in order to investigate any effects caused by the experimental intervention. I understand that plate and genotype are confounded in this design, but can fastMNN still be used to integrate such data? My approach would be to analyse the plates separately then merge them with fastMNN and re-do the dimensionality reduction and clustering so I can perform some comparative analyses. Are there any caveats or limitations to this approach that I should keep in mind?
I fail to understand why this is OK in scRNA-seq. In bulk RNA-seq this would be a big no-no. If the plates/batches are confounded with treatment, you would almost throw away the data. Why is it still OK in scRNA-seq? I must say that I see this "confounded design" frequently in scRNA-seq where control and treated are done on different plates/batches/days.
Aaron which chapter of the book do you refer to?
The updated link is https://bioconductor.org/books/release/OSCA.multisample/differential-abundance.html#sacrificing-differences.
Thanks for the link. Surely MNN or any scRNA integration would "work" but I still don't understand why it is OK to proceed with downstream analysis. Any ComBat or limma batch correction on the data would remove the treatment effect, so in this confounded case we cannot do any batch correction for downstream analysis. In fact, in your book, but also most other scRNAseq integration methods do not advice to use the corrected expression matrix.
But suppose in this case we use the uncorrected data, it's confounded, so how can we "trust" the results? Are we sure the DE are due to biological rather than technical effects? Or is the technical batch effect in scRNAseq much smaller than in bulk RNAseq?
Or is it perhaps is confounding plates/batches less relevant for DA (diffential abundance), compared to DE (differential expression)?
I would like to share my own understanding: in single-cell RNA sequencing (scRNA-seq), batch correction is solely intended to identify the same cell clusters across different batches, and the corrected gene expression values are generally not used for differential analysis.
For differential analysis and other statistical tests, it is typically required that the observations are independent of each other. However, the batch correction process in scRNA-seq differs from that in bulk RNA-seq, which relies on linear modeling or direct estimation within the expression model. In scRNA-seq, batch correction can disrupt the independence of gene expressions, causing them to become correlated with surrounding cells (particularly for methods that directly adjust gene expression values). As a result, the corrected gene expression values cannot be used for differential analysis or other downstream analyses.
Regarding batch effects, we can categorize them into two types: technical batches and biological batches. For technical batches, since the FindConservedMarkers() function conducts differential analysis separately for each sample, the process does not impact differential analysis as long as the biological effects are not overshadowed by the technical batches. For biological batches, however, if the biological effects are altered, it can lead to anomalies in differential analysis. Such anomalies cannot be eliminated by batch correction in bulk RNA-seq either; they can only be mitigated through techniques like Bayesian shrinkage of fold changes (FC) and the calculation of p-values.
Additionally, certain downstream analyses, such as pseudotime analysis in Monocle2, incorporate batch correction methods that do not compromise independence. Nevertheless, the prerequisite for accurate analysis is that the correct cell clusters have been obtained through prior batch correction.