Question

SingleR with multiple single cell references

0

Entering edit mode

maltethodberg ▴ 180

@maltethodberg-9690

Last seen 5 hours ago

Denmark

I'm trying to annotate cell types in my single cell dataset (so) using 3 different single cell references (ref1, ref2, ref3). I've manually harmonized the cell type labels in each of the 3 references, so they use the exact same labels with each label appearing in at least 2 references.

Following the SingleR book, I would like to predict cell types using both combining results from each reference (https://bioconductor.org/books/release/SingleRBook/using-multiple-references.html#combining-inferences-from-individual-references) and by sharing information between references during markers detection (https://bioconductor.org/books/release/SingleRBook/using-multiple-references.html#combining-inferences-from-individual-references).

I can use each reference separately like so:

l <- SingleR::SingleR(test=GetAssayData(so, assay="RNA", layer="counts"), # so is a Seurat object
                      ref = ref1, # each reference is log normalized
                      labels = ref1$Celltype,
                      de.method = "wilcox",
                      de.n = 50,
                      aggr.ref = TRUE,
                      fine.tune=TRUE,
                      BPPARAM = BiocParallel::MulticoreParam(60))

However, trying to run 2 or 3 references together doesn't produce an error, but runs for much, much longer than the combinend run time of all references:

l <- SingleR::SingleR(test=GetAssayData(so, assay="RNA", layer="counts"),
                      ref = list(Ref1=ref1, Ref2=ref2, Ref3=ref3),
                      labels = list(Ref1=ref1$Celltype, Ref2=ref2$Celltype, Ref3=ref3$Celltype),
                      de.method = "wilcox",
                      de.n = 50,
                      aggr.ref = TRUE,
                      fine.tune=TRUE,
                      BPPARAM = BiocParallel::MulticoreParam(60))

It runs for so long I haven't been able to determine if it will eventually produce an error. Is it necessary to piece together the individual steps manually (e.g. aggregateReference, trainSingleR, classifySingleR, combineRecomputedResults)?

For the second approach (sharing information during marker detection), the SingleR books points to the getClassicMarkers function. IIRC this isn't appropriate for single cell data: Is there an equivalent approach for multiple single cell references using scran? (pairwiseWilcox)?

Is it possible to use SingleR with multiple single cell references?

SingleR scran • 559 views

ADD COMMENT • link 7 weeks ago • updated 5 hours ago maltethodberg ▴ 180

score 2 · Accepted Answer · 2025-02-16

Is it necessary to piece together the individual steps manually (e.g. aggregateReference, trainSingleR, classifySingleR, combineRecomputedResults)?

Your combined SingleR() call looks fine to me. I don't specifically know why it takes so long, though combineRecomputedResults() is quite computationally intensive.

I assume you're using the latest release version in BioC 3.20, where computeRecomputedResults() now has a fine.tune=TRUE default option that performs fine-tuning to improve the accuracy of the combined calls. This effectively mirrors the fine-tuning behavior in a single-reference context but is, of course, even more expensive than it was before.

Currently, there's no option to SingleR() to turn off combined fine-tuning without also turning off single-reference fine-tuning. I doubt it's the cause of the slow performance, as the fine-tuning should increase runtime by 3-fold at most if you have 3 references. Nonetheless, I just added a fine.tune.combined= option to the BioC-devel version (also on GitHub) that you can disable to see if it gets any better.

As usual, it's worth confirming that you don't have any other bottlenecks in your system (e.g., going into swap). I'd make sure that it works on some smaller datasets first.

Is there an equivalent approach for multiple single cell references using scran? (pairwiseWilcox)?

You can cbind all your references together and then use block= in various scran functions. This assumes that the labels are consistently named across references, which is usually the most tedious and painful part of the process.

To be honest, if you have consistent labels across references, there's no need to treat them as separate references; just cbind them together, set block= in the de.args=, and avoid the need to call combineRecomputedResults() at all.