I'm using a fairly standard scRNA-seq data analysis workflow, inspired by the OSCA pipeline (i.e., using scuttle
, scater
, scran
among others).
As the dataset is fairly big (>100'000 cells), I'm using BiocParalell
to speed up computation.
Now, the problem I encounter is that every time I rerun my code (typically as part of knitting an .Rmd file), I get slightly different results, more specifically:
- The UMAP projection looks different
- The number of clusters and the cell-to-cluster assignments are different.
This is despite setting the seed at the very beginning (bp
is what I pass to all the functions that have a BPPARAM
argument):
set.seed(16)
bp <- BiocParallel::MulticoreParam(workers = 8, RNGseed = 16)
The differences between runs are not major, but, as an example, they prevent me to reliably map cluster numbers to cell types after running SingleR
(because, say, cluster 7 might indicate completely unrelated cells in different runs).
I'm guessing this might be due to some stochastic component of the UMAP and Louvain clustering algorithms, though I would have thought setting the seed was enough. Interestingly, I can't quite reproduce this on a small, toy dataset, possibly because the algorithms converge more easily and/or in less time.
How can I ensure reproducibility of dimensionality reduction and clustering in SingleCellExperiment
workflows using BiocParallel
?
Thank you.
The lack of reproducibility has been true in the past, but I believe under the (just released) BiocParallel 1.28.0 setting
RNGseed =
will make the results reproducible, including across workers and 'back-ends'. (Unless the author of a package has subverted this, perhaps as a legitimate attempt to 'correct' the misbehavior of previous BiocParallel). A new vignette describes random number behavior in detail.Thank you both Martin Morgan and Aaron Lun! I can confirm upgrading to Bioconductor 3.14 and BiocParallel 1.28.0 now gives me reproducible results across runs. Hooray!