Question

Ensuring reproducibility in BiocParallel SingleCellExperiment workflows

1

Entering edit mode

enricoferrero ▴ 680

@enricoferrero-6037

Last seen 3.4 years ago

Switzerland

I'm using a fairly standard scRNA-seq data analysis workflow, inspired by the OSCA pipeline (i.e., using scuttle, scater, scran among others). As the dataset is fairly big (>100'000 cells), I'm using BiocParalell to speed up computation.

Now, the problem I encounter is that every time I rerun my code (typically as part of knitting an .Rmd file), I get slightly different results, more specifically:

The UMAP projection looks different
The number of clusters and the cell-to-cluster assignments are different.

This is despite setting the seed at the very beginning (bp is what I pass to all the functions that have a BPPARAM argument):

set.seed(16)
bp <- BiocParallel::MulticoreParam(workers = 8, RNGseed = 16)

The differences between runs are not major, but, as an example, they prevent me to reliably map cluster numbers to cell types after running SingleR (because, say, cluster 7 might indicate completely unrelated cells in different runs).

I'm guessing this might be due to some stochastic component of the UMAP and Louvain clustering algorithms, though I would have thought setting the seed was enough. Interestingly, I can't quite reproduce this on a small, toy dataset, possibly because the algorithms converge more easily and/or in less time.

How can I ensure reproducibility of dimensionality reduction and clustering in SingleCellExperiment workflows using BiocParallel? Thank you.

scater SingleCellExperiment BiocParallel scuttle scran • 2.2k views

ADD COMMENT • link 3.5 years ago enricoferrero ▴ 680

score 1 · Answer 1 · 2021-10-25

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

I don't think that setting RNGseed inside MulticoreParam does what anyone really expects. See the discussion in https://github.com/Bioconductor/BiocParallel/pull/140.

More generally, it would help if you could identify the offending function. I would guess that it is the PCA if it affects both the UMAP and the clustering.

ADD COMMENT • link 3.5 years ago Aaron Lun ★ 28k

1

Entering edit mode

The lack of reproducibility has been true in the past, but I believe under the (just released) BiocParallel 1.28.0 setting RNGseed = will make the results reproducible, including across workers and 'back-ends'. (Unless the author of a package has subverted this, perhaps as a legitimate attempt to 'correct' the misbehavior of previous BiocParallel). A new vignette describes random number behavior in detail.

ADD REPLY • link 3.5 years ago Martin Morgan 25k

0

Entering edit mode

Thank you both Martin Morgan and Aaron Lun! I can confirm upgrading to Bioconductor 3.14 and BiocParallel 1.28.0 now gives me reproducible results across runs. Hooray!

ADD REPLY • link 3.5 years ago enricoferrero ▴ 680