Reproducibility issue with scran together with lapply and BiocParallel
1
1
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 23 hours ago
Germany

Hi,

I am running (to me) identical code for some clustering but despite fixed seeds it is inconsistent when using either lapply or bplapply:

Any idea what is happening?

suppressMessages({

  library(BiocParallel)
  library(bluster)
  library(scran)
  library(scuttle)

})

# Mock up some data
set.seed(1)
sce <- logNormCounts(mockSCE())

set.seed(1)
sce <- scater::runPCA(sce)

blusparam <- bluster::SNNGraphParam(k=20, cluster.fun="louvain")

# Get clusters
get_clusters <- function(){

  set.seed(1)
  d <- scran::clusterCells(sce, use.dimred="PCA", BLUSPARAM=blusparam)
  nlevels(factor(d))

}

# lapply 4 clusters, bplapply 5 clusters
la <- lapply(1:10, function(x) get_clusters())
unique(unlist(la))
#> [1] 4

bp <- bplapply(1:10, function(x) get_clusters())
unique(unlist(bp))
#> [1] 5

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] scuttle_1.12.0              SingleCellExperiment_1.24.0
#>  [3] SummarizedExperiment_1.32.0 Biobase_2.62.0             
#>  [5] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1        
#>  [7] IRanges_2.36.0              S4Vectors_0.40.2           
#>  [9] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
#> [11] matrixStats_1.1.0           bluster_1.12.0             
#> [13] BiocParallel_1.36.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0          viridisLite_0.4.2        
#>  [3] dplyr_1.1.4               vipor_0.4.5              
#>  [5] viridis_0.6.4             bitops_1.0-7             
#>  [7] fastmap_1.1.1             RCurl_1.98-1.13          
#>  [9] reprex_2.0.2              digest_0.6.33            
#> [11] rsvd_1.0.5                lifecycle_1.0.4          
#> [13] cluster_2.1.4             statmod_1.5.0            
#> [15] magrittr_2.0.3            compiler_4.3.2           
#> [17] rlang_1.1.2               tools_4.3.2              
#> [19] igraph_1.5.1              utf8_1.2.4               
#> [21] yaml_2.3.7                knitr_1.45               
#> [23] S4Arrays_1.2.0            dqrng_0.3.2              
#> [25] DelayedArray_0.28.0       abind_1.4-5              
#> [27] withr_2.5.2               grid_4.3.2               
#> [29] fansi_1.0.5               beachmat_2.18.0          
#> [31] colorspace_2.1-0          edgeR_4.0.2              
#> [33] ggplot2_3.4.4             scales_1.3.0             
#> [35] cli_3.6.1                 rmarkdown_2.25           
#> [37] crayon_1.5.2              generics_0.1.3           
#> [39] metapod_1.10.0            rstudioapi_0.15.0        
#> [41] DelayedMatrixStats_1.24.0 ggbeeswarm_0.7.2         
#> [43] zlibbioc_1.48.0           parallel_4.3.2           
#> [45] XVector_0.42.0            vctrs_0.6.5              
#> [47] Matrix_1.6-4              BiocSingular_1.18.0      
#> [49] BiocNeighbors_1.20.0      ggrepel_0.9.4            
#> [51] irlba_2.3.5.1             beeswarm_0.4.0           
#> [53] scater_1.30.1             locfit_1.5-9.8           
#> [55] limma_3.58.1              glue_1.6.2               
#> [57] codetools_0.2-19          gtable_0.3.4             
#> [59] ScaledMatrix_1.10.0       munsell_0.5.0            
#> [61] tibble_3.2.1              pillar_1.9.0             
#> [63] htmltools_0.5.7           GenomeInfoDbData_1.2.11  
#> [65] R6_2.5.1                  sparseMatrixStats_1.14.0 
#> [67] evaluate_0.23             lattice_0.22-5           
#> [69] scran_1.30.0              Rcpp_1.0.11              
#> [71] gridExtra_2.3             SparseArray_1.2.2        
#> [73] xfun_0.41                 fs_1.6.3                 
#> [75] pkgconfig_2.0.3
Created on 2024-04-24 with reprex v2.0.2
scran BiocParallel bluster • 423 views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 12 hours ago
The city by the bay

IIRC BiocParallel switches to a different RNG when inside its own functions like bplapply - specifically L'Ecuyer-CMRG, as opposed to the usual Mersenne Twister. This is necessary for a few reasons - it ensures that Monte Carlo simulations distributed across threads actually use different random streams of random numbers, and it guarantees the same result regardless of the parallelization scheme (i.e., choice of BPPARAM). I suspect you will eliminate the discrepancy in your example if you replace lapply() with bplapply(..., BPPARAM=SerialParam()).

bluster isn't much involved in this phenomenon, other than the fact that igraph::cluster_louvain and related methods have a random component. I considered adding a default random seed in the SNNGraphParam() arguments to hide this stochasticity from the user, but that would be misleading as well as inconsistent with other algorithms like KmeansParam().

ADD COMMENT
0
Entering edit mode

Thanks Aaron for the response. RNGseed (for me) has no effect. I actually stumbled over this in one of my analysis where I have a constant bpparam definition that always has a fixed seed to exactly avoid these sorts of things. If you run this below the issue persists:

la <- lapply(1:10, function(x) get_clusters())
unique(unlist(la))
>4

bp <- bplapply(1:10, BPPARAM=SerialParam(RNGseed = 1), function(x) get_clusters())
unique(unlist(bp))
>5
ADD REPLY

Login before adding your answer.

Traffic: 862 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6