Question

scuttle::perFeatureQCMetrics runs very slow for big scRNA-seq data

1

Entering edit mode

Adam ▴ 10

@f2e87e45

Last seen 6 weeks ago

United States

Hi,

I was processing a scRNA-seq data which has about 1.3M cells with around 53000 genes. When I tried the function perFeatureQCMetrics() to get some gene metrics for filtering genes, it ran over 24 hours and still not finished even with parallelization. I preferred to use all cells with bioconductor packages and not to apply Sketch-based analysis in Seurat. Is there a way to speed it up? Thanks a lot!

Best, JG


# Here is the code I used .

bp.params <- MulticoreParam(workers = 26)
per.feat <- perFeatureQCMetrics(sce, BPPARAM = bp.params)



sessionInfo( )
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] Azimuth_0.5.0               shinyBS_0.61.1
 [3] patchwork_1.3.0             lubridate_1.9.3
 [5] forcats_1.0.0               stringr_1.5.1
 [7] dplyr_1.1.4                 purrr_1.0.2
 [9] readr_2.1.5                 tidyr_1.3.1
[11] tibble_3.2.1                tidyverse_2.0.0
[13] AnnotationHub_3.12.0        BiocFileCache_2.10.2
[15] dbplyr_2.5.0                PCAtools_2.16.0
[17] ggrepel_0.9.6               scran_1.32.0
[19] BiocParallel_1.38.0         DropletUtils_1.24.0
[21] scater_1.32.1               ggplot2_3.5.1
[23] scuttle_1.14.0              SingleCellExperiment_1.26.0
[25] SummarizedExperiment_1.34.0 Biobase_2.64.0
[27] GenomicRanges_1.56.2        GenomeInfoDb_1.40.1
[29] IRanges_2.38.1              S4Vectors_0.42.1
[31] BiocGenerics_0.50.0         MatrixGenerics_1.16.0
[33] matrixStats_1.4.1

loaded via a namespace (and not attached):
  [1] fs_1.6.4                          spatstat.sparse_3.1-0
  [3] ProtGenerics_1.34.0               bitops_1.0-9
  [5] DirichletMultinomial_1.44.0       TFBSTools_1.42.0
  [7] httr_1.4.7                        RColorBrewer_1.1-3
  [9] sctransform_0.4.1                 tools_4.4.1
 [11] utf8_1.2.4                        R6_2.5.1
 [13] DT_0.33                           HDF5Array_1.32.1
 [15] uwot_0.2.2                        lazyeval_0.2.2
 [17] rhdf5filters_1.16.0               withr_3.0.1
 [19] sp_2.1-4                          prettyunits_1.2.0
 [21] gridExtra_2.3                     progressr_0.14.0
 [23] cli_3.6.3                         spatstat.explore_3.3-2
 [25] fastDummies_1.7.4                 EnsDb.Hsapiens.v86_2.99.0
 [27] shinyjs_2.1.0                     Seurat_5.1.0
 [29] spatstat.data_3.1-2               ggridges_0.5.6
 [31] pbapply_1.7-2                     Rsamtools_2.18.0
 [33] R.utils_2.12.3                    parallelly_1.38.0
 [35] BSgenome_1.70.2                   limma_3.60.6
 [37] RSQLite_2.3.7                     generics_0.1.3
 [39] BiocIO_1.12.0                     gtools_3.9.5
 [41] spatstat.random_3.3-2             ica_1.0-3
 [43] googlesheets4_1.1.1               GO.db_3.18.0
 [45] Matrix_1.7-1                      ggbeeswarm_0.7.2
 [47] fansi_1.0.6                       abind_1.4-8
 [49] R.methodsS3_1.8.2                 lifecycle_1.0.4
 [51] yaml_2.3.10                       edgeR_4.2.2
 [53] rhdf5_2.48.0                      SparseArray_1.4.8
 [55] Rtsne_0.17                        grid_4.4.1
 [57] blob_1.2.4                        promises_1.3.0
 [59] dqrng_0.4.1                       shinydashboard_0.7.2
 [61] pwalign_1.0.0                     crayon_1.5.3
 [63] miniUI_0.1.1.1                    lattice_0.22-6
 [65] beachmat_2.20.0                   cowplot_1.1.3
 [67] annotate_1.82.0                   GenomicFeatures_1.54.4
 [69] KEGGREST_1.44.1                   pillar_1.9.0
 [71] metapod_1.12.0                    rjson_0.2.23
 [73] future.apply_1.11.2               codetools_0.2-20
 [75] fastmatch_1.1-4                   leiden_0.4.3.1
 [77] glue_1.8.0                        spatstat.univar_3.0-0
 [79] data.table_1.16.2                 remotes_2.5.0
 [81] vctrs_0.6.5                       png_0.1-8
 [83] spam_2.11-0                       cellranger_1.1.0
 [85] poweRlaw_0.80.0                   gtable_0.3.5
 [87] cachem_1.1.0                      Signac_1.14.0
 [89] S4Arrays_1.4.1                    mime_0.12
 [91] pracma_2.4.4                      survival_3.7-0
 [93] gargle_1.5.2                      RcppRoll_0.3.1
 [95] statmod_1.5.0                     bluster_1.14.0
 [97] fitdistrplus_1.2-1                ROCR_1.0-11
 [99] nlme_3.1-165                      bit64_4.5.2
[101] progress_1.2.3                    filelock_1.0.3
[103] RcppAnnoy_0.0.22                  irlba_2.3.5.1
[105] vipor_0.4.7                       KernSmooth_2.23-24
[107] seqLogo_1.68.0                    SeuratDisk_0.0.0.9021
[109] colorspace_2.1-1                  DBI_1.2.3
[111] tidyselect_1.2.1                  processx_3.8.4
[113] bit_4.5.0                         compiler_4.4.1
[115] curl_5.2.3                        BiocNeighbors_1.22.0
[117] hdf5r_1.3.11                      xml2_1.3.6
[119] desc_1.4.3                        DelayedArray_0.30.1
[121] plotly_4.10.4                     rtracklayer_1.62.0
[123] caTools_1.18.3                    scales_1.3.0
[125] lmtest_0.9-40                     callr_3.7.6
[127] rappdirs_0.3.3                    goftest_1.2-3
[129] digest_0.6.37                     presto_1.0.0
[131] spatstat.utils_3.1-0              XVector_0.44.0
[133] htmltools_0.5.8.1                 pkgconfig_2.0.3
[135] sparseMatrixStats_1.16.0          fastmap_1.2.0
[137] ensembldb_2.26.0                  rlang_1.1.4
[139] htmlwidgets_1.6.4                 UCSC.utils_1.0.0
[141] shiny_1.9.1                       DelayedMatrixStats_1.26.0
[143] farver_2.1.2                      zoo_1.8-12
[145] jsonlite_1.8.9                    R.oo_1.26.0
[147] BiocSingular_1.20.0               RCurl_1.98-1.16
[149] magrittr_2.0.3                    GenomeInfoDbData_1.2.12
[151] dotCall64_1.2                     Rhdf5lib_1.26.0
[153] munsell_0.5.1                     Rcpp_1.0.13
[155] viridis_0.6.5                     reticulate_1.39.0
[157] stringi_1.8.4                     zlibbioc_1.50.0
[159] MASS_7.3-61                       plyr_1.8.9
[161] pkgbuild_1.4.4                    parallel_4.4.1
[163] listenv_0.9.1                     CNEr_1.38.0
[165] deldir_2.0-4                      Biostrings_2.72.1
[167] splines_4.4.1                     tensor_1.5
[169] hms_1.1.3                         locfit_1.5-9.10
[171] BSgenome.Hsapiens.UCSC.hg38_1.4.5 ps_1.7.6
[173] igraph_2.1.1                      spatstat.geom_3.3-3
[175] RcppHNSW_0.6.0                    reshape2_1.4.4
[177] biomaRt_2.58.2                    ScaledMatrix_1.12.0
[179] TFMPvalue_0.0.9                   BiocVersion_3.19.1
[181] XML_3.99-0.17                     SeuratObject_5.0.2
[183] BiocManager_1.30.23               JASPAR2020_0.99.10
[185] tzdb_0.4.0                        httpuv_1.6.15
[187] polyclip_1.10-7                   RANN_2.6.2
[189] SeuratData_0.2.2.9001             future_1.34.0
[191] scattermore_1.2                   rsvd_1.0.5
[193] xtable_1.8-4                      restfulr_0.0.15
[195] AnnotationFilter_1.26.0           RSpectra_0.16-2
[197] later_1.3.2                       googledrive_2.1.1
[199] viridisLite_0.4.2                 memoise_2.0.1
[201] beeswarm_0.4.0                    AnnotationDbi_1.66.0
[203] GenomicAlignments_1.38.2          cluster_2.1.6
[205] timechange_0.3.0                  globals_0.16.3

scuttle • 732 views

ADD COMMENT • link updated 10 weeks ago by Aaron Lun ★ 28k • written 11 weeks ago by Adam ▴ 10

score 0 · Answer 1 · 2024-12-03

You don't mention how your matrix data is being represented, but if it's a file-backed matrix (e.g., HDF5Matrix or one of its variants, based on the fact that you have HDF5Array loaded), then it's probably going to be slow. You might be bottlenecked by the filesystem I/O or some other factor (see discussion here) that is not amenable to parallelization. In fact, if you have a HDF5-backed sparse matrix in a CSC format, perFeatureQCMetrics() will have a very suboptimal access pattern as it distributes a block of rows to each worker; this means that each worker will need to read in the entire matrix to get the rows of interest.

An easy solution - if you have enough memory - is to coerce the matrix into something like an SVT_SparseMatrix with as(assay(mat), "SVT_SparseMatrix") and then call perFeatureQCMetrics() on that. For 1.3M cells and an integer matrix, I'd guess you need about 25 GB to hold the matrix. Once loaded, this should eliminate the filesystem bottleneck in all subsequent operations via scuttle functions. If you've already got an in-memory matrix and you're still seeing a 24+ hour runtime, then I'm less sure of the reason. Maybe you're running out of memory with all the workers - R's process-based parallelization is a notorious RAM glutton - and going into swap. I dunno.

Usually, for questions of efficiency, I'd be promoting the use of scrapper, which is a reimplementation of a lot of scuttle and scran functionality in C++. Unfortunately, scrapper doesn't yet have an equivalent to perFeatureQCMetrics(), but you could trick it into thinking cells are genes and then applying the RNA QC metrics:

# Mocking up a matrix; this would be equivalent to the
# matrix you create after loading the file-backed matrix into memory.
mat <- Matrix::rsparsematrix(50000, 10000, density = 0.01) 

# Wrap it in a DelayedArray to avoid making a new copy upon transposition
library(DelayedArray)
pretend.genes.are.cells <- t(DelayedArray(mat))

library(scrapper)
qc <- computeRnaQcMetrics(pretend.genes.are.cells, subsets=list())
qc$sum # sum of counts for each gene
qc$detected # number of detected cells for each gene

If you're particularly knowledgeable, you could even ask beachmat to handle the in-memory loading for you, which would probably cut the memory usage down to <15 GB.

score 0 · Answer 2 · 2024-12-03

0

Entering edit mode

Adam ▴ 10

@f2e87e45

Last seen 6 weeks ago

United States

Hi Aaron,

Thanks a lot for your help! As you mentioned, I checked my data and it is "DelayedMatrix". I don't know the detailed difference between 'DelayedMatrix' and 'SVT_SparseMatrix'. Do you think this format may make the computation slow, or may be the matrix size, or something else? And I will try 'scrapper' as you suggested, and I think 'transpose' matrix could be a good idea. Thank you so much! If you have more ideas, could you please share them? Thanks!

Best, JG

ADD COMMENT • link 11 weeks ago Adam ▴ 10

0

Entering edit mode

You can examine the structure of a DelayedMatrix by calling the showtree() function. For example:

library(scRNAseq)
X <- fetchDataset("grun-bone_marrow-2016", version="2023-12-14")
DelayedArray::showtree(assay(X))
## 23536x1915 double, sparse: DelayedMatrix object
##    23536x1915 double, sparse: Set dimnames
##       23536x1915 double, sparse: ReloadedArraySeed object
##          23536x1915 double, sparse: [seed] CSC_H5SparseMatrixSeed object

You can see that, at the core of the object, the data is being stored in a HDF5 file. It only shows up as a DelayedMatrix because of all the extra delayed operations we added on top, e.g., delayed dimnames. Regardless of the underlying structure of the DelayedMatrix, it can be loaded into memory with as(assay(X), "SVT_SparseMatrix"), after which all functions should be a lot faster. scrapper functions will be the fastest but scran and scuttle should also benefit.

For the sake of completeness, here is how you can get beachmat to handle the loading instead of using as(). This approach only works with scrapper functions, though.

library(beachmat.hdf5)
initialized.pretend.genes.are.cells <- beachmat::initializeCpp(t(assay(X)), memorize=TRUE)

library(scrapper)
qc <- computeRnaQcMetrics(initialized.pretend.genes.are.cells, subsets=list())
qc$sum # sum of counts for each gene
qc$detected # number of detected cells for each gene

ADD REPLY • link 11 weeks ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

You are right. I checked the structure and it is "TENxMatrix" (HDF5-based sparse matrix ). I tried to convert it to SparseArray, but looks it will take longer than I expected. Later I will test how long it will take. I think the transpose matrix may be a easy solution by now. Thank you so much for your help!

Best,JG

ADD REPLY • link 11 weeks ago Adam ▴ 10

1

Entering edit mode

IIRC loading ~million cells into memory took about 3-5 minutes, but that was with a rather slow filesystem (HDD on a HPC), so YMMV depending on the specs of your computing environment.

ADD REPLY • link 10 weeks ago Aaron Lun ★ 28k