scuttle::perFeatureQCMetrics runs very slow for big scRNA-seq data
2
0
Entering edit mode
Adam • 0
@f2e87e45
Last seen 1 hour ago
United States

Hi,

I was processing a scRNA-seq data which has about 1.3M cells with around 53000 genes. When I tried the function perFeatureQCMetrics() to get some gene metrics for filtering genes, it ran over 24 hours and still not finished even with parallelization. I preferred to use all cells with bioconductor packages and not to apply Sketch-based analysis in Seurat. Is there a way to speed it up? Thanks a lot!

Best, JG


# Here is the code I used .

bp.params <- MulticoreParam(workers = 26)
per.feat <- perFeatureQCMetrics(sce, BPPARAM = bp.params)



sessionInfo( )
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] Azimuth_0.5.0               shinyBS_0.61.1
 [3] patchwork_1.3.0             lubridate_1.9.3
 [5] forcats_1.0.0               stringr_1.5.1
 [7] dplyr_1.1.4                 purrr_1.0.2
 [9] readr_2.1.5                 tidyr_1.3.1
[11] tibble_3.2.1                tidyverse_2.0.0
[13] AnnotationHub_3.12.0        BiocFileCache_2.10.2
[15] dbplyr_2.5.0                PCAtools_2.16.0
[17] ggrepel_0.9.6               scran_1.32.0
[19] BiocParallel_1.38.0         DropletUtils_1.24.0
[21] scater_1.32.1               ggplot2_3.5.1
[23] scuttle_1.14.0              SingleCellExperiment_1.26.0
[25] SummarizedExperiment_1.34.0 Biobase_2.64.0
[27] GenomicRanges_1.56.2        GenomeInfoDb_1.40.1
[29] IRanges_2.38.1              S4Vectors_0.42.1
[31] BiocGenerics_0.50.0         MatrixGenerics_1.16.0
[33] matrixStats_1.4.1

loaded via a namespace (and not attached):
  [1] fs_1.6.4                          spatstat.sparse_3.1-0
  [3] ProtGenerics_1.34.0               bitops_1.0-9
  [5] DirichletMultinomial_1.44.0       TFBSTools_1.42.0
  [7] httr_1.4.7                        RColorBrewer_1.1-3
  [9] sctransform_0.4.1                 tools_4.4.1
 [11] utf8_1.2.4                        R6_2.5.1
 [13] DT_0.33                           HDF5Array_1.32.1
 [15] uwot_0.2.2                        lazyeval_0.2.2
 [17] rhdf5filters_1.16.0               withr_3.0.1
 [19] sp_2.1-4                          prettyunits_1.2.0
 [21] gridExtra_2.3                     progressr_0.14.0
 [23] cli_3.6.3                         spatstat.explore_3.3-2
 [25] fastDummies_1.7.4                 EnsDb.Hsapiens.v86_2.99.0
 [27] shinyjs_2.1.0                     Seurat_5.1.0
 [29] spatstat.data_3.1-2               ggridges_0.5.6
 [31] pbapply_1.7-2                     Rsamtools_2.18.0
 [33] R.utils_2.12.3                    parallelly_1.38.0
 [35] BSgenome_1.70.2                   limma_3.60.6
 [37] RSQLite_2.3.7                     generics_0.1.3
 [39] BiocIO_1.12.0                     gtools_3.9.5
 [41] spatstat.random_3.3-2             ica_1.0-3
 [43] googlesheets4_1.1.1               GO.db_3.18.0
 [45] Matrix_1.7-1                      ggbeeswarm_0.7.2
 [47] fansi_1.0.6                       abind_1.4-8
 [49] R.methodsS3_1.8.2                 lifecycle_1.0.4
 [51] yaml_2.3.10                       edgeR_4.2.2
 [53] rhdf5_2.48.0                      SparseArray_1.4.8
 [55] Rtsne_0.17                        grid_4.4.1
 [57] blob_1.2.4                        promises_1.3.0
 [59] dqrng_0.4.1                       shinydashboard_0.7.2
 [61] pwalign_1.0.0                     crayon_1.5.3
 [63] miniUI_0.1.1.1                    lattice_0.22-6
 [65] beachmat_2.20.0                   cowplot_1.1.3
 [67] annotate_1.82.0                   GenomicFeatures_1.54.4
 [69] KEGGREST_1.44.1                   pillar_1.9.0
 [71] metapod_1.12.0                    rjson_0.2.23
 [73] future.apply_1.11.2               codetools_0.2-20
 [75] fastmatch_1.1-4                   leiden_0.4.3.1
 [77] glue_1.8.0                        spatstat.univar_3.0-0
 [79] data.table_1.16.2                 remotes_2.5.0
 [81] vctrs_0.6.5                       png_0.1-8
 [83] spam_2.11-0                       cellranger_1.1.0
 [85] poweRlaw_0.80.0                   gtable_0.3.5
 [87] cachem_1.1.0                      Signac_1.14.0
 [89] S4Arrays_1.4.1                    mime_0.12
 [91] pracma_2.4.4                      survival_3.7-0
 [93] gargle_1.5.2                      RcppRoll_0.3.1
 [95] statmod_1.5.0                     bluster_1.14.0
 [97] fitdistrplus_1.2-1                ROCR_1.0-11
 [99] nlme_3.1-165                      bit64_4.5.2
[101] progress_1.2.3                    filelock_1.0.3
[103] RcppAnnoy_0.0.22                  irlba_2.3.5.1
[105] vipor_0.4.7                       KernSmooth_2.23-24
[107] seqLogo_1.68.0                    SeuratDisk_0.0.0.9021
[109] colorspace_2.1-1                  DBI_1.2.3
[111] tidyselect_1.2.1                  processx_3.8.4
[113] bit_4.5.0                         compiler_4.4.1
[115] curl_5.2.3                        BiocNeighbors_1.22.0
[117] hdf5r_1.3.11                      xml2_1.3.6
[119] desc_1.4.3                        DelayedArray_0.30.1
[121] plotly_4.10.4                     rtracklayer_1.62.0
[123] caTools_1.18.3                    scales_1.3.0
[125] lmtest_0.9-40                     callr_3.7.6
[127] rappdirs_0.3.3                    goftest_1.2-3
[129] digest_0.6.37                     presto_1.0.0
[131] spatstat.utils_3.1-0              XVector_0.44.0
[133] htmltools_0.5.8.1                 pkgconfig_2.0.3
[135] sparseMatrixStats_1.16.0          fastmap_1.2.0
[137] ensembldb_2.26.0                  rlang_1.1.4
[139] htmlwidgets_1.6.4                 UCSC.utils_1.0.0
[141] shiny_1.9.1                       DelayedMatrixStats_1.26.0
[143] farver_2.1.2                      zoo_1.8-12
[145] jsonlite_1.8.9                    R.oo_1.26.0
[147] BiocSingular_1.20.0               RCurl_1.98-1.16
[149] magrittr_2.0.3                    GenomeInfoDbData_1.2.12
[151] dotCall64_1.2                     Rhdf5lib_1.26.0
[153] munsell_0.5.1                     Rcpp_1.0.13
[155] viridis_0.6.5                     reticulate_1.39.0
[157] stringi_1.8.4                     zlibbioc_1.50.0
[159] MASS_7.3-61                       plyr_1.8.9
[161] pkgbuild_1.4.4                    parallel_4.4.1
[163] listenv_0.9.1                     CNEr_1.38.0
[165] deldir_2.0-4                      Biostrings_2.72.1
[167] splines_4.4.1                     tensor_1.5
[169] hms_1.1.3                         locfit_1.5-9.10
[171] BSgenome.Hsapiens.UCSC.hg38_1.4.5 ps_1.7.6
[173] igraph_2.1.1                      spatstat.geom_3.3-3
[175] RcppHNSW_0.6.0                    reshape2_1.4.4
[177] biomaRt_2.58.2                    ScaledMatrix_1.12.0
[179] TFMPvalue_0.0.9                   BiocVersion_3.19.1
[181] XML_3.99-0.17                     SeuratObject_5.0.2
[183] BiocManager_1.30.23               JASPAR2020_0.99.10
[185] tzdb_0.4.0                        httpuv_1.6.15
[187] polyclip_1.10-7                   RANN_2.6.2
[189] SeuratData_0.2.2.9001             future_1.34.0
[191] scattermore_1.2                   rsvd_1.0.5
[193] xtable_1.8-4                      restfulr_0.0.15
[195] AnnotationFilter_1.26.0           RSpectra_0.16-2
[197] later_1.3.2                       googledrive_2.1.1
[199] viridisLite_0.4.2                 memoise_2.0.1
[201] beeswarm_0.4.0                    AnnotationDbi_1.66.0
[203] GenomicAlignments_1.38.2          cluster_2.1.6
[205] timechange_0.3.0                  globals_0.16.3
scuttle • 78 views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 13 hours ago
The city by the bay

You don't mention how your matrix data is being represented, but if it's a file-backed matrix (e.g., HDF5Matrix or one of its variants, based on the fact that you have HDF5Array loaded), then it's probably going to be slow. You might be bottlenecked by the filesystem I/O or some other factor (see discussion here) that is not amenable to parallelization. In fact, if you have a HDF5-backed sparse matrix in a CSC format, perFeatureQCMetrics() will have a very suboptimal access pattern as it distributes a block of rows to each worker; this means that each worker will need to read in the entire matrix to get the rows of interest.

An easy solution - if you have enough memory - is to coerce the matrix into something like an SVT_SparseMatrix with as(assay(mat), "SVT_SparseMatrix") and then call perFeatureQCMetrics() on that. For 1.3M cells and an integer matrix, I'd guess you need about 25 GB to hold the matrix. Once loaded, this should eliminate the filesystem bottleneck in all subsequent operations via scuttle functions. If you've already got an in-memory matrix and you're still seeing a 24+ hour runtime, then I'm less sure of the reason. Maybe you're running out of memory with all the workers - R's process-based parallelization is a notorious RAM glutton - and going into swap. I dunno.

Usually, for questions of efficiency, I'd be promoting the use of scrapper, which is a reimplementation of a lot of scuttle and scran functionality in C++. Unfortunately, scrapper doesn't yet have an equivalent to perFeatureQCMetrics(), but you could trick it into thinking cells are genes and then applying the RNA QC metrics:

# Mocking up a matrix; this would be equivalent to the
# matrix you create after loading the file-backed matrix into memory.
mat <- Matrix::rsparsematrix(50000, 10000, density = 0.01) 

# Wrap it in a DelayedArray to avoid making a new copy upon transposition
library(DelayedArray)
pretend.genes.are.cells <- t(DelayedArray(mat))

library(scrapper)
qc <- computeRnaQcMetrics(pretend.genes.are.cells, subsets=list())
qc$sum # sum of counts for each gene
qc$detected # number of detected cells for each gene

If you're particularly knowledgeable, you could even ask beachmat to handle the in-memory loading for you, which would probably cut the memory usage down to <15 GB.

ADD COMMENT
0
Entering edit mode
Adam • 0
@f2e87e45
Last seen 1 hour ago
United States

Hi Aaron,

Thanks a lot for your help! As you mentioned, I checked my data and it is "DelayedMatrix". I don't know the detailed difference between 'DelayedMatrix' and 'SVT_SparseMatrix'. Do you think this format may make the computation slow, or may be the matrix size, or something else? And I will try 'scrapper' as you suggested, and I think 'transpose' matrix could be a good idea. Thank you so much! If you have more ideas, could you please share them? Thanks!

Best, JG

ADD COMMENT

Login before adding your answer.

Traffic: 837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6