scuttle::aggregateAcrossCells() crashes R on large SingleCellExperiment
2
0
Entering edit mode
enricoferrero ▴ 660
@enricoferrero-6037
Last seen 3.0 years ago
Switzerland

Hello,

I'm working on a large SingleCellExperiment object (25982 genes x 143019 cells) and would like to perform a differential expression analysis across conditions using pseudobulking.

However, when I call scuttle::aggregateAcrossCells(), R works through it for a long time and then crashes. I tried sampling the columns to 1/10th of the size (14302 cells) and it works (after some time), so I think the only problem is the size of the SingleCellExperiment object.

What can I do to get this to work? Are there any workarounds I can use? The function doesn't seem to support parallelisation with BiocParallel as far as I can see.

Thank you! Enrico

library(tidyverse)
library(scater)
library(scran)
sce <- read_rds("../dat/sce.rds")

sce
class: SingleCellExperiment 
dim: 25982 143019 
metadata(4): cell_or_tissue_type_colors condition_colors replicate_group_colors replicategroup_colors
assays(2): counts logcounts
rownames(25982): Xkr4 Gm1992 ... CAAA01147332.1 AC149090.1
rowData names(5): ensembl_id gene_symbol n_cells is_mito is_ribo
colnames: NULL
colData names(88): sample organism ... label celltype
reducedDimNames(2): PCA UMAP
altExpNames(0):

# this crashes
pb <- aggregateAcrossCells(sce, ids = colData(sce)[, c("mouse", "label")])

# this works
sce <- sce[, sample(ncol(sce), ncol(sce)/10)]
pb <- aggregateAcrossCells(sce, ids = colData(sce)[, c("mouse", "label")])

# session info
sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/prog/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_haswellp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scran_1.18.3                scater_1.18.3               SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
 [5] Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.2         IRanges_2.24.1             
 [9] S4Vectors_0.28.1            BiocGenerics_0.36.0         MatrixGenerics_1.2.0        matrixStats_0.57.0         
[13] forcats_0.5.0               stringr_1.4.0               dplyr_1.0.2                 purrr_0.3.4                
[17] readr_1.4.0                 tidyr_1.1.2                 tibble_3.0.4                ggplot2_3.3.3              
[21] tidyverse_1.3.0            

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0          colorspace_2.0-0          ggsignif_0.6.0            ellipsis_0.3.1           
 [5] rio_0.5.16                scuttle_1.0.4             bluster_1.0.0             XVector_0.30.0           
 [9] BiocNeighbors_1.8.2       fs_1.5.0                  rstudioapi_0.13           ggpubr_0.4.0             
[13] ggrepel_0.9.0             fansi_0.4.1               lubridate_1.7.9.2         xml2_1.3.2               
[17] sparseMatrixStats_1.2.0   jsonlite_1.7.2            broom_0.7.3               dbplyr_2.0.0             
[21] compiler_4.0.3            httr_1.4.2                dqrng_0.2.1               backports_1.2.1          
[25] assertthat_0.2.1          Matrix_1.3-0              limma_3.46.0              cli_2.2.0                
[29] BiocSingular_1.6.0        tools_4.0.3               rsvd_1.0.3                igraph_1.2.6             
[33] gtable_0.3.0              glue_1.4.2                GenomeInfoDbData_1.2.4    Rcpp_1.0.5               
[37] carData_3.0-4             cellranger_1.1.0          vctrs_0.3.6               writexl_1.3.1            
[41] DelayedMatrixStats_1.12.1 openxlsx_4.2.3            beachmat_2.6.4            rvest_0.3.6              
[45] lifecycle_0.2.0           irlba_2.3.3               statmod_1.4.35            rstatix_0.6.0            
[49] edgeR_3.32.0              zlibbioc_1.36.0           scales_1.1.1              hms_0.5.3                
[53] curl_4.3                  gridExtra_2.3             stringi_1.5.3             zip_2.1.1                
[57] BiocParallel_1.24.1       rlang_0.4.10              pkgconfig_2.0.3           bitops_1.0-6             
[61] lattice_0.20-41           tidyselect_1.1.0          magrittr_2.0.1            R6_2.5.0                 
[65] generics_0.1.0            DelayedArray_0.16.0       DBI_1.1.0                 pillar_1.4.7             
[69] haven_2.3.1               foreign_0.8-81            withr_2.3.0               abind_1.4-5              
[73] RCurl_1.98-1.2            modelr_0.1.8              crayon_1.3.4              car_3.0-10               
[77] viridis_0.5.1             locfit_1.5-9.4            grid_4.0.3                readxl_1.3.1             
[81] data.table_1.13.6         reprex_0.3.0              munsell_0.5.0             beeswarm_0.2.3           
[85] viridisLite_0.3.0         vipor_0.4.5
scran SingleCellExperiment scuttle scater • 2.8k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 9 hours ago
The city by the bay

Hm. This works pretty well for me:

library(scuttle)
library(scRNAseq)
sce <- ZilionisLungData()
## class: SingleCellExperiment
## dim: 41861 173954
## metadata(0):
## assays(1): counts
## rownames(41861): 5S_rRNA 5_8S_rRNA ... snosnR66 uc_338
## rowData names(0):
## colnames(173954): bcIIOD bcHTNA ... bcELDH bcFGGM
## colData names(16): Library Barcode ... used_in_NSCLC_non_immune
##   used_in_blood
## reducedDimNames(5): SPRING_NSCLC_all_cells
##   SPRING_NSCLC_and_blood_immune SPRING_NSCLC_immune
##   SPRING_NSCLC_non_immune SPRING_blood
## altExpNames(0):
groups <- sample(100, ncol(sce), replace=TRUE)
out <- aggregateAcrossCells(sce, groups)
## class: SingleCellExperiment
## dim: 41861 100
## metadata(0):
## assays(1): counts
## rownames(41861): 5S_rRNA 5_8S_rRNA ... snosnR66 uc_338
## rowData names(0):
## colnames(100): 1 2 ... 99 100
## colData names(18): Library Barcode ... ids ncells
## reducedDimNames(5): SPRING_NSCLC_all_cells
##   SPRING_NSCLC_and_blood_immune SPRING_NSCLC_immune
##   SPRING_NSCLC_non_immune SPRING_blood
## altExpNames(0):

Last step takes a few seconds on my laptop (16 GB RAM, not that all of it is used here).You also just pass a BPPARAM via ..., but for a few seconds, it isn't really worth it.

ADD COMMENT
2
Entering edit mode

Thanks Aaron. For whatever reason, using BPPARAM did the trick! I suspect this is related to the problem described by Steve, with parallelisation splitting the object into smaller chunks.

ADD REPLY
0
Entering edit mode

There should be no loss of sparsity at any stage inside aggregateAcrossCells; if this is happening, it is a bug. If you can make a minimum example with public data (e.g., ZilionisLungData), it would be helpful.

ADD REPLY
1
Entering edit mode
@steve-lianoglou-2771
Last seen 20 months ago
United States

I ran into this problem recently as well: my call to aggregateAcrossCells() was eventually crashing with an error message (that I unfortunately didn't copy verbatim) that lead me to believe that there is some matrix densification going on during the aggregation, and the temp counts matrix was larger (ie. ncol() * nrow()) than .Machine$integer.max.

I was able to get around it by splitting my SingleCellExperiment (xsce) into two halves, bulking it up that way, then combining the results when done, like so:

top <- xsce[1:16000, ]
top.agg <- aggregateAcrossCells(
  top,
  ids = colData(top)[, c("sample_id", "cluster_all")]) 

bottom <- xsce[!rownames(xsce) %in% rownames(top),]
bottom.agg <- aggregateAcrossCells(
  bottom,
  ids = colData(bottom)[, c("sample_id", "cluster_all")]) 

pbulk <- rbind(top.agg, bottom.agg)

In my case, the number of cells in xsce were small enough such that 16000 * ncol(xsce) was less than .Machine$integer.max. If that's not true for you, you can try splitting your SingleCellExperiment into smaller chunks and combine each of them after.

Writing a small utility function that accepts a SingleCellExperiment and a "chunk size" (row count, in my approach) should be pretty straightforward ... back calculating a default chunk size is also doable ...

ADD COMMENT
0
Entering edit mode

Thank Steve! I think the problem was indeed similar. I solved it simply using BPPARAM in the end, but I suspect this is because parallelisation splits the object into smaller chunks.

ADD REPLY

Login before adding your answer.

Traffic: 943 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6