Hello all,
I am trying to get cpm
counts from this GEO accession [1] following this method. However I get a segfault ("memory not mapped") killing my entire R session.
Edit: My issue is not getting this to run at all, the issue is that I get a segfault with these precise inputs.
Edit2: As stated in this reply I originally ran into this issue on a compute server with about 1TB of RAM in terms of memory leeway. Following the related comment I also ran the example code on my personal laptop on WSL2 (30GB of RAM allocated + 100GB of swap), to the same result. Output appended at [2].
count_mat
dimensions: nrow = 29'733, ncol = 100'064
Reproducible example (Edit: code used to generate the RDS provided below):
library("edgeR")
library("Matrix")
dl_url <- "https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS"
dl_dest <- "wu_raw_crash_objs.RDS"
if (!file.exists(dl_dest)) {
download.file(dl_url, dl_dest, timeout = 60^2)
}
crash_objs <- readRDS(dl_dest)
count_mat <- crash_objs$count_mat
norm_factors <- crash_objs$norm_factors
print(sessionInfo())
# This will generate a segmentation fault.
edgeR::cpm(count_mat, norm_factors * colSums(count_mat))
Output:
Loading required package: limma
trying URL 'https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS' Content type 'application/octet-stream' length 412819144 bytes (393.7 MB)
==================================================
downloaded 393.7 MB
R version 4.3.1 (2023-06-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS/LAPACK: /gnu/store/d94sxy6axgzd3xr6bnp49l4v1nqzf825-openblas-0.3.20/lib/libopenblasp-r0.3.20.so; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.42.4 limma_3.56.2
loaded via a namespace (and not attached):
[1] compiler_4.3.1 Rcpp_1.0.11 grid_4.3.1 locfit_1.5-9.8 lattice_0.21-9
*** caught segfault ***
address 0x7fc0514ff320, cause 'memory not mapped'
Traceback:
1: cpm.default(count_mat, norm_factors * colSums(count_mat))
2: edgeR::cpm(count_mat, norm_factors * colSums(count_mat))
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
Entire example script for posterity, the file at https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS
may not stay there long after this is resolved:
library("edgeR")
options("timeout" = 60 * 60)
download_raw <- T
if (download_raw) {
library("Matrix")
files_needed <- c(
cells = "count_matrix_barcodes.tsv",
genes = "count_matrix_genes.tsv",
matrix = "count_matrix_sparse.mtx"
)
dl_url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE176nnn/GSE176078/suppl/GSE176078_Wu_etal_2021_BRCA_scRNASeq.tar.gz"
dl_dest <- "wu_raw.tar"
unpacked_dir_name <- "Wu_etal_2021_BRCA_scRNASeq/"
needed_paths <- paste0(unpacked_dir_name, "/", files_needed)
names(needed_paths) <- names(files_needed)
if (!all(lapply(needed_paths, file.exists))) {
download.file(url = dl_url, destfile = dl_dest)
untar(dl_dest)
}
matrix <- Matrix::readMM(needed_paths["matrix"])
matrix@Dimnames <- list(
readLines(needed_paths["genes"]),
readLines(needed_paths["cells"])
)
count_mat <- as.matrix(matrix)
norm_factors <- edgeR::calcNormFactors(count_mat)
save_crash_objs <- FALSE
if (save_crash_objs) {
saveRDS(
list(
count_mat = count_mat,
norm_factors = norm_factors
),
"wu_raw_crash_objs.RDS"
)
}
} else {
dl_url <- "https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS"
dl_dest <- "wu_raw_crash_objs.RDS"
if (!file.exists(dl_dest)) {
download.file(url = dl_url, destfile = dl_dest)
}
crash_objs <- readRDS(dl_dest)
count_mat <- crash_objs$count_mat
norm_factors <- crash_objs$norm_factors
}
print(sessionInfo())
# This will generate a segmentation fault.
edgeR::cpm(count_mat, norm_factors * colSums(count_mat))
[1] Wu SZ, Al-Eryani G, Roden DL, Junankar S et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet 2021 Sep;53(9):1334-1347. PMID: 34493872
[2] WSL2 Run output:
Loading required package: limma
trying URL 'https://bimsbstatic.mdc-berlin.de/akalin/jfreige/wu_raw_crash_objs.RDS'
Content type 'application/octet-stream' length 412819144 bytes (393.7 MB)
==================================================
downloaded 393.7 MB
R version 4.3.1 (2023-06-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS/LAPACK: /gnu/store/z9aavic2jyvp6jyfxnl9ka3i8vfk5phc-openblas-0.3.20/lib/libopenblasp-r0.3.20.so; LAPACK version 3.9.0
locale:
[1] C
time zone: Europe/Berlin
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.6-1.1 edgeR_3.42.4 limma_3.56.2
loaded via a namespace (and not attached):
[1] compiler_4.3.1 Rcpp_1.0.11 grid_4.3.1 locfit_1.5-9.8 lattice_0.21-9
*** caught segfault ***
address 0x7fcba94ff300, cause 'memory not mapped'
Traceback:
1: cpm.default(count_mat, norm_factors * colSums(count_mat))
2: edgeR::cpm(count_mat, norm_factors * colSums(count_mat))
An irrecoverable exception occurred. R is aborting now ...
That must have been quite annoying to tease out, thank you very much for taking the time! Indeed running
cpm
on an numeric matrix with more than 46'340 rows/columns generated a segfault for me as well, while doing the same with an integer matrix did not.Thank you also very much for compiling this list of options for avoiding this issue!
https://github.com/RcppCore/Rcpp/issues/1280 if you want to follow the progress