Hi Robert Castelo ,
I face an error when trying to run gsva function from the GSVA package using a big sparse matrix (class dgCMatrix). The matrix has dimensions of around 30k x 120k. I've noticed that using dgCMatrix objects as input is still in an experimental stage but would like to know if there's some solution under way to this.
The code starts running normally but after a while stops:
Estimating GSVA scores for 50 gene sets.
Estimating ECDFs with Gaussian kernels
Error in as.vector(.Call(Csparse_to_vector, x), mode) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Here is some of the traceback:
17: as.vector(.Call(Csparse_to_vector, x), mode)
16: as.vector(x, mode)
15: as.vector(x, mode)
14: as.vector(x)
13: as.double(t(expr[, sample.idxs, drop = FALSE]))
12: as.double(t(expr[, sample.idxs, drop = FALSE]))
11: compute.gene.density(expr, sample.idxs, rnaseq, kernel)
10: compute.geneset.es(expr, gset.idx.list, 1:n.samples, rnaseq = rnaseq,
abs.ranking = abs.ranking, parallel.sz = parallel.sz, mx.diff = mx.diff,
tau = tau, kernel = kernel, verbose = verbose, BPPARAM = BPPARAM)
9: .gsva(expr, mapped.gset.idx.list, method, kcdf, rnaseq, abs.ranking,
parallel.sz, mx.diff, tau, kernel, ssgsea.norm, verbose,
BPPARAM)
8: .local(expr, gset.idx.list, ...)
7: gsva(logtpm_matrix, gene_sets, kcdf = "Gaussian", min.sz = 5,
max.sz = 500, parallel.sz = 1, verbose = TRUE)
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] RColorBrewer_1.1-2 GSEABase_1.54.0 graph_1.70.0
[4] annotate_1.70.0 XML_3.99-0.6 AnnotationDbi_1.54.1
[7] IRanges_2.26.0 S4Vectors_0.30.0 Biobase_2.52.0
[10] BiocGenerics_0.38.0 GSVA_1.40.1 patchwork_1.1.1
[13] SeuratObject_4.0.2 Seurat_4.0.3 dplyr_1.0.6
Hi, Thank you for the information! I'll try ssGSEA in the meantime.
If I may quickly inquire on this. I tried this:
With
data
being aSingleCellExperiment
with a sparse assay matrixX
andgene_sets
being aGeneSetCollection
. Unfortunately however, it converts the sparse matrix to dense:EDIT: with
gsva(ssgseaParam(assay(data, "X"), gene_sets), ...)
it uses the sparse matrix, however fails at the Normalization step:(Also, do you know why 99% (59552) of my genes are considered "constant"? There is quite some variability in the data (it contains a lot of diverse samples though)).
Hi Moritz,
could you please let me know which version of GSVA this is, ideally the output of
sessionInfo()
? Did you install it usingBiocManager::install("GSVA")
or any other source or method?Did your first attempt produce useful results despite the two warnings? Depending on your data. the sparse->dense coercion will use more memory but should otherwise give correct results and while ssGSEA checks for constant genes it does not drop them.
Conversation moved to https://github.com/rcastelo/GSVA/issues/113
The dataset I am using at the moment is just a test dataset. The main dataset is much bigger and it's impossible to work with it in the dense fashion.
May I kindly ask for more specifics on how the constant genes are determined (as they are not 100% constant)? I had a quick search but could not find information on it in the ssGSEA publication.
If they are determinted via a heuristic (e.g. little variance across samples), but not deleted, I wouldn't mind actually.
Briefly, GSVA always tries to identify constant genes and warn about them.
Genes found to be constant will be removed
In particular, they are not removed when the method is
ssGSEA
.Earlier versions of GSVA tried to identify constant genes by checking if their standard deviation == 0 but because of the vagaries of floating point arithmetics this caused issues that are probably best summarized in #87
Hence, in the current release version 1.50.0, genes are considered constant if their SD is less than 1e-10. As you have seen, this arbitrary threshold may result in false positives and worse, it will fail to identify genes with a single non-zero value in a sparse matrix.
Therefore, in the current devel version (that will become 1.52.0 with the release of BioC 3.19 in April), genes will be considered constant, if their
max(.) == min(.)
, over either all values or, in the case of sparse matrices, their non-zero values, as implemented inGSVA:::.filterGenes()
.This approach has been chosen because it (hopefully)
matrixStats
andsparseMatrixStats
.I'd of course be grateful for better ideas that meet these criteria. ;-)
This sounds great and helped a lot to pinpoint my issue: I was loading a sparse matrix from h5ad and it failed loading the X matrix (so all values were 0).