Question

Sparse matrices in Bioconductor objects for single-cell analyses

0

Entering edit mode

sandmann.t ▴ 70

@sandmannt-11014

Last seen 19 months ago

United States

Dear Bioconductors,

more and more publications include large single-cell RNA-seq datasets. For example, Keren-Shaul et al made count matrices with 34016 gene x 37248 samples (= cells) available on NCBI GEO. I am interested in using Bioconductor to analyze such data and was happy to find the single-cell analysis Bioconductor workflow and the excellent scater package.

The single-cell data are very sparse, with up to 90% of zero counts and can be very efficiently stored in sparse matrices, e.g. using the Matrix package. Yet, it seems that while scater's newSCESet function accepts a sparse matrix, it coerces it into a regular matrix right away, which requires much more memory (and storage space).

I can simply find a machine with lots of RAM or apply abundance filters before creating an SCESet, but I am curious: are there ways to use Bioconductor's infrastructure and take advantage of the sparse nature of the data?

And in case the scater authors are listening: do you have any plans to take use sparse matrices?

Any recommendations are appreciated.

Thanks,

Thomas

scater simpleSingleCell workflows • 3.9k views

ADD COMMENT • link updated 7.8 years ago by Aaron Lun ★ 28k • written 7.8 years ago by sandmann.t ▴ 70

score 3 · Accepted Answer · 2017-06-20

Well, this question might as well have my name written on it.

Yes, we are planning to adapt the SCESet object to accept sparse matrices; see https://github.com/drisso/SingleCellExperiment for a class based on SummarizedExperiment (which happily takes sparse matrices, as well as disk-based representations such as instances of the HDF5Array class).

Most of the R code should then work interchangeably with dense, sparse and file-backed matrices. The C++ code will take some more effort but we have written an API (https://github.com/LTLA/beachmat) which allows package code to be written in a manner that is agnostic to the exact representation of the matrix.

All of these goodies should hopefully come out in the next release, sometime in September.