Dear Bioconductors,
more and more publications include large single-cell RNA-seq datasets. For example, Keren-Shaul et al made count matrices with 34016 gene x 37248 samples (= cells) available on NCBI GEO. I am interested in using Bioconductor to analyze such data and was happy to find the single-cell analysis Bioconductor workflow and the excellent scater
package.
The single-cell data are very sparse, with up to 90% of zero counts and can be very efficiently stored in sparse matrices, e.g. using the Matrix package. Yet, it seems that while scater's
newSCESet
function accepts a sparse matrix, it coerces it into a regular matrix right away, which requires much more memory (and storage space).
I can simply find a machine with lots of RAM or apply abundance filters before creating an SCESet, but I am curious: are there ways to use Bioconductor's infrastructure and take advantage of the sparse nature of the data?
And in case the scater
authors are listening: do you have any plans to take use sparse matrices?
Any recommendations are appreciated.
Thanks,
Thomas
That's great! Thanks a lot for sharing your progress, plans and especially the pointer to the SingleCellExperiment package. I am not surprised that you have even more awesome solutions in the works :-)