I am running computeSumFactors on a rather large expression matrix ( 68K ~ 20K). Instead of running quickCluster on such a large matrix, I use k-mean clustering result from a previous analysis ( PCA of top 1000 variable genes). However, it returns the following warning upon finished : number of cells in each cluster should be at least twice that of the largest 'sizes'. I was wondering if it will have any adverse effects on the size factor when the number of cells are not evenly distributed in each clusters.
No, it's fine. We threw in warnings when we were developing the method, but later on, we found that it didn't really matter (as long as you have enough cells - usually at least 100 - in each cluster, which is ensures you get precise estimates). As of the next release, this particular warning will be removed; the function will also be more tolerant of number of cells in each cluster below sizes, producing warnings rather than errors.
Note that clustering prior to computeSumFactors should be done in a way that is insensitive to the size factors. Otherwise, in extreme cases, you cluster cells that have similar library sizes, rather than those with similar expression profiles. We suggest doing something like computing ranks (e.g., with quickCluster and get.ranks=TRUE) and running a clustering algorithm on that instead. You can use k-means, or you can try quickCluster with method="igraph" (parallelizable via BiocParallel). This uses a community-based detection algorithm for clustering, which avoids constructing the distance matrix for large numbers of cells.
Also, you misspelt the tag, which is why I didn't see this post until now.
Thank you very much. I think I have the older version of scran (scran_1.2.2) , because it does not have the options for method and get.ranks when I try to use quickCluster. It may sound silly , but I wonder if there is any way to work around it. If not, I probably have to upgrade both R and bioconductor.
Yes, upgrading R and Bioconductor would be wise. The single-cell field moves quickly so you really want to get the latest versions of all packages. I personally switch to new versions of R as soon as they are available.
Thanks, I have installed the latest version. I am wondering how to use BiocParallel for quickCluster with igrpah option. Also, is it normal to consume 60 to 80 gb of memory for the matrix of this kind of size ?
Ah, the parallelization is only supported in the devel version at the moment; got my wires crossed. Currently we're transitioning to the SingleCellExperiment class, which is pretty hectic; so until the next release (next month, I think?) the current version of scran will not receive any new features.
As for the memory consumption; that is somewhat unusual, though not impossible, as the current version of scran (due to the limitation of ExpressionSet objects) represents all data inefficiently as dense matrices. The next version will provide proper support for sparse and file-backed matrices, so this should cease to be a major issue.
Thank you very much. I think I have the older version of scran (scran_1.2.2) , because it does not have the options for
method
andget.ranks
when I try to usequickCluster.
It may sound silly , but I wonder if there is any way to work around it. If not, I probably have to upgrade both R and bioconductor.Yes, upgrading R and Bioconductor would be wise. The single-cell field moves quickly so you really want to get the latest versions of all packages. I personally switch to new versions of R as soon as they are available.
Thanks, I have installed the latest version. I am wondering how to use BiocParallel for
quickCluster
withigrpah
option. Also, is it normal to consume 60 to 80 gb of memory for the matrix of this kind of size ?Ah, the parallelization is only supported in the devel version at the moment; got my wires crossed. Currently we're transitioning to the
SingleCellExperiment
class, which is pretty hectic; so until the next release (next month, I think?) the current version of scran will not receive any new features.As for the memory consumption; that is somewhat unusual, though not impossible, as the current version of scran (due to the limitation of
ExpressionSet
objects) represents all data inefficiently as dense matrices. The next version will provide proper support for sparse and file-backed matrices, so this should cease to be a major issue.