Question

scran: question on minimum cluster size for computeSumFactors()

0

Entering edit mode

aatsmith • 0

@aatsmith-10597

Last seen 7.1 years ago

Dear Bioconductor support,

We are currently considering using Bioconductor package scran to analyse some publicly-available single-cell RNA-seq data (quantified at the transcript level with Kallisto, aggregated to gene level with tximport). This dataset has (after QC) less than 90 cells in each of 3 different known sub-populations. I was wondering if this would be enough to pass to scran's computeSumFactors(), or even if it would be enough to run quickCluster() first to get clusters without using a priori information. Either way, for future reference, what would be the general guidelines for lower bounds on cell counts [per cluster] for use with these scran functions?

Thank you in advance for your help!

Best regards,

-- Alex

scran single cell normalization clustering • 3.4k views

ADD COMMENT • link 8.6 years ago aatsmith • 0

score 2 · Answer 1 · 2016-04-29

That's an interesting question. The minimum number of cells required for deconvolution probably depends on the quality of each cell - if you don't have a lot of zeroes in each cell, or if the number of zeroes is not variable across cells, then I would expect that you don't need to pool as many cells to get accurate normalization. In fact, if you have high-quality libraries, then bulk-based methods that operate on each cell separately will actually do okay. For very-low-count data (e.g., inDrop, Zeisel et al.'s brain data), the deconvolution approach works with as few as 100 cells per cluster, but I've noticed that the precision of the estimates start to deteriorate when the number of cells decrease. Which makes sense, because the method works by sharing information across cells and there's less information when there's fewer cells.

If your data is like the low count data mentioned above, then 90 cells will probably be borderline (and definitely not enough if they're split into three subpopulations, such that you'd only have 30 cells on average in each subpopulation). In such cases, I would try doing running computeSumFactors without any clustering and hope that you don't have large numbers of DE genes between your subpopulations. On the other hand, if your data has higher coverage, then you might be able to get away with fewer cells (e.g., set sizes=c(5, 10, 15, 20), assuming your subpopulations are around 30 cells each; you'll have to turn down min.size in quickCluster as well, or define your clusters manually). It's worth a shot, at least - well, it's not like it would do worse than standard normalization methods, so you might as well give it a go.

In any case, I always plot the deconvolution size factors against the library sizes, just as a sanity check. For low count data, these two methods are the most similar (relative to DESeq and TMM normalization, which give distorted estimates) so it's generally a good idea to check that their estimates are roughly correlated. Of course, some scatter in this plot is expected, with differences between the normalization strategies due to DE between cells (against which library size normalization is not robust).

Also, I don't know how nicely the deconvolution method plays with Kallisto-derived counts. In theory, it shouldn't matter if the values are interpretable as counts, but I haven't checked it out in practice.

score 0 · Answer 2 · 2016-05-01

Hi Aaron,

Thank you for your prompt reply & your insight!

I have between 80 & 96 cells in each sub-population, for a total of less than 280, so that sounds borderline as you said, but worth a try - especially since it can hardly be worse than the traditional normalisation methods. I've been playing around with the DESeq2 one so far, and if the number of DEGs I get with that are anything to go by, I have quite a few DEGs between sub-populations (up to 20% of tested genes), so I am hoping your deconvolution approach is applicable here!

I will thus test out different setups and sizes, trying with quickCluster() or with the a priori-defined subpopulations, & have a look at the normalisation "QC" plots (pre-eminence of 0's, deconv. size factors vs library sizes) you suggest, and I'll let you know what I see!

Thanks again & best regards,

-- Alex