Hi,
I am trying to compare if down-sampling reads will affect by multi-dataset batch integration. I am using the scran
, scater
and batchelor
tools in my pipeline.
1) First, I have processed each dataset to get normalized values based on clusters.
2) Then I am using multiBatchNorm
to rescale normalized value across datasets
3) Finally I am performing fastMNN
based batch correction and using the corrected dimensions to plot UMAP distribution. I am getting expected integration of cells from biological replicate batches, based on a quick overview. I haven't performed annotation yet, so can't say for certain.
To perform marker analysis, I was going to use both findMarker
and edgeR
based approaches. As edgeR
requires raw counts, I am worried if counts from deeply sequenced time points will affect the analysis. I also wanted to compare, how down-sampling would reproduce results of the data integration analysis.
I thought downsampleBatches
would be appropriate strategy for what I was planning to do. To do that, I first extracted counts of each dataset, and got a downsampled count matrix (hoping this is in proportion to the lowest depth sample). Next I tried calculating new size factors for these new count matrices, again following cluster based size factor estimation. Each dataset gave a warning about negative size factor estimation in computeSumFactors
, for which I got an explanation from the function page. However, for few datasets, even quickCluster
is giving negative size factor error and then fails to run.
So my question is, is it necessary to check effect of down-sampling? The depth of my individual datasets range from ~30k reads per cell to 350k reads per cell. Most of them are around 60-70k but the outliers on high end are 170k, 240k and 350k reads per sample. If I should, what would be a better strategy? I also checked downsampleReads
, but it requires and HDF5 file, which my run on STARsolo doesn't produce.
Could I get the cluster annotation from full dataset analysis and use the downsampled data for edgeR based differential gene expression analysis? That way I won't have to normalization of downsampled data, however, I don't know if the analysis would stay similar. Hence, I just wanted some opinion about my strategy and alternative approaches that could help me with downsampling, if I should do that for my dataset.
Thanks, Piyush
Thanks Aaron. I feel confident of the overall data integration with fastMNN as I see similar cell types across stages getting clustered together based on some marker expression. I just wanted an expert opinion on downsampling.
I also have another question regarding differential gene expression. In our meetings, other bioinformaticians have suggested to compare similar number of cells acroos clusters for DGE. So if the smallest cluster has 100 cells, other clusters should be reduced to this cell number size, to explore the stability of DGE markers. One way was to use bootstraps of cells from bigger cluster and combine results (taking avergae of lfc and pvalue?). I didn't find their opinion incorrect, as in bulk RNA-seq DGE I have experienced it is better to have balanced comparison, do you think that it is a good approach for marker analysis in single cell data. I haven't see any tutorials of that, so I was wondering if difference in cell number size is big issue for marker analysis.
We explored this idea to some extent in the Biostatistics paper. I'll assume you are following a pseudo-bulk strategy, given that edgeR wouldn't be able to handle sample-level variability for per-cell counts.
tl;dr Don't worry about differences in cell number.
The main effect from such differences is that the sum from counts with more cells is more precise. In theory, this is not ideal because it means that different observations for the same gene would have different dispersions, whereas edgeR assumes that all observations have the same dispersion. In practice, this doesn't matter (much) for a variety of reasons:
There are, of course, cases where you don't have enough cells so the three points above do not apply. However, in such cases, I would say that downsampling the number of cells involves throwing away so much information that the solution is almost as bad as the problem.