But how about concatenating the batches, calling scran::quickCluster with block specifying the batch, and finally calling calculateSumFactors on the concatenated dataset with the clusters from quickCluster?
Sure, you can do that if the batches are all from similar technologies. multiBatchNorm()
was initially built to deal with the painful situation of merging Smart-seq2 and UMI-based data. For most applications, I suspect multiBatchNorm()
is not really necessary, but it doesn't seem to hurt and so I just run it so that my merging pipelines will work regardless of what crazy datasets I throw in.
Mind you, I wouldn't be so sure that the assumption for calculateSumFactors()
is weaker. It's true that we only require a non-DE majority between pairs of clusters, but if there's a strong batch effect, each batch will form separate clusters. This means you'll eventually be working with pairs of clusters from different batches, so the DEGs due to the batch effect will add onto any DEGs from the cluster-to-cluster differences. In contrast, multiBatchNorm()
only cares about DEGs between the averages of the batches; so, if the cell type composition doesn't change between batches, then we only have to worry about the batch-induced DEGs.
In terms of the bigger picture, though, I don't think it matters all that much; these details are relatively minor compared to the heavy distortions to the data introduced by MNN correction and its contemporaries.
Is there a reason for that?
I must admit that I don't really remember all that much. If I had to say, we probably used a higher filter for multiBatchNorm()
because we were potentially dealing with read count data + UMI count data, and I erred on the side of having a higher threshold to match the higher counts in the former. (At the same magnitude, read counts are noisier than UMI counts, hence this need for adjustment when filtering for informative genes.)
Using multiBatchNorm with min.mean = 1 seems indeed to give me better clustering results (after batchelor::fastMNN correction) than using min.mean = 0.1.
I don't really have any idea of why this might be, so... ¯\_(ツ)_/¯
If you're curious, you can probably calculate the scaling applied to the size factors for each batch. As in, take the sizeFactors()
before running multiBatchNorm()
, and then use them to divide the size factors in the output objects. The ratio will be constant for all cells in each batch, but different across batches; I would be interested to know whether you see some notable differences for min.mean=1
versus min.mean=0.1
.
Thank you Aaron.
Indeed, but the composition does change between my batches. Also I proposed using
quickCluster
withblock
equal to the batch, so the clusters will be batch-specific by design.Here you go:
min.mean = 1
: 1.562508 5.834937 1.708959 2.902615 1.426514 4.478574 1.000000 3.274636 1.144713 3.347561min.mean = 0.1
: 1.630960 4.565926 1.676077 2.556120 1.340033 3.366067 1.000000 2.709936 1.203517 2.916712After lots of experimentation, I got the best clustering results after downsampling the batches using
DropletUtils::downsampleBatches
and applying the concatenation strategy for normalisation.Yes, downsampling is the theoretically safest approach in terms of equalizing coverage without distorting the mean-variance relationship. I'm reluctant to recommend it as the default because it throws away so much information, but I am not too surprised that it does well. I guess you've already read the relevant section of the OSCA book, but here's another one that may be of interest.