Hello, I want to use SC3 for data sets from multiple batches. I use fastMNN() function of Scran/Scater package for batch normalization but it does not effect logcounts, it creates a reduced dimension "MNN" that shows the corrected data which also used in clustering step. How can I use SC3 with these values? Can I create a new SingleCellExperiment with MNN matrix and use SC3 on that matrix? MNN matrix includes negative values so I know I should not use gene_filter parameter as TRUE.
Thank you in advance.
Thank you for answer. As you said, negative values effects results as expected. To try it, I used sc3estimatek function on both data set itself and reduced dimension (PCA with first 50 PCs in that case), and estimated k was 27 for all data and 5 for reduced dimension. Probably it is not a good way to do it. Since the data sets from different batches are really common, what is optimal way to use SC3 on these kind of data sets? Actually I looked for a method to correct all logcounts but could not find any method.
There are lots of batch correction methods at the moment. Not all of them correct the expression matrix though. But for those that don't you could use other clustering methods such as louvain clustering on knn graph (default in scanpy package). Here we cover some of the batch correction methods: R - https://github.com/cellgeni/notebooks/blob/master/files/notebooks/10X-batch-correction-harmony-mnn-cca-other.Rmd python - https://github.com/cellgeni/notebooks/blob/master/files/notebooks/10X-batch-correction-bbknn-scanorama.ipynb
Thank you for answer. Actually I am planning to use MNN correction. It is more suitable in my situation and further analyses I am planning. MNN can create a corrected expression matrix but it also have negative values (due to cosine normalization I believe). I took the risk and used SC3 on this corrected matrix but I have NAs in clustering results.
I'll chip in here and mention that a batch correction method will only be able to preserve zeroes if it is aware that the data are derived from counts. This is not the case for the vast majority of methods, which operate on transformed expression values where the count-based nature of the data are lost. And for good reason; the theory for count-based models is difficult. (See
batchelor::rescaleBatches()
for a limited exception.) Indeed, there is no philosophical reason that log-expression values should be non-negative. The fact that they often are is simply a matter of practical convenience to avoid loss of sparsity.Now, I can't remember exactly what special stuff SC3 does, but if you just want to do no-frills k-means clustering, you can apply
kmeans
on the low-dimensional MNN corrected values. Any feature selection should have been done before MNN correction anyway.