Can we say they are low quality clusters?
You could say that; I prefer the term "poorly separated". However, "poorly separated" does not mean "useless". For me, clustering is just a way of breaking up the data set into parts that can be comprehended. A useful clustering procedure will deliver parts (i.e., clusters) that relate to biological concepts like cell types or status or whatever, which gives us something concrete to think about when we're trying to understand the data. If you treat clustering as a tool in this manner, then even poorly separated clusters are useful if they break up a big blob of cells into easily digestible chunks.
You might think that poorly separated clusters are less likely to be "real". But clusters are inherently empirically defined, so there's not much meaning to discussing whether they are real or not. The more pertinent question is whether two poorly-separated clusters are derived from the same underlying biological aspect or whether they correspond to different aspects, e.g., different cell types. Answering this question requires some strong assumptions about how these aspects manifest in the data, like "cells are normally distributed around an average expression profile for each cell type" (e.g., k-means). I do not find this line of investigation to be particularly interesting, and would rather spend my efforts on achieving a useful clustering that yields some hypotheses for experimental validation.
You might also think that poorly separated clusters are less stable in the sense that, under slightly different circumstances, they will merge with neighbouring clusters. This is a valid concern, but more from a perspective of logistics - it's annoying to have to re-find a poorly separated cluster if it keeps on merging with its neighbours every time you change a parameter in the upstream steps of your analysis. However, the cluster hasn't "disappeared" - the cells that make it up are still there, it's just the way you're summarizing the data that has changed.
So, sure, I would be more inclined to work on well-separated clusters, but only because it's easier. There is usually some important biology that occurs in poorly separated clusters, so it would be silly to dismiss them out of hand.
can we use their logcounts of data from different conditions (such as healthy and disease) for differential expression analysis before we use MNN?
Yes. See for example here for a pseudo-bulk analysis. The workflows here demonstrate how to do this for each cell type after clustering on the MNN-corrected values. (Note that only the clustering is done on the corrected values, the DE analysis is done on the counts!)
Thank you Aaron, So basically, poorly-seperated is not useless and still can keep valuable information and among 5-6 clusters, choosing the one with the highest modularity means using the "best-seperated" one. And thank you for the information about DE analysis. Since the MNN correction does not effect the logcounts, we can do DE analysis to different data from different batches without using batch correction. But I am wondering, counts must have a batch effect so we kind of ignore it?
As a last question about scater; findMarkers function generates possible markers for each cluster but in the results, what is the actual meaning of column "Top"?
Thank you for all your answers.
I'm not entirely sure what you mean by this, so I'll just say these things:
See the section "Consolidating p-values into a ranking" in
?combineMarkers
(which is called byfindMarkers
). Basically, if you take the set of genes withTop=1
, this is the same as taking the top DE gene from each pairwise comparison between your cluster of interest and every other cluster. The idea is to use a set of genes to define the cluster, rather than relying on a single marker gene that is DE between the current cluster and all other clusters. The latter scenario may not even exist, see my CD4/CD8 comments.Thank you Aaron,
"Since the MNN correction does not effect the logcounts, we can do DE analysis to different data from different batches without using batch correction. But I am wondering, counts must have a batch effect so we kind of ignore it?"
Actually I am using some external methods such as ROTS for differential expression analysis and I use logcounts. My question is: this method does not any parameter about batch, using these kinds of methods wrong or acceptable?
If you must use a method that cannot consider batch terms, you should perform a meta-analysis instead. That is, perform differential expression between cell types or clusters within each batch (where, by definition, there is no batch effect), and combine the p-values across batches. However, if your comparisons are confounded with batch, then you're in trouble. If this is the case, performing the DE analysis on the corrected values will not help as your experimental design is fundamentally flawed.