There's a number of ways to do this, but in all cases, you should be computing doublet scores within each batch. It is obviously impossible to get a doublet consisting of cells from different batches! My favored approach is to:
- Compute doublet scores within each batch, but do not remove them.
- Do the batch correction with all cells.
- Mark clusters as doublets if they contain many cells with high doublet scores.
This is motivated by the fact that not all doublets will be assigned high doublet scores. (This is simply a consequence of the assumptions that are necessary to get doubletCells
to work, see comments here.) By leaving in the doublets, we can use "guilt by association" to identify the cluster of doublet cells. If we removed all cells with high doublet scores beforehand, we would not be able to detect these troublesome clusters as all of the remaining doublets would have low scores.
From a workflow perspective, doublets are of such low frequency that leaving them in will probably not do much harm. In addition, they are fairly well behaved as sequencing libraries go (e.g., high library sizes, lots of detected genes) and their expression profiles are, by definition, within the range of observed expression profiles in the population (e.g., you won't get different HVGs during feature selection). This is unlike, say, low-quality libraries that could really interfere with your normalization, feature selection, PCA, etc.
Thank you Aaron, You approach looks really useful and makes sense. I was thinking the same but just wanted to be sure. After computing doublet scores for each data, I will merge the scores to create a vector (same length as cell number) and assign them into the corrected SingleCellExperiment object and I will use t-SNE to examine the clusters.
For a single data set, is there a threshold for doublet scores or using NMADS is an option as usual?
Thank you in advance.
I have recently been through this for a set of many 10X samples (what I ended up doing is shown here)
In essence, I first calculated the scores and called doublets within samples, then performed another round of calling across all samples to identify where I had missed calls in individual samples. Or, in more depth:
This is shown with figures in the HTML file in the link I have above. There are some things I would change in retrospect (e.g. using NMADS as you say). I note that the difficulty of clustering and identifying doublets will depend a lot on how different the cell-types actually are in your data (e.g. I suspect adult tissue would be easier than my embryonic samples). Also I would recommend visualising your scores and clusters on e.g. t-SNE all the way through to make sure nothing crazy is happening.
I hope this is useful!
double post, oops...