Usage of doubletCells() function before or after the batch correction?
1
1
Entering edit mode
@hamza_karakurt-17704
Last seen 2.5 years ago
Turkey

Hello, I am doing a scRNA-Seq analysis and I want to use doubletCells() function to identify possible doublets. My data comes from 4 different batches and I use fastMNN for batch correction. Which way would be better in this situation? Using doubletCells() for each data before batch correction and remove cells with high scores as doublets and doing the batch correction or after fastMNN(), using the doubletCells() function on counts of all data sets (I think I have to use computeSumFactors() function with counts of all data sets).

Thank you in advance.

scater scran scRNA-Seq doubletCells • 3.1k views
ADD COMMENT
5
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 55 minutes ago
The city by the bay

There's a number of ways to do this, but in all cases, you should be computing doublet scores within each batch. It is obviously impossible to get a doublet consisting of cells from different batches! My favored approach is to:

  1. Compute doublet scores within each batch, but do not remove them.
  2. Do the batch correction with all cells.
  3. Mark clusters as doublets if they contain many cells with high doublet scores.

This is motivated by the fact that not all doublets will be assigned high doublet scores. (This is simply a consequence of the assumptions that are necessary to get doubletCells to work, see comments here.) By leaving in the doublets, we can use "guilt by association" to identify the cluster of doublet cells. If we removed all cells with high doublet scores beforehand, we would not be able to detect these troublesome clusters as all of the remaining doublets would have low scores.

From a workflow perspective, doublets are of such low frequency that leaving them in will probably not do much harm. In addition, they are fairly well behaved as sequencing libraries go (e.g., high library sizes, lots of detected genes) and their expression profiles are, by definition, within the range of observed expression profiles in the population (e.g., you won't get different HVGs during feature selection). This is unlike, say, low-quality libraries that could really interfere with your normalization, feature selection, PCA, etc.

ADD COMMENT
0
Entering edit mode

Thank you Aaron, You approach looks really useful and makes sense. I was thinking the same but just wanted to be sure. After computing doublet scores for each data, I will merge the scores to create a vector (same length as cell number) and assign them into the corrected SingleCellExperiment object and I will use t-SNE to examine the clusters.

For a single data set, is there a threshold for doublet scores or using NMADS is an option as usual?

Thank you in advance.

ADD REPLY
2
Entering edit mode

I have recently been through this for a set of many 10X samples (what I ended up doing is shown here)

In essence, I first calculated the scores and called doublets within samples, then performed another round of calling across all samples to identify where I had missed calls in individual samples. Or, in more depth:

  1. Get scores separately within each sample
  2. Calculate clusters within each sample (I had to really cluster finely to properly separate the doublet clusters, by the way)
  3. Call doublet clusters in each sample (e.g. by identifying outlying clusters with high median doublet score). Label all cells in the doublet clusters as doublets.
  4. Batch correct all samples together
  5. Cluster within the all-sample corrected data
  6. Identify all-sample clusters that contain a disproportionately high number of cells that were called as doublets in their own samples; label all cells in these clusters as doublets. This is the across-sample sweep step.

This is shown with figures in the HTML file in the link I have above. There are some things I would change in retrospect (e.g. using NMADS as you say). I note that the difficulty of clustering and identifying doublets will depend a lot on how different the cell-types actually are in your data (e.g. I suspect adult tissue would be easier than my embryonic samples). Also I would recommend visualising your scores and clusters on e.g. t-SNE all the way through to make sure nothing crazy is happening.

I hope this is useful!

ADD REPLY
0
Entering edit mode

double post, oops...

ADD REPLY

Login before adding your answer.

Traffic: 496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6