I am following the Bioconductor simpleSingleCell workflow for droplet-based data and have a question regarding pre-processing. I have 10x scRNA-seq data from multiple samples. These were prepared in different wells on the same Chromium chip and ran in the same lane of a single flowcell using the HiSeq 4000 sequencing machine. I ultimately want to perform comparisons between the samples, however I'm not sure at what stage of pre-processing the samples should be combined. In particular, the RNA content and activity of cells between samples may differ markedly so I assume empty droplet detection step should be performed independently? Given cells from different samples are physically separated on the Chromium chip I also assume doublet detection should be performed independently?
My proposed workflow would be the following:
- Remove barcode swapping (All samples)
- Remove empty droplets (Per sample)
- Calculate QC metrics (Per sample)
- Remove low quality cells (Per sample)
- Assign cell cycle phases (Per sample)
- Remove zero count genes (Per sample, may cause problems later)
- Normalization for cell-specific biases (Per sample)
- Modelling the mean-variance trend (Per sample)
- Dimensionality reduction (Per sample)
- Clustering (Per sample)
- Remove doublets detected by clusters / by simulation (Per sample)
- Combine raw count matrices from all remaining cells across samples
- Go back to the normalization step (7) and process all samples together
Does this seem reasonable, or am I over-complicating the pre-processing steps?
Thank you Aaron for confirming that empty droplet and doublet detection should be performed separately. However, I'm not sure why you suggested looking at the batch correction workflow? The samples were all prepared on the same chip (albeit in different channels) and sequenced on the same lane of a flow cell. I assumed batch correction was only suitable when for example the samples are sequenced on different dates or prepared by different labs. Additionally, two of the samples contain distinct cell types so MNN-based correction may not work? Apologies if I have misunderstood anything in your reply.