(Apologies, this got rather long ...)
I've encountered a situation where DropletUtils::emptyDrops()
was calling 'too many' non-empty droplets and would appreciate some advice.
Initially, I noticed this affecting 1/60 captures in a large scRNA-seq experiment I'm analysing.
But upon investigating the issue further I'm wondering whether I should also revise how non-empty droplets are being identified in the other 59/60 captures (and more generally in other similar-ish experiments).
I found similarities to a previously reported issue ('emptyDrops behaviour when excluding mitochondrial/ribosomal genes') whereby the performance of emptyDrops()
, and in particular the identification of the knee point in the barcode rank plot, was improved by excluding mitochondrial and ribosomal protein genes (this is now documented in the 'Non-empty droplets versus cells' section of the man page for emptyDrops()
).
In my case, the knee point is being properly identified, but there are still too many non-empty droplets.
Another point of difference to the earlier report is that the genes driving the differences between the retained low-count barcodes and discarded barcodes in this capture are T-cell receptor (TCR) genes, which I suspect may suggest the need for assay-specific or sample-specific gene filtering.
Some details of my experiment:
- Human PBMCs profiled with 10x Genomics 5' chemistry (v2), including gene expression (GEX), a large panel (BioLegend's TotalSeq™-C Human Universal Cocktail) of antibody derived tags (ADTs), and VDJ sequencing of T-cell receptors (TCRs) and B-cell receptors (BCRs).
- 12 batches (named 3-14 for reasons), with each batch containing 5 captures, and targeting ~20,000 cells/capture.
- Within a batch, each capture is roughly a technical replicate because the cells come from the same population (a tube containing a mixture of PBMCs from different donors labelled with HTOs) which are then distributed across the 5 captures.
Initial problematic capture
The problematic capture is capture 1 of batch 13 (C120_batch13_1
).
The number of non-empty droplets in this capture is 3x larger than is reasonable/expected (~65,000 rather than ~20,000 non-empty droplets from 1 lane (capture) of 10x 5' scRNA-seq data).
The other captures within this batch have similar and expected numbers of non-empty droplets.
After excluding mitochondrial, ribosomal protein, TCR, and BCR genes (hereafter referred to as gene-filtering), emptyDrops()
returns a number of non-empty droplets for C120_batch13_1
that is more like what I expect (~20,000).
Interestingly, Cell Ranger (v7) reports ~20,000 cells for this capture, so presumably they do some gene-filtering before applying their implementation of the Empty Drops algorithm.
Capture | emptyDrops() without gene-filtering |
emptyDrops() with gene-filtering |
Cell Ranger (v7) |
---|---|---|---|
C120_batch13_1 | 65,561 | 20,288 | 20,209 |
C120_batch13_2 | 22,627 | 22,694 | 22,602 |
C120_batch13_3 | 22,020 | 22,169 | 22,102 |
C120_batch13_4 | 22,297 | 22,504 | 22,425 |
C120_batch13_5 | 24,739 | 24,786 | 24,656 |
Following the advice in OSCA Advanced 'Testing for empty droplets' and 'emptyDrops behaviour when excluding mitochondrial/ribosomal genes', I made:
- Barcode rank plots.
- A histogram of P-values for low-total barcodes.
- MA plot comparing the retained low-count barcodes to the discarded barcodes.
Barcode rank plots
In these plots, the top row is without gene filtering and the bottom row is after filtering out mitochondrial, ribosomal protein, TCR, and BCR genes. Nothing seems amiss in the barcode plots, with the knee being well-identified in all 5 captures. The same is true without and with gene-filtering.
Histogram of P-values
In these plots, the top row is without gene-filtering and the bottom row is with gene-filtering.
You can clearly see the P-value distribution is not uniform for C120_batch13_1
without gene-filtering but looks reasonable-ish for the other 4 captures.
I was a little surprised that the P-value histograms are, if anything, 'worse' after gene filtering.
However, given the non-empty droplet calls look more 'reasonable' after gene filtering, I'm leaning towards overlooking this anomaly(?), although it would be good to understand why this is and if I should be more concerned by it.
MA plot
In these plots,
e0
=emptyDrops(m, test.ambient = TRUE)
e1
=emptyDrops(m[keep, ], test.embient = TRUE)
wherekeep
is a logical vector that isFALSE
for mitochondrial, ribosomal protein, TCR, and BCR genes andTRUE
otherwise.
You can see in C120_batch13_1
that there's a lot of genes with large logFCs but that the TCR genes stand out.
I'm a bit miffed why this has only affected 1/5 captures in this batch and I assume that this capture is likely to be more affected by ambient RNA which I'll need to keep any eye on in downstream analysis.
In the subtitle of each plot I've reported the number of droplets exclusively identified in e0
(|e0_only|
) or exclusively identified in e1
(|e1_only|
) as non-empty.
Extension to other captures in this experiment
I then decided to re-run emptyDrops()
on each of the 60 captures both without and with gene filtering to see if any of the other batches might be affected (I had not previously looked very closely at the emptyDrops()
diagnostics because I'd obtained seemingly sensible estimates of the number of non-empty droplets).
I've uploaded PDFs of the barcode rank plots, P-value histograms, and MA plots.
A few things stand out:
- Barcode plots
- The barcode rank plots all look okay, with the knee point well-identified even when there are small 'kinks' in the plot (e.g.,
batch5
,batch10
). - The plots look much the same whether or not gene filtering was applied.
- The barcode rank plots all look okay, with the knee point well-identified even when there are small 'kinks' in the plot (e.g.,
- P-value histograms
- The distributions are generally quite similar within a batch.
- A non-uniform distribution, with a peak near zero, occurs more frequently than I'd expected to see (
batch3
,batch4
,batch5
,batch8
,batch9
,batch10
,batch13
batch14
) - Occasionally there are distributions with a peak near one for an almost U-shaped distribution (
batch6
) - Filtering genes does not substantially change the P-value histograms. If anything, it sometimes makes it 'worse'.
- Occasionally there is within-batch variation (exemplified by
batch10_1
being less uniform than other captures in that batch but also e.g.,batch7_5
is perhaps less uniform than otherbatch7
captures)
- MA plots
- For most captures the number of droplets exclusively identified without or with gene filtering is relatively small (<5%, being a few hundred out of ~20,000 droplets). Consequently, the pseudobulk profiles of the
e0_only
droplets, shown in the MA plots, aren't based on very large library sizes and the MA plots are quite 'sparse'. - The MA plots are generally quite similar within a batch.
- The set of genes that are more highly expressed in the
e0_only
droplets than in theambient
droplets, if they exist, are almost always readily identifiable asribo
,tcr
, orbcr
enriched:ribo
:batch3
,batch9
,tcr
:batch13
(specificallybatch13_1
)bcr
:batch3
,batch5
,batch6
,batch10
batch13_1
stands out as having a lot ofother
(i.e. non-[mito
|ribo
|tcr
|bcr
]) genes that are highly expressed in thee0_only
droplets over theambient
droplets.- The
mito
genes aren't consistent across batches, but are perhaps more often highly expressed in theambient
droplets than in thee0_only
droplets.
- For most captures the number of droplets exclusively identified without or with gene filtering is relatively small (<5%, being a few hundred out of ~20,000 droplets). Consequently, the pseudobulk profiles of the
Conclusions and remaining questions
I'm satisfied I've found the root cause of the 'too many cells' issue for C120_batch13_1
and have a remedy by filtering out mitochondrial, ribosomal protein, TCR, and BCR genes prior to running emptyDrops()
.
However, I'd appreciate any thoughts on these questions that are still dogging me:
- Should I apply this gene-filtering prior to running
emptyDrops()
for (1) all captures in this batch or (2) even all batches in this experiment or (3) even more broadly (e.g., all 5' datasets or all datasets that include large numbers of T cells or B cells)? - Could filtering out BCR (resp. TCR) genes from
emptyDrops()
lead to B cells (resp. T cells) being marked as empty droplets? E.g., from my woeful knowledge of immunology I _think_ there are certain types of B cells where most of the gene expression is from the immunoglobulins (BCR genes). - How concerned should I be by the non-uniform P-value distributions, particularly because it's seemingly sometimes exacerbated by the gene-filtering?
I've had some further and ongoing discussion with 10x tech support. I've copied the initial reply below and will add any useful updates.
In short, for GEX+VDJ libraries Cell Ranger does additional filtering of barcodes based on the VDJ data (i.e. it's more than just their implementation of the EmptyDrops algorithm applied to the GEX data). I'm still not sure I've quite got my head around it, however, because the information in https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/using/multi#why seems mostly to be about refining non-empty droplet calls for T and B cells. How, then, does Cell Ranger (specifically cellranger multi) estimate/identify non-empty droplets that don't contain T or B cells?
Anyway, as I pointed out in my reply, it would really help if the source code of Cell Ranger was easily browsable and publicly available, because AFAICT the source code is only available up to v3.0.2 (https://github.com/10XGenomics/cellranger).
Knowing what I do now about how Cell Ranger is applying additional and assay-specific steps for non-empty droplet detection in GEX+VDJ+ADT samples, I think I will opt to use Cell Ranger's non-empty droplet detection for these samples.
10x tech support's reply