I'm using SingleR to annotate cell types in my single cell data but I recently discover a potential problem. It seems there is no minimum treshold for the correlations used for the transfer - meaning predictions will (almost) always be made even if the data does make sense.
To illustrate this point lets use at the example from the vignette
# BiocManager::install("SingleR")
library(SingleR)
### Load refrence data
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se
# BiocManager::install("scRNAseq")
library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]
# BiocManager::install("scater")
library(scater)
hESCs <- logNormCounts(hESCs)
pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)
sum(!is.na(pred.hesc$labels))
100
table(pred.hesc$labels)
Astrocyte Neuroepithelial_cell Neurons
14 81 5
That works very nicely - we get 100 predictions from 5 cell types. (Although it can be debated whether these labels are good for embryonic stem cells).
Now lets scramble the data and see what happens.
### Scamble data
logCounts <- assay(hESCs, 'logcounts')
resampleData <- matrix(
data = sample(unlist( logCounts )),
ncol = ncol( logCounts ),
nrow = nrow( logCounts )
)
rownames(resampleData) <- rownames( logCounts )
colnames(resampleData) <- colnames( logCounts )
### Verify no resemblance to original data
round( summary(sapply(1:ncol(logCounts), function(x) {
cor(logCounts[,x],resampleData[,x])
})), 2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.02 0.00 0.00 0.00 0.01 0.02
### Rerun SingleR
pred.rand <- SingleR(test = resampleData, ref = hpca.se, labels = hpca.se$label.main)
sum(!is.na(pred.rand$labels))
100
length(unique(pred.rand$labels))
29
So this time - with meaningless data we still get a hundred predictions (of very different cell types).
Is there a way to impose a minimum treshold on the correlations, possibly by comparing to scrambled data, to ensure cell types are only predicted when there is sufficient evidence?
Cheers Kristoffer
Hi Aaron
Thanks for taking time to answer. Good point about it only being a problem if your cells do not exist in the reference.
Could you elaborate on what you mean by "scrambling will just generate a strawman"?
Cheers Kristoffer
https://en.wikipedia.org/wiki/Straw_man
In this context, I'm basically saying that the null distribution of correlations that you get by scrambling the gene expression values is so divorced from reality that it doesn't really provide any meaningful baseline for comparison. For example, different cell types will almost always have a positive correlation in their whole-transcriptome expression profiles, simply due to the fact that everyone expresses ribosomal proteins and histones and actin and that pesky MALAT1.
The story is probably a little murkier when you select on marker genes, as SingleR does. Then the question is, "do your selected markers overlap with any markers for cell types that are in your data but not in the reference?" If not, then you're in luck; shuffling would allow you to generate a reasonably realistic null distribution. (Here, the null hypothesis is that your cells do not come from your reference, but instead belong to an unknown cell type.) But if they do overlap, then that's too bad, because the true null distribution will always contain some positive correlation to whichever reference cell type is closest to the unknown cell type, and this would not be captured by shuffling.
The likelihood that you get an overlap - and that shuffling will fail - increases as your reference becomes more comprehensive and you have a more expansive list of markers. This is actually kind of convenient as your need for any shuffling at all decreases when the reference becomes more likely to contain your cell type of interest. But I wouldn't be able to confidently say that shuffling would work for more focused references, either; lots of genes have multiple roles in different cells, so I would well imagine overlaps in places you would not expect a priori.
Incidentally, the approach used by
pruneScores
is to define the null hypothesis as "the cell does come from a cell type in the reference". The aim is to reject the null when a cell has an unusually low correlation. The trick is to define "unusually low" - read the documentation for more information, but basically, this is done by looking at the distribution across cells and cutting off the low outliers.Thanks for the elaborate answer. Really appreciate it! :-)