Question

Impose minimum restrictions on correlations in SingleR

0

Entering edit mode

k.vitting.seerup ▴ 120

@kvittingseerup-7956

Last seen 20 months ago

European Union

I'm using SingleR to annotate cell types in my single cell data but I recently discover a potential problem. It seems there is no minimum treshold for the correlations used for the transfer - meaning predictions will (almost) always be made even if the data does make sense.

To illustrate this point lets use at the example from the vignette

# BiocManager::install("SingleR")
library(SingleR)

### Load refrence data
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se

# BiocManager::install("scRNAseq")
library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]

# BiocManager::install("scater")
library(scater)
hESCs <- logNormCounts(hESCs)

pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)

sum(!is.na(pred.hesc$labels))
100
table(pred.hesc$labels)
Astrocyte Neuroepithelial_cell              Neurons 
       14                   81                    5

That works very nicely - we get 100 predictions from 5 cell types. (Although it can be debated whether these labels are good for embryonic stem cells).

Now lets scramble the data and see what happens.

### Scamble data
logCounts <- assay(hESCs, 'logcounts')
resampleData <- matrix(
    data = sample(unlist( logCounts )),
    ncol = ncol(          logCounts ),
    nrow = nrow(          logCounts )
)
rownames(resampleData) <- rownames( logCounts )
colnames(resampleData) <- colnames( logCounts )

### Verify no resemblance to original data
round( summary(sapply(1:ncol(logCounts), function(x) {
    cor(logCounts[,x],resampleData[,x])
})), 2)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.02    0.00    0.00    0.00    0.01    0.02

### Rerun SingleR
pred.rand <- SingleR(test = resampleData, ref = hpca.se, labels = hpca.se$label.main)
sum(!is.na(pred.rand$labels))
100
length(unique(pred.rand$labels))
29

So this time - with meaningless data we still get a hundred predictions (of very different cell types).

Is there a way to impose a minimum treshold on the correlations, possibly by comparing to scrambled data, to ensure cell types are only predicted when there is sufficient evidence?

Cheers Kristoffer

single cell R SingleR • 1.8k views

ADD COMMENT • link updated 5.3 years ago by Aaron Lun ★ 28k • written 5.3 years ago by k.vitting.seerup ▴ 120

score 1 · Answer 1 · 2020-01-25

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 8 hours ago

The city by the bay

This is not a new problem. At least for default implementations of methods like k-NN or RFs or SVMs, no matter what you put in, you'll always get some assignments out. Sure, you could stick a minimum threshold onto the assignment score, but then how do you pick a threshold value in an intelligible manner? Scrambling will just generate a strawman, I've don't recall seeing real data actually hit a zero or negative correlation between different cell types; probably because many genes are constitutively expressed in all cell types, so you'll always get some level of positive correlation.

We have put some effort into identifying improbable assignments - see ?pruneScores for more details - but this makes a few assumptions of its own. We do not have a clean solution for the general problem of what happens when your cells do not exist in the reference. Well, that's not entirely true; the solution is to get a more comprehensive reference.

If you think you can do better, the correlations are reported in the output, so go for it.

ADD COMMENT • link 5.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron

Thanks for taking time to answer. Good point about it only being a problem if your cells do not exist in the reference.

Could you elaborate on what you mean by "scrambling will just generate a strawman"?

Cheers Kristoffer

ADD REPLY • link 5.3 years ago k.vitting.seerup ▴ 120

1

Entering edit mode

https://en.wikipedia.org/wiki/Straw_man

In this context, I'm basically saying that the null distribution of correlations that you get by scrambling the gene expression values is so divorced from reality that it doesn't really provide any meaningful baseline for comparison. For example, different cell types will almost always have a positive correlation in their whole-transcriptome expression profiles, simply due to the fact that everyone expresses ribosomal proteins and histones and actin and that pesky MALAT1.

The story is probably a little murkier when you select on marker genes, as SingleR does. Then the question is, "do your selected markers overlap with any markers for cell types that are in your data but not in the reference?" If not, then you're in luck; shuffling would allow you to generate a reasonably realistic null distribution. (Here, the null hypothesis is that your cells do not come from your reference, but instead belong to an unknown cell type.) But if they do overlap, then that's too bad, because the true null distribution will always contain some positive correlation to whichever reference cell type is closest to the unknown cell type, and this would not be captured by shuffling.

The likelihood that you get an overlap - and that shuffling will fail - increases as your reference becomes more comprehensive and you have a more expansive list of markers. This is actually kind of convenient as your need for any shuffling at all decreases when the reference becomes more likely to contain your cell type of interest. But I wouldn't be able to confidently say that shuffling would work for more focused references, either; lots of genes have multiple roles in different cells, so I would well imagine overlaps in places you would not expect a priori.

Incidentally, the approach used by pruneScores is to define the null hypothesis as "the cell does come from a cell type in the reference". The aim is to reject the null when a cell has an unusually low correlation. The trick is to define "unusually low" - read the documentation for more information, but basically, this is done by looking at the distribution across cells and cutting off the low outliers.