Question

Discarding ambient RNA using estimateAmbience

0

Entering edit mode

lirongrossmann ▴ 50

@lirongrossmann-23954

Last seen 3.8 years ago

Hi,

I have a single nucleus RNA seq sample of a tumor and when I look at a violin plot, all the clusters are expressing key genes from tumor, leading me to think that ambient RNA is contaminating most of the cells and am using the following code:

my.counts <- Read10x("path.to.unfiltered.raw.counts")

ambience <- estimateAmbience(my.counts, good.turing = TRUE)

I have the following question:

After sorting the ambience vector to look for the genes with the highest proportions of ambient RNA in the low count cells- how do I continue from here? Do I choose a selected number of genes (say top 10) and substract their counts from the raw count matrix for downstream analysis?

Thanks!

emptyDrop ambient singlecell estimateAmbience • 2.5k views

ADD COMMENT • link updated 4.2 years ago by Aaron Lun ★ 28k • written 4.2 years ago by lirongrossmann ▴ 50

score 1 · Answer 1 · 2020-09-27

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 14 hours ago

The city by the bay

There are at least a few approaches:

1. Don't worry about it. Seriously, if you're just doing clustering and looking at marker genes between clusters in the same sample, then a bit of background contamination isn't a big deal. Any contaminating gene probably won't be DE if the contamination is more-or-less even across cell types (assuming similar total RNA content in each cluster), so it won't show up in your marker lists for clusters that aren't actually overexpressing it.

2. Identify and remove the affected genes from your results, e.g., DE tables. This may be necessary when dealing with multi-sample comparisons where the ambient contamination might differ between conditions. In such cases, it is possible for the ambient contamination to drive false DE between conditions, so getting rid of them is important.

3. Try to remove the ambient contamination from the expression matrix at the start of your analysis. This can be done by packages like SoupX, but I would say that this is the most challenging approach. IIRC, you need to have prior knowledge of at least two markers that are highly abundant in the ambient solution but can never be expressed together. This is used to identify the proportion of ambient contamination in each cell, which is unlikely to be very precise given the level of per-cell noise. Then you have to perform the actual process of subtracting counts, which can be difficult due to the mean-variance relationship of count data.

You could use clustering or nearest-neighbors information to overcome some of the stability and variance problems in 3, but if you're already got clusters that suitably summarize your data, it seems pointless to go back to edit the expression matrix. (Well, aside from improving the aesthetics of your plots.) The bigger problem is that I never have two genes that I am absolutely sure are never co-expressed. Even identifying one gene that should not be expressed in a single subpopulation of the data is hard enough... unless you're working with very well-established cell types, but if so, you shouldn't need to bother removing contamination at all.

ADD COMMENT • link 4.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you so much, Aaron!

ADD REPLY • link 4.2 years ago lirongrossmann ▴ 50

0

Entering edit mode

I also anticipate the presence of ambience RNA would affect prediction of cell types using reference dataset methods, like SingleR. Would you agree? Is there a way to overcome that and still use SingleR? (maybe removing ambient RNA of the expression matrix?)

ADD REPLY • link 4.2 years ago lirongrossmann ▴ 50

1

Entering edit mode

In theory, I would say yes, that is possible. In practice, I would be surprised if contamination was strong enough to override a cell type's identity. At least for the broad cell types, a cell's actual markers should still have many more transcript molecules than those from ambient contamination. Perhaps the subtle assignments may get confused but you will have to try it and see.

I should mention that the ambient subtraction process is not a free lunch either. This has a bundle of its own assumptions, the most obvious being that an appropriate clustering exists, but also other distributional assumptions related to how counts are (re)distributed across cells in the same cluster. So if you're preemptively worrying about contamination affecting subtle cell type classification, you should also worry about the effects of violating the assumptions of ambient removal.

That said, I figured that removing ambient noise would make for some nicer plots, even if I did not plan to use it for anything substantive. So I added the removeAmbience function to DropletUtils, you can get it in version 1.9.12. The devel version of the OSCA book will also have a section demonstrating its use, it should show up on Tuesday.

ADD REPLY • link 4.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you so so much!

ADD REPLY • link 4.1 years ago lirongrossmann ▴ 50

0

Entering edit mode

Hi, I have tried to install the new version of the package using:

devtools::install_github("MarioniLab/DropletUtils")

and got the following error:

ERROR: compilation failed for package ‘DropletUtils’
* removing ‘/tmp/Rtmp4VjwFR/Rinst19fd655306c7/DropletUtils’

I also tried to clone it from github and got the same error.

ADD REPLY • link 4.1 years ago lirongrossmann ▴ 50