There are at least a few approaches:
1. Don't worry about it. Seriously, if you're just doing clustering and looking at marker genes between clusters in the same sample, then a bit of background contamination isn't a big deal. Any contaminating gene probably won't be DE if the contamination is more-or-less even across cell types (assuming similar total RNA content in each cluster), so it won't show up in your marker lists for clusters that aren't actually overexpressing it.
2. Identify and remove the affected genes from your results, e.g., DE tables. This may be necessary when dealing with multi-sample comparisons where the ambient contamination might differ between conditions. In such cases, it is possible for the ambient contamination to drive false DE between conditions, so getting rid of them is important.
3. Try to remove the ambient contamination from the expression matrix at the start of your analysis. This can be done by packages like SoupX, but I would say that this is the most challenging approach. IIRC, you need to have prior knowledge of at least two markers that are highly abundant in the ambient solution but can never be expressed together. This is used to identify the proportion of ambient contamination in each cell, which is unlikely to be very precise given the level of per-cell noise. Then you have to perform the actual process of subtracting counts, which can be difficult due to the mean-variance relationship of count data.
You could use clustering or nearest-neighbors information to overcome some of the stability and variance problems in 3, but if you're already got clusters that suitably summarize your data, it seems pointless to go back to edit the expression matrix. (Well, aside from improving the aesthetics of your plots.) The bigger problem is that I never have two genes that I am absolutely sure are never co-expressed. Even identifying one gene that should not be expressed in a single subpopulation of the data is hard enough... unless you're working with very well-established cell types, but if so, you shouldn't need to bother removing contamination at all.
Thank you so much, Aaron!
I also anticipate the presence of ambience RNA would affect prediction of cell types using reference dataset methods, like SingleR. Would you agree? Is there a way to overcome that and still use SingleR? (maybe removing ambient RNA of the expression matrix?)
In theory, I would say yes, that is possible. In practice, I would be surprised if contamination was strong enough to override a cell type's identity. At least for the broad cell types, a cell's actual markers should still have many more transcript molecules than those from ambient contamination. Perhaps the subtle assignments may get confused but you will have to try it and see.
I should mention that the ambient subtraction process is not a free lunch either. This has a bundle of its own assumptions, the most obvious being that an appropriate clustering exists, but also other distributional assumptions related to how counts are (re)distributed across cells in the same cluster. So if you're preemptively worrying about contamination affecting subtle cell type classification, you should also worry about the effects of violating the assumptions of ambient removal.
That said, I figured that removing ambient noise would make for some nicer plots, even if I did not plan to use it for anything substantive. So I added the
removeAmbience
function to DropletUtils, you can get it in version 1.9.12. The devel version of the OSCA book will also have a section demonstrating its use, it should show up on Tuesday.Thank you so so much!
Hi, I have tried to install the new version of the package using:
and got the following error:
I also tried to clone it from github and got the same error.