Hello!
I'm working with an interesting single cell RNA-seq dataset, which because of the collection method is suffering from elevated "contamination" because of misassigned reads. Basically because of how the sample is collected, if two cells are nearby one another, reads from Cell_1 can be mis-assigned to Cell_2, and vice-versa.
This is causing quite a lot of "false positives" in our differential expression analysis. example: let's assume gene_A is highly expressed in cell-type_1, but never expressed in cell-type_2. Let's also assume that cell-type_1 varies in its abundance between samples from 2 treatment conditions. Because of contaminating reads, gene_A might "incorrectly" appear as differentially expressed in cell-type_2.
However, we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.
Basically, as I can generate a genes x sample matrix of counts, and a genes x sample matrix of estimated contamination, I am wondering if it would be appropriate to include this contamination metric as an offset matrix in edgeR or glmGamPoi or DESeq2 (in addition to a library.size measure). Similar in concept to observation level GC corrections done in EDAseq (as I understand it). I'm trying not to reinvent the wheel for its own sake, but if I need to go a new route that's fine as well.
I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure as a term in the model, or including it as an offset (representative code below).
model <- glm.nb(counts_gene_A ~ group + log(gene_A_contamination) + offset(log(library.size)), data)
model_offset <- glm.nb(counts_gene_A ~ group + offset(log(gene_A_contamination)) + offset(log(library.size)), data)
Doing this I see very little change in aic values, though model
usually has a slightly lower aic than model_offset
. Group coefficients aren't much affected either (again, usually).
Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.
Thanks very much for taking the time to read this!
Do you mean an offset matrix in the GLM? DESeq2 has
normalizationFactors
for this, and it can be combined with sequencing depth control usingnormMatrix
argument ofestimateSizeFactors
.Oh, I never realized. That's cool. Thanks for correcting me :)
Not a problem :)
Thanks Constantin! I appreciate the insight.
Running a separate fit for each gene would be fine, and I'm trying out a version of the analysis that just does that, however it would be nice if I could take advantage of your tool considering the optimizations you made to the process.
The remaining issue on using this contamination estimate as an offset seems to be the question of the error matrix accuracy. I'm certainly not confident that it's as theoretically grounded as setting the coefficient to 1 for changes in the size-factor, but I'm also not totally sure how to go about trying to empirically quantify that. We don't have a way of directly measuring how many counts are present because of contamination.
I've tried a procedure that's similar to the one done in EDAseq, quantile normalization based on contaminating expression, but the contamination estimate also roughly seems to have a negative binomial distribution, and the resulting normalized counts (and offsets) are somewhat reasonable, but I'm just not confident that this method is applicable when the underlying distribution is so different from what the original tool was using (GC content vs contamination estimate).