Advice on appropriateness of an offset in glmGamPoi/edgeR
2
0
Entering edit mode
Alec • 0
@43b7c63e
Last seen 1 day ago
United States

Hello!

I'm working with an interesting single cell RNA-seq dataset, which because of the collection method is suffering from elevated "contamination" because of misassigned reads. Basically because of how the sample is collected, if two cells are nearby one another, reads from Cell_1 can be mis-assigned to Cell_2, and vice-versa.

This is causing quite a lot of "false positives" in our differential expression analysis. example: let's assume gene_A is highly expressed in cell-type_1, but never expressed in cell-type_2. Let's also assume that cell-type_1 varies in its abundance between samples from 2 treatment conditions. Because of contaminating reads, gene_A might "incorrectly" appear as differentially expressed in cell-type_2.

However, we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.

Basically, as I can generate a genes x sample matrix of counts, and a genes x sample matrix of estimated contamination, I am wondering if it would be appropriate to include this contamination metric as an offset matrix in edgeR or glmGamPoi or DESeq2 (in addition to a library.size measure). Similar in concept to observation level GC corrections done in EDAseq (as I understand it). I'm trying not to reinvent the wheel for its own sake, but if I need to go a new route that's fine as well.

I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure as a term in the model, or including it as an offset (representative code below).

model <- glm.nb(counts_gene_A ~ group + log(gene_A_contamination) + offset(log(library.size)), data)
model_offset <- glm.nb(counts_gene_A ~ group + offset(log(gene_A_contamination)) + offset(log(library.size)), data)

Doing this I see very little change in aic values, though model usually has a slightly lower aic than model_offset. Group coefficients aren't much affected either (again, usually).

Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.

Thanks very much for taking the time to read this!

DESeq2 glmGamPoi offset edgeR scRNAseq • 224 views
ADD COMMENT
1
Entering edit mode
@constantin-ahlmann-eltze-20493
Last seen 2 days ago
Germany/Heidelberg/EMBL

Hi Alec,

The problem you describe sounds related to problems that people encounter working with spatial data, where suboptimal segmentations misassign reads to the wrong cells. Some time ago, I saw a paper by Kieran Campbell's lab, which proposed a probabilistic model to denoise the initial count matrix (Lee et al., bioRxiv 2024), which might be relevant here.

we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.

To me your idea to use a genes x samples contamination error matrix sounds good! Speaking for glmGamPoi, you will need to include the combined library size and gene_contamination matrix as an offset parameter. As far as I know, edgeR also supports a full matrix with offset values and DESeq2 does not does as well through the normalizationFactor function.

I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure [...] Doing this I see very little change in aic values, though model usually has a slightly lower aic than model_offset. Group coefficients aren't much affected either (again, usually).

Whether you need to fit a coefficient for your contaminations or can include them as an offset (i.e., fix the coefficient to 1), depends on how accurate they are. For comparison, if a cell is twice as big and thus the size factor is twice as large, we expect there also to be twice as many counts for each gene and can thus treat the size factor as an offset.

Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.

Again speaking for glmGamPoi, you cannot include a gene-specific covariate in your design matrix. You would have to run a separate fit for each gene. On the other hand this is not actually as bad, as it may sound. Internally glmGamPoi is running a separate model fit for each gene anyways (glmGamPoi Github).

Best, Constantin

0
Entering edit mode

Do you mean an offset matrix in the GLM? DESeq2 has normalizationFactors for this, and it can be combined with sequencing depth control using normMatrix argument of estimateSizeFactors.

ADD REPLY
0
Entering edit mode

Oh, I never realized. That's cool. Thanks for correcting me :)

0
Entering edit mode

Not a problem :)

ADD REPLY
0
Entering edit mode

Thanks Constantin! I appreciate the insight.

Running a separate fit for each gene would be fine, and I'm trying out a version of the analysis that just does that, however it would be nice if I could take advantage of your tool considering the optimizations you made to the process.

The remaining issue on using this contamination estimate as an offset seems to be the question of the error matrix accuracy. I'm certainly not confident that it's as theoretically grounded as setting the coefficient to 1 for changes in the size-factor, but I'm also not totally sure how to go about trying to empirically quantify that. We don't have a way of directly measuring how many counts are present because of contamination.

I've tried a procedure that's similar to the one done in EDAseq, quantile normalization based on contaminating expression, but the contamination estimate also roughly seems to have a negative binomial distribution, and the resulting normalized counts (and offsets) are somewhat reasonable, but I'm just not confident that this method is applicable when the underlying distribution is so different from what the original tool was using (GC content vs contamination estimate).

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 14 minutes ago
WEHI, Melbourne, Australia

edgeR allows observation specific offsets (as already pointed out), although I am not confident that would be a reliable solution to your problem.

edgeR also allows observation-specific weights as part of the quasi-likelihood pipeline, which might be relevant to downweight observations with contamination.

edgeR doesn't allow gene-specific predictors in the design matrix.

A different pipeline would be to use limma::voomaLmFit where predictor is set to your indirect measure. That doesn't correct for the contamination but it does do observation-specific modelling of accuracy with downweighting of more contaminated observations.

ADD COMMENT

Login before adding your answer.

Traffic: 519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6