Question

Advice on appropriateness of an offset in glmGamPoi/edgeR

0

Entering edit mode

Alec • 0

@43b7c63e

Last seen 1 day ago

United States

Hello!

I'm working with an interesting single cell RNA-seq dataset, which because of the collection method is suffering from elevated "contamination" because of misassigned reads. Basically because of how the sample is collected, if two cells are nearby one another, reads from Cell_1 can be mis-assigned to Cell_2, and vice-versa.

This is causing quite a lot of "false positives" in our differential expression analysis. example: let's assume gene_A is highly expressed in cell-type_1, but never expressed in cell-type_2. Let's also assume that cell-type_1 varies in its abundance between samples from 2 treatment conditions. Because of contaminating reads, gene_A might "incorrectly" appear as differentially expressed in cell-type_2.

However, we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.

Basically, as I can generate a genes x sample matrix of counts, and a genes x sample matrix of estimated contamination, I am wondering if it would be appropriate to include this contamination metric as an offset matrix in edgeR or glmGamPoi or DESeq2 (in addition to a library.size measure). Similar in concept to observation level GC corrections done in EDAseq (as I understand it). I'm trying not to reinvent the wheel for its own sake, but if I need to go a new route that's fine as well.

I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure as a term in the model, or including it as an offset (representative code below).

model <- glm.nb(counts_gene_A ~ group + log(gene_A_contamination) + offset(log(library.size)), data)
model_offset <- glm.nb(counts_gene_A ~ group + offset(log(gene_A_contamination)) + offset(log(library.size)), data)

Doing this I see very little change in aic values, though model usually has a slightly lower aic than model_offset. Group coefficients aren't much affected either (again, usually).

Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.

Thanks very much for taking the time to read this!

DESeq2 glmGamPoi offset edgeR scRNAseq • 224 views

ADD COMMENT • link updated 22 hours ago by Gordon Smyth 51k • written 3 days ago by Alec • 0

score 1 · Answer 1 · 2024-06-27

Hi Alec,

The problem you describe sounds related to problems that people encounter working with spatial data, where suboptimal segmentations misassign reads to the wrong cells. Some time ago, I saw a paper by Kieran Campbell's lab, which proposed a probabilistic model to denoise the initial count matrix (Lee et al., bioRxiv 2024), which might be relevant here.

we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.

To me your idea to use a genes x samples contamination error matrix sounds good! Speaking for glmGamPoi, you will need to include the combined library size and gene_contamination matrix as an offset parameter. As far as I know, edgeR also supports a full matrix with offset values and DESeq2 ~~does not~~ does as well through the normalizationFactor function.

I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure [...] Doing this I see very little change in aic values, though model usually has a slightly lower aic than model_offset. Group coefficients aren't much affected either (again, usually).

Whether you need to fit a coefficient for your contaminations or can include them as an offset (i.e., fix the coefficient to 1), depends on how accurate they are. For comparison, if a cell is twice as big and thus the size factor is twice as large, we expect there also to be twice as many counts for each gene and can thus treat the size factor as an offset.

Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.

Again speaking for glmGamPoi, you cannot include a gene-specific covariate in your design matrix. You would have to run a separate fit for each gene. On the other hand this is not actually as bad, as it may sound. Internally glmGamPoi is running a separate model fit for each gene anyways (glmGamPoi Github).

Best, Constantin

score 0 · Answer 2 · 2024-06-28

edgeR allows observation specific offsets (as already pointed out), although I am not confident that would be a reliable solution to your problem.

edgeR also allows observation-specific weights as part of the quasi-likelihood pipeline, which might be relevant to downweight observations with contamination.

edgeR doesn't allow gene-specific predictors in the design matrix.

A different pipeline would be to use limma::voomaLmFit where predictor is set to your indirect measure. That doesn't correct for the contamination but it does do observation-specific modelling of accuracy with downweighting of more contaminated observations.