Question

Incorporating Salmon offset matrix into Wilcoxon DEG analysis

0

Entering edit mode

Andre • 0

@804722cd

Last seen 3 months ago

Portugal

I´m working with this dataset(GSE234297) that provides a salmon counts file and a salmon offsets file(GSE234297_gene_raw_counts.txt.gz GSE234297_gene_offset_matrix.txt.gz) . The paper authors use them like this to do DEG:~

 # create DGEList
y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)

# add salmon offsets
y <- scaleOffset(y, offset = as.matrix(salmon_offset)) 

# filter low count genes 
keep <- filterByExpr(y, group = y$samples$group)
table(keep) 
y <- y[keep, , keep.lib.sizes=FALSE] 

# set design matrix
design <- model.matrix(~ Sex + GC_content + group, data = y$samples)
design

# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design)

But my question is if I want to use the salmon offsets for Wilcoxon (which can't use offsets directly), could I simply do:

Copy <- DGEList(counts)
y$offset <- offsets  
cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)

To get properly normalized values that account for the Salmon bias corrections?

edgeR • 1.2k views

ADD COMMENT • link 4 months ago • updated 3 months ago Andre • 0

score 1 · Answer 1 · 2024-12-31

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

To compute CPM values from edgeR:

y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)
y <- scaleOffset(y, offset = salmon_offset) 
cpms <- cpm(y)

Needless to say, using edgeR v4 quasi-likelihood would be much faster and better than Wilcoxon tests. I would be skipping the estimateGLMRobustDisp call shown in your code above and instead using:

fit <- glmQLFit(y, design, robust=TRUE)
ftest <- glmQLFTest(fit)
topTags(ftest)

BTW, please ask questions just once. I answered essentially the same question from you 3 hours ago, added as a comment to an old answer.

ADD COMMENT • link 4 months ago Gordon Smyth 52k

0

Entering edit mode

Thank you. What is the difference between setting robust=TRUE in glmQLFit and estimateGLMRobustDisp? The paper i mentioned in the post used both:

y <- estimateGLMRobustDisp(y, design) 

fit <- glmQLFit(y, design, robust=TRUE)

ADD REPLY • link 4 months ago Andre • 0

0

Entering edit mode

It is not possible to use both. estimateGLMRobustDisp() estimates negative binomial dispersions for each gene, whereas glmQLFit() estimates quasi-dispersions. estimateGLMRobustDisp() is for the older edgeR glm pipeline based on likelihood ratio tests. If you call the newer pipeline with glmQLFit(), then the gene-specific negative binomial dispersions will be discarded. See Chen et al (2024) for the differences between the pipelines.

How do you know that the paper used robust=TRUE? I see no mention of that in the published paper associated with the GSE234297 dataset.

Chen Y, Chen L, Lun ATL, Baldoni PL, Smyth GK (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv 2024.01.21.576131. doi:10.1101/2024.01.21.576131

Edit later

I have now remembered that estimateGLMRobustDisp() produces weights as well as NB dispersion, and the weights will definitely affect any quasi-likelihood analysis done downstream. I can't recommend combining estimateGLMRobustDisp with glmQLFit. The edgeR authors didn't intend that the two pipelines be combined in this way.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

They provide the code they used for DGE analysis in a gitlab link that can be found in the paper. In there you can find:

# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design) 

# visualise negative binomial dispersions 
plotBCV(y)

# fit a quasi-likelihood negative binomial GLM model to count data
fit <- glmQLFit(y, design, robust=TRUE)

ADD REPLY • link 3 months ago Andre • 0

0

Entering edit mode

Here, estimateGLMRobustDisp() was used just to get the diagnostic BCV plot by plotBCV(y). The robust NB dispersions were not used for the DE analysis.

In edgeR v3, used by the paper, glmQLFit() would have taken the trended NB dispersions from the estimateGLMRobustDisp() output, which would be same trended dispersions as from the standard dispersion estimate command estimateDisp(). In edgeR v4, the NB dispersions are being ignored entirely by glmQLFit() so you can simply skip estimateDisp() and friends, resulting in a considerable saving of computation time, because the NB dispersion steps is computationally expensive.

For a dataset with 100-200 samples, the whole edgeR v4 analysis should take well under a minute.

Edit later

I remember now that estimateGLMRobustDisp() produces weights, which will be used by the glmQLFit analysis. We don't recommend that sequence though.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

The estimateGLMRobustDisp() function seems to have an effect on glmQLFit(), even in edgeR 4 (4.4.1). If I use estimateGLMRobustDisp(), like this:

DGE_list_object <- estimateGLMRobustDisp(DGE_list_object, design) 

  fit <- glmQLFit(DGE_list_object, design, robust=robust)

qlf <- glmQLFTest(fit)

print(summary(decideTests(qlf)))

I get 45 differentially expressed genes. However, if I omit estimateGLMRobustDisp():

fit <- glmQLFit(DGE_list_object, design, robust=robust)

qlf <- glmQLFTest(fit)

print(summary(decideTests(qlf)))

Also, there's quite a big difference between the results obtained for this dataset with edgeR 3 using the same code as in the paper (245 differentially expressed genes) and using the new QL pipeline in edgeR 4 (0 differentially expressed genes).

ADD REPLY • link 3 months ago Andre • 0

0

Entering edit mode

Sorry, I had forgotten estimateGLMDisp() produces weights as well as NB dispersions. The weights from estimatGLMRobustDisp() were intended to be used by adjustedProfileLik() to estimate the NB dispersions. However, if you pass the DGE object to glmQLFit(), then glmQLFit() will interpret the weights a generalized linear model weights, with possibly unintended results. The edgeR authors did not intend that estimateGLMRobustDisp() be combined with glmQLFit().

BTW, I am a bit confused by the code in your comment. You have given two code sequences, but they both look to be exactly the same.

ADD REPLY • link 3 months ago Gordon Smyth 52k

0

Entering edit mode

I edited the previous post. I meant to remove the estimateGLMRobustDisp line in the second snippet :

 DGE_list_object <- estimateGLMRobustDisp(DGE_list_object, design)

ADD REPLY • link 3 months ago Andre • 0