Mistake in Cook's distance calculation in DESeq2 or am I missing something?
1
0
Entering edit mode
phil.xie • 0
@370763f1
Last seen 20 months ago
United Kingdom

Cook's distance was implemented in DESeq2 for outlier detection. Specifically, the calculation is defined here.

calculateCooksDistance <- function(object, H, modelMatrix) {
  p <- ncol(modelMatrix)
  dispersions <- robustMethodOfMomentsDisp(object, modelMatrix)
  V <- assays(object)[["mu"]] + dispersions * assays(object)[["mu"]]^2
  PearsonResSq <- (counts(object) - assays(object)[["mu"]])^2 / V
  cooks <- PearsonResSq / p  * H / (1 - H)^2
  cooks
}

If I understood correctly, Cook's distance can be calculated using Pearson residuals as such (modified from here):

PearsonResSq / (dispersions * p) * H / (1 - H)^2

which is different from the DESeq2 implementation.

It further complicates things when the dispersion parameter in DESeq2 is defined in robustMethodOfMomentsDisp here as

alpha <- ( v - m ) / m^2

which if I understood correctly, returns the inverse of dispersion according to this source.

This in turns make the calculation of the variance correct, because normally it is expressed as such:

V <- assays(object)[["mu"]] + 1 / dispersions * assays(object)[["mu"]]^2

Therefore, I think the correct formula for calculating cooks should be:

cooks <- PearsonResSq * dispersions / p  * H / (1 - H)^2

However, I'm not the expert in statistics, and all my knowledge came from forum posts. I would love to see what others think. Thanks in advance!

DESeq2 • 1.2k views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

The stackexchange post defines dispersion in a different way.

The Pearson residuals in the DESeq2 code are already weighted for variance V.

returns the inverse of dispersion

Also note that there are two definitions of dispersion, one that scales with variance and one that is the inverse.

ADD COMMENT
0
Entering edit mode

Hi Mike, thanks so much for replying! I'm actually learning from the DESeq2 codes because I want to implement Cook's distance calculation for a beta-binomial regression model. The more I read the more confused I am, because it seems like there are different ways of calculating Cook's D. Yet, we expect Cook's D to follow an F distribution in any case.

The Pearson residuals in the DESeq2 code are already weighted for variance V.

You are right that Pearson residuals, by definition, is calculated as residuals divided by standard deviation (ie square root of variance).

However, would you say the source I cited is correct for the calculation of Cook's D using Pearson residuals? Because it seems like in that example, Pearson residuals was further divided by dispersion to get the Cook's D. Whereas in the implementation of DESeq2, it is not done.

ADD REPLY
1
Entering edit mode

That statsexchange post mentions to fix overdispersion to 1 for GLM (we say the same in our 2014 paper):

"τ is an overdispersion parameter (in the negative binomial GLM, τ is set to 1)"

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

ADD REPLY
0
Entering edit mode

I see! Thanks for your prompt reply, I should have read your paper in more detail. Now I understand more about what's going on.

Just to confirm, am I correct to say that because the overdispersion has been corrected for already by the dispersion parameter, therefore the overdispersion is set to 1? I suppose the same can be applied to other regression models where overdispersion has been accounted for?

ADD REPLY
0
Entering edit mode

Here is the definition of Cook's for GLM in R source code:

https://github.com/wch/r-source/blob/trunk/src/library/stats/R/lm.influence.R#L261

And see here that dispersion is set to 1:

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/summary.glm

This value dispersion is also set to 1 when using glm.nb.

So our formula accords with Cook's for Poisson and Binomial GLM in R and in glm.nb in MASS.

ADD REPLY
0
Entering edit mode

I was curious so I tested glm.nb. Putting it here for the record.

set.seed(123)
n <- 100
dat <- data.frame(
  batch = c(rep('A', n), rep('B', n)),
  count = c(rnbinom(n = n, size = 5, prob = 0.3), rnbinom(n = n, size = 15, prob = 0.5))
)

mdl <- MASS::glm.nb(count ~ batch, data = dat)
summary(mdl)$dispersion

Indeed, the dispersion value is set to 1 in glm.nb. Therefore I fully agree with you that DESeq2's formula is the same as in MASS.

Thank you very much again for answering my questions!

ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6