cohen's d in edgeR
1
0
Entering edit mode
mali salmon ▴ 370
@mali-salmon-4532
Last seen 6.3 years ago
Israel

Hello list

I would like to calculateĀ standardized effect size in edgeR, I found this postĀ https://www.biostars.org/p/140976/ for DEseq2

Is the formula below the right calculation also for edgeR?

(fit$table$logFC / sqrt(1/(fit$table$logCPM)+fit$dispersion

Thanks

Mali

edger effect size cohen d • 1.3k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

The formula you describe above is, presumably, based on a first-order Taylor approximation. For a NB-distributed random variable $X$, the first-order approximation of the variance of the log-counts is:

$$ \mbox{var}[\log(X + c)] \approx E(X + c)^{-2} \mbox{var}(X) $$

... where $c$ is the prior/pseudo-count that needs to be added to handle zeroes. This expands to:

$$ \mbox{var}[\log(X + c)] \approx \frac{E(X) + \varphi E(X)^2}{(E(X) + c)^2} $$

... for some NB dispersion $\phi$, which collapses to your expression when $c=0$. This approximation is a bit dodgy but not too bad provided your means are not low relative to the dispersions:

# Fails quite badly here:
disp <- 1
mu <- 1
y <- rnbinom(1000, mu=mu, size=1/disp)
var(log(y+1))
(mu + mu^2 * disp)/(mu+1)^2

The real problem stems from the fact that the variance will differ for each observation, depending on the library size and the average expression for a gene. And even if the library sizes are all the same, the variance will differ between groups for each DE gene. If you have two groups, do you use the variance of the group with lower expression? With higher expression? The variance at the average count across all samples (which is sensitive to technical aspects of the experiment such as the number of replicates in each group)? It's not entirely clear what the variance should be here, it's not like a linear model where the variance is the same for all samples.

Given these issues, I wouldn't be confident that you could obtain an effect size estimate that is easily comparable across experiments or genes. For example, decreases to sequencing depth will increase the variance of the log-counts and decrease the apparent effect size, even if the biological system is the same. The ranking of genes within an experiment will also depend on the overall depth, e.g., a low-abundance gene with a low dispersion may have a larger effect size than a high-abundance gene with a larger dispersion at high coverage, but a lower effect size at low coverage where Poisson noise dominates. Your specific application is also incorrect in that it divides by the log-CPM, but you need the expected count instead (i.e., without log-transformation).

Perhaps there is a better way to do what you want instead of trying to compute Cohen's d here.

ADD COMMENT
0
Entering edit mode

OK, I see, thank you so much for the detailed explanation

Mali

ADD REPLY

Login before adding your answer.

Traffic: 484 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6