Question

Overriding of shrinkage in DESeq2

1

Entering edit mode

Nik Tuzov ▴ 90

@nik-tuzov-8783

Last seen 14 months ago

United States

Hello:

Could you please answer some questions about Figure 1 in Love et al (2014), “Moderated estimation of fold change and dispersion”. I am interested in how and why the shrinkage is overridden for the features that exhibit extraordinarily high dispersion.

1) The paper says that such features “do not obey our modeling assumptions”. At the same time the, a similar reasoning is not applied to the features with extraordinarily low dispersion. Why?

2) Suppose the dispersion parameter indeed follows the lognormal prior assumption and no overriding is required. Using the “2 residual standard deviations” rule will result in about 5% of features having overestimated p-values. Don't you think it's a problem, especially in a plausible situation when only a small proportion of features is differentially expressed?

Regards,

Nik

deseq2 • 2.2k views

ADD COMMENT • link updated 8.9 years ago by Michael Love 43k • written 8.9 years ago by Nik Tuzov ▴ 90

score 5 · Answer 1 · 2016-04-12

5

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

hi Nik,

These are very good questions. For simplicity, I will be speaking of within-group estimates: the estimates of dispersion tend toward zero occur essentially whenever the variance is less than the mean, even with a positive true dispersion for the distribution. This happens often when the counts are low. For example:

varmean <- function(x) var(x)/mean(x)
sum(replicate(100, varmean(rnbinom(n=3,mu=10,size=10))) < 1)

So the low values are often not a violation of the modeling assumptions, but just occur due to random sampling.

Raising low dispersion values is a conservative choice, as is not lowering very high values. The real statistical challenge with RNA-seq differential expression is typically small replicate size (n=3-5), and so doing conservative things regarding shrinkage (which is strongest when sample sizes are small and which disappears as sample size grows) makes sense.

Regarding (2), a few points: yes, if the log dispersion values do follow a Normal then we are not shrinking for ~5%. But again we think this is the best choice. The MLE values are another valid estimator for dispersion, and here we are just making a conservative choice when sample size is very small and shrinkage would be strong.

Regarding differential expression (DE), note that dispersion and DE status are not linked. These genes with high dispersion are not the DE genes, they are the genes with extremely high within-group variance.

Of course, there could be DE genes in this set of genes with high dispersion (as there can be DE genes at all levels of dispersion), but these are the hardest genes to include in an FDR-bounded set, because the noise level is so high. The point of a statistical method is to maximize sensitivity while controlling the FDR, which inevitably means not calling such genes DE, so as to control the FDR. Here we are not guaranteeing that we won't call these genes DE, but we are not moderating dispersion and so the DE evidence needs to be strong as the gene-level (without the information sharing across genes).

ADD COMMENT • link 8.9 years ago Michael Love 43k

0

Entering edit mode

Thank you for replying. Maybe you could consider some kind of a three component model for the dispersion parameter.

ADD REPLY • link 8.9 years ago Nik Tuzov ▴ 90

1

Entering edit mode

So, the DESeq2 methods are basically fixed. There are many benchmarking papers showing we have very good sensitivity while controlling specificity or precision, and we don't want our method to be a moving target. You can imagine trying to have more sophisticated approaches to all the estimation tasks in gene-level DE, but then you have to guarantee that these will work well not just on simulation but on a variety of shapes and sizes of real datasets, and won't do something strange or unexpected. I like the current DESeq2 methods because our performance is good according to independent groups doing benchmarking and because I know that many groups have tested them on all kind of data and they give reasonable answers (or else I would hear about the problems on the support site).

ADD REPLY • link 8.9 years ago Michael Love 43k

1

Entering edit mode

Hi Nik,

As a side note, there is a paper on mixture modelling with the dispersions and associated software:

Bonafede et. al. - Modeling overdispersion heterogeneity in differential expression analysis using mixtures - 2015

https://cran.r-project.org/web/packages/mixtNB/

http://arxiv.org/abs/1410.8093v2

www.dx.doi.org/10.1111/biom.12458

which might be of interest to you.

Bernd

ADD REPLY • link 8.9 years ago Bernd Klaus ▴ 610

0

Entering edit mode

Good point Bernd. Actually that happens to capture the discussion here, because they show with simulation that DESeq2 has a FPR rate a little less than the target alpha, so too conservative of inference.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

That's queer. I came across this paper

http://www.lifesci.dundee.ac.uk/sites/www.lifesci.dundee.ac.uk/files/schurch%20et%20al%202016.pdf

where in Fig 4 one can see that DESeq2 does not control FPR. Since the shrinkage of variance parameters is conservative, it's probably due to the shrinkage of regression coefficients. The latter is absent in DESeq which does not have the FPR problem, according to the paper.

ADD REPLY • link 8.8 years ago Nik Tuzov ▴ 90

1

Entering edit mode

There was an error in the original evaluations in which DESeq2 filtered genes were assigned a p-value of 0 (note: by the authors scripts, not by DESeq2).

The authors have issued a correction, see the latest:

http://rnajournal.cshlp.org/content/early/2016/03/28/rna.053959.115

And the corresponding authors blog post:

https://geoffbarton.wordpress.com/2016/04/21/how-many-biological-replicates-are-needed-in-an-rna-seq-experiment-and-which-differential-expression-tool-should-you-use/

ADD REPLY • link 8.8 years ago Michael Love 43k

1

Entering edit mode

Hi Nik,

As Mike says, there was an error in the original early-access publication of our paper that lead to incorrect FPR results for DESeq2. Thanks for Mike, the bug was quickly spotted and corrected and the final publication version (which he has already linked to) correctly shows the DESeq2 performs well across all our benchmark tests.

Could I ask, where did you find the link to this old copy of the paper? It's certainly there (at the time of posting anyway, we're taking this old incorrect version down now), but I can't find any links to it from out homepage, arXiv, RNA, or a google search and we'd like correct the linking page to point at the final version of the paper.

ADD REPLY • link 8.8 years ago nschurch ▴ 20

0

Entering edit mode

Thank you for replying. I got the link because I am subscribed to Google Scholar alerts.

ADD REPLY • link 8.8 years ago Nik Tuzov ▴ 90