Question

EdgeR: Prior degrees of freedom effect on tag wise dispersion

1

Entering edit mode

Mr.RB ▴ 20

@mrrb-15633

Last seen 6.0 years ago

Background

I'm currently trying to understand how RNA-seq analysis should be performed. I have read all manuals (+ the corresponding papers), however as I'm mostly concerned with bio-informatics regarding data cleaning and integration I find it hard to understand the statistics behind e.g. edgeR (If there are any recommendation (papers e.g.) to better understand these feel free to recommend some).

Why this question?

As the prior degrees of freedom is a constant, and thus not automatically determined based on the data I was interested in its effects on the dispersion estimate. I read here that the prior is commonly calculated as:

50/(#samples−#groups)

When I do this for my data: 50/(4-2) = 25.

As this is substantially higher than the default (=10) I decided to see the effects of the prior df on my data, for which I used the following code:

library(edgeR)

# loading data
groups <- as.factor(c("group1","group1","group2","group2"))
dg.list <- DGEList(counts=counts, group = groups)

# normalization
dg.list <- calcNormFactors(dg.list)

# dispersion estimates
dg.list <- estimateGLMCommonDisp(dg.list)
dg.list <- estimateGLMTrendedDisp(dg.list)

# effect of prior in tag wise dispersion estimates
design <- model.matrix(~0+group, data=dg.list$samples)
colnames(design) <- levels(dg.list$samples$group)

par(mfrow = c(3,2))
for (prior in c(10,15,20,25,30,35)) {
  dg.list <- estimateGLMTagwiseDisp(dg.list, design, prior.df = prior)
  fit <- glmFit(dg.list ,design)
  gof(fit, plot = T, main = paste0('prior: ',prior))
}

The produced plot

The question

As can be seen in this plot the genes in the top right are under the line with the default (=10) but above the line with a prior df of 25. Of course I could just pick the prior df best fitting the line but I assume this would not be the best practice. Thus my question is: is there a way to estimate the 'best' prior df or should I just stick with the defaults?

R edgeR rnaseq rna-seq • 2.0k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 28k • written 7.0 years ago by Mr.RB ▴ 20

score 2 · Accepted Answer · 2018-04-22

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 2 hours ago

The city by the bay

There is no longer any need to set the prior d.f. manually. estimateDisp will automatically estimate a suitable prior d.f. for you, and has been doing so for several years. We are also recommending the QL framework:

A: edgeR glmLRT with test="F"?

https://f1000research.com/articles/5-1438/v2

... in which the prior d.f. for empirical Bayes shrinkage of the QL dispersions is also automatically estimated. There has also been work performed in making the prior d.f. estimation more robust to outliers, see:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373812/

ADD COMMENT • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron, thankyou for your reply! I did not know about the estimateDisp wrapper for the common, trend and tagwise dispersion estimates.

ADD REPLY • link 7.0 years ago Mr.RB ▴ 20

0

Entering edit mode

estimateDisp() and the associated publications are introduced in the first chapter of the edgeR User's Guide. estimateDisp() is the first piece of code introduced in Section 1.4 "Quick Start".

ADD REPLY • link 7.0 years ago Gordon Smyth 52k

0

Entering edit mode

Do you have any recommendation to papers/websites/videos to understand why you should calculate these dispersions, and how these are used in the GLM fitting?

ADD REPLY • link 7.0 years ago Mr.RB ▴ 20

0

Entering edit mode

Section 1.2 of the edgeR user's guide contains brief summaries of the relevant papers.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k