Background
I'm currently trying to understand how RNA-seq analysis should be performed. I have read all manuals (+ the corresponding papers), however as I'm mostly concerned with bio-informatics regarding data cleaning and integration I find it hard to understand the statistics behind e.g. edgeR (If there are any recommendation (papers e.g.) to better understand these feel free to recommend some).
Why this question?
As the prior degrees of freedom is a constant, and thus not automatically determined based on the data I was interested in its effects on the dispersion estimate. I read here that the prior is commonly calculated as:
50/(#samples−#groups)
When I do this for my data: 50/(4-2) = 25.
As this is substantially higher than the default (=10) I decided to see the effects of the prior df on my data, for which I used the following code:
library(edgeR) # loading data groups <- as.factor(c("group1","group1","group2","group2")) dg.list <- DGEList(counts=counts, group = groups) # normalization dg.list <- calcNormFactors(dg.list) # dispersion estimates dg.list <- estimateGLMCommonDisp(dg.list) dg.list <- estimateGLMTrendedDisp(dg.list) # effect of prior in tag wise dispersion estimates design <- model.matrix(~0+group, data=dg.list$samples) colnames(design) <- levels(dg.list$samples$group) par(mfrow = c(3,2)) for (prior in c(10,15,20,25,30,35)) { dg.list <- estimateGLMTagwiseDisp(dg.list, design, prior.df = prior) fit <- glmFit(dg.list ,design) gof(fit, plot = T, main = paste0('prior: ',prior)) }
The question
As can be seen in this plot the genes in the top right are under the line with the default (=10) but above the line with a prior df of 25. Of course I could just pick the prior df best fitting the line but I assume this would not be the best practice. Thus my question is: is there a way to estimate the 'best' prior df or should I just stick with the defaults?
Dear Aaron, thankyou for your reply! I did not know about the estimateDisp wrapper for the common, trend and tagwise dispersion estimates.
estimateDisp() and the associated publications are introduced in the first chapter of the edgeR User's Guide. estimateDisp() is the first piece of code introduced in Section 1.4 "Quick Start".
Do you have any recommendation to papers/websites/videos to understand why you should calculate these dispersions, and how these are used in the GLM fitting?
Section 1.2 of the edgeR user's guide contains brief summaries of the relevant papers.