I originally asked this question on Biostars and a responder suggested I try here. To avoid any duplication of effort, I will copy or link any answer. If you have a better suggestion, please let me know.
Below is a copy-pasted version of my question:
McCarthy, D.J., Chen, Y., and Smyth, G.K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 40, 4288–4297.
In Figure 2 of this paper, the authors show that estimating dispersion on a per-gene basis is more compatible with their data:
(I realise I'm probably not supposed to copy the figure here, but I've acknowledged the source)
I think understand broadly what is being demonstrated here (please correct me if I'm mistaken): When we estimate dispersions, that is an implicit model of the ratio of the mean to the standard deviation of each gene. Here, the authors are showing, with QQ plots, that the per-gene model describes the observed ratio better than a common dispersion value. Each dot in the plot corresponds to a gene.
I'd like to generate this figure for my own data, but I don't understand how to compute the two vectors required as inputs to qqplot(). I'm guessing that one might be the log likelihood after fitting the GLM?
Thanks for any light you can shed (code also gratefully appreciated, but no obligation)
You can make these plots very simply in edgeR using the function gof() provided for this purpose. Here is the code to make the common plot:
If you want to program all the gory details yourself then (this is a variation on the code in Davis' answer):
Note that this plot was used to demonstrate a basic property of RNA-seq data. It is not intended to check assumptions for individual data analyses. For practical data analyses, you should always use gene-specific dispersions. Or rather, estimateDisp() will automatically find for you an optimal compromise between common, trended and tagwise dispersion estimates.
This is a tremendously helpful set of comments, many thanks Gordon. I take your caveat on board - the primary motivation for repeating the gof() analysis is to show my colleagues the effect of tagwise estimation. The secondary motivation is to check my own understanding, which has been vastly improved by this discussion.
I had forgotten about the gof() function! :)
Thank you! That's extremely helpful. So we're actually testing how well our deviances fit the chi squared distribution (via a normal transform). Am I thinking along the right lines when I say that this is like testing the residuals for correlation in a (general) linear model? In the sense that we are testing the appropriateness of our modelling assumptions by looking at the distribution of the errors.
To check I have understood: I think I can find the blue points in the plot using the following
?
You've forgotten to apply Holm multiple-testing adjustment to the p-values (as mentioned in the figure caption). See the code in my comment to Davis' answer.
The plot is analogous to the classic qq plot of normal residuals. The idea is identify outlier residuals rather than to check for correlation. You can check out for example the Wikipedia article on Q-Q Plot.
But note: these plots were used for a specific purpose in our published paper. We do not use or recommend these plots as part of routine data analyses. The plots do not have any simple interpretation in terms of goodness of fit in an individual data analysis, except possibly to confirm that common or trended dispersion is inadequate. The plot is a bit vacuous when applied to tagwise dispersions because the tagwise dispersions are chosen to make the deviances a reasonable fit. But really you should be using tagwise dispersion all the time.