Dear DESeq2 Experts,
I am new to the rna-seq data and I have started to learn DESeq2.
I have used the following example to run DESeq2
Then I extracted the p-values and generated the qq plot of observed vs. expected p-value, as described in:
http://genome.sph.umich.edu/wiki/Code_Sample:_Generating_QQ_Plots_in_R
Enclosed is the plot:
My question is, why the observed p-values are inflated? is this expected with the analysis of rna-seq data?
Thanks for any comments/suggestions.
Nikmehr
Hi Michael,
I understand my first plot was based on example data; however, I see the same pattern with my actual data. Below is the graph using my data:
I wonder, what could be the reason for this inflated pattern of observed p-values?
Thanks for any comments/suggestions.
Nikmehr
I don't follow, are you saying that all the genes are null?
Why do you expect all the genes to be null? In many RNA-seq datasets, the null (log fold change = 0) is obviously not the case for many genes.
This motivated our focus on log fold change threshold tests and accurate estimation of log fold change in the DESeq2 paper.
I expect in an experiment most of the genes are null and there are only small subset of genes (dots) that significantly deviate from the solid line (matching X=Y).
or a plot like this:
https://www.nature.com/ng/journal/v48/n9/fig_tab/ng.3620_SF2.html
I am trying to identify that small subset of genes that represent the genuine associations, but at the moment I have many significant hits.
In RNA-seq, for a well-designed experiment there will be many differentially expressed genes. And then there is a tail of genes which likely do not have log fold change = 0, but maybe have a small effect. For this, we recommend you to use the lfcThreshold argument of results(). We discuss this in depth in the DESeq2 paper, but you can just take a look at the vignette for example usage:
https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#tests-of-log2-fold-change-above-or-below-a-threshold
The link you post above is a GWAS experiment, often underpowered for all but the largest effects. Most of the genomic loci are explanatory for, if anything, minuscule fractions of variance in the trait. In RNA-seq we typically have much, much larger effect sizes (in terms of population SD if you like) than in GWAS.
Thank you very much for the information, I found it very helpful !