P Value of 0?
3
0
Entering edit mode
dsperley • 0
@dsperley-7315
Last seen 7.3 years ago
United States

Hi, 

I am using edgeR to analyze some RNA-Seq data. I have 3 groups: untreated 24h control, infected at 24h, and infected at 48h, each with 2 biological replicates.  Looking at the P Values and FDR for  the 48h-Untreated comparison shows that top DE genes have P Values and FDR of 0, not 0.000000e+00 which suggests that the P values and FDR are very small, and the R has rounded down to 0.

Can I interpret a P Value of 0 for a particular gene to mean that there is zero probability that the result could be observed due to chance alone? or are P values of 0 impossible? 

Condition<-c("Bb_24h","Bb_24h","U_24h","U_24h","Bb_48h","Bb_48h")

RNA_data = DGEList(counts=Count_Table[,c(2:7)], group=Condition,genes = Count_Table$Gene_ID)

# filter low expressing genes
#Keeping Genes with greater than 1 count per million (cpm) in at least 2 samples
# 1 cpm corresponds to a read count of aproximately 22

keep<-rowSums(cpm(RNA_data)>1)>=2
RNA_data_filtered<-RNA_data[keep,,keep.lib.sizes=F]
dim(RNA_data_filtered)

# normalization
RNA_data_filtered<- calcNormFactors(RNA_data_filtered)

##estimating dispersion
RNA_data_filtered<-estimateCommonDisp(RNA_data_filtered,verbose=T)
#Disp = 0.00053 , BCV = 0.0231 
RNA_data_filtered<-estimateTagwiseDisp(RNA_data_filtered)
plotBCV(RNA_data_filtered)

et_24<-exactTest(RNA_data_filtered,pair=c("U_24h","Bb_24h"))
topTags(et_24)
Comparison of groups:  Bb_24h-U_24h 
        genes      logFC    logCPM        PValue           FDR
14699    MMP1  1.8846197  7.262686  0.000000e+00  0.000000e+00
7122     GFAP -1.4426427 10.339403  0.000000e+00  0.000000e+00
721    ANKRD1 -1.3150204  8.340054  0.000000e+00  0.000000e+00
19713     SCD -1.3983463 10.545028 2.163614e-314 7.556422e-311
3841  COL11A1 -1.2789164  8.063025 7.528131e-291 2.103360e-287
4268     CTGF -1.0540266 10.327154 7.999797e-270 1.862619e-266

et_48<-exactTest(RNA_data_filtered,pair=c("U_24h","Bb_48h"))
topTags(et_48)
Comparison of groups:  Bb_48h-U_24h 
         genes     logFC   logCPM PValue FDR
19948 SERPINA3  4.699776 4.873163      0   0
14699     MMP1 -4.204516 7.262686      0   0
5619     ENPP2  4.101805 4.409902      0   0
12354      LUM  4.069323 5.353129      0   0

sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.8.5  limma_3.22.4

loaded via a namespace (and not attached):
[1] tools_3.1.2

 

differential gene expression edger • 26k views
ADD COMMENT
2
Entering edit mode
@ryan-c-thompson-5618
Last seen 3 months ago
Icahn School of Medicine at Mount Sinai…

These p-values don't indicate that the p-value is literally zero. They are simply less than the smallest representable positive double-precision floating point value. Note that .Machine$double.xmin is 2.225074e-308. Floating point values can get a little bit smaller than this using Subnormal numbers, but they bottom out around 1e-314. Anything smaller than that gets rounded down to zero (underflow). The probability density function used to calculate edgeR p-values is continuous and nonzero all the way out to infinity, so a p-value of exactly zero is not possible.

Note that your common dispersion value is 0.00053. This is a very small dispersion, meaning there is almost no detectable biological variability within each condition in your data. EdgeR's significance measures are based on this dispersion estimate. Because edgeR doesn't see any biological variation in your data, it is reporting highly significant p-values. Do you have technical replicates instead of biological replicates?

Also, I notice you skipped the trended dispersion step. Was this intentional? Try using the estimateDisp function, which combines all the dispersion estimation steps into one.

ADD COMMENT
0
Entering edit mode
dsperley • 0
@dsperley-7315
Last seen 7.3 years ago
United States

Thank-you for your comments,

I thought that the P Values were so small that they were being rounded down to 0, but I was expecting to see the value represented as 0.0000e+00 instead of just 0.

The small dispersion is puzzling to me, since I just completed a small RNA-seq analysis of the miRNA extracted from the same samples in  this analysis and the dispersion was 0.02372  with a  BCV of 0.154.  

This experiment had both biological and technical replicates-2 technical replicates per biological replicate. We aligned the technical replicates together using bowtie2, then used HTSeq to create counts files:

bowtie2 -x genome -U techrep1.fastq,techrep2.fastq -S collapsed_techrep.sam 2> collapsed_techrep_stderr.txt

htseq-count -q -s no collapsed_techrep.sam genes.gtf > collapsed_techrep.txt

With such low biological variability, I'm not sure how to proceed. Should I make my cut off for differentially expressed genes more stringent? I'm currently considering any genes with a FDR of less that 0.05 and a logFC greater than 2 or less than -2 to be differentially expressed. Maybe I should decrease the FDR cutoff to 0.01?

 

ADD COMMENT
0
Entering edit mode

There is no difference between 0 and 0.0000e+00. They are two ways of displaying the exact same number:

> 0 == 0.0000e+00
[1] TRUE
> identical(0, 0.0000e+00)
[1] TRUE

With regard to the biological variability, I find it hard to believe that you genuinely have such a small BCV in biological samples. The BCV that you report for the miRNA dataset is much more typical. You should inspect the distribution of genewise dispersions by running estimateDisp with prior.df=0 and inspecting the output of plotBCV to see if your mean-dispersion distribution looks ok.

ADD REPLY
0
Entering edit mode

I've attached the BCV plot  generated from running estimateDisp with a prior.df=0. I don't have that much experience interpreting BCV plots, but the lack of a trend in the BCV with respect to logCPM suggests that the mean-dispersion distribution is ok. Since the low biological variation is resulting in very small P values, would selecting genes based on logFC be an appropriate approach?

ADD REPLY
1
Entering edit mode

The dispersion distribution shown by plotBCV is fine, however it still indicates that you have little or no biological variability between your replicates.

Selecting genes by logFC would be a very poor strategy, because it would preferentially select genes with very small counts for which the evidence for DE is relatively poor. The fact that some of the p-values are small doesn't justify abandoning the idea of statistical significance entirely.

As I wrote yesterday, I suggest you try glmFit and treatDGE. This incorporates a logFC cutoff within a statistical significance framework. This will reduce the number of DE genes that you have, prioritizing those with larger fold changes. This approach will effectively require genes with smaller counts to pass a higher logFC cutoff than those with larger counts, with is exactly what you need.

ADD REPLY
0
Entering edit mode

Thank-you for the comments,  I will try glmFit and treatDGE as you suggest.

 

ADD REPLY
0
Entering edit mode

We do not recommend using both FDR and logFC cutoffs simultaneously, because the logFC cutoff can cause the true FDR to go above the nominal level. If you want to be more stringent and give more weight to larger fold changes, use glmFit() and treatDGE() instead of exactTest().

ADD REPLY
0
Entering edit mode
@matthew-mccormack-2021
Last seen 11 months ago
United States

Hi Gordon,

   So, to be clear, you would not recommend cutting off at, say FDR 0.05, and after that cutting off at log(2) of, say, >1 and < -1 ?

Could you possibly elaborate in a few sentences the reasoning behind your above comment ?

ADD COMMENT
0
Entering edit mode

The treatDGE function tests the null hypothesis that the absolute value of the log fold change is less than a threshold, rather than equal to zero. See the help text for treatDGE and treat for more explanation.

ADD REPLY
0
Entering edit mode

 Thanks, Ryan.  So, how is this different from just doing more manual log(2) cut off  in Excel or a Perl script ?

ADD REPLY
0
Entering edit mode

I would recommend you read the paper for TREAT. The introduction gives a good overview of the rationale, and the figures demonstrate that it is a significant improvement in practice. Here's the link: http://www.ncbi.nlm.nih.gov/pubmed/19176553

ADD REPLY
0
Entering edit mode

Thank for link, Ryan. I have started reading the paper and it is very helpful.

ADD REPLY

Login before adding your answer.

Traffic: 656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6