Hi,
I am using edgeR to analyze some RNA-Seq data. I have 3 groups: untreated 24h control, infected at 24h, and infected at 48h, each with 2 biological replicates. Looking at the P Values and FDR for the 48h-Untreated comparison shows that top DE genes have P Values and FDR of 0, not 0.000000e+00 which suggests that the P values and FDR are very small, and the R has rounded down to 0.
Can I interpret a P Value of 0 for a particular gene to mean that there is zero probability that the result could be observed due to chance alone? or are P values of 0 impossible?
Condition<-c("Bb_24h","Bb_24h","U_24h","U_24h","Bb_48h","Bb_48h") RNA_data = DGEList(counts=Count_Table[,c(2:7)], group=Condition,genes = Count_Table$Gene_ID) # filter low expressing genes #Keeping Genes with greater than 1 count per million (cpm) in at least 2 samples # 1 cpm corresponds to a read count of aproximately 22 keep<-rowSums(cpm(RNA_data)>1)>=2 RNA_data_filtered<-RNA_data[keep,,keep.lib.sizes=F] dim(RNA_data_filtered) # normalization RNA_data_filtered<- calcNormFactors(RNA_data_filtered) ##estimating dispersion RNA_data_filtered<-estimateCommonDisp(RNA_data_filtered,verbose=T) #Disp = 0.00053 , BCV = 0.0231 RNA_data_filtered<-estimateTagwiseDisp(RNA_data_filtered) plotBCV(RNA_data_filtered) et_24<-exactTest(RNA_data_filtered,pair=c("U_24h","Bb_24h")) topTags(et_24) Comparison of groups: Bb_24h-U_24h genes logFC logCPM PValue FDR 14699 MMP1 1.8846197 7.262686 0.000000e+00 0.000000e+00 7122 GFAP -1.4426427 10.339403 0.000000e+00 0.000000e+00 721 ANKRD1 -1.3150204 8.340054 0.000000e+00 0.000000e+00 19713 SCD -1.3983463 10.545028 2.163614e-314 7.556422e-311 3841 COL11A1 -1.2789164 8.063025 7.528131e-291 2.103360e-287 4268 CTGF -1.0540266 10.327154 7.999797e-270 1.862619e-266 et_48<-exactTest(RNA_data_filtered,pair=c("U_24h","Bb_48h")) topTags(et_48) Comparison of groups: Bb_48h-U_24h genes logFC logCPM PValue FDR 19948 SERPINA3 4.699776 4.873163 0 0 14699 MMP1 -4.204516 7.262686 0 0 5619 ENPP2 4.101805 4.409902 0 0 12354 LUM 4.069323 5.353129 0 0 sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.8.5 limma_3.22.4 loaded via a namespace (and not attached): [1] tools_3.1.2
There is no difference between 0 and 0.0000e+00. They are two ways of displaying the exact same number:
With regard to the biological variability, I find it hard to believe that you genuinely have such a small BCV in biological samples. The BCV that you report for the miRNA dataset is much more typical. You should inspect the distribution of genewise dispersions by running estimateDisp with prior.df=0 and inspecting the output of plotBCV to see if your mean-dispersion distribution looks ok.
I've attached the BCV plot generated from running estimateDisp with a prior.df=0. I don't have that much experience interpreting BCV plots, but the lack of a trend in the BCV with respect to logCPM suggests that the mean-dispersion distribution is ok. Since the low biological variation is resulting in very small P values, would selecting genes based on logFC be an appropriate approach?
The dispersion distribution shown by plotBCV is fine, however it still indicates that you have little or no biological variability between your replicates.
Selecting genes by logFC would be a very poor strategy, because it would preferentially select genes with very small counts for which the evidence for DE is relatively poor. The fact that some of the p-values are small doesn't justify abandoning the idea of statistical significance entirely.
As I wrote yesterday, I suggest you try glmFit and treatDGE. This incorporates a logFC cutoff within a statistical significance framework. This will reduce the number of DE genes that you have, prioritizing those with larger fold changes. This approach will effectively require genes with smaller counts to pass a higher logFC cutoff than those with larger counts, with is exactly what you need.
Thank-you for the comments, I will try glmFit and treatDGE as you suggest.
We do not recommend using both FDR and logFC cutoffs simultaneously, because the logFC cutoff can cause the true FDR to go above the nominal level. If you want to be more stringent and give more weight to larger fold changes, use glmFit() and treatDGE() instead of exactTest().