Question

Why edgeR ouput some p values larger than 1

0

Entering edit mode

Yuan Tian ▴ 60

@yuan-tian-5034

Last seen 10.7 years ago

Hello, I was using edgeR to do differential expression on a RNA-seq dataset, but when I check the p.value (not adjusted) distribution of the output, I found ~700 genes got a p value larger than 1... Just wondering if anyone else got this problem before? And why I got such a weird a result. Here's my codes, d <- DGEList(counts=countsTable, group=conds, lib.size = lib.sizes) #I used the raw read counts as instructed in the user guide d <- calcNormFactors(d, method="TMM") d <- estimateCommonDisp(d) et.commonDis <- exactTest(d) hist(et.commonDis$table$p.value, breaks=100, col="skyblue", border="slateblue", main="pval distribution_edgeR") May I have your ideas? Thanks. Yuan

edgeR edgeR • 1.5k views

ADD COMMENT • link updated 13.3 years ago by Gordon Smyth 52k • written 13.3 years ago by Yuan Tian ▴ 60

score 0 · Answer 1 · 2012-01-07

Dear Yuan, exactTest() does not produce p-values larger than one as far as I am aware. For example I just ran: > library(edgeR) > example(exactTest) > summary(de$table$p.value) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1295 0.4351 0.6770 0.6201 0.8535 1.0000 > max(de$table$p.value) [1] 1 If you think there are p-values>1, please give an example. You can't judge from a histogram, because the right bar of the plot may appear to extend past 1 even when the values themselves don't do that. Best wishes Gordon > Date: Thu, 5 Jan 2012 14:24:21 -0800 > From: Yuan Tian <ytianidyll at="" ucla.edu=""> > To: <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] Why edgeR ouput some p values larger than 1 > > Hello, > > I was using edgeR to do differential expression on a RNA-seq dataset, > but when I check the p.value (not adjusted) distribution of the output, > I found ~700 genes got a p value larger than 1... Just wondering if > anyone else got this problem before? And why I got such a weird a > result. Here's my codes, > > d <- DGEList(counts=countsTable, group=conds, lib.size = lib.sizes) #I used the raw read counts as instructed in the user guide > d <- calcNormFactors(d, method="TMM") > d <- estimateCommonDisp(d) > et.commonDis <- exactTest(d) > > hist(et.commonDis$table$p.value, breaks=100, col="skyblue", > border="slateblue", main="pval distribution_edgeR") > > May I have your ideas? Thanks. > > Yuan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

score 0 · Answer 2 · 2012-01-10

Hi Yuan, OK, I see. Thanks for alerting me to this and sending through your example data. The short answer is that all the p-values that are greater than 1 should be equal to 1. I have committed through a fix to the edgeR package today to make sure that this happens. The problems arises because the default exact test method is to double the smaller of the two tail probabilities. If both of the tail probabilities are greater than 0.5, then the final p-value ends up being > 1. The issue arises most commonly when the counts are so small that significant differential expression is impossible anyway. For example, suppose there are two libraries in group A and one library in group B, and the counts for a particular gene are: 0 0 1 for the three libraries. Suppose the overall library sizes are all equal. The total count in group A is 0 and the total count in B in 1. Given the fact that there is one count in total, the probability that it occurs in group A is 2/3 and the probability it occurs in group B is 1/3, under the null hypothesis. We actually observe the A count to be zero. Prob(A count <= 0) = 2/3 Prob(A count >= 0) = 1 Hence both tail probabilities are > 1/2. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth On Sun, 8 Jan 2012, Yuan Tian wrote: > Dear Gordon, > > In my dataset, there are 1509 genes that got a p.value larger than1. I > extract out the read counts of those genes and make them into a new > dataset named as "dat" (also attached as "test.csv"). I ran the > following codes, and I still got 1034 genes with pvalue larger than 1... > >> conds > [1] 1 2 1 1 1 1 1 2 2 1 1 >> d <- DGEList(counts=dat, group=conds) >> d <- calcNormFactors(d, method="TMM") >> d <- estimateCommonDisp(d) >> d<-estimateTagwiseDisp(d) >> de=exactTest(d) > Comparison of groups: 2 - 1 > >> summary(de$table$p.value) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.0000001 0.9372000 1.0000000 0.9548000 1.1750000 1.4550000 > > Most of problematic genes are actually the genes that have 0 read count > in some samples, but would this be a problem for the DE test? > > Best, > > Yuan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}