Hello,
I was using edgeR to do differential expression on a RNA-seq dataset,
but when I check the p.value (not adjusted) distribution of the
output, I found ~700 genes got a p value larger than 1... Just
wondering if anyone else got this problem before? And why I got such a
weird a result. Here's my codes,
d <- DGEList(counts=countsTable, group=conds, lib.size = lib.sizes) #I
used the raw read counts as instructed in the user guide
d <- calcNormFactors(d, method="TMM")
d <- estimateCommonDisp(d)
et.commonDis <- exactTest(d)
hist(et.commonDis$table$p.value, breaks=100, col="skyblue",
border="slateblue", main="pval distribution_edgeR")
May I have your ideas? Thanks.
Yuan
Dear Yuan,
exactTest() does not produce p-values larger than one as far as I am
aware. For example I just ran:
> library(edgeR)
> example(exactTest)
> summary(de$table$p.value)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1295 0.4351 0.6770 0.6201 0.8535 1.0000
> max(de$table$p.value)
[1] 1
If you think there are p-values>1, please give an example. You can't
judge from a histogram, because the right bar of the plot may appear
to
extend past 1 even when the values themselves don't do that.
Best wishes
Gordon
> Date: Thu, 5 Jan 2012 14:24:21 -0800
> From: Yuan Tian <ytianidyll at="" ucla.edu="">
> To: <bioconductor at="" stat.math.ethz.ch="">
> Subject: [BioC] Why edgeR ouput some p values larger than 1
>
> Hello,
>
> I was using edgeR to do differential expression on a RNA-seq
dataset,
> but when I check the p.value (not adjusted) distribution of the
output,
> I found ~700 genes got a p value larger than 1... Just wondering if
> anyone else got this problem before? And why I got such a weird a
> result. Here's my codes,
>
> d <- DGEList(counts=countsTable, group=conds, lib.size = lib.sizes)
#I used the raw read counts as instructed in the user guide
> d <- calcNormFactors(d, method="TMM")
> d <- estimateCommonDisp(d)
> et.commonDis <- exactTest(d)
>
> hist(et.commonDis$table$p.value, breaks=100, col="skyblue",
> border="slateblue", main="pval distribution_edgeR")
>
> May I have your ideas? Thanks.
>
> Yuan
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Hi Yuan,
OK, I see. Thanks for alerting me to this and sending through your
example data.
The short answer is that all the p-values that are greater than 1
should
be equal to 1. I have committed through a fix to the edgeR package
today
to make sure that this happens.
The problems arises because the default exact test method is to double
the
smaller of the two tail probabilities. If both of the tail
probabilities
are greater than 0.5, then the final p-value ends up being > 1. The
issue
arises most commonly when the counts are so small that significant
differential expression is impossible anyway.
For example, suppose there are two libraries in group A and one
library in
group B, and the counts for a particular gene are: 0 0 1 for the three
libraries. Suppose the overall library sizes are all equal. The
total
count in group A is 0 and the total count in B in 1. Given the fact
that
there is one count in total, the probability that it occurs in group A
is
2/3 and the probability it occurs in group B is 1/3, under the null
hypothesis. We actually observe the A count to be zero.
Prob(A count <= 0) = 2/3
Prob(A count >= 0) = 1
Hence both tail probabilities are > 1/2.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.wehi.edu.auhttp://www.statsci.org/smyth
On Sun, 8 Jan 2012, Yuan Tian wrote:
> Dear Gordon,
>
> In my dataset, there are 1509 genes that got a p.value larger than1.
I
> extract out the read counts of those genes and make them into a new
> dataset named as "dat" (also attached as "test.csv"). I ran the
> following codes, and I still got 1034 genes with pvalue larger than
1...
>
>> conds
> [1] 1 2 1 1 1 1 1 2 2 1 1
>> d <- DGEList(counts=dat, group=conds)
>> d <- calcNormFactors(d, method="TMM")
>> d <- estimateCommonDisp(d)
>> d<-estimateTagwiseDisp(d)
>> de=exactTest(d)
> Comparison of groups: 2 - 1
>
>> summary(de$table$p.value)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 0.0000001 0.9372000 1.0000000 0.9548000 1.1750000 1.4550000
>
> Most of problematic genes are actually the genes that have 0 read
count
> in some samples, but would this be a problem for the DE test?
>
> Best,
>
> Yuan
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}