Hello,
I've found either a bug in estimateDisp() or a mistake in the
documentation about when it uses the classic qCML vs. the GLM CR
dispersion estimates. The help page for estimateDisp() says
"If there is no design matrix, it calculates the quantile conditional
likelihood for each gene (tag) and then maximize it. The method is
same as in the function estimateCommonDisp and estimateTagwiseDisp".
However, when you call estimateDisp() without specifying a design
matrix but with a DGEList object with more than one level in
samples$group, it internally calculates a design matrix and gives you
common/trended/tagwise dispersion estimates. Actually, even when
samples$group has only one level, estimateDisp still internally
creates a unit vector design matrix and outputs
common/trended/tagwise dispersion estimates. So is it a bug in the
code or a mistake in the documentation and estimateDisp() can't be
used to get the same results as estimateCommonDisp() and
estimateTagwiseDisp()? (example below)
It was my understanding that for a single factor experiment (2 or more
groups), the classic qCML was preferred over the GLM dispersion.
However, if you have more than 2 groups and you want to calculate a
oneway ANOVA or specialized contrasts, it seems you have to use the
GLM approach. Is the GLM approach therefore becoming the best method
for all experimental designs, even those with only a single factor?
Thanks,
Jenny
#example - data as indicated in edgeRUsersGuide() section 4.3
> library(edgeR)
Loading required package: limma
> shell.exec(system.file("doc","edgeRUsersGuide.pdf",package="edgeR"))
>
> targets <- readTargets()
> targets
Lane Treatment Label
Con1 1 Control Con1
Con2 2 Control Con2
Con3 2 Control Con3
Con4 4 Control Con4
DHT1 5 DHT DHT1
DHT2 6 DHT DHT2
DHT3 8 DHT DHT3
> x <- read.delim("pnas_expression.txt", row.names=1,
stringsAsFactors=FALSE)
> y <- DGEList(counts=x[,1:7], group=targets$Treatment,
genes=data.frame(Length=x[,8]))
> colnames(y) <- targets$Label
> keep <- rowSums(cpm(y)>1) >= 3
> y <- y[keep,]
> y$samples$lib.size <- colSums(y$counts)
> y <- calcNormFactors(y)
> y$samples
group lib.size norm.factors
Con1 Control 976847 1.0296636
Con2 Control 1154746 1.0372521
Con3 Control 1439393 1.0362662
Con4 Control 1482652 1.0378383
DHT1 DHT 1820628 0.9537095
DHT2 DHT 1831553 0.9525624
DHT3 DHT 680798 0.9583181
>
> y2 <- y
>
> y <- estimateCommonDisp(y, verbose=TRUE)
Disp = 0.02002 , BCV = 0.1415
> y <- estimateTagwiseDisp(y)
> y2 <- estimateDisp(y2)
Loading required package: splines
>
> names(y)
[1] "counts" "samples" "genes"
[4] "common.dispersion" "pseudo.counts" "pseudo.lib.size"
[7] "AveLogCPM" "prior.n" "tagwise.dispersion"
> names(y2)
[1] "counts" "samples" "genes"
[4] "common.dispersion" "trended.dispersion" "trend.method"
[7] "AveLogCPM" "span" "tagwise.dispersion"
[10] "prior.df" "prior.n"
>
> all.equal(y$tagwise.dispersion, y2$tagwise.dispersion)
[1] "Mean relative difference: 0.1156227"
>
> et <- exactTest(y)
> summary(decideTestsDGE(et))
[,1]
-1 2096
0 12059
1 2339
>
> et2 <- exactTest(y2)
> summary(decideTestsDGE(et2))
[,1]
-1 2078
0 12084
1 2332
>
>
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] splines stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] locfit_1.5-9.1 edgeR_3.6.6 limma_3.20.8
[4] BiocInstaller_1.14.2
loaded via a namespace (and not attached):
[1] grid_3.1.1 lattice_0.20-29 tools_3.1.1
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
High Performance Biological Computing Program
and The Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
NOTE NEW OFFICE LOCATION
2112 IGB
1206 W. Gregory Dr.
Urbana, IL 61801
USA
ph: 217-300-6543
fax: 217-265-5066
e-mail: drnevich@illinois.edu<mailto:drnevich@illinois.edu>
[[alternative HTML version deleted]]
Hi Gordon,
Thanks for the clarification. Yes, the documentation should be reworded - I did understand from it that the GLM dispersion estimates were slightly different in this function because the help page says it is "similar to" the trio of GLM dispersion functions, but the qCML method says it "is the same as". Therefore, I thought it was a deliberate distinction between the two.
Jenny
Sorry to post on such an old question, but I am facing the same situation. I am trying to use the classic method, and I get different results when using estimateDisp vs estimateCommonDisp + estimateTagwiseDisp.
Am I correct to assume that estimateDisp is the best option?
Thanks
Mau
estimateDisp() estimates the prior.df whereas estimateTagwiseDisp() just puts in a preset value (prior.df=10), so they won't give the same results.
We recommend estimateDisp() is all our documentation, see for example the edgeR workflow:
https://bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html