DESeq2 estimateDispersionGeneEst taking infinite time
1
0
Entering edit mode
@gregorylstone-12225
Last seen 6.0 years ago

I have encountered the same problem posted a while back here: https://www.biostars.org/p/193916/

I am using DESeq2 to run a multifactor paired analysis to determine the change in gene expression between men and women due to a treatment on 70 males and 47 females. Because I have a differing number of samples for each condition, I'm using the following steps to create a custom design matrix: C: DESeq2 paired and multi factor comparison

Everything is fine until estimateDispersionsGeneEst(dds, modelMatrix=mm1) gets called. Previously, when I was using the dds constructor for HTSeq count data, even though I was using feeatureCount output, the command would run for a while then exit citing glibc.

Now I am using the dds constructor for a count matrix, and the estimateDispersionsGeneEst(dds, modelMatrix=mm1) call is taking forever to run. I'm confused as to what I have done wrong, and would appreciate any help.

 

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] IHW_1.2.0                  DESeq2_1.14.1
[3] SummarizedExperiment_1.4.0 Biobase_2.34.0
[5] GenomicRanges_1.26.2       GenomeInfoDb_1.10.2
[7] IRanges_2.8.1              S4Vectors_0.12.1
[9] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0    locfit_1.5-9.1       slam_0.1-40
 [4] splines_3.3.1        lattice_0.20-33      colorspace_1.3-2
 [7] htmltools_0.3.5      base64enc_0.1-3      survival_2.40-1
[10] XML_3.98-1.5         foreign_0.8-66       DBI_0.5-1
[13] BiocParallel_1.8.1   RColorBrewer_1.1-2   plyr_1.8.4
[16] stringr_1.1.0        zlibbioc_1.20.0      munsell_0.4.3
[19] gtable_0.2.0         htmlwidgets_0.8      memoise_1.0.0
[22] latticeExtra_0.6-28  knitr_1.15.1         geneplotter_1.52.0
[25] fdrtool_1.2.15       AnnotationDbi_1.36.2 htmlTable_1.9
[28] Rcpp_0.12.9          acepack_1.4.1        xtable_1.8-2
[31] backports_1.0.5      scales_0.4.1         checkmate_1.8.2
[34] Hmisc_4.0-2          annotate_1.52.1      XVector_0.14.0
[37] lpsymphony_1.2.0     gridExtra_2.2.1      ggplot2_2.2.1
[40] digest_0.6.12        stringi_1.1.2        grid_3.3.1
[43] tools_3.3.1          bitops_1.0-6         magrittr_1.5
[46] lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2
[49] RSQLite_1.1-2        Formula_1.2-1        cluster_2.0.4
[52] Matrix_1.2-6         data.table_1.10.4    assertthat_0.1
[55] rpart_4.1-10         nnet_7.3-12

deseq2 paired samples • 1.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 23 hours ago
United States

hi,

Can you subset to a smaller number of rows to get a sense of how long it is taking? Like 100 genes with decently high counts (e.g. average counts across samples ~100 should do for this test). You do have a very large vector of coefficients to fit, so it makes sense that it takes some time. Ways to go forward would be to make sure you're only running this on genes with moderate to high counts (do strict pre-filtering), and then use parallelization. Another option would be to use a faster method for the size of your model matrix, such as the linear model in limma-voom.

ADD COMMENT
0
Entering edit mode

Thank you for the advice. I really like DESeq2's count normalization method, and find it very easy to use, so I'll proceed by applying a strict pre-filter and see what happens from there. Thanks again!

ADD REPLY
0
Entering edit mode

You could require a normalized count of 10 or more in some number x of samples. 

E.g:

dds <- estimateSizeFactors(dds)

norm.cts <- counts(dds, normalized=TRUE)

keep <- rowSums(norm.cts >= 10) >= x

table(keep)

dds <- dds[keep,]

But I would check first to see if this is running in a reasonable amount of time for small subset of genes.

ADD REPLY
0
Entering edit mode

After several hours of running (without subsetting or prefiltering) it exited with the output: "glibc detected". The first several lines of the output is:

*** glibc detected *** /opt/R-3.3.1/lib64/R/bin/exec/R: double free or corruption (!prev): 0x0000000037643ca0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75f3e)[0x2b46f4815f3e]
/lib64/libc.so.6(+0x78dd0)[0x2b46f4818dd0]
/home/gls14/R/library/DESeq2/libs/DESeq2.so(_ZN4arma3MatIdED2Ev+0x29)[0x2b470a8121c9]
/home/gls14/R/library/DESeq2/libs/DESeq2.so(_ZN4arma6auxlib10det_lapackIdEET_RKNS_3MatIS2_EEb+0x14a)[0x2b470a8145da]
/home/gls14/R/library/DESeq2/libs/DESeq2.so(_ZN4arma6auxlib3detIdNS_3MatIdEEEET_RKNS_4BaseIS4_T0_EE+0x2da)[0x2b470a8149ba]
/home/gls14/R/library/DESeq2/libs/DESeq2.so(_Z13log_posteriordN4Rcpp9MatrixRowILi14EEES1_N4arma3MatIdEEddb+0x1a4)[0x2b470a80a214]
/home/gls14/R/library/DESeq2/libs/DESeq2.so(_Z7fitDispP7SEXPRECS0_S0_S0_S0_S0_S0_S0_S0_S0_S0_+0x732)[0x2b470a80c602]
/home/gls14/R/library/DESeq2/libs/DESeq2.so(DESeq2_fitDisp+0xb3)[0x2b470a821f83]
/opt/R-3.3.1/lib64/R/lib/libR.so(+0xdcc1c)[0x2b46f3dbec1c]

I'm at a loss of how to explain this. Have you ever encountered this before?

ADD REPLY
0
Entering edit mode
I haven't. You could email me the filtered dds and I can take a look. May take me a few days to get to it though.
ADD REPLY
0
Entering edit mode

Sounds good, I'll send it to you. Thank you for being so generous with your time

ADD REPLY
0
Entering edit mode

I don't have any problem running this on my end on a small subset.

Just to say again (for others reading this thread) your design matrix is really large (234 rows x 119 coefficients), so it makes sense this takes a while to fit with a GLM that has to iteratively reach the solution for each gene.

I tried running estimateDispersionGeneEst for the first 200 genes, and this takes 2.5 minutes. This is the longest step. The MAP step takes 1 minute, and Wald testing takes 1 minute. Extrapolating 4.5 minutes for 200 genes to 15,000 genes, and using 10 cores, this should take about 30 minutes (plus multicore overhead).

However, like I've said, for such a large design matrix I would instead use a linear model, such as limma-voom.

dds <- estimateSizeFactors(dds)
norm.cts <- counts(dds, normalized=TRUE)
rsnz <- rowSums(norm.cts >= 10)
table(rsnz >= 10) # here, 10 bc you have >200 samples in 4 groups
keep <- rsnz >= 10 

dds <- dds[keep,]
dds <- estimateSizeFactors(dds)

mm1 <- model.matrix(~ sex + sex:nested + sex:condition, colData(dds))
all.zero <- apply(mm1, 2, function(x) all(x == 0))
table(all.zero)
mm1 <- mm1[,-which(all.zero)]

system.time({
  dds2 <- estimateDispersionsGeneEst(dds[1:200,], modelMatrix=mm1)
  })

system.time({
  dds2 <- estimateDispersionsFit(dds2)
  })

system.time({
  dds2 <- estimateDispersionsMAP(dds2, modelMatrix=mm1)
  })

system.time({
  dds2 <- nbinomWaldTest(dds2, modelMatrix=mm1)
  })


> sessionInfo()
 R version 3.3.2 (2016-10-31)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Ubuntu 16.10

 locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C
 [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] parallel  stats4    stats     graphics  grDevices datasets  utils     methods   base

 other attached packages:
  [1] DESeq2_1.14.1              SummarizedExperiment_1.4.0 Biobase_2.34.0
  [4] GenomicRanges_1.26.1       GenomeInfoDb_1.10.1        IRanges_2.8.1
  [7] S4Vectors_0.12.0           BiocGenerics_0.20.0        rmarkdown_1.2
 [10] magrittr_1.5               knitr_1.15.1               devtools_1.12.0
 [13] BiocInstaller_1.24.0

 loaded via a namespace (and not attached):
  [1] genefilter_1.56.0    locfit_1.5-9.1       splines_3.3.2        lattice_0.20-34
  [5] colorspace_1.3-1     htmltools_0.3.5      survival_2.40-1      XML_3.98-1.5
  [9] foreign_0.8-67       withr_1.0.2          DBI_0.5-1            BiocParallel_1.8.1
 [13] RColorBrewer_1.1-2   plyr_1.8.4           stringr_1.1.0        zlibbioc_1.20.0
 [17] munsell_0.4.3        gtable_0.2.0         memoise_1.0.0        evaluate_0.10
 [21] latticeExtra_0.6-28  geneplotter_1.52.0   AnnotationDbi_1.36.0 htmlTable_1.7
 [25] Rcpp_0.12.8          acepack_1.4.1        xtable_1.8-2         scales_0.4.1
 [29] backports_1.0.4      Hmisc_4.0-0          annotate_1.52.0      XVector_0.14.0
 [33] gridExtra_2.2.1      ggplot2_2.2.0        digest_0.6.10        stringi_1.1.2
 [37] grid_3.3.2           rprojroot_1.1        tools_3.3.2          bitops_1.0-6
 [41] lazyeval_0.2.0       RCurl_1.95-4.8       tibble_1.2           RSQLite_1.1
 [45] Formula_1.2-1        cluster_2.0.5        Matrix_1.2-7.1       data.table_1.9.8
 [49] assertthat_0.1       rpart_4.1-10         nnet_7.3-12          compiler_3.3.2
 >
ADD REPLY
0
Entering edit mode

Great, thank you so much for your help. I'll be sure to give limma-voom a try. Thanks again

ADD REPLY
0
Entering edit mode

Even copying your code above and taking a subset of genes, R terminates and outputs the following:

*** glibc detected *** /opt/R-3.3.1/lib64/R/bin/exec/R: double free or corruption (!prev): 0x00000000140ddcb0 ***
======= Backtrace: =========

 

Can I assume at this point that the problem is with my machine and not how I'm implementing the software?

 

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] DESeq2_1.14.1              SummarizedExperiment_1.4.0
[3] Biobase_2.34.0             GenomicRanges_1.26.2
[5] GenomeInfoDb_1.10.2        IRanges_2.8.1
[7] S4Vectors_0.12.1           BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9          RColorBrewer_1.1-2   plyr_1.8.4
 [4] XVector_0.14.0       bitops_1.0-6         base64enc_0.1-3
 [7] tools_3.3.1          zlibbioc_1.20.0      digest_0.6.12
[10] rpart_4.1-10         memoise_1.0.0        RSQLite_1.1-2
[13] annotate_1.52.1      tibble_1.2           gtable_0.2.0
[16] htmlTable_1.9        checkmate_1.8.2      lattice_0.20-33
[19] Matrix_1.2-6         DBI_0.5-1            gridExtra_2.2.1
[22] genefilter_1.56.0    stringr_1.1.0        knitr_1.15.1
[25] cluster_2.0.4        htmlwidgets_0.8      locfit_1.5-9.1
[28] grid_3.3.1           nnet_7.3-12          data.table_1.10.4
[31] AnnotationDbi_1.36.2 XML_3.98-1.5         survival_2.40-1
[34] BiocParallel_1.8.1   foreign_0.8-66       latticeExtra_0.6-28
[37] Formula_1.2-1        geneplotter_1.52.0   magrittr_1.5
[40] ggplot2_2.2.1        htmltools_0.3.5      Hmisc_4.0-2
[43] scales_0.4.1         backports_1.0.5      splines_3.3.1
[46] assertthat_0.1       xtable_1.8-2         colorspace_1.3-2
[49] stringi_1.1.2        acepack_1.4.1        RCurl_1.95-4.8
[52] lazyeval_0.2.0       munsell_0.4.3

ADD REPLY
0
Entering edit mode
I believe so
ADD REPLY

Login before adding your answer.

Traffic: 645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6