Hi,
I'd like to revisit some earlier posts regarding running DEXSeq
on a large number of samples, similar to past posts like this one. My eventual goal is to run DEXSeq
on a subset of TCGA tumors; my full dataset comprises ~300 samples x ~700k junctions, and I (perhaps expectedly) have run into issues where estimateDispersions(dxd)
will run forever, even with BPPARAM=SnowParam(4)
specified. The post linked above refers to some improvements as part of the DEXSeqAlt
repository, however since its creation 6 years ago, DESeq2::estimateDispersions()
can now utilize glmGamPoi
to optimize this step to scale more efficiently. I'm wondering if it is possible for DEXSeq
to be updated to adopt glmGamPoi
and improve the speed of this step / scale better to large datasets, or if I could simply run bits of the DESeq2 + glmGamPoi
workflow as described here on my dxd
object.
I realize the large dimensions of these data may not be feasible no matter what. Therefore for testing purposes, I've created a smaller-scale example dataset available here where I've randomly selected 10 tumors from each test condition (10 in group "low", 10 in group "high", 20 total samples), and have removed entire genes whose total count was in the bottom 50% (ie restrict to only junctions within genes whose expression is in the top 50%). This cuts down the dimensions of my DEXSeq
object to 558591 x 40
. A (hopefully) more realistic goal may be to test many of these random subsets as a permutation, but I'm not sure whether pre-selecting genes/junctions to the top 50% of expressers is statistically appropriate given the model.
The original source of my data is from here, and I've gathered the data in a SummarizedExperiment
object then converted to a DEXSeqDataSet
object with DEXSeqDataSetFromSE(se, design=formula( ~ sample + exon + Group:exon))
. The resulting object seems to be consistent with what you describe in the vignette where the first 20 columns represent counts from the given exon (this
) and the next 20 columns represent the sum of the other exons belonging to the same gene (others
). Happy to share my code for any of these wrangling steps if it would be helpful.
Assuming I've set everything up correctly, you should be able to reproduce and test the speed with something like the following code. For me, the below completed successfully in ~1.5 hrs using 4 cores on our compute cluster.
Of course if I've missed something obvious or there are already documented workarounds somewhere, please let me know!
Thanks so much!
library(DEXSeq)
library(BiocParallel)
# Load random sampling
dxd <- readRDS(file="DEXSeq_Bioc_rand1.rds")
# Peek at coldata
colData(dxd)
DataFrame with 40 rows and 5 columns
sample SampleID Group sample.1 exon
<factor> <character> <factor> <factor> <factor>
1 1 TCGA-E2-A1IH-01A low TCGA.E2.A1IH.01A this
2 2 TCGA-C8-A1HO-01A low TCGA.C8.A1HO.01A this
3 3 TCGA-A2-A1FW-01A low TCGA.A2.A1FW.01A this
4 4 TCGA-C8-A12U-01A low TCGA.C8.A12U.01A this
5 5 TCGA-C8-A273-01A low TCGA.C8.A273.01A this
... ... ... ... ... ...
36 16 TCGA-BH-A0DH-01A high TCGA.BH.A0DH.01A others
37 17 TCGA-AO-A12G-01A high TCGA.AO.A12G.01A others
38 18 TCGA-D8-A27T-01A high TCGA.D8.A27T.01A others
39 19 TCGA-A8-A094-01A high TCGA.A8.A094.01A others
40 20 TCGA-D8-A1XF-01A high TCGA.D8.A1XF.01A others
# Peek at counts
counts(dxd)[1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
ENSG00000000003.10:E001 0 0 0 0 0
ENSG00000000003.10:E002 0 0 0 0 0
ENSG00000000003.10:E003 259 43 109 6 153
ENSG00000000003.10:E004 0 0 0 0 0
ENSG00000000003.10:E005 0 0 0 0 0
# Now proceed with DEXSeq
dxd <- estimateSizeFactors(dxd)
options(bphost="localhost")
dxd <- estimateDispersions(dxd, BPPARAM=SnowParam(4, log=TRUE) )
dxd <- testForDEU(dxd, BPPARAM=SnowParam(4, log=TRUE) )
dxd <- estimateExonFoldChanges(dxd, fitExpToVar="Group", BPPARAM=SnowParam(4, log=TRUE), denominator="low")
dxr1 <- DEXSeqResults(dxd)
head(as.data.frame(dxr1[!is.na(dxr1$padj) & dxr1$padj < 0.05,]))
groupID featureID exonBaseMean dispersion stat pvalue
ENSG00000003402.15:E102 ENSG00000003402.15 E102 5.682785 0.01433858 17.75352 2.514524e-05
ENSG00000006071.7:E054 ENSG00000006071.7 E054 14.508353 0.01849662 17.59588 2.731798e-05
ENSG00000006625.13:E027 ENSG00000006625.13 E027 3.387866 0.23556958 17.21820 3.332288e-05
ENSG00000006744.14:E006 ENSG00000006744.14 E006 1.906249 0.49098505 17.18358 3.393576e-05
ENSG00000010244.12:E059 ENSG00000010244.12 E059 10.860515 0.09847943 23.51376 1.240234e-06
ENSG00000011465.12:E046 ENSG00000011465.12 E046 10.115533 0.00991492 17.57031 2.768780e-05
padj DAB2IP_high DAB2IP_low log2fold_DAB2IP_high_DAB2IP_low genomicData
ENSG00000003402.15:E102 0.041529898 0.9379022 6.466887e-01 1.1593566
ENSG00000006071.7:E054 0.043587315 1.1932887 8.332773e-01 1.3294337
ENSG00000006625.13:E027 0.045175810 0.8723202 3.162357e-01 2.5906126
ENSG00000006744.14:E006 0.045488976 0.6990015 6.938573e-06 17.9337945
ENSG00000010244.12:E059 0.007795511 1.2766317 7.781390e-01 1.8406042
ENSG00000011465.12:E046 0.043786361 1.1024762 8.659708e-01 0.8781098
countData.1 countData.2 countData.3 countData.4 countData.5 countData.6 countData.7
ENSG00000003402.15:E102 5 3 5 2 8 0 3
ENSG00000006071.7:E054 0 4 1 6 6 2 6
ENSG00000006625.13:E027 0 1 0 4 0 0 1
ENSG00000006744.14:E006 0 0 0 0 0 0 0
ENSG00000010244.12:E059 3 3 3 7 13 11 1
ENSG00000011465.12:E046 0 3 11 3 13 2 5
countData.8 countData.9 countData.10 countData.11 countData.12 countData.13
ENSG00000003402.15:E102 1 5 1 7 11 13
ENSG00000006071.7:E054 0 4 1 5 4 48
ENSG00000006625.13:E027 4 2 1 4 5 6
ENSG00000006744.14:E006 0 0 0 0 6 2
ENSG00000010244.12:E059 3 10 1 7 12 47
ENSG00000011465.12:E046 5 11 9 11 11 33
countData.14 countData.15 countData.16 countData.17 countData.18 countData.19
ENSG00000003402.15:E102 3 12 5 5 12 14
ENSG00000006071.7:E054 95 74 5 13 67 2
ENSG00000006625.13:E027 1 11 4 11 7 2
ENSG00000006744.14:E006 0 8 0 9 10 0
ENSG00000010244.12:E059 12 21 10 28 23 11
ENSG00000011465.12:E046 1 0 22 2 4 23
countData.20 transcripts
ENSG00000003402.15:E102 8 ENSG0000....
ENSG00000006071.7:E054 22 ENSG0000....
ENSG00000006625.13:E027 7 ENSG0000....
ENSG00000006744.14:E006 8 ENSG0000....
ENSG00000010244.12:E059 11 ENSG0000....
ENSG00000011465.12:E046 25 ENSG0000....
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.5 (Ootpa)
Matrix products: default
BLAS/LAPACK: /nas/longleaf/rhel8/apps/r/4.1.0/lib/libopenblas_haswellp-r0.3.5.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DEXSeq_1.40.0 RColorBrewer_1.1-3 AnnotationDbi_1.56.2
[4] DESeq2_1.34.0 SummarizedExperiment_1.24.0 GenomicRanges_1.46.1
[7] GenomeInfoDb_1.30.1 IRanges_2.28.0 S4Vectors_0.32.4
[10] MatrixGenerics_1.6.0 matrixStats_0.62.0 Biobase_2.54.0
[13] BiocGenerics_0.40.0 BiocParallel_1.28.3 forcats_0.5.1
[16] stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
[19] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8
[22] ggplot2_3.3.6 tidyverse_1.3.1 rhdf5_2.38.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.5.2 lubridate_1.8.0 bit64_4.0.5
[5] filelock_1.0.2 progress_1.2.2 httr_1.4.2 tools_4.1.0
[9] backports_1.4.1 utf8_1.2.2 R6_2.5.1 DBI_1.1.2
[13] colorspace_2.0-3 rhdf5filters_1.6.0 withr_2.5.0 prettyunits_1.1.1
[17] tidyselect_1.1.2 curl_4.3.2 bit_4.0.4 compiler_4.1.0
[21] cli_3.3.0 rvest_1.0.2 xml2_1.3.3 DelayedArray_0.20.0
[25] scales_1.2.0 genefilter_1.76.0 rappdirs_0.3.3 Rsamtools_2.10.0
[29] digest_0.6.29 XVector_0.34.0 pkgconfig_2.0.3 dbplyr_2.1.1
[33] fastmap_1.1.0 rlang_1.0.4 readxl_1.3.1 rstudioapi_0.13
[37] RSQLite_2.2.10 generics_0.1.2 hwriter_1.3.2 jsonlite_1.8.0
[41] RCurl_1.98-1.6 magrittr_2.0.2 GenomeInfoDbData_1.2.7 Matrix_1.4-0
[45] Rcpp_1.0.8.3 munsell_0.5.0 Rhdf5lib_1.16.0 fansi_1.0.3
[49] lifecycle_1.0.1 stringi_1.7.6 zlibbioc_1.40.0 BiocFileCache_2.2.1
[53] grid_4.1.0 blob_1.2.2 parallel_4.1.0 crayon_1.5.1
[57] lattice_0.20-45 Biostrings_2.62.0 haven_2.4.3 splines_4.1.0
[61] annotate_1.72.0 hms_1.1.1 KEGGREST_1.34.0 locfit_1.5-9.5
[65] pillar_1.7.0 biomaRt_2.50.3 geneplotter_1.72.0 reprex_2.0.1
[69] XML_3.99-0.9 glue_1.6.2 modelr_0.1.8 vctrs_0.4.1
[73] png_0.1-7 tzdb_0.2.0 cellranger_1.1.0 gtable_0.3.0
[77] assertthat_0.2.1 cachem_1.0.6 xtable_1.8-4 broom_1.0.0
[81] survival_3.2-13 memoise_2.0.1 statmod_1.4.36 ellipsis_0.3.2
Thanks for posting and providing the reproducible example Jeremy.