On my box within R I attempted to create a TxDb
object from a GFF
file with
human_txdb <- makeTxDbFromGFF(file="human.gff",format="gff3")
and this crashed the R process. (Presumably an out-of-memory error.) Here's my environment:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 pillar_1.4.2
[3] compiler_3.6.1 GenomeInfoDb_1.20.0
[5] XVector_0.24.0 GenomicFeatures_1.36.4
[7] prettyunits_1.0.2 bitops_1.0-6
[9] tools_3.6.1 zlibbioc_1.30.0
[11] progress_1.2.2 zeallot_0.1.0
[13] biomaRt_2.40.4 digest_0.6.20
[15] bit_1.1-14 lattice_0.20-38
[17] RSQLite_2.1.2 memoise_1.1.0
[19] tibble_2.1.3 pkgconfig_2.0.2
[21] rlang_0.4.0 Matrix_1.2-17
[23] DelayedArray_0.10.0 DBI_1.0.0
[25] parallel_3.6.1 GenomeInfoDbData_1.2.1
[27] rtracklayer_1.44.4 httr_1.4.1
[29] stringr_1.4.0 Biostrings_2.52.0
[31] S4Vectors_0.22.1 vctrs_0.2.0
[33] IRanges_2.18.2 hms_0.5.1
[35] grid_3.6.1 stats4_3.6.1
[37] bit64_0.9-7 Biobase_2.44.0
[39] R6_2.4.0 AnnotationDbi_1.46.1
[41] BiocParallel_1.18.1 XML_3.98-1.20
[43] blob_1.2.0 magrittr_1.5
[45] matrixStats_0.55.0 GenomicAlignments_1.20.1
[47] Rsamtools_2.0.0 backports_1.1.4
[49] BiocGenerics_0.30.0 GenomicRanges_1.36.1
[51] SummarizedExperiment_1.14.1 assertthat_0.2.1
[53] stringi_1.4.3 RCurl_1.95-4.12
[55] crayon_1.3.4
The box has 8 GB of memory and 8 cores. human.gff
is 1.1 GB.
I then tried the same thing on an HPC cluster whose nodes have 16 cores and 64 GB of memory. This succeeded and within R I saved the result as an R object:
saveRDS(human_txdb,file = "human_txdb.rds")
I then scp'd human_txdb.rds
to my box and within R executed
human_txdb <- readRDS(file="human_txdb.rds")
This succeeded. However
human_txdb
failed with
TxDb object:
Error: external pointer is not valid
As an experiment, I went back to the HPC cluster and changed the R script :
human_txdb <- makeTxDbFromGFF(file="human.gff",format="gff3")
saveRDS(human_txdb,file = "human_txdb.rds")
human_txdb <- readRDS(file = "human_txdb.rds")
human_txdb
This created a more explicit error:
Error in result_create(conn@ptr, statement) :
external pointer is not valid
Calls: <Anonymous> ... .local -> new -> initialize -> initialize -> result_create
Execution halted
Thus the problem seems not to be related to the scp
. Here's the R environment on the HPC cluster node:
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Scientific Linux release 6.10 (Carbon)
Matrix products: default
BLAS/LAPACK: /util/opt/anaconda/deployed-conda-envs/packages/bioconductor/envs/bioconductor-3.8/lib/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] systemPipeR_1.16.0 ShortRead_1.40.0
[3] GenomicAlignments_1.18.0 SummarizedExperiment_1.12.0
[5] DelayedArray_0.8.0 matrixStats_0.54.0
[7] BiocParallel_1.16.2 Rsamtools_1.34.0
[9] Biostrings_2.50.1 XVector_0.22.0
[11] GenomicFeatures_1.34.1 AnnotationDbi_1.44.0
[13] Biobase_2.42.0 GenomicRanges_1.34.0
[15] GenomeInfoDb_1.18.1 IRanges_2.16.0
[17] S4Vectors_0.20.1 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] httr_1.4.0 edgeR_3.24.1 bit64_0.9-7
[4] splines_3.5.1 assertthat_0.2.0 latticeExtra_0.6-28
[7] RBGL_1.58.1 blob_1.1.1 GenomeInfoDbData_1.2.0
[10] Category_2.48.0 progress_1.2.2 backports_1.1.3
[13] pillar_1.4.2 RSQLite_2.1.1 lattice_0.20-38
[16] glue_1.3.0 limma_3.38.3 digest_0.6.18
[19] checkmate_1.8.5 RColorBrewer_1.1-2 colorspace_1.4-0
[22] Matrix_1.2-15 plyr_1.8.4 GSEABase_1.44.0
[25] XML_3.98-1.16 pkgconfig_2.0.2 pheatmap_1.0.12
[28] biomaRt_2.38.0 genefilter_1.64.0 zlibbioc_1.28.0
[31] GO.db_3.7.0 purrr_0.3.2 xtable_1.8-3
[34] scales_1.0.0 brew_1.0-6 tibble_2.0.1
[37] annotate_1.60.0 ggplot2_3.1.0 lazyeval_0.2.1
[40] survival_2.43-3 magrittr_1.5 crayon_1.3.4
[43] memoise_1.1.0 hwriter_1.3.2 GOstats_2.48.0
[46] graph_1.60.0 data.table_1.12.0 tools_3.5.1
[49] prettyunits_1.0.2 hms_0.4.2 BBmisc_1.11
[52] stringr_1.4.0 sendmailR_1.2-1 munsell_0.5.0
[55] locfit_1.5-9.1 compiler_3.5.1 rlang_0.3.4
[58] grid_3.5.1 RCurl_1.95-4.11 rjson_0.2.20
[61] AnnotationForge_1.24.0 base64enc_0.1-3 bitops_1.0-6
[64] gtable_0.2.0 DBI_1.0.0 R6_2.4.0
[67] dplyr_0.8.0.1 rtracklayer_1.42.1 bit_1.1-14
[70] Rgraphviz_2.26.0 stringi_1.2.4 BatchJobs_1.7
[73] Rcpp_1.0.0 tidyselect_0.2.5
Any debugging hints will be gratefully received.
I wanted to present an alternative based on the TxDb that I used since I was storing other information as well. for my example I was storing the genome as DNA string set and the GFF as well the TxDb for tracks information.
genome = redDNAStringSet("genome.fna")
gff = readGFF("genome.gff")
txdb = makeTxDbFromGFF("genome.gff", format="gff3")
rds_data = list()
save everything in a list - as lists will put out contents of the TxDb not use a pointer
rds_data[[1]] = list(genome, gff, as.list(txdb))
saveRDS(rds_data, "fileName.RDS")
read back in the RDS you already saved
dat = readRDS("fileName.RDS")
newTxDb = do.call(makeTxDb, dat[[1]][[3]])
This can be found in the methods section of the TxDB class functions. This way you can save multiple types at once, but is a little more complicated than just saveDb / loadDb