deseq2 DESeqDataSet size when saved
1
0
Entering edit mode
eric.blanc • 0
@ericblanc-11613
Last seen 4.4 years ago
Hi,

I am trying to generate small files to include in my package for regression tests. One of them is a small DESeqDataSet object (object dds_small below, the first 50 features from a complete analysis store in object dds). However, when I save the small object, its size remains very large:

> dds <- readRDS("2018-02-12_all_tissues/dds.rds")
> object.size(dds)
12579016 bytes
> dds_small <- dds[1:50,]
> object.size(dds_small)
111056 bytes
> length(serialize(dds_small, NULL))
[1] 45625706

The size of the small object seems larger than the size of the original object! It seems to be the design slot which uses so much space, as there appears to be an environment attached to it:

> dds_small@design
~(Tissue/Age)/Genotype
<environment: 0x3e64708>
> object.size(dds_small@design)
1344 bytes
> length(serialize(dds_small@design, NULL))
[1] 45353218

This environment probably stores a bunch of packages that were in use when the original object was created, because the sessionInfo (below) reports many loaded packages, although I just did a readRDS command in a fresh R session.

As I am not familiar with environments nor with DESeqDataSet internals, my question is: how should I do to keep my subset object size small?

Thanks for your help,

Eric

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/eblanc/R/R-3.5.1/lib/libRblas.so
LAPACK: /home/eblanc/R/R-3.5.1/lib/libRlapack.so

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Biobase_2.40.0              bit64_0.9-7                
 [3] splines_3.5.1               Formula_1.2-3              
 [5] assertthat_0.2.0            stats4_3.5.1               
 [7] latticeExtra_0.6-28         blob_1.1.1                 
 [9] GenomeInfoDbData_1.1.0      pillar_1.3.0               
[11] RSQLite_2.1.1               backports_1.1.2            
[13] lattice_0.20-35             glue_1.3.0                 
[15] digest_0.6.17               GenomicRanges_1.32.7       
[17] RColorBrewer_1.1-2          XVector_0.20.0             
[19] checkmate_1.8.5             colorspace_1.3-2           
[21] htmltools_0.3.6             Matrix_1.2-14              
[23] plyr_1.8.4                  DESeq2_1.20.0              
[25] XML_3.98-1.16               pkgconfig_2.0.2            
[27] rseqCP_0.1.0                genefilter_1.62.0          
[29] zlibbioc_1.26.0             purrr_0.2.5                
[31] xtable_1.8-3                scales_1.0.0               
[33] BiocParallel_1.14.2         htmlTable_1.12             
[35] tibble_1.4.2                annotate_1.58.0            
[37] IRanges_2.14.12             ggplot2_3.0.0              
[39] SummarizedExperiment_1.10.1 nnet_7.3-12                
[41] BiocGenerics_0.26.0         lazyeval_0.2.1             
[43] survival_2.42-3             magrittr_1.5               
[45] crayon_1.3.4                memoise_1.1.0              
[47] foreign_0.8-70              tools_3.5.1                
[49] data.table_1.11.6           matrixStats_0.54.0         
[51] stringr_1.3.1               S4Vectors_0.18.3           
[53] locfit_1.5-9.1              munsell_0.5.0              
[55] cluster_2.0.7-1             DelayedArray_0.6.6         
[57] AnnotationDbi_1.42.1        bindrcpp_0.2.2             
[59] compiler_3.5.1              GenomeInfoDb_1.16.0        
[61] rlang_0.2.2                 grid_3.5.1                 
[63] RCurl_1.95-4.11             rstudioapi_0.7             
[65] htmlwidgets_1.2             bitops_1.0-6               
[67] base64enc_0.1-3             gtable_0.2.0               
[69] DBI_1.0.0                   R6_2.2.2                   
[71] gridExtra_2.3               knitr_1.20                 
[73] dplyr_0.7.6                 bit_1.1-14                 
[75] bindr_0.1.1                 Hmisc_4.1-1                
[77] stringi_1.2.4               parallel_3.5.1             
[79] Rcpp_0.12.18                geneplotter_1.58.0         
[81] rpart_4.1-13                acepack_1.4.1              
[83] tidyselect_0.2.4           

 

deseq2 infrastructure deseqdataset importing • 3.1k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 18 hours ago
United States

There is a thread on here somewhere but I'll just repeat the options here. The limitation is from R's formula() function, and there is a part of it that is unavoidable. You can't call formula() inside of a function, and attach it to an object, because it grabs everything it sees. This would happen whether or not you use DESeqDataSet() or if you were saving your own object, e.g. obj <- list(data, formula), inside of a function.

Let me note for other (users) reading that this issue doesn’t affect normal usage, construction or saving of DESeqDataSets, only when developers call it inside of their own defined functions.

The solutions are:

1) Since version 1.18, you can just provide a matrix to design. This should solve your issue entirely.

2) You could avoid calling formula() or DESeqDataSet() within your function, but instead have the user call it from the global environment. This is what we do in DESeq2, which avoids the dds object that users create being bloated in size. The problem again, is only when you call forumula within a function, attach it to an object, then save it. And this will happen because of R's formula() function and can't easily be avoided.

3) You can delete everything from the environment inside the function using rm(), except the object itself, before return(). There is still some duplication because you can't delete the object itself, so the first two options are preferred.

4) You can also try doing what we do in makeExampleDESeqDataSet(), which is to force formula's environment to the global environment: https://github.com/mikelove/DESeq2/blob/600c6c20fca6c2d54148bea17ac31c424ac69336/R/core.R#L427-L431

ADD COMMENT
0
Entering edit mode

Thanks Michael, and sorry I wasn't able to find the relevant thread...

 

ADD REPLY
0
Entering edit mode

I have a hard time finding old threads myself! And (1) is new since the last two versions since I got tired of dealing with formula() and it’s greedy behavior.

ADD REPLY

Login before adding your answer.

Traffic: 639 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6