Large disk images of DESeqDataSet / SummarizedExperiment objects when using the save function
2
2
Entering edit mode
akaever ▴ 30
@akaever-7380
Last seen 8.0 years ago

Hi,

I have encountered unexpected large disk images when exporting DESeqDataSet (subclass of SummarizedExperiment) objects using the save function. The size of the exported file is much larger than expected, which is a problem for large experiments. In the following example, a test object is created and exported. For comparison, also the actual count values (a matrix of integers) are extracted and saved.

In order to test compression, all count values are set to zero and the corresponding object and count matrix are exported once again.

The DESeqDataSet disk image (about 10MB) is much larger than the expected size (about 4MB) and the counts seem not to be compressed by means of the save function. In comparison, the exported count matrix is highly compressed (1.3 MB and 0.03 MB for zero count values). See the following code

 

library(DESeq2)
# DESeq2_1.6.3

# create example data set
dds <- makeExampleDESeqDataSet(n=10000, m=100)
# size should be about 4 MB (10000 x 100 x 4 Bytes (integers))
print(object.size(dds@assays$data@listData$counts))
# 4000200 bytes

# extract counts for comparison
countMat <- counts(dds)
# set counts to zero to test compression
zeroDds <- dds; counts(zeroDds)[] <- 0L
zeroCountMat <- counts(zeroDds)

# save to disk
save(dds, file="dds.RData")
save(countMat, file="countMat.RData")
save(zeroCountMat, file="zeroCountMat.RData")
save(zeroDds, file="zeroDds.RData")

# report file size
for(f in c("dds.RData", "zeroDds.RData", "countMat.RData", "zeroCountMat.RData"))
  cat(f, ": ", file.info(f)$size / 2^20, " MB\n", sep="")
# dds.RData: 10.47201 MB
# zeroDds.RData: 10.50408 MB
# countMat.RData: 1.269134 MB
# zeroCountMat.RData: 0.02977085 MB

 

Maybe, there is somewhere a conversion to double?

Thanks for answers

SummarizedExperiment DESeqDataSet GenomicRanges DESeq2 save • 3.6k views
ADD COMMENT
0
Entering edit mode

The large size comes from the 'design' slot

owd = setwd(tempdir())
sort(sapply(slotNames(dds), function(nm) {
    x = slot(dds, nm); save(x, file=nm); file.size(nm)
}))
setwd(owd)

## dispersionFunction           exptData            colData 
##                79                168                316 
##           rowData             assays             design 
##            206618            1316594           10771848

which I think captures the global environment(!).

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Hi Alexander,

Thanks for the note.

I believe this is only an issue for the makeExampleDESeqDataSet function and not for the "real data" constructors DESeqDataSet*. So if you try this in the global environment with a counts matrix, or SummarizedExperiment or HTSeq input I don't see the same problem.

Within the makeExampleDESeqDataSet function, I had tried to use the following code to have it not store extra copies of all the objects with the design:

designFormula <- as.formula("~ condition",env=new.env())

But I should go back and make sure I've removed the objects, as this is apparently not working.

 

ADD COMMENT
0
Entering edit mode

I think I was mislead when I was testing the above code by the following:

> dds <- makeExampleDESeqDataSet(n=10000,m=100)
> ls(environment(design(dds)))
character(0)

I'll go over the makeExampleDESeqDataSet code and test by saving.

ADD REPLY
0
Entering edit mode

I believe I have solved this problem for makeExampleDESeqDataSet in version 1.7.26, by using:

design <- as.formula("~ condition", env=.GlobalEnv)

Now the design of an example dataset does not include all the other objects from within makeExampleDESeqDataSet(), nor does it include objects from the global environment when saved:

 big.vector <- seq_len(1e7)
 dds <- makeExampleDESeqDataSet(n=10000, m=100)
 sizes <- sapply(slotNames(dds), function(nm) { x = slot(dds, nm); save(x, file=nm); file.info(nm)$size })
 sort(sizes)
 dispersionFunction             design           exptData            colData
                 79                115                168                316
            rowData             assays
             206654            1312688
 sum(sizes)
 [1] 1520020
save(dds, file="object"); file.info("object")$size
 [1] 1519846

 

 

ADD REPLY
0
Entering edit mode

Thanks for all the replies! It works :-), but I do not fully understand why objects in the .GlobalEnv are not automatically saved in this case. I would have tried

design <- as.formula("~ condition", env=emptyenv())

to avoid saving additional objects.

ADD REPLY
1
Entering edit mode

This would give an error later on, when trying to use the formula to build a matrix:

> model.matrix(as.formula("~x",env=emptyenv()), data=data.frame(x=c(0,0,1,1)))
 Error in eval(expr, envir, enclos) : could not find function "list"
 Calls: model.matrix ... model.matrix.default -> model.frame -> model.frame.default -> eval -> eval

 

ADD REPLY
0
Entering edit mode
@wolfgang-huber-3550
Last seen 4 months ago
EMBL European Molecular Biology Laborat…

For the record, with Alexander's code from above and DESeq2_1.7.2 6, I now get:

dds.RData: 1.457016 MB
zeroDds.RData: 0.2411022 MB
countMat.RData: 1.270781 MB
zeroCountMat.RData: 0.02954865 MB

Can someone explain why in his initial post (with the previous DESeq2), zeroDds.RData was larger than dds.RData?

 

 

 

 

 

ADD COMMENT
0
Entering edit mode

I think dds assays do not have dimnames, whereas zeroDds do

> str(dds@assays@.xData$data@listData$counts)
 int [1:10000, 1:100] 44 313 34 3 1 3 34 8 11 14 ...
> str(zeroDds@assays@.xData$data@listData$counts)
 int [1:10000, 1:100] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:10000] "gene1" "gene2" "gene3" "gene4" ...
  ..$ : NULL

I think this is because of a (unintentional?) omission in the assays<-,SummarizedExperiment,SimpleList-method

> showMethods("assays<-", includeD=T)
Function: assays<- (package GenomicRanges)
x="SummarizedExperiment", value="list"
function (x, ..., withDimnames = TRUE, value) 
{
    value <- lapply(x, "dimnames<-", NULL)
    .SummarizedExperiment.assays.replace(x, ..., value = SimpleList(value))
}

x="SummarizedExperiment", value="SimpleList"
function (x, ..., withDimnames = TRUE, value) 
{
    x <- clone(x, ..., assays = value)
    msg <- .valid.SummarizedExperiment(x)
    if (!is.null(msg)) 
        stop(msg)
    x
}

I think the names should be stripped (SummarizedExperiment manages these internally).

ADD REPLY

Login before adding your answer.

Traffic: 608 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6