Load saved RangedSummarizedExperiment as Deseq2 dataset
3
0
Entering edit mode
@jakob-petereit-13577
Last seen 7.3 years ago
UWA Perth

Hi,

 

I am currently analyzing some RNAseq data. I worked my way through the RNAseq gene level workflow (https://www.bioconductor.org/help/workflows/rnaseqGene/

And most things work well, and I will get it done in some time. The only really annoying thing is, that I cannot save and load my RangedSummarizedExperiment.  Here is the problematic code with my description:

se <- summarizeOverlaps(features=ebg, reads=bamfiles,
                        mode="Union",
                        singleEnd=F,
                        ignore.strand=TRUE,
                        fragments=T)
colData(se) <- DataFrame(sampleTable)
colData(se)

se
class: RangedSummarizedExperiment 
dim: 32833 70 
metadata(0):
assays(1): counts
rownames(32833): AT1G01010 AT1G01020 ... ATMG09960 ATMG09980
rowData names(0):
colnames(70): 1_1 1_2 ... 36_1 36_2
colData names(7): ID rep ... ds.DNA.Concentration..nM. SampleName

This works, summarizing overlaps takes about 1-2 hours if I run it locally on 4 cores and 16gb RAM. I can load this SummarizedExperiment into Deseq2.

DESeqDataSet(se, ~ Genotype + SA.treatment + Genotype:SA.treatment, ignoreRank = T)
class: DESeqDataSet 
dim: 32833 70 
metadata(1): version
assays(1): counts
rownames(32833): AT1G01010 AT1G01020 ... ATMG09960 ATMG09980
rowData names(0):
colnames(70): 1_1 1_2 ... 36_1 36_2
colData names(7): ID rep ... ds.DNA.Concentration..nM. SampleName

However, I would really like to save this SummarizedExperiment and load it later with Deseq2. I found that the only package supporting this is SummarizedExperiment with the dependency HDF5Array. I can save and load SummarizedExperiments with that and the SummarizedExperiment looks identical:

saveHDF5SummarizedExperiment(se, dir='data/se', replace=T,level=NULL, verbose=T)
Start writing assay 1/1 to 'data/se/assays.h5':
Finished writing assay 1/1 to 'data/se/assays.h5'.
se_load <- loadHDF5SummarizedExperiment('data/se/')
se_load
class: RangedSummarizedExperiment 
dim: 32833 70 
metadata(0):
assays(1): counts
rownames(32833): AT1G01010 AT1G01020 ... ATMG09960 ATMG09980
rowData names(0):
colnames(70): 1_1 1_2 ... 36_1 36_2
colData names(7): ID rep ... ds.DNA.Concentration..nM. SampleName

But the error shows, once I wanna load the loaded SummarizedExperiemnt into Deseq2:

DESeqDataSet(se_load, ~ Genotype + SA.treatment + Genotype:SA.treatment, ignoreRank = T)
converting counts to integer mode
Error in all_dims[, 1L] : incorrect number of dimensions

I read quite a bit on the Error, but I cannot find a real reference to someone dealing with it. I assume its something in the writeHDF5Array arguments, which somehow interfere with DESeq2, but that is just over my current skill level...

Any suggestions ? 

sessionInfo()
R version 3.4.1 Patched (2017-07-07 r72904)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pheatmap_1.0.8             bindrcpp_0.2               HDF5Array_1.4.8           
 [4] rhdf5_2.20.0               readr_1.1.1                DESeq2_1.16.1             
 [7] dplyr_0.7.2                BiocParallel_1.10.1        GenomicAlignments_1.12.1  
[10] GenomicFeatures_1.28.4     AnnotationDbi_1.38.1       Rsamtools_1.28.0          
[13] Biostrings_2.44.1          XVector_0.16.0             airway_0.110.0            
[16] SummarizedExperiment_1.6.3 DelayedArray_0.2.7         matrixStats_0.52.2        
[19] Biobase_2.36.2             GenomicRanges_1.28.4       GenomeInfoDb_1.12.2       
[22] IRanges_2.10.2             S4Vectors_0.14.3           BiocGenerics_0.22.0       

loaded via a namespace (and not attached):
 [1] bit64_0.9-7             splines_3.4.1           Formula_1.2-2          
 [4] assertthat_0.2.0        latticeExtra_0.6-28     blob_1.1.0             
 [7] GenomeInfoDbData_0.99.0 RSQLite_2.0             backports_1.1.0        
[10] lattice_0.20-35         glue_1.1.1              digest_0.6.12          
[13] RColorBrewer_1.1-2      checkmate_1.8.3         colorspace_1.3-2       
[16] htmltools_0.3.6         Matrix_1.2-10           plyr_1.8.4             
[19] XML_3.98-1.9            pkgconfig_2.0.1         biomaRt_2.32.1         
[22] genefilter_1.58.1       zlibbioc_1.22.0         xtable_1.8-2           
[25] snow_0.4-2              scales_0.4.1            annotate_1.54.0        
[28] tibble_1.3.3            htmlTable_1.9           ggplot2_2.2.1          
[31] nnet_7.3-12             lazyeval_0.2.0          survival_2.41-3        
[34] magrittr_1.5            memoise_1.1.0           foreign_0.8-69         
[37] data.table_1.10.4       tools_3.4.1             hms_0.3                
[40] stringr_1.2.0           locfit_1.5-9.1          munsell_0.4.3          
[43] cluster_2.0.6           compiler_3.4.1          rlang_0.1.1            
[46] grid_3.4.1              RCurl_1.95-4.8          htmlwidgets_0.9        
[49] bitops_1.0-6            base64enc_0.1-3         gtable_0.2.0           
[52] DBI_0.7                 R6_2.2.2                gridExtra_2.2.1        
[55] knitr_1.16              rtracklayer_1.36.4      bit_1.1-12             
[58] bindr_0.1               Hmisc_4.0-3             stringi_1.1.5          
[61] Rcpp_0.12.12            geneplotter_1.54.0      rpart_4.1-11           
[64] acepack_1.4.1   
deseq2 hdf5 rangedsummarizedexperiment rnaseq • 3.1k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 12 hours ago
United States

Unrelated but you shouldn't use ignoreRank=TRUE, this is just for other packages not for end users. You will just miss out of useful error messages and get more confusing error messages later. Why don't you save the se with save() and load() instead? Does this work?

ADD COMMENT
0
Entering edit mode
@jakob-petereit-13577
Last seen 7.3 years ago
UWA Perth

It somehow works with save.image, to mirror my work space and then load it later on again.

It would be still good to investigate the error or find a better way to save and load the SummarizedExperiment. 

I plan to source the summary of the overlaps out to our in house unix-cluster, as summarizeOverlaps is quite resource intense and a working save/load function will be necessary to get this to work (dunno if i can export the work space from the server R to my local R).

 

A little bit off the topic, summarizing the Overlaps from BAM files seems to be the bottleneck of this workflow. If you use STAR to map and align the fastq files to the genome, it already creates raw counts for each file. It would be probably a lot quicker if we just merge the raw-counts for each file and add some meta data, but I don't know how complicated it would be to feed that to DESeq.

Cheers for you quick answer anyway :)

ADD COMMENT
0
Entering edit mode

Can you not do this using the standard R approaches, e.g.,

example(RangedSummarizedExperiment)  # generate 'rse', an example
fl = tempfile()
saveRDS(rse, fl)

and then later

rse1 = readRDS(fl)
ADD REPLY
0
Entering edit mode

re: merging counts for each file: you can either do this manually in R and use DESeqDataSetFromMatrix, or you can use DESeqDataSetFromHTSeqCount, which will read in count files in the htseq format (first column is gene name, second column is count, no header). See ?DESeqDataSet

ADD REPLY
0
Entering edit mode
@jakob-petereit-13577
Last seen 7.3 years ago
UWA Perth

sweet, saveRDS works indeed very nicely here.

I'll try the DESeqDataSetFromMatrix and DESeqDataSetFromHTSeqCount later on

 

Cheers for the help <3

 

 

ADD COMMENT

Login before adding your answer.

Traffic: 649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6