rhdf5 and factors
1
0
Entering edit mode
@moritz-e-beber-5727
Last seen 9.3 years ago
European Union
Dear all, I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re- reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten. Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support. TIA, Moritz
rhdf5 convert rhdf5 convert • 2.3k views
ADD COMMENT
0
Entering edit mode
Bernd Fischer ▴ 550
@bernd-fischer-5348
Last seen 8.0 years ago
Germany / Heidelberg / DKFZ
Dear Moritz! An easy solution for you would be to separately write the factor- values (the integers) and the levels: >h5write(as.integer(obj), file=file, name="objCODES") >h5write(levels(obj), file=file, name="objLEVELS") Best, Bernd -- Bernd Fischer EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Tel: +49 [0] 6221 387-8131 E-Mail: bernd.fischer@embl.de Homepage: http://www-huber.embl.de/users/befische/ On 23.01.2013, at 16:05, Moritz Emanuel Beber <moritz.beber@gmail.com> wrote: > Dear all, > > I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re- reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten. > > Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support. > > TIA, > Moritz > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
On 01/27/2013 09:42 AM, Bernd Fischer wrote: > Dear Moritz! > > An easy solution for you would be to separately write the factor- values (the integers) > and the levels: > >> h5write(as.integer(obj), file=file, name="objCODES") >> h5write(levels(obj), file=file, name="objLEVELS") I was thinking this would work f = factor("M", "F") h5createFile(fl <- tempfile()) res = h5write(f, fl, write.attributes=TRUE, name="f") but the last line fails ('no applicable method for 'h5writeDataset' applied to an object of class "factor"') so then tried res = h5write(unclass(f), fl, write.attributes=TRUE, name="f") which doesn't fail but doesn't seem to work? > dput(h5read(fl, "f", read.attributes=TRUE)) structure(c(2L, 1L), .Dim = 2L) > dput(unclass(f)) structure(c(2L, 1L), .Label = c("F", "M")) I initially went down this line thinking that since a factor (and many other R entities) are just basic types + attributes, it would be easy to support serializing a broad range of R data types (read/write.attributes=TRUE would be a better default if the objective was to provide a transparent way to use hdf5 as a storage back-end, which I think would be cool). But maybe there's not intention, getting back to the original poster's question, to support this kind of high-level functionality in this package? Or maybe there's scope for an elegant (because one just has to recurse through an R object to save it) additional package that extends rhdf5? Martin > > Best, > > Bernd > > > > -- > Bernd Fischer > EMBL Heidelberg > Meyerhofstra?e 1 > 69117 Heidelberg > Tel: +49 [0] 6221 387-8131 > E-Mail: bernd.fischer at embl.de > Homepage: http://www-huber.embl.de/users/befische/ > > > > > > > On 23.01.2013, at 16:05, Moritz Emanuel Beber <moritz.beber at="" gmail.com=""> wrote: > >> Dear all, >> >> I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re- reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten. >> >> Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support. >> >> TIA, >> Moritz >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
Dear Martin thank you for digging into this. I agree that it should not be hard to use (r)hdf5 as a storage backend for any R object by recursively saving the object in terms of its simple type components. My question is what would be the use case for that (given that we have already 'save' and 'load' in the base package)? The use cases that we have been thinking about here, so far, involve: (i) efficient r/w access to subarrays (hyperslabs) in particular for very large arrays (which don't reside in memory) (ii) inter-language exchange of data For both of these, it is clearly useful to deal with basic array types, and perhaps less so for more complex (non-sequential or R-idiosyncratic) objects. Or are you thinking of (iii) creating a full-fledged alternative to base:save, base:load? Best wishes Wolfgang Il giorno Jan 27, 2013, alle ore 9:40 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> ha scritto: > On 01/27/2013 09:42 AM, Bernd Fischer wrote: >> Dear Moritz! >> >> An easy solution for you would be to separately write the factor- values (the integers) >> and the levels: >> >>> h5write(as.integer(obj), file=file, name="objCODES") >>> h5write(levels(obj), file=file, name="objLEVELS") > > I was thinking this would work > > f = factor("M", "F") > h5createFile(fl <- tempfile()) > res = h5write(f, fl, write.attributes=TRUE, name="f") > > but the last line fails ('no applicable method for 'h5writeDataset' applied to an object of class "factor"') so then tried > > res = h5write(unclass(f), fl, write.attributes=TRUE, name="f") > > which doesn't fail but doesn't seem to work? > >> dput(h5read(fl, "f", read.attributes=TRUE)) > structure(c(2L, 1L), .Dim = 2L) >> dput(unclass(f)) > structure(c(2L, 1L), .Label = c("F", "M")) > > I initially went down this line thinking that since a factor (and many other R entities) are just basic types + attributes, it would be easy to support serializing a broad range of R data types (read/write.attributes=TRUE would be a better default if the objective was to provide a transparent way to use hdf5 as a storage back-end, which I think would be cool). But maybe there's not intention, getting back to the original poster's question, to support this kind of high- level functionality in this package? Or maybe there's scope for an elegant (because one just has to recurse through an R object to save it) additional package that extends rhdf5? > > Martin > > >> >> Best, >> >> Bernd >> >> >> >> -- >> Bernd Fischer >> EMBL Heidelberg >> Meyerhofstra?e 1 >> 69117 Heidelberg >> Tel: +49 [0] 6221 387-8131 >> E-Mail: bernd.fischer at embl.de >> Homepage: http://www-huber.embl.de/users/befische/ >> >> >> >> >> >> >> On 23.01.2013, at 16:05, Moritz Emanuel Beber <moritz.beber at="" gmail.com=""> wrote: >> >>> Dear all, >>> >>> I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re- reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten. >>> >>> Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support. >>> >>> TIA, >>> Moritz >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On 01/31/2013 05:06 AM, Wolfgang Huber wrote: > Dear Martin > > thank you for digging into this. I agree that it should not be hard to use (r)hdf5 as a storage backend for any R object by recursively saving the object in terms of its simple type components. My question is what would be the use case for that (given that we have already 'save' and 'load' in the base package)? > > The use cases that we have been thinking about here, so far, involve: > (i) efficient r/w access to subarrays (hyperslabs) in particular for very large arrays (which don't reside in memory) > (ii) inter-language exchange of data > > For both of these, it is clearly useful to deal with basic array types, and perhaps less so for more complex (non-sequential or R-idiosyncratic) objects. > > Or are you thinking of > (iii) creating a full-fledged alternative to base:save, base:load? Along the lines of (iii), to get the benefits of (i) and (ii). For instance, 'transparently' reading in a slice of a GRanges object (or more simply a factor or data.frame), instead of needing to write code to martial the individual components. I agree that truly complicated structures might be too difficult to use outside R, but a factor, data.frame, etc would be parse-able. Some of the 'sugar' in Rcpp shows this. Thanks for listening, Martin > > Best wishes > Wolfgang > > > > > Il giorno Jan 27, 2013, alle ore 9:40 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> ha scritto: > >> On 01/27/2013 09:42 AM, Bernd Fischer wrote: >>> Dear Moritz! >>> >>> An easy solution for you would be to separately write the factor- values (the integers) >>> and the levels: >>> >>>> h5write(as.integer(obj), file=file, name="objCODES") >>>> h5write(levels(obj), file=file, name="objLEVELS") >> >> I was thinking this would work >> >> f = factor("M", "F") >> h5createFile(fl <- tempfile()) >> res = h5write(f, fl, write.attributes=TRUE, name="f") >> >> but the last line fails ('no applicable method for 'h5writeDataset' applied to an object of class "factor"') so then tried >> >> res = h5write(unclass(f), fl, write.attributes=TRUE, name="f") >> >> which doesn't fail but doesn't seem to work? >> >>> dput(h5read(fl, "f", read.attributes=TRUE)) >> structure(c(2L, 1L), .Dim = 2L) >>> dput(unclass(f)) >> structure(c(2L, 1L), .Label = c("F", "M")) >> >> I initially went down this line thinking that since a factor (and many other R entities) are just basic types + attributes, it would be easy to support serializing a broad range of R data types (read/write.attributes=TRUE would be a better default if the objective was to provide a transparent way to use hdf5 as a storage back-end, which I think would be cool). But maybe there's not intention, getting back to the original poster's question, to support this kind of high- level functionality in this package? Or maybe there's scope for an elegant (because one just has to recurse through an R object to save it) additional package that extends rhdf5? >> >> Martin >> >> >>> >>> Best, >>> >>> Bernd >>> >>> >>> >>> -- >>> Bernd Fischer >>> EMBL Heidelberg >>> Meyerhofstra?e 1 >>> 69117 Heidelberg >>> Tel: +49 [0] 6221 387-8131 >>> E-Mail: bernd.fischer at embl.de >>> Homepage: http://www-huber.embl.de/users/befische/ >>> >>> >>> >>> >>> >>> >>> On 23.01.2013, at 16:05, Moritz Emanuel Beber <moritz.beber at="" gmail.com=""> wrote: >>> >>>> Dear all, >>>> >>>> I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re-reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten. >>>> >>>> Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support. >>>> >>>> TIA, >>>> Moritz >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> [[alternative HTML version deleted]] >>> >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -- >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
Dear Martin and Wolfgang! I agree with Martin that (iii) would be useful to give an easy way to establish (i) and (ii), but I don' t think rhdf5 will replace save/load, because it is less efficient. I tried to write attributes along with each object but then in the end set the standard parameter for 'write.attributes' to FALSE, because HDF5 attributes are very limited: 1.) HDF5-Attributes are limited to a certain maximum dimensionality (I think a length of around 65000 and a maximum of 5 dimensions). 2.) HDF5-Attributes are limited to simple data.types, but R often uses more complex datatypes in attributes, e.g. dimnames is a list, which can not be stored as an HDF5 attribute. Thus, so far a standard way of storing attributes is missing. My plans are the following: All R-attributes should be stored as a separate HDF5 objects. These objects are further linked to the main object (HDF5 references should provide the functionality). An HDF5-attribute with the R-class name and package name is stored together with the main object. This should give the possibility to automatically create an object of the respective class. Any other suggestions are welcome, Bernd -- Bernd Fischer EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Tel: +49 [0] 6221 387-8131 E-Mail: bernd.fischer@embl.de Homepage: http://www-huber.embl.de/users/befische/ On 31.01.2013, at 14:42, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 01/31/2013 05:06 AM, Wolfgang Huber wrote: >> Dear Martin >> >> thank you for digging into this. I agree that it should not be hard to use (r)hdf5 as a storage backend for any R object by recursively saving the object in terms of its simple type components. My question is what would be the use case for that (given that we have already 'save' and 'load' in the base package)? >> >> The use cases that we have been thinking about here, so far, involve: >> (i) efficient r/w access to subarrays (hyperslabs) in particular for very large arrays (which don't reside in memory) >> (ii) inter-language exchange of data >> >> For both of these, it is clearly useful to deal with basic array types, and perhaps less so for more complex (non-sequential or R-idiosyncratic) objects. >> >> Or are you thinking of >> (iii) creating a full-fledged alternative to base:save, base:load? > > Along the lines of (iii), to get the benefits of (i) and (ii). For instance, 'transparently' reading in a slice of a GRanges object (or more simply a factor or data.frame), instead of needing to write code to martial the individual components. > > I agree that truly complicated structures might be too difficult to use outside R, but a factor, data.frame, etc would be parse-able. Some of the 'sugar' in Rcpp shows this. > > Thanks for listening, > > Martin > >> >> Best wishes >> Wolfgang >> >> >> >> >> Il giorno Jan 27, 2013, alle ore 9:40 PM, Martin Morgan <mtmorgan@fhcrc.org> ha scritto: >> >>> On 01/27/2013 09:42 AM, Bernd Fischer wrote: >>>> Dear Moritz! >>>> >>>> An easy solution for you would be to separately write the factor- values (the integers) >>>> and the levels: >>>> >>>>> h5write(as.integer(obj), file=file, name="objCODES") >>>>> h5write(levels(obj), file=file, name="objLEVELS") >>> >>> I was thinking this would work >>> >>> f = factor("M", "F") >>> h5createFile(fl <- tempfile()) >>> res = h5write(f, fl, write.attributes=TRUE, name="f") >>> >>> but the last line fails ('no applicable method for 'h5writeDataset' applied to an object of class "factor"') so then tried >>> >>> res = h5write(unclass(f), fl, write.attributes=TRUE, name="f") >>> >>> which doesn't fail but doesn't seem to work? >>> >>>> dput(h5read(fl, "f", read.attributes=TRUE)) >>> structure(c(2L, 1L), .Dim = 2L) >>>> dput(unclass(f)) >>> structure(c(2L, 1L), .Label = c("F", "M")) >>> >>> I initially went down this line thinking that since a factor (and many other R entities) are just basic types + attributes, it would be easy to support serializing a broad range of R data types (read/write.attributes=TRUE would be a better default if the objective was to provide a transparent way to use hdf5 as a storage back-end, which I think would be cool). But maybe there's not intention, getting back to the original poster's question, to support this kind of high- level functionality in this package? Or maybe there's scope for an elegant (because one just has to recurse through an R object to save it) additional package that extends rhdf5? >>> >>> Martin >>> >>> >>>> >>>> Best, >>>> >>>> Bernd >>>> >>>> >>>> >>>> -- >>>> Bernd Fischer >>>> EMBL Heidelberg >>>> Meyerhofstraße 1 >>>> 69117 Heidelberg >>>> Tel: +49 [0] 6221 387-8131 >>>> E-Mail: bernd.fischer@embl.de >>>> Homepage: http://www-huber.embl.de/users/befische/ >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 23.01.2013, at 16:05, Moritz Emanuel Beber <moritz.beber@gmail.com> wrote: >>>> >>>>> Dear all, >>>>> >>>>> I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re-reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten. >>>>> >>>>> Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support. >>>>> >>>>> TIA, >>>>> Moritz >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6