Curious file size issues
2
0
Entering edit mode
Daniel Brewer ★ 1.9k
@daniel-brewer-1791
Last seen 10.2 years ago
Hello, The GTF file from Ensembl for the human genome, Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file. I import it into R and process it so that there are two objects: genomeRanges & genomeBlocks. genomeRanges is a list of IRanges objects, each of which is a particular chromosome and strand. genomeBlocks is a list of dataframes with the associated annotation for each of the transcripts. When I save this to file (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out as 859M. How is this possible? Especially as the Rdata file is a binary format. > object.size(genomeBlocks) [1] 2939935864 > object.size(genomeRanges) [1] 8769208 Anyway got any ideas what is going on? Thanks Dan -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}
Annotation Cancer PROcess Annotation Cancer PROcess • 1.0k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 10 weeks ago
United States
On Thu, Mar 12, 2009 at 6:28 AM, Daniel Brewer <daniel.brewer@icr.ac.uk>wrote: > Hello, > > The GTF file from Ensembl for the human genome, > Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file. I > import it into R and process it so that there are two objects: > genomeRanges & genomeBlocks. genomeRanges is a list of IRanges objects, > each of which is a particular chromosome and strand. genomeBlocks is a > list of dataframes with the associated annotation for each of the > transcripts. > > When I save this to file > (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out as > 859M. How is this possible? Especially as the Rdata file is a binary > format. > > > object.size(genomeBlocks) > [1] 2939935864 > You don't tell us exactly how you made this, but the above shows that genomeBlocks is consuming about 2.9 GB. Serializing to 859M seems reasonable. If you want more information, send more details. > > > object.size(genomeRanges) > [1] 8769208 > > Anyway got any ideas what is going on? > > Thanks > > Dan > > > -- > ************************************************************** > Daniel Brewer, Ph.D. > > Institute of Cancer Research > Molecular Carcinogenesis > Email: daniel.brewer@icr.ac.uk > ************************************************************** > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable > Company Limited by Guarantee, Registered in England under Company No. 534147 > with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the...{{dropped:13}}
ADD COMMENT
0
Entering edit mode
@adaikalavan-ramasamy-2749
Last seen 10.2 years ago
I am not an expert in R data representations. However, my experience suggests that if an object is stored incorrectly as matrix instead of data.frame, then the object sizes may be bloated. Also if it is a data.frame, check that each column is stored correctly - via matrix(obj). E.g. storing numeric columns as factors or characters etc. Also use the compress=TRUE option in the save(). Regards, Adai Daniel Brewer wrote: > Hello, > > The GTF file from Ensembl for the human genome, > Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file. I > import it into R and process it so that there are two objects: > genomeRanges & genomeBlocks. genomeRanges is a list of IRanges objects, > each of which is a particular chromosome and strand. genomeBlocks is a > list of dataframes with the associated annotation for each of the > transcripts. > > When I save this to file > (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out as > 859M. How is this possible? Especially as the Rdata file is a binary > format. > >> object.size(genomeBlocks) > [1] 2939935864 > >> object.size(genomeRanges) > [1] 8769208 > > Anyway got any ideas what is going on? > > Thanks > > Dan > >
ADD COMMENT
0
Entering edit mode
Thats great, thank you so much. There was a particular variable that had long strings that was being treated as a factor which caused the problems. It is now down to 13M without compression. That's more like it. Thanks Dan Adaikalavan Ramasamy wrote: > I am not an expert in R data representations. However, my experience > suggests that if an object is stored incorrectly as matrix instead of > data.frame, then the object sizes may be bloated. Also if it is a > data.frame, check that each column is stored correctly - via > matrix(obj). E.g. storing numeric columns as factors or characters etc. > > Also use the compress=TRUE option in the save(). > > Regards, Adai > > > > Daniel Brewer wrote: >> Hello, >> >> The GTF file from Ensembl for the human genome, >> Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file. I >> import it into R and process it so that there are two objects: >> genomeRanges & genomeBlocks. genomeRanges is a list of IRanges objects, >> each of which is a particular chromosome and strand. genomeBlocks is a >> list of dataframes with the associated annotation for each of the >> transcripts. >> >> When I save this to file >> (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out as >> 859M. How is this possible? Especially as the Rdata file is a binary >> format. >> >>> object.size(genomeBlocks) >> [1] 2939935864 >> >>> object.size(genomeRanges) >> [1] 8769208 >> >> Anyway got any ideas what is going on? >> >> Thanks >> >> Dan >> >> > -- ************************************************************** Daniel Brewer, Ph.D. Institute of Cancer Research Molecular Carcinogenesis Email: daniel.brewer at icr.ac.uk ************************************************************** The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the a...{{dropped:2}}
ADD REPLY

Login before adding your answer.

Traffic: 841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6