Hello,
The GTF file from Ensembl for the human genome,
Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file.
I
import it into R and process it so that there are two objects:
genomeRanges & genomeBlocks. genomeRanges is a list of IRanges
objects,
each of which is a particular chromosome and strand. genomeBlocks is
a
list of dataframes with the associated annotation for each of the
transcripts.
When I save this to file
(save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out
as
859M. How is this possible? Especially as the Rdata file is a binary
format.
> object.size(genomeBlocks)
[1] 2939935864
> object.size(genomeRanges)
[1] 8769208
Anyway got any ideas what is going on?
Thanks
Dan
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
Molecular Carcinogenesis
Email: daniel.brewer at icr.ac.uk
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
a...{{dropped:2}}
On Thu, Mar 12, 2009 at 6:28 AM, Daniel Brewer
<daniel.brewer@icr.ac.uk>wrote:
> Hello,
>
> The GTF file from Ensembl for the human genome,
> Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file.
I
> import it into R and process it so that there are two objects:
> genomeRanges & genomeBlocks. genomeRanges is a list of IRanges
objects,
> each of which is a particular chromosome and strand. genomeBlocks
is a
> list of dataframes with the associated annotation for each of the
> transcripts.
>
> When I save this to file
> (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out
as
> 859M. How is this possible? Especially as the Rdata file is a
binary
> format.
>
> > object.size(genomeBlocks)
> [1] 2939935864
>
You don't tell us exactly how you made this, but the above shows that
genomeBlocks
is consuming about 2.9 GB. Serializing to 859M seems reasonable. If
you
want more
information, send more details.
>
> > object.size(genomeRanges)
> [1] 8769208
>
> Anyway got any ideas what is going on?
>
> Thanks
>
> Dan
>
>
> --
> **************************************************************
> Daniel Brewer, Ph.D.
>
> Institute of Cancer Research
> Molecular Carcinogenesis
> Email: daniel.brewer@icr.ac.uk
> **************************************************************
>
> The Institute of Cancer Research: Royal Cancer Hospital, a
charitable
> Company Limited by Guarantee, Registered in England under Company
No. 534147
> with its Registered Office at 123 Old Brompton Road, London SW7 3RP.
>
> This e-mail message is confidential and for use by
the...{{dropped:13}}
I am not an expert in R data representations. However, my experience
suggests that if an object is stored incorrectly as matrix instead of
data.frame, then the object sizes may be bloated. Also if it is a
data.frame, check that each column is stored correctly - via
matrix(obj). E.g. storing numeric columns as factors or characters
etc.
Also use the compress=TRUE option in the save().
Regards, Adai
Daniel Brewer wrote:
> Hello,
>
> The GTF file from Ensembl for the human genome,
> Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text file.
I
> import it into R and process it so that there are two objects:
> genomeRanges & genomeBlocks. genomeRanges is a list of IRanges
objects,
> each of which is a particular chromosome and strand. genomeBlocks
is a
> list of dataframes with the associated annotation for each of the
> transcripts.
>
> When I save this to file
> (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes out
as
> 859M. How is this possible? Especially as the Rdata file is a
binary
> format.
>
>> object.size(genomeBlocks)
> [1] 2939935864
>
>> object.size(genomeRanges)
> [1] 8769208
>
> Anyway got any ideas what is going on?
>
> Thanks
>
> Dan
>
>
Thats great, thank you so much. There was a particular variable that
had long strings that was being treated as a factor which caused the
problems. It is now down to 13M without compression. That's more
like it.
Thanks
Dan
Adaikalavan Ramasamy wrote:
> I am not an expert in R data representations. However, my experience
> suggests that if an object is stored incorrectly as matrix instead
of
> data.frame, then the object sizes may be bloated. Also if it is a
> data.frame, check that each column is stored correctly - via
> matrix(obj). E.g. storing numeric columns as factors or characters
etc.
>
> Also use the compress=TRUE option in the save().
>
> Regards, Adai
>
>
>
> Daniel Brewer wrote:
>> Hello,
>>
>> The GTF file from Ensembl for the human genome,
>> Homo_sapiens.NCBI36.52.gtf, is 194M and is a tab-delimted text
file. I
>> import it into R and process it so that there are two objects:
>> genomeRanges & genomeBlocks. genomeRanges is a list of IRanges
objects,
>> each of which is a particular chromosome and strand. genomeBlocks
is a
>> list of dataframes with the associated annotation for each of the
>> transcripts.
>>
>> When I save this to file
>> (save(genomeBlocks,genomeRanges,file="Hsgenome.Rdata")) it comes
out as
>> 859M. How is this possible? Especially as the Rdata file is a
binary
>> format.
>>
>>> object.size(genomeBlocks)
>> [1] 2939935864
>>
>>> object.size(genomeRanges)
>> [1] 8769208
>>
>> Anyway got any ideas what is going on?
>>
>> Thanks
>>
>> Dan
>>
>>
>
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
Molecular Carcinogenesis
Email: daniel.brewer at icr.ac.uk
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the
a...{{dropped:2}}