I've recently started to use biomaRt seriously. In teh past I just did
a few tens of searches and all works fine. Now I have several datasets
of several thousand IDs each.
I imagine that sending a single search with 3000 ids might not be a
good idea. I tried, and it broke after a while... and got no results.
So I turned to divide the ids in blocks of 200, and proceeded to send
my queries that way, 200 ids at a time, saving results as I go along.
This worked very well for my first set of 953 ids. When processing my
secodn dataset of 1545 ids, the connection broke after 1200.
I obtained this error:
"Error in value[[3L]](cond) :
Request to BioMart web service failed. Verify if you are still
connected to the internet. Alternatively the BioMart web service is
temporarily down."
I am connected to the internet, and I see no evidence of Biomart being
down...
Can this somehow be related to the size of my queries? I was trying to
find what size is ok to send in one block, but I didn't find anything
definite, only that sending one id at a time in a loop is not a good
idea.
Any help greatly appreciated.
Thanks!
Jose
PS: sessionInfo()
R version 2.10.0 (2009-10-26)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.2.0
loaded via a namespace (and not attached):
[1] RCurl_1.2-1 XML_2.6-0
--
Dr. Jose I. de las Heras Email: J.delasHeras at
ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131
6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
*********************************************
NEW EMAIL from July'09: nach.mcnach at gmail.com
*********************************************
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Hi Jose,
J.delasHeras at ed.ac.uk wrote:
>
> I've recently started to use biomaRt seriously. In teh past I just
did a
> few tens of searches and all works fine. Now I have several datasets
of
> several thousand IDs each.
>
> I imagine that sending a single search with 3000 ids might not be a
good
> idea. I tried, and it broke after a while... and got no results.
A query of 3000 ids is no problem for biomaRt - you should be able to
do
a much larger query than that without any troubles.
It would be helpful if you tried your query again and if it fails,
send
the results of a traceback().
>
> So I turned to divide the ids in blocks of 200, and proceeded to
send my
> queries that way, 200 ids at a time, saving results as I go along.
This is a bad idea for two reasons. First, as you see below, you can
get
transient connection problems that will break your loop. Second,
repeatedly querying online database resources in a tight loop is
commonly considered abuse of the resource, and can get your IP banned
from further queries.
Best,
Jim
>
> This worked very well for my first set of 953 ids. When processing
my
> secodn dataset of 1545 ids, the connection broke after 1200.
>
> I obtained this error:
> "Error in value[[3L]](cond) :
> Request to BioMart web service failed. Verify if you are still
connected
> to the internet. Alternatively the BioMart web service is
temporarily
> down."
>
> I am connected to the internet, and I see no evidence of Biomart
being
> down...
>
> Can this somehow be related to the size of my queries? I was trying
to
> find what size is ok to send in one block, but I didn't find
anything
> definite, only that sending one id at a time in a loop is not a good
idea.
>
> Any help greatly appreciated.
>
> Thanks!
>
> Jose
>
> PS: sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252
> [2] LC_CTYPE=English_United Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_2.2.0
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.2-1 XML_2.6-0
>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should
not be used for urgent or sensitive issues
Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">:
> Hi Jose,
>
> J.delasHeras at ed.ac.uk wrote:
>>
>> I've recently started to use biomaRt seriously. In teh past I just
>> did a few tens of searches and all works fine. Now I have several
>> datasets of several thousand IDs each.
>>
>> I imagine that sending a single search with 3000 ids might not be a
>> good idea. I tried, and it broke after a while... and got no
>> results.
>
> A query of 3000 ids is no problem for biomaRt - you should be able
to
> do a much larger query than that without any troubles.
>
> It would be helpful if you tried your query again and if it fails,
send
> the results of a traceback().
Hi James,
thanks for the reply.
After what you said, I tried again my 1545 Ids in one simple query,
rather than in blocks of 200. I got a different error (after a good
30-40min) which suggests a memory issue now:
"Error in gsub("\n", "", postRes) :
Calloc could not allocate (841769536 of 1) memory"
which surprised me because as far as I can tell I have plenty or
memory available...
I do expect the results to be a large dataframe, as I'm retrieving a
number of different attributes, so each original ID ends up producing
a good number of rows (which later I would process).
For completeness, in case it matters, my query is:
> BMresults<-getBM(attributes=dataset.attributes,
+ filters="entrezgene",
+ values=geneids, mart=ensembl)
where
geneids (values) contain 1545 entrez gene IDs (human)
dataset: human ("hsapiens_gene_ensembl")
mart: ensembl
attributes:
"entrezgene",
"ensembl_gene_id",
"go_cellular_component_id",
"go_biological_process_id",
"go_molecular_function_id",
"go_cellular_component__dm_name_1006",
"name_1006",
"go_molecular_function__dm_name_1006",
"goslim_goa_accession",
"goslim_goa_description"
Similar queries on the mouse and rat datasets (1200 and 950 ids
respectively) worked ok.
In this case traceback() only shows it was eliminating end of line
characters from some object:
> traceback()
2: gsub("\n", "", postRes)
1: getBM(attributes = dataset.attributes, filters = "entrezgene",
values = geneids, mart = ensembl)
If I'm running out of memory (running Windows XP, 32 bit, 4Gb RAM...
but I suspect R may not be able to use the 3Gb I try to make available
by using the memory.size() function) then I suppose dividing the task
into two queries (or three) might help... just not dozens of them. Any
other suggestion?
Jose
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Dear Javier
Try there:
1. Set
options(error=recover)
and then use the 'post mortem' debugger to see why postRes (a
character
string) is so large. Let us know what you find!
2. Rather than splitting up the query genes, you could split up the
attributes, and only ask for a few at a time, and/or see which one
causes the large size of the result
3. Send us a reproducible example (i.e. one that others can reproduce
by
copy-pasting from your email).
Best wishes
Wolfgang
J.delasHeras at ed.ac.uk scripsit 12/21/2009 05:09 PM:
> Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">:
>
>> Hi Jose,
>>
>> J.delasHeras at ed.ac.uk wrote:
>>>
>>> I've recently started to use biomaRt seriously. In teh past I just
>>> did a few tens of searches and all works fine. Now I have several
>>> datasets of several thousand IDs each.
>>>
>>> I imagine that sending a single search with 3000 ids might not be
a
>>> good idea. I tried, and it broke after a while... and got no
results.
>>
>> A query of 3000 ids is no problem for biomaRt - you should be able
to
>> do a much larger query than that without any troubles.
>>
>> It would be helpful if you tried your query again and if it fails,
send
>> the results of a traceback().
>
>
> Hi James,
>
> thanks for the reply.
> After what you said, I tried again my 1545 Ids in one simple query,
> rather than in blocks of 200. I got a different error (after a good
> 30-40min) which suggests a memory issue now:
>
> "Error in gsub("\n", "", postRes) :
> Calloc could not allocate (841769536 of 1) memory"
>
> which surprised me because as far as I can tell I have plenty or
memory
> available...
>
> I do expect the results to be a large dataframe, as I'm retrieving a
> number of different attributes, so each original ID ends up
producing a
> good number of rows (which later I would process).
>
> For completeness, in case it matters, my query is:
>
>> BMresults<-getBM(attributes=dataset.attributes,
> + filters="entrezgene",
> + values=geneids, mart=ensembl)
>
>
> where
>
> geneids (values) contain 1545 entrez gene IDs (human)
> dataset: human ("hsapiens_gene_ensembl")
> mart: ensembl
> attributes:
> "entrezgene",
> "ensembl_gene_id",
> "go_cellular_component_id",
> "go_biological_process_id",
> "go_molecular_function_id",
> "go_cellular_component__dm_name_1006",
> "name_1006",
> "go_molecular_function__dm_name_1006",
> "goslim_goa_accession",
> "goslim_goa_description"
>
> Similar queries on the mouse and rat datasets (1200 and 950 ids
> respectively) worked ok.
>
> In this case traceback() only shows it was eliminating end of line
> characters from some object:
>
>> traceback()
> 2: gsub("\n", "", postRes)
> 1: getBM(attributes = dataset.attributes, filters = "entrezgene",
> values = geneids, mart = ensembl)
>
> If I'm running out of memory (running Windows XP, 32 bit, 4Gb RAM...
but
> I suspect R may not be able to use the 3Gb I try to make available
by
> using the memory.size() function) then I suppose dividing the task
into
> two queries (or three) might help... just not dozens of them. Any
other
> suggestion?
>
> Jose
>
>
>
>
>
--
Best wishes
Wolfgang
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact
Quoting Wolfgang Huber <whuber at="" embl.de="">:
>
> Dear Javier
>
> Try there:
>
> 1. Set
> options(error=recover)
> and then use the 'post mortem' debugger to see why postRes (a
character
> string) is so large. Let us know what you find!
>
> 2. Rather than splitting up the query genes, you could split up the
> attributes, and only ask for a few at a time, and/or see which one
> causes the large size of the result
>
> 3. Send us a reproducible example (i.e. one that others can
reproduce
> by copy-pasting from your email).
>
> Best wishes
> Wolfgang
"My name is not Javier!!!"
(you had to be in Spain in the 80s to get the joke... nevermind, it
was a silly pop song ;-)
Thank you for the suggestions. I managed to finish what I was doing
(breaking the query into chunks of 200ids at a time) but I have some
more searches coming and will definitely use a different approach, and
try the options(error=recover) method to investigate if I have
problems.
My query, as you suggest above, would be better performed by using
less attributes, rather than splitting the ids. I just didn't have
enough experience in this. When using multiple attributes, the
resulting data frame may contain quite a few more rows of data, if
there are multiple values for some of teh attributes... and this
happens a lot when looking at gene ontologies.
I may have started with a 1545 id vector, but ended up with a data
frame containing nearly 4 million rows! (assembled from 8 individual
queries of ~200 ids at a time) I will definitely not do it again this
way!
Much better to pick less attributes and then process the data, and
then I'll probably be able to process all IDs at once.
Thank you for your help, Wolfgang and Jim.
Jose
--
Dr. Jose I. de las Heras Email: J.delasHeras at
ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131
6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
*********************************************
NEW EMAIL from July'09: nach.mcnach at gmail.com
*********************************************
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
Hola Jos?
sorry for the name confusion. The way that BioMart presents many-to-
one
relationships (producing one single big table with all queried
attributes, and possibly lots of repetitions in some columns) can be
very space-inefficient. This is the price that that system's design
pays
for the simplicity.
Anyway, I don't think it should return table rows that are completely
identical - if you (or someone else here) comes across such an
instance, then please report that on this list!
Best wishes
Wolfgang
PS Do you know the way to San ... :)
J.delasHeras at ed.ac.uk scripsit 12/21/2009 07:03 PM:
> Quoting Wolfgang Huber <whuber at="" embl.de="">:
>
>>
>> Dear Javier
>>
>> Try there:
>>
>> 1. Set
>> options(error=recover)
>> and then use the 'post mortem' debugger to see why postRes (a
character
>> string) is so large. Let us know what you find!
>>
>> 2. Rather than splitting up the query genes, you could split up the
>> attributes, and only ask for a few at a time, and/or see which one
>> causes the large size of the result
>>
>> 3. Send us a reproducible example (i.e. one that others can
reproduce
>> by copy-pasting from your email).
>>
>> Best wishes
>> Wolfgang
>
>
> "My name is not Javier!!!"
>
> (you had to be in Spain in the 80s to get the joke... nevermind, it
was
> a silly pop song ;-)
>
> Thank you for the suggestions. I managed to finish what I was doing
> (breaking the query into chunks of 200ids at a time) but I have some
> more searches coming and will definitely use a different approach,
and
> try the options(error=recover) method to investigate if I have
problems.
>
> My query, as you suggest above, would be better performed by using
less
> attributes, rather than splitting the ids. I just didn't have enough
> experience in this. When using multiple attributes, the resulting
data
> frame may contain quite a few more rows of data, if there are
multiple
> values for some of teh attributes... and this happens a lot when
looking
> at gene ontologies.
> I may have started with a 1545 id vector, but ended up with a data
frame
> containing nearly 4 million rows! (assembled from 8 individual
queries
> of ~200 ids at a time) I will definitely not do it again this way!
> Much better to pick less attributes and then process the data, and
then
> I'll probably be able to process all IDs at once.
>
> Thank you for your help, Wolfgang and Jim.
>
> Jose
>
--
Best wishes
Wolfgang
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact
Quoting Wolfgang Huber <whuber at="" embl.de="">:
> Hola Jos?
>
> sorry for the name confusion. The way that BioMart presents many-to-
one
> relationships (producing one single big table with all queried
> attributes, and possibly lots of repetitions in some columns) can be
> very space-inefficient. This is the price that that system's design
> pays for the simplicity.
>
> Anyway, I don't think it should return table rows that are
completely
> identical - if you (or someone else here) comes across such an
> instance, then please report that on this list!
>
> Best wishes
> Wolfgang
Hi Wolfgang,
no worries (about the name).
Yes, the results table is not the most space-efficient, but it IS
simple. It's just a matter of knowing the shape teh results will take
(and now I know) and one can easily write the code accordingly. I
didn't come across entirely repeated rows, there was always at least
one difference. I think it works just the way it's supposed to.
I like to process these type of data by merging unique multiple hits
(GO ids, for isntance) into once cell, maybe separated by a pipe
character "|". The resulting table is a lot smaller and can still be
easily searched.
> PS Do you know the way to San ... :)
(sitting at the piano)
no, but if you hum it... ;-)
Jose
--
Dr. Jose I. de las Heras Email: J.delasHeras at
ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131
6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
*********************************************
NEW EMAIL from July'09: nach.mcnach at gmail.com
*********************************************
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.