Question

biomaRt queries: optimal size?

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 9.5 years ago

United Kingdom

I've recently started to use biomaRt seriously. In teh past I just did a few tens of searches and all works fine. Now I have several datasets of several thousand IDs each. I imagine that sending a single search with 3000 ids might not be a good idea. I tried, and it broke after a while... and got no results. So I turned to divide the ids in blocks of 200, and proceeded to send my queries that way, 200 ids at a time, saving results as I go along. This worked very well for my first set of 953 ids. When processing my secodn dataset of 1545 ids, the connection broke after 1200. I obtained this error: "Error in value[[3L]](cond) : Request to BioMart web service failed. Verify if you are still connected to the internet. Alternatively the BioMart web service is temporarily down." I am connected to the internet, and I see no evidence of Biomart being down... Can this somehow be related to the size of my queries? I was trying to find what size is ok to send in one block, but I didn't find anything definite, only that sending one id at a time in a loop is not a good idea. Any help greatly appreciated. Thanks! Jose PS: sessionInfo() R version 2.10.0 (2009-10-26) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] biomaRt_2.2.0 loaded via a namespace (and not attached): [1] RCurl_1.2-1 XML_2.6-0 -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK ********************************************* NEW EMAIL from July'09: nach.mcnach at gmail.com ********************************************* -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

GO biomaRt GO biomaRt • 1.7k views

ADD COMMENT • link updated 15.1 years ago by James W. MacDonald 67k • written 15.1 years ago by J.delasHeras@ed.ac.uk ★ 1.9k

score 0 · Answer 1 · 2009-12-21

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Hi Jose, J.delasHeras at ed.ac.uk wrote: > > I've recently started to use biomaRt seriously. In teh past I just did a > few tens of searches and all works fine. Now I have several datasets of > several thousand IDs each. > > I imagine that sending a single search with 3000 ids might not be a good > idea. I tried, and it broke after a while... and got no results. A query of 3000 ids is no problem for biomaRt - you should be able to do a much larger query than that without any troubles. It would be helpful if you tried your query again and if it fails, send the results of a traceback(). > > So I turned to divide the ids in blocks of 200, and proceeded to send my > queries that way, 200 ids at a time, saving results as I go along. This is a bad idea for two reasons. First, as you see below, you can get transient connection problems that will break your loop. Second, repeatedly querying online database resources in a tight loop is commonly considered abuse of the resource, and can get your IP banned from further queries. Best, Jim > > This worked very well for my first set of 953 ids. When processing my > secodn dataset of 1545 ids, the connection broke after 1200. > > I obtained this error: > "Error in value[[3L]](cond) : > Request to BioMart web service failed. Verify if you are still connected > to the internet. Alternatively the BioMart web service is temporarily > down." > > I am connected to the internet, and I see no evidence of Biomart being > down... > > Can this somehow be related to the size of my queries? I was trying to > find what size is ok to send in one block, but I didn't find anything > definite, only that sending one id at a time in a loop is not a good idea. > > Any help greatly appreciated. > > Thanks! > > Jose > > PS: sessionInfo() > R version 2.10.0 (2009-10-26) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 > [2] LC_CTYPE=English_United Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] biomaRt_2.2.0 > > loaded via a namespace (and not attached): > [1] RCurl_1.2-1 XML_2.6-0 > > -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD COMMENT • link 15.1 years ago James W. MacDonald 67k

0

Entering edit mode

Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > Hi Jose, > > J.delasHeras at ed.ac.uk wrote: >> >> I've recently started to use biomaRt seriously. In teh past I just >> did a few tens of searches and all works fine. Now I have several >> datasets of several thousand IDs each. >> >> I imagine that sending a single search with 3000 ids might not be a >> good idea. I tried, and it broke after a while... and got no >> results. > > A query of 3000 ids is no problem for biomaRt - you should be able to > do a much larger query than that without any troubles. > > It would be helpful if you tried your query again and if it fails, send > the results of a traceback(). Hi James, thanks for the reply. After what you said, I tried again my 1545 Ids in one simple query, rather than in blocks of 200. I got a different error (after a good 30-40min) which suggests a memory issue now: "Error in gsub("\n", "", postRes) : Calloc could not allocate (841769536 of 1) memory" which surprised me because as far as I can tell I have plenty or memory available... I do expect the results to be a large dataframe, as I'm retrieving a number of different attributes, so each original ID ends up producing a good number of rows (which later I would process). For completeness, in case it matters, my query is: > BMresults<-getBM(attributes=dataset.attributes, + filters="entrezgene", + values=geneids, mart=ensembl) where geneids (values) contain 1545 entrez gene IDs (human) dataset: human ("hsapiens_gene_ensembl") mart: ensembl attributes: "entrezgene", "ensembl_gene_id", "go_cellular_component_id", "go_biological_process_id", "go_molecular_function_id", "go_cellular_component__dm_name_1006", "name_1006", "go_molecular_function__dm_name_1006", "goslim_goa_accession", "goslim_goa_description" Similar queries on the mouse and rat datasets (1200 and 950 ids respectively) worked ok. In this case traceback() only shows it was eliminating end of line characters from some object: > traceback() 2: gsub("\n", "", postRes) 1: getBM(attributes = dataset.attributes, filters = "entrezgene", values = geneids, mart = ensembl) If I'm running out of memory (running Windows XP, 32 bit, 4Gb RAM... but I suspect R may not be able to use the 3Gb I try to make available by using the memory.size() function) then I suppose dividing the task into two queries (or three) might help... just not dozens of them. Any other suggestion? Jose -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 15.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

Dear Javier Try there: 1. Set options(error=recover) and then use the 'post mortem' debugger to see why postRes (a character string) is so large. Let us know what you find! 2. Rather than splitting up the query genes, you could split up the attributes, and only ask for a few at a time, and/or see which one causes the large size of the result 3. Send us a reproducible example (i.e. one that others can reproduce by copy-pasting from your email). Best wishes Wolfgang J.delasHeras at ed.ac.uk scripsit 12/21/2009 05:09 PM: > Quoting "James W. MacDonald" <jmacdon at="" med.umich.edu="">: > >> Hi Jose, >> >> J.delasHeras at ed.ac.uk wrote: >>> >>> I've recently started to use biomaRt seriously. In teh past I just >>> did a few tens of searches and all works fine. Now I have several >>> datasets of several thousand IDs each. >>> >>> I imagine that sending a single search with 3000 ids might not be a >>> good idea. I tried, and it broke after a while... and got no results. >> >> A query of 3000 ids is no problem for biomaRt - you should be able to >> do a much larger query than that without any troubles. >> >> It would be helpful if you tried your query again and if it fails, send >> the results of a traceback(). > > > Hi James, > > thanks for the reply. > After what you said, I tried again my 1545 Ids in one simple query, > rather than in blocks of 200. I got a different error (after a good > 30-40min) which suggests a memory issue now: > > "Error in gsub("\n", "", postRes) : > Calloc could not allocate (841769536 of 1) memory" > > which surprised me because as far as I can tell I have plenty or memory > available... > > I do expect the results to be a large dataframe, as I'm retrieving a > number of different attributes, so each original ID ends up producing a > good number of rows (which later I would process). > > For completeness, in case it matters, my query is: > >> BMresults<-getBM(attributes=dataset.attributes, > + filters="entrezgene", > + values=geneids, mart=ensembl) > > > where > > geneids (values) contain 1545 entrez gene IDs (human) > dataset: human ("hsapiens_gene_ensembl") > mart: ensembl > attributes: > "entrezgene", > "ensembl_gene_id", > "go_cellular_component_id", > "go_biological_process_id", > "go_molecular_function_id", > "go_cellular_component__dm_name_1006", > "name_1006", > "go_molecular_function__dm_name_1006", > "goslim_goa_accession", > "goslim_goa_description" > > Similar queries on the mouse and rat datasets (1200 and 950 ids > respectively) worked ok. > > In this case traceback() only shows it was eliminating end of line > characters from some object: > >> traceback() > 2: gsub("\n", "", postRes) > 1: getBM(attributes = dataset.attributes, filters = "entrezgene", > values = geneids, mart = ensembl) > > If I'm running out of memory (running Windows XP, 32 bit, 4Gb RAM... but > I suspect R may not be able to use the 3Gb I try to make available by > using the memory.size() function) then I suppose dividing the task into > two queries (or three) might help... just not dozens of them. Any other > suggestion? > > Jose > > > > > -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact

ADD REPLY • link 15.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Quoting Wolfgang Huber <whuber at="" embl.de="">: > > Dear Javier > > Try there: > > 1. Set > options(error=recover) > and then use the 'post mortem' debugger to see why postRes (a character > string) is so large. Let us know what you find! > > 2. Rather than splitting up the query genes, you could split up the > attributes, and only ask for a few at a time, and/or see which one > causes the large size of the result > > 3. Send us a reproducible example (i.e. one that others can reproduce > by copy-pasting from your email). > > Best wishes > Wolfgang "My name is not Javier!!!" (you had to be in Spain in the 80s to get the joke... nevermind, it was a silly pop song ;-) Thank you for the suggestions. I managed to finish what I was doing (breaking the query into chunks of 200ids at a time) but I have some more searches coming and will definitely use a different approach, and try the options(error=recover) method to investigate if I have problems. My query, as you suggest above, would be better performed by using less attributes, rather than splitting the ids. I just didn't have enough experience in this. When using multiple attributes, the resulting data frame may contain quite a few more rows of data, if there are multiple values for some of teh attributes... and this happens a lot when looking at gene ontologies. I may have started with a 1545 id vector, but ended up with a data frame containing nearly 4 million rows! (assembled from 8 individual queries of ~200 ids at a time) I will definitely not do it again this way! Much better to pick less attributes and then process the data, and then I'll probably be able to process all IDs at once. Thank you for your help, Wolfgang and Jim. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK ********************************************* NEW EMAIL from July'09: nach.mcnach at gmail.com ********************************************* -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 15.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

Hola Jos? sorry for the name confusion. The way that BioMart presents many-to- one relationships (producing one single big table with all queried attributes, and possibly lots of repetitions in some columns) can be very space-inefficient. This is the price that that system's design pays for the simplicity. Anyway, I don't think it should return table rows that are completely identical - if you (or someone else here) comes across such an instance, then please report that on this list! Best wishes Wolfgang PS Do you know the way to San ... :) J.delasHeras at ed.ac.uk scripsit 12/21/2009 07:03 PM: > Quoting Wolfgang Huber <whuber at="" embl.de="">: > >> >> Dear Javier >> >> Try there: >> >> 1. Set >> options(error=recover) >> and then use the 'post mortem' debugger to see why postRes (a character >> string) is so large. Let us know what you find! >> >> 2. Rather than splitting up the query genes, you could split up the >> attributes, and only ask for a few at a time, and/or see which one >> causes the large size of the result >> >> 3. Send us a reproducible example (i.e. one that others can reproduce >> by copy-pasting from your email). >> >> Best wishes >> Wolfgang > > > "My name is not Javier!!!" > > (you had to be in Spain in the 80s to get the joke... nevermind, it was > a silly pop song ;-) > > Thank you for the suggestions. I managed to finish what I was doing > (breaking the query into chunks of 200ids at a time) but I have some > more searches coming and will definitely use a different approach, and > try the options(error=recover) method to investigate if I have problems. > > My query, as you suggest above, would be better performed by using less > attributes, rather than splitting the ids. I just didn't have enough > experience in this. When using multiple attributes, the resulting data > frame may contain quite a few more rows of data, if there are multiple > values for some of teh attributes... and this happens a lot when looking > at gene ontologies. > I may have started with a 1545 id vector, but ended up with a data frame > containing nearly 4 million rows! (assembled from 8 individual queries > of ~200 ids at a time) I will definitely not do it again this way! > Much better to pick less attributes and then process the data, and then > I'll probably be able to process all IDs at once. > > Thank you for your help, Wolfgang and Jim. > > Jose > -- Best wishes Wolfgang -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber/contact

ADD REPLY • link 15.1 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Quoting Wolfgang Huber <whuber at="" embl.de="">: > Hola Jos? > > sorry for the name confusion. The way that BioMart presents many-to- one > relationships (producing one single big table with all queried > attributes, and possibly lots of repetitions in some columns) can be > very space-inefficient. This is the price that that system's design > pays for the simplicity. > > Anyway, I don't think it should return table rows that are completely > identical - if you (or someone else here) comes across such an > instance, then please report that on this list! > > Best wishes > Wolfgang Hi Wolfgang, no worries (about the name). Yes, the results table is not the most space-efficient, but it IS simple. It's just a matter of knowing the shape teh results will take (and now I know) and one can easily write the code accordingly. I didn't come across entirely repeated rows, there was always at least one difference. I think it works just the way it's supposed to. I like to process these type of data by merging unique multiple hits (GO ids, for isntance) into once cell, maybe separated by a pipe character "|". The resulting table is a lot smaller and can still be easily searched. > PS Do you know the way to San ... :) (sitting at the piano) no, but if you hum it... ;-) Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK ********************************************* NEW EMAIL from July'09: nach.mcnach at gmail.com ********************************************* -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 15.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k