queryGEO fails on GDS files (GEO Datasets)
1
0
Entering edit mode
Peter ▴ 170
@peter-1556
Last seen 10.2 years ago
This follows on from a question from Saurin D. Jani, on the list a year ago: https://stat.ethz.ch/pipermail/bioconductor/2005-January/007405.html A working example: library(AnnBuilder) geo <- GEO() queryGEO(geo,"GSM107") This downloads and parses:- http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM107&targ=self&for m=text&view=data This fails for GEO Datasets (GDS files) like GDS813 (Saurin's example) because the URL isn't accepted - the NCBI returns an HTML page which redirects you to: http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds=813 This page in turn can be used (by a human, a little more tricky in code) to download the actual GDS file - but only in compressed form: ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS813.soft.gz What this means is that at the moment, queryGEO doesn't support GDS files. Even if it did, they are generally large and only available in compressed format, making things generally more complicated. Would it make more sense to provide to separate functions: Firstly, to download the file (dealing with all possible URLs) and if need be decompress it. Secondly, to parse a GEO file from the provided handle/filename/url This makes sense for other large GEO files like the GPL annotation files, as well as the GEO datasets (GDS files). It seems wasteful and slow to download them fresh each time. Peter
• 1.3k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 12 weeks ago
United States
Peter, I have recently uploaded a new package to bioconductor called GEOquery. It is available as a development package (http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html), but it doesn't depend on much, so should work with recent R and bioconductor releases. It is capable of downloading and parsing GDS, GSM, GPL, and GSE. (GSE download and parsing seems to be broken on windows, at least for some GSEs--working on that). After installing, you could do: > library(GEOquery) # the following takes about a minute or so.... > gds813 <- getGEO('GDS813') And then to convert to an exprSet, simply do: > eset <- GDS2eSet(GDS,do.log2=TRUE) > eset Expression Set (exprSet) with 22690 genes 20 samples phenoData object with 4 variables and 38 cases varLabels : sample : disease.state : tissue : description Sean On 1/4/06 10:27 AM, "Peter" <bioconductor-mailinglist at="" maubp.freeserve.co.uk=""> wrote: > This follows on from a question from Saurin D. Jani, on the list a year ago: > > https://stat.ethz.ch/pipermail/bioconductor/2005-January/007405.html > > A working example: > > library(AnnBuilder) > geo <- GEO() > queryGEO(geo,"GSM107") > > This downloads and parses:- > > http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM107&targ=self&f orm=text&v > iew=data > > This fails for GEO Datasets (GDS files) like GDS813 (Saurin's example) > because the URL isn't accepted - the NCBI returns an HTML page which > redirects you to: > > http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds=813 > > This page in turn can be used (by a human, a little more tricky in code) > to download the actual GDS file - but only in compressed form: > > ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/GDS813.soft.gz > > What this means is that at the moment, queryGEO doesn't support GDS > files. Even if it did, they are generally large and only available in > compressed format, making things generally more complicated. > > Would it make more sense to provide to separate functions: > > Firstly, to download the file (dealing with all possible URLs) and if > need be decompress it. > > Secondly, to parse a GEO file from the provided handle/filename/url > > This makes sense for other large GEO files like the GPL annotation > files, as well as the GEO datasets (GDS files). It seems wasteful and > slow to download them fresh each time. > > Peter > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor
ADD COMMENT
0
Entering edit mode
On 1/4/06 10:50 AM, "Sean Davis" <sdavis2 at="" mail.nih.gov=""> wrote: > Peter, > > I have recently uploaded a new package to bioconductor called GEOquery. It > is available as a development package > (http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html), but it > doesn't depend on much, so should work with recent R and bioconductor > releases. It is capable of downloading and parsing GDS, GSM, GPL, and GSE. > (GSE download and parsing seems to be broken on windows, at least for some > GSEs--working on that). After installing, you could do: > >> library(GEOquery) > # the following takes about a minute or so.... >> gds813 <- getGEO('GDS813') > > And then to convert to an exprSet, simply do: > >> eset <- GDS2eSet(GDS,do.log2=TRUE) Made a typo in the line above: eset <- GDS2eSet(gds813,do.log2=TRUE) Will make an exprSet including the sample information from the GDS that was downloaded and parsed using getGEO above. >> eset > Expression Set (exprSet) with > 22690 genes > 20 samples > phenoData object with 4 variables and 38 cases > varLabels > : sample > : disease.state > : tissue > : description > > Sean > > > On 1/4/06 10:27 AM, "Peter" <bioconductor-mailinglist at="" maubp.freeserve.co.uk=""> > wrote: > > >> Would it make more sense to provide to separate functions: >> >> Firstly, to download the file (dealing with all possible URLs) and if >> need be decompress it. See the function "getGEOfile" in the GEOquery package. >> Secondly, to parse a GEO file from the provided handle/filename/url >> >> This makes sense for other large GEO files like the GPL annotation >> files, as well as the GEO datasets (GDS files). It seems wasteful and >> slow to download them fresh each time. The getGEO function also includes a filename argument. The file given by the filename will be parsed as a GEO file; .gz files are handled appropriately as long as the file extension '.gz' is present. Sean
ADD REPLY
0
Entering edit mode
Sean Davis wrote: > Peter, > > I have recently uploaded a new package to bioconductor called GEOquery. Very recently! I was scouring the net for something like this before Christmas, but to no avail. I'm particularly interested in your function GDS2eSet for turning a GDS file into a standard BioConductor exprSet object. I managed to cobble together something similar using a combination of BioPython's GEO parser and rpy (R from Python) as an experiment over Christmas... I'm going to have a closer look at if/how you expose the subsample information in the exprSet phenoData. > It is available as a development package > (http://www.bioconductor.org/packages/bioc/1.8/html/GEOquery.html), but it > doesn't depend on much, so should work with recent R and bioconductor > releases. It is capable of downloading and parsing GDS, GSM, GPL, and GSE. > (GSE download and parsing seems to be broken on windows, at least for some > GSEs--working on that). After installing, you could do: > >>library(GEOquery) I see there is a source bundle online (file GEOquery_1.5.1.tar.gz) but as yet no Windows binary. Is this something you could provide? Alternatively, how would I compile this myself? I do have MSVC 6.0 In the meantime, I'll try and have a go at home on my linux box... Thanks Sean, Peter
ADD REPLY

Login before adding your answer.

Traffic: 468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6