GEOquery: how to extract experimental data? (confused)
3
1
Entering edit mode
@jdelasherasedacuk-1189
Last seen 9.3 years ago
United Kingdom
I have been until now downloading GEO data directly to my computer and using basic R functions to load tables and process them. It works, but I figured I would probably save time if I learn to use the GEOquery package, which looks promising. However, I'm failing tremendously at my first attempt. I can get a lot of good information, except the actual experiment data... and it seems to be there, but can't get to it! Example. I'm trying to get GSE19044, which contains 42 samples and uses the Illumina WG6 platform, which is great as I'm familiar with it. so I do: library(GEOquery) u = getGEO('GSE19044') show(u) > show(u) $GSE19044_series_matrix.txt.gz ExpressionSet (storageMode: lockedEnvironment) assayData: 45281 features, 42 samples element names: exprs protocolData: none phenoData sampleNames: GSM471318, GSM471319, ..., GSM471359 (42 total) varLabels and varMetadata description: title: NA geo_accession: NA ...: ... data_row_count: NA (39 total) featureData featureNames: ILMN_1212602, ILMN_1212603, ..., ILMN_3163582 (45281 total) fvarLabels and fvarMetadata description: ID: NA Species: NA ...: ... SPOT_ID: NA (31 total) additional fvarMetadata: Column, Description experimentData: use 'experimentData(object)' Annotation: GPL6887 It looks good. It looks like what I want is the 'assayData'. But I can't get to it. 'u' is a list, containing one element... > class(u) [1] "list" > length(u) [1] 1 > class(u[[1]]) [1] "ExpressionSet" attr(,"package") [1] "Biobase" ok, so I rename that, and look at its structure: eset<-u[[1]] str(eset) > str(eset) Formal class 'ExpressionSet' [package "Biobase"] with 7 slots ..@ assayData :<environment: 0x0645ec5c=""> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] [...] (omitted for brevity) I can extract the sample names, the basic annotation/probe identity etc easily: eset at phenoData@data #samples eset at featureData@data #annotation but how do I get into 'assayData'? from the 'show(u)' it looks like it contains what I am after: 45281 features, 42 samples ... but it's class 'environment' and that's throwing me off. I was looking into the GEOquery user guide, but I'm still none the wiser. How do I get in there? thanks for any help. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6507090 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
PROcess GEOquery PROcess GEOquery • 10k views
ADD COMMENT
1
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Tue, Aug 16, 2011 at 7:20 AM, <j.delasheras at="" ed.ac.uk=""> wrote: > > I have been until now downloading GEO data directly to my computer and using > basic R functions to load tables and process them. > It works, but I figured I would probably save time if I learn to use the > GEOquery package, which looks promising. > > However, I'm failing tremendously at my first attempt. I can get a lot of > good information, except the actual experiment data... and it seems to be > there, but can't get to it! > > Example. I'm trying to get GSE19044, which contains 42 samples and uses the > Illumina WG6 platform, which is great as I'm familiar with it. > > so I do: > > library(GEOquery) > u = getGEO('GSE19044') > show(u) > >> show(u) > > $GSE19044_series_matrix.txt.gz > ExpressionSet (storageMode: lockedEnvironment) > assayData: 45281 features, 42 samples > ?element names: exprs > protocolData: none > phenoData > ?sampleNames: GSM471318, GSM471319, ..., GSM471359 ?(42 total) > ?varLabels and varMetadata description: > ? ?title: NA > ? ?geo_accession: NA > ? ?...: ... > ? ?data_row_count: NA > ? ?(39 total) > featureData > ?featureNames: ILMN_1212602, ILMN_1212603, ..., ILMN_3163582 ?(45281 total) > ?fvarLabels and fvarMetadata description: > ? ?ID: NA > ? ?Species: NA > ? ?...: ... > ? ?SPOT_ID: NA > ? ?(31 total) > ?additional fvarMetadata: Column, Description > experimentData: use 'experimentData(object)' > Annotation: GPL6887 > > It looks good. It looks like what I want is the 'assayData'. But I can't get > to it. > > 'u' is a list, containing one element... >> >> class(u) > > [1] "list" >> >> length(u) > > [1] 1 > >> class(u[[1]]) > > [1] "ExpressionSet" > attr(,"package") > [1] "Biobase" > > ok, so I rename that, and look at its structure: > > eset<-u[[1]] > str(eset) > >> str(eset) > > Formal class 'ExpressionSet' [package "Biobase"] with 7 slots > ?..@ assayData ? ? ? ?:<environment: 0x0645ec5c=""> > ?..@ phenoData ? ? ? ?:Formal class 'AnnotatedDataFrame' [package "Biobase"] > [...] (omitted for brevity) > > I can extract the sample names, the basic annotation/probe identity etc > easily: > eset at phenoData@data #samples > eset at featureData@data #annotation > > but how do I get into 'assayData'? > from the 'show(u)' it looks like it contains what I am after: 45281 > features, 42 samples ... but it's class 'environment' and that's throwing me > off. > > I was looking into the GEOquery user guide, but I'm still none the wiser. Hi, Jose. Sorry this was confusing for you. Your eset object above is an ExpressionSet and is one of the standard classes for storing gene expression data in Bioconductor; GEOquery uses this class where possible to store GEO data so as to facilitate downstream processing with other Bioconductor packages. Typically, you can get the expression data from an ExpressionSet by doing: assayDataElement(eset,'exprs') or the simpler shorthand: exprs(eset) Similarly, to get the sample variables, you can do: pData(eset) To get more help on ExpressionSet, you can do help("ExpressionSet-class") and read the related Biobase vignette. I hope that clears things up. Sean
ADD COMMENT
0
Entering edit mode
Quoting Sean Davis <sdavis2 at="" mail.nih.gov=""> on Tue, 16 Aug 2011 07:36:41 -0400: > On Tue, Aug 16, 2011 at 7:20 AM, <j.delasheras at="" ed.ac.uk=""> wrote: >> >> I have been until now downloading GEO data directly to my computer and using >> basic R functions to load tables and process them. >> It works, but I figured I would probably save time if I learn to use the >> GEOquery package, which looks promising. >> >> However, I'm failing tremendously at my first attempt. I can get a lot of >> good information, except the actual experiment data... and it seems to be >> there, but can't get to it! >> >> Example. I'm trying to get GSE19044, which contains 42 samples and uses the >> Illumina WG6 platform, which is great as I'm familiar with it. >> >> so I do: >> >> library(GEOquery) >> u = getGEO('GSE19044') >> show(u) >> >>> show(u) >> >> $GSE19044_series_matrix.txt.gz >> ExpressionSet (storageMode: lockedEnvironment) >> assayData: 45281 features, 42 samples >> ?element names: exprs >> protocolData: none >> phenoData >> ?sampleNames: GSM471318, GSM471319, ..., GSM471359 ?(42 total) >> ?varLabels and varMetadata description: >> ? ?title: NA >> ? ?geo_accession: NA >> ? ?...: ... >> ? ?data_row_count: NA >> ? ?(39 total) >> featureData >> ?featureNames: ILMN_1212602, ILMN_1212603, ..., ILMN_3163582 ?(45281 total) >> ?fvarLabels and fvarMetadata description: >> ? ?ID: NA >> ? ?Species: NA >> ? ?...: ... >> ? ?SPOT_ID: NA >> ? ?(31 total) >> ?additional fvarMetadata: Column, Description >> experimentData: use 'experimentData(object)' >> Annotation: GPL6887 >> >> It looks good. It looks like what I want is the 'assayData'. But I can't get >> to it. >> >> 'u' is a list, containing one element... >>> >>> class(u) >> >> [1] "list" >>> >>> length(u) >> >> [1] 1 >> >>> class(u[[1]]) >> >> [1] "ExpressionSet" >> attr(,"package") >> [1] "Biobase" >> >> ok, so I rename that, and look at its structure: >> >> eset<-u[[1]] >> str(eset) >> >>> str(eset) >> >> Formal class 'ExpressionSet' [package "Biobase"] with 7 slots >> ?..@ assayData ? ? ? ?:<environment: 0x0645ec5c=""> >> ?..@ phenoData ? ? ? ?:Formal class 'AnnotatedDataFrame' [package "Biobase"] >> [...] (omitted for brevity) >> >> I can extract the sample names, the basic annotation/probe identity etc >> easily: >> eset at phenoData@data #samples >> eset at featureData@data #annotation >> >> but how do I get into 'assayData'? >> from the 'show(u)' it looks like it contains what I am after: 45281 >> features, 42 samples ... but it's class 'environment' and that's throwing me >> off. >> >> I was looking into the GEOquery user guide, but I'm still none the wiser. > > Hi, Jose. > > Sorry this was confusing for you. Your eset object above is an > ExpressionSet and is one of the standard classes for storing gene > expression data in Bioconductor; GEOquery uses this class where > possible to store GEO data so as to facilitate downstream processing > with other Bioconductor packages. Typically, you can get the > expression data from an ExpressionSet by doing: > > assayDataElement(eset,'exprs') > > or the simpler shorthand: > > exprs(eset) > > Similarly, to get the sample variables, you can do: > > pData(eset) > > To get more help on ExpressionSet, you can do > help("ExpressionSet-class") and read the related Biobase vignette. > > I hope that clears things up. > > Sean > > just like that! ha! thank you very much for that! I never used the ExpressionSet class before and I assumed that I could simply just access its contents directly by brute force, indicating the right slot/component... I'll check the info on ExpressionSet for its characteristics. thank you! Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6507090 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
ADD REPLY
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 10 weeks ago
United States
you have noted that getGEO returns a list, and the first element of the list you got is an ExpressionSet instance. you are asking about access to the assayData. This concerns ExpressionSet mechanics, not GEOquery > library(Biobase) Welcome to Bioconductor Vignettes contain introductory material. To view, type 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation("pkgname")'. > data(sample.ExpressionSet) > assayData(sample.ExpressionSet) <environment: 0x100b1eb40=""> > ls(.Last.value) [1] "exprs" "se.exprs" > dim(assayData(sample.ExpressionSet)$exprs) [1] 500 26 > dim(assayData(sample.ExpressionSet)$se.exprs) [1] 500 26 > dim(exprs(sample.ExpressionSet)) [1] 500 26 assayData is not intended for direct handling by end users. the exprs() method will retrieve the matrix of expression values. On Tue, Aug 16, 2011 at 7:20 AM, <j.delasheras@ed.ac.uk> wrote: > > I have been until now downloading GEO data directly to my computer and > using basic R functions to load tables and process them. > It works, but I figured I would probably save time if I learn to use the > GEOquery package, which looks promising. > > However, I'm failing tremendously at my first attempt. I can get a lot of > good information, except the actual experiment data... and it seems to be > there, but can't get to it! > > Example. I'm trying to get GSE19044, which contains 42 samples and uses the > Illumina WG6 platform, which is great as I'm familiar with it. > > so I do: > > library(GEOquery) > u = getGEO('GSE19044') > show(u) > > show(u) >> > $GSE19044_series_matrix.txt.gz > ExpressionSet (storageMode: lockedEnvironment) > assayData: 45281 features, 42 samples > element names: exprs > protocolData: none > phenoData > sampleNames: GSM471318, GSM471319, ..., GSM471359 (42 total) > varLabels and varMetadata description: > title: NA > geo_accession: NA > ...: ... > data_row_count: NA > (39 total) > featureData > featureNames: ILMN_1212602, ILMN_1212603, ..., ILMN_3163582 (45281 total) > fvarLabels and fvarMetadata description: > ID: NA > Species: NA > ...: ... > SPOT_ID: NA > (31 total) > additional fvarMetadata: Column, Description > experimentData: use 'experimentData(object)' > Annotation: GPL6887 > > It looks good. It looks like what I want is the 'assayData'. But I can't > get to it. > > 'u' is a list, containing one element... > >> class(u) >> > [1] "list" > >> length(u) >> > [1] 1 > > class(u[[1]]) >> > [1] "ExpressionSet" > attr(,"package") > [1] "Biobase" > > ok, so I rename that, and look at its structure: > > eset<-u[[1]] > str(eset) > > str(eset) >> > Formal class 'ExpressionSet' [package "Biobase"] with 7 slots > ..@ assayData :<environment: 0x0645ec5c=""> > ..@ phenoData :Formal class 'AnnotatedDataFrame' [package > "Biobase"] [...] (omitted for brevity) > > I can extract the sample names, the basic annotation/probe identity etc > easily: > eset@phenoData@data #samples > eset@featureData@data #annotation > > but how do I get into 'assayData'? > from the 'show(u)' it looks like it contains what I am after: 45281 > features, 42 samples ... but it's class 'environment' and that's throwing me > off. > > I was looking into the GEOquery user guide, but I'm still none the wiser. > > How do I get in there? > > thanks for any help. > > Jose > > -- > Dr. Jose I. de las Heras Email: J.delasHeras@ed.ac.uk > The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6507090 > Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 > Swann Building, Mayfield Road > University of Edinburgh > Edinburgh EH9 3JR > UK > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Quoting Vincent Carey <stvjc at="" channing.harvard.edu=""> on Tue, 16 Aug 2011 07:38:19 -0400: > you have noted that getGEO returns a list, and the first element of the list > you got is an ExpressionSet instance. > > you are asking about access to the assayData. This concerns ExpressionSet > mechanics, not GEOquery > >> library(Biobase) > > Welcome to Bioconductor > > Vignettes contain introductory material. To view, type > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation("pkgname")'. > >> data(sample.ExpressionSet) >> assayData(sample.ExpressionSet) > <environment: 0x100b1eb40=""> >> ls(.Last.value) > [1] "exprs" "se.exprs" >> dim(assayData(sample.ExpressionSet)$exprs) > [1] 500 26 >> dim(assayData(sample.ExpressionSet)$se.exprs) > [1] 500 26 >> dim(exprs(sample.ExpressionSet)) > [1] 500 26 > > assayData is not intended for direct handling by end users. the exprs() > method will retrieve the matrix of expression > values. Hi Vincent, thanks for that, I'll use the methods for the ExpressionSet class, but it was interesting to see how to look at the contents of the assayData teh way you did. I'm used to typing 'ls()', without arguments, to check teh contents of my current workspace, and didn't think of checking what arguments I could specify. Jose -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
ADD REPLY
0
Entering edit mode
@dai-hongying-4801
Last seen 10.2 years ago
Hi, Jose and Sean I try to run the exactly same code as you guys provided to extract GS19044 data. But I failed with connection issue. Here is my code and system information: R> source("http://www.bioconductor.org/biocLite.R") # It connected to the cite successfully BioC_mirror = http://bioconductor.org Change using chooseBioCmirror(). Warning messages: 1: In safeSource() : Redefining 'biocinstall' 2: In safeSource() : Redefining 'biocinstallPkgGroups' 3: In safeSource() : Redefining 'biocinstallRepos' library(Biobase) # no error message library(GEOquery) # no error message R> u = getGEO('GSE19044') Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : couldn't connect to host R> sessionInfo() R version 2.13.1 (2011-07-08) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ff_2.2-3 bit_1.1-7 GEOquery_2.19.2 Biobase_2.12.2 loaded via a namespace (and not attached): [1] RCurl_1.6-7.1 XML_3.4-2.2 Daisy ________________________________ Electronic mail from Children's Mercy Hospitals and Clinics. This communication is intended only for the use of the addressee. It may contain information that is privileged or confidential under applicable law. If you are not the intended recipient or the agent of the recipient, you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited. If you have received this communication in error, please immediately forward the message to Children's Mercy Hospital's Information Security Officer via return electronic mail at informationsecurityofficer@cmh.edu and expunge this communication without making any copies. Thank you for your cooperation. [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Daisy, Your R code worked fine for me. Not sure what the issue is. One work-around would be to 1. manually download the Series Matrix file (ftp://ftp.ncbi.nih.gov/pu b/geo/DATA/SeriesMatrix/GSE19044/GSE19044_series_matrix.txt.gz) into your working directory 2. slightly modify your R code: > library(GEOquery) > u = getGEO(file="GSE19044_series_matrix.txt.gz") > head(exprs(u)) --Johannes -----Original Message----- From: Dai, Hongying, [mailto:hdai@cmh.edu] Sent: Wednesday, August 17, 2011 2:51 PM To: 'bioconductor at r-project.org' Subject: Re: [BioC] GEOquery: how to extract experimental data? (confused) Hi, Jose and Sean I try to run the exactly same code as you guys provided to extract GS19044 data. But I failed with connection issue. Here is my code and system information: R> source("http://www.bioconductor.org/biocLite.R") # It connected to R> the cite successfully BioC_mirror = http://bioconductor.org Change using chooseBioCmirror(). Warning messages: 1: In safeSource() : Redefining 'biocinstall' 2: In safeSource() : Redefining 'biocinstallPkgGroups' 3: In safeSource() : Redefining 'biocinstallRepos' library(Biobase) # no error message library(GEOquery) # no error message R> u = getGEO('GSE19044') Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : couldn't connect to host R> sessionInfo() R version 2.13.1 (2011-07-08) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ff_2.2-3 bit_1.1-7 GEOquery_2.19.2 Biobase_2.12.2 loaded via a namespace (and not attached): [1] RCurl_1.6-7.1 XML_3.4-2.2 Daisy ________________________________ Electronic mail from Children's Mercy Hospitals and Clinics. This communication is intended only for the use of the addressee. It may contain information that is privileged or confidential under applicable law. If you are not the intended recipient or the agent of the recipient, you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited. If you have received this communication in error, please immediately forward the message to Children's Mercy Hospital's Information Security Officer via return electronic mail at informationsecurityofficer at cmh.edu and expunge this communication without making any copies. Thank you for your cooperation. [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Login before adding your answer.

Traffic: 563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6