Adding annotations to GSE datasets

0

Entering edit mode

Marcelo Pereira ▴ 70

@marcelo-pereira-6541

Last seen 9.0 years ago

Quick question: I am trying to import some GEO datasets, and having some issues with the annotations: I can download the GSE dataset using: gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) However, it will return me a ExpressionSet with the following format: X1 X10 X100 X1000 ... GSM278765 GSM278766 GSM278767 GSM278768 GSM278769 ... This is pretty much what I need, but I still need to translate (X1, X10, X100, X1000, etc...) to the actual names of the genes. Any suggestions? Thanks, Marcelo [[alternative HTML version deleted]]

• 3.1k views

ADD COMMENT • link updated 10.9 years ago by Sean Davis 21k • written 10.9 years ago by Marcelo Pereira ▴ 70

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 7 weeks ago

United States

Hi, Marcelo. On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira <marcelops at="" gmail.com=""> wrote: > Quick question: > > I am trying to import some GEO datasets, and having some issues with the > annotations: > > I can download the GSE dataset using: > > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > > > However, it will return me a ExpressionSet with the following format: > > X1 X10 X100 X1000 ... > GSM278765 > GSM278766 > GSM278767 > GSM278768 > GSM278769 > ... This is not what is returned by GEOquery, so you have done some manipulation (looks like you did a transpose on the expression matrix), it seems. > This is pretty much what I need, but I still need to translate (X1, X10, > X100, X1000, etc...) to the actual names of the genes. library(GEOquery) gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] head(fData(gset)) The gene symbols are in the "Gene" column: genesymbols = fData(gset)$Gene Sean > > Any suggestions? > > Thanks, > Marcelo > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.9 years ago Sean Davis 21k

0

Entering edit mode

Hi Sean, Thanks for your answer! That is great already. I can see the gene's names now: > library(GEOquery) > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > head(fData(gset[[1]]))$Gene [1] A1BG NAT2 ADA CDH2 AKT3 MED6 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 ADA ADAM8 AKT3 ... ZNF254 But the data frame only contains these columns. > names(fData(gset[[1]])) [1] "ID" "Gene" "UniGene" "Description" "Ensembl* Chr" "Start (bp)" [7] "End (bp)" "Strand" "ORF" "SPOT_ID" Where is the expression information for each gene? Thanks, Marcelo On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > Hi, Marcelo. > > > On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira <marcelops@gmail.com> > wrote: > > Quick question: > > > > I am trying to import some GEO datasets, and having some issues with the > > annotations: > > > > I can download the GSE dataset using: > > > > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > > > > > > However, it will return me a ExpressionSet with the following format: > > > > X1 X10 X100 X1000 ... > > GSM278765 > > GSM278766 > > GSM278767 > > GSM278768 > > GSM278769 > > ... > > This is not what is returned by GEOquery, so you have done some > manipulation (looks like you did a transpose on the expression > matrix), it seems. > > > This is pretty much what I need, but I still need to translate (X1, X10, > > X100, X1000, etc...) to the actual names of the genes. > > library(GEOquery) > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] > head(fData(gset)) > > The gene symbols are in the "Gene" column: > > genesymbols = fData(gset)$Gene > > Sean > > > > > > Any suggestions? > > > > Thanks, > > Marcelo > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Marcelo Pereira ▴ 70

0

Entering edit mode

On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops at="" gmail.com=""> wrote: > Hi Sean, > > Thanks for your answer! > > That is great already. > > I can see the gene's names now: > >> library(GEOquery) >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >> head(fData(gset[[1]]))$Gene > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 ADA > ADAM8 AKT3 ... ZNF254 > > But the data frame only contains these columns. > >> names(fData(gset[[1]])) > [1] "ID" "Gene" "UniGene" "Description" "Ensembl* > Chr" "Start (bp)" > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" > > Where is the expression information for each gene? exprs(gset[[1]]) gset is an ExpressionSet, so you should read a bit about ExpressionSets in the Biobase vignette as well as the help page. Sean > > Thanks, > Marcelo > > > > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > >> Hi, Marcelo. >> >> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira <marcelops at="" gmail.com=""> >> wrote: >> > Quick question: >> > >> > I am trying to import some GEO datasets, and having some issues with the >> > annotations: >> > >> > I can download the GSE dataset using: >> > >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >> > >> > >> > However, it will return me a ExpressionSet with the following format: >> > >> > X1 X10 X100 X1000 ... >> > GSM278765 >> > GSM278766 >> > GSM278767 >> > GSM278768 >> > GSM278769 >> > ... >> >> This is not what is returned by GEOquery, so you have done some >> manipulation (looks like you did a transpose on the expression >> matrix), it seems. >> >> > This is pretty much what I need, but I still need to translate (X1, X10, >> > X100, X1000, etc...) to the actual names of the genes. >> >> library(GEOquery) >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] >> head(fData(gset)) >> >> The gene symbols are in the "Gene" column: >> >> genesymbols = fData(gset)$Gene >> >> Sean >> >> >> > >> > Any suggestions? >> > >> > Thanks, >> > Marcelo >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.9 years ago Sean Davis 21k

0

Entering edit mode

Hello Sean, Thanks for your replies. I used to download all the CEL files, and then load, normalize and generate the ExpressionSet output. All manually, and it was working fine! Then I found out about doing it automatically using the GEOquery library. And this is what have been taking my hours lately. The output of exprs(gset[[1]]) is the initial point where I got stuck after a few minutes using the GEOquery library, because I have the expression, but not the gene's names. GSM278765 GSM278766 GSM278767 ... 1 5.459950 5.548725 5.477436 ... 10 6.728919 6.329578 6.570104 ... 100 6.861095 7.005730 7.235361 ... 1000 9.660035 9.189507 9.740223 ... 10000 5.644313 5.898675 5.475838 ... 10001 7.838040 7.564335 8.397569 ... After that, I tried to manipulate the output in order to translate 1, 10, 100, 1000, to the actual names of the genes. And my last resource was to ask here at the forum. It is looking good already. I only need to have an extra column, with the names of the genes. Thanks, Marcelo On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops@gmail.com> > wrote: > > Hi Sean, > > > > Thanks for your answer! > > > > That is great already. > > > > I can see the gene's names now: > > > >> library(GEOquery) > >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > >> head(fData(gset[[1]]))$Gene > > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 > > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 ADA > > ADAM8 AKT3 ... ZNF254 > > > > But the data frame only contains these columns. > > > >> names(fData(gset[[1]])) > > [1] "ID" "Gene" "UniGene" "Description" > "Ensembl* > > Chr" "Start (bp)" > > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" > > > > Where is the expression information for each gene? > > exprs(gset[[1]]) > > gset is an ExpressionSet, so you should read a bit about > ExpressionSets in the Biobase vignette as well as the help page. > > Sean > > > > > > Thanks, > > Marcelo > > > > > > > > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > > > >> Hi, Marcelo. > >> > >> > >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira <marcelops@gmail.com> > >> wrote: > >> > Quick question: > >> > > >> > I am trying to import some GEO datasets, and having some issues with > the > >> > annotations: > >> > > >> > I can download the GSE dataset using: > >> > > >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > >> > > >> > > >> > However, it will return me a ExpressionSet with the following format: > >> > > >> > X1 X10 X100 X1000 ... > >> > GSM278765 > >> > GSM278766 > >> > GSM278767 > >> > GSM278768 > >> > GSM278769 > >> > ... > >> > >> This is not what is returned by GEOquery, so you have done some > >> manipulation (looks like you did a transpose on the expression > >> matrix), it seems. > >> > >> > This is pretty much what I need, but I still need to translate (X1, > X10, > >> > X100, X1000, etc...) to the actual names of the genes. > >> > >> library(GEOquery) > >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] > >> head(fData(gset)) > >> > >> The gene symbols are in the "Gene" column: > >> > >> genesymbols = fData(gset)$Gene > >> > >> Sean > >> > >> > >> > > >> > Any suggestions? > >> > > >> > Thanks, > >> > Marcelo > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor@r-project.org > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Marcelo Pereira ▴ 70

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 7 weeks ago

United States

On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at="" gmail.com=""> wrote: > That is all because I am interested in the expression values for some pairs > of genes. > > If I had something like this: > > GSM278765 GSM278766 GSM278767 ... > A1BG 5.459950 5.548725 5.477436 ... > NAT2 6.728919 6.329578 6.570104 ... > ADA 6.861095 7.005730 7.235361 ... > CDH2 9.660035 9.189507 9.740223 ... > ... 5.644313 5.898675 5.475838 ... > ... 7.838040 7.564335 8.397569 ... > > Then I could extract lines for the genes of interest (for example, 'A1BG' > and 'ADA'), and then plot scatterplots, compute correlation coefficients, > etc... Something like this might work: plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',]) Sean > The name of the genes for each line is the only detail that is not present > in my dataset. > > What am I missing here? > > Thanks, > Marcelo > > > > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at="" gmail.com=""> wrote: >> >> Hello Sean, >> >> Thanks for your replies. >> >> I used to download all the CEL files, and then load, normalize and >> generate the ExpressionSet output. All manually, and it was working fine! >> >> Then I found out about doing it automatically using the GEOquery library. >> And this is what have been taking my hours lately. >> >> The output of exprs(gset[[1]]) is the initial point where I got stuck >> after a few minutes using the GEOquery library, because I have the >> expression, but not the gene's names. >> >> GSM278765 GSM278766 GSM278767 ... >> 1 5.459950 5.548725 5.477436 ... >> 10 6.728919 6.329578 6.570104 ... >> 100 6.861095 7.005730 7.235361 ... >> 1000 9.660035 9.189507 9.740223 ... >> 10000 5.644313 5.898675 5.475838 ... >> 10001 7.838040 7.564335 8.397569 ... >> >> After that, I tried to manipulate the output in order to translate 1, 10, >> 100, 1000, to the actual names of the genes. And my last resource was to >> ask here at the forum. >> >> It is looking good already. I only need to have an extra column, with the >> names of the genes. >> >> Thanks, >> Marcelo >> >> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >>> wrote: >>> > Hi Sean, >>> > >>> > Thanks for your answer! >>> > >>> > That is great already. >>> > >>> > I can see the gene's names now: >>> > >>> >> library(GEOquery) >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >>> >> head(fData(gset[[1]]))$Gene >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 ADA >>> > ADAM8 AKT3 ... ZNF254 >>> > >>> > But the data frame only contains these columns. >>> > >>> >> names(fData(gset[[1]])) >>> > [1] "ID" "Gene" "UniGene" "Description" >>> > "Ensembl* >>> > Chr" "Start (bp)" >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" >>> > >>> > Where is the expression information for each gene? >>> >>> exprs(gset[[1]]) >>> >>> gset is an ExpressionSet, so you should read a bit about >>> ExpressionSets in the Biobase vignette as well as the help page. >>> >>> Sean >>> >>> >>> > >>> > Thanks, >>> > Marcelo >>> > >>> > >>> > >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> >>> > wrote: >>> > >>> >> Hi, Marcelo. >>> >> >>> >> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira <marcelops at="" gmail.com=""> >>> >> wrote: >>> >> > Quick question: >>> >> > >>> >> > I am trying to import some GEO datasets, and having some issues with >>> >> > the >>> >> > annotations: >>> >> > >>> >> > I can download the GSE dataset using: >>> >> > >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >>> >> > >>> >> > >>> >> > However, it will return me a ExpressionSet with the following >>> >> > format: >>> >> > >>> >> > X1 X10 X100 X1000 ... >>> >> > GSM278765 >>> >> > GSM278766 >>> >> > GSM278767 >>> >> > GSM278768 >>> >> > GSM278769 >>> >> > ... >>> >> >>> >> This is not what is returned by GEOquery, so you have done some >>> >> manipulation (looks like you did a transpose on the expression >>> >> matrix), it seems. >>> >> >>> >> > This is pretty much what I need, but I still need to translate (X1, >>> >> > X10, >>> >> > X100, X1000, etc...) to the actual names of the genes. >>> >> >>> >> library(GEOquery) >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] >>> >> head(fData(gset)) >>> >> >>> >> The gene symbols are in the "Gene" column: >>> >> >>> >> genesymbols = fData(gset)$Gene >>> >> >>> >> Sean >>> >> >>> >> >>> >> > >>> >> > Any suggestions? >>> >> > >>> >> > Thanks, >>> >> > Marcelo >>> >> > >>> >> > [[alternative HTML version deleted]] >>> >> > >>> >> > _______________________________________________ >>> >> > Bioconductor mailing list >>> >> > Bioconductor at r-project.org >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> > Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>> > >>> > [[alternative HTML version deleted]] >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor at r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >

ADD COMMENT • link 10.9 years ago Sean Davis 21k

0

Entering edit mode

Thanks Sean, That is exactly what I was looking for! Cheers, Marcelo On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops@gmail.com> > wrote: > > That is all because I am interested in the expression values for some > pairs > > of genes. > > > > If I had something like this: > > > > GSM278765 GSM278766 GSM278767 ... > > A1BG 5.459950 5.548725 5.477436 ... > > NAT2 6.728919 6.329578 6.570104 ... > > ADA 6.861095 7.005730 7.235361 ... > > CDH2 9.660035 9.189507 9.740223 ... > > ... 5.644313 5.898675 5.475838 ... > > ... 7.838040 7.564335 8.397569 ... > > > > Then I could extract lines for the genes of interest (for example, 'A1BG' > > and 'ADA'), and then plot scatterplots, compute correlation coefficients, > > etc... > > Something like this might work: > > plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',]) > > Sean > > > > The name of the genes for each line is the only detail that is not > present > > in my dataset. > > > > What am I missing here? > > > > Thanks, > > Marcelo > > > > > > > > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops@gmail.com> > wrote: > >> > >> Hello Sean, > >> > >> Thanks for your replies. > >> > >> I used to download all the CEL files, and then load, normalize and > >> generate the ExpressionSet output. All manually, and it was working > fine! > >> > >> Then I found out about doing it automatically using the GEOquery > library. > >> And this is what have been taking my hours lately. > >> > >> The output of exprs(gset[[1]]) is the initial point where I got stuck > >> after a few minutes using the GEOquery library, because I have the > >> expression, but not the gene's names. > >> > >> GSM278765 GSM278766 GSM278767 ... > >> 1 5.459950 5.548725 5.477436 ... > >> 10 6.728919 6.329578 6.570104 ... > >> 100 6.861095 7.005730 7.235361 ... > >> 1000 9.660035 9.189507 9.740223 ... > >> 10000 5.644313 5.898675 5.475838 ... > >> 10001 7.838040 7.564335 8.397569 ... > >> > >> After that, I tried to manipulate the output in order to translate 1, > 10, > >> 100, 1000, to the actual names of the genes. And my last resource was > to > >> ask here at the forum. > >> > >> It is looking good already. I only need to have an extra column, with > the > >> names of the genes. > >> > >> Thanks, > >> Marcelo > >> > >> > >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2@mail.nih.gov> > wrote: > >>> > >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops@gmail.com> > >>> wrote: > >>> > Hi Sean, > >>> > > >>> > Thanks for your answer! > >>> > > >>> > That is great already. > >>> > > >>> > I can see the gene's names now: > >>> > > >>> >> library(GEOquery) > >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > >>> >> head(fData(gset[[1]]))$Gene > >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 > >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 > ADA > >>> > ADAM8 AKT3 ... ZNF254 > >>> > > >>> > But the data frame only contains these columns. > >>> > > >>> >> names(fData(gset[[1]])) > >>> > [1] "ID" "Gene" "UniGene" "Description" > >>> > "Ensembl* > >>> > Chr" "Start (bp)" > >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" > >>> > > >>> > Where is the expression information for each gene? > >>> > >>> exprs(gset[[1]]) > >>> > >>> gset is an ExpressionSet, so you should read a bit about > >>> ExpressionSets in the Biobase vignette as well as the help page. > >>> > >>> Sean > >>> > >>> > >>> > > >>> > Thanks, > >>> > Marcelo > >>> > > >>> > > >>> > > >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2@mail.nih.gov> > >>> > wrote: > >>> > > >>> >> Hi, Marcelo. > >>> >> > >>> >> > >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira < > marcelops@gmail.com> > >>> >> wrote: > >>> >> > Quick question: > >>> >> > > >>> >> > I am trying to import some GEO datasets, and having some issues > with > >>> >> > the > >>> >> > annotations: > >>> >> > > >>> >> > I can download the GSE dataset using: > >>> >> > > >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > >>> >> > > >>> >> > > >>> >> > However, it will return me a ExpressionSet with the following > >>> >> > format: > >>> >> > > >>> >> > X1 X10 X100 X1000 ... > >>> >> > GSM278765 > >>> >> > GSM278766 > >>> >> > GSM278767 > >>> >> > GSM278768 > >>> >> > GSM278769 > >>> >> > ... > >>> >> > >>> >> This is not what is returned by GEOquery, so you have done some > >>> >> manipulation (looks like you did a transpose on the expression > >>> >> matrix), it seems. > >>> >> > >>> >> > This is pretty much what I need, but I still need to translate > (X1, > >>> >> > X10, > >>> >> > X100, X1000, etc...) to the actual names of the genes. > >>> >> > >>> >> library(GEOquery) > >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] > >>> >> head(fData(gset)) > >>> >> > >>> >> The gene symbols are in the "Gene" column: > >>> >> > >>> >> genesymbols = fData(gset)$Gene > >>> >> > >>> >> Sean > >>> >> > >>> >> > >>> >> > > >>> >> > Any suggestions? > >>> >> > > >>> >> > Thanks, > >>> >> > Marcelo > >>> >> > > >>> >> > [[alternative HTML version deleted]] > >>> >> > > >>> >> > _______________________________________________ > >>> >> > Bioconductor mailing list > >>> >> > Bioconductor@r-project.org > >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> >> > Search the archives: > >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> >> > >>> > > >>> > [[alternative HTML version deleted]] > >>> > > >>> > _______________________________________________ > >>> > Bioconductor mailing list > >>> > Bioconductor@r-project.org > >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> > Search the archives: > >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Marcelo Pereira ▴ 70

0

Entering edit mode

One last question: *GSM278765 GSM278766 GSM278767* ... A1BG 5.459950 5.548725 5.477436 ... NAT2 6.728919 6.329578 6.570104 ... ADA 6.861095 7.005730 7.235361 ... CDH2 9.660035 9.189507 9.740223 ... ... 5.644313 5.898675 5.475838 ... ... 7.838040 7.564335 8.397569 ... Each CEL file has a description, telling which kind of tissue that sample is related to. Is there a direct way of translating the column names from (GSM278765, GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, CC_KIDNEY_2, CC_KIDNEY_3, ...) ? *CC_KIDNEY_1 CC_KIDNEY_2 CC_KIDNEY_3* ... A1BG 5.459950 5.548725 5.477436 ... NAT2 6.728919 6.329578 6.570104 ... ADA 6.861095 7.005730 7.235361 ... CDH2 9.660035 9.189507 9.740223 ... ... 5.644313 5.898675 5.475838 ... ... 7.838040 7.564335 8.397569 ... Thanks, Marcelo On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops@gmail.com>wrote: > Thanks Sean, > > That is exactly what I was looking for! > > Cheers, > Marcelo > > > On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > >> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops@gmail.com> >> wrote: >> > That is all because I am interested in the expression values for some >> pairs >> > of genes. >> > >> > If I had something like this: >> > >> > GSM278765 GSM278766 GSM278767 ... >> > A1BG 5.459950 5.548725 5.477436 ... >> > NAT2 6.728919 6.329578 6.570104 ... >> > ADA 6.861095 7.005730 7.235361 ... >> > CDH2 9.660035 9.189507 9.740223 ... >> > ... 5.644313 5.898675 5.475838 ... >> > ... 7.838040 7.564335 8.397569 ... >> > >> > Then I could extract lines for the genes of interest (for example, >> 'A1BG' >> > and 'ADA'), and then plot scatterplots, compute correlation >> coefficients, >> > etc... >> >> Something like this might work: >> >> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',]) >> >> Sean >> >> >> > The name of the genes for each line is the only detail that is not >> present >> > in my dataset. >> > >> > What am I missing here? >> > >> > Thanks, >> > Marcelo >> > >> > >> > >> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops@gmail.com> >> wrote: >> >> >> >> Hello Sean, >> >> >> >> Thanks for your replies. >> >> >> >> I used to download all the CEL files, and then load, normalize and >> >> generate the ExpressionSet output. All manually, and it was working >> fine! >> >> >> >> Then I found out about doing it automatically using the GEOquery >> library. >> >> And this is what have been taking my hours lately. >> >> >> >> The output of exprs(gset[[1]]) is the initial point where I got stuck >> >> after a few minutes using the GEOquery library, because I have the >> >> expression, but not the gene's names. >> >> >> >> GSM278765 GSM278766 GSM278767 ... >> >> 1 5.459950 5.548725 5.477436 ... >> >> 10 6.728919 6.329578 6.570104 ... >> >> 100 6.861095 7.005730 7.235361 ... >> >> 1000 9.660035 9.189507 9.740223 ... >> >> 10000 5.644313 5.898675 5.475838 ... >> >> 10001 7.838040 7.564335 8.397569 ... >> >> >> >> After that, I tried to manipulate the output in order to translate 1, >> 10, >> >> 100, 1000, to the actual names of the genes. And my last resource was >> to >> >> ask here at the forum. >> >> >> >> It is looking good already. I only need to have an extra column, with >> the >> >> names of the genes. >> >> >> >> Thanks, >> >> Marcelo >> >> >> >> >> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2@mail.nih.gov> >> wrote: >> >>> >> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops@gmail.com> >> >>> wrote: >> >>> > Hi Sean, >> >>> > >> >>> > Thanks for your answer! >> >>> > >> >>> > That is great already. >> >>> > >> >>> > I can see the gene's names now: >> >>> > >> >>> >> library(GEOquery) >> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >> >>> >> head(fData(gset[[1]]))$Gene >> >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 >> >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 >> ADA >> >>> > ADAM8 AKT3 ... ZNF254 >> >>> > >> >>> > But the data frame only contains these columns. >> >>> > >> >>> >> names(fData(gset[[1]])) >> >>> > [1] "ID" "Gene" "UniGene" "Description" >> >>> > "Ensembl* >> >>> > Chr" "Start (bp)" >> >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" >> >>> > >> >>> > Where is the expression information for each gene? >> >>> >> >>> exprs(gset[[1]]) >> >>> >> >>> gset is an ExpressionSet, so you should read a bit about >> >>> ExpressionSets in the Biobase vignette as well as the help page. >> >>> >> >>> Sean >> >>> >> >>> >> >>> > >> >>> > Thanks, >> >>> > Marcelo >> >>> > >> >>> > >> >>> > >> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2@mail.nih.gov> >> >>> > wrote: >> >>> > >> >>> >> Hi, Marcelo. >> >>> >> >> >>> >> >> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira < >> marcelops@gmail.com> >> >>> >> wrote: >> >>> >> > Quick question: >> >>> >> > >> >>> >> > I am trying to import some GEO datasets, and having some issues >> with >> >>> >> > the >> >>> >> > annotations: >> >>> >> > >> >>> >> > I can download the GSE dataset using: >> >>> >> > >> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >> >>> >> > >> >>> >> > >> >>> >> > However, it will return me a ExpressionSet with the following >> >>> >> > format: >> >>> >> > >> >>> >> > X1 X10 X100 X1000 ... >> >>> >> > GSM278765 >> >>> >> > GSM278766 >> >>> >> > GSM278767 >> >>> >> > GSM278768 >> >>> >> > GSM278769 >> >>> >> > ... >> >>> >> >> >>> >> This is not what is returned by GEOquery, so you have done some >> >>> >> manipulation (looks like you did a transpose on the expression >> >>> >> matrix), it seems. >> >>> >> >> >>> >> > This is pretty much what I need, but I still need to translate >> (X1, >> >>> >> > X10, >> >>> >> > X100, X1000, etc...) to the actual names of the genes. >> >>> >> >> >>> >> library(GEOquery) >> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] >> >>> >> head(fData(gset)) >> >>> >> >> >>> >> The gene symbols are in the "Gene" column: >> >>> >> >> >>> >> genesymbols = fData(gset)$Gene >> >>> >> >> >>> >> Sean >> >>> >> >> >>> >> >> >>> >> > >> >>> >> > Any suggestions? >> >>> >> > >> >>> >> > Thanks, >> >>> >> > Marcelo >> >>> >> > >> >>> >> > [[alternative HTML version deleted]] >> >>> >> > >> >>> >> > _______________________________________________ >> >>> >> > Bioconductor mailing list >> >>> >> > Bioconductor@r-project.org >> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> >> > Search the archives: >> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>> >> >> >>> > >> >>> > [[alternative HTML version deleted]] >> >>> > >> >>> > _______________________________________________ >> >>> > Bioconductor mailing list >> >>> > Bioconductor@r-project.org >> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> > Search the archives: >> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> > >> > > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Marcelo Pereira ▴ 70

0

Entering edit mode

On Thu, May 8, 2014 at 11:22 AM, Marcelo Pereira <marcelops at="" gmail.com=""> wrote: > One last question: > > GSM278765 GSM278766 GSM278767 ... > A1BG 5.459950 5.548725 5.477436 ... > NAT2 6.728919 6.329578 6.570104 ... > ADA 6.861095 7.005730 7.235361 ... > CDH2 9.660035 9.189507 9.740223 ... > ... 5.644313 5.898675 5.475838 ... > ... 7.838040 7.564335 8.397569 ... > > Each CEL file has a description, telling which kind of tissue that sample is > related to. > > Is there a direct way of translating the column names from (GSM278765, > GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, CC_KIDNEY_2, > CC_KIDNEY_3, ...) ? > > CC_KIDNEY_1 CC_KIDNEY_2 CC_KIDNEY_3 ... > A1BG 5.459950 5.548725 5.477436 ... > NAT2 6.728919 6.329578 6.570104 ... > ADA 6.861095 7.005730 7.235361 ... > CDH2 9.660035 9.189507 9.740223 ... > ... 5.644313 5.898675 5.475838 ... > ... 7.838040 7.564335 8.397569 ... > > Thanks, > Marcelo You'll need to do a little work using sub(), but this information is typically in one of the columns of: pData(gset[[1]]) This blog post by Rafa Irizarry might be helpful to understand how an ExpressionSet works: http://simplystatistics.org/2014/02/03/the-three-tables-for-genomics- collaborations/ Sean > > On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops at="" gmail.com=""> > wrote: >> >> Thanks Sean, >> >> That is exactly what I was looking for! >> >> Cheers, >> Marcelo >> >> >> On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >>> >>> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >>> wrote: >>> > That is all because I am interested in the expression values for some >>> > pairs >>> > of genes. >>> > >>> > If I had something like this: >>> > >>> > GSM278765 GSM278766 GSM278767 ... >>> > A1BG 5.459950 5.548725 5.477436 ... >>> > NAT2 6.728919 6.329578 6.570104 ... >>> > ADA 6.861095 7.005730 7.235361 ... >>> > CDH2 9.660035 9.189507 9.740223 ... >>> > ... 5.644313 5.898675 5.475838 ... >>> > ... 7.838040 7.564335 8.397569 ... >>> > >>> > Then I could extract lines for the genes of interest (for example, >>> > 'A1BG' >>> > and 'ADA'), and then plot scatterplots, compute correlation >>> > coefficients, >>> > etc... >>> >>> Something like this might work: >>> >>> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',]) >>> >>> Sean >>> >>> >>> > The name of the genes for each line is the only detail that is not >>> > present >>> > in my dataset. >>> > >>> > What am I missing here? >>> > >>> > Thanks, >>> > Marcelo >>> > >>> > >>> > >>> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >>> > wrote: >>> >> >>> >> Hello Sean, >>> >> >>> >> Thanks for your replies. >>> >> >>> >> I used to download all the CEL files, and then load, normalize and >>> >> generate the ExpressionSet output. All manually, and it was working >>> >> fine! >>> >> >>> >> Then I found out about doing it automatically using the GEOquery >>> >> library. >>> >> And this is what have been taking my hours lately. >>> >> >>> >> The output of exprs(gset[[1]]) is the initial point where I got stuck >>> >> after a few minutes using the GEOquery library, because I have the >>> >> expression, but not the gene's names. >>> >> >>> >> GSM278765 GSM278766 GSM278767 ... >>> >> 1 5.459950 5.548725 5.477436 ... >>> >> 10 6.728919 6.329578 6.570104 ... >>> >> 100 6.861095 7.005730 7.235361 ... >>> >> 1000 9.660035 9.189507 9.740223 ... >>> >> 10000 5.644313 5.898675 5.475838 ... >>> >> 10001 7.838040 7.564335 8.397569 ... >>> >> >>> >> After that, I tried to manipulate the output in order to translate 1, >>> >> 10, >>> >> 100, 1000, to the actual names of the genes. And my last resource was >>> >> to >>> >> ask here at the forum. >>> >> >>> >> It is looking good already. I only need to have an extra column, with >>> >> the >>> >> names of the genes. >>> >> >>> >> Thanks, >>> >> Marcelo >>> >> >>> >> >>> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> >>> >> wrote: >>> >>> >>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >>> >>> wrote: >>> >>> > Hi Sean, >>> >>> > >>> >>> > Thanks for your answer! >>> >>> > >>> >>> > That is great already. >>> >>> > >>> >>> > I can see the gene's names now: >>> >>> > >>> >>> >> library(GEOquery) >>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >>> >>> >> head(fData(gset[[1]]))$Gene >>> >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 >>> >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 >>> >>> > ADA >>> >>> > ADAM8 AKT3 ... ZNF254 >>> >>> > >>> >>> > But the data frame only contains these columns. >>> >>> > >>> >>> >> names(fData(gset[[1]])) >>> >>> > [1] "ID" "Gene" "UniGene" "Description" >>> >>> > "Ensembl* >>> >>> > Chr" "Start (bp)" >>> >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" >>> >>> > >>> >>> > Where is the expression information for each gene? >>> >>> >>> >>> exprs(gset[[1]]) >>> >>> >>> >>> gset is an ExpressionSet, so you should read a bit about >>> >>> ExpressionSets in the Biobase vignette as well as the help page. >>> >>> >>> >>> Sean >>> >>> >>> >>> >>> >>> > >>> >>> > Thanks, >>> >>> > Marcelo >>> >>> > >>> >>> > >>> >>> > >>> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> >>> >>> > wrote: >>> >>> > >>> >>> >> Hi, Marcelo. >>> >>> >> >>> >>> >> >>> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira >>> >>> >> <marcelops at="" gmail.com=""> >>> >>> >> wrote: >>> >>> >> > Quick question: >>> >>> >> > >>> >>> >> > I am trying to import some GEO datasets, and having some issues >>> >>> >> > with >>> >>> >> > the >>> >>> >> > annotations: >>> >>> >> > >>> >>> >> > I can download the GSE dataset using: >>> >>> >> > >>> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >>> >>> >> > >>> >>> >> > >>> >>> >> > However, it will return me a ExpressionSet with the following >>> >>> >> > format: >>> >>> >> > >>> >>> >> > X1 X10 X100 X1000 ... >>> >>> >> > GSM278765 >>> >>> >> > GSM278766 >>> >>> >> > GSM278767 >>> >>> >> > GSM278768 >>> >>> >> > GSM278769 >>> >>> >> > ... >>> >>> >> >>> >>> >> This is not what is returned by GEOquery, so you have done some >>> >>> >> manipulation (looks like you did a transpose on the expression >>> >>> >> matrix), it seems. >>> >>> >> >>> >>> >> > This is pretty much what I need, but I still need to translate >>> >>> >> > (X1, >>> >>> >> > X10, >>> >>> >> > X100, X1000, etc...) to the actual names of the genes. >>> >>> >> >>> >>> >> library(GEOquery) >>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] >>> >>> >> head(fData(gset)) >>> >>> >> >>> >>> >> The gene symbols are in the "Gene" column: >>> >>> >> >>> >>> >> genesymbols = fData(gset)$Gene >>> >>> >> >>> >>> >> Sean >>> >>> >> >>> >>> >> >>> >>> >> > >>> >>> >> > Any suggestions? >>> >>> >> > >>> >>> >> > Thanks, >>> >>> >> > Marcelo >>> >>> >> > >>> >>> >> > [[alternative HTML version deleted]] >>> >>> >> > >>> >>> >> > _______________________________________________ >>> >>> >> > Bioconductor mailing list >>> >>> >> > Bioconductor at r-project.org >>> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >>> >> > Search the archives: >>> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >>> >>> > >>> >>> > [[alternative HTML version deleted]] >>> >>> > >>> >>> > _______________________________________________ >>> >>> > Bioconductor mailing list >>> >>> > Bioconductor at r-project.org >>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >>> > Search the archives: >>> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>> >> >>> > >> >> >

ADD REPLY • link 10.9 years ago Sean Davis 21k

0

Entering edit mode

To OP: How about something like this? library(GEOquery) renal <- getGEO('GSE11024')[[1]] ## this is usually where you find the sample names sampleNames(renal) <- renal$title ## 599 unannotated probesets, plus 1 dupe each for SKIP and PRG2 featureNames(renal) <- make.unique(fData(renal)$Gene, sep='.extra') ## more conveniently labeled now? set.seed(1234) exprs(renal)[ sample(1:nrow(renal), 5), sample(1:ncol(renal), 5) ] ## P2_KIDNEY_12 CC_KIDNEY_1 NO_KIDNEY_2 WM_KIDNEY_79 P1_KIDNEY_4 ## NHEDC2 5.84 6.11 6.07 6.21 6.04 ## KIAA1370 8.51 8.98 8.29 8.97 9.04 ## PRKCA 7.60 6.89 7.53 7.04 7.54 ## PROP1 7.52 6.69 7.26 6.60 7.29 ## HIST1H4L 5.25 5.15 5.17 5.14 5.13 There are times when you might want to fiddle around with the returned data from GEOquery, but IMHO this isn't one of them. Re-read Sean's emails until you agree ;-) --t ps. for completeness, sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines grDevices datasets parallel stats graphics utils [8] methods base other attached packages: [1] mnormt_1.4-7 MASS_7.3-33 dma_1.2-0 [4] survival_2.37-7 GEOquery_2.31.0 Biobase_2.25.0 [7] BiocInstaller_1.15.3 rtracklayer_1.25.5 GenomicRanges_1.17.12 [10] GenomeInfoDb_1.1.3 IRanges_1.99.13 S4Vectors_0.0.6 [13] BiocGenerics_0.11.2 bigrquery_0.1 gtools_3.4.0 [16] dplyr_0.1.3 loaded via a namespace (and not attached): [1] assertthat_0.1 BatchJobs_1.2 BBmisc_1.6 [4] BiocParallel_0.7.0 Biostrings_2.33.6 bitops_1.0-6 [7] brew_1.0-6 BSgenome_1.33.2 codetools_0.2-8 [10] DBI_0.2-7 devtools_1.5 digest_0.6.4 [13] evaluate_0.5.5 fail_1.2 foreach_1.4.2 [16] formatR_0.10 GenomicAlignments_1.1.9 httr_0.3 [19] iterators_1.0.7 jsonlite_0.9.7 knitr_1.5 [22] memoise_0.2.1 plyr_1.8.1 Rcpp_0.11.1 [25] RCurl_1.95-4.1 Rsamtools_1.17.10 RSQLite_0.11.4 [28] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 [31] tcltk_3.1.0 tools_3.1.0 whisker_0.3-2 [34] XML_3.98-1.1 XVector_0.5.6 zlibbioc_1.11.1 Statistics is the grammar of science. Karl Pearson <http: en.wikipedia.org="" wiki="" the_grammar_of_science=""> On Thu, May 8, 2014 at 8:30 AM, Sean Davis <sdavis2@mail.nih.gov> wrote: > On Thu, May 8, 2014 at 11:22 AM, Marcelo Pereira <marcelops@gmail.com> > wrote: > > One last question: > > > > GSM278765 GSM278766 GSM278767 ... > > A1BG 5.459950 5.548725 5.477436 ... > > NAT2 6.728919 6.329578 6.570104 ... > > ADA 6.861095 7.005730 7.235361 ... > > CDH2 9.660035 9.189507 9.740223 ... > > ... 5.644313 5.898675 5.475838 ... > > ... 7.838040 7.564335 8.397569 ... > > > > Each CEL file has a description, telling which kind of tissue that > sample is > > related to. > > > > Is there a direct way of translating the column names from (GSM278765, > > GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, > CC_KIDNEY_2, > > CC_KIDNEY_3, ...) ? > > > > CC_KIDNEY_1 CC_KIDNEY_2 CC_KIDNEY_3 ... > > A1BG 5.459950 5.548725 5.477436 ... > > NAT2 6.728919 6.329578 6.570104 ... > > ADA 6.861095 7.005730 7.235361 ... > > CDH2 9.660035 9.189507 9.740223 ... > > ... 5.644313 5.898675 5.475838 ... > > ... 7.838040 7.564335 8.397569 ... > > > > Thanks, > > Marcelo > > You'll need to do a little work using sub(), but this information is > typically in one of the columns of: > > pData(gset[[1]]) > > This blog post by Rafa Irizarry might be helpful to understand how an > ExpressionSet works: > > > http://simplystatistics.org/2014/02/03/the-three-tables-for- genomics-collaborations/ > > Sean > > > > > > On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops@gmail.com> > > wrote: > >> > >> Thanks Sean, > >> > >> That is exactly what I was looking for! > >> > >> Cheers, > >> Marcelo > >> > >> > >> On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2@mail.nih.gov> > wrote: > >>> > >>> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops@gmail.com> > >>> wrote: > >>> > That is all because I am interested in the expression values for some > >>> > pairs > >>> > of genes. > >>> > > >>> > If I had something like this: > >>> > > >>> > GSM278765 GSM278766 GSM278767 ... > >>> > A1BG 5.459950 5.548725 5.477436 ... > >>> > NAT2 6.728919 6.329578 6.570104 ... > >>> > ADA 6.861095 7.005730 7.235361 ... > >>> > CDH2 9.660035 9.189507 9.740223 ... > >>> > ... 5.644313 5.898675 5.475838 ... > >>> > ... 7.838040 7.564335 8.397569 ... > >>> > > >>> > Then I could extract lines for the genes of interest (for example, > >>> > 'A1BG' > >>> > and 'ADA'), and then plot scatterplots, compute correlation > >>> > coefficients, > >>> > etc... > >>> > >>> Something like this might work: > >>> > >>> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',]) > >>> > >>> Sean > >>> > >>> > >>> > The name of the genes for each line is the only detail that is not > >>> > present > >>> > in my dataset. > >>> > > >>> > What am I missing here? > >>> > > >>> > Thanks, > >>> > Marcelo > >>> > > >>> > > >>> > > >>> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops@gmail.com> > > >>> > wrote: > >>> >> > >>> >> Hello Sean, > >>> >> > >>> >> Thanks for your replies. > >>> >> > >>> >> I used to download all the CEL files, and then load, normalize and > >>> >> generate the ExpressionSet output. All manually, and it was working > >>> >> fine! > >>> >> > >>> >> Then I found out about doing it automatically using the GEOquery > >>> >> library. > >>> >> And this is what have been taking my hours lately. > >>> >> > >>> >> The output of exprs(gset[[1]]) is the initial point where I got > stuck > >>> >> after a few minutes using the GEOquery library, because I have the > >>> >> expression, but not the gene's names. > >>> >> > >>> >> GSM278765 GSM278766 GSM278767 ... > >>> >> 1 5.459950 5.548725 5.477436 ... > >>> >> 10 6.728919 6.329578 6.570104 ... > >>> >> 100 6.861095 7.005730 7.235361 ... > >>> >> 1000 9.660035 9.189507 9.740223 ... > >>> >> 10000 5.644313 5.898675 5.475838 ... > >>> >> 10001 7.838040 7.564335 8.397569 ... > >>> >> > >>> >> After that, I tried to manipulate the output in order to translate > 1, > >>> >> 10, > >>> >> 100, 1000, to the actual names of the genes. And my last resource > was > >>> >> to > >>> >> ask here at the forum. > >>> >> > >>> >> It is looking good already. I only need to have an extra column, > with > >>> >> the > >>> >> names of the genes. > >>> >> > >>> >> Thanks, > >>> >> Marcelo > >>> >> > >>> >> > >>> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2@mail.nih.gov> > >>> >> wrote: > >>> >>> > >>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira < > marcelops@gmail.com> > >>> >>> wrote: > >>> >>> > Hi Sean, > >>> >>> > > >>> >>> > Thanks for your answer! > >>> >>> > > >>> >>> > That is great already. > >>> >>> > > >>> >>> > I can see the gene's names now: > >>> >>> > > >>> >>> >> library(GEOquery) > >>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > >>> >>> >> head(fData(gset[[1]]))$Gene > >>> >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 > >>> >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 > ACTR3 > >>> >>> > ADA > >>> >>> > ADAM8 AKT3 ... ZNF254 > >>> >>> > > >>> >>> > But the data frame only contains these columns. > >>> >>> > > >>> >>> >> names(fData(gset[[1]])) > >>> >>> > [1] "ID" "Gene" "UniGene" "Description" > >>> >>> > "Ensembl* > >>> >>> > Chr" "Start (bp)" > >>> >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" > >>> >>> > > >>> >>> > Where is the expression information for each gene? > >>> >>> > >>> >>> exprs(gset[[1]]) > >>> >>> > >>> >>> gset is an ExpressionSet, so you should read a bit about > >>> >>> ExpressionSets in the Biobase vignette as well as the help page. > >>> >>> > >>> >>> Sean > >>> >>> > >>> >>> > >>> >>> > > >>> >>> > Thanks, > >>> >>> > Marcelo > >>> >>> > > >>> >>> > > >>> >>> > > >>> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2@mail.nih.gov> > > >>> >>> > wrote: > >>> >>> > > >>> >>> >> Hi, Marcelo. > >>> >>> >> > >>> >>> >> > >>> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira > >>> >>> >> <marcelops@gmail.com> > >>> >>> >> wrote: > >>> >>> >> > Quick question: > >>> >>> >> > > >>> >>> >> > I am trying to import some GEO datasets, and having some > issues > >>> >>> >> > with > >>> >>> >> > the > >>> >>> >> > annotations: > >>> >>> >> > > >>> >>> >> > I can download the GSE dataset using: > >>> >>> >> > > >>> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > However, it will return me a ExpressionSet with the following > >>> >>> >> > format: > >>> >>> >> > > >>> >>> >> > X1 X10 X100 X1000 ... > >>> >>> >> > GSM278765 > >>> >>> >> > GSM278766 > >>> >>> >> > GSM278767 > >>> >>> >> > GSM278768 > >>> >>> >> > GSM278769 > >>> >>> >> > ... > >>> >>> >> > >>> >>> >> This is not what is returned by GEOquery, so you have done some > >>> >>> >> manipulation (looks like you did a transpose on the expression > >>> >>> >> matrix), it seems. > >>> >>> >> > >>> >>> >> > This is pretty much what I need, but I still need to translate > >>> >>> >> > (X1, > >>> >>> >> > X10, > >>> >>> >> > X100, X1000, etc...) to the actual names of the genes. > >>> >>> >> > >>> >>> >> library(GEOquery) > >>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] > >>> >>> >> head(fData(gset)) > >>> >>> >> > >>> >>> >> The gene symbols are in the "Gene" column: > >>> >>> >> > >>> >>> >> genesymbols = fData(gset)$Gene > >>> >>> >> > >>> >>> >> Sean > >>> >>> >> > >>> >>> >> > >>> >>> >> > > >>> >>> >> > Any suggestions? > >>> >>> >> > > >>> >>> >> > Thanks, > >>> >>> >> > Marcelo > >>> >>> >> > > >>> >>> >> > [[alternative HTML version deleted]] > >>> >>> >> > > >>> >>> >> > _______________________________________________ > >>> >>> >> > Bioconductor mailing list > >>> >>> >> > Bioconductor@r-project.org > >>> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> >>> >> > Search the archives: > >>> >>> >> > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> >>> >> > >>> >>> > > >>> >>> > [[alternative HTML version deleted]] > >>> >>> > > >>> >>> > _______________________________________________ > >>> >>> > Bioconductor mailing list > >>> >>> > Bioconductor@r-project.org > >>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> >>> > Search the archives: > >>> >>> > > http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> >> > >>> >> > >>> > > >> > >> > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 7 weeks ago

United States

Hi, Marcelo. Please keep things on the list so everyone learns from your questions. http://www.bioconductor.org/packages/release/bioc/html/Biobase.html Sean On Thu, May 8, 2014 at 11:23 AM, Marcelo Pereira <marcelops at="" gmail.com=""> wrote: > Also, where can I find the documentation for the ExpressionSet object from > the BioConductor library? > > Thanks again, > Marcelo > > > On Thu, May 8, 2014 at 11:22 AM, Marcelo Pereira <marcelops at="" gmail.com=""> > wrote: >> >> One last question: >> >> GSM278765 GSM278766 GSM278767 ... >> A1BG 5.459950 5.548725 5.477436 ... >> NAT2 6.728919 6.329578 6.570104 ... >> ADA 6.861095 7.005730 7.235361 ... >> CDH2 9.660035 9.189507 9.740223 ... >> ... 5.644313 5.898675 5.475838 ... >> ... 7.838040 7.564335 8.397569 ... >> >> Each CEL file has a description, telling which kind of tissue that sample >> is related to. >> >> Is there a direct way of translating the column names from (GSM278765, >> GSM278766, ...) to the description of the tissue (CC_KIDNEY_1, CC_KIDNEY_2, >> CC_KIDNEY_3, ...) ? >> >> CC_KIDNEY_1 CC_KIDNEY_2 CC_KIDNEY_3 ... >> A1BG 5.459950 5.548725 5.477436 ... >> NAT2 6.728919 6.329578 6.570104 ... >> ADA 6.861095 7.005730 7.235361 ... >> CDH2 9.660035 9.189507 9.740223 ... >> ... 5.644313 5.898675 5.475838 ... >> ... 7.838040 7.564335 8.397569 ... >> >> Thanks, >> Marcelo >> >> >> On Thu, May 8, 2014 at 10:21 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >> wrote: >>> >>> Thanks Sean, >>> >>> That is exactly what I was looking for! >>> >>> Cheers, >>> Marcelo >>> >>> >>> On Thu, May 8, 2014 at 10:15 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >>>> >>>> On Thu, May 8, 2014 at 8:21 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >>>> wrote: >>>> > That is all because I am interested in the expression values for some >>>> > pairs >>>> > of genes. >>>> > >>>> > If I had something like this: >>>> > >>>> > GSM278765 GSM278766 GSM278767 ... >>>> > A1BG 5.459950 5.548725 5.477436 ... >>>> > NAT2 6.728919 6.329578 6.570104 ... >>>> > ADA 6.861095 7.005730 7.235361 ... >>>> > CDH2 9.660035 9.189507 9.740223 ... >>>> > ... 5.644313 5.898675 5.475838 ... >>>> > ... 7.838040 7.564335 8.397569 ... >>>> > >>>> > Then I could extract lines for the genes of interest (for example, >>>> > 'A1BG' >>>> > and 'ADA'), and then plot scatterplots, compute correlation >>>> > coefficients, >>>> > etc... >>>> >>>> Something like this might work: >>>> >>>> plot(exprs(gset[[1]])[fData(gset[[1]])$Gene=='A1BG',]) >>>> >>>> Sean >>>> >>>> >>>> > The name of the genes for each line is the only detail that is not >>>> > present >>>> > in my dataset. >>>> > >>>> > What am I missing here? >>>> > >>>> > Thanks, >>>> > Marcelo >>>> > >>>> > >>>> > >>>> > On Thu, May 8, 2014 at 7:42 AM, Marcelo Pereira <marcelops at="" gmail.com=""> >>>> > wrote: >>>> >> >>>> >> Hello Sean, >>>> >> >>>> >> Thanks for your replies. >>>> >> >>>> >> I used to download all the CEL files, and then load, normalize and >>>> >> generate the ExpressionSet output. All manually, and it was working >>>> >> fine! >>>> >> >>>> >> Then I found out about doing it automatically using the GEOquery >>>> >> library. >>>> >> And this is what have been taking my hours lately. >>>> >> >>>> >> The output of exprs(gset[[1]]) is the initial point where I got stuck >>>> >> after a few minutes using the GEOquery library, because I have the >>>> >> expression, but not the gene's names. >>>> >> >>>> >> GSM278765 GSM278766 GSM278767 ... >>>> >> 1 5.459950 5.548725 5.477436 ... >>>> >> 10 6.728919 6.329578 6.570104 ... >>>> >> 100 6.861095 7.005730 7.235361 ... >>>> >> 1000 9.660035 9.189507 9.740223 ... >>>> >> 10000 5.644313 5.898675 5.475838 ... >>>> >> 10001 7.838040 7.564335 8.397569 ... >>>> >> >>>> >> After that, I tried to manipulate the output in order to translate 1, >>>> >> 10, >>>> >> 100, 1000, to the actual names of the genes. And my last resource >>>> >> was to >>>> >> ask here at the forum. >>>> >> >>>> >> It is looking good already. I only need to have an extra column, >>>> >> with the >>>> >> names of the genes. >>>> >> >>>> >> Thanks, >>>> >> Marcelo >>>> >> >>>> >> >>>> >> On Thu, May 8, 2014 at 7:14 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> >>>> >> wrote: >>>> >>> >>>> >>> On Thu, May 8, 2014 at 6:58 AM, Marcelo Pereira >>>> >>> <marcelops at="" gmail.com=""> >>>> >>> wrote: >>>> >>> > Hi Sean, >>>> >>> > >>>> >>> > Thanks for your answer! >>>> >>> > >>>> >>> > That is great already. >>>> >>> > >>>> >>> > I can see the gene's names now: >>>> >>> > >>>> >>> >> library(GEOquery) >>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >>>> >>> >> head(fData(gset[[1]]))$Gene >>>> >>> > [1] A1BG NAT2 ADA CDH2 AKT3 MED6 >>>> >>> > 17098 Levels: A1BG ABCB6 ABCC5 ABCC9 ABCF2 ABI1 ACOT8 ACTR2 ACTR3 >>>> >>> > ADA >>>> >>> > ADAM8 AKT3 ... ZNF254 >>>> >>> > >>>> >>> > But the data frame only contains these columns. >>>> >>> > >>>> >>> >> names(fData(gset[[1]])) >>>> >>> > [1] "ID" "Gene" "UniGene" "Description" >>>> >>> > "Ensembl* >>>> >>> > Chr" "Start (bp)" >>>> >>> > [7] "End (bp)" "Strand" "ORF" "SPOT_ID" >>>> >>> > >>>> >>> > Where is the expression information for each gene? >>>> >>> >>>> >>> exprs(gset[[1]]) >>>> >>> >>>> >>> gset is an ExpressionSet, so you should read a bit about >>>> >>> ExpressionSets in the Biobase vignette as well as the help page. >>>> >>> >>>> >>> Sean >>>> >>> >>>> >>> >>>> >>> > >>>> >>> > Thanks, >>>> >>> > Marcelo >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > On Thu, May 8, 2014 at 6:24 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> >>>> >>> > wrote: >>>> >>> > >>>> >>> >> Hi, Marcelo. >>>> >>> >> >>>> >>> >> >>>> >>> >> On Wed, May 7, 2014 at 8:01 PM, Marcelo Pereira >>>> >>> >> <marcelops at="" gmail.com=""> >>>> >>> >> wrote: >>>> >>> >> > Quick question: >>>> >>> >> > >>>> >>> >> > I am trying to import some GEO datasets, and having some issues >>>> >>> >> > with >>>> >>> >> > the >>>> >>> >> > annotations: >>>> >>> >> > >>>> >>> >> > I can download the GSE dataset using: >>>> >>> >> > >>>> >>> >> > gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE) >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > However, it will return me a ExpressionSet with the following >>>> >>> >> > format: >>>> >>> >> > >>>> >>> >> > X1 X10 X100 X1000 ... >>>> >>> >> > GSM278765 >>>> >>> >> > GSM278766 >>>> >>> >> > GSM278767 >>>> >>> >> > GSM278768 >>>> >>> >> > GSM278769 >>>> >>> >> > ... >>>> >>> >> >>>> >>> >> This is not what is returned by GEOquery, so you have done some >>>> >>> >> manipulation (looks like you did a transpose on the expression >>>> >>> >> matrix), it seems. >>>> >>> >> >>>> >>> >> > This is pretty much what I need, but I still need to translate >>>> >>> >> > (X1, >>>> >>> >> > X10, >>>> >>> >> > X100, X1000, etc...) to the actual names of the genes. >>>> >>> >> >>>> >>> >> library(GEOquery) >>>> >>> >> gset <- getGEO("GSE11024", GSEMatrix=TRUE, AnnotGPL=TRUE)[[1]] >>>> >>> >> head(fData(gset)) >>>> >>> >> >>>> >>> >> The gene symbols are in the "Gene" column: >>>> >>> >> >>>> >>> >> genesymbols = fData(gset)$Gene >>>> >>> >> >>>> >>> >> Sean >>>> >>> >> >>>> >>> >> >>>> >>> >> > >>>> >>> >> > Any suggestions? >>>> >>> >> > >>>> >>> >> > Thanks, >>>> >>> >> > Marcelo >>>> >>> >> > >>>> >>> >> > [[alternative HTML version deleted]] >>>> >>> >> > >>>> >>> >> > _______________________________________________ >>>> >>> >> > Bioconductor mailing list >>>> >>> >> > Bioconductor at r-project.org >>>> >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>> >> > Search the archives: >>>> >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >> >>>> >>> > >>>> >>> > [[alternative HTML version deleted]] >>>> >>> > >>>> >>> > _______________________________________________ >>>> >>> > Bioconductor mailing list >>>> >>> > Bioconductor at r-project.org >>>> >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >>> > Search the archives: >>>> >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >>>> >> >>>> > >>> >>> >> >

ADD COMMENT • link 10.9 years ago Sean Davis 21k

Login before adding your answer.