On Fri, Nov 22, 2013 at 4:53 PM, Rohan [guest]
<guest@bioconductor.org>wrote:
>
> I would like to get the data for all the genes in the form of Gene
> Symbols/Gene ids's mapped to GPL/GSE/GSM/GDS metadata.
> I have used GEOmetadb package to get this metadata,however I am not
able
> to find a way to extract all this metadata mapped to genes.
>
> Is their any way GEOquery bioconductor package be used for this?
>
Good question. It has a long-winded answer.
The GEO platform (GPL) is the only GEO entity that stores any
information
about gene identity. Other entities (GSM, GSE, GDS) are linked to GPL
rows
by an ID column. So, to get information about the genes represented
by an
experiment, we need to look at GPL records. GPL records come in two
flavors, the submitter-supplied flavor and the so-called "Annotation"
GPL
that has been curated by NCBI GEO. You'll need to focus on the
Annotation
GPL since those are the ones with a standard "Gene ID" column in all
of
them. The "Annotation" GPLs are only generated for data sets that
have
been curated by NCBI GEO, namely the GDS records. So, we need to get
the
distinct GPL records associated with GDS and these will be the entire
set
of "Annotation" GPLs. Using GEOmetadb (assuming you have already made
a
connection, etc.):
annotgpl = dbGetquery(con,"select distinct GPL from gds")
Now, annotgpl contains the accession numbers (GPL IDs) for all the
Annotation GPLs. You can use these GPL IDs to relate each GPL to GSM,
GDS,
and GSE records.
How do you get the information about what genes are on each GPL,
though?
You'll need to use GEOquery for that.
gpl = getGEO(annotgpl[1,1],AnnotGPL=TRUE)
gpl is now a GPL object and we can use the Table method to get a data
frame
and grab the Gene ID (which is an Entrez Gene ID):
geneids = Table(gpl)[,'Gene ID']
Now, you have the Entrez Gene IDs for all features on the platform and
you
can associate those with all the GSM, GDS, and GSE records attached to
the
GPL. If you loop over all the GPLs in the annotgpl data frame, you'll
have
the information you want, I think.
Unfortunately, this is not a complete answer because it does not
include
the submitter-supplied GPLs that do not have any Annotation GPL
available
(since NCBI GEO do not curate everything). The submitter-supplied
GPLs do
not have a standard vocabulary for what is include in the columns of
the
GPL, so there is not an easy way to automate processing as above.
Hope that helps.
Sean
[[alternative HTML version deleted]]