I've recently started trying to process data from the GEO datasets that are publically available.
Unfortunately my R skills are still quite not up to par.
I've been working with a specific set in particular GSE3933
the code is simply to extract the expression for each sample then match it with the appropriate gene symbol/annotation.
gse3933 <- getGEO('GSE3933',GSEMatrix=TRUE)
gpl <- annotation(gse3933[[3]])
platf <- getGEO(gpl, AnnotGPL=TRUE)
anot <- data.frame(attr(dataTable(platf), "table"))
Gmicro <- exprs(gse3933[[3]])
My problem arises in two parts:
Part 1
I've accessed the user submitted annotation which is GPL3289 - this has ten variables. However there is no variable which is "gene symbol". My guess is that the GB_List variable column can be converted to Gene symbols/ Gene ID's - however I have not been able to figure out how to do this.
Part 2
I then want to match the Gene Symbols/Probes with the appropriate expression values from the matrix 'Gmicro' as per the code. Is this simply matching the ID's from the matrix generated by using the exprs function to those generated using dataTable?
Thanks for the answer! very helpful. Just a couple of follow up questions
From the code you've provided, you've matched the GB_LIST to the known gene names. I'm still a bit confused about the GB_LIST. If these are known Genebank accession numbers, shouldn't they easily match? In other words, why do certain GB_LIST not 'exist' in the database.
My other question as per before is simply:
Does the ID column in featureData correspond with that found in AssayData? so e.g. 550 matches 550? (seems obvious, but want to check for surety)
Thank you so much again for the answer and your time.
Just because a sequence is found in GenBank doesn't imply that there will be a gene. As I mentioned, these are ESTs, which IIRC can be roughly translated as 'A sequence some person found in a tissue or cell culture this one time so we put it in the database.'
Or more formally, from NCBI:
"The Nucleotide, Genome Survey Sequence (GSS), and Expressed Sequence Tag (EST) database all contain nucleic acid sequences. The data in GSS and EST are from two large bulk sequence divisions of GenBank. GSS and EST data are typically uncharacterized, short genomic (GSS) or cDNA (EST) sequences."
As an example, one of the first GenBank IDs is AI251547, which if you do a Google search, should come up with this as one of the top hits (it's first for me):
http://www.ncbi.nlm.nih.gov/nucest/AI251574,AI792844,AI733932
And if you follow the links, you will see that this is a clone that was pulled out of an ovary cancer cell line by a CGAP researcher way back in 1998. It has a GenBank accession number and a GI number, but that's it. You can take the sequence and blat it at UCSC and find out that it doesn't really align anywhere.