If you really want a data.frame, then you can use the Homo.sapiens package:
> z <- select(Homo.sapiens, keys(Homo.sapiens, "ENTREZID"), c("CDSID","CDSNAME","CDSCHROM","CDSSTRAND","CDSSTART","CDSEND","SYMBOL"), "ENTREZID")
'select()' returned 1:many mapping between keys and columns
> head(z)
ENTREZID SYMBOL CDSID CDSNAME CDSCHROM CDSSTRAND CDSSTART CDSEND
1 1 A1BG 206062 <NA> chr19 - 58864770 58864803
2 1 A1BG 206061 <NA> chr19 - 58864658 58864693
3 1 A1BG 206060 <NA> chr19 - 58864294 58864563
4 1 A1BG 206059 <NA> chr19 - 58863649 58863921
5 1 A1BG 206058 <NA> chr19 - 58862757 58863053
6 1 A1BG 206057 <NA> chr19 - 58861736 58862017
Alternatively you could use a GRanges object, which is in many respects easier to deal with
> zz <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> zz <- unlist(zz)
> zz
GRanges object with 237874 ranges and 2 metadata columns:
seqnames ranges strand | cds_id cds_name
<Rle> <IRanges> <Rle> | <integer> <character>
1 chr19 [58858388, 58858395] - | 206055 <NA>
1 chr19 [58858719, 58859006] - | 206056 <NA>
1 chr19 [58861736, 58862017] - | 206057 <NA>
1 chr19 [58862757, 58863053] - | 206058 <NA>
1 chr19 [58863649, 58863921] - | 206059 <NA>
... ... ... ... ... ... ...
9994 chr6 [90575672, 90578801] + | 76227 <NA>
9994 chr6 [90581008, 90581107] + | 76228 <NA>
9994 chr6 [90581008, 90581109] + | 76229 <NA>
9994 chr6 [90583522, 90583528] + | 76230 <NA>
9997 chr22 [50962040, 50962840] - | 218795 <NA>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
## add in symbols
> mcols(zz)$symbol <- mapIds(org.Hs.eg.db, names(zz), "SYMBOL", "ENTREZID", multiVals="CharacterList")
> zz
GRanges object with 237874 ranges and 3 metadata columns:
seqnames ranges strand | cds_id cds_name
<Rle> <IRanges> <Rle> | <integer> <character>
1 chr19 [58858388, 58858395] - | 206055 <NA>
1 chr19 [58858719, 58859006] - | 206056 <NA>
1 chr19 [58861736, 58862017] - | 206057 <NA>
1 chr19 [58862757, 58863053] - | 206058 <NA>
1 chr19 [58863649, 58863921] - | 206059 <NA>
... ... ... ... ... ... ...
9994 chr6 [90575672, 90578801] + | 76227 <NA>
9994 chr6 [90581008, 90581107] + | 76228 <NA>
9994 chr6 [90581008, 90581109] + | 76229 <NA>
9994 chr6 [90583522, 90583528] + | 76230 <NA>
9997 chr22 [50962040, 50962840] - | 218795 <NA>
symbol
<CharacterList>
1 A1BG
1 A1BG
1 A1BG
1 A1BG
1 A1BG
... ...
9994 CASP8AP2
9994 CASP8AP2
9994 CASP8AP2
9994 CASP8AP2
9997 SCO2
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
This is obviously not for hg18. You can put the TxDb.Hsapiens.hg18.UCSC.knownGene object into the Homo.sapiens object like this:
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg18.knownGene
Now getting the GODb Object directly
Now getting the OrgDb Object directly
Now loading the TxDb Object from local disc
> Homo.sapiens
OrganismDb Object:
# Includes GODb Object: GO.db
# With data about: Gene Ontology
# Includes OrgDb Object: org.Hs.eg.db
# Gene data about: Homo sapiens
# Taxonomy Id: 9606
# Includes TxDb Object: TxDb.Hsapiens.UCSC.hg18.knownGene
# Transcriptome data about: Homo sapiens
# Based on genome: hg18
# The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .
And then proceed as above. The (possible) downside of doing that has to do with the difference between a genome build and the other annotations. Genome builds are semi-static things that are released on a semi-regular basis (e.g., hg18 was released in 2006, which in 'annotation years' is literally eons ago).
All the other annotation databases are updated on a weekly basis, and only made semi-static by the fact that we release them semi-yearly. So the TxDb.Hsapiens.UCSC.hg18.knownGene is based on what we knew about the human genome a decade ago, whereas the org.Hs.eg.db is based on data from a couple of months ago. I don't know if UCSC tries to harmonize the Gene IDs so they match up with the current annotations - that wouldn't make sense to me; if you want old data, you should get old data - so you run the risk of having old Gene IDs that have been retired, that won't match up with whatever the current Gene ID is.
What version of things are you using? That is incorrect behavior for
mapIds()
. It should just return A1BG for Gene ID 1, as there are no other Symbols for that gene. You may need to update.But do note that this is answerable by you. I gave you explicit code, and you have a particular question about one of the arguments. Rather than spending the time to ask the question and wait for me to answer it, why didn't you look at the help page for
mapIds()
? You do yourself a disservice by asking me instead of figuring it out yourself - the only way to learn R is by learning how to figure stuff out for yourself. So in that spirit, RTFM! ;-D