hi,
an alternative is to use the GSEABase sorcery. Let's say you downloaded the GMT file for the MSigDB C2 gene set collection using HGNC gene symbols from:
http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/7.0/c2.all.v7.0.symbols.gmt
The Gene Matrix Transposed (GMT) file format can be imported into R using the getGmt()
function from the GSEABase package, as follows:
library(GSEABase)
c2sym <- getGmt("c2.all.v7.0.symbols.gmt", geneIdType=SymbolIdentifier())
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., GARGALOVIC_RESPONSE_TO_OXIDIZED_PHOSPHOLIPIDS_LIGHTGREEN_DN (5501 total)
unique identifiers: ACSS2, GCK, ..., RNU1-1 (20690 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: NullCollection (1 total)
Note that i'm calling getGmt()
with the argument geneIdType=SymbolIdentifier()
because i know i downloaded HGNC gene symbols and this metadata is included in the returned object called c2sym
of class GeneSetCollection
. Now we can map the identifiers in this GeneSetCollection
object from HGNC gene symbols to Ensembl identifiers using the GSEABase function mapIdentifiers()
, as follows:
c2ensembl <- mapIdentifiers(c2sym, ENSEMBLIdentifier("org.Hs.eg.db"))
c2ensembl
GeneSetCollection
names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., GARGALOVIC_RESPONSE_TO_OXIDIZED_PHOSPHOLIPIDS_LIGHTGREEN_DN (5501 total)
unique identifiers: ENSG00000131069, ENSG00000106633, ..., ENSG00000206652 (21956 total)
types in collection:
geneIdType: ENSEMBLIdentifier (1 total)
collectionType: NullCollection (1 total)
where i'm using the function ENSEMBLIdentifier("org.Hs.eg.db")
as second argument to tell the mapIdentifiers()
function that I want the map to Ensembl identifiers and that the mapping information can be found in the organism-level gene-centric package org.Hs.eg.db.
finally, some Bioconductor software packages, such as GSVA, may take directly as input such a GeneSetCollection
object, but for other packages as limma, where the camera()
algorithm is implemented, you need those gene sets as a list
object. You can extract a list
object from a GeneSetCollection
object using the geneIds()
getter function:
c2ensembllist <- geneIds(c2ensembl)
length(c2ensembllist)
[1] 5501
head(lapply(c2ensembllist, head))
$KEGG_GLYCOLYSIS_GLUCONEOGENESIS
[1] "ENSG00000131069" "ENSG00000106633" "ENSG00000170950" "ENSG00000102144"
[5] "ENSG00000168291" "ENSG00000131828"
$KEGG_CITRATE_CYCLE_TCA_CYCLE
[1] "ENSG00000101365" "ENSG00000119689" "ENSG00000100889" "ENSG00000285241"
[5] "ENSG00000062485" "ENSG00000168291"
$KEGG_PENTOSE_PHOSPHATE_PATHWAY
[1] "ENSG00000197713" "ENSG00000153574" "ENSG00000169299" "ENSG00000101911"
[5] "ENSG00000130957" "ENSG00000152556"
$KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS
[1] "ENSG00000242515" "ENSG00000242366" "ENSG00000197713" "ENSG00000244122"
[5] "ENSG00000167165" "ENSG00000135226"
$KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM
[1] "ENSG00000178802" "ENSG00000140650" "ENSG00000100417" "ENSG00000130957"
[5] "ENSG00000152556" "ENSG00000112699"
$KEGG_GALACTOSE_METABOLISM
[1] "ENSG00000106633" "ENSG00000108479" "ENSG00000170266" "ENSG00000117308"
[5] "ENSG00000086062" "ENSG00000169299"
note that this infrastructure already takes care of duplicated entries
any(sapply(lapply(c2ensembllist, duplicated), any))
[1] FALSE
and because you're converting from gene sets to gene sets, it's not so much of a concern whether your mapping is not 1-to-1, while you don't need to do anything with your molecular data.
cheers,
robert.
This seems inadvisable to me (and you as well, apparently), but, the broad says
It's not clear from this wiki what they are really doing, and getting the NCBI Gene ID from a non-NCBI source seems suboptimal, but what do I know?
But you could probably get the GENCODE release 31 data and get the mappings from there?