Problem converting msigdb Human symbols to Zebrafish in gmt format
1
0
Entering edit mode
Georg • 0
@a6df0d80
Last seen 6 months ago
Canada

Hello,

I am new to using R and bioconductor, so I apologize for any errors. What I am trying to do is use the msigdb database to retrieve the KEGG gmt data for a GSEA analysis and convert the human genes to zebrafish genes. I followed a previous posts on how to retrieve the data online and in their case they converted the human KEGG gene symbol identifiers to human ENSEMBL codes. I tried to replace the "org.Hs.eg.db" with the Zebrafish version "org.Dr.eg.db". However, now under unique identifiers, the total is 0. I assume it is due to attempting to jump straight from human gene symbols to zebrafish ENSEMBL codes. Does someone have a suggestion on how to convert the human genes to zebrafish genes/ENSEMBL codes in the gmt format that can be used by the GSEA software?

Thank you.

> setwd("C:/Users/georg/Desktop")
> library(org.Dr.eg.db)

> keggsym <- getGmt("c2.cp.kegg.v7.1.symbols.gmt",  geneIdType=SymbolIdentifier())
> keggsym
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
  unique identifiers: ACSS2, GCK, ..., EIF4G1 (5242 total)
  types in collection:
    geneIdType: SymbolIdentifier (1 total)
    collectionType: NullCollection (1 total)

> keggens <- mapIdentifiers(keggsym, ENSEMBLIdentifier("org.Dr.eg.db"))
> keggens
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
  unique identifiers:  (0 total)
  types in collection:
    geneIdType: ENSEMBLIdentifier (1 total)
    collectionType: NullCollection (1 total)
msigdb gmt GSEA • 644 views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 51 minutes ago
United States

The main goal here is to map human gene symbols to something else, which you can use the Orthology.eg.db package to do. You could also use biomaRt, but I'll stick with the former.

> library(GSEABase)
> library(org.Hs.eg.db)
> library(org.Dr.eg.db)
> library(Orthology.eg.db)

## first read in the gmt in a useful format
## this is cribbed from getGmt
> z <- strsplit(readLines("c2.cp.kegg.v7.1.symbols.gmt"), "\t")
## make it a bit easier
> z <- lapply(z, function(x) list(x[1], x[2], x[-(1:2)]))
## Now just get all the symbols
> symb <- unique(do.call(c, sapply(z, "[[", 3)))
## we need NCBI Ids
human <- mapIds(org.Hs.eg.db, symb, "ENTREZID", "ALIAS")
## now map to Danio
> mapped <- select(Orthology.eg.db, human, "Danio.rerio","Homo.sapiens")
## that will take a while
## now add Danio Ensembl and SYMBOL
## there are lots of non-matches here
> mapped$ENSEMBL <- do.call(c, lapply(mapIds(org.Dr.eg.db, as.character(mapped$Danio.rerio), "ENSEMBL","ENTREZID"), 
                                      function(x) if(is.null(x)) return(NA) else return(x)))
> mapped$SYMBOL <- do.call(c, lapply(mapIds(org.Dr.eg.db, as.character(mapped$Danio.rerio), "SYMBOL","ENTREZID"), 
                                      function(x) if(is.null(x)) return(NA) else return(x))
> head(mapped)
      Homo.sapiens Danio.rerio            ENSEMBL SYMBOL
ACSS2        55902          NA               <NA>   <NA>
GCK           2645      751668 ENSDARG00000068006    gck
PGK2          5232          NA               <NA>   <NA>
PGK1          5230      406696 ENSDARG00000054191   pgk1
PDHB          5162      406428 ENSDARG00000021346   pdhb
PDHA1         5160          NA               <NA>   <NA>
## now make mapping vectors
> mapper1 <- setNames(mapped$ENSEMBL, row.names(mapped))
> mapper2 <- setNames(mapped$SYMBOL, row.names(mapped))
## now use these to convert the gene IDs
## I do both symbol and Ensembl, you choose
> zz <- lapply(z, function(x) {tmp <- mapper1[x[[3]]]; x[[3]] <- tmp[!is.na(tmp)]; return(x)})
> zzz <- lapply(z, function(x) {tmp <- mapper2[x[[3]]]; x[[3]] <- tmp[!is.na(tmp)]; return(x)})
## again, cribbing from getGmt
> template <- GeneSet(geneIdType = NullIdentifier(), collectionType = NullCollection())
> zze <- GeneSetCollection(lapply(zz, function(x) initialize(template, setName = x[[1]], shortDescription = x[[2]], geneIds = unique(x[[3]]))))
## or
> zzg <- GeneSetCollection(lapply(zzz, function(x) initialize(template, setName = x[[1]], shortDescription = x[[2]], geneIds = unique(x[[3]]))))
> zze
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
  unique identifiers: ENSDARG00000068006, ENSDARG00000054191, ..., ENSDARG00000006200 (2710 total)
  types in collection:
    geneIdType: NullIdentifier (1 total)
    collectionType: NullCollection (1 total)
0
Entering edit mode

Thank you for the detailed response. The code worked right away. I appreciate the quick response.

ADD REPLY

Login before adding your answer.

Traffic: 813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6