Question

Problem converting msigdb Human symbols to Zebrafish in gmt format

0

Entering edit mode

Georg • 0

@a6df0d80

Last seen 11 months ago

Canada

Hello,

I am new to using R and bioconductor, so I apologize for any errors. What I am trying to do is use the msigdb database to retrieve the KEGG gmt data for a GSEA analysis and convert the human genes to zebrafish genes. I followed a previous posts on how to retrieve the data online and in their case they converted the human KEGG gene symbol identifiers to human ENSEMBL codes. I tried to replace the "org.Hs.eg.db" with the Zebrafish version "org.Dr.eg.db". However, now under unique identifiers, the total is 0. I assume it is due to attempting to jump straight from human gene symbols to zebrafish ENSEMBL codes. Does someone have a suggestion on how to convert the human genes to zebrafish genes/ENSEMBL codes in the gmt format that can be used by the GSEA software?

Thank you.

> setwd("C:/Users/georg/Desktop")
> library(org.Dr.eg.db)

> keggsym <- getGmt("c2.cp.kegg.v7.1.symbols.gmt",  geneIdType=SymbolIdentifier())
> keggsym
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
  unique identifiers: ACSS2, GCK, ..., EIF4G1 (5242 total)
  types in collection:
    geneIdType: SymbolIdentifier (1 total)
    collectionType: NullCollection (1 total)

> keggens <- mapIdentifiers(keggsym, ENSEMBLIdentifier("org.Dr.eg.db"))
> keggens
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
  unique identifiers:  (0 total)
  types in collection:
    geneIdType: ENSEMBLIdentifier (1 total)
    collectionType: NullCollection (1 total)

msigdb gmt GSEA • 919 views

ADD COMMENT • link written 13 months ago by Georg • 0

score 4 · Accepted Answer · 2024-03-18

The main goal here is to map human gene symbols to something else, which you can use the Orthology.eg.db package to do. You could also use biomaRt, but I'll stick with the former.

> library(GSEABase)
> library(org.Hs.eg.db)
> library(org.Dr.eg.db)
> library(Orthology.eg.db)

## first read in the gmt in a useful format
## this is cribbed from getGmt
> z <- strsplit(readLines("c2.cp.kegg.v7.1.symbols.gmt"), "\t")
## make it a bit easier
> z <- lapply(z, function(x) list(x[1], x[2], x[-(1:2)]))
## Now just get all the symbols
> symb <- unique(do.call(c, sapply(z, "[[", 3)))
## we need NCBI Ids
human <- mapIds(org.Hs.eg.db, symb, "ENTREZID", "ALIAS")
## now map to Danio
> mapped <- select(Orthology.eg.db, human, "Danio.rerio","Homo.sapiens")
## that will take a while
## now add Danio Ensembl and SYMBOL
## there are lots of non-matches here
> mapped$ENSEMBL <- do.call(c, lapply(mapIds(org.Dr.eg.db, as.character(mapped$Danio.rerio), "ENSEMBL","ENTREZID"), 
                                      function(x) if(is.null(x)) return(NA) else return(x)))
> mapped$SYMBOL <- do.call(c, lapply(mapIds(org.Dr.eg.db, as.character(mapped$Danio.rerio), "SYMBOL","ENTREZID"), 
                                      function(x) if(is.null(x)) return(NA) else return(x))
> head(mapped)
      Homo.sapiens Danio.rerio            ENSEMBL SYMBOL
ACSS2        55902          NA               <NA>   <NA>
GCK           2645      751668 ENSDARG00000068006    gck
PGK2          5232          NA               <NA>   <NA>
PGK1          5230      406696 ENSDARG00000054191   pgk1
PDHB          5162      406428 ENSDARG00000021346   pdhb
PDHA1         5160          NA               <NA>   <NA>
## now make mapping vectors
> mapper1 <- setNames(mapped$ENSEMBL, row.names(mapped))
> mapper2 <- setNames(mapped$SYMBOL, row.names(mapped))
## now use these to convert the gene IDs
## I do both symbol and Ensembl, you choose
> zz <- lapply(z, function(x) {tmp <- mapper1[x[[3]]]; x[[3]] <- tmp[!is.na(tmp)]; return(x)})
> zzz <- lapply(z, function(x) {tmp <- mapper2[x[[3]]]; x[[3]] <- tmp[!is.na(tmp)]; return(x)})
## again, cribbing from getGmt
> template <- GeneSet(geneIdType = NullIdentifier(), collectionType = NullCollection())
> zze <- GeneSetCollection(lapply(zz, function(x) initialize(template, setName = x[[1]], shortDescription = x[[2]], geneIds = unique(x[[3]]))))
## or
> zzg <- GeneSetCollection(lapply(zzz, function(x) initialize(template, setName = x[[1]], shortDescription = x[[2]], geneIds = unique(x[[3]]))))
> zze
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., KEGG_VIRAL_MYOCARDITIS (186 total)
  unique identifiers: ENSDARG00000068006, ENSDARG00000054191, ..., ENSDARG00000006200 (2710 total)
  types in collection:
    geneIdType: NullIdentifier (1 total)
    collectionType: NullCollection (1 total)