Question

Incomplete gene IDs in KEGG pathway testing

0

Entering edit mode

willmacnair • 0

@willmacnair-7054

Last seen 9.0 years ago

Switzerland

I am trying to do pathway enrichment analysis for a set of genes from an Illumina chip, and I'm having problems with using hyperGTest (from the GOstats package).
My approach is:

define gene_list and gene_universe in the same format, in this case using the Illumina gene identifiers (e.g. ILMN_1775522) from the annotation package illuminaHumanv3.db;
construct KEGGHyperGParams object using these;
do hyperGTest(KEGGHyperGParams).

When I try this using the Illumina gene identifiers, I get the following error message:

    Error in getUniverseHelper(probes, p@datPkg, universeGeneIds(p)) : 
      After filtering, there are no valid IDs that can be used as the Gene universe.
      Check input values to confirm they are the same type as the central ID used by your annotation package.
      For chip packages, this will still mean the central GENE identifier used by the package (NOT the probe IDs).

This confused me, because as far as I can tell (not being a proper bioinformatician), the ILMN_1775522-style codes are gene IDs and not probe IDs, and are exactly the gene identifiers used in the annotation package. I got them via illuminaHumanv3ACCNUM, which gives a list with ILMN_etc as names of the objects in the list, and NM_etc style gene IDs as entries in the list. I also tried using the NM_etc entries as my gene list / universe, and got the same error message (not done in code below).
So question (1) is: Am I using KEGGHyperGParams / hyperGTest wrong? I find its documentation pretty light, and I couldn't get any further information on what 'central GENE identifier' should mean.

To get round this, I converted my gene list / universe to entrez geneIDs. hyperGTest worked fine with these gene IDs, but in the illuminaHumanv3.db package, only half (60%) of the genes are annotated with entrez IDs. I've manually checked a handful of the non-annotated ones (e.g. on ReMOAT), and the entrez IDs do exist. Next question is therefore:
(2) Am I doing this wrong or is this (heavily) incomplete annotation a known issue? If so are there standard ways to get round it? I haven't done much GO/similar work before, so perhaps I'm not using the most up-to-date package - if so please point me in the right direction.

Finally (3) It looks like the way I'm doing KEGG tests is not up-to-date - are there any ways round this using my current setup, or do I need to use a different enrichment testing tool as well as e.g. KEGGREST?

Thanks in advance for any help, and let me know if anything isn't clear.
Will

# Sample code:
library('illuminaHumanv3.db')
library('org.Hs.eg.db')
library('GOstats')
library('data.table')
library('reshape2')

pathway_test <- function(gene_list, gene_universe) {
    hgCutoff = 1
    pathway_test_params = new(
        'KEGGHyperGParams',
        geneIds         = gene_list,
        universeGeneIds = gene_universe,
        annotation         = 'illuminaHumanv3.db',
        pvalueCutoff     = hgCutoff,
        testDirection     = 'over'
        )
    hgOver             = hyperGTest(pathway_test_params)
    results_table     = data.table(summary(hgOver))
}

convert_to_entrez <- function(gene_list, entrez_lookup) {
    output = entrez_lookup[ILMN %in% gene_list, entrez]
    output = as.integer(unlist(output))
    return(output)
}

entrez_lookup                 = data.table(
    ILMN     = ls(illuminaHumanv3ACCNUM),
    refseq     = as.list(illuminaHumanv3ACCNUM), 
    entrez     = as.list(illuminaHumanv3ENTREZID)
    )
print(entrez_lookup)
#                ILMN                           refseq entrez
#     1: ILMN_1343048                               NA     NA
#     2: ILMN_1343049                               NA     NA
#     3: ILMN_1343050                               NA     NA
#     4: ILMN_1343052                               NA     NA
#     5: ILMN_1343059                               NA     NA
#    ---                                                     
# 49572: ILMN_2415911              NM_006375,NM_182314  10495
# 49573: ILMN_2415926                        NM_032361  84321
# 49574: ILMN_2415949 NM_001173512,NM_199177,NM_138777  92399
# 49575: ILMN_2415979                     NM_001080484  85452
# 49576: ILMN_2416019           NM_001033113,NM_198585 377841

genes_without_entrez     = c('ILMN_1343048', 'ILMN_1343049', 'ILMN_1343050', 'ILMN_1343052', 'ILMN_1343059')
genes_with_entrez         = c('ILMN_2415911', 'ILMN_2415926', 'ILMN_2415949', 'ILMN_2415979', 'ILMN_2416019')
gene_list                 = c(genes_without_entrez, genes_with_entrez)

gene_universe             = ls(illuminaHumanv3ACCNUM)

gene_list_entrez         = convert_to_entrez(gene_list, entrez_lookup)
gene_universe_entrez     = convert_to_entrez(gene_universe, entrez_lookup)

output_ILMN     = pathway_test(gene_list, gene_universe)
 # Error in getUniverseHelper(probes, p@datPkg, universeGeneIds(p)) : 
 #  After filtering, there are no valid IDs that can be used as the Gene universe.
 #  Check input values to confirm they are the same type as the central ID used by your annotation package.
 #  For chip packages, this will still mean the central GENE identifier used by the package (NOT the probe IDs).
output_entrez     = pathway_test(gene_list_entrez, gene_universe_entrez)
# KEGG.db contains mappings based on older data because the original resource was removed from the the public domain before the
#   most recent update was produced. This package should now be considered deprecated and future versions of Bioconductor may not
#   have it available.  Users who want more current data are encouraged to look at the KEGGREST or reactome.db packages
print(output_entrez)
#    KEGGID     Pvalue OddsRatio   ExpCount Count Size                  Term
# 1:  00240 0.03315071  59.44681 0.03342716     1   95 Pyrimidine metabolism
# 2:  03040 0.04350327  44.82258 0.04398311     1  125           Spliceosome
# 3:  03013 0.04900184  39.58571 0.04961295     1  141         RNA transport
# 4:  00230 0.05516880  34.96203 0.05594652     1  159     Purine metabolism

sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-w64-mingw32/x64 (64-bit)

# locale:
# [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252
# [4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.1252    

# attached base packages:
# [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
#  [1] KEGG.db_3.0.0             reshape2_1.4              data.table_1.9.4          GOstats_2.32.0           
#  [5] graph_1.44.0              Category_2.32.0           GO.db_3.0.0               Matrix_1.1-4             
#  [9] illuminaHumanv3.db_1.24.0 org.Hs.eg.db_3.0.0        RSQLite_1.0.0             DBI_0.3.1                
# [13] AnnotationDbi_1.28.1      GenomeInfoDb_1.2.2        IRanges_2.0.0             S4Vectors_0.4.0          
# [17] Biobase_2.26.0            BiocGenerics_0.12.0      

# loaded via a namespace (and not attached):
#  [1] annotate_1.44.0       AnnotationForge_1.8.1 chron_2.3-45          genefilter_1.48.1     grid_3.1.2           
#  [6] GSEABase_1.28.0       lattice_0.20-29       plyr_1.8.1            RBGL_1.42.0           Rcpp_0.11.3          
# [11] splines_3.1.2         stringr_0.6.2         survival_2.37-7       tools_3.1.2           XML_3.98-1.1         
# [16] xtable_1.7-4

gostats kegg • 2.1k views

ADD COMMENT • link updated 10.4 years ago by James W. MacDonald 68k • written 10.4 years ago by willmacnair • 0

score 0 · Answer 1 · 2014-11-17

You are correct that the help page is a bit coy about what the IDs should be. The vast majority of annotations for Bioconductor are based on Entrez Gene IDs (now rebranded as Gene IDs). So the input has to be the Entrez Gene IDs, as you have found.

But the way you are getting the Gene IDs is not how you should be going about it. You are using (very) old ways of getting information, and it is not serving you well. The correct way to get anything from any of the annotation packages these days is via the select() function.

You show all your functions, but not how you got your set of significant genes, so I will give an example using randomly selected data.

> universeGeneIds <- select(illuminaHumanv3.db, keys(illuminaHumanv3.db), "ENTREZID")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows

## You will almost surely see the same thing, as many probes map to multiple gene IDs. How you resolve this is up to you. I just do the simplest possible thing.

> universeGeneIds <- universeGeneIds[!duplicated(universeGeneIds[,2]),2]
> geneIds <- universeGeneIds[sample(1:length(universeGeneIds), 250)]

## You will not do this, but instead will run select() again, using the set of significant genes you are testing.

> p <- new("KEGGHyperGParams", geneIds=  geneIds, universeGeneIds = universeGeneIds, annotation = "illuminaHumanv3.db", testDirection = "over")
> hyp <- hyperGTest(p)
> head(summary(hyp))

KEGG.db contains mappings based on older data because the original
  resource was removed from the the public domain before the most
  recent update was produced. This package should now be considered
  deprecated and future versions of Bioconductor may not have it
  available.  Users who want more current data are encouraged to look
  at the KEGGREST or reactome.db packages

  KEGGID      Pvalue OddsRatio ExpCount Count Size      Term
1  04145 0.006785841  3.894275 1.703906     6  149 Phagosome