I am trying to do pathway enrichment analysis for a set of genes from an Illumina chip, and I'm having problems with using hyperGTest (from the GOstats package).
My approach is:
- define gene_list and gene_universe in the same format, in this case using the Illumina gene identifiers (e.g. ILMN_1775522) from the annotation package illuminaHumanv3.db;
- construct KEGGHyperGParams object using these;
- do hyperGTest(KEGGHyperGParams).
When I try this using the Illumina gene identifiers, I get the following error message:
Error in getUniverseHelper(probes, p@datPkg, universeGeneIds(p)) : After filtering, there are no valid IDs that can be used as the Gene universe. Check input values to confirm they are the same type as the central ID used by your annotation package. For chip packages, this will still mean the central GENE identifier used by the package (NOT the probe IDs).
This confused me, because as far as I can tell (not being a proper bioinformatician), the ILMN_1775522-style codes are gene IDs and not probe IDs, and are exactly the gene identifiers used in the annotation package. I got them via illuminaHumanv3ACCNUM, which gives a list with ILMN_etc as names of the objects in the list, and NM_etc style gene IDs as entries in the list. I also tried using the NM_etc entries as my gene list / universe, and got the same error message (not done in code below).
So question (1) is: Am I using KEGGHyperGParams / hyperGTest wrong? I find its documentation pretty light, and I couldn't get any further information on what 'central GENE identifier' should mean.
To get round this, I converted my gene list / universe to entrez geneIDs. hyperGTest worked fine with these gene IDs, but in the illuminaHumanv3.db package, only half (60%) of the genes are annotated with entrez IDs. I've manually checked a handful of the non-annotated ones (e.g. on ReMOAT), and the entrez IDs do exist. Next question is therefore:
(2) Am I doing this wrong or is this (heavily) incomplete annotation a known issue? If so are there standard ways to get round it? I haven't done much GO/similar work before, so perhaps I'm not using the most up-to-date package - if so please point me in the right direction.
Finally (3) It looks like the way I'm doing KEGG tests is not up-to-date - are there any ways round this using my current setup, or do I need to use a different enrichment testing tool as well as e.g. KEGGREST?
Thanks in advance for any help, and let me know if anything isn't clear.
Will
# Sample code:
library('illuminaHumanv3.db')
library('org.Hs.eg.db')
library('GOstats')
library('data.table')
library('reshape2')
pathway_test <- function(gene_list, gene_universe) {
hgCutoff = 1
pathway_test_params = new(
'KEGGHyperGParams',
geneIds = gene_list,
universeGeneIds = gene_universe,
annotation = 'illuminaHumanv3.db',
pvalueCutoff = hgCutoff,
testDirection = 'over'
)
hgOver = hyperGTest(pathway_test_params)
results_table = data.table(summary(hgOver))
}
convert_to_entrez <- function(gene_list, entrez_lookup) {
output = entrez_lookup[ILMN %in% gene_list, entrez]
output = as.integer(unlist(output))
return(output)
}
entrez_lookup = data.table(
ILMN = ls(illuminaHumanv3ACCNUM),
refseq = as.list(illuminaHumanv3ACCNUM),
entrez = as.list(illuminaHumanv3ENTREZID)
)
print(entrez_lookup)
# ILMN refseq entrez
# 1: ILMN_1343048 NA NA
# 2: ILMN_1343049 NA NA
# 3: ILMN_1343050 NA NA
# 4: ILMN_1343052 NA NA
# 5: ILMN_1343059 NA NA
# ---
# 49572: ILMN_2415911 NM_006375,NM_182314 10495
# 49573: ILMN_2415926 NM_032361 84321
# 49574: ILMN_2415949 NM_001173512,NM_199177,NM_138777 92399
# 49575: ILMN_2415979 NM_001080484 85452
# 49576: ILMN_2416019 NM_001033113,NM_198585 377841
genes_without_entrez = c('ILMN_1343048', 'ILMN_1343049', 'ILMN_1343050', 'ILMN_1343052', 'ILMN_1343059')
genes_with_entrez = c('ILMN_2415911', 'ILMN_2415926', 'ILMN_2415949', 'ILMN_2415979', 'ILMN_2416019')
gene_list = c(genes_without_entrez, genes_with_entrez)
gene_universe = ls(illuminaHumanv3ACCNUM)
gene_list_entrez = convert_to_entrez(gene_list, entrez_lookup)
gene_universe_entrez = convert_to_entrez(gene_universe, entrez_lookup)
output_ILMN = pathway_test(gene_list, gene_universe)
# Error in getUniverseHelper(probes, p@datPkg, universeGeneIds(p)) :
# After filtering, there are no valid IDs that can be used as the Gene universe.
# Check input values to confirm they are the same type as the central ID used by your annotation package.
# For chip packages, this will still mean the central GENE identifier used by the package (NOT the probe IDs).
output_entrez = pathway_test(gene_list_entrez, gene_universe_entrez)
# KEGG.db contains mappings based on older data because the original resource was removed from the the public domain before the
# most recent update was produced. This package should now be considered deprecated and future versions of Bioconductor may not
# have it available. Users who want more current data are encouraged to look at the KEGGREST or reactome.db packages
print(output_entrez)
# KEGGID Pvalue OddsRatio ExpCount Count Size Term
# 1: 00240 0.03315071 59.44681 0.03342716 1 95 Pyrimidine metabolism
# 2: 03040 0.04350327 44.82258 0.04398311 1 125 Spliceosome
# 3: 03013 0.04900184 39.58571 0.04961295 1 141 RNA transport
# 4: 00230 0.05516880 34.96203 0.05594652 1 159 Purine metabolism
sessionInfo()
# R version 3.1.2 (2014-10-31)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# locale:
# [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 LC_MONETARY=German_Switzerland.1252
# [4] LC_NUMERIC=C LC_TIME=German_Switzerland.1252
# attached base packages:
# [1] parallel stats4 stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] KEGG.db_3.0.0 reshape2_1.4 data.table_1.9.4 GOstats_2.32.0
# [5] graph_1.44.0 Category_2.32.0 GO.db_3.0.0 Matrix_1.1-4
# [9] illuminaHumanv3.db_1.24.0 org.Hs.eg.db_3.0.0 RSQLite_1.0.0 DBI_0.3.1
# [13] AnnotationDbi_1.28.1 GenomeInfoDb_1.2.2 IRanges_2.0.0 S4Vectors_0.4.0
# [17] Biobase_2.26.0 BiocGenerics_0.12.0
# loaded via a namespace (and not attached):
# [1] annotate_1.44.0 AnnotationForge_1.8.1 chron_2.3-45 genefilter_1.48.1 grid_3.1.2
# [6] GSEABase_1.28.0 lattice_0.20-29 plyr_1.8.1 RBGL_1.42.0 Rcpp_0.11.3
# [11] splines_3.1.2 stringr_0.6.2 survival_2.37-7 tools_3.1.2 XML_3.98-1.1
# [16] xtable_1.7-4
Hi James
Thanks for your response. I now have a more up-to-date way of obtaining Entrez Gene IDs, but this doesn't solve the problem of incomplete annotation in illuminaHumanv3.db (i.e. even using select, the genes I chose as having NAs still have NAs). Do you have any suggestions for solving this?
Thanks again, Will
PS Sorry for such a slow response, somehow I didn't have email notifications set up, and have only now come back to this problem.