Hello! I'd like to do a pathway enrichment analysis on metagenomic data. I already have the KO IDs for my genes. Does gage support bacteria genomes? than, I saw that you need to specify the organism name, but I'm not interested in a specific bacterial species. Is it possible to use it with KO ids from different species?
Yes, both gage and pathview work with KO. You may generate the pathway gene set data using function kegg.gsets (with species="ko"). Then you can follow the examples in the quick start and basic analysis sections of the gage tutorial:
In your code above, path.ids and path.ids.l are the list of up and down-regulated pathways selected based on q-val<0.1. the full analysis results tables are stored in fc.kegg.p$greater and fc.kegg.p$less. To take a look:
My fc.kegg.p table is all NAs... Moreover, I have all pathways from human genome (i think, the IDs start with hsa). In the code above, where should I specify the organism? And what di I need to use, since I have mixed bacteria population?
When you call gage function in your code above, you specify gsets = kegg.gs, which is the gene set data. However, you used the human gene set data by following the tutorial example exactly. As I mentioned above, you should create your own gene set data using kegg.gsets function (with species="ko"). Please follow the suggested links/docs above.
fc.kegg.p$greater
p.geomean stat.mean p.val q.val set.size exp1
kg.sets NA NaN NA NA 0 NA
sigmet.idx NA NaN NA NA 0 NA
sig.idx NA NaN NA NA 0 NA
met.idx NA NaN NA NA 0 NA
dise.idx NA NaN NA NA 0 NA
I'm sorry to bother you again, but I still have a table of NAs, except for setsize where there are some values
p.geomean stat.mean p.val q.val
ko00970 Aminoacyl-tRNA biosynthesis NA NaN NA NA
ko02010 ABC transporters NA NaN NA NA
ko02020 Two-component system NA NaN NA NA
ko02030 Bacterial chemotaxis NA NaN NA NA
ko02040 Flagellar assembly NA NaN NA NA
ko02060 Phosphotransferase system (PTS) NA NaN NA NA
Okay, you only have 238 KO genes. These seems to be a selected significant gene list, since all are heavily up or down regulated based on log2 fold change values.
Gene set analysis like GAGE requires the full list of genes (proteins, molecules etc), usually thousands of them, instead of a preselected short list. Otherwise, there is no background to compare to in the statistical test. Therefore, you should include data for all KO genes output (from DEseq or other analysis) as described in gage tutorials.
BTW, you saw NA or NaN in your output because each pathways gets none or only a few genes mapped as the list is very short with 238 genes.
My fc.kegg.p table is all NAs... Moreover, I have all pathways from human genome (i think, the IDs start with hsa). In the code above, where should I specify the organism? And what di I need to use, since I have mixed bacteria population?