Is there a way to find out which KEGG IDs mapped to a pathway? Specifically, in the output of generateResultsTable(method="hypergeom"...) function there are 2 columns called "CompoundHits" and "CompoundsInPathway". Can I find out to which KEGG IDs the numbers in these columns correspond?
There is no "nice" built-in way to solve your query, but it is indeed possible to get around it.
I have attached a script - hope it helps.
# example on how to extract individual metabolite hits in hypergeom
library(igraph)
library(FELLA)
library(plyr)
library(dplyr)
data("FELLA.sample")
data("input.sample")
myAnalysis <- enrich(
compounds = input.sample,
method = "hypergeom",
data = FELLA.sample)
df.hypergeom <- generateResultsTable(
method = "hypergeom",
object = myAnalysis,
data = FELLA.sample,
threshold = 1)
# there is currently no nice function for this
# you can obtain the internal metabolite-pathway matrix
# remember these are not the original KEGG pathways but the
# pathways that can be reached from the metabolites in the
# hierarchical, upwards-directed network
mat.hypergeom <- FELLA:::getMatrix(FELLA.sample, "hypergeom")
list.cpd <- intersect(FELLA::getInput(myAnalysis), rownames(mat.hypergeom))
# all the metabolites (path) and those within the list (hits)
list.path <- apply(mat.hypergeom, 2, function(x) names(which(x)))
list.hits <- lapply(list.path, function(x) intersect(x, list.cpd))
lapply(list.hits, head)
# check that the numbers make sense - compare to those in tab.hypergeom
df.hits <- plyr::ldply(names(list.path), function(path) {
data.frameKEGG.id = path,
CompoundHits = length(list.hits[[path]]),
CompoundsInPathway = length(list.path[[path]]),
stringsAsFactors = FALSE)
})
df.hits
# all the rows should be repeated
# otherwise rise exception
stopifnot(nrow(dplyr::anti_join(df.hypergeom, df.hits)) == 0)
Thank you for sharing this code. Is it possible to do something similar with the diffusion matrix as opposed to the hypergeometric matrix? The structure of the diffusion matrix appears to be different to the hypergeometric matrix and I cannot get the above code to work. I did convert the diffusion matrix to class 'lgCMatrix' (same class as the hypergeometric matrix) but I cannot work out how to map compounds to individual pathways. I appreciate that the generateResultsTable function lists which compounds results in enrichment of pathways but it would be useful to know which compounds actually map to each individual KEGG pathway and/or module.
Thank you for your help.
Dear liampvmartin1992, please excuse the late reply. I did not notice the comment.
If by "doing someting similar" you mean...
Finding out what pathways each compound belongs to, originally (right before building the network): you should store the result of KEGGREST::keggLink("cpd", "pathway") when building the network. You can do it a posteriori, but there might be small changes with each KEGG release.
Finding out what pathways can be directly reached from each compound, right after building the network (e.g. compound C does not belong to pathway P originally, but is related to enzyme E which belongs to P, so P can be reached from C via C-E-P): the answer is the one to the original question above.
Finding out the contribution of each compound to each pathway: then you can grab the diffusion matrix with the kernel values and work on those directly. See code below.
# example on how to extract individual metabolite contributions in diffusion
# to know which metabolites were in which pathways see 03_hypergeom_getpathways
# https://support.bioconductor.org/p/119264/
library(igraph)
library(FELLA)
library(plyr)
library(dplyr)
data("FELLA.sample")
data("input.sample")
# rebuild diffusion matrix (since matrix is not
# bundled in the default sample object)
g <- FELLA::getGraph(FELLA.sample)
db.dir <- paste0(tempdir(), "/fellasampledb")
FELLA::buildDataFromGraph(
g, databaseDir = db.dir, internalDir = FALSE, niter = 10)
# reload db with diffusion matrix
FELLA.samplemat <- FELLA::loadKEGGdata(
db.dir, internalDir = FALSE, loadMatrix = "diffusion")
# toy analysis
myAnalysis <- enrich(
compounds = input.sample,
method = "diffusion",
data = FELLA.samplemat)
df.diffusion <- generateResultsTable(
method = "diffusion",
object = myAnalysis,
data = FELLA.samplemat)
# get diffusion matrix with kernel values
mat.diffusion <- FELLA:::getMatrix(FELLA.samplemat, "diffusion")
# input cpds
list.cpdin <- intersect(FELLA::getInput(myAnalysis), rownames(mat.diffusion))
# all cpds
list.cpdall <- FELLA::getCom(FELLA.samplemat, "compound")
# pathway hits
list.pathhits <- subset(df.diffusion, Entry.type == "pathway")$KEGG.id
# all pathways
list.pathall <- FELLA::getCom(FELLA.samplemat, "pathway")
# columns: input metabolites
# rows: pathways
# values: contribution of compound to pathway score
mat.contr <- mat.diffusion[list.pathall, list.cpdin]
# The sum of each compound contributions to all pathway is 1
colSums(mat.contr)
# hsa00640 always gets more contribution than hsa00010 for the input compounds
rownames(mat.contr)[apply(mat.contr, 2, which.max)] %>% table
# in general, hsa00640 gets larger contributions than
# hsa00010 from the compounds in the graph
# (24 compounds with larger contrib from hsa00010,
# versus 301 compounds from hsa00640)
list.pathall[apply(mat.diffusion[list.pathall, list.cpdall], 2, which.max)] %>% table
Edit: provided answer instead of comment
This a wrapper function of generateResultsTable
Compound IDs and names matched in pathway are included in the table. Thanks to sergi.picart!
Example