I have created a datatable of over expressed and under expressed kegg pathways, and I'm trying to find a way to add an enrichment score. I used the gage() function to get the p value and path ids for the kegg pathways, but I'm struggling to find a way to add an enrichment score. I see that the gsekegg() function provides enrichment scores, but I'm struggling to get it to match my kegg pathways. Would anyone know how?
link for data: https://github.com/Bioinformatics-Research-Network/skill-assessments/blob/main/RNA-Seq%20Analysis/EwS.rds
#BiocManager::install("DESeq2")
library(DESeq2)
#BiocManager::install("EnsDb.Hsapiens.v86")
library(EnsDb.Hsapiens.v86)
#BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
#BiocManager::install("pheatmap")
library(pheatmap)
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
#BiocManager::install("KEGGREST")
library(KEGGREST)
#BiocManager::install("pathview")
library(pathview)
#BiocManager::install("clusterProfiler")
library(clusterProfiler)
#BiocManager::install("gage")
library(gage)
#BiocManager::install("gageData")
library(gageData)
The DESeq2 Object
# class(rse): RangedSummarizedExperiment
rse <- readRDS("EwS.rds")
#remove the version number from the geneID
rownames(rse) <- gsub(rownames(rse),
pattern = "\\..+", replacement = "")
#make the dds
dds <- DESeqDataSet(rse, design = ~condition)
#set the factor level
dds$condition <- relevel(dds$condition, ref = "shCTR")
#creat DESeq 2 object
dds <- DESeq(dds)
#decrease the size of the DESeq object to make DESeq2 functions faster
dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) >= 10 ) >= 3
dds <- dds[idx,]
#remove rows with low gene counts
keep <- as.data.frame(counts(dds)) %>%
rowSums(counts(dds, normalized=TRUE)) >= 10
dds <- dds[keep,]
#extract the results
res <- results(dds,
contrast = c("condition",
"shCTR",
"shEF1"),
alpha = 0.05)
#shrink data
resNorm <- lfcShrink(dds = dds,
res = res,
type = "normal",
coef = 2)
#make a dataframe of the results to view them
resdf <- as.data.frame(resNorm)
#extract gene symbol from EnsDb.Hsapiens.v86
ens2sym <- AnnotationDbi::select(EnsDb.Hsapiens.v86,
keys = keys(EnsDb.Hsapiens.v86),
columns = c("SYMBOL"))
#join ens2sym to resdf by shared column (GENEID)
resdfsym <- resdf %>%
rownames_to_column() %>%
mutate(GENEID = gsub(rowname, pattern = "\\..+", replacement = "")) %>%
inner_join(y = ens2sym, by = "GENEID") %>%
dplyr::select(-rowname) %>%
mutate(padj = case_when(padj == 0 ~ .Machine$double.xmin,
TRUE ~ padj)) #replacing 0s with minimum R value
Here is the KEGG pathway.
#reorder these results by p-value and call summary() on the results object
res = res[order(res$pvalue),]
summary(res)
res$symbol = mapIds(org.Hs.eg.db,
keys=row.names(res),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
res$entrez = mapIds(org.Hs.eg.db,
keys=row.names(res),
column="ENTREZID",
keytype="ENSEMBL",
multiVals="first")
res$name = mapIds(org.Hs.eg.db,
keys=row.names(res),
column="GENENAME",
keytype="ENSEMBL",
multiVals="first")
#setup the KEGG data-sets we need.
data(kegg.sets.hs)
data(sigmet.idx.hs)
#remove unnecessary pathway defintions
kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]
#creating vector of FCs with Entrez IDs as the names for the gage() function
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez
#Get the results for the pathway analysis
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)
I am trying to get the enrichment scores for the filtered pathways in the dataframe KT.
#join the over-expressed and under-expressed pathways
keggtable <- c(rownames(keggres$greater),rev(rownames(keggres$less)))
#edit the dataframe
keggtable <- as.data.frame(keggtable)
keggtable <- keggtable %>%
separate(keggtable,
sep = 9,
into=c("pathid","pathway")) %>%
mutate(pathid = substring(pathid, 4)) %>%
select(pathway,pathid) %>%
distinct()
#format p value for downregulated pathways
ls <- as.data.frame(keggres$less) %>%
rownames_to_column() %>%
rename(pathway = rowname) %>%
mutate(pathway = substring(pathway, 10)) %>% map_df(rev) %>%
distinct()
#format p value for upregulated pathways
grtr <- as.data.frame(keggres$greater) %>%
rownames_to_column() %>%
rename(pathway = rowname) %>%
mutate(pathway = substring(pathway, 10)) %>%
distinct()
#join dataframes
grtrls <- grtr %>%
full_join(ls)
KT <- grtrls %>%
left_join(keggtable, by="pathway") %>%
select(pathway, pathid, p.val) %>%
distinct()
#present the results in a data table
datatable(KT,
options=list(pageLength=5),
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table 2: ',
width = 3,
htmltools::em('Top Over-Expressed & Under-Expressed KEGG Pathways.')
)
)
When I try to use gseKEGG() with a dataframe of the results as an input, it only produces 11 enrichment scores (the datatable named out). None of the enrichment scores found match the over/under-regulated pathways I have under KT. How can I get the datatable enrichkegg to match the datatable KT?
# we want the log2 fold change
original_gene_list <- resdf$log2FoldChange
# name the vector
names(original_gene_list) <- rownames(resdf)
# omit any NA values
original_gene_list <- na.omit(original_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
original_gene_list = sort(original_gene_list, decreasing = TRUE)
# Convert gene IDs for gseKEGG function
#lose some genes here because not all IDs will be converted
ids <- bitr(geneID = names(original_gene_list),
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
# remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),]
df2 = resdf[rownames(resdf) %in% dedup_ids$ENSEMBL,]
df2$Y = dedup_ids$ENTREZID
# Create a vector of the gene universe
kegg_gene_list <- df2$log2FoldChange
# Name vector with ENTREZ ids
names(kegg_gene_list) <- df2$Y
# omit any NA values
kegg_gene_list<-na.omit(kegg_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)
kegg_organism = "hsa"
set.seed(1234)
kk2 <- gseKEGG(geneList = kegg_gene_list,
organism = kegg_organism,
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "none",
keyType = "ncbi-geneid")
enrichkegg <- kk2@result
datatable(enrichkegg)
Not an answer to your question, but as far as I know the enrichment score for a
gage
analysis is reported in the columnstat.mean
. See the help pages of thegage
function.Also: why would you want to mix results from 2 different gene set analysis methods? Thus: gage vs gsea? In my opinion this does not make much sense, so you may want to elaborate on this.
Lastly: you provide a lot of code, but going through all of these takes quite some time/effort and it will discourage people such as me to have a close look at it. I would advise you to only show the code, (+ reproducible in- and output) that is directly relevant to your question.
Thanks for the advise! I can't see anywhere on the gage() pdf from bioconductor about stat.mean representing the enrichment score. I'd appreciate any links on that if you have it.
I only used gseKEGG because that was the only function I found that explicitly gave the enrichment score. I can match the stat.mean column to my datatable instead if that represents the enrichment score.
I definitely understand that this big block of code will deter lots of people, and I usually try to limit the amount I add to a question. I just didn't know how to cut the code while keeping the problem reproducible.
Regarding
stat.mean
:Did you have a look at the help page for the function
gage
? To do so, in R, type?gage
.