how to add an enrichment score to a kegg pathway enrichment analysis?
0
0
Entering edit mode
m-bihie ▴ 10
@4f002f4c
Last seen 20 months ago
Canada

I have created a datatable of over expressed and under expressed kegg pathways, and I'm trying to find a way to add an enrichment score. I used the gage() function to get the p value and path ids for the kegg pathways, but I'm struggling to find a way to add an enrichment score. I see that the gsekegg() function provides enrichment scores, but I'm struggling to get it to match my kegg pathways. Would anyone know how?

link for data: https://github.com/Bioinformatics-Research-Network/skill-assessments/blob/main/RNA-Seq%20Analysis/EwS.rds

#BiocManager::install("DESeq2")
library(DESeq2)
#BiocManager::install("EnsDb.Hsapiens.v86")
library(EnsDb.Hsapiens.v86)
#BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
#BiocManager::install("pheatmap")
library(pheatmap)
#BiocManager::install("org.Hs.eg.db")
library(org.Hs.eg.db)
#BiocManager::install("KEGGREST")
library(KEGGREST)
#BiocManager::install("pathview")
library(pathview)
#BiocManager::install("clusterProfiler")
library(clusterProfiler)
#BiocManager::install("gage")
library(gage)
#BiocManager::install("gageData")
library(gageData)

The DESeq2 Object

#   class(rse): RangedSummarizedExperiment
rse <- readRDS("EwS.rds")

#remove the version number from the geneID
rownames(rse) <- gsub(rownames(rse),
                      pattern = "\\..+", replacement = "")

#make the dds
dds <- DESeqDataSet(rse, design = ~condition)

#set the factor level
dds$condition <- relevel(dds$condition, ref = "shCTR")

#creat DESeq 2 object
dds <- DESeq(dds)

#decrease the size of the DESeq object to make DESeq2 functions faster
dds <- estimateSizeFactors(dds)
idx <- rowSums( counts(dds, normalized=TRUE) >= 10 ) >= 3
dds <- dds[idx,]

#remove rows with low gene counts
keep <- as.data.frame(counts(dds)) %>%
  rowSums(counts(dds, normalized=TRUE)) >= 10
dds <- dds[keep,]


#extract the results
res <- results(dds,
               contrast = c("condition", 
                            "shCTR",      
                            "shEF1"),  
               alpha = 0.05)

#shrink data
resNorm <- lfcShrink(dds = dds,
                     res = res,
                     type = "normal",
                     coef = 2)

#make a dataframe of the results to view them
resdf <- as.data.frame(resNorm)

#extract gene symbol from EnsDb.Hsapiens.v86
ens2sym <- AnnotationDbi::select(EnsDb.Hsapiens.v86,
                                 keys = keys(EnsDb.Hsapiens.v86), 
                                 columns = c("SYMBOL"))

#join ens2sym to resdf by shared column (GENEID)
resdfsym <- resdf %>%
  rownames_to_column() %>%
  mutate(GENEID = gsub(rowname, pattern = "\\..+", replacement = "")) %>% 
  inner_join(y = ens2sym, by = "GENEID") %>% 
  dplyr::select(-rowname) %>%
  mutate(padj = case_when(padj == 0 ~ .Machine$double.xmin,
                          TRUE ~ padj)) #replacing 0s with minimum R value

Here is the KEGG pathway.

#reorder these results by p-value and call summary() on the results object
res = res[order(res$pvalue),]
summary(res)

res$symbol = mapIds(org.Hs.eg.db,
                    keys=row.names(res),
                    column="SYMBOL",
                    keytype="ENSEMBL",
                    multiVals="first")

res$entrez = mapIds(org.Hs.eg.db,
                    keys=row.names(res),
                    column="ENTREZID",
                    keytype="ENSEMBL",
                    multiVals="first")

res$name =   mapIds(org.Hs.eg.db,
                    keys=row.names(res),
                    column="GENENAME",
                    keytype="ENSEMBL",
                    multiVals="first")

#setup the KEGG data-sets we need.
data(kegg.sets.hs)
data(sigmet.idx.hs)

#remove unnecessary pathway defintions
kegg.sets.hs = kegg.sets.hs[sigmet.idx.hs]

#creating vector of FCs with Entrez IDs as the names for the gage() function
foldchanges = res$log2FoldChange
names(foldchanges) = res$entrez

#Get the results for the pathway analysis
keggres = gage(foldchanges, gsets=kegg.sets.hs, same.dir=TRUE)

I am trying to get the enrichment scores for the filtered pathways in the dataframe KT.

#join the over-expressed and under-expressed pathways
keggtable <- c(rownames(keggres$greater),rev(rownames(keggres$less)))

#edit the dataframe
keggtable <- as.data.frame(keggtable) 
keggtable <- keggtable %>%
  separate(keggtable,
           sep = 9,
           into=c("pathid","pathway")) %>% 
  mutate(pathid = substring(pathid, 4)) %>%
  select(pathway,pathid) %>% 
  distinct()

#format p value for downregulated pathways
ls <- as.data.frame(keggres$less) %>%
  rownames_to_column() %>%
  rename(pathway = rowname) %>%
  mutate(pathway = substring(pathway, 10)) %>% map_df(rev) %>% 
  distinct()

#format p value for upregulated pathways
grtr <- as.data.frame(keggres$greater) %>%
  rownames_to_column() %>%
  rename(pathway = rowname) %>%
  mutate(pathway = substring(pathway, 10)) %>% 
  distinct()

#join dataframes
grtrls <- grtr %>%
  full_join(ls)

KT <- grtrls %>%
  left_join(keggtable, by="pathway") %>%
  select(pathway, pathid, p.val) %>%
  distinct()

#present the results in a data table
datatable(KT, 
          options=list(pageLength=5),
          caption = htmltools::tags$caption(
          style = 'caption-side: bottom; text-align: center;',
                  'Table 2: ',
                  width = 3,
          htmltools::em('Top Over-Expressed & Under-Expressed KEGG Pathways.')
                                           )
         )

When I try to use gseKEGG() with a dataframe of the results as an input, it only produces 11 enrichment scores (the datatable named out). None of the enrichment scores found match the over/under-regulated pathways I have under KT. How can I get the datatable enrichkegg to match the datatable KT?

# we want the log2 fold change 
original_gene_list <- resdf$log2FoldChange

# name the vector
names(original_gene_list) <- rownames(resdf)

# omit any NA values 
original_gene_list <- na.omit(original_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
original_gene_list = sort(original_gene_list, decreasing = TRUE)

# Convert gene IDs for gseKEGG function
#lose some genes here because not all IDs will be converted
ids <- bitr(geneID = names(original_gene_list), 
            fromType = "ENSEMBL", 
            toType = "ENTREZID", 
            OrgDb = org.Hs.eg.db)

# remove duplicate IDS (here I use "ENSEMBL", but it should be whatever was selected as keyType)
dedup_ids = ids[!duplicated(ids[c("ENSEMBL")]),]

df2 = resdf[rownames(resdf) %in% dedup_ids$ENSEMBL,]

df2$Y = dedup_ids$ENTREZID

# Create a vector of the gene universe
kegg_gene_list <- df2$log2FoldChange

# Name vector with ENTREZ ids
names(kegg_gene_list) <- df2$Y

# omit any NA values 
kegg_gene_list<-na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

kegg_organism = "hsa"
set.seed(1234)
kk2 <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "none",
               keyType       = "ncbi-geneid")

enrichkegg <- kk2@result

datatable(enrichkegg)
KEGG clusterProfiler Bioconductor • 1.7k views
ADD COMMENT
1
Entering edit mode

Not an answer to your question, but as far as I know the enrichment score for a gage analysis is reported in the column stat.mean. See the help pages of the gage function.

Also: why would you want to mix results from 2 different gene set analysis methods? Thus: gage vs gsea? In my opinion this does not make much sense, so you may want to elaborate on this.

Lastly: you provide a lot of code, but going through all of these takes quite some time/effort and it will discourage people such as me to have a close look at it. I would advise you to only show the code, (+ reproducible in- and output) that is directly relevant to your question.

ADD REPLY
0
Entering edit mode

Thanks for the advise! I can't see anywhere on the gage() pdf from bioconductor about stat.mean representing the enrichment score. I'd appreciate any links on that if you have it.

I only used gseKEGG because that was the only function I found that explicitly gave the enrichment score. I can match the stat.mean column to my datatable instead if that represents the enrichment score.

I definitely understand that this big block of code will deter lots of people, and I usually try to limit the amount I add to a question. I just didn't know how to cut the code while keeping the problem reproducible.

ADD REPLY
0
Entering edit mode

Regarding stat.mean:

Did you have a look at the help page for the function gage? To do so, in R, type ?gage.

stat.mean: mean of the individual statistics from multiple single array based gene set tests. Normally, its **absoluate value measures the magnitude of gene-set level changes**, and its sign indicates direction of the changes. When saaTest=gs.KSTest, stat.mean is always positive.
ADD REPLY

Login before adding your answer.

Traffic: 876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6