Hello everybody,
I would like to calculate and plot the enrichment score of a selection from a ranked list, as many readers are familar with the gene set enrichment plots produced by the Broad Institute's GSEA implementation.
While the basic principle is clear to me, I failed to understand how the magnitude is determined, by which the running sum statistic is increased or decreased. The paper and the FAQ only say "the magnitude of the increment depends on the correlation of the gene with the phenotype" and a otherwise very helpful third-party site also doesn't explain how it is calculated.
Please find a working example code to illustrate my current approach below in case you would like to try it out for yourself. However to help me an explanation or formula to calculate the enrichment score would already suffice, since my current result doesn't look similar to the enrichment score of GSEA plots at all. Thanks a lot for your help.
Matthias
testvalues <- rweibull(1e4,shape=1,scale=2) + rnorm(1e4) testdata <- data.frame( rank = rank(testvalues,ties.method = "random"), value = testvalues, random_set = sample( c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.02,0.98) ), enriched_set = ( # place random TRUES, but only in the upper 50% range rank(testvalues,ties.method = "random") > 5e3 & sample( c(TRUE,FALSE),size = 1e4,replace = TRUE,prob = c(0.05,0.95) ) ) ) # order dataset and reverse ranks (so biggest value is ranked first) testdata <- testdata[order(testdata[,"rank"]),] testdata[,"rank"] <- rev(testdata[,"rank"]) testdata <- rev(testdata) # fruitless attempt to calculate someting like an enrichment score testdata[,c("ES_random_set","ES_enriched_set")] <- apply(testdata[,c("random_set","enriched_set")],2,function(x) { enscore <- as.numeric(x) # normalize magnitutde for selection size enscore[enscore == 0] <- (-1 / (length(x) / sum(x) - 1)) return(round(cumsum(enscore),3)*-1) }) ### plot the wannabe enrichment score plot_es <- function(plotdata,set,es){ library("ggplot2") library("gridExtra") curve <- ggplot(plotdata,aes_string(x="rank", y="value")) + geom_line() + geom_point(data=plotdata[plotdata[,set],],color="red",size=5,shape=124) esplot <- ggplot(plotdata,aes_string(x="rank", y=es)) + geom_line() return(grid.arrange(curve,esplot,nrow=2,heights=c(3,1))) } plot_es(testdata,"random_set","ES_random_set") plot_es(testdata,"enriched_set","ES_enriched_set")
Hello Gordon,
Thanks a lot for your answer, this indeed looks tremendously helpful. It seems I was trying to reinvent the wheel here.
I will probably be able to use the barcodeplot right away for my project or can at least dissect the function. Anyway this should clearly solve my problems.
Best regards
Matthias