Hi,
I am facing some issues while processing my RNAseq outputs through the TopGO package to determine which GOterms are enriched in the significantly modulated gene sets extracted from DESeq2. When running topGO, I am getting significant GO terms that are associated with 0 of my significant genes, using a p-value threshold of p.value.elim.ks <= 0.01. Is this related to the type of statistical test used ? Please see below the function I used:
Where GO.type = either MF / BP / CC (one of the 3 GO categories) counts.matrix.id is the output data.frame of DESeq2 (using results()) with rownames corresponding to Ensembl Gene IDs. anSig is the subset of the counts.matrix.id corresponding to the significantly modulated genes. backG is the background of genes to be used to compute enrichment
RunTopGO <- function(GO.type,
counts.matrix.id,
anSig,
backG){
geneIDs = rownames(x = counts.matrix.id)
inUniverse = geneIDs %in% c(anSig$ENSEMBL, backG)
inSelection = geneIDs %in% anSig$ENSEMBL
alg <- factor(x = as.integer(inSelection[inUniverse] ) )
names(x = alg) <- geneIDs[inUniverse]
tgd <- new("topGOdata",
ontology = GO.type,
allGenes = alg,
nodeSize = 5,
annot = annFUN.org,
mapping ="org.Mm.eg.db",
ID = "ensembl" )
## run tests
resultTopGO.elim.fisher <- runTest(tgd,
algorithm = "elim",
statistic = "fisher" )
resultTopGO.elim.ks <- runTest(tgd,
algorithm = "elim",
statistic = "ks" )
resultTopGO.classic.fisher <- runTest(tgd,
algorithm = "classic",
statistic = "fisher" )
resultTopGO.classic.ks <- runTest(tgd,
algorithm = "classic",
statistic = "ks" )
## look at results
tab <- GenTable(tgd, Fisher.elim = resultTopGO.elim.fisher,
Fisher.classic = resultTopGO.classic.fisher,
ks.elim = resultTopGO.elim.ks,
ks.classic = resultTopGO.classic.ks,
orderBy = "ks.elim" ,
topNodes = 200)
results.list <- list(resultTopGO.elim.fisher = resultTopGO.elim.fisher,
resultTopGO.elim.ks = resultTopGO.elim.ks,
resultTopGO.classic.fisher = resultTopGO.classic.fisher,
resultTopGO.classic.ks = resultTopGO.classic.ks)
data.list <- mapply(GO.data = results.list,
col.name = names(results.list),
FUN = function(GO.data,
col.name) {
d <- data.frame(topGO::score(GO.data))
colnames(d) <- gsub(pattern = "resultTopGO.",
replacement = "p.value.",
x = col.name)
d$GO.ID <- rownames(x = d)
d
},
SIMPLIFY = FALSE,
USE.NAMES = TRUE)
data.GO <- Reduce(function(x, y) merge(x = x,
y = y,
all=TRUE,
by = "GO.ID"),
data.list)
data.GO <- merge(x = data.GO,
y = tab,
by = "GO.ID",
all.y = T)
df1 <- data.GO[data.GO$p.value.elim.ks < 0.01, ]
df1$Term.unique <- paste(df1$Term,
df1$GO.ID,
sep = "_")
##
plotdf1 <- ggplot(data = df1,
aes(y= -log10(p.value.elim.ks),
x = Term.unique,
fill = -log10(p.value.elim.ks))) +
geom_bar(stat = "identity",
color = "black", size = 0.1) +
coord_flip() +
scale_x_discrete(limits = df1$Term.unique[order(df1$p.value.elim.ks)]) +
theme(aspect.ratio = 1,
panel.background = element_rect(fill = "lightgrey")) +
labs(x = "") +
scale_fill_gradientn(colours = viridis::magma(5)[-1])
return(list(tgd = tgd,
GO.type.df = df1,
plot = plotdf1,
tab = tab))
}
Hereafter is an example of the type of result I get :
GO.ID Term Annotated Significant Expected Rank in Fisher.classic Fisher.elim Fisher.classic ks.elim ks.classic
1 GO:0003724 RNA helicase activity 17 0 1.99 510 1.000 1.0000 8.5e-09 8.5e-09
2 GO:0008017 microtubule binding 47 1 5.51 506 0.997 0.9973 1.1e-07 1.1e-07
3 GO:0020037 heme binding 14 1 1.64 421 0.826 0.8259 1.6e-05 1.6e-05
4 GO:0003743 translation initiation factor activity 8 0 0.94 511 1.000 1.0000 8.0e-05 8.0e-05
5 GO:0016705 oxidoreductase activity, acting on paire... 24 0 2.81 512 1.000 1.0000 0.00012 0.00012
6 GO:0003735 structural constituent of ribosome 36 0 4.22 513 1.000 1.0000 0.00016 0.00016
You can see from the first GO.ID, which is highly significant from the ks elim computation (8.5e-09), that there are 0 genes falling in the "Significant" column... Thanks in advance for the inputs. Best,
Al.
> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.0.0 RColorBrewer_1.1-2 gplots_3.0.3 geneplotter_1.64.0
[5] annotate_1.64.0 XML_3.99-0.3 lattice_0.20-40 LSD_4.0-0
[9] Rgraphviz_2.30.0 EDASeq_2.20.0 ShortRead_1.44.3 GenomicAlignments_1.22.1
[13] Rsamtools_2.2.3 Biostrings_2.54.0 XVector_0.26.0 fdrtool_1.2.15
[17] topGO_2.38.1 SparseM_1.78 GO.db_3.10.0 graph_1.64.0
[21] org.Mm.eg.db_3.10.0 AnnotationDbi_1.48.0 biomaRt_2.42.1 genefilter_1.68.0
[25] plyr_1.8.6 stringr_1.4.0 dplyr_0.8.5 ggrepel_0.8.2
[29] ggplot2_3.3.0 edgeR_3.28.1 limma_3.42.2 DESeq2_1.26.0
[33] SummarizedExperiment_1.16.1 DelayedArray_0.12.2 BiocParallel_1.20.1 matrixStats_0.56.0
[37] Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2
[41] S4Vectors_0.24.3 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 hwriter_1.3.2 ellipsis_0.3.0 htmlTable_1.13.3 base64enc_0.1-3
[6] rstudioapi_0.11 farver_2.0.3 bit64_0.9-7 fansi_0.4.1 splines_3.6.3
[11] R.methodsS3_1.8.0 DESeq_1.38.0 knitr_1.28 Formula_1.2-3 cluster_2.1.0
[16] dbplyr_1.4.2 png_0.1-7 R.oo_1.23.0 compiler_3.6.3 httr_1.4.1
[21] backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18 cli_2.0.2 acepack_1.4.1
[26] htmltools_0.4.0 prettyunits_1.1.1 tools_3.6.3 gtable_0.3.0 glue_1.3.2
[31] GenomeInfoDbData_1.2.2 rappdirs_0.3.1 Rcpp_1.0.4 vctrs_0.2.4 gdata_2.18.0
[36] rtracklayer_1.46.0 xfun_0.12 lifecycle_0.2.0 gtools_3.8.2 zlibbioc_1.32.0
[41] scales_1.1.0 aroma.light_3.16.0 hms_0.5.3 curl_4.3 memoise_1.1.0
[46] gridExtra_2.3 rpart_4.1-15 latticeExtra_0.6-29 stringi_1.4.6 RSQLite_2.2.0
[51] highr_0.8 checkmate_2.0.0 caTools_1.18.0 GenomicFeatures_1.38.2 rlang_0.4.5
[56] pkgconfig_2.0.3 bitops_1.0-6 evaluate_0.14 purrr_0.3.3 labeling_0.3
[61] htmlwidgets_1.5.1 bit_1.1-15.2 tidyselect_1.0.0 magrittr_1.5 R6_2.4.1
[66] Hmisc_4.4-0 DBI_1.1.0 pillar_1.4.3 foreign_0.8-76 withr_2.1.2
[71] survival_3.1-11 RCurl_1.98-1.1 nnet_7.3-13 tibble_3.0.0 crayon_1.3.4
[76] KernSmooth_2.23-16 BiocFileCache_1.10.2 rmarkdown_2.1 viridis_0.5.1 jpeg_0.1-8.1
[81] progress_1.2.2 locfit_1.5-9.4 data.table_1.12.8 blob_1.2.1 digest_0.6.25
[86] xtable_1.8-4 R.utils_2.9.2 openssl_1.4.1 munsell_0.5.0 viridisLite_0.3.0
[91] askpass_1.1
Term
,Annotated
andSignificant
are 3 different columns. If I got it correctly,Term
is the GOterm name,Annotated
corresponds to the number of genes associated to this given GOterm, andSignificant
to the number of genes from my DESeq2 input that are within thisAnnotated
gene-set.So indeed yes, it makes no sense that there are no significant genes associated to the significantly enriched GOterms, which is why I am opening this thread...
The KS test is supposed to be testing for enrichment, but evidently not. Even the example in the vignette shows that (the first row in Table 2 shows a non-significant Fisher test and significant KS test for under-representation).
The author chocks that up to the KS test being less conservative, but my interpretation would be different?