I want to use the groupGO function from cluster profiler, but do not exactly understand the level
argument. The help page (hence ?groupGO)
says: "Specific GO level." When I play around with it (giving it different numbers), I still don't seem to understand what and how this argument works.
> ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 20, readable = F) > dim(as.data.frame(ggo_all)) [1] 2 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 2, readable = F) > dim(as.data.frame(ggo_all)) [1] 33 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 3, readable = F) > dim(as.data.frame(ggo_all)) [1] 585 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 10, readable = F) > dim(as.data.frame(ggo_all)) [1] 13782 5 > ggo_all <- groupGO(gene = All_genes, keytype = "ENSEMBL", OrgDb = org.Hs.eg.db, ont = "BP", level = 15, readable = F) > dim(as.data.frame(ggo_all)) [1] 1046 5
What I would like to do is, to get all GO terms for my genes. But I don't understand what the level
argument should be set on to achieve this.
> sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grid parallel stats4 stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Rgraphviz_2.22.0 topGO_2.30.1 SparseM_1.77 [4] GO.db_3.5.0 graph_1.56.0 BiocInstaller_1.28.0 [7] org.Hs.eg.db_3.5.0 AnnotationDbi_1.40.0 IRanges_2.12.0 [10] S4Vectors_0.16.0 Biobase_2.38.0 BiocGenerics_0.24.0 [13] clusterProfiler_3.6.0 DOSE_3.4.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.15 pillar_1.2.1 compiler_3.4.3 [4] plyr_1.8.4 tools_3.4.3 digest_0.6.15 [7] bit_1.1-12 lattice_0.20-35 RSQLite_2.0 [10] memoise_1.1.0 tibble_1.4.2 gtable_0.2.0 [13] pkgconfig_2.0.1 rlang_0.2.0 fastmatch_1.1-0 [16] igraph_1.1.2 DBI_0.8 rvcheck_0.0.9 [19] fgsea_1.4.1 gridExtra_2.3 stringr_1.3.0 [22] bit64_0.9-7 glue_1.2.0 qvalue_2.10.0 [25] data.table_1.10.4-3 BiocParallel_1.12.0 GOSemSim_2.4.1 [28] purrr_0.2.4 tidyr_0.8.0 DO.db_2.9 [31] ggplot2_2.2.1 reshape2_1.4.3 blob_1.1.0 [34] magrittr_1.5 matrixStats_0.53.1 splines_3.4.3 [37] scales_0.5.0 colorspace_1.3-2 labeling_0.3 [40] stringi_1.1.6 lazyeval_0.2.1 munsell_0.4.3
Thanks for the explanation and example. So how do I get all GO terms for all levels? If you look at my examples, you'll see that with
level = 15
(and20)
the number of terms seem to decrease compared withlevel = 10
. Why does this decrease?Have you read the vignette? It seems pretty straightforward. The goal for this support site is to help those who are truly stuck. While R and Bioconductor are free to download and install, they are not truly free. You will have to be willing to invest the necessary time and effort to figure out how things work and what you should be doing.
One thing this support site is not intended to be is a conventional support service like what you get with paid software. The vast majority of the helpers on this site are giving their time and experience to you for free, and if it seems like you are taking advantage and not doing your own homework, you run the very real risk of being considered 'one of those people' who end up being routinely ignored.
Thanks for your comment James, and yes I have read the vignette but still can't figure out how level should be set to get all the terms, or how it is possible that the number of terms decreases with very high level.
I don't completely understand your irritation, because I don't understand why I give the impression that I take advantage of this support site, or that I don't do my homework. But if I gave you that impression I am sorry for the inconvenience.
I'm not irritated. I'm trying to explain something to you. Evidently not well.
So let me try to explain. I know nothing about clusterProfiler. But if I go to the vignette and click on the link for GO Analysis, I see this:
In clusterProfiler,
groupGO
is designed for gene classification based on GO distribution at a specific level.And immediately following that, in the next paragraph, I see
GO over-representation test
Over-representation test3 were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.
So without knowing anything about this package, and taking like two minutes of my time, I now know that the function
groupGO
is for doing a hypergeometric test at a specific level, and thatenrichGO
is for doing a hypergeometric using all levels.The expectation is that you could have done that yourself, rather than having somebody read the section and parse it for you. So to reiterate my earlier point, if people start to think that you are unwilling to read and understand the documentation provided to you, particularly when it is laid out in clear and understandable text, they may themselves become unwilling to offer help when you ask for it. That is not intended to imply some irritation with you on my part, but to tell you how things work, and to offer some constructive criticism.
Thanks again, I understand what you mean. Sorry I gave you that impression. I was actually not asking about a simple hypergeometric test, I understand how to do that with this package. I thought I was clear about that in my question, but probably I was not. I was interested in getting all GO terms, for all my genes (which is a list of all ensemble genes). I would like to get this gene 2 GO term data frame for further analysis (see how many genes overlap per term). Thanks for your time, I think I will stick to GOseq, where I do understand the vignette better.
So you buried your actual question in a bunch of other stuff. If you want to know all the GO IDs for each ensembl gene ID, you should just use biomaRt. Trying to use a package for some side effect is not really the way to go.
From the vignette here:
In clusterProfiler, groupGO is designed for gene classification based on GO distribution at a specific level.
Over-representation test were implemented in clusterProfiler. For calculation details and explanation of parameters, please refer to the vignette of DOSE. enrichGO test the whole GO corpus, ...
The decrease of GO terms when increasing the levels is (AFAIK) due the fact that not all trees have an equal 'depth'. In other words, some trees stop at e.g. level 15, others already at level 10.
Moreover, the nature of the GO classification also makes things complex. As James indicated, the GO classification is a directed acyclic graph. GO terms do not occupy strict fixed levels in the hierarchy. Because GO is structured as a graph, terms would appear at different 'levels' if different paths were followed through the graph. This is especially true if one mixes the different relations used to connect terms. In other words, a problem is that GO hierarchy is set-up in a way where you can take multiple paths to reach the same term. Let's say a gene has the term, neuron differentiation. You can reach that term through neuron development or cell differentiation. Depending on the path, the level (depth) could be different. (Modified from this post here at Biostars)
Thanks for your comment. So due to the complexity of the GO graph structure, it is probably not possible to find a level where all GO terms are given. I didn't realize it, thanks for your time.