Hello everybody, sorry if this post is wrong for some reason, it's my first time posting here.
I have a question regarding the association between GO and ORFs provided by the org.Sc.sgd.db package. In particular, the GO2ALLORFs mapping. In the package manual, it is stated:
"org.Sc.sgdGO2ALLORFS is an R object that provides mappings between a given GO identifier and all ORF identifiers annotated at that GO term or one of its children in the GO ontology. ... For GO2ALL style mappings, the intention is to return all the genes that are the same kind of term as the parent term (based on the idea that they are more specific descriptions of the general term). However because of this intent, not all relationship types will be counted as offspring for this mapping. Only "is a" and "has a" style relationships indicate that the genes from the child terms would be the same kind of thing."
Based on this description, I would expect a mapping towards genes which describe subcategories of the BP, not genes with regulation activity on the BP. For instance querying for ORFs associated with the BP "sterol biosynthetic process" should also return all ORFs that are associated to "ergosterol biosynthetic process", since it is a sterol biosynthetic process. Conversely, it should not return ORFs with "regulates" connection to the BP. Yet, GO2ALLORFS returns ORFs that are annotated as "regulates" or even "negatively regulates" with respect to the queried biological process GO.
As an example, GO:0005978 glycogen biosynthetic process does not, on the SGD website, have any listed children. However, the mapping GO2ALLORFs also returns ORFs whose sole BP annotation on the SGD website is "involved in negative regulation of glycogen biosynthetic process".
> library(AnnotationDbi)
> library(org.Sc.sgd.db)
> library(GO.db)
> library(dplyr)
> setwd("/home/davide/Dottorato/R useful commands/Reprex org.Sc glycerol/")
> set.seed(1)
> term <- "GO:0005978"
> AnnotationDbi::select(GO.db,keys=term,keytype = "GOID",columns=c("TERM","ONTOLOGY"))
'select()' returned 1:1 mapping between keys and columns
GOID TERM ONTOLOGY
1 GO:0005978 glycogen biosynthetic process BP
> # with select ()
> allorfs_sel <- AnnotationDbi::select(
+ x = org.Sc.sgd.db,
+ keys=term,
+ keytype = "GOALL",
+ column=c("ORF")
+ ) %>% dplyr::distinct(ORF) # remove duplicate ORF rows
'select()' returned 1:many mapping between keys and columns
> allorfs_sel$COMMON <- AnnotationDbi::mapIds(
+ x = org.Sc.sgd.db,
+ keys=allorfs_sel$ORF,
+ keytype = "ORF",
+ column="COMMON"
+ )
'select()' returned 1:many mapping between keys and columns
> allorfs_sel <- dplyr::arrange(allorfs_sel,COMMON)
> allorfs_sel
ORF COMMON
1 YOR178C GAC1
2 YPR184W GDB1
3 YER054C GIP2
4 YEL011W GLC3
5 YMR311C GLC8
6 YKR058W GLG1
7 YJL137C GLG2
8 YFR015C GSY1
9 YLR258W GSY2
10 YFR017C IGD1
11 YGL134W PCL10
12 YER059W PCL6
13 YIL050W PCL7
14 YPL219W PCL8
15 YKL127W PGM1
16 YMR105C PGM2
17 YPL031C PHO85
18 YLR273C PIG1
19 YIL045W PIG2
20 YAL017W PSK1
21 YOL045W PSK2
22 YER129W SAK1
23 YKL035W UGP1
> # as Bimap (sgdGO2ALLORFS)
> allorfs_bi <- as.list(org.Sc.sgdGO2ALLORFS)[[term]]
> allorfs_bi <- as.data.frame(allorfs_bi)
> colnames(allorfs_bi) <- c("ORF")
> allorfs_bi <- allorfs_bi %>% dplyr::distinct(ORF) # remove duplicate ORF rows
> allorfs_bi$COMMON <- AnnotationDbi::mapIds(
+ x = org.Sc.sgd.db,
+ keys=allorfs_bi$ORF,
+ keytype = "ORF",
+ column="COMMON"
+ )
'select()' returned 1:many mapping between keys and columns
> allorfs_bi <- dplyr::arrange(allorfs_bi,COMMON)
> allorfs_bi
ORF COMMON
1 YOR178C GAC1
2 YPR184W GDB1
3 YER054C GIP2
4 YEL011W GLC3
5 YMR311C GLC8
6 YKR058W GLG1
7 YJL137C GLG2
8 YFR015C GSY1
9 YLR258W GSY2
10 YFR017C IGD1
11 YGL134W PCL10
12 YER059W PCL6
13 YIL050W PCL7
14 YPL219W PCL8
15 YKL127W PGM1
16 YMR105C PGM2
17 YPL031C PHO85
18 YLR273C PIG1
19 YIL045W PIG2
20 YAL017W PSK1
21 YOL045W PSK2
22 YER129W SAK1
23 YKL035W UGP1
> all(allorfs_bi$ORF == allorfs_sel$ORF) # TRUE: just checking that select() and Bimap methods are consistent
[1] TRUE
> # making a dataframe of ORFs from the tables downloaded from [the SGD website][1]
> sgdpath1 <- "GO:0005978/glycogen_biosynthetic_process_annotations_MC.txt" # <-- table of manually curated annotations
> sgdpath2 <- "GO:0005978/glycogen_biosynthetic_process_annotations_COMP.txt" # <-- table of computational annotations
> SGD1 <- read.table(sgdpath1, sep="\t", comment.char = c("!"), header = TRUE)
> SGD2 <- read.table(sgdpath2, sep="\t", comment.char = c("!"), header = TRUE)
> SGD <- rbind(SGD1,SGD2) %>% dplyr::distinct(Gene.Complex,.keep_all = TRUE) # make one dataframe with non-redundant ORFs
> SGD
Gene.Complex Systematic.Name.Complex.Accession Qualifier Gene.Ontology.Term.ID Gene.Ontology.Term
1 GLC3 YEL011W involved in GO:0005978 glycogen biosynthetic process
2 GSY1 YFR015C involved in GO:0005978 glycogen biosynthetic process
3 GLG2 YJL137C involved in GO:0005978 glycogen biosynthetic process
4 UGP1 YKL035W involved in GO:0005978 glycogen biosynthetic process
5 PGM1 YKL127W involved in GO:0005978 glycogen biosynthetic process
6 GLG1 YKR058W involved in GO:0005978 glycogen biosynthetic process
7 GSY2 YLR258W involved in GO:0005978 glycogen biosynthetic process
8 PGM2 YMR105C involved in GO:0005978 glycogen biosynthetic process
9 GLC8 YMR311C involved in GO:0005978 glycogen biosynthetic process
10 IGD1 YFR017C involved in GO:0005978 glycogen biosynthetic process
11 PIG1 YLR273C involved in GO:0005978 glycogen biosynthetic process
12 GAC1 YOR178C involved in GO:0005978 glycogen biosynthetic process
13 GDB1 YPR184W involved in GO:0005978 glycogen biosynthetic process
Aspect Annotation.Extension Evidence Method Source Assigned.On
1 biological process NA IMP manually curated SGD 2013-08-07
2 biological process NA IMP manually curated SGD 2013-08-07
3 biological process NA IGI with GLG1 manually curated SGD 2019-05-14
4 biological process NA IMP manually curated SGD 2013-08-07
5 biological process NA IGI with UGP1 manually curated SGD 2013-08-07
6 biological process NA IGI with GLG2 manually curated SGD 2019-05-14
7 biological process NA IMP manually curated SGD 2013-08-07
8 biological process NA IGI with PGM1 manually curated SGD 2013-08-07
9 biological process NA TAS manually curated SGD 2013-08-07
10 biological process NA IEA with KW-0320 computational UniProt 2025-03-03
11 biological process NA IEA with KW-0320 computational UniProt 2025-03-03
12 biological process NA IEA with KW-0320 computational UniProt 2025-03-03
13 biological process NA IEA with IPR006421 computational InterPro 2025-03-03
> # check concordancy between package annotations and SGD annotations
> all(SGD$Systematic.Name.Complex.Accession %in% allorfs_sel$ORF) # --> TRUE: as expected, ORFs reported by SGD are all reported by the package
[1] TRUE
> all(allorfs_sel$ORF %in% SGD$Systematic.Name.Complex.Accession) # --> FALSE: ORFs extracted with "GOALL" are NOT all in the SGD tables
[1] FALSE
> # retrieve ORFs not found in SGD tables
> surplus <- allorfs_sel$ORF[!(allorfs_sel$ORF %in% SGD$Systematic.Name.Complex.Accession)]
> surplus <- as.data.frame(surplus)
> colnames(surplus) <- c("ORF")
> surplus$COMMON <- AnnotationDbi::mapIds(
+ x = org.Sc.sgd.db,
+ keys=surplus$ORF,
+ keytype = "ORF",
+ column="COMMON"
+ )
'select()' returned 1:many mapping between keys and columns
> surplus
ORF COMMON
1 YER054C GIP2
2 YGL134W PCL10
3 YER059W PCL6
4 YIL050W PCL7
5 YPL219W PCL8
6 YPL031C PHO85
7 YIL045W PIG2
8 YAL017W PSK1
9 YOL045W PSK2
10 YER129W SAK1
These ORFs on yeastgenome.com are all annotated as either "involved in glycogen metabolic process" or "involved in negative regulation of glycogen biosynthetic process".
Therefore my questions are:
1) Is this intended behavior, have I misinterpreted the manual?
2) Wouldn't it make this mapping unsuitable for GSEA? Because genes which "negatively regulate" the pathway are expressed in an opposite way to the pathway genes which actually perform the synthesis. How could you tell whether "glycogen biosynthesis" is being activated if you pool together the expression of biosynthesis genes and repression genes?
Thank you
Davide
> sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Rome
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] reprex_2.1.1 GO.db_3.20.0 dplyr_1.1.4 org.Sc.sgd.db_3.20.0 AnnotationDbi_1.68.0
[6] IRanges_2.40.1 S4Vectors_0.44.0 Biobase_2.66.0 BiocGenerics_0.52.0
loaded via a namespace (and not attached):
[1] generics_0.1.3 RSQLite_2.3.9 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
[6] fastmap_1.2.0 blob_1.2.4 jsonlite_1.9.0 processx_3.8.6 GenomeInfoDb_1.42.3
[11] DBI_1.2.3 ps_1.9.0 httr_1.4.7 UCSC.utils_1.2.0 Biostrings_2.74.1
[16] cli_3.6.4 rlang_1.1.5 crayon_1.5.3 XVector_0.46.0 bit64_4.6.0-1
[21] yaml_2.3.10 withr_3.0.2 cachem_1.1.0 tools_4.4.3 memoise_2.0.1
[26] GenomeInfoDbData_1.2.13 vctrs_0.6.5 R6_2.6.1 png_0.1-8 lifecycle_1.0.4
[31] zlibbioc_1.52.0 KEGGREST_1.46.0 fs_1.6.5 bit_4.5.0.1 callr_3.7.6
[36] pkgconfig_2.0.3 clipr_0.8.0 pillar_1.10.1 glue_1.8.0 xfun_0.51
[41] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.49 htmltools_0.5.8.1
[46] rmarkdown_2.29 compiler_4.4.3