Question

Clarification on GOALL / GO2ALLORFs mapping

0

Entering edit mode

DavideSant • 0

@2ab2ab24

Last seen 14 hours ago

Italy

Hello everybody, sorry if this post is wrong for some reason, it's my first time posting here.

I have a question regarding the association between GO and ORFs provided by the org.Sc.sgd.db package. In particular, the GO2ALLORFs mapping. In the package manual, it is stated:

"org.Sc.sgdGO2ALLORFS is an R object that provides mappings between a given GO identifier and all ORF identifiers annotated at that GO term or one of its children in the GO ontology. ... For GO2ALL style mappings, the intention is to return all the genes that are the same kind of term as the parent term (based on the idea that they are more specific descriptions of the general term). However because of this intent, not all relationship types will be counted as offspring for this mapping. Only "is a" and "has a" style relationships indicate that the genes from the child terms would be the same kind of thing."

Based on this description, I would expect a mapping towards genes which describe subcategories of the BP, not genes with regulation activity on the BP. For instance querying for ORFs associated with the BP "sterol biosynthetic process" should also return all ORFs that are associated to "ergosterol biosynthetic process", since it is a sterol biosynthetic process. Conversely, it should not return ORFs with "regulates" connection to the BP. Yet, GO2ALLORFS returns ORFs that are annotated as "regulates" or even "negatively regulates" with respect to the queried biological process GO.

As an example, GO:0005978 glycogen biosynthetic process does not, on the SGD website, have any listed children. However, the mapping GO2ALLORFs also returns ORFs whose sole BP annotation on the SGD website is "involved in negative regulation of glycogen biosynthetic process".

> library(AnnotationDbi)
> library(org.Sc.sgd.db)
> library(GO.db)
> library(dplyr)

> setwd("/home/davide/Dottorato/R useful commands/Reprex org.Sc glycerol/")
> set.seed(1)


> term <- "GO:0005978"
> AnnotationDbi::select(GO.db,keys=term,keytype = "GOID",columns=c("TERM","ONTOLOGY"))
'select()' returned 1:1 mapping between keys and columns
        GOID                          TERM ONTOLOGY
1 GO:0005978 glycogen biosynthetic process       BP

> # with select ()
> allorfs_sel <- AnnotationDbi::select(
+   x = org.Sc.sgd.db,
+   keys=term,
+   keytype = "GOALL",
+   column=c("ORF")
+ ) %>% dplyr::distinct(ORF) # remove duplicate ORF rows
'select()' returned 1:many mapping between keys and columns
> allorfs_sel$COMMON <- AnnotationDbi::mapIds(
+   x = org.Sc.sgd.db,
+   keys=allorfs_sel$ORF,
+   keytype = "ORF",
+   column="COMMON"
+ )
'select()' returned 1:many mapping between keys and columns
> allorfs_sel <- dplyr::arrange(allorfs_sel,COMMON)
> allorfs_sel
       ORF COMMON
1  YOR178C   GAC1
2  YPR184W   GDB1
3  YER054C   GIP2
4  YEL011W   GLC3
5  YMR311C   GLC8
6  YKR058W   GLG1
7  YJL137C   GLG2
8  YFR015C   GSY1
9  YLR258W   GSY2
10 YFR017C   IGD1
11 YGL134W  PCL10
12 YER059W   PCL6
13 YIL050W   PCL7
14 YPL219W   PCL8
15 YKL127W   PGM1
16 YMR105C   PGM2
17 YPL031C  PHO85
18 YLR273C   PIG1
19 YIL045W   PIG2
20 YAL017W   PSK1
21 YOL045W   PSK2
22 YER129W   SAK1
23 YKL035W   UGP1

> # as Bimap (sgdGO2ALLORFS)
> allorfs_bi <- as.list(org.Sc.sgdGO2ALLORFS)[[term]]
> allorfs_bi <- as.data.frame(allorfs_bi)
> colnames(allorfs_bi) <- c("ORF")
> allorfs_bi <- allorfs_bi %>% dplyr::distinct(ORF) # remove duplicate ORF rows
> allorfs_bi$COMMON <- AnnotationDbi::mapIds(
+   x = org.Sc.sgd.db,
+   keys=allorfs_bi$ORF,
+   keytype = "ORF",
+   column="COMMON"
+ )
'select()' returned 1:many mapping between keys and columns
> allorfs_bi <- dplyr::arrange(allorfs_bi,COMMON)
> allorfs_bi
       ORF COMMON
1  YOR178C   GAC1
2  YPR184W   GDB1
3  YER054C   GIP2
4  YEL011W   GLC3
5  YMR311C   GLC8
6  YKR058W   GLG1
7  YJL137C   GLG2
8  YFR015C   GSY1
9  YLR258W   GSY2
10 YFR017C   IGD1
11 YGL134W  PCL10
12 YER059W   PCL6
13 YIL050W   PCL7
14 YPL219W   PCL8
15 YKL127W   PGM1
16 YMR105C   PGM2
17 YPL031C  PHO85
18 YLR273C   PIG1
19 YIL045W   PIG2
20 YAL017W   PSK1
21 YOL045W   PSK2
22 YER129W   SAK1
23 YKL035W   UGP1

> all(allorfs_bi$ORF == allorfs_sel$ORF) # TRUE: just checking that select() and Bimap methods are consistent
[1] TRUE

> # making a dataframe of ORFs from the tables downloaded from [the SGD website][1]
> sgdpath1 <- "GO:0005978/glycogen_biosynthetic_process_annotations_MC.txt" # <-- table of manually curated annotations
> sgdpath2 <- "GO:0005978/glycogen_biosynthetic_process_annotations_COMP.txt" # <-- table of computational annotations
> SGD1 <- read.table(sgdpath1, sep="\t", comment.char = c("!"), header = TRUE)
> SGD2 <- read.table(sgdpath2, sep="\t", comment.char = c("!"), header = TRUE)
> SGD <- rbind(SGD1,SGD2) %>% dplyr::distinct(Gene.Complex,.keep_all = TRUE) # make one dataframe with non-redundant ORFs
> SGD
   Gene.Complex Systematic.Name.Complex.Accession   Qualifier Gene.Ontology.Term.ID            Gene.Ontology.Term
1          GLC3                           YEL011W involved in            GO:0005978 glycogen biosynthetic process
2          GSY1                           YFR015C involved in            GO:0005978 glycogen biosynthetic process
3          GLG2                           YJL137C involved in            GO:0005978 glycogen biosynthetic process
4          UGP1                           YKL035W involved in            GO:0005978 glycogen biosynthetic process
5          PGM1                           YKL127W involved in            GO:0005978 glycogen biosynthetic process
6          GLG1                           YKR058W involved in            GO:0005978 glycogen biosynthetic process
7          GSY2                           YLR258W involved in            GO:0005978 glycogen biosynthetic process
8          PGM2                           YMR105C involved in            GO:0005978 glycogen biosynthetic process
9          GLC8                           YMR311C involved in            GO:0005978 glycogen biosynthetic process
10         IGD1                           YFR017C involved in            GO:0005978 glycogen biosynthetic process
11         PIG1                           YLR273C involved in            GO:0005978 glycogen biosynthetic process
12         GAC1                           YOR178C involved in            GO:0005978 glycogen biosynthetic process
13         GDB1                           YPR184W involved in            GO:0005978 glycogen biosynthetic process
               Aspect Annotation.Extension           Evidence           Method   Source Assigned.On
1  biological process                   NA                IMP manually curated      SGD  2013-08-07
2  biological process                   NA                IMP manually curated      SGD  2013-08-07
3  biological process                   NA      IGI with GLG1 manually curated      SGD  2019-05-14
4  biological process                   NA                IMP manually curated      SGD  2013-08-07
5  biological process                   NA      IGI with UGP1 manually curated      SGD  2013-08-07
6  biological process                   NA      IGI with GLG2 manually curated      SGD  2019-05-14
7  biological process                   NA                IMP manually curated      SGD  2013-08-07
8  biological process                   NA      IGI with PGM1 manually curated      SGD  2013-08-07
9  biological process                   NA                TAS manually curated      SGD  2013-08-07
10 biological process                   NA   IEA with KW-0320    computational  UniProt  2025-03-03
11 biological process                   NA   IEA with KW-0320    computational  UniProt  2025-03-03
12 biological process                   NA   IEA with KW-0320    computational  UniProt  2025-03-03
13 biological process                   NA IEA with IPR006421    computational InterPro  2025-03-03

> # check concordancy between package annotations and SGD annotations
> all(SGD$Systematic.Name.Complex.Accession %in% allorfs_sel$ORF) # --> TRUE: as expected, ORFs reported by SGD are all reported by the package 
[1] TRUE
> all(allorfs_sel$ORF %in% SGD$Systematic.Name.Complex.Accession) # --> FALSE: ORFs extracted with "GOALL" are NOT all in the SGD tables
[1] FALSE

> # retrieve ORFs not found in SGD tables
> surplus <- allorfs_sel$ORF[!(allorfs_sel$ORF %in% SGD$Systematic.Name.Complex.Accession)]
> surplus <- as.data.frame(surplus)
> colnames(surplus) <- c("ORF")
> surplus$COMMON <- AnnotationDbi::mapIds(
+   x = org.Sc.sgd.db,
+   keys=surplus$ORF,
+   keytype = "ORF",
+   column="COMMON"
+ )
'select()' returned 1:many mapping between keys and columns
> surplus
       ORF COMMON
1  YER054C   GIP2
2  YGL134W  PCL10
3  YER059W   PCL6
4  YIL050W   PCL7
5  YPL219W   PCL8
6  YPL031C  PHO85
7  YIL045W   PIG2
8  YAL017W   PSK1
9  YOL045W   PSK2
10 YER129W   SAK1

These ORFs on yeastgenome.com are all annotated as either "involved in glycogen metabolic process" or "involved in negative regulation of glycogen biosynthetic process".

Therefore my questions are:

1) Is this intended behavior, have I misinterpreted the manual?

2) Wouldn't it make this mapping unsuitable for GSEA? Because genes which "negatively regulate" the pathway are expressed in an opposite way to the pathway genes which actually perform the synthesis. How could you tell whether "glycogen biosynthesis" is being activated if you pool together the expression of biosynthesis genes and repression genes?

Thank you

Davide

> sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Rome
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reprex_2.1.1         GO.db_3.20.0         dplyr_1.1.4          org.Sc.sgd.db_3.20.0 AnnotationDbi_1.68.0
[6] IRanges_2.40.1       S4Vectors_0.44.0     Biobase_2.66.0       BiocGenerics_0.52.0 

loaded via a namespace (and not attached):
 [1] generics_0.1.3          RSQLite_2.3.9           digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
 [6] fastmap_1.2.0           blob_1.2.4              jsonlite_1.9.0          processx_3.8.6          GenomeInfoDb_1.42.3    
[11] DBI_1.2.3               ps_1.9.0                httr_1.4.7              UCSC.utils_1.2.0        Biostrings_2.74.1      
[16] cli_3.6.4               rlang_1.1.5             crayon_1.5.3            XVector_0.46.0          bit64_4.6.0-1          
[21] yaml_2.3.10             withr_3.0.2             cachem_1.1.0            tools_4.4.3             memoise_2.0.1          
[26] GenomeInfoDbData_1.2.13 vctrs_0.6.5             R6_2.6.1                png_0.1-8               lifecycle_1.0.4        
[31] zlibbioc_1.52.0         KEGGREST_1.46.0         fs_1.6.5                bit_4.5.0.1             callr_3.7.6            
[36] pkgconfig_2.0.3         clipr_0.8.0             pillar_1.10.1           glue_1.8.0              xfun_0.51              
[41] tibble_3.2.1            tidyselect_1.2.1        rstudioapi_0.17.1       knitr_1.49              htmltools_0.5.8.1      
[46] rmarkdown_2.29          compiler_4.4.3

org.Sc.sgd.db • 83 views

ADD COMMENT • link updated 8 hours ago by James W. MacDonald 68k • written 1 day ago by DavideSant • 0

score 0 · Answer 1 · 2025-03-14

It appears that 'negative regulation of glycogen biosynthesis' has an 'is a' relationship with 'glycogen biosynthetic process' https://www.yeastgenome.org/go/GO:0045719

I wouldn't necessarily expect that any analysis using GO terms would incorporate directionality however. While you may be able to point to some terms like this, where one could argue that a negative regulation child term shouldn't be included in its parent, the hypergeometric test doesn't account for that scenario. GSEA doesn't either. The idea for both tests is to identify genes in a term that appear to be over-represented in some way (higher proportion in the significant genes for the hypergeometric, and higher in a ranked list of genes for GSEA).

The SPIA package does take directionality into account, but only in the context of KEGG pathways, where a given gene directly down-regulates a part of the pathway. It also ignores the sort of directionality you point out in this particular GO term.