How does the maintenance of org.Sc.sgd.db work? Is there a timeline for updating? I believe that the current s.serevisiae S288C_R64 annotation set is 3.1.
http://sgd-archive.yeastgenome.org/sequence/S288C_reference/genome_releases/
but the org.Sc.sgd.db is 2.1.
Somewhat related, I have been trying to test out a couple of the functional analysis suites -- EnrichmentBrowser and GeneTonic among them -- and have run into the same error with using org.Sc.sgd.db. The same error occurs in pcaExplorer::pca2go
.
I believe each has topGO as a dependency, and I get the same error in topGO. So, my suspicion is that topGO somewhere requires SYMBOL
, which is not a column of the s.cer org package:
> columns(org.Sc.sgd.db)
[1] "ALIAS" "COMMON" "DESCRIPTION" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
[10] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "INTERPRO" "ONTOLOGY" "ONTOLOGYALL" "ORF" "PATH"
[19] "PFAM" "PMID" "REFSEQ" "SGD" "SMART" "UNIPROT"
I haven't found a method in the topGO arguments to correctly re-direct whatever it is that SYMBOL
is used for, and I have found that SYMBOL
is actually hardcoded into some of these packages built on top of topGO
(pcaExplorer
, for example).
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] pcaExplorer_2.20.1 topGO_2.46.0 SparseM_1.81 GO.db_3.14.0
[5] graph_1.72.0 here_1.0.1 org.Sc.sgd.db_3.14.0 AnnotationDbi_1.56.2
[9] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
[13] readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5
[17] tidyverse_1.3.1 DESeq2_1.34.0 SummarizedExperiment_1.24.0 Biobase_2.54.0
[21] MatrixGenerics_1.6.0 matrixStats_0.61.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[25] IRanges_2.28.0 S4Vectors_0.32.3 BiocGenerics_0.40.0
loaded via a namespace (and not attached):
[1] GOstats_2.60.0 readxl_1.3.1 backports_1.4.1 BiocFileCache_2.2.1 NMF_0.23.0
[6] plyr_1.8.6 igraph_1.2.11 lazyeval_0.2.2 GSEABase_1.56.0 shinydashboard_0.7.2
[11] splines_4.1.2 BiocParallel_1.28.3 crosstalk_1.2.0 gridBase_0.4-7 digest_0.6.29
[16] foreach_1.5.2 htmltools_0.5.2 viridis_0.6.2 fansi_1.0.2 magrittr_2.0.2
[21] memoise_2.0.1 cluster_2.1.2 doParallel_1.0.17 limma_3.50.0 tzdb_0.2.0
[26] Biostrings_2.62.0 annotate_1.72.0 modelr_0.1.8 prettyunits_1.1.1 colorspace_2.0-2
[31] blob_1.2.2 rvest_1.0.2 rappdirs_0.3.3 ggrepel_0.9.1 haven_2.4.3
[36] xfun_0.29 crayon_1.4.2 RCurl_1.98-1.5 jsonlite_1.7.3 genefilter_1.76.0
[41] survival_3.2-13 iterators_1.0.14 glue_1.6.1 registry_0.5-1 gtable_0.3.0
[46] zlibbioc_1.40.0 XVector_0.34.0 webshot_0.5.2 DelayedArray_0.20.0 Rgraphviz_2.38.0
[51] scales_1.1.1 pheatmap_1.0.12 DBI_1.1.2 rngtools_1.5.2 Rcpp_1.0.8
[56] viridisLite_0.4.0 xtable_1.8-4 progress_1.2.2 bit_4.0.4 DT_0.20
[61] AnnotationForge_1.36.0 htmlwidgets_1.5.4 httr_1.4.2 threejs_0.3.3 shinyAce_0.4.1
[66] RColorBrewer_1.1-2 ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.8 dbplyr_2.1.1
[71] locfit_1.5-9.4 utf8_1.2.2 tidyselect_1.1.1 rlang_1.0.1 reshape2_1.4.4
[76] later_1.3.0 munsell_0.5.0 cellranger_1.1.0 tools_4.1.2 cachem_1.0.6
[81] cli_3.1.1 generics_0.1.2 RSQLite_2.2.9 broom_0.7.12 shinyBS_0.61
[86] evaluate_0.14 fastmap_1.1.0 heatmaply_1.3.0 knitr_1.37 bit64_4.0.5
[91] fs_1.5.2 dendextend_1.15.2 KEGGREST_1.34.0 RBGL_1.70.0 mime_0.12
[96] xml2_1.3.3 biomaRt_2.50.3 compiler_4.1.2 rstudioapi_0.13 plotly_4.10.0
[101] filelock_1.0.2 curl_4.3.2 png_0.1-7 reprex_2.0.1 geneplotter_1.72.0
[106] stringi_1.7.6 lattice_0.20-45 Matrix_1.4-0 vctrs_0.3.8 pillar_1.7.0
[111] lifecycle_1.0.1 data.table_1.14.2 bitops_1.0-7 seriation_1.3.1 httpuv_1.6.5
[116] R6_2.5.1 TSP_1.1-11 promises_1.2.0.1 gridExtra_2.3 codetools_0.2-18
[121] assertthat_0.2.1 Category_2.60.0 pkgmaker_0.32.2 rprojroot_2.0.2 withr_2.4.3
[126] GenomeInfoDbData_1.2.7 parallel_4.1.2 hms_1.1.1 grid_4.1.2 rmarkdown_2.11
[131] shiny_1.7.1 lubridate_1.8.0 base64enc_0.1-3
As James correctly pointed out, it can be that the column names across different species are just not the same, but this is due to the different sources one needs to pick all the information from. Having the whole wealth of information in an annotation package is quite a convenience, and I guess that a change in the naming scheme for the different identifiers might break some existing code?
The get_annotation_orgdb function is nothing but a wrapper to mapIds - you mentioned the erroneous behavior in this issue. For that, you can just replace the call with the explicit lines James suggested you in his reply.
From my
pcaExplorer
side, I'll try to have it flagged somewhere. It is not topGO requesting a SYMBOL column, but rather "our task" to provide the right kind of identifiers, and tell the functions to handle them as expected.Thank you very much. I really appreciate the help.
the link to SGD above shows that the S288C_R64 annotation set 3.1 were released 2021-04, which I take to mean that they should be in the bioconductor release from October?
In any case, I am more or less happy to know that they do get updated, in some way shape or fashion, with each release, if there are updates to be had.
No, the link you provided shows the genome releases. As I already explained, that is not the same thing as an 'annotation' release.
Understood, though I think my point stands. The
org.Sc.sgd.db
package currently has data from 2019, which is genome version 2.1.This is where I am misunderstanding -- if the org package tracks with the bioconductor updates, then it should have genome version 3.1 and the corresponding annotations, since those were released by SGD in April of 2021, no?
Here is a better way of asking this: it seems like what you're saying is that there are no new annotations in the 3-1 genome build annotation set, and so no reason to update the org package. Is that what I am to understand?
I don't think that is the case. Here is an example of a
ID
in 3-1 that is not in 2-1 (admittedly, I could maybe find a more exciting example. there are a number):The
OrgDb
packages are meant to only provide annotation data, not positional data. There are positional data in these packages, but that's because we originally only had theOrgDb
packages but have since introduced theTxDb
packages that are meant to provide positional data. It is difficult to change these things; some of the existing codebase still references positional data using theOrgDb
packages, and to remove the positional data requires someone (lately me) to go through all the code used to build the packages, as well as the code meant to interact with the packages and remove that code without negatively affecting all the other code that is still supposed to work. But since the positional data are now in theTxDb
packages, it's not updated in theOrgDb
.Changing things like that is a non-trivial task. It took me 80 hours for both the Fall 2020 and Spring 2021 releases to make similarly trivial appearing changes. So, while expunging the positional data from the
OrgDb
packages is a goal, it's not clear when (or if) that will happen.Which brings me to another point. We generate these annotation packages by first downloading data from various sources (for yeast, mostly sgd-archive.yeastgenome.org), parsing, and then loading the data into SQLite databases. For any of this to work, several things have to be true. First, the location of the data has to remain relatively constant. We use bash scripts to get the data, and they are semi-sophisticated, but if things change dramatically at the FTP site, then the scripts stop working. Second, the format of the data has to remain constant. The scripts used to parse the data make assumptions, based on the previous versions of the files. If things change the assumptions are no longer valid. Third, the data has to be updated. In Fall 2020 I spent 80 hours updating the code to parse the GO data because the GO people stopped providing the data in SQL dump format, and only provided updated data in their OWL format.
For the yeast
OrgDb
, we primarily use the SGD_features.tab file, which has this line:Where RDN58-1 is in the position we parse as the 'ORF'. So this happens:
However
The SGD_features.tab file apparently hasn't been updated since 2019, and in the interim it appears they have changed the ORF to YNCL0014C. It may be that the SGD_features.tab is no longer being updated and we may need to use a different file for these data.
I can appreciate how complicated this is -- very frustrating, in fact, and its either all of our faults or none of our faults. It seems like a dumb system to keep so many different annotation sets for model organisms, in particular. We should play more nicely with one another and just agree on a single source, set of unique IDs, and update schedule. We could choose a neutral location for it -- Switzerland, or Turkmenistan, which also has declared its political neutrality, come to mind.
But, more to the point: I do appreciate how much time this has to take to create and maintain. Like everyone else on here I suspect, I very much like bioconductor software, particularly the 'core' software. I am a believer. I'd also like to be able to convince others, especially in the lab where I work, that they should be fully embracing these resources. But, it becomes difficult to make that argument with issues like these. That isn't a criticism at all -- it really is meant as a thank you for those 80 hours spent on this annotation set (and all the hours maintaining other packages here), and an expressed concern that without more regular updates/maintenance (now specifically back to
org.Sc.sgd.db
. I have no complaints about the update schedule generally) and maybe stronger object/class definitions/templates/whatever, there is a risk that these resources aren't as widely utilized as they should be.Separating the name mapping objects from the coordinates seems like a great idea. For what it is worth, that gets a big +1 from me. I think a required set of keytypes for all of the 'name mapping' objects is worthwhile, so that any package built on one of the Org packages, for example, can expect that a key SYMBOL exists.
I have been finding issues reports regarding this
SYMBOL
issue for yeast all over the place -- seems like this has been going on for some time. Here is another example, and I have noticed that this exact question has been asked a number of times on this board also (if I had noticed that before, I probably wouldn't have posted, so apologies for the duplication):https://github.com/YuLab-SMU/ReactomePA/issues/12#issuecomment-470726547