Hello,
I ran into a puzzling situation with AnnotationHub when trying to retrieve updated annotations for rat. NCBI released a new gene model set for rat at the end of July (beginning of Aug by the time it propagated through their ftp server) that we used for a recent RNA-Seq experiment. BioC's org.Rn.eg.db package was created back in Mar/April 2016, and so is missing ~700 new genes. I tried using AnnotationHub to get updated annotations, but despite the fact the snapshotDate() is 2016-08-15, which should have been just after the updated annotations, the OrgDB retrieved for rat has an older EGSOURCEDATE: 2015-Aug11 than does org.Rn.eg.db EGSOURCEDATE: 2016-Mar14. I checked mouse and it has the same problem. Why are the OrgDB in AnnotationHub not current?
Thanks,
Jenny
> library(AnnotationHub) Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit > library(org.Rn.eg.db) Loading required package: AnnotationDbi Loading required package: stats4 Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:AnnotationHub’: cache Loading required package: IRanges Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following objects are masked from ‘package:base’: colMeans, colSums, expand.grid, rowMeans, rowSums > library(org.Mm.eg.db) > > > ah = AnnotationHub() snapshotDate(): 2016-08-15 > > #See what they have for Rattus norvegicus, from NCBI and OrgDB > > query(ah, c("OrgDB", "NCBI", "Rattus norvegicus")) AnnotationHub with 1 record # snapshotDate(): 2016-08-15 # names(): AH49585 # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ # $species: Rattus norvegicus # $rdataclass: OrgDb # $title: org.Rn.eg.db.sqlite # $description: NCBI gene ID based annotations about Rattus norvegicus # $taxonomyid: 10116 # $genome: NCBI genomes # $sourcetype: NCBI/ensembl # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/pub/current_fasta # $sourcelastmodifieddate: NA # $sourcesize: NA # $tags: NCBI, Gene, Annotation # retrieve record with 'object[["AH49585"]]' > > > ah[["AH49585"]] loading from cache ‘C:/Users/drnevich/Documents/AppData/.AnnotationHub/56315’ OrgDb object: | DBSCHEMAVERSION: 2.1 | Db type: OrgDb | Supporting package: AnnotationDbi | DBSCHEMA: RAT_DB | ORGANISM: Rattus norvegicus | SPECIES: Rat | EGSOURCEDATE: 2015-Aug11 | EGSOURCENAME: Entrez Gene | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | CENTRALID: EG | TAXID: 10116 | GOSOURCENAME: Gene Ontology | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ | GOSOURCEDATE: 20150808 | GOEGSOURCEDATE: 2015-Aug11 | GOEGSOURCENAME: Entrez Gene | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | KEGGSOURCENAME: KEGG GENOME | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes | KEGGSOURCEDATE: 2011-Mar15 | GPSOURCENAME: UCSC Genome Bioinformatics (Rattus norvegicus) | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn6 | GPSOURCEDATE: 2014-Aug1 | ENSOURCEDATE: 2015-Jul16 | ENSOURCENAME: Ensembl | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta | UPSOURCENAME: Uniprot | UPSOURCEURL: http://www.UniProt.org/ | UPSOURCEDATE: Thu Aug 20 15:37:19 2015 Please see: help('select') for usage information > > > #compare EGSOURCEDATE with org.Rn.eg.db: > > org.Rn.eg.db OrgDb object: | DBSCHEMAVERSION: 2.1 | Db type: OrgDb | Supporting package: AnnotationDbi | DBSCHEMA: RAT_DB | ORGANISM: Rattus norvegicus | SPECIES: Rat | EGSOURCEDATE: 2016-Mar14 | EGSOURCENAME: Entrez Gene | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | CENTRALID: EG | TAXID: 10116 | GOSOURCENAME: Gene Ontology | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ | GOSOURCEDATE: 20160305 | GOEGSOURCEDATE: 2016-Mar14 | GOEGSOURCENAME: Entrez Gene | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | KEGGSOURCENAME: KEGG GENOME | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes | KEGGSOURCEDATE: 2011-Mar15 | GPSOURCENAME: UCSC Genome Bioinformatics (Rattus norvegicus) | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/rn6 | GPSOURCEDATE: 2014-Aug1 | ENSOURCEDATE: 2016-Mar9 | ENSOURCENAME: Ensembl | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta | UPSOURCENAME: Uniprot | UPSOURCEURL: http://www.UniProt.org/ | UPSOURCEDATE: Wed Mar 23 15:52:15 2016 Please see: help('select') for usage information > > > #Try mouse: > > query(ah, c("OrgDB", "NCBI", "Mus musculus")) AnnotationHub with 1 record # snapshotDate(): 2016-08-15 # names(): AH49583 # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ # $species: Mus musculus # $rdataclass: OrgDb # $title: org.Mm.eg.db.sqlite # $description: NCBI gene ID based annotations about Mus musculus # $taxonomyid: 10090 # $genome: NCBI genomes # $sourcetype: NCBI/ensembl # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/pub/current_fasta # $sourcelastmodifieddate: NA # $sourcesize: NA # $tags: NCBI, Gene, Annotation # retrieve record with 'object[["AH49583"]]' > > > ah[["AH49583"]] loading from cache ‘C:/Users/drnevich/Documents/AppData/.AnnotationHub/56313’ OrgDb object: | DBSCHEMAVERSION: 2.1 | Db type: OrgDb | Supporting package: AnnotationDbi | DBSCHEMA: MOUSE_DB | ORGANISM: Mus musculus | SPECIES: Mouse | EGSOURCEDATE: 2015-Aug11 | EGSOURCENAME: Entrez Gene | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | CENTRALID: EG | TAXID: 10090 | GOSOURCENAME: Gene Ontology | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ | GOSOURCEDATE: 20150808 | GOEGSOURCEDATE: 2015-Aug11 | GOEGSOURCENAME: Entrez Gene | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | KEGGSOURCENAME: KEGG GENOME | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes | KEGGSOURCEDATE: 2011-Mar15 | GPSOURCENAME: UCSC Genome Bioinformatics (Mus musculus) | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10 | GPSOURCEDATE: 2012-Mar8 | ENSOURCEDATE: 2015-Jul16 | ENSOURCENAME: Ensembl | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta | UPSOURCENAME: Uniprot | UPSOURCEURL: http://www.UniProt.org/ | UPSOURCEDATE: Thu Aug 20 15:49:03 2015 Please see: help('select') for usage information > > #compare EGSOURCEDATE with org.Mm.eg.db: > > org.Mm.eg.db OrgDb object: | DBSCHEMAVERSION: 2.1 | Db type: OrgDb | Supporting package: AnnotationDbi | DBSCHEMA: MOUSE_DB | ORGANISM: Mus musculus | SPECIES: Mouse | EGSOURCEDATE: 2016-Mar14 | EGSOURCENAME: Entrez Gene | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | CENTRALID: EG | TAXID: 10090 | GOSOURCENAME: Gene Ontology | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ | GOSOURCEDATE: 20160305 | GOEGSOURCEDATE: 2016-Mar14 | GOEGSOURCENAME: Entrez Gene | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | KEGGSOURCENAME: KEGG GENOME | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes | KEGGSOURCEDATE: 2011-Mar15 | GPSOURCENAME: UCSC Genome Bioinformatics (Mus musculus) | GPSOURCEURL: ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10 | GPSOURCEDATE: 2012-Mar8 | ENSOURCEDATE: 2016-Mar9 | ENSOURCENAME: Ensembl | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta | UPSOURCENAME: Uniprot | UPSOURCEURL: http://www.UniProt.org/ | UPSOURCEDATE: Wed Mar 23 15:59:16 2016 Please see: help('select') for usage information > > > sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] org.Mm.eg.db_3.3.0 org.Rn.eg.db_3.3.0 AnnotationDbi_1.34.4 IRanges_2.6.1 [5] S4Vectors_0.10.3 Biobase_2.32.0 AnnotationHub_2.4.2 BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.7 digest_0.6.10 [3] mime_0.5 R6_2.1.3 [5] xtable_1.8-2 DBI_0.5 [7] RSQLite_1.0.0 BiocInstaller_1.22.3 [9] httr_1.2.1 curl_1.2 [11] tools_3.3.1 shiny_0.13.2 [13] httpuv_1.3.3 htmltools_0.3.5 [15] interactiveDisplayBase_1.10.3
The standard organism OrgDbs in our repo
http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb
are comprised of data downloaded from multiple locations, UCSC, NCBI, Ensembl, etc. The other non-standard organism OrgDbs in AnnotationHub are made with makeOrgPackageFromNCBI() which downloads from
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
ftp://ftp.geneontology.org/pub/go/godata
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping
Valerie