Question

Why No Uniprot information for some other organisms in AnnotationHub?

0

Entering edit mode

wssdandan2009 • 0

@wssdandan2009-11216

Last seen 7.4 years ago

Hi Marc and Others,

I am trying to use the wonderful package 'AnnotationHub' to retrieve some information, however, I found a little tricky problem-No Uniprot information for some other organisms in AnnotationHub, shown as below:

For Homo sapiens:

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2016-08-15

> query(hub, c("OrgDb","Homo sapiens"))
AnnotationHub with 1 record
# snapshotDate(): 2016-08-15
# names(): AH49582
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Homo sapiens
# $rdataclass: OrgDb
# $title: org.Hs.eg.db.sqlite
# $description: NCBI gene ID based annotations about Homo sapiens
# $taxonomyid: 9606
# $genome: NCBI genomes
# $sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/pub/current_fasta
# $sourcelastmodifieddate: NA
# $sourcesize: NA
# $tags: NCBI, Gene, Annotation
# retrieve record with 'object[["AH49582"]]'
> human<-hub[["AH49582"]]
loading from cache :/Users/RCPA/Documents/AppData/.AnnotationHub/56312?

> keytypes(human)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME"
[8] "EVIDENCE" "EVIDENCEALL" "GENENAME" "GO" "GOALL" "IPI" "MAP"
[15] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROSITE"
[22] "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE" "UNIPROT"

For Solanum lycopersicum:

> query(hub, c("OrgDb","Solanum lycopersicum"))
AnnotationHub with 2 records
# snapshotDate(): 2016-08-15
# $dataprovider: NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Solanum lycopersicum
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH13359"]]'

title
AH13359 | org.Solanum_lycopersicum.eg.sqlite
AH48047 | org.Solanum_lycopersicum.eg.sqlite

> tomato<-hub[["AH48047"]]
loading from cache :/Users/RCPA/Documents/AppData/.AnnotationHub/54353?

> keytypes(tomato)
[1] "ACCNUM" "ALIAS" "ENTREZID" "EVIDENCE" "EVIDENCEALL" "GENENAME" "GID" "GO"
[9] "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL" "UNIGENE"

As you can see, unexpectedly, No "UNIPROT" in tomato! I think "UNIPROT" is one of the most basic information for any organism, it should be included.

Therefore, could you give some suggestion for this or provide an approach to add the "UNIPROT" information in it?

My sessionInfo():

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] GenomeInfoDb_1.8.3 clusterProfiler_3.0.4 DOSE_2.10.7 org.Hs.eg.db_3.3.0 sqldf_0.4-10
[6] RSQLite_1.0.0 DBI_0.4-1 gsubfn_0.6-6 proto_0.3-10 AnnotationDbi_1.34.4
[11] IRanges_2.6.1 S4Vectors_0.10.2 Biobase_2.32.0 AnnotationHub_2.4.2 BiocGenerics_0.18.0

loaded via a namespace (and not attached):
[1] qvalue_2.4.2 shinyjs_0.6 reshape2_1.4.1
[4] lattice_0.20-33 splines_3.3.1 tcltk_3.3.1
[7] colorspace_1.2-6 miniUI_0.1.1 htmltools_0.3.5
[10] chron_2.3-47 interactiveDisplayBase_1.10.3 XML_3.98-1.4
[13] topGO_2.24.0 matrixStats_0.50.2 plyr_1.8.4
[16] stringr_1.0.0 munsell_0.4.3 GOSemSim_1.30.3
[19] gtable_0.2.0 SparseM_1.7 httpuv_1.3.3
[22] BiocInstaller_1.22.3 curl_1.1 GSEABase_1.34.0
[25] Rcpp_0.12.6 xtable_1.8-2 scales_0.4.0
[28] DO.db_2.9 graph_1.50.0 annotate_1.50.0
[31] mime_0.5 ggplot2_2.1.0 digest_0.6.10
[34] stringi_1.1.1 shiny_0.13.2 grid_3.3.1
[37] tools_3.3.1 magrittr_1.5 tibble_1.1
[40] GO.db_3.3.0 tidyr_0.5.1 rsconnect_0.4.3
[43] assertthat_0.1 httr_1.2.1 R6_2.1.2
[46] igraph_1.0.1

Thank a lot for helping^_^

Regards,

Shisheng

hub annotationhub • 1.6k views

ADD COMMENT • link updated 4.3 years ago by shepherl 4.1k • written 8.6 years ago by wssdandan2009 • 0

score 1 · Answer 1 · 2016-09-21

Hi Shisheng,

The human OrgDb was made with a different set of scripts than the two tomato OrgDbs. The human OrgDb is one of the 'standard' organisms we host in our repo:

http://www.bioconductor.org/packages/release/BiocViews.html#___OrgDb

Raw data for the standard OrgDb packages are pulled from many different sources and are the most comprehensive. As a convenience, we also provide OrgDbs for 'non-standard' organisms in AnnotationHub made with AnnotationForge::makeOrgPackageFromNCBI(); these are less comprehensive and pull data primarily from the UCSC browser.

makeOrgPackageFromNCBI() does download a file from UniProt but it's used to create the altGO data table. I'm not sure why the uniprot identifiers weren't included and exposed as a keytype. We may consider adding these in the future but it won't happen before the release.

In the meantime, you can use the UniProt.ws package.

library(AnnotationHub) hub <- AnnotationHub() tomato <- query(hub, c("OrgDb","Solanum lycopersicum"))

> mcols(tomato)[, c("sourcetype", "rdatadateadded")] DataFrame with 2 rows and 2 columns sourcetype rdatadateadded <character> <character> AH13359 NCBI/blast2GO 2014-07-09 AH48047 NCBI/UniProt 2015-07-27

We'll use the more current NCBI/UniProt resource. First get the gene ids from the OrgDb you're working with:

entrezid <- keys(tomato[[2]])

Create a UniProt.ws object (with the tax id if you have it):

libraryUniProt.ws) > lookupUniprotSpeciesFromTaxId(4081) [1] "Solanum lycopersicum" up <- UniProt.ws(taxId=4081)

Decide which columns you want back:

> head(columns(up)) [1] "3D" "AARHUS/GHENT-2DPAGE" "AGD" [4] "ALLERGOME" "ARACHNOSERVER" "BIOCYC"

Call select():

> res <- select(up, keys=entrezid, keytype="ENTREZ_GENE", columns="UNIPROTKB") Getting mapping data for 543501 ... and ACC 'select()' returned 1:many mapping between keys and columns > dim(res) [1] 30934 2 > head(res) ENTREZ_GENE UNIPROTKB 1 543501 O48645 2 543502 O82119 3 543502 Q8LRN7 4 543502 Q8LRN8 5 543506 K4B9Y9 6 543506 Q9ZWP2

Valerie