Question

ensembldb to make the EnsDb for Human Version 105 using 'fetchTablesFromEnsembl' I'm on day two, can it be sped up?

0

Entering edit mode

Matthew Thornton ▴ 380

@matthew-thornton-5564

Last seen 13 days ago

USA, Los Angeles, USC

Hello!

I've installed all of the prerequisites for using fetchTablesFromEnsembl with ensembldb to download the data for the human ensembl version 105. However, it is taking forever and I was dropped once and had to start over. I've had this problem with SRAtools and it was solved using Aspera connect. Is there anyway to speed this process up, or append if dropped?

So if I just cant get it, which is looking likely, is it better to use a gff or a gtf file? What would be the easiest way to add the Entrezgene ids?

These Ensdb libraries are used in the Signac vignettes and as people begin to use Signac with their multiome + ATAC. I would expect this issue would come up more often.

Any advice or help is greatly appreciated. Thank you!

> sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ensembldb_2.18.3        AnnotationFilter_1.18.0 GenomicFeatures_1.46.5 
 [4] AnnotationDbi_1.56.2    Biobase_2.54.0          GenomicRanges_1.46.1   
 [7] GenomeInfoDb_1.30.1     IRanges_2.28.0          S4Vectors_0.32.3       
[10] BiocGenerics_0.40.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] Rsamtools_2.10.0            Biostrings_2.62.0          
 [7] assertthat_0.2.1            digest_0.6.29              
 [9] utf8_1.2.2                  BiocFileCache_2.2.1        
[11] R6_2.5.1                    RSQLite_2.2.10             
[13] httr_1.4.2                  pillar_1.7.0               
[15] zlibbioc_1.40.0             rlang_1.0.1                
[17] progress_1.2.2              lazyeval_0.2.2             
[19] curl_4.3.2                  blob_1.2.2                 
[21] Matrix_1.4-0                BiocParallel_1.28.3        
[23] stringr_1.4.0               ProtGenerics_1.26.0        
[25] RCurl_1.98-1.6              bit_4.0.4                  
[27] biomaRt_2.50.3              DelayedArray_0.20.0        
[29] compiler_4.1.2              rtracklayer_1.54.0         
[31] pkgconfig_2.0.3             SummarizedExperiment_1.24.0
[33] tidyselect_1.1.2            KEGGREST_1.34.0            
[35] tibble_3.1.6                GenomeInfoDbData_1.2.7     
[37] matrixStats_0.61.0          XML_3.99-0.9               
[39] fansi_1.0.2                 crayon_1.5.0               
[41] dplyr_1.0.8                 dbplyr_2.1.1               
[43] GenomicAlignments_1.30.0    bitops_1.0-7               
[45] rappdirs_0.3.3              grid_4.1.2                 
[47] lifecycle_1.0.1             DBI_1.1.2                  
[49] magrittr_2.0.2              cli_3.2.0                  
[51] stringi_1.7.6               cachem_1.0.6               
[53] XVector_0.34.0              xml2_1.3.3                 
[55] ellipsis_0.3.2              filelock_1.0.2             
[57] generics_0.1.2              vctrs_0.3.8                
[59] rjson_0.2.21                restfulr_0.0.13            
[61] tools_4.1.2                 bit64_4.0.5                
[63] glue_1.6.2                  purrr_0.3.4                
[65] MatrixGenerics_1.6.0        hms_1.1.1                  
[67] parallel_4.1.2              fastmap_1.1.0              
[69] yaml_2.3.5                  memoise_2.0.1              
[71] BiocIO_1.4.0

ensembldb • 1.8k views

ADD COMMENT • link 3.1 years ago Matthew Thornton ▴ 380

score 2 · Accepted Answer · 2022-03-03

You shouldn't be building that EnsDb, as it already exists.

> library(AnnotationHub)

> hub <- AnnotationHub()
> query(hub, c("ensdb","homo sapiens","105"))
AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH98047
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2021-10-20
# $title: Ensembl 105 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("105", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH98047"]]' 
> ensdb <- hub[["AH98047"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.7
|Creation time: Sat Dec 18 14:48:15 2021
|ensembl_version: 105
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
| No. of genes: 69329.
| No. of transcripts: 268255.
|Protein data available.
>