ensembldb to make the EnsDb for Human Version 105 using 'fetchTablesFromEnsembl' I'm on day two, can it be sped up?
1
0
Entering edit mode
@matthew-thornton-5564
Last seen 11 weeks ago
USA, Los Angeles, USC

Hello!

I've installed all of the prerequisites for using fetchTablesFromEnsembl with ensembldb to download the data for the human ensembl version 105. However, it is taking forever and I was dropped once and had to start over. I've had this problem with SRAtools and it was solved using Aspera connect. Is there anyway to speed this process up, or append if dropped?

So if I just cant get it, which is looking likely, is it better to use a gff or a gtf file? What would be the easiest way to add the Entrezgene ids?

These Ensdb libraries are used in the Signac vignettes and as people begin to use Signac with their multiome + ATAC. I would expect this issue would come up more often.

Any advice or help is greatly appreciated. Thank you!

> sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ensembldb_2.18.3        AnnotationFilter_1.18.0 GenomicFeatures_1.46.5 
 [4] AnnotationDbi_1.56.2    Biobase_2.54.0          GenomicRanges_1.46.1   
 [7] GenomeInfoDb_1.30.1     IRanges_2.28.0          S4Vectors_0.32.3       
[10] BiocGenerics_0.40.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] Rsamtools_2.10.0            Biostrings_2.62.0          
 [7] assertthat_0.2.1            digest_0.6.29              
 [9] utf8_1.2.2                  BiocFileCache_2.2.1        
[11] R6_2.5.1                    RSQLite_2.2.10             
[13] httr_1.4.2                  pillar_1.7.0               
[15] zlibbioc_1.40.0             rlang_1.0.1                
[17] progress_1.2.2              lazyeval_0.2.2             
[19] curl_4.3.2                  blob_1.2.2                 
[21] Matrix_1.4-0                BiocParallel_1.28.3        
[23] stringr_1.4.0               ProtGenerics_1.26.0        
[25] RCurl_1.98-1.6              bit_4.0.4                  
[27] biomaRt_2.50.3              DelayedArray_0.20.0        
[29] compiler_4.1.2              rtracklayer_1.54.0         
[31] pkgconfig_2.0.3             SummarizedExperiment_1.24.0
[33] tidyselect_1.1.2            KEGGREST_1.34.0            
[35] tibble_3.1.6                GenomeInfoDbData_1.2.7     
[37] matrixStats_0.61.0          XML_3.99-0.9               
[39] fansi_1.0.2                 crayon_1.5.0               
[41] dplyr_1.0.8                 dbplyr_2.1.1               
[43] GenomicAlignments_1.30.0    bitops_1.0-7               
[45] rappdirs_0.3.3              grid_4.1.2                 
[47] lifecycle_1.0.1             DBI_1.1.2                  
[49] magrittr_2.0.2              cli_3.2.0                  
[51] stringi_1.7.6               cachem_1.0.6               
[53] XVector_0.34.0              xml2_1.3.3                 
[55] ellipsis_0.3.2              filelock_1.0.2             
[57] generics_0.1.2              vctrs_0.3.8                
[59] rjson_0.2.21                restfulr_0.0.13            
[61] tools_4.1.2                 bit64_4.0.5                
[63] glue_1.6.2                  purrr_0.3.4                
[65] MatrixGenerics_1.6.0        hms_1.1.1                  
[67] parallel_4.1.2              fastmap_1.1.0              
[69] yaml_2.3.5                  memoise_2.0.1              
[71] BiocIO_1.4.0               
ensembldb • 1.4k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

You shouldn't be building that EnsDb, as it already exists.

> library(AnnotationHub)

> hub <- AnnotationHub()
> query(hub, c("ensdb","homo sapiens","105"))
AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH98047
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2021-10-20
# $title: Ensembl 105 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("105", "Annotation", "AnnotationHubSoftware", "Coverage",
#   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#   "Transcript") 
# retrieve record with 'object[["AH98047"]]' 
> ensdb <- hub[["AH98047"]]
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")
> ensdb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.7
|Creation time: Sat Dec 18 14:48:15 2021
|ensembl_version: 105
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
| No. of genes: 69329.
| No. of transcripts: 268255.
|Protein data available.
>
ADD COMMENT
0
Entering edit mode

Oh great! Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6