forgeBSgenomeDataPkgFromNCBI: Error in find_NCBI_assembly_ftp_dir unable to find FTP dir
Geetha • 0
Last seen 3 months ago

Hi All,

I am trying to create a BSgenome object for horse genome, but I am receiving the following error. Has anyone faced similar issue? Any help would be appreciated. Thanks!

> forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_002863925.1",
+                              organism="Equus caballus")
Error in find_NCBI_assembly_ftp_dir(assembly_accession, assembly_name = assembly_name) : 
  unable to find FTP dir for assembly GCA_002863925.1 in

sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/;  LAPACK version 3.8.0

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] MethylSeekR_1.40.0   mhsmm_0.4.21         mvtnorm_1.3-2        tidyr_1.3.1          dplyr_1.1.4          readr_2.1.5         
 [7] BSgenomeForge_1.0.1  BSgenome_1.68.0      rtracklayer_1.60.1   Biostrings_2.68.1    XVector_0.40.0       GenomicRanges_1.52.1
[13] GenomeInfoDb_1.36.4  IRanges_2.34.1       S4Vectors_0.38.2     BiocGenerics_0.46.0 

loaded via a namespace (and not attached):
 [1] utf8_1.2.3                  generics_0.1.3              bitops_1.0-9                lattice_0.21-8             
 [5] hms_1.1.3                   magrittr_2.0.3              grid_4.3.0                  Matrix_1.5-4               
 [9] restfulr_0.0.15             purrr_1.0.1                 fansi_1.0.4                 XML_3.99-0.17              
[13] codetools_0.2-19            abind_1.4-8                 cli_3.6.1                   rlang_1.1.1                
[17] crayon_1.5.2                Biobase_2.60.0              DelayedArray_0.26.7         yaml_2.3.7                 
[21] S4Arrays_1.0.6              tools_4.3.0                 tzdb_0.4.0                  BiocParallel_1.34.2        
[25] GenomeInfoDbData_1.2.10     Rsamtools_2.16.0            SummarizedExperiment_1.30.2 vctrs_0.6.5                
[29] R6_2.5.1                    BiocIO_1.10.0               matrixStats_1.4.1           lifecycle_1.0.3            
[33] zlibbioc_1.46.0             pkgconfig_2.0.3             pillar_1.9.0                glue_1.6.2                 
[37] tibble_3.2.1                GenomicAlignments_1.36.0    tidyselect_1.2.1            rstudioapi_0.14            
[41] MatrixGenerics_1.12.3       rjson_0.2.23                compiler_4.3.0              RCurl_1.98-1.16
Last seen 6 days ago
Seattle, WA, United States


Have you tried with RefSeq accession GCF_002863925.1 instead? Since this one is registered in GenomeInfoDb (as reported by GenomeInfoDb::registered_NCBI_assemblies("equus")), you don't need to specify the organism or circ_seqs argument:

forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_002863925.1", pkg_maintainer="Jane Doe <>")
# trying URL ''
# Content type 'application/x-gzip' length 802074530 bytes (764.9 MB)
# ==================================================
# downloaded 764.9 MB
# Creating package in ./BSgenome.Ecaballus.NCBI.EquCab3.0

Using the GenBank accession also works for me but note that the GenBank assembly seems to be missing the MT chromosome:

forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_002863925.1", organism="Equus caballus", pkg_maintainer="Jane Doe <>", circ_seqs=character(0))
# trying URL ''
# Content type 'application/x-gzip' length 802051120 bytes (764.9 MB)
# ==================================================
# downloaded 764.9 MB
# Creating package in ./BSgenome.Ecaballus.NCBI.EquCab3.0

Otherwise, note that GCF_002863925.1 is the same as UCSC equCab3 genome, so you could also use the following:

forgeBSgenomeDataPkgFromUCSC("equCab3", "Equus caballus", pkg_maintainer="Jane Doe <>")
# trying URL ''
# Content type 'unknown' length 653547323 bytes (623.3 MB)
# ==================================================
# downloaded 623.3 MB
# Creating package in ./BSgenome.Ecaballus.UCSC.equCab3

UCSC uses a different chromosome naming scheme though, but you can switch back and forth between UCSC and NCBI names by using the seqlevelsStyle() getter/setter on the BSgenome object.

Please open an issue on GitHub if you're still having problems with this.



