forgeBSgenomeDataPkgFromNCBI: Error in find_NCBI_assembly_ftp_dir unable to find FTP dir
1
0
Entering edit mode
Geetha • 0
@1a6a0c61
Last seen 4 weeks ago
Germany

Hi All,

I am trying to create a BSgenome object for horse genome, but I am receiving the following error. Has anyone faced similar issue? Any help would be appreciated. Thanks!


> forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_002863925.1",
+                              organism="Equus caballus")
Error in find_NCBI_assembly_ftp_dir(assembly_accession, assembly_name = assembly_name) : 
  unable to find FTP dir for assembly GCA_002863925.1 in https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/863/925/

sessionInfo( )
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so;  LAPACK version 3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] MethylSeekR_1.40.0   mhsmm_0.4.21         mvtnorm_1.3-2        tidyr_1.3.1          dplyr_1.1.4          readr_2.1.5         
 [7] BSgenomeForge_1.0.1  BSgenome_1.68.0      rtracklayer_1.60.1   Biostrings_2.68.1    XVector_0.40.0       GenomicRanges_1.52.1
[13] GenomeInfoDb_1.36.4  IRanges_2.34.1       S4Vectors_0.38.2     BiocGenerics_0.46.0 

loaded via a namespace (and not attached):
 [1] utf8_1.2.3                  generics_0.1.3              bitops_1.0-9                lattice_0.21-8             
 [5] hms_1.1.3                   magrittr_2.0.3              grid_4.3.0                  Matrix_1.5-4               
 [9] restfulr_0.0.15             purrr_1.0.1                 fansi_1.0.4                 XML_3.99-0.17              
[13] codetools_0.2-19            abind_1.4-8                 cli_3.6.1                   rlang_1.1.1                
[17] crayon_1.5.2                Biobase_2.60.0              DelayedArray_0.26.7         yaml_2.3.7                 
[21] S4Arrays_1.0.6              tools_4.3.0                 tzdb_0.4.0                  BiocParallel_1.34.2        
[25] GenomeInfoDbData_1.2.10     Rsamtools_2.16.0            SummarizedExperiment_1.30.2 vctrs_0.6.5                
[29] R6_2.5.1                    BiocIO_1.10.0               matrixStats_1.4.1           lifecycle_1.0.3            
[33] zlibbioc_1.46.0             pkgconfig_2.0.3             pillar_1.9.0                glue_1.6.2                 
[37] tibble_3.2.1                GenomicAlignments_1.36.0    tidyselect_1.2.1            rstudioapi_0.14            
[41] MatrixGenerics_1.12.3       rjson_0.2.23                compiler_4.3.0              RCurl_1.98-1.16
BSgenomeForge non-modelorganism unabletofindFTPdir • 238 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Have you tried with RefSeq accession GCF_002863925.1 instead? Since this one is registered in GenomeInfoDb (as reported by GenomeInfoDb::registered_NCBI_assemblies("equus")), you don't need to specify the organism or circ_seqs argument:

forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCF_002863925.1", pkg_maintainer="Jane Doe <janedoe@gmail.com>")
# trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/863/925/GCF_002863925.1_EquCab3.0/GCF_002863925.1_EquCab3.0_genomic.fna.gz'
# Content type 'application/x-gzip' length 802074530 bytes (764.9 MB)
# ==================================================
# downloaded 764.9 MB
#
# Creating package in ./BSgenome.Ecaballus.NCBI.EquCab3.0

Using the GenBank accession also works for me but note that the GenBank assembly seems to be missing the MT chromosome:

forgeBSgenomeDataPkgFromNCBI(assembly_accession="GCA_002863925.1", organism="Equus caballus", pkg_maintainer="Jane Doe <janedoe@gmail.com>", circ_seqs=character(0))
# trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/863/925/GCA_002863925.1_EquCab3.0/GCA_002863925.1_EquCab3.0_genomic.fna.gz'
# Content type 'application/x-gzip' length 802051120 bytes (764.9 MB)
# ==================================================
# downloaded 764.9 MB
#
# Creating package in ./BSgenome.Ecaballus.NCBI.EquCab3.0

Otherwise, note that GCF_002863925.1 is the same as UCSC equCab3 genome, so you could also use the following:

forgeBSgenomeDataPkgFromUCSC("equCab3", "Equus caballus", pkg_maintainer="Jane Doe <janedoe@gmail.com>")
# trying URL 'https://hgdownload.soe.ucsc.edu/goldenPath/equCab3/bigZips/equCab3.2bit'
# Content type 'unknown' length 653547323 bytes (623.3 MB)
# ==================================================
# downloaded 623.3 MB
#
# Creating package in ./BSgenome.Ecaballus.UCSC.equCab3

UCSC uses a different chromosome naming scheme though, but you can switch back and forth between UCSC and NCBI names by using the seqlevelsStyle() getter/setter on the BSgenome object.

Please open an issue on GitHub if you're still having problems with this.

H.

ADD COMMENT

Login before adding your answer.

Traffic: 1019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6