Hi, I am trying to retrieve gene sequences from AnnotationHub using ensemblb's getGenomeTwoBitFile()
for many non-human species and am getting similar errors stating that there is no genome assembly fasta file available for these organisms. I tried with multiple versions of the AnnotationHub EnsDb object for each species and continue to get the same type of error. My code works fine for the latest human genome version 109 and is able to fetch version 95 of the Mus musculus genome (but it doesn't find anything more updated than that for mouse). Is there anything I can do to change this or am I out of luck? Ideally I would like to retrieve selected gene sequences from all species in the Ensembl database so a universal solution would be much appreciated if possible. I know I can do this using biomaRt but I need an alternative with different caching behavior, which is why I started looking into the ensembldb method. Thanks in advance for your assistance!
Edit: it's actually only retrieving v95 of the human twobit file as well, not v109 as I originally stated.
> ah <- AnnotationHub::AnnotationHub(cache = ".cache/annotationhub")
> xtah <- ah[["AH83359"]]
downloading 1 resources
retrieving 1 resource
|=============================================================================================================================================| 100%
loading from cache
> xtgenome <- ensembldb::getGenomeTwoBitFile(xtah)
snapshotDate(): 2022-10-31
Error in .local(x, ...) :
No genome assembly fasta file available for organism: Xenopus tropicalis, data provider: Ensembl and genome version: Xenopus_tropicalis_v9.1!
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: AlmaLinux 9.2 (Turquoise Kodkod)
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.21.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomartr_1.0.4 rtracklayer_1.58.0 ensembldb_2.22.0 AnnotationFilter_1.22.0 GenomicFeatures_1.50.4 GenomicRanges_1.50.2
[7] GenomeInfoDb_1.34.9 AnnotationDbi_1.60.2 IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0 BiocGenerics_0.44.0
[13] readr_2.1.4 biomaRt_2.54.1
loaded via a namespace (and not attached):
[1] ProtGenerics_1.30.0 bitops_1.0-7 matrixStats_1.0.0 bit64_4.0.5
[5] filelock_1.0.2 progress_1.2.2 httr_1.4.7 tools_4.2.2
[9] utf8_1.2.3 R6_2.5.1 DBI_1.1.3 lazyeval_0.2.2
[13] withr_2.5.0 tidyselect_1.2.0 prettyunits_1.1.1 bit_4.0.5
[17] curl_5.0.2 compiler_4.2.2 cli_3.6.1 xml2_1.3.5
[21] DelayedArray_0.24.0 rappdirs_0.3.3 stringr_1.5.0 digest_0.6.33
[25] Rsamtools_2.14.0 XVector_0.38.0 pkgconfig_2.0.3 htmltools_0.5.6
[29] MatrixGenerics_1.10.0 BSgenome_1.66.3 dbplyr_2.3.3 fastmap_1.1.1
[33] rlang_1.1.1 RSQLite_2.3.1 shiny_1.7.5 BiocIO_1.8.0
[37] generics_0.1.3 jsonlite_1.8.7 BiocParallel_1.32.6 vroom_1.6.3
[41] dplyr_1.1.3 RCurl_1.98-1.12 magrittr_2.0.3 GenomeInfoDbData_1.2.9
[45] Matrix_1.5-1 Rcpp_1.0.11 fansi_1.0.4 lifecycle_1.0.3
[49] stringi_1.7.12 yaml_2.3.7 SummarizedExperiment_1.28.0 zlibbioc_1.44.0
[53] BiocFileCache_2.6.1 AnnotationHub_3.6.0 grid_4.2.2 blob_1.2.4
[57] parallel_4.2.2 promises_1.2.1 crayon_1.5.2 lattice_0.20-45
[61] Biostrings_2.66.0 hms_1.1.3 KEGGREST_1.38.0 pillar_1.9.0
[65] rjson_0.2.21 codetools_0.2-18 XML_3.99-0.14 glue_1.6.2
[69] BiocVersion_3.16.0 downloader_0.4 data.table_1.14.8 renv_1.0.2
[73] BiocManager_1.30.22 png_0.1-8 vctrs_0.6.3 tzdb_0.4.0
[77] httpuv_1.6.11 purrr_1.0.2 tidyr_1.3.0 cachem_1.0.8
[81] mime_0.12 xtable_1.8-4 restfulr_0.0.15 later_1.3.1
[85] tibble_3.2.1 GenomicAlignments_1.34.1 memoise_2.0.1 ellipsis_0.3.2
[89] interactiveDisplayBase_1.36.0
Oh wait. You want the entire genome, not the cDNA, right? In that case you probably have to go to their ftp site, download all the chromosomes and then convert to a TwoBit file
I didn't know I could do that, thank you! However, I am looking for the gene (exon + intron) sequences, not the cDNA. So I am trying to retrieve the primary assembly genomic DNA twobit files. Looking at the records in your
z
object, it looks like it only has the toplevel DNA objects, which might work for my purposes but are going to be substantially larger. I think that's prohibitive for my use-case (needing a lot of species) so if you have any suggestions for retrieving the primary assembly DNA I would greatly appreciate it!Edit: I think I also need the AnnotationHub annotation object from which I am pulling exon coordinates to match the AnnotationHub object from which I pull the gene sequences, so that the genome versions are the same. Using ensembldb to get the twobit files allows me to do that, if I am understanding it correctly.
Ensembl says this:
Which seems to be a different thing than what you think of as a primary assembly? But if you just mean the standard chromosomes, then for X. tropicalis, the toplevel are the same as primary assembly.
Ah, thank you! This was very helpful - "If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent."
There is a primary assembly for X. tropicalis, but it's not available on
AnnotationHub
. It's not what I think you think it is though.That's just a chromosome, not the gene sequences (not a thing, btw. There are transcript sequences, but a gene is a simplification of all the transcripts and doesn't really exist).
Yes, I know the gene sequences don't really exist per se. I'm having to build them by taking the start position of the first exon and the end position of the last exon and then extracting that range from the chromosomal sequences, or by using
ensembldb::genes()
to get a GRanges filter for the specific genes I'm interested in and then applying that to the twobit object withBSgenome::getSeq()
. It's for a very specific use-case that just requires knowledge of each exon's position within the overall gene, not knowledge of how the exons relate to each other in terms of transcripts. I can get the exon position info using the AnnotationHub object andensembldb::exons()
and write that to a gff3 file, but I need the corresponding gene sequences in fasta format as well. So it's making my life a bit harder given that there's no easy method of fetching those sequences from a cached/downloaded genome file.This is code that's working as intended, but it doesn't function for many of the non-human genomes and actually only gives v95 of the twobit file, not v109 which is the AnnotationHub version returned by AH109606. Hence my original question about the versions returned and the availability of non-human genomes.
It returns the following:
If you have any suggestions for easier ways to do this, I'm all ears! Thanks again :)
The easy way would involve other people having already converted the existing FASTA files into 2bit and uploading to the AnnotationHub. But there are only a few species for which that has been done, so the easy way may not exist.
If this is something you have to do repeatedly for the same species, then it might make sense to download the FASTA files from Ensembl and convert to 2bit format yourself. Or even if you have to do it once, maybe it's worth the effort? There is the
BSgenomeForge
package, which will automatically download FASTA files from NCBI or UCSC and makeBSgenome
packages. That might not be directly relevant for you, although maybe it is? There are any number of differences between Ensembl and NCBI, as far as transcripts and genes go, but I don't know if there are any real differences between the genomes.But anyway, at the very least
BSgenomeForge
has functions to download and convert FASTA files that you could copy and edit to get from Ensembl, and then maybe it's just a matter of firing the code off on a Friday and hoping it's done by Monday.I think your original suggestion of querying AnnotationHub directly for the twobit files is probably the way to go - it looks like they exist for all the species in the main Ensembl database. Do you happen to know what the difference between the records marked
dna_rm
anddna_sm
is? i.e.AH107025 | Xenopus_tropicalis.UCB_Xtro_10.0.dna_rm.toplevel.2bit
vs.AH107026 | Xenopus_tropicalis.UCB_Xtro_10.0.dna_sm.toplevel.2bit
? Thanks again!FILE NAMES
The files are consistently named following this pattern: <species>.<assembly>.<sequence type>.<id type>.<id>.fa.gz
<species>: The systematic name of the species.
<assembly>: The assembly build name.
<sequence type>: