Hello,
I'm trying to analyse the results from a RNASeq experiment of Bacillus subtilis subspecies py79. Initially, the analysis was done by someone else, but on the wrong subspecies, I am now trying to redo it on the right subspecies. The problem occurs when I run the lines ensDbFromGff("my_gff_file.gff3")
. Initially the gff file for the task was fetched from https://bacteria.ensembl.org/species.html?search=Bacillus+subtilis, however, this database does not contain the file for the subspecies we require. It can instead be retrieved from here https://www.ncbi.nlm.nih.gov/nuccore/CP006881.1?report=genbank&to=4033459. The problems occur since the files I get from NCBI causes various errors.
Initially, I go to that NCBI page, and the "Send to > File > GFF3" to download the file (called "sequence.gff3"). I then run:
library("ensembldb")
bsbSq1 <- ensDbFromGff("sequence.gff3")
which yields the error message:
Error in .checkExtractVersions(gff, organism, genomeVersion, version) :
The file name does not match the expected naming scheme of Ensembl files (<organism>.<genome version>.<Ensembl version>) hence I cannot extract any information from it! Parameters 'organism', 'genomeVersion' and 'version' are thus required!
To continue I then rename the file to "Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3" (this is the name from the file from ensemb, which work when given as input)
bsbSq1 <- ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3")
which yields the error message:
Error in ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3") :
File Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3 does not seem to be a correct GFF file! The ##gff-version line is missing!
turns out the beginning of the said file is
##sequence-region CP006881.1 1 4033459
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1415167
CP006881.1 Genbank region 1 4033459 . + . ID=CP006881.1:1..4033459;Dbxref=taxon:1415167;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=PY79
so I try to change it to what again worked in the previous file:
##gff-version 3
#!genome-build European Nucleotide Archive ASM904v1
#!genome-version ASM904v1
#!genome-date 2015-02
#!genome-build-accession GCA_000009045.1
#!genebuild-last-updated 2015-02
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=14151671415167
CP006881.1 Genbank region 1 4033459 . + . ID=CP006881.1:1..4033459;Dbxref=taxon:1415167;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=PY79
and run
bsbSq1 <- ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3")
which yields the error message:
Importing GFF ... OK
Error in ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.39.chromosome.Chromosome.gff3") :
Required columns/fields gene_id;transcript_id;exon_id;rank;biotype not present in the GFF file!
turns out that in the file from ensemble there are pieces of information like "gene_id=BSU00010;", which the NCBI file lack. So even if my rewriting of the filename and header might not have been entirely correct, I will end up with this error anyways.
So my question; Is there some way of successfully doing what I want to do, but using the genomic files from NCBI instead of ensembl (since my required strain does not exist at ensemble)?
Finally, the output of
sessionInfo( )
is
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] ensembldb_2.14.0 AnnotationFilter_1.14.0 GenomicFeatures_1.42.3 AnnotationDbi_1.52.0 Biobase_2.50.0
[6] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4 IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41 prettyunits_1.1.1 Rsamtools_2.6.0
[5] Biostrings_2.58.0 assertthat_0.2.1 utf8_1.2.1 BiocFileCache_1.14.0
[9] R6_2.5.0 RSQLite_2.2.5 httr_1.4.2 pillar_1.5.1
[13] zlibbioc_1.36.0 rlang_0.4.10 progress_1.2.2 lazyeval_0.2.2
[17] curl_4.3 blob_1.2.1 Matrix_1.3-2 BiocParallel_1.24.1
[21] stringr_1.4.0 ProtGenerics_1.22.0 RCurl_1.98-1.3 bit_4.0.4
[25] biomaRt_2.46.3 DelayedArray_0.16.3 compiler_4.0.5 rtracklayer_1.49.5
[29] pkgconfig_2.0.3 askpass_1.1 openssl_1.4.3 tidyselect_1.1.0
[33] SummarizedExperiment_1.20.0 tibble_3.1.0 GenomeInfoDbData_1.2.4 matrixStats_0.58.0
[37] XML_3.99-0.6 fansi_0.4.2 crayon_1.4.1 dplyr_1.0.5
[41] dbplyr_2.1.0 GenomicAlignments_1.26.0 bitops_1.0-6 rappdirs_0.3.3
[45] grid_4.0.5 lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1
[49] stringi_1.5.3 cachem_1.0.4 XVector_0.30.0 xml2_1.3.2
[53] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.7 tools_4.0.5
[57] bit64_4.0.5 glue_1.4.2 purrr_0.3.4 hms_1.0.0
[61] MatrixGenerics_1.2.1 fastmap_1.1.0 BiocManager_1.30.12 memoise_2.0.0
This worked, thanks a lot for the help!