Question

Unable to use ensembldb's "ensDbFromGff()" function on .gff file from NCBI

0

Entering edit mode

Torkel • 0

@85d013fb

Last seen 3.2 years ago

Malmo

Hello,

I'm trying to analyse the results from a RNASeq experiment of Bacillus subtilis subspecies py79. Initially, the analysis was done by someone else, but on the wrong subspecies, I am now trying to redo it on the right subspecies. The problem occurs when I run the lines ensDbFromGff("my_gff_file.gff3"). Initially the gff file for the task was fetched from https://bacteria.ensembl.org/species.html?search=Bacillus+subtilis, however, this database does not contain the file for the subspecies we require. It can instead be retrieved from here https://www.ncbi.nlm.nih.gov/nuccore/CP006881.1?report=genbank&to=4033459. The problems occur since the files I get from NCBI causes various errors.

Initially, I go to that NCBI page, and the "Send to > File > GFF3" to download the file (called "sequence.gff3"). I then run:

library("ensembldb")
bsbSq1 <- ensDbFromGff("sequence.gff3")

which yields the error message:

Error in .checkExtractVersions(gff, organism, genomeVersion, version) : 
  The file name does not match the expected naming scheme of Ensembl files (<organism>.<genome version>.<Ensembl version>) hence I cannot extract any information from it! Parameters 'organism', 'genomeVersion' and 'version' are thus required!

To continue I then rename the file to "Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3" (this is the name from the file from ensemb, which work when given as input)

bsbSq1 <- ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3")

which yields the error message:

Error in ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3") : 
  File Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3 does not seem to be a correct GFF file! The ##gff-version line is missing!

turns out the beginning of the said file is

##sequence-region CP006881.1 1 4033459
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1415167
CP006881.1  Genbank region  1   4033459 .   +   .   ID=CP006881.1:1..4033459;Dbxref=taxon:1415167;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=PY79

so I try to change it to what again worked in the previous file:

##gff-version 3
#!genome-build European Nucleotide Archive ASM904v1
#!genome-version ASM904v1
#!genome-date 2015-02
#!genome-build-accession GCA_000009045.1
#!genebuild-last-updated 2015-02
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=14151671415167
CP006881.1  Genbank region  1   4033459 .   +   .   ID=CP006881.1:1..4033459;Dbxref=taxon:1415167;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=PY79

and run

bsbSq1 <- ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.38.chromosome.Chromosome.gff3")

which yields the error message:

Importing GFF ... OK
Error in ensDbFromGff("Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.39.chromosome.Chromosome.gff3") : 
  Required columns/fields gene_id;transcript_id;exon_id;rank;biotype not present in the GFF file!

turns out that in the file from ensemble there are pieces of information like "gene_id=BSU00010;", which the NCBI file lack. So even if my rewriting of the filename and header might not have been entirely correct, I will end up with this error anyways.

So my question; Is there some way of successfully doing what I want to do, but using the genomic files from NCBI instead of ensembl (since my required strain does not exist at ensemble)?

Finally, the output of

sessionInfo( )

is

R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Swedish_Sweden.1252  LC_CTYPE=Swedish_Sweden.1252    LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C                    LC_TIME=Swedish_Sweden.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ensembldb_2.14.0        AnnotationFilter_1.14.0 GenomicFeatures_1.42.3  AnnotationDbi_1.52.0    Biobase_2.50.0         
 [6] GenomicRanges_1.42.0    GenomeInfoDb_1.26.4     IRanges_2.24.1          S4Vectors_0.28.1        BiocGenerics_0.36.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6                  lattice_0.20-41             prettyunits_1.1.1           Rsamtools_2.6.0            
 [5] Biostrings_2.58.0           assertthat_0.2.1            utf8_1.2.1                  BiocFileCache_1.14.0       
 [9] R6_2.5.0                    RSQLite_2.2.5               httr_1.4.2                  pillar_1.5.1               
[13] zlibbioc_1.36.0             rlang_0.4.10                progress_1.2.2              lazyeval_0.2.2             
[17] curl_4.3                    blob_1.2.1                  Matrix_1.3-2                BiocParallel_1.24.1        
[21] stringr_1.4.0               ProtGenerics_1.22.0         RCurl_1.98-1.3              bit_4.0.4                  
[25] biomaRt_2.46.3              DelayedArray_0.16.3         compiler_4.0.5              rtracklayer_1.49.5         
[29] pkgconfig_2.0.3             askpass_1.1                 openssl_1.4.3               tidyselect_1.1.0           
[33] SummarizedExperiment_1.20.0 tibble_3.1.0                GenomeInfoDbData_1.2.4      matrixStats_0.58.0         
[37] XML_3.99-0.6                fansi_0.4.2                 crayon_1.4.1                dplyr_1.0.5                
[41] dbplyr_2.1.0                GenomicAlignments_1.26.0    bitops_1.0-6                rappdirs_0.3.3             
[45] grid_4.0.5                  lifecycle_1.0.0             DBI_1.1.1                   magrittr_2.0.1             
[49] stringi_1.5.3               cachem_1.0.4                XVector_0.30.0              xml2_1.3.2                 
[53] ellipsis_0.3.1              generics_0.1.0              vctrs_0.3.7                 tools_4.0.5                
[57] bit64_4.0.5                 glue_1.4.2                  purrr_0.3.4                 hms_1.0.0                  
[61] MatrixGenerics_1.2.1        fastmap_1.1.0               BiocManager_1.30.12         memoise_2.0.0

ensembldb RNASeq • 1.2k views

ADD COMMENT • link 3.2 years ago Torkel • 0

score 1 · Answer 1 · 2021-04-06

You are trying to pretend that your GFF is from Ensembl when it's from NCBI. There's no profit in doing that. Instead, accept the fact that it's from NCBI and proceed accordingly.

> library(GenomicFeatures)
> z <- makeTxDbFromGFF("sequence.gff3")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> genes(z)
GRanges object with 4256 ranges and 1 metadata column:
               seqnames          ranges strand |     gene_id
                  <Rle>       <IRanges>  <Rle> | <character>
  U712_00005 CP006881.1        410-1750      + |  U712_00005
  U712_00010 CP006881.1       1939-3075      + |  U712_00010
  U712_00015 CP006881.1       3206-3421      + |  U712_00015
  U712_00020 CP006881.1       3437-4549      + |  U712_00020
  U712_00025 CP006881.1       4567-4812      + |  U712_00025
         ...        ...             ...    ... .         ...
  U712_21390 CP006881.1 3363743-3363818      - |  U712_21390
  U712_21395 CP006881.1 3972639-3972711      - |  U712_21395
  U712_21400 CP006881.1 3972750-3972823      - |  U712_21400
  U712_21405 CP006881.1 3972905-3972976      - |  U712_21405
  U712_21410 CP006881.1 3972989-3973061      - |  U712_21410
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> transcripts(z)
GRanges object with 4256 ranges and 2 metadata columns:
           seqnames          ranges strand |     tx_id     tx_name
              <Rle>       <IRanges>  <Rle> | <integer> <character>
     [1] CP006881.1        410-1750      + |         1  U712_00005
     [2] CP006881.1       1939-3075      + |         2  U712_00010
     [3] CP006881.1       3206-3421      + |         3  U712_00015
     [4] CP006881.1       3437-4549      + |         4  U712_00020
     [5] CP006881.1       4567-4812      + |         5  U712_00025
     ...        ...             ...    ... .       ...         ...
  [4252] CP006881.1 4027456-4029342      - |      4252  U712_20810
  [4253] CP006881.1 4029363-4030742      - |      4253  U712_20815
  [4254] CP006881.1 4031053-4031679      - |      4254  U712_20820
  [4255] CP006881.1 4031676-4032458      - |      4255  U712_20825
  [4256] CP006881.1 4032606-4032956      - |      4256  U712_20830
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths