I'm trying to build 2 custom BLAST databases from the DNA sequences and protein sequences of Pseudomonas genomes from pseudomonas.com
I've downloaded the genomes in genbank format, and now I want to parse them to extract the DNA or protein sequences that I'll use for database generation.
The problem is, readGenbank from genbankr is throwing an error when I try to load the lower quality draft genomes. See the below code for a reprex. The error I'm getting is:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘seqinfo’ for signature ‘"NULL"’
parseGenbank works, but shows that the problem seems to be in parsing from the FEATURES line to the ORIGIN line since the DNA sequence is being picked up as part of the FEATURES, leaving ORIGIN empty, which I assume is what causes the above error, since no sequence gets parsed when 'seqinfo' gets run it finds a NULL value.
I tried looking in the code of readGenbank, which led me to the parseGenbank function, which I think is where the actual problem lies. For some reason it is unable to parse the draft genome genbank files I'm interested in. Is this a bug or a problem with the formatting of the genbank files? Does anyone know how I can get around it?
Thanks!
Reprex: either download the whole genome from pseudomonas.com or use the short genbank formatted data pasted here as a minimal genbank file able to reproduce the error.
library(genbankr)
download.file("http://www.pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_00041437_6471/Pseudomonas_aeruginosa_00041437_6471.gbk.gz","test.gbk.gz")
gb <- readGenbank("test.gbk.gz")
#error will happen here
#for better diagnostics
readGenbank("mintest.gbk")
parseGenbank("mintest.gbk")
copy and paste the following into a file and call it 'mintest.gbk'
LOCUS NZ_NSRH01000041.1 805 bp DNA UNK 03-OCT-2019
DEFINITION Pseudomonas aeruginosa strain 001-6 IPC988_41.1, whole genome
shotgun sequence.
ACCESSION NZ_NSRH01000041
VERSION NZ_NSRH01000041.1
DBLINK Assembly: GCF_003835295.1
BioSample: SAMN07424126
BioProject: PRJNA224116
KEYWORDS Pseudomonas Genome DB Version: 19.1.
SOURCE Pseudomonas aeruginosa
ORGANISM Pseudomonas aeruginosa
.
FEATURES Location/Qualifiers
pseudo 1..805
/gene="filamentous hemagglutinin (pseudogene)"
/locus_tag="IPC988_RS29780"
/db_xref="Pseudomonas Genome DB: PGD126368844"
ORIGIN
1 gcctggacag ccgggccggc aacctcgacc tgcagagcgg cagcctcgac aacggcgccg
61 gcggcgtgct caacagcgcc wagggttggc tgaagctggt caccgggctg ttcgacaaca
121 gcgccggcgt cacccaggcg cagtcgctgg agattcgcgc cgggcaaggc gtgcgcaacc
181 agcagggcca cctctcggcg ctgggcggcg acaaccgcat cgtcaccgcc gacttcgaca
241 accagggcgg cggcctctac gccagcggcc tgctcagcct cgacggccag cgcttcctca
301 accagggcgc ggcggcgggc cagggcggca aggtcggcgc cgggcgcatc gacttcagcc
361 tggccggcgc gctggccaac cgcttcggcc agttggaaag cgagagcgag ctgcacctgc
421 gcgccgccgc gatcgacaac agcggcggca gcctgcgcgc cctcggccgc agcggcagca
481 cgcggctggt cgctggcgac ctgaacaacg cctacggcgt gctggaaagc gccaaccagg
541 acctcgacct gcaactgggc agcctggcca acgccggcgg gcgcatcctc cacactggca
601 acggcacctt cggcctggat tccgggcagg tgatccgcgc cggcggcgaa ctgaccacca
661 atggcctgct ggacatccgc gccagcgaat ggaccaacag cagcgtgctg caagccggac
721 gcctgaacct ggacatcggc accttccgcc agacggccga gggcaagctg ctggcggtgc
781 agtccttcac tggccgcggc ggcga
//
if you don't copy and paste the example into a textfile and instead run this, you get slightly different parseGenbank output but readGenbank throws the same error
string <- 'LOCUS NZ_NSRH01000041.1 805 bp DNA UNK 03-OCT-2019
DEFINITION Pseudomonas aeruginosa strain 001-6 IPC988_41.1, whole genome
shotgun sequence.
ACCESSION NZ_NSRH01000041
VERSION NZ_NSRH01000041.1
DBLINK Assembly: GCF_003835295.1
BioSample: SAMN07424126
BioProject: PRJNA224116
KEYWORDS Pseudomonas Genome DB Version: 19.1.
SOURCE Pseudomonas aeruginosa
ORGANISM Pseudomonas aeruginosa
.
FEATURES Location//Qualifiers
pseudo 1..805
//gene=/"filamentous hemagglutinin (pseudogene)/"
//locus_tag=/"IPC988_RS29780/"
//db_xref=/"Pseudomonas Genome DB: PGD126368844/"
ORIGIN
1 gcctggacag ccgggccggc aacctcgacc tgcagagcgg cagcctcgac aacggcgccg
61 gcggcgtgct caacagcgcc wagggttggc tgaagctggt caccgggctg ttcgacaaca
121 gcgccggcgt cacccaggcg cagtcgctgg agattcgcgc cgggcaaggc gtgcgcaacc
181 agcagggcca cctctcggcg ctgggcggcg acaaccgcat cgtcaccgcc gacttcgaca
241 accagggcgg cggcctctac gccagcggcc tgctcagcct cgacggccag cgcttcctca
301 accagggcgc ggcggcgggc cagggcggca aggtcggcgc cgggcgcatc gacttcagcc
361 tggccggcgc gctggccaac cgcttcggcc agttggaaag cgagagcgag ctgcacctgc
421 gcgccgccgc gatcgacaac agcggcggca gcctgcgcgc cctcggccgc agcggcagca
481 cgcggctggt cgctggcgac ctgaacaacg cctacggcgt gctggaaagc gccaaccagg
541 acctcgacct gcaactgggc agcctggcca acgccggcgg gcgcatcctc cacactggca
601 acggcacctt cggcctggat tccgggcagg tgatccgcgc cggcggcgaa ctgaccacca
661 atggcctgct ggacatccgc gccagcgaat ggaccaacag cagcgtgctg caagccggac
721 gcctgaacct ggacatcggc accttccgcc agacggccga gggcaagctg ctggcggtgc
781 agtccttcac tggccgcggc ggcga
//'
parseGenbank(text=string)
readGenbank(text=string)
This issue is using genbankr_1.12.0