Question

readGenbank can't handle multiple contigs?

0

Entering edit mode

sean.c.m.booth • 0

@seancmbooth-23506

Last seen 5.0 years ago

I'm trying to build 2 custom BLAST databases from the DNA sequences and protein sequences of Pseudomonas genomes from pseudomonas.com

I've downloaded the genomes in genbank format, and now I want to parse them to extract the DNA or protein sequences that I'll use for database generation.

The problem is, readGenbank from genbankr is throwing an error when I try to load the lower quality draft genomes. See the below code for a reprex. The error I'm getting is:

Error in (function (classes, fdef, mtable)  : 
unable to find an inherited method for function ‘seqinfo’ for signature ‘"NULL"’

parseGenbank works, but shows that the problem seems to be in parsing from the FEATURES line to the ORIGIN line since the DNA sequence is being picked up as part of the FEATURES, leaving ORIGIN empty, which I assume is what causes the above error, since no sequence gets parsed when 'seqinfo' gets run it finds a NULL value.

I tried looking in the code of readGenbank, which led me to the parseGenbank function, which I think is where the actual problem lies. For some reason it is unable to parse the draft genome genbank files I'm interested in. Is this a bug or a problem with the formatting of the genbank files? Does anyone know how I can get around it?

Thanks!

Reprex: either download the whole genome from pseudomonas.com or use the short genbank formatted data pasted here as a minimal genbank file able to reproduce the error.

library(genbankr)
download.file("http://www.pseudomonas.com/downloads/pseudomonas/pgd_r_19_1/Pseudomonas_aeruginosa_00041437_6471/Pseudomonas_aeruginosa_00041437_6471.gbk.gz","test.gbk.gz")
gb <- readGenbank("test.gbk.gz")
#error will happen here
#for better diagnostics
readGenbank("mintest.gbk")
parseGenbank("mintest.gbk")

copy and paste the following into a file and call it 'mintest.gbk'

LOCUS       NZ_NSRH01000041.1        805 bp    DNA              UNK 03-OCT-2019
DEFINITION  Pseudomonas aeruginosa strain 001-6 IPC988_41.1, whole genome
            shotgun sequence.
ACCESSION   NZ_NSRH01000041
VERSION     NZ_NSRH01000041.1
DBLINK      Assembly: GCF_003835295.1
            BioSample: SAMN07424126
            BioProject: PRJNA224116
KEYWORDS    Pseudomonas Genome DB Version: 19.1.
SOURCE      Pseudomonas aeruginosa
  ORGANISM  Pseudomonas aeruginosa
            .
FEATURES             Location/Qualifiers
     pseudo          1..805
                     /gene="filamentous hemagglutinin (pseudogene)"
                     /locus_tag="IPC988_RS29780"
                     /db_xref="Pseudomonas Genome DB: PGD126368844"
ORIGIN
        1 gcctggacag ccgggccggc aacctcgacc tgcagagcgg cagcctcgac aacggcgccg
       61 gcggcgtgct caacagcgcc wagggttggc tgaagctggt caccgggctg ttcgacaaca
      121 gcgccggcgt cacccaggcg cagtcgctgg agattcgcgc cgggcaaggc gtgcgcaacc
      181 agcagggcca cctctcggcg ctgggcggcg acaaccgcat cgtcaccgcc gacttcgaca
      241 accagggcgg cggcctctac gccagcggcc tgctcagcct cgacggccag cgcttcctca
      301 accagggcgc ggcggcgggc cagggcggca aggtcggcgc cgggcgcatc gacttcagcc
      361 tggccggcgc gctggccaac cgcttcggcc agttggaaag cgagagcgag ctgcacctgc
      421 gcgccgccgc gatcgacaac agcggcggca gcctgcgcgc cctcggccgc agcggcagca
      481 cgcggctggt cgctggcgac ctgaacaacg cctacggcgt gctggaaagc gccaaccagg
      541 acctcgacct gcaactgggc agcctggcca acgccggcgg gcgcatcctc cacactggca
      601 acggcacctt cggcctggat tccgggcagg tgatccgcgc cggcggcgaa ctgaccacca
      661 atggcctgct ggacatccgc gccagcgaat ggaccaacag cagcgtgctg caagccggac
      721 gcctgaacct ggacatcggc accttccgcc agacggccga gggcaagctg ctggcggtgc
      781 agtccttcac tggccgcggc ggcga
//

if you don't copy and paste the example into a textfile and instead run this, you get slightly different parseGenbank output but readGenbank throws the same error

string <- 'LOCUS       NZ_NSRH01000041.1        805 bp    DNA              UNK 03-OCT-2019
DEFINITION  Pseudomonas aeruginosa strain 001-6 IPC988_41.1, whole genome
            shotgun sequence.
ACCESSION   NZ_NSRH01000041
VERSION     NZ_NSRH01000041.1
DBLINK      Assembly: GCF_003835295.1
            BioSample: SAMN07424126
            BioProject: PRJNA224116
KEYWORDS    Pseudomonas Genome DB Version: 19.1.
SOURCE      Pseudomonas aeruginosa
  ORGANISM  Pseudomonas aeruginosa
            .
FEATURES             Location//Qualifiers
     pseudo          1..805
                     //gene=/"filamentous hemagglutinin (pseudogene)/"
                     //locus_tag=/"IPC988_RS29780/"
                     //db_xref=/"Pseudomonas Genome DB: PGD126368844/"
ORIGIN
        1 gcctggacag ccgggccggc aacctcgacc tgcagagcgg cagcctcgac aacggcgccg
       61 gcggcgtgct caacagcgcc wagggttggc tgaagctggt caccgggctg ttcgacaaca
      121 gcgccggcgt cacccaggcg cagtcgctgg agattcgcgc cgggcaaggc gtgcgcaacc
      181 agcagggcca cctctcggcg ctgggcggcg acaaccgcat cgtcaccgcc gacttcgaca
      241 accagggcgg cggcctctac gccagcggcc tgctcagcct cgacggccag cgcttcctca
      301 accagggcgc ggcggcgggc cagggcggca aggtcggcgc cgggcgcatc gacttcagcc
      361 tggccggcgc gctggccaac cgcttcggcc agttggaaag cgagagcgag ctgcacctgc
      421 gcgccgccgc gatcgacaac agcggcggca gcctgcgcgc cctcggccgc agcggcagca
      481 cgcggctggt cgctggcgac ctgaacaacg cctacggcgt gctggaaagc gccaaccagg
      541 acctcgacct gcaactgggc agcctggcca acgccggcgg gcgcatcctc cacactggca
      601 acggcacctt cggcctggat tccgggcagg tgatccgcgc cggcggcgaa ctgaccacca
      661 atggcctgct ggacatccgc gccagcgaat ggaccaacag cagcgtgctg caagccggac
      721 gcctgaacct ggacatcggc accttccgcc agacggccga gggcaagctg ctggcggtgc
      781 agtccttcac tggccgcggc ggcga
//'
parseGenbank(text=string)
readGenbank(text=string)

This issue is using genbankr_1.12.0

genbankr parsing • 865 views

ADD COMMENT • link 5.0 years ago sean.c.m.booth • 0