makeTranscriptDbFromGFF fails on NCBI Bacteria genomes
3
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 8 months ago
United States
It seems to me that makeTranscriptDbFromGFF does not yet work on the bacteria GFFs from NCBI (perhaps others too): ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ ## For instance, the following library(GenomicFeatures) download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arth ritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") ## returns this error: extracting transcript information Error in .prepareGFF3TXS(data) : No Transcript information present in gff file I guess this is because in bacteria GFF we don't have explicit transcript annotations. There are hacks around this problem, but it would be nice if this could be supported in the future right out of the box. I apologize if I missed an existing solution for this. Best, Thomas > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics utils datasets grDevices methods base other attached packages: [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
• 1.7k views
ADD COMMENT
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 8 months ago
United States
The perl -> genbank -> gff -> R -> txdb approach seems to work in R-2.15.1 but not in R-3.0.0 anymore. The latter returns txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") extracting transcript information Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates Given that there are over 2500 bacteria genomes annotated this way at NCBI, a more generic solution to this problem would be very useful, I think. Thomas On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote: > FYI, bioperl includes bp_genbank2gff3.pl > > which when run as > > > bp_genbank2gff3.pl NC_011025.gbk > > produces NC_011025.gbk.gff (attached) > > which loaded without error with transcript: > > > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") > extracting transcript information > Extracting gene IDs > extracting transcript information > Processing splicing information for gff3 file. > Deducing exon rank from relative coordinates provided > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Warning messages: > 1: In .deduceExonRankings(exs, format = "gff") : > Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName > 2: In matchCircularity(chroms, circ_seqs) : > None of the strings in your circ_seqs argument match your seqnames. > > txdb > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | Data source: NCBI > | Genus and Species: Some bact > | miRBase build ID: NA > | transcript_nrow: 631 > | exon_nrow: 631 > | cds_nrow: 631 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013) > | GenomicFeatures version at creation time: 1.10.2 > | RSQLite version at creation time: 0.11.2 > | DBSCHEMAVERSION: 1.0 > > > > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Cook, Malcolm > Sent: Friday, June 07, 2013 2:32 PM > To: 'Thomas Girke'; 'bioconductor at r-project.org' > Cc: 'Brandon Gallaher' > Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes > > Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml) > > That is your first problem. > > In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF. > > ##gff-version 3 > #!gff-spec-version 1.20 > #!processor NCBI annotwriter > ##sequence-region NC_011025.1 1 820453 > ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272 > NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosom e;mol_type=genomic DNA;strain=158L3-1 > NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_ta g=MARTH_orf001 > NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,Gene ID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4 > NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0 > NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0 > > > The question is what to do. > > Not sure. > > Any other help? > > Good luck, > > ~Malcolm > > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Thomas Girke > Sent: Friday, June 07, 2013 12:52 PM > To: bioconductor at r-project.org > Cc: Brandon Gallaher > Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes > > It seems to me that makeTranscriptDbFromGFF does not yet work on the > bacteria GFFs from NCBI (perhaps others too): > ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > > ## For instance, the following > library(GenomicFeatures) > download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_ar thritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") > > ## returns this error: > extracting transcript information > Error in .prepareGFF3TXS(data) : > No Transcript information present in gff file > > I guess this is because in bacteria GFF we don't have explicit > transcript annotations. There are hacks around this problem, but it > would be nice if this could be supported in the future right out of the > box. I apologize if I missed an existing solution for this. > > Best, > > Thomas > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] parallel stats graphics utils datasets grDevices methods base > > other attached packages: > [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 > > loaded via a namespace (and not attached): > [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Hi Thomas, Malcolm, This GFF3 file doesn't pass validation using the online validator at: http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online but for a minor problem that doesn't have anything to do with what is being discussed here (still, it's sad that an organization like NCBI distributes invalid GFF3 files :-/ ). Anyway you're right Thomas that it would probably make sense to be able to import those files with makeTranscriptDbFromGFF(). Don't know how hard that will be though. Marc being on vacation right now, this won't be addressed until 2 weeks from now. I don't have an easy workaround to offer in the mean time, sorry... but maybe someone has? Cheers, H. On 06/07/2013 04:33 PM, Thomas Girke wrote: > The perl -> genbank -> gff -> R -> txdb approach seems to work in > R-2.15.1 but not in R-3.0.0 anymore. The latter returns > > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") > extracting transcript information > Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates > > Given that there are over 2500 bacteria genomes annotated this way at NCBI, > a more generic solution to this problem would be very useful, I think. > > Thomas > > > On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote: >> FYI, bioperl includes bp_genbank2gff3.pl >> >> which when run as >> >>> bp_genbank2gff3.pl NC_011025.gbk >> >> produces NC_011025.gbk.gff (attached) >> >> which loaded without error with transcript: >> >>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") >> extracting transcript information >> Extracting gene IDs >> extracting transcript information >> Processing splicing information for gff3 file. >> Deducing exon rank from relative coordinates provided >> Prepare the 'metadata' data frame ... metadata: OK >> Now generating chrominfo from available sequence names. No chromosome length information is available. >> Warning messages: >> 1: In .deduceExonRankings(exs, format = "gff") : >> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName >> 2: In matchCircularity(chroms, circ_seqs) : >> None of the strings in your circ_seqs argument match your seqnames. >>> txdb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Supporting package: GenomicFeatures >> | Data source: NCBI >> | Genus and Species: Some bact >> | miRBase build ID: NA >> | transcript_nrow: 631 >> | exon_nrow: 631 >> | cds_nrow: 631 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013) >> | GenomicFeatures version at creation time: 1.10.2 >> | RSQLite version at creation time: 0.11.2 >> | DBSCHEMAVERSION: 1.0 >> >> >> >> >> -----Original Message----- >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Cook, Malcolm >> Sent: Friday, June 07, 2013 2:32 PM >> To: 'Thomas Girke'; 'bioconductor at r-project.org' >> Cc: 'Brandon Gallaher' >> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes >> >> Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml) >> >> That is your first problem. >> >> In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF. >> >> ##gff-version 3 >> #!gff-spec-version 1.20 >> #!processor NCBI annotwriter >> ##sequence-region NC_011025.1 1 820453 >> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272 >> NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosom e;mol_type=genomic DNA;strain=158L3-1 >> NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_ta g=MARTH_orf001 >> NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,Gene ID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4 >> NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0 >> NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0 >> >> >> The question is what to do. >> >> Not sure. >> >> Any other help? >> >> Good luck, >> >> ~Malcolm >> >> >> -----Original Message----- >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Thomas Girke >> Sent: Friday, June 07, 2013 12:52 PM >> To: bioconductor at r-project.org >> Cc: Brandon Gallaher >> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes >> >> It seems to me that makeTranscriptDbFromGFF does not yet work on the >> bacteria GFFs from NCBI (perhaps others too): >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ >> >> ## For instance, the following >> library(GenomicFeatures) >> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_a rthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") >> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") >> >> ## returns this error: >> extracting transcript information >> Error in .prepareGFF3TXS(data) : >> No Transcript information present in gff file >> >> I guess this is because in bacteria GFF we don't have explicit >> transcript annotations. There are hacks around this problem, but it >> would be nice if this could be supported in the future right out of the >> box. I apologize if I missed an existing solution for this. >> >> Best, >> >> Thomas >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] parallel stats graphics utils datasets grDevices methods base >> >> other attached packages: >> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 8 months ago
United States
No worries about the timing. I just wanted to point out that it would be worthwhile to address this some time in the future since it applies to a huge number of sequenced genomes. I totally understand the challenge Marc is facing supporting all these inconsistent data formats. Thomas On Fri, Jun 07, 2013 at 11:55:34PM +0000, Hervé Pagès wrote: > Hi Thomas, Malcolm, > > This GFF3 file doesn't pass validation using the online validator at: > > http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online > > but for a minor problem that doesn't have anything to do with what is > being discussed here (still, it's sad that an organization like NCBI > distributes invalid GFF3 files :-/ ). > > Anyway you're right Thomas that it would probably make sense to > be able to import those files with makeTranscriptDbFromGFF(). > Don't know how hard that will be though. Marc being on vacation right > now, this won't be addressed until 2 weeks from now. > > I don't have an easy workaround to offer in the mean time, sorry... > but maybe someone has? > > Cheers, > H. > > > On 06/07/2013 04:33 PM, Thomas Girke wrote: > > The perl -> genbank -> gff -> R -> txdb approach seems to work in > > R-2.15.1 but not in R-3.0.0 anymore. The latter returns > > > > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") > > extracting transcript information > > Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates > > > > Given that there are over 2500 bacteria genomes annotated this way at NCBI, > > a more generic solution to this problem would be very useful, I think. > > > > Thomas > > > > > > On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote: > >> FYI, bioperl includes bp_genbank2gff3.pl > >> > >> which when run as > >> > >>> bp_genbank2gff3.pl NC_011025.gbk > >> > >> produces NC_011025.gbk.gff (attached) > >> > >> which loaded without error with transcript: > >> > >>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") > >> extracting transcript information > >> Extracting gene IDs > >> extracting transcript information > >> Processing splicing information for gff3 file. > >> Deducing exon rank from relative coordinates provided > >> Prepare the 'metadata' data frame ... metadata: OK > >> Now generating chrominfo from available sequence names. No chromosome length information is available. > >> Warning messages: > >> 1: In .deduceExonRankings(exs, format = "gff") : > >> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName > >> 2: In matchCircularity(chroms, circ_seqs) : > >> None of the strings in your circ_seqs argument match your seqnames. > >>> txdb > >> TranscriptDb object: > >> | Db type: TranscriptDb > >> | Supporting package: GenomicFeatures > >> | Data source: NCBI > >> | Genus and Species: Some bact > >> | miRBase build ID: NA > >> | transcript_nrow: 631 > >> | exon_nrow: 631 > >> | cds_nrow: 631 > >> | Db created by: GenomicFeatures package from Bioconductor > >> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013) > >> | GenomicFeatures version at creation time: 1.10.2 > >> | RSQLite version at creation time: 0.11.2 > >> | DBSCHEMAVERSION: 1.0 > >> > >> > >> > >> > >> -----Original Message----- > >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Cook, Malcolm > >> Sent: Friday, June 07, 2013 2:32 PM > >> To: 'Thomas Girke'; 'bioconductor at r-project.org' > >> Cc: 'Brandon Gallaher' > >> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes > >> > >> Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml) > >> > >> That is your first problem. > >> > >> In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF. > >> > >> ##gff-version 3 > >> #!gff-spec-version 1.20 > >> #!processor NCBI annotwriter > >> ##sequence-region NC_011025.1 1 820453 > >> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272 > >> NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=c hromosome;mol_type=genomic DNA;strain=158L3-1 > >> NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA; locus_tag=MARTH_orf001 > >> NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_00 1999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4 > >> NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0 > >> NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0 > >> > >> > >> The question is what to do. > >> > >> Not sure. > >> > >> Any other help? > >> > >> Good luck, > >> > >> ~Malcolm > >> > >> > >> -----Original Message----- > >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Thomas Girke > >> Sent: Friday, June 07, 2013 12:52 PM > >> To: bioconductor at r-project.org > >> Cc: Brandon Gallaher > >> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes > >> > >> It seems to me that makeTranscriptDbFromGFF does not yet work on the > >> bacteria GFFs from NCBI (perhaps others too): > >> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ > >> > >> ## For instance, the following > >> library(GenomicFeatures) > >> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma _arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") > >> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") > >> > >> ## returns this error: > >> extracting transcript information > >> Error in .prepareGFF3TXS(data) : > >> No Transcript information present in gff file > >> > >> I guess this is because in bacteria GFF we don't have explicit > >> transcript annotations. There are hacks around this problem, but it > >> would be nice if this could be supported in the future right out of the > >> box. I apologize if I missed an existing solution for this. > >> > >> Best, > >> > >> Thomas > >> > >>> sessionInfo() > >> R version 3.0.0 (2013-04-03) > >> Platform: x86_64-unknown-linux-gnu (64-bit) > >> > >> locale: > >> [1] C > >> > >> attached base packages: > >> [1] parallel stats graphics utils datasets grDevices methods base > >> > >> other attached packages: > >> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 > >> > >> loaded via a namespace (and not attached): > >> [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi Thomas, I will look into finding some way to accommodate these strange NCBI files. Marc On 06/07/2013 06:16 PM, Thomas Girke wrote: > No worries about the timing. I just wanted to point out that it would be > worthwhile to address this some time in the future since it applies to a > huge number of sequenced genomes. I totally understand the challenge Marc is > facing supporting all these inconsistent data formats. > > Thomas > > On Fri, Jun 07, 2013 at 11:55:34PM +0000, Hervé Pagès wrote: >> Hi Thomas, Malcolm, >> >> This GFF3 file doesn't pass validation using the online validator at: >> >> http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online >> >> but for a minor problem that doesn't have anything to do with what is >> being discussed here (still, it's sad that an organization like NCBI >> distributes invalid GFF3 files :-/ ). >> >> Anyway you're right Thomas that it would probably make sense to >> be able to import those files with makeTranscriptDbFromGFF(). >> Don't know how hard that will be though. Marc being on vacation right >> now, this won't be addressed until 2 weeks from now. >> >> I don't have an easy workaround to offer in the mean time, sorry... >> but maybe someone has? >> >> Cheers, >> H. >> >> >> On 06/07/2013 04:33 PM, Thomas Girke wrote: >>> The perl -> genbank -> gff -> R -> txdb approach seems to work in >>> R-2.15.1 but not in R-3.0.0 anymore. The latter returns >>> >>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") >>> extracting transcript information >>> Error in .prepareGFF3TXS(data) : Unexpected transcript duplicates >>> >>> Given that there are over 2500 bacteria genomes annotated this way at NCBI, >>> a more generic solution to this problem would be very useful, I think. >>> >>> Thomas >>> >>> >>> On Fri, Jun 07, 2013 at 07:56:32PM +0000, Cook, Malcolm wrote: >>>> FYI, bioperl includes bp_genbank2gff3.pl >>>> >>>> which when run as >>>> >>>>> bp_genbank2gff3.pl NC_011025.gbk >>>> produces NC_011025.gbk.gff (attached) >>>> >>>> which loaded without error with transcript: >>>> >>>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") >>>> extracting transcript information >>>> Extracting gene IDs >>>> extracting transcript information >>>> Processing splicing information for gff3 file. >>>> Deducing exon rank from relative coordinates provided >>>> Prepare the 'metadata' data frame ... metadata: OK >>>> Now generating chrominfo from available sequence names. No chromosome length information is available. >>>> Warning messages: >>>> 1: In .deduceExonRankings(exs, format = "gff") : >>>> Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName >>>> 2: In matchCircularity(chroms, circ_seqs) : >>>> None of the strings in your circ_seqs argument match your seqnames. >>>>> txdb >>>> TranscriptDb object: >>>> | Db type: TranscriptDb >>>> | Supporting package: GenomicFeatures >>>> | Data source: NCBI >>>> | Genus and Species: Some bact >>>> | miRBase build ID: NA >>>> | transcript_nrow: 631 >>>> | exon_nrow: 631 >>>> | cds_nrow: 631 >>>> | Db created by: GenomicFeatures package from Bioconductor >>>> | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013) >>>> | GenomicFeatures version at creation time: 1.10.2 >>>> | RSQLite version at creation time: 0.11.2 >>>> | DBSCHEMAVERSION: 1.0 >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Cook, Malcolm >>>> Sent: Friday, June 07, 2013 2:32 PM >>>> To: 'Thomas Girke'; 'bioconductor at r-project.org' >>>> Cc: 'Brandon Gallaher' >>>> Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes >>>> >>>> Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml) >>>> >>>> That is your first problem. >>>> >>>> In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF. >>>> >>>> ##gff-version 3 >>>> #!gff-spec-version 1.20 >>>> #!processor NCBI annotwriter >>>> ##sequence-region NC_011025.1 1 820453 >>>> ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272 >>>> NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=c hromosome;mol_type=genomic DNA;strain=158L3-1 >>>> NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA; locus_tag=MARTH_orf001 >>>> NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_00 1999673.1,GeneID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4 >>>> NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0 >>>> NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0 >>>> >>>> >>>> The question is what to do. >>>> >>>> Not sure. >>>> >>>> Any other help? >>>> >>>> Good luck, >>>> >>>> ~Malcolm >>>> >>>> >>>> -----Original Message----- >>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Thomas Girke >>>> Sent: Friday, June 07, 2013 12:52 PM >>>> To: bioconductor at r-project.org >>>> Cc: Brandon Gallaher >>>> Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes >>>> >>>> It seems to me that makeTranscriptDbFromGFF does not yet work on the >>>> bacteria GFFs from NCBI (perhaps others too): >>>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ >>>> >>>> ## For instance, the following >>>> library(GenomicFeatures) >>>> download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma _arthritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") >>>> txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") >>>> >>>> ## returns this error: >>>> extracting transcript information >>>> Error in .prepareGFF3TXS(data) : >>>> No Transcript information present in gff file >>>> >>>> I guess this is because in bacteria GFF we don't have explicit >>>> transcript annotations. There are hacks around this problem, but it >>>> would be nice if this could be supported in the future right out of the >>>> box. I apologize if I missed an existing solution for this. >>>> >>>> Best, >>>> >>>> Thomas >>>> >>>>> sessionInfo() >>>> R version 3.0.0 (2013-04-03) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics utils datasets grDevices methods base >>>> >>>> other attached packages: >>>> [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 4 months ago
United States
Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml) That is your first problem. In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF. ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter ##sequence-region NC_011025.1 1 820453 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272 NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosom e;mol_type=genomic DNA;strain=158L3-1 NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_ta g=MARTH_orf001 NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,Gene ID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4 NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0 NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0 The question is what to do. Not sure. Any other help? Good luck, ~Malcolm -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Thomas Girke Sent: Friday, June 07, 2013 12:52 PM To: bioconductor at r-project.org Cc: Brandon Gallaher Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes It seems to me that makeTranscriptDbFromGFF does not yet work on the bacteria GFFs from NCBI (perhaps others too): ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ ## For instance, the following library(GenomicFeatures) download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arth ritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") ## returns this error: extracting transcript information Error in .prepareGFF3TXS(data) : No Transcript information present in gff file I guess this is because in bacteria GFF we don't have explicit transcript annotations. There are hacks around this problem, but it would be nice if this could be supported in the future right out of the box. I apologize if I missed an existing solution for this. Best, Thomas > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics utils datasets grDevices methods base other attached packages: [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
FYI, bioperl includes bp_genbank2gff3.pl which when run as > bp_genbank2gff3.pl NC_011025.gbk produces NC_011025.gbk.gff (attached) which loaded without error with transcript: > txdb <- makeTranscriptDbFromGFF(file="NC_011025.gbk.gff", format="gff3", dataSource="NCBI", species="Some bact") extracting transcript information Extracting gene IDs extracting transcript information Processing splicing information for gff3 file. Deducing exon rank from relative coordinates provided Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Warning messages: 1: In .deduceExonRankings(exs, format = "gff") : Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName 2: In matchCircularity(chroms, circ_seqs) : None of the strings in your circ_seqs argument match your seqnames. > txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: NCBI | Genus and Species: Some bact | miRBase build ID: NA | transcript_nrow: 631 | exon_nrow: 631 | cds_nrow: 631 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2013-06-07 14:52:50 -0500 (Fri, 07 Jun 2013) | GenomicFeatures version at creation time: 1.10.2 | RSQLite version at creation time: 0.11.2 | DBSCHEMAVERSION: 1.0 -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Cook, Malcolm Sent: Friday, June 07, 2013 2:32 PM To: 'Thomas Girke'; 'bioconductor at r-project.org' Cc: 'Brandon Gallaher' Subject: Re: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes Taking a quick look at the GFF in question ... I don't see any mRNA features.... they appear to be implicit .... which is not well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml) That is your first problem. In particular, the first gene in a file by itself is rejected. Adding the mRNA and exon lines as below, and the first gene is now accepted by makeTranscriptDbFromGFF. ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter ##sequence-region NC_011025.1 1 820453 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=243272 NC_011025.1 RefSeq region 1 820453 . + . ID=id0;Dbxref=taxon:243272;Is_circular=true;gbkey=Src;genome=chromosom e;mol_type=genomic DNA;strain=158L3-1 NC_011025.1 RefSeq gene 107 1471 . + . ID=gene0;Name=dnaA;Dbxref=GeneID:6418131;gbkey=Gene;gene=dnaA;locus_ta g=MARTH_orf001 NC_011025.1 RefSeq CDS 107 1471 . + 0 ID=cds0;Name=YP_001999673.1;Parent=gene0;Note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;Dbxref=Genbank:YP_001999673.1,Gene ID:6418131;gbkey=CDS;product=chromosomal replication initiation protein;protein_id=YP_001999673.1;transl_table=4 NC_011025.1 RefSeq mRNA 107 1471 . + 0 ID=mRNA0;Parent=gene0 NC_011025.1 RefSeq exon 107 1471 . + 0 ID=exon0;Parent=mRNA0 The question is what to do. Not sure. Any other help? Good luck, ~Malcolm -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Thomas Girke Sent: Friday, June 07, 2013 12:52 PM To: bioconductor at r-project.org Cc: Brandon Gallaher Subject: [BioC] makeTranscriptDbFromGFF fails on NCBI Bacteria genomes It seems to me that makeTranscriptDbFromGFF does not yet work on the bacteria GFFs from NCBI (perhaps others too): ftp://ftp.ncbi.nih.gov/genomes/Bacteria/ ## For instance, the following library(GenomicFeatures) download.file("ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Mycoplasma_arth ritidis_158L3_1_uid58005/NC_011025.gff", destfile="NC_011025.gff") txdb <- makeTranscriptDbFromGFF(file="NC_011025.gff", format="gff3", dataSource="NCBI", species="Some bact") ## returns this error: extracting transcript information Error in .prepareGFF3TXS(data) : No Transcript information present in gff file I guess this is because in bacteria GFF we don't have explicit transcript annotations. There are hacks around this problem, but it would be nice if this could be supported in the future right out of the box. I apologize if I missed an existing solution for this. Best, Thomas > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] parallel stats graphics utils datasets grDevices methods base other attached packages: [1] GenomicFeatures_1.12.1 AnnotationDbi_1.22.0 Biobase_2.20.0 rtracklayer_1.20.1 GenomicRanges_1.12.0 IRanges_1.18.0 BiocGenerics_0.6.0 loaded via a namespace (and not attached): [1] BSgenome_1.28.0 Biostrings_2.28.0 DBI_0.2-5 RCurl_1.95-4.1 RSQLite_0.11.2 Rsamtools_1.12.0 XML_3.96-1.1 biomaRt_2.16.0 bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6