I would like to make a transcript-based annotation file (TxDb) for Arabidopsis, based on the recent Araport11 genome release. I am using the gff3 file (Araport11_GFF3_genes_transposons.201606.gff, 22 June 2016), available from here.
However, this fails because of an error:
Error in makeTxDbFromGRanges(araport):
some exons are linked to transcripts not found in the file
.
While the error message is crystal clear, and I realize the error originates from an apparent mistake in the gff3 file (which has to be corrected by the people at the Arabidopsis Biological Resource Center), I wondered whether it somehow would be possible to have these exons and transcripts identified and returned. This would better enable troubleshooting.
Thanks,
Guido
> library("rtracklayer") > library("GenomicFeatures") > > > araport <- import.gff3("Araport11_GFF3_genes_transposons.201606.gff", format="gff3") > > araport GRanges object with 789890 ranges and 21 metadata columns: seqnames ranges strand | source type score phase <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> [1] Chr1 [3631, 5899] + | Araport11 gene <NA> <NA> [2] Chr1 [3631, 5899] + | Araport11 mRNA <NA> <NA> [3] Chr1 [3631, 3759] + | Araport11 five_prime_UTR <NA> <NA> [4] Chr1 [3631, 3913] + | Araport11 exon <NA> <NA> [5] Chr1 [3760, 3913] + | Araport11 CDS <NA> 0 ... ... ... ... . ... ... ... ... [789886] ChrM [366086, 366700] - | Araport11 gene <NA> <NA> [789887] ChrM [366086, 366700] - | Araport11 mRNA <NA> <NA> [789888] ChrM [366086, 366700] - | Araport11 CDS <NA> 0 [789889] ChrM [366086, 366700] - | Araport11 exon <NA> <NA> [789890] ChrM [366086, 366700] - | Araport11 protein <NA> <NA> ID Name Note symbol <character> <character> <CharacterList> <character> [1] AT1G01010 AT1G01010 NAC domain containing protein 1 NAC001 [2] AT1G01010.1 AT1G01010.1 NAC domain containing protein 1 NAC001 [3] AT1G01010:five_prime_UTR:1 NAC001:five_prime_UTR:1 <NA> [4] AT1G01010:exon:1 NAC001:exon:1 <NA> [5] AT1G01010:CDS:1 NAC001:CDS:1 <NA> ... ... ... ... ... [789886] ATMG01410 ATMG01410 open reading frame 204 ORF204 [789887] ATMG01410.1 ATMG01410.1 open reading frame 204 ORF204 [789888] ATMG01410:CDS:1 ORF204:CDS:1 <NA> [789889] ATMG01410:exon:1 ORF204:exon:1 <NA> [789890] ATMG01410.1-Protein ATMG01410.1 <NA> Alias full_name <CharacterList> <character> [1] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1 [2] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1 [3] <NA> [4] <NA> [5] <NA> ... ... ... [789886] open reading frame 204 [789887] open reading frame 204 [789888] <NA> [789889] <NA> [789890] <NA> Dbxref locus_type Parent conf_class <CharacterList> <character> <CharacterList> <character> [1] PMID:11118137,PMID:12820902,PMID:15029955,... protein_coding <NA> [2] PMID:11118137,gene:2200934,UniProt:Q0WV96 <NA> AT1G01010 2 [3] <NA> AT1G01010.1 <NA> [4] <NA> AT1G01010.1 <NA> [5] <NA> AT1G01010.1 <NA> ... ... ... ... ... [789886] locus:504954624 protein_coding <NA> [789887] gene:1009022691 <NA> ATMG01410 1 [789888] <NA> ATMG01410.1 <NA> [789889] <NA> ATMG01410.1 <NA> [789890] <NA> <NA> conf_rating Derives_from curator_summary description index nochangenat-description <character> <character> <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> <NA> [2] **** <NA> <NA> <NA> <NA> <NA> [3] <NA> <NA> <NA> <NA> <NA> <NA> [4] <NA> <NA> <NA> <NA> <NA> <NA> [5] <NA> <NA> <NA> <NA> <NA> <NA> ... ... ... ... ... ... ... [789886] <NA> <NA> <NA> <NA> <NA> <NA> [789887] ***** <NA> <NA> <NA> 1 <NA> [789888] <NA> <NA> <NA> <NA> <NA> <NA> [789889] <NA> <NA> <NA> <NA> <NA> <NA> [789890] <NA> ATMG01410.1 <NA> <NA> <NA> <NA> ------- seqinfo: 7 sequences from an unspecified genome; no seqlengths > > txdb <- makeTxDbFromGRanges(araport) Error in makeTxDbFromGRanges(araport) : some exons are linked to transcripts not found in the file > > sessionInfo() R version 3.3.1 Patched (2016-06-28 r70853) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.24.3 AnnotationDbi_1.34.3 Biobase_2.32.0 rtracklayer_1.32.1 [5] GenomicRanges_1.24.2 GenomeInfoDb_1.8.2 IRanges_2.6.1 S4Vectors_0.10.1 [9] BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] XML_3.98-1.4 Rsamtools_1.24.0 Biostrings_2.40.2 [4] bitops_1.0-6 GenomicAlignments_1.8.3 DBI_0.4-1 [7] RSQLite_1.0.0 zlibbioc_1.18.0 XVector_0.12.0 [10] BiocParallel_1.6.2 tools_3.3.1 biomaRt_2.28.0 [13] RCurl_1.95-4.8 SummarizedExperiment_1.2.3 >
Thanks Herve, working nicely now!