troubleshooting makeTxDbFromGRanges: some exons are linked to transcripts not found in the file
1
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 8 days ago
Wageningen University, Wageningen, the …

I would like to make a transcript-based annotation file (TxDb) for Arabidopsis, based on the recent Araport11 genome release. I am using the gff3 file (Araport11_GFF3_genes_transposons.201606.gff, 22 June 2016), available from here.

However, this fails because of an error:

Error in makeTxDbFromGRanges(araport):

some exons are linked to transcripts not found in the file.

While the error message is crystal clear, and I realize the error originates from an apparent mistake in the gff3 file (which has to be corrected by the people at the Arabidopsis Biological Resource Center), I wondered whether it somehow would be possible to have these exons and transcripts identified and returned. This would better enable troubleshooting.

Thanks,

Guido

 

> library("rtracklayer")
> library("GenomicFeatures")
>
>
> araport <- import.gff3("Araport11_GFF3_genes_transposons.201606.gff", format="gff3")
>
> araport
GRanges object with 789890 ranges and 21 metadata columns:
           seqnames           ranges strand |    source           type     score     phase
              <Rle>        <IRanges>  <Rle> |  <factor>       <factor> <numeric> <integer>
       [1]     Chr1     [3631, 5899]      + | Araport11           gene      <NA>      <NA>
       [2]     Chr1     [3631, 5899]      + | Araport11           mRNA      <NA>      <NA>
       [3]     Chr1     [3631, 3759]      + | Araport11 five_prime_UTR      <NA>      <NA>
       [4]     Chr1     [3631, 3913]      + | Araport11           exon      <NA>      <NA>
       [5]     Chr1     [3760, 3913]      + | Araport11            CDS      <NA>         0
       ...      ...              ...    ... .       ...            ...       ...       ...
  [789886]     ChrM [366086, 366700]      - | Araport11           gene      <NA>      <NA>
  [789887]     ChrM [366086, 366700]      - | Araport11           mRNA      <NA>      <NA>
  [789888]     ChrM [366086, 366700]      - | Araport11            CDS      <NA>         0
  [789889]     ChrM [366086, 366700]      - | Araport11           exon      <NA>      <NA>
  [789890]     ChrM [366086, 366700]      - | Araport11        protein      <NA>      <NA>
                                   ID                    Name                            Note      symbol
                          <character>             <character>                 <CharacterList> <character>
       [1]                  AT1G01010               AT1G01010 NAC domain containing protein 1      NAC001
       [2]                AT1G01010.1             AT1G01010.1 NAC domain containing protein 1      NAC001
       [3] AT1G01010:five_prime_UTR:1 NAC001:five_prime_UTR:1                                        <NA>
       [4]           AT1G01010:exon:1           NAC001:exon:1                                        <NA>
       [5]            AT1G01010:CDS:1            NAC001:CDS:1                                        <NA>
       ...                        ...                     ...                             ...         ...
  [789886]                  ATMG01410               ATMG01410          open reading frame 204      ORF204
  [789887]                ATMG01410.1             ATMG01410.1          open reading frame 204      ORF204
  [789888]            ATMG01410:CDS:1            ORF204:CDS:1                                        <NA>
  [789889]           ATMG01410:exon:1           ORF204:exon:1                                        <NA>
  [789890]        ATMG01410.1-Protein             ATMG01410.1                                        <NA>
                                             Alias                       full_name
                                   <CharacterList>                     <character>
       [1] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
       [2] ANAC001,NAC domain containing protein 1 NAC domain containing protein 1
       [3]                                                                    <NA>
       [4]                                                                    <NA>
       [5]                                                                    <NA>
       ...                                     ...                             ...
  [789886]                                                  open reading frame 204
  [789887]                                                  open reading frame 204
  [789888]                                                                    <NA>
  [789889]                                                                    <NA>
  [789890]                                                                    <NA>
                                                                                                                                                                                                                                                                                                                                                                                                          Dbxref     locus_type          Parent  conf_class
                                         <CharacterList>    <character> <CharacterList> <character>
       [1] PMID:11118137,PMID:12820902,PMID:15029955,... protein_coding                        <NA>
       [2]     PMID:11118137,gene:2200934,UniProt:Q0WV96           <NA>       AT1G01010           2
       [3]                                                         <NA>     AT1G01010.1        <NA>
       [4]                                                         <NA>     AT1G01010.1        <NA>
       [5]                                                         <NA>     AT1G01010.1        <NA>
       ...                                           ...            ...             ...         ...
  [789886]                               locus:504954624 protein_coding                        <NA>
  [789887]                               gene:1009022691           <NA>       ATMG01410           1
  [789888]                                                         <NA>     ATMG01410.1        <NA>
  [789889]                                                         <NA>     ATMG01410.1        <NA>
  [789890]                                                         <NA>                        <NA>
           conf_rating Derives_from curator_summary description       index nochangenat-description
           <character>  <character>     <character> <character> <character>             <character>
       [1]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [2]        ****         <NA>            <NA>        <NA>        <NA>                    <NA>
       [3]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [4]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       [5]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
       ...         ...          ...             ...         ...         ...                     ...
  [789886]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789887]       *****         <NA>            <NA>        <NA>           1                    <NA>
  [789888]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789889]        <NA>         <NA>            <NA>        <NA>        <NA>                    <NA>
  [789890]        <NA>  ATMG01410.1            <NA>        <NA>        <NA>                    <NA>
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
>
> txdb <- makeTxDbFromGRanges(araport)
Error in makeTxDbFromGRanges(araport) :
  some exons are linked to transcripts not found in the file
>
> sessionInfo()
R version 3.3.1 Patched (2016-06-28 r70853)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.24.3 AnnotationDbi_1.34.3   Biobase_2.32.0         rtracklayer_1.32.1    
[5] GenomicRanges_1.24.2   GenomeInfoDb_1.8.2     IRanges_2.6.1          S4Vectors_0.10.1      
[9] BiocGenerics_0.18.0   

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.2         
 [4] bitops_1.0-6               GenomicAlignments_1.8.3    DBI_0.4-1                 
 [7] RSQLite_1.0.0              zlibbioc_1.18.0            XVector_0.12.0            
[10] BiocParallel_1.6.2         tools_3.3.1                biomaRt_2.28.0            
[13] RCurl_1.95-4.8             SummarizedExperiment_1.2.3
>
genomicfeatures txdb makeTxDbFromGRanges • 2.8k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 10 hours ago
Seattle, WA, United States

Hi Guido,

You make a good point that error reporting could be improved in makeTxDbFromGRanges(). In that case the error was caused by makeTxDbFromGRanges() not recognizing features of type lnc_RNA, antisense_lncRNA, transcript_region, and pseudogenic_tRNA, as transcripts. I fixed this in GenomicFeatures 1.24.4 (release) and 1.25.15 (devel).

With GenomicFeatures 1.24.4:

library(rtracklayer)
library(GenomicFeatures)
araport <- import("~/Downloads/Araport11_GFF3_genes_transposons.201606.gff")
txdb <- makeTxDbFromGRanges(araport)
txdb
# TxDb object:
## Db type: TxDb
## Supporting package: GenomicFeatures
## Genome: NA
## transcript_nrow: 60204
## exon_nrow: 203352
## cds_nrow: 221781
## Db created by: GenomicFeatures package from Bioconductor
## Creation time: 2016-07-07 10:31:12 -0700 (Thu, 07 Jul 2016)
## GenomicFeatures version at creation time: 1.24.4
## RSQLite version at creation time: 1.0.0
## DBSCHEMAVERSION: 1.1

Both versions should become available via biocLite() in the next couple of days.

Cheers,

H.

 

ADD COMMENT
0
Entering edit mode

Thanks Herve, working nicely now!

ADD REPLY

Login before adding your answer.

Traffic: 420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6