Get wrong tx_type when using GenomicFeatures::makeTxDbFromGTF
The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: .
Here an extract:
1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

However, I don't get the a tx_type like mRNA, snoRNA,... .  Instead the tx_type column is filled with the word "transcript".

My example:

> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type"))
> head(tx)
GRanges object with 6 ranges and 2 metadata columns:
      seqnames         ranges strand |         tx_name     tx_type
         <Rle>      <IRanges>  <Rle> |     <character> <character>
  [1]        1 [11869, 14409]      + | ENST00000456328  transcript
  [2]        1 [12010, 13670]      + | ENST00000450305  transcript
  [3]        1 [29554, 31097]      + | ENST00000473358  transcript
  [4]        1 [30267, 31109]      + | ENST00000469289  transcript
  [5]        1 [30366, 30503]      + | ENST00000607096  transcript
  [6]        1 [52473, 53312]      + | ENST00000606857  transcript
  seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths


Looking at the code:

rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.


My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?

2) Why are only a predefined tx_types excapted ?



Thanks, Karolin


R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.32.0       XVector_0.10.0             GenomicRanges_1.22.1       BiocGenerics_0.16.1       
 [5] zlibbioc_1.16.0            GenomicAlignments_1.6.1    IRanges_2.4.4              BiocParallel_1.4.0        
 [9] GenomeInfoDb_1.6.1         tools_3.2.2                SummarizedExperiment_1.0.1 parallel_3.2.2            
[13] Biobase_2.30.0             DBI_0.3.1                  lambda.r_1.1.7             futile.logger_1.4.1       
[17] rtracklayer_1.30.1         S4Vectors_0.8.3            futile.options_1.0.0       bitops_1.0-6              
[21] RCurl_1.95-4.7             biomaRt_2.26.1             RSQLite_1.0.0              GenomicFeatures_1.22.5    
[25] Biostrings_2.38.2          Rsamtools_1.22.0           stats4_3.2.2               XML_3.98-1.3


