Get wrong tx_type when using GenomicFeatures::makeTxDbFromGTF
3
0
Entering edit mode
@karolin-wiedemann-9303
Last seen 9.1 years ago
Germany

The package GenomicFeatures (>v1.20) provides the "tx_type" column in the transcript table of TranscriptDBs.
I want to read a GTF file, that includes the transcript_biotype. As example, I downloaded and unziped an GTF from Ensembl: ftp://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh38.82.gtf.gz .
Here an extract:

1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

However, I don't get the a tx_type like mRNA, snoRNA,... .  Instead the tx_type column is filled with the word "transcript".

My example:

> txdb <- GenomicFeatures::makeTxDbFromGFF("~/data/Homo_sapiens.GRCh38.82.gtf",format="gtf")
> tx <- GenomicFeatures::transcripts(txdb,column=c("tx_name","tx_type"))
> head(tx)
GRanges object with 6 ranges and 2 metadata columns:
      seqnames         ranges strand |         tx_name     tx_type
         <Rle>      <IRanges>  <Rle> |     <character> <character>
  [1]        1 [11869, 14409]      + | ENST00000456328  transcript
  [2]        1 [12010, 13670]      + | ENST00000450305  transcript
  [3]        1 [29554, 31097]      + | ENST00000473358  transcript
  [4]        1 [30267, 31109]      + | ENST00000469289  transcript
  [5]        1 [30366, 30503]      + | ENST00000607096  transcript
  [6]        1 [52473, 53312]      + | ENST00000606857  transcript
  -------
  seqinfo: 59 sequences (1 circular) from an unspecified genome; no seqlengths

 

Looking at the code:

rtracklayer::import is used to read the GTF, while only the columns "type","gene_id","transcript_id" and "exon_id" are returned. Thereby "type" describes the 3.column in the GTF. Maybe I am wrong, but this column never includes transcript_type information.

 

My questions:
1) Is there something wrong in the way I make TxDbs from GTF or did I understand the tx_type incorrectly?

2) Why are only a predefined tx_types excapted ?

 

 

Thanks, Karolin

__________

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.32.0       XVector_0.10.0             GenomicRanges_1.22.1       BiocGenerics_0.16.1       
 [5] zlibbioc_1.16.0            GenomicAlignments_1.6.1    IRanges_2.4.4              BiocParallel_1.4.0        
 [9] GenomeInfoDb_1.6.1         tools_3.2.2                SummarizedExperiment_1.0.1 parallel_3.2.2            
[13] Biobase_2.30.0             DBI_0.3.1                  lambda.r_1.1.7             futile.logger_1.4.1       
[17] rtracklayer_1.30.1         S4Vectors_0.8.3            futile.options_1.0.0       bitops_1.0-6              
[21] RCurl_1.95-4.7             biomaRt_2.26.1             RSQLite_1.0.0              GenomicFeatures_1.22.5    
[25] Biostrings_2.38.2          Rsamtools_1.22.0           stats4_3.2.2               XML_3.98-1.3

genomicfeatures maketxdbfromgff tx_type gtf • 2.1k views
ADD COMMENT
0
Entering edit mode
Johannes Rainer ★ 2.1k
@johannes-rainer-6987
Last seen 10 weeks ago
Italy

It may not answer your question directly... but for Ensembl based annotations you might also consider to use the ensembldb package instead. That way you can create EnsDb objects/databases/packages that have essentially the same functionality than the TxDb object, just that they are tailored to Ensembl based annotations (you'll have an explicit column tx_biotype and gene_biotype available).

Basically, you could use the ensDbFromGtf function if you have the GTF locally. Alternatively, as explained in the vignette, you could use the AnnotationHub package to fetch a GRanges object representing the GTF data from Ensembl and build the database from that (unfortunateyl, Ensembl 82 is not yet available in AnnotationHub).

Let me know if you run into problems or something in unclear.

cheers, jo

PS: the working example:

> library(ensembldb)
> gtff <- "~/data/Homo_sapiens.GRCh38.82.gtf"
> DB <- ensDbFromGtf(gtff, verbose=TRUE)
importing gtf file...done
processing metadata...OK
processing genes...OK
processing transcripts...OK
processing exons...OK
processing chromosomes...fetch seqlenghts from ensembl, dataset  hsapiens_gene_ensembl  version  82 ...OK
generating index...OK
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK

> edb <- EnsDb(DB)
> transcripts(edb)
GRanges object with 198634 ranges and 5 metadata columns:
                  seqnames               ranges strand   |           tx_id
                     <Rle>            <IRanges>  <Rle>   |     <character>
  ENST00000456328        1       [11869, 14409]      +   | ENST00000456328
  ENST00000450305        1       [12010, 13670]      +   | ENST00000450305
  ENST00000488147        1       [14404, 29570]      -   | ENST00000488147
  ENST00000619216        1       [17369, 17436]      -   | ENST00000619216
  ENST00000473358        1       [29554, 31097]      +   | ENST00000473358
              ...      ...                  ...    ... ...             ...
  ENST00000420810        Y [26549425, 26549743]      +   | ENST00000420810
  ENST00000456738        Y [26586642, 26591601]      -   | ENST00000456738
  ENST00000435945        Y [26594851, 26634652]      -   | ENST00000435945
  ENST00000435741        Y [26626520, 26627159]      -   | ENST00000435741
  ENST00000431853        Y [56855244, 56855488]      +   | ENST00000431853
                                          tx_biotype tx_cds_seq_start
                                         <character>        <numeric>
  ENST00000456328               processed_transcript             <NA>
  ENST00000450305 transcribed_unprocessed_pseudogene             <NA>
  ENST00000488147             unprocessed_pseudogene             <NA>
  ENST00000619216                              miRNA             <NA>
  ENST00000473358                            lincRNA             <NA>
              ...                                ...              ...
  ENST00000420810               processed_pseudogene             <NA>
  ENST00000456738             unprocessed_pseudogene             <NA>
  ENST00000435945             unprocessed_pseudogene             <NA>
  ENST00000435741               processed_pseudogene             <NA>
  ENST00000431853               processed_pseudogene             <NA>
                  tx_cds_seq_end         gene_id
                       <numeric>     <character>
  ENST00000456328           <NA> ENSG00000223972
  ENST00000450305           <NA> ENSG00000223972
  ENST00000488147           <NA> ENSG00000227232
  ENST00000619216           <NA> ENSG00000278267
  ENST00000473358           <NA> ENSG00000243485
              ...            ...             ...
  ENST00000420810           <NA> ENSG00000224240
  ENST00000456738           <NA> ENSG00000227629
  ENST00000435945           <NA> ENSG00000237917
  ENST00000435741           <NA> ENSG00000231514
  ENST00000431853           <NA> ENSG00000235857
  -------
  seqinfo: 59 sequences from GRCh38 genome

 

 

ADD COMMENT
0
Entering edit mode
@karolin-wiedemann-9303
Last seen 9.1 years ago
Germany

Quite nice! Theoretically, it could help to solve some more problems I have. Unfortunately, I get an error:

> db <- ensembldb::ensDbFromGtf("~/data/Homo_sapiens.GRCh38.82.gtf",verbose=TRUE)
importing gtf file...done
Error in ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
  could not find function "seqlengths"

I guess,the function tries to fetch the data through biomart or something like this. Right?
I am connected to the WWW. Nevertheless, I need a solution without additional data.
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2                  AnnotationDbi_1.32.0         magrittr_1.5                 AnnotationHub_2.2.2          XVector_0.10.0              
 [6] GenomicRanges_1.22.1         BiocGenerics_0.16.1          zlibbioc_1.16.0              GenomicAlignments_1.6.1      IRanges_2.4.4               
[11] BiocParallel_1.4.0           xtable_1.8-0                 R6_2.1.1                     stringr_1.0.0                httr_1.0.0                  
[16] ensembldb_1.2.0              GenomeInfoDb_1.6.1           tools_3.2.2                  SummarizedExperiment_1.0.1   parallel_3.2.2              
[21] Biobase_2.30.0               DBI_0.3.1                    lambda.r_1.1.7               futile.logger_1.4.1          htmltools_0.2.6             
[26] digest_0.6.8                 interactiveDisplayBase_1.8.0 shiny_0.12.2                 rtracklayer_1.30.1           S4Vectors_0.8.3             
[31] futile.options_1.0.0         bitops_1.0-6                 RCurl_1.95-4.7               biomaRt_2.26.1               mime_0.4                    
[36] RSQLite_1.0.0                stringi_1.0-1                BiocInstaller_1.20.1         GenomicFeatures_1.22.5       Biostrings_2.38.2           
[41] Rsamtools_1.22.0             stats4_3.2.2                 XML_3.98-1.3                 httpuv_1.3.3 
              

ADD COMMENT
0
Entering edit mode

Did you load the libraries before?

I mean library(ensembldb) and eventually also library(GenomicFeatures) (although this should come along the ensembldb package). I don't see these two packages being attached in your sessionInfo.

In fact, the function does also load the sequence lengths (length of the chromosomes in nt) from Ensembl by a call to the Ensembl ftp server, so, yes, to get all information you need internet connection. If you don't have that you will still get an EnsDb database, but without the seqinfo.

cheers, jo

ADD REPLY
0
Entering edit mode

Oh, it's a shame! That is the reason. Thanks a lot !!

 

ADD REPLY
0
Entering edit mode

Johannes -- I think the could not find function "seqlengths" error means that your package is missing an import(GenomeInfoDb) or importFrom(seqlengths, GenomeInfoDb) in it's NAMESPACE. This is needed both for the (valid) case illustrated in post and for when a package extends your package via import-ing functionality.

ADD REPLY
0
Entering edit mode

Thanks Martin. I indeed forgot to import this method. Is now fixed in release and devel version and should be populated tomorrow (I guess).

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States

Hi,

makeTxDbFromGFF() gets the tx_type information from the 3rd column (type column) of the GFF/GTF file, which for a GTF file always happens to be set to transcript for transcripts (unlike for a GFF file, where it can be set to various things). I agree this is not very informative and we should probably address this.

FWIW note that you can also fetch a TxDb object from Ensembl with makeTxDbFromBiomart() (also from the GenomicFeatures package). This will populate the tx_type column with Ensembl transcript_biotype attribute.

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6