Question

package to manipulate GTF files

0

Entering edit mode

pcantalupo ▴ 10

@pcantalupo-8617

Last seen 9 months ago

United States

Does anybody know of R libraries that extracts information from GTF files? For instance, I have DESeq results with ensembl transcript IDs. I want to annotate each transcript row with the Transcript information from the GTF file. Column 9 of the GTF file is quite complicated by the many fields deliminted by ;. I was wondering if there are tools out there to manipulate GTF files.

gtf • 7.9k views

ADD COMMENT • link updated 3.6 years ago by Konstantinos Yeles ▴ 80 • written 3.6 years ago by pcantalupo ▴ 10

0

Entering edit mode

Konstantinos Yeles ▴ 80

@konstantinos-yeles-8961

Last seen 15 months ago

Italy

Hi pcantalupo, if you are familiar with tidyverse, you can work with plyranges
It helped me a lot in handling Genomic ranges from different sources.
You can use plyranges::filter in order to keep only the rows of your interest, or try

dplyr::separate(col=yourcolumn,  into=c("tag1", "tag2",etc), sep=";")

You can read more about plyranges here

ADD COMMENT • link 3.6 years ago Konstantinos Yeles ▴ 80

score 2 · Accepted Answer · 2021-07-14

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 18 hours ago

United States

The rtracklayer package can import a GTF file, and then you can use GenomicFeatures to make a TxDb object from the GTF, which is probably where you want to go. But if this is a common model organism that is probably not necessary. There are existing TxDb and if you are using Ensembl based genome, EnsDb packages on the AnnotationHub that you could use. But you are being rather mysterious about your organism, so it's hard to say for sure.

ADD COMMENT • link 3.6 years ago James W. MacDonald 68k

0

Entering edit mode

Oh, sorry I'm using the latest Gencode v38 GTF file for Human found here: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz. It has specific attributes that I don't find in the Ensembl GTFs (version 104). For instance, Ensembl GTF does not have information for the canonical transcript but the Gencode v38 version does.

Thank you for the information!

ADD REPLY • link 3.6 years ago pcantalupo ▴ 10

1

Entering edit mode

You could do something like

> library(rtracklayer)
> z <- import("http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz")
trying URL 'http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz'
Content type 'application/octet-stream' length 46556621 bytes (44.4 MB)
downloaded 44.4 MB

> z
GRanges object with 3150424 ranges and 21 metadata columns:
            seqnames      ranges strand |   source       type     score
               <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric>
        [1]     chr1 11869-14409      + |   HAVANA gene              NA
        [2]     chr1 11869-14409      + |   HAVANA transcript        NA
        [3]     chr1 11869-12227      + |   HAVANA exon              NA
        [4]     chr1 12613-12721      + |   HAVANA exon              NA
        [5]     chr1 13221-14409      + |   HAVANA exon              NA
        ...      ...         ...    ... .      ...        ...       ...
  [3150420]     chrM 15888-15953      + |  ENSEMBL transcript        NA
  [3150421]     chrM 15888-15953      + |  ENSEMBL exon              NA
  [3150422]     chrM 15956-16023      - |  ENSEMBL gene              NA
  [3150423]     chrM 15956-16023      - |  ENSEMBL transcript        NA
  [3150424]     chrM 15956-16023      - |  ENSEMBL exon              NA
                phase           gene_id              gene_type   gene_name
            <integer>       <character>            <character> <character>
        [1]      <NA> ENSG00000223972.5 transcribed_unproces..     DDX11L1
        [2]      <NA> ENSG00000223972.5 transcribed_unproces..     DDX11L1
        [3]      <NA> ENSG00000223972.5 transcribed_unproces..     DDX11L1
        [4]      <NA> ENSG00000223972.5 transcribed_unproces..     DDX11L1
        [5]      <NA> ENSG00000223972.5 transcribed_unproces..     DDX11L1
        ...       ...               ...                    ...         ...
  [3150420]      <NA> ENSG00000210195.2                Mt_tRNA       MT-TT
  [3150421]      <NA> ENSG00000210195.2                Mt_tRNA       MT-TT
  [3150422]      <NA> ENSG00000210196.2                Mt_tRNA       MT-TP
  [3150423]      <NA> ENSG00000210196.2                Mt_tRNA       MT-TP
  [3150424]      <NA> ENSG00000210196.2                Mt_tRNA       MT-TP
                  level     hgnc_id          havana_gene     transcript_id
            <character> <character>          <character>       <character>
        [1]           2  HGNC:37102 OTTHUMG00000000961.2              <NA>
        [2]           2  HGNC:37102 OTTHUMG00000000961.2 ENST00000456328.2
        [3]           2  HGNC:37102 OTTHUMG00000000961.2 ENST00000456328.2
        [4]           2  HGNC:37102 OTTHUMG00000000961.2 ENST00000456328.2
        [5]           2  HGNC:37102 OTTHUMG00000000961.2 ENST00000456328.2
        ...         ...         ...                  ...               ...
  [3150420]           3   HGNC:7499                 <NA> ENST00000387460.2
  [3150421]           3   HGNC:7499                 <NA> ENST00000387460.2
  [3150422]           3   HGNC:7494                 <NA>              <NA>
  [3150423]           3   HGNC:7494                 <NA> ENST00000387461.2
  [3150424]           3   HGNC:7494                 <NA> ENST00000387461.2
                 transcript_type transcript_name transcript_support_level
                     <character>     <character>              <character>
        [1]                 <NA>            <NA>                     <NA>
        [2] processed_transcript     DDX11L1-202                        1
        [3] processed_transcript     DDX11L1-202                        1
        [4] processed_transcript     DDX11L1-202                        1
        [5] processed_transcript     DDX11L1-202                        1
        ...                  ...             ...                      ...
  [3150420]              Mt_tRNA       MT-TT-201                       NA
  [3150421]              Mt_tRNA       MT-TT-201                       NA
  [3150422]                 <NA>            <NA>                     <NA>
  [3150423]              Mt_tRNA       MT-TP-201                       NA
  [3150424]              Mt_tRNA       MT-TP-201                       NA
                          tag    havana_transcript exon_number
                  <character>          <character> <character>
        [1]              <NA>                 <NA>        <NA>
        [2]             basic OTTHUMT00000362751.1        <NA>
        [3]             basic OTTHUMT00000362751.1           1
        [4]             basic OTTHUMT00000362751.1           2
        [5]             basic OTTHUMT00000362751.1           3
        ...               ...                  ...         ...
  [3150420] Ensembl_canonical                 <NA>        <NA>
  [3150421] Ensembl_canonical                 <NA>           1
  [3150422]              <NA>                 <NA>        <NA>
  [3150423] Ensembl_canonical                 <NA>        <NA>
  [3150424] Ensembl_canonical                 <NA>           1
                      exon_id         ont  protein_id      ccdsid
                  <character> <character> <character> <character>
        [1]              <NA>        <NA>        <NA>        <NA>
        [2]              <NA>        <NA>        <NA>        <NA>
        [3] ENSE00002234944.1        <NA>        <NA>        <NA>
        [4] ENSE00003582793.1        <NA>        <NA>        <NA>
        [5] ENSE00002312635.1        <NA>        <NA>        <NA>
        ...               ...         ...         ...         ...
  [3150420]              <NA>        <NA>        <NA>        <NA>
  [3150421] ENSE00001544475.2        <NA>        <NA>        <NA>
  [3150422]              <NA>        <NA>        <NA>        <NA>
  [3150423]              <NA>        <NA>        <NA>        <NA>
  [3150424] ENSE00001544473.2        <NA>        <NA>        <NA>
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths

> zz <- z[mcols(z)$tag %in% "Ensembl_canonical"]

> library(GenomicFeatures)

> tx <- makeTxDbFromGRanges(zz)
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> genes(tx)
GRanges object with 34752 ranges and 1 metadata column:
                     seqnames              ranges strand |            gene_id
                        <Rle>           <IRanges>  <Rle> |        <character>
  ENSG00000002079.14     chr7   99238829-99306809      + | ENSG00000002079.14
  ENSG00000005075.15     chr7 102473938-102478906      - | ENSG00000005075.15
  ENSG00000006576.16     chr7   77798792-77957503      + | ENSG00000006576.16
  ENSG00000015592.16     chr8   27239340-27258404      - | ENSG00000015592.16
   ENSG00000018607.7     chr2 132309309-132318736      + |  ENSG00000018607.7
                 ...      ...                 ...    ... .                ...
   ENSG00000288718.1     chr3   46558540-46562495      + |  ENSG00000288718.1
   ENSG00000288719.1    chr22   42183827-42207600      - |  ENSG00000288719.1
   ENSG00000288720.1     chr3   45795970-45828952      + |  ENSG00000288720.1
   ENSG00000288723.1     chr1 241722926-241848128      - |  ENSG00000288723.1
   ENSG00000288724.1     chr3   46284775-46293795      - |  ENSG00000288724.1
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths
> transcripts(tx)
GRanges object with 34752 ranges and 2 metadata columns:
          seqnames      ranges strand |     tx_id           tx_name
             <Rle>   <IRanges>  <Rle> | <integer>       <character>
      [1]     chr1 12010-13670      + |         1 ENST00000450305.2
      [2]     chr1 29554-31097      + |         2 ENST00000473358.1
      [3]     chr1 30366-30503      + |         3 ENST00000607096.1
      [4]     chr1 52473-53312      + |         4 ENST00000606857.1
      [5]     chr1 62949-63887      + |         5 ENST00000492842.2
      ...      ...         ...    ... .       ...               ...
  [34748]     chrM   5761-5826      - |     34748 ENST00000387405.1
  [34749]     chrM   5826-5891      - |     34749 ENST00000387409.1
  [34750]     chrM   7446-7514      - |     34750 ENST00000387416.2
  [34751]     chrM 14674-14742      - |     34751 ENST00000387459.1
  [34752]     chrM 15956-16023      - |     34752 ENST00000387461.2
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths
>

And then all those transcripts would be the canonical ones?

ADD REPLY • link 3.6 years ago James W. MacDonald 68k

0

Entering edit mode

This is great thank you...it would have taken me hours to figure it out. However, it does not give me the info I need. For example the Gencode GTF file may have multiple tag attributes for each row:

For example I'm interested in the gene TMEM14C. Here is an example row in the GTF file:

chr6    HAVANA  transcript      10723070        10731127        .       +       .       gene_id "ENSG00000111843.14"; transcript_id "ENST00000229563.6"; gene_type "protein_coding"; gene_name "TMEM14C"; transcript_type "protein_coding"; transcript_name "TMEM14C-201"; level 2; protein_id "ENSP00000229563.5"; transcript_support_level "1"; hgnc_id "HGNC:20952"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS4514.1"; havana_gene "OTTHUMG00000014242.2"; havana_transcript "OTTHUMT00000039829.2";

When I select for TMEM14C using the z object from import and check the tag values, I get this:

> tmem14c = z[z$gene_name == "TMEM14C"]
> tmem14c$tag
 [1] NA                  "CCDS"              "CCDS"              "CCDS"             
 [5] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
 [9] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
[13] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
[17] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
[21] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
[25] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
[29] "CCDS"              "CCDS"              "CCDS"              "CCDS"             
[33] "CCDS"              "CCDS"              "CCDS"              NA                 
[37] NA                  NA                  NA                  NA                 
[41] NA                  NA                  NA                  NA                 
[45] NA                  NA                  "alternative_5_UTR" "alternative_5_UTR"
[49] "alternative_5_UTR" "alternative_5_UTR" "alternative_5_UTR"

However, I expect this when extracting all the tag attributes for TMEM14C on the command line:

$ cut -f 9 gencode.v38.annotation.gtf | grep TMEM14C | tr ';' '\n' | grep tag | sort | uniq -c
      5  tag "alternative_5_UTR"
     34  tag "appris_principal_1"
     34  tag "basic"
     34  tag "CCDS"
     17  tag "Ensembl_canonical"
     17  tag "MANE_Select"

Any ideas for a workaround? Thank you again,

ADD REPLY • link 3.6 years ago pcantalupo ▴ 10

0

Entering edit mode

Could you just grep the gtf for lines containing "Ensembl_canonical"?

ADD REPLY • link 3.6 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

Here you go...

$ cut -f 9 gencode.v38.annotation.gtf | grep TMEM14C | grep 'Ensembl_canonical' | tr ';' '\n' | grep tag | sort | uniq -c
     17  tag "appris_principal_1"
     17  tag "basic"
     17  tag "CCDS"
     17  tag "Ensembl_canonical"
     17  tag "MANE_Select"

ADD REPLY • link 3.6 years ago pcantalupo ▴ 10

0

Entering edit mode

It might not be in the gtf, but you can get canonical transcript info from BioMart. It might be easier to query biomart than to parse the gtf.

ADD REPLY • link 3.6 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

Correct, I already tried that, thank you. But I'm interested in all the attributes in the GTF file. I was hoping for a simple way in Bioconductor to provide that in a neat little dataframe so I could join it to various differentially expressed transcript tables from DESeq2. However, it does not handle lines with multiple tag attributes and only parses the last one. I submitted this as an Issue to rtracklayer.