Question

extracting gene names, gene id and transcript id

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 17 months ago

Palo Alto, CA, USA

Dear all,

given a GTF file (for example, gencode.v28.basic.annotation.gtf), what is the simplest way to extract a table with the following information :

-- gene_name

-- gene_id

-- transcript_id

many thanks !

bogdan

gtf • 11k views

ADD COMMENT • link updated 6.6 years ago by lee.s ▴ 70 • written 6.6 years ago by Bogdan ▴ 670

1

Entering edit mode

lee.s ▴ 70

@lees-15179

Last seen 5.4 years ago

Another option with plyranges

library(plyranges)
gr <- read_gff("your_file.gtf") %>% select(gene_id, gene_name, transcript_id)

ADD COMMENT • link 6.6 years ago lee.s ▴ 70

4

Entering edit mode

I don't see a read_gtf in plyranges, in either release or devel?

Anyway, this is just a two-liner using basic rtracklayer/GenomicRanges functions.

> library(rtracklayer)

> z <- import("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz")

> mcols(z)[,c("gene_id","gene_name","transcript_id")]
DataFrame with 1684537 rows and 3 columns
                  gene_id   gene_name     transcript_id
              <character> <character>       <character>
1       ENSG00000223972.5     DDX11L1                NA
2       ENSG00000223972.5     DDX11L1 ENST00000456328.2
3       ENSG00000223972.5     DDX11L1 ENST00000456328.2
4       ENSG00000223972.5     DDX11L1 ENST00000456328.2
5       ENSG00000223972.5     DDX11L1 ENST00000456328.2
...                   ...         ...               ...
1684533 ENSG00000210195.2       MT-TT ENST00000387460.2
1684534 ENSG00000210195.2       MT-TT ENST00000387460.2
1684535 ENSG00000210196.2       MT-TP                NA
1684536 ENSG00000210196.2       MT-TP ENST00000387461.2
1684537 ENSG00000210196.2       MT-TP ENST00000387461.2

ADD REPLY • link 6.6 years ago James W. MacDonald 68k

0

Entering edit mode

Yes you're right, thanks! - the backend of the readers use import so read_gff() should still work. I should update plyranges to explicitly include gtf.

ADD REPLY • link 6.6 years ago lee.s ▴ 70

score 4 · Accepted Answer · 2018-08-28

4

Entering edit mode

jaro.slamecka ▴ 140

@jaroslamecka-7419

Last seen 9 weeks ago

Mitchell Cancer Institute, Mobile AL, U…

If you can use an Ensembl GTF, one easy and fast way is to use the refGenome package

library(refGenome) gtf = ensemblGenome() read.gtf(gtf, filename="Homo_sapiens.GRCh38.93.gtf") genes = gtf@ev$gtf[ ,c("gene_name","gene_id","transcript_id")]

ADD COMMENT • link 6.6 years ago jaro.slamecka ▴ 140

0

Entering edit mode

thank you Jaro. I wish it works. On my Ubuntu system, by using a GTF file from STAR aligner website, it says :

"terminate called after throwing an instance of 'std::length_error'

what():  basic_string::_S_create

Aborted (core dumped)"

ADD REPLY • link 6.6 years ago Bogdan ▴ 670

0

Entering edit mode

It needs to be either Ensembl or UCSC (you'd use it with gtf=ucscGenome()), that's the limitation. What exactly is the GTF file from the STAR website you describe? Can you post a link to it?

ADD REPLY • link 6.6 years ago jaro.slamecka ▴ 140

0

Entering edit mode

Thank you Jaro. The links are :

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/ENSEMBL/homo_sapiens/ENSEMBL.homo_sapiens.release-83/

the file is : Homo_sapiens.GRCh38.83.gtf

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/GENCODE/GRCh38_Gencode26/

the file is : gencode.v26.primary_assembly.annotation.gtf

ADD REPLY • link 6.6 years ago Bogdan ▴ 670

0

Entering edit mode

During the last analysis, where 've mentioned the errors, the GTF files that 've used were from GENCODE: