Question

Gene Size using GenomicFeatures

0

Entering edit mode

ehimelbl • 0

@ehimelbl-14023

Last seen 7.2 years ago

I want to calculate the total length of genes (including introns) from the start codon to the stop codon. I have loaded a GFF file into R using the GenomicFeatures package. Using genes(txdb) I get the output below that shows the positions of the first nucleotide of the start codon and the last nucleotide of the stop codon. Is there a function in GenomicFeatures that will give the gene size/length using this information? Or is there a way to get the absolute value of subtracting the last from the first nucleotide?

> genes(txdb)
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
CG32006-RA Chr4 [1703212, 1705370] + | CG32006-RA
CG33978-RE Chr4 [1640671, 1653615] + | CG33978-RE
mav-RA Chr4 [1708086, 1710381] - | mav-RA
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

genomicfeatures gene size gene length • 5.0k views

ADD COMMENT • link updated 7.4 years ago by Hervé Pagès 16k • written 7.4 years ago by ehimelbl • 0

score 1 · Answer 1 · 2017-09-24

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.2 years ago

United States

No, there is no function in GenomicFeatures for that. You have a GRanges object, defined by the GenomicRanges package. There is a GRanges method on the generic function width() that will return what you want. You can check out ?GRanges for more on what you can do with a GRanges.

ADD COMMENT • link 7.4 years ago Michael Lawrence ★ 11k

score 1 · Answer 2 · 2017-09-25

Hi,

Note that because a gene can have more than 1 transcript, there is no one definitive way to look at the length of a gene. A more sensible way to look at this though might be to consider the length of the longest transcript:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
tx_by_gene <- transcriptsBy(txdb, by="gene")
gene_lens <- max(width(tx_by_gene))

gene_lens is a named integer vector where the names are the gene ids and the values the lengths of the longest transcript for each gene:

head(gene_lens)
#        1        10       100      1000     10000 100008586 
#    14383      9969     32214    226516    355050     15729

Also please note that there are a couple long-standing issues with genes() that get in the way here. One of them is that by default genes() discards genes that have transcripts on more than 1 chromosome. Another one is that some organizations like UCSC can link transcripts that are very far apart from each other to the same gene. This makes genes() return a range that is artificially big and has no biological meaning for these genes. The transcriptsBy()-based solution above avoids these problems.

Cheers,

H.

score 0 · Answer 3 · 2017-09-24

0

Entering edit mode

ehimelbl • 0

@ehimelbl-14023

Last seen 7.2 years ago

Thank you! This worked:

gr <- genes(txdb)
genelengths <- width (gr)

ADD COMMENT • link 7.4 years ago ehimelbl • 0