Question

getting the longest transcript by gene from Refseq

1

Entering edit mode

dalloliogm ▴ 50

@dalloliogm-7141

Last seen 3.8 years ago

European Union

The Refseq Gene table contains more than one transcript per gene:

> library(dplyr)
> library(AnnotationHub)
> ahub = AnnotationHub()
> refgene = ahub[['AH5040']]
> table(table(mcols(refgene)['name']))
    1     2     3     4     5     6     7     8     9    10    12    13    19
45178   568   102    27    74    52   171   129    13    19     1     1     5

The vast majority of genes has only one transcript, but 1,162 genes have more.

In order to simplify my analysis I would like to select only one transcript per gene, the longest. However I am not familiar with bioconductor methods and I am not sure how to select it. For the moment I am using dplyr to sort transcripts by length and get the longest:

> refgene.ranks = refgene %>%
    DataFrame(., start=start(.), end=end(.), chrom=seqnames(.), transcript.length=width(.) ) %>% 
    subset(., select=c(name, chrom, start, end, transcript.length)) %>%
    data.frame %>% 
    group_by(name) %>% 
    arrange(-transcript.length) %>% 
    mutate(rank=row_number())

The code is complicated because I am more familiar with dplyr, so I need to convert the data to a data frame in order to sort.

The refgene.ranks variable contains transcripts ranked by their length. For example NM_001080137 has 9 transcripts:

> refgene.ranks %>% subset(name == 'NM_001080137')
Source: local data frame [9 x 3]
Groups: name [1]

          name  chrom     start       end transcript.length  rank
         (chr) (fctr)     (int)     (int)             (int) (int)
1 NM_001080137   chrX 120011345 120066151             54807     1
2 NM_001080137   chrX 120067695 120071012              3318     2
3 NM_001080137   chrX 120111461 120114778              3318     3
4 NM_001080137   chrX 120096881 120100198              3318     4
5 NM_001080137   chrX 120101741 120105058              3318     5
6 NM_001080137   chrX 120106601 120109918              3318     6
7 NM_001080137   chrX 120072556 120075873              3318     7
8 NM_001080137   chrX 120077416 120080733              3318     8
9 NM_001080137   chrX 120082277 120085594              3318     9

At this point I can get a GRanges from the longest transcripts per gene:

> longest.transcripts.gr = refgene.ranks %>% 
    subset(rank==1) %>% 
    makeGRangesFromDataFrame(., keep.extra.columns=T)

Then I should be able to subset the original refgene GRanges by the coordinates of the longest transcripts. However this doesn't work very well, because after that I still get more than one transcript for some gene:

> refgene.longest = refgene %>% subsetByOverlaps(refgene.ranks.gr, type='equal')
> refgene.longest %>% mcols %>% subset(., select=name) %>% data.frame %>% count(name) %>% arrange(-n)
Source: local data frame [46,308 x 2]

           name     n
          (chr) (int)
1  NM_001130404     4
2  NM_001130405     4
3  NM_001130406     4
4  NM_001130407     4
5  NM_001242326     4
6  NM_001242327     4
7  NM_001242328     4
8  NM_001242329     4
9  NM_001242330     4
10 NM_001242331     4

What am I doing wrong in this intersection? Is there a simpler way to select the longest transcript by gene?

refseq annotationhub • 4.0k views

ADD COMMENT • link updated 9.3 years ago by Martin Morgan 25k • written 9.3 years ago by dalloliogm ▴ 50

score 3 · Accepted Answer · 2015-09-29

You can split the data structure ('UCSCData', which is a little idiosyncractic to rtracklayer) into a list, where each element belongs to a particular gene

rd = splitAsList(refgene, refgene$name)

Then find the longest (maybe this is the difference -- here ties go to the first transcript?)

maxwd = which.max(width(rd))
refgene.longest = rd[splitAsList(unname(maxwd), names(maxwd))]

Confirming that there is only one per gene, showing the first, and unlisting to retrieve something with the original geometry:

table(elementLengths(refgene.longest))
refgene.longest[[1]]
unlist(refgene.longest)