Entering edit mode
olavur
▴
10
@olavur-16765
Last seen 5.9 years ago
I'm trying to obtain exon intervals for a particular gene, but am running into two problems. First, the TxDb
query returns a GRanges object with 12 ranges, but I know the gene has 10 exons. Second, two of the exons overlap (as I will show later).
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# UCSC/Entrez ID of SLC22A5.
gene_id <- '6584'
# Get exons in SLC22A5.
exons_by_gene <- exonsBy(txdb, by='gene')
exons <- exons_by_gene[[gene_id]]
The result:
> as.data.frame(exons_gene)
seqnames start end width strand exon_id exon_name
1 chr5 131705401 131706057 657 + 77618 <NA>
2 chr5 131713856 131713927 72 + 77619 <NA>
3 chr5 131714070 131714173 104 + 77620 <NA>
4 chr5 131719839 131719993 155 + 77621 <NA>
5 chr5 131721020 131721191 172 + 77622 <NA>
6 chr5 131722717 131722843 127 + 77623 <NA>
7 chr5 131724613 131724713 101 + 77624 <NA>
8 chr5 131726382 131726596 215 + 77625 <NA>
9 chr5 131726468 131726596 129 + 77626 <NA>
10 chr5 131728125 131728307 183 + 77627 <NA>
11 chr5 131729368 131729503 136 + 77628 <NA>
12 chr5 131729877 131731306 1430 + 77629 <NA>
Exons with IDs 77625 and 77626 overlap:
> ov <- findOverlapPairs(exons_gene, exons_gene)
> first(ov)$exon_id
[1] 77618 77619 77620 77621 77622 77623 77624 77625 77625 77626 77626 77627 77628 77629
> second(ov)$exon_id
[1] 77618 77619 77620 77621 77622 77623 77624 77625 77626 77625 77626 77627 77628 77629
Also, why do you think that two overlapping exons should only be counted as one? If they have different lengths, they are different, no?
Well, I thought I had reliable sources that the gene has 10 exons, but as you showed that depends on which transcript you're looking at. What I needed to do for my analysis to make sense is to pick a transcript, then there are no overlapping exons. Thanks for the help.
So my solution is:
An alternative to that is to generate a fake transcript that consists of the full extent of all known exons. The obvious tradeoff being that in your method you are saying 'this is the important transcript' and in my method you are saying 'these are all the genomic regions that can be part of any transcript'. But that all depends on what your analysis might be. Anyway: