Question

Difference in TxDb.Hsapiens.UCSC.hg19.knownGene and UCSC knownGene

0

Entering edit mode

adnan.niazi • 0

@adnanniazi-8272

Last seen 4.8 years ago

Sweden

Hi,

I wanted a list of all exons in human genome (hg19) along with there coordinates. For this purpose, I downloaded knownGene.txt.gz from UCSC, and extracted and removed duplicate exon-coordinates for all the transcripts. Unique exons with coordinates were ~3 million. Besides this, I also used TxDb.Hsapiens.UCSC.hg19.knownGene package to extract list of exons using the command:

exon_list <- as.data.frame(exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx"))

which listed out ~7 million unique exon-coordinates and most of them are not found in the list generated from UCSC. Following is the chunk form the file I prepared:

chr   start   end   strand
chr1   11874   12227   +
chr1   12613   12721   +
chr1   13221   14409   +
chr1   12595   12721   +
chr1   13403   14409   +

I am confused which is the reliable way to obtain all exons and why there is difference between the two sources. Thanks in advance.

regards,

ad

annotate exons txdb.hsapiens.ucsc.hg19.knowngene • 2.3k views

ADD COMMENT • link 9.8 years ago adnan.niazi • 0

0

Entering edit mode

You will have to give more details about how you got your data from UCSC. In addition:

> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")

> sum(elementLengths(ex))
[1] 742493

So there are not even 1M non-unique exons! In other words, there are about 742K exons that exist in all the transcripts in knownGene. But there are likely to be multiple identical exons listed here, as two transcripts of the same gene are likely to have one or more exons that are identical (although not all are identical, obvs).

But it is easier to just get the (canonical, if I am not mistaken) exons using exons():

> ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
> length(ex)
[1] 289969

which is an order of magnitude fewer exons than you say you got from the direct download.

ADD REPLY • link 9.8 years ago James W. MacDonald 68k