Hi,
I wanted a list of all exons in human genome (hg19) along with there coordinates. For this purpose, I downloaded knownGene.txt.gz from UCSC, and extracted and removed duplicate exon-coordinates for all the transcripts. Unique exons with coordinates were ~3 million. Besides this, I also used TxDb.Hsapiens.UCSC.hg19.knownGene package to extract list of exons using the command:
exon_list <- as.data.frame(exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx"))
which listed out ~7 million unique exon-coordinates and most of them are not found in the list generated from UCSC. Following is the chunk form the file I prepared:
chr start end strand
chr1 11874 12227 +
chr1 12613 12721 +
chr1 13221 14409 +
chr1 12595 12721 +
chr1 13403 14409 +
I am confused which is the reliable way to obtain all exons and why there is difference between the two sources. Thanks in advance.
regards,
ad
You will have to give more details about how you got your data from UCSC. In addition:
So there are not even 1M non-unique exons! In other words, there are about 742K exons that exist in all the transcripts in knownGene. But there are likely to be multiple identical exons listed here, as two transcripts of the same gene are likely to have one or more exons that are identical (although not all are identical, obvs).
But it is easier to just get the (canonical, if I am not mistaken) exons using exons():
which is an order of magnitude fewer exons than you say you got from the direct download.