Question

TxDb.Hsapiens.UCSC.hg19.knownGene Exons that are not part of any gene

0

Entering edit mode

Aliaksei Holik ▴ 350

@aliaksei-holik-4992

Last seen 9.2 years ago

Spain/Barcelona/Centre for Genomic Regu…

Dear Bioconductors,

This is a bit of a curiosity question. I have been working with TxDb.Hsapiens.UCSC.hg19.knownGene package and noticed that there are some exons, that do not seem to be a part of any gene.

> # get all the genes
> genic.regions <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> # get all the exons
> exonic.regions <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)
> # Find the overlaps between the genes and exons
> findOverlaps(genic.regions, exonic.regions)
Hits object with 270213 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1      250809
       [2]         1      250810
       [3]         1      250811
       [4]         1      250812
       [5]         1      250813
       ...       ...         ...
  [270209]     23056      266961
  [270210]     23056      266962
  [270211]     23056      266963
  [270212]     23056      266964
  [270213]     23056      266965
  -------
  queryLength: 23056
  subjectLength: 289969

As you can see, there are nearly 290000 exons, but only about 270000 overlap with any of the genes. I can see it very clearly, if I try to plot genes and exons overlapping a fragment of a chromosome. There's a few exons (marked by the green triangle) that do not appear to be part of any gene. So my question is, what might they be and how I should deal with them if, for instance, I'm trying to get coordinates of the intronic or intergenic regions?

Discrepancy between Genes and exons in TxDb.Hsapiens.UCSC.hg19.knownGene

TxDb.Hsapiens.UCSC.hg19.knownGene genomicfeatures annotation • 2.4k views

ADD COMMENT • link 9.2 years ago Aliaksei Holik ▴ 350

0

Entering edit mode

I don't think your images are showing, if you have any.

ADD REPLY • link 9.2 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks, fixed it.

ADD REPLY • link 9.2 years ago Aliaksei Holik ▴ 350

score 2 · Accepted Answer · 2016-01-25

There are undoubtedly many reasons that the exons and genes don't all line up. One reason is likely the distinction between what is considered a gene. If you look at the rownames of the GRanges object you get when you do genes(TxDb), those are all Entrez Gene IDs, which is in one sense the list of all the 'genes'.

But there are any number of 'genes' that don't (yet) have Entrez Gene IDs. There are lots of lincRNA, piRNA, and probably even miRNA sequences that are not in the Gene database. For example, if we get all naive and stuff, we can check this out.

> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> exns <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene)

> exns[!exns %over% unlist(ex),]
GRanges object with 19332 ranges and 1 metadata column:
                seqnames           ranges strand   |   exon_id
                   <Rle>        <IRanges>  <Rle>   | <integer>
      [1]           chr1 [321084, 321115]      +   |         8
      [2]           chr1 [321146, 321207]      +   |         9
      [3]           chr1 [420206, 420296]      +   |        20
      [4]           chr1 [420992, 421258]      +   |        21
      [5]           chr1 [421396, 421839]      +   |        22
      ...            ...              ...    ... ...       ...
  [19328] chrUn_gl000241   [35706, 35859]      -   |    289965
  [19329] chrUn_gl000241   [36711, 36875]      -   |    289966
  [19330] chrUn_gl000243   [11501, 11530]      +   |    289967
  [19331] chrUn_gl000243   [13608, 13637]      +   |    289968
  [19332] chrUn_gl000247   [ 5787,  5816]      -   |    289969
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

So like you say, about 20K exons without a corresponding gene. The first two are piRNAs, and the next three are clone images. So with a sample size of 5 out of 20K, I would venture to guess it's probably a combination of all sorts of things that have been reported by someone somewhere, that have not yet become 'real' enough to make it into the Gene database.