Question

POU5F1(OCT4) /ENTREZID: 5460 is not in the TxDb.Hsapiens.UCSC.hg19.knownGene

1

Entering edit mode

tangming2005 ▴ 200

@tangming2005-6754

Last seen 6 months ago

United States

Hi,

I was trying to get the coordinates of some genes, and was surprised that POU5F1 gene is not in the txdb.

Thank you for looking into this.

Tommy

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes<- genes(txdb)

library(AnnotationDbi)
library("org.Hs.eg.db")

## note that dplyr and AnnotationDbi both have a function called select
## use dplyr::select when use dplyr

gene_symbol<- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, 
                                    columns="SYMBOL", keytype="ENTREZID")

"POU5F1" %in% gene_symbol$SYMBOL

#FALSE

AnnotationDbi::select(org.Hs.eg.db, keys="POU5F1", 
                                    columns= "ENTREZID", keytype= "SYMBOL")

#5460

hg19.genes["5460",]

#Error: subscript contains invalid names


> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.7.0                             AnnotationHub_2.6.0                     org.Hs.eg.db_3.4.0                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.0                  AnnotationDbi_1.36.0                   
 [7] Biobase_2.34.0                          GenomicRanges_1.26.1                    GenomeInfoDb_1.10.1                    
[10] IRanges_2.8.1                           S4Vectors_0.12.1                        BiocGenerics_0.20.0                    
[13] here_0.0-6                             

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8                   BiocInstaller_1.24.0          XVector_0.14.0                bitops_1.0-6                 
 [5] tools_3.3.1                   zlibbioc_1.20.0               biomaRt_2.30.0                digest_0.6.10                
 [9] tibble_1.3.3                  RSQLite_1.0.0                 lattice_0.20-34               rlang_0.1.1                  
[13] Matrix_1.2-7.1                shiny_0.14.1                  DBI_0.5-1                     curl_2.6                     
[17] rtracklayer_1.34.1            httr_1.2.1                    knitr_1.14                    Biostrings_2.42.1            
[21] rprojroot_1.2                 grid_3.3.1                    glue_1.1.1                    R6_2.2.0                     
[25] XML_3.98-1.5                  readxl_0.1.1                  BiocParallel_1.8.0            magrittr_1.5                 
[29] Rsamtools_1.26.1              backports_1.0.5               htmltools_0.3.5               GenomicAlignments_1.10.0     
[33] assertthat_0.1                SummarizedExperiment_1.4.0    xtable_1.8-2                  mime_0.5                     
[37] interactiveDisplayBase_1.12.0 httpuv_1.3.3                  RCurl_1.95-4.8

txdb.hsapiens.ucsc.hg19.knowngene • 2.2k views

ADD COMMENT • link updated 7.6 years ago by jma1991 ▴ 70 • written 7.6 years ago by tangming2005 ▴ 200

score 1 · Accepted Answer · 2017-09-25

1

Entering edit mode

jma1991 ▴ 70

@jma1991-11856

Last seen 2.1 years ago

Cumbernauld

It's called POU5F1B instead of POU5F1:

library("TxDb.Hsapiens.UCSC.hg19.knownGene")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes <- genes(txdb)

library("AnnotationDbi")
library("org.Hs.eg.db")

gene_symbol <- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, columns="SYMBOL", keytype="ENTREZID")

"POU5F1B" %in% gene_symbol$SYMBOL

ADD COMMENT • link 7.6 years ago jma1991 ▴ 70

0

Entering edit mode

Thanks! I have a list of genes which I can not find in the txdb.

DAXX, HLA-A, SHOX and TRIM27.

where can I find the alternative names?

Tommy

ADD REPLY • link 7.6 years ago tangming2005 ▴ 200

1

Entering edit mode

> select(org.Hs.eg.db, c("DAXX","HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
   ALIAS ENTREZID
1   DAXX     1616
2  HLA-A     3105
3   SHOX     6473
4 TRIM27     5987

ADD REPLY • link 7.6 years ago James W. MacDonald 68k

0

Entering edit mode

thanks. looks like I can not find DAXX, HLA-A, SHOX and TRIM27 in TxDb.Hsapiens.UCSC.hg19.knownGene even using the other names. but POU5F1B is not shown either

AnnotationDbi::select(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), keytype = "SYMBOL", columns = "ALIAS")

'select()' returned 1:many mapping between keys and columns
   SYMBOL  ALIAS
1  POU5F1   OCT3
2  POU5F1   OCT4
3  POU5F1  OTF-3
4  POU5F1   OTF3
5  POU5F1   OTF4
6  POU5F1  Oct-3
7  POU5F1  Oct-4
8  POU5F1 POU5F1
9    DAXX  BING2
10   DAXX   DAP6
11   DAXX   EAP1
12   DAXX   DAXX
13  HLA-A   HLAA
14  HLA-A  HLA-A
15   SHOX   GCFX
16   SHOX   PHOG
17   SHOX  SHOXY
18   SHOX     SS
19   SHOX   SHOX
20 TRIM27    RFP
21 TRIM27  RNF76
22 TRIM27 TRIM27

using the ENTREZID.

ADD REPLY • link 7.6 years ago tangming2005 ▴ 200

0

Entering edit mode

What you show and what you say are sort of orthogonal things. Do note that using select on the org.Hs.eg.db package isn't the same thing as doing (like, anything) with the TxDb.Hsapiens.UCSC.hg19.knownGene package. So I am unsure what the code you show is supposed to illustrate, exactly.

But this brings up the question of "what is a gene, and how should one define it". Currently, we use a relatively simplistic definition of 'the region between the start of the transcripts from a gene to the end of the transcripts for that gene'. Which is sort of problematic for some genes. For example, some of the lincRNA genes can be found in two places on a given chromosome, so you end up with the (unlikely) gene region that encompasses some huge proportion of the chromosome.

This is also a problematic definition for - as it so happens - all of the genes you are talking about. So let's see:

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(org.Hs.eg.db)

> toget <- mapIds(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
> toget
POU5F1   DAXX  HLA-A   SHOX TRIM27
"5460" "1616" "3105" "6473" "5987"

So there are the Entrez Gene IDs for those genes. Now let's subset the GRanges of our genes:

> gns[toget]
Error: subscript contains invalid names
> toget %in% names(gns)
[1] FALSE FALSE FALSE FALSE FALSE

How can this be? Well, let's look at the transcripts:

> toget %in% names(txs)
[1] TRUE TRUE TRUE TRUE TRUE

So we have transcripts for all those genes, but no genes? That seems...wrong? This next bit of code is a bit complex, primarily to clean it up to more easily show what is happening:

> lapply(txs[toget], function(x) as.character(unique(seqnames(x))))
$`5460`
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

$`1616`
[1] "chr6"          "chr6_cox_hap2" "chr6_dbb_hap3" "chr6_mcf_hap5"
[5] "chr6_qbl_hap6"

$`3105`
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_dbb_hap3"
[5] "chr6_mann_hap4" "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

$`6473`
[1] "chrX" "chrY"

$`5987`
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

Every one of those genes is found on more than one chromosome (or unplaced scaffold), in which case our definition of a 'gene' fails, because you can't define a region between the start and stop of the transcripts if those transcripts are on different chromosomes!

ADD REPLY • link 7.6 years ago James W. MacDonald 68k

0

Entering edit mode

Hi James, thanks very much for this detailed answer. now I remember a related question you answered me some time ago C: TxDb.Hsapiens.UCSC.hg19.knownGene misannotation of a gene

again, your help is much appreciated. learned a lot!

Tommy

ADD REPLY • link 7.6 years ago tangming2005 ▴ 200

0

Entering edit mode

Actually I'm a bit confused now.. Neither the gene symbol or alias returns the canonical OCT4/POU5F1 gene:

library("org.Hs.eg.db")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")

genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
symbols <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "SYMBOL", keytype = "ENTREZID")
alias <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "ALIAS", keytype = "ENTREZID")

grep("POU", symbols, value = TRUE)
grep("POU", alias, value = TRUE)

grep("OCT", symbols, value = TRUE)
grep("OCT", alias, value = TRUE)

ADD REPLY • link 7.6 years ago jma1991 ▴ 70

0

Entering edit mode

Why are you confused? I just explained in some detail why you won't get that gene doing what you have done.

ADD REPLY • link 7.6 years ago James W. MacDonald 68k

0

Entering edit mode

Sorry, so are you saying OCT4/POU5F1 has transcripts on different chromosomes?

ADD REPLY • link 7.6 years ago jma1991 ▴ 70

0

Entering edit mode

No, that's not what I said. Here are the locations for the gene you care about:

$`5460`
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

And all of those things are 1) Chromosome 6, and 2) a bunch of unplaced or haplotype scaffolds. If you don't know what that means, you should listen to what Dan has to say.

ADD REPLY • link 7.6 years ago James W. MacDonald 68k