POU5F1(OCT4) /ENTREZID: 5460 is not in the TxDb.Hsapiens.UCSC.hg19.knownGene
1
1
Entering edit mode
tangming2005 ▴ 200
@tangming2005-6754
Last seen 3 months ago
United States

Hi,

I was trying to get the coordinates of some genes, and was surprised that POU5F1 gene is not in the txdb.

Thank you for looking into this.

Tommy

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes<- genes(txdb)

library(AnnotationDbi)
library("org.Hs.eg.db")

## note that dplyr and AnnotationDbi both have a function called select
## use dplyr::select when use dplyr

gene_symbol<- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, 
                                    columns="SYMBOL", keytype="ENTREZID")

"POU5F1" %in% gene_symbol$SYMBOL

#FALSE

AnnotationDbi::select(org.Hs.eg.db, keys="POU5F1", 
                                    columns= "ENTREZID", keytype= "SYMBOL")

#5460

hg19.genes["5460",]

#Error: subscript contains invalid names


> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_0.7.0                             AnnotationHub_2.6.0                     org.Hs.eg.db_3.4.0                     
 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.0                  AnnotationDbi_1.36.0                   
 [7] Biobase_2.34.0                          GenomicRanges_1.26.1                    GenomeInfoDb_1.10.1                    
[10] IRanges_2.8.1                           S4Vectors_0.12.1                        BiocGenerics_0.20.0                    
[13] here_0.0-6                             

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8                   BiocInstaller_1.24.0          XVector_0.14.0                bitops_1.0-6                 
 [5] tools_3.3.1                   zlibbioc_1.20.0               biomaRt_2.30.0                digest_0.6.10                
 [9] tibble_1.3.3                  RSQLite_1.0.0                 lattice_0.20-34               rlang_0.1.1                  
[13] Matrix_1.2-7.1                shiny_0.14.1                  DBI_0.5-1                     curl_2.6                     
[17] rtracklayer_1.34.1            httr_1.2.1                    knitr_1.14                    Biostrings_2.42.1            
[21] rprojroot_1.2                 grid_3.3.1                    glue_1.1.1                    R6_2.2.0                     
[25] XML_3.98-1.5                  readxl_0.1.1                  BiocParallel_1.8.0            magrittr_1.5                 
[29] Rsamtools_1.26.1              backports_1.0.5               htmltools_0.3.5               GenomicAlignments_1.10.0     
[33] assertthat_0.1                SummarizedExperiment_1.4.0    xtable_1.8-2                  mime_0.5                     
[37] interactiveDisplayBase_1.12.0 httpuv_1.3.3                  RCurl_1.95-4.8          
txdb.hsapiens.ucsc.hg19.knowngene • 2.1k views
ADD COMMENT
1
Entering edit mode
jma1991 ▴ 70
@jma1991-11856
Last seen 21 months ago
Cumbernauld

It's called POU5F1B instead of POU5F1:

library("TxDb.Hsapiens.UCSC.hg19.knownGene")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
hg19.genes <- genes(txdb)

library("AnnotationDbi")
library("org.Hs.eg.db")

gene_symbol <- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, columns="SYMBOL", keytype="ENTREZID")

"POU5F1B" %in% gene_symbol$SYMBOL
ADD COMMENT
0
Entering edit mode

Thanks! I have a list of genes which I can not find in the txdb.

DAXX, HLA-A, SHOX and TRIM27.

where can I find the alternative names?

Tommy

 

ADD REPLY
1
Entering edit mode
> select(org.Hs.eg.db, c("DAXX","HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
   ALIAS ENTREZID
1   DAXX     1616
2  HLA-A     3105
3   SHOX     6473
4 TRIM27     5987
ADD REPLY
0
Entering edit mode

thanks. looks like I can not find DAXX, HLA-A, SHOX and TRIM27 in TxDb.Hsapiens.UCSC.hg19.knownGene even using the other names. but POU5F1B is not shown either

AnnotationDbi::select(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), keytype = "SYMBOL", columns = "ALIAS")
'select()' returned 1:many mapping between keys and columns
   SYMBOL  ALIAS
1  POU5F1   OCT3
2  POU5F1   OCT4
3  POU5F1  OTF-3
4  POU5F1   OTF3
5  POU5F1   OTF4
6  POU5F1  Oct-3
7  POU5F1  Oct-4
8  POU5F1 POU5F1
9    DAXX  BING2
10   DAXX   DAP6
11   DAXX   EAP1
12   DAXX   DAXX
13  HLA-A   HLAA
14  HLA-A  HLA-A
15   SHOX   GCFX
16   SHOX   PHOG
17   SHOX  SHOXY
18   SHOX     SS
19   SHOX   SHOX
20 TRIM27    RFP
21 TRIM27  RNF76
22 TRIM27 TRIM27

using the ENTREZID.

ADD REPLY
0
Entering edit mode

What you show and what you say are sort of orthogonal things. Do note that using select on the org.Hs.eg.db package isn't the same thing as doing (like, anything) with the TxDb.Hsapiens.UCSC.hg19.knownGene package. So I am unsure what the code you show is supposed to illustrate, exactly.

But this brings up the question of "what is a gene, and how should one define it". Currently, we use a relatively simplistic definition of 'the region between the start of the transcripts from a gene to the end of the transcripts for that gene'. Which is sort of problematic for some genes. For example, some of the lincRNA genes can be found in two places on a given chromosome, so you end up with the (unlikely) gene region that encompasses some huge proportion of the chromosome.

This is also a problematic definition for - as it so happens - all of the genes you are talking about. So let's see:

> txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(org.Hs.eg.db)

> toget <- mapIds(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), "ENTREZID", "ALIAS")
'select()' returned 1:1 mapping between keys and columns
> toget
POU5F1   DAXX  HLA-A   SHOX TRIM27
"5460" "1616" "3105" "6473" "5987"

So there are the Entrez Gene IDs for those genes. Now let's subset the GRanges of our genes:

> gns[toget]
Error: subscript contains invalid names
> toget %in% names(gns)
[1] FALSE FALSE FALSE FALSE FALSE

How can this be? Well, let's look at the transcripts:

> toget %in% names(txs)
[1] TRUE TRUE TRUE TRUE TRUE

So we have transcripts for all those genes, but no genes? That seems...wrong? This next bit of code is a bit complex, primarily to clean it up to more easily show what is happening:

> lapply(txs[toget], function(x) as.character(unique(seqnames(x))))
$`5460`
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

$`1616`
[1] "chr6"          "chr6_cox_hap2" "chr6_dbb_hap3" "chr6_mcf_hap5"
[5] "chr6_qbl_hap6"

$`3105`
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_dbb_hap3"
[5] "chr6_mann_hap4" "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

$`6473`
[1] "chrX" "chrY"

$`5987`
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"


Every one of those genes is found on more than one chromosome (or unplaced scaffold), in which case our definition of a 'gene' fails, because you can't define a region between the start and stop of the transcripts if those transcripts are on different chromosomes!

ADD REPLY
0
Entering edit mode

Hi James, thanks very much for this detailed answer. now I remember a related question you answered me some time ago C: TxDb.Hsapiens.UCSC.hg19.knownGene misannotation of a gene

again, your help is much appreciated. learned a lot!

Tommy

ADD REPLY
0
Entering edit mode

Actually I'm a bit confused now.. Neither the gene symbol or alias returns the canonical OCT4/POU5F1 gene:

library("org.Hs.eg.db")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")

genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
symbols <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "SYMBOL", keytype = "ENTREZID")
alias <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "ALIAS", keytype = "ENTREZID")

grep("POU", symbols, value = TRUE)
grep("POU", alias, value = TRUE)

grep("OCT", symbols, value = TRUE)
grep("OCT", alias, value = TRUE)

 

ADD REPLY
0
Entering edit mode

Why are you confused? I just explained in some detail why you won't get that gene doing what you have done.


 

ADD REPLY
0
Entering edit mode

Sorry, so are you saying OCT4/POU5F1 has transcripts on different chromosomes?

ADD REPLY
0
Entering edit mode

No, that's not what I said. Here are the locations for the gene you care about:

$`5460`
[1] "chr6"           "chr6_cox_hap2"  "chr6_dbb_hap3"  "chr6_mann_hap4"
[5] "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

And all of those things are 1) Chromosome 6, and 2) a bunch of unplaced or haplotype scaffolds. If you don't know what that means, you should listen to what Dan has to say.

ADD REPLY

Login before adding your answer.

Traffic: 630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6