Entering edit mode
Hi,
I was trying to get the coordinates of some genes, and was surprised that POU5F1 gene is not in the txdb.
Thank you for looking into this.
Tommy
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene hg19.genes<- genes(txdb) library(AnnotationDbi) library("org.Hs.eg.db") ## note that dplyr and AnnotationDbi both have a function called select ## use dplyr::select when use dplyr gene_symbol<- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, columns="SYMBOL", keytype="ENTREZID") "POU5F1" %in% gene_symbol$SYMBOL #FALSE AnnotationDbi::select(org.Hs.eg.db, keys="POU5F1", columns= "ENTREZID", keytype= "SYMBOL") #5460 hg19.genes["5460",] #Error: subscript contains invalid names > sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.5 (El Capitan) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.7.0 AnnotationHub_2.6.0 org.Hs.eg.db_3.4.0 [4] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.26.0 AnnotationDbi_1.36.0 [7] Biobase_2.34.0 GenomicRanges_1.26.1 GenomeInfoDb_1.10.1 [10] IRanges_2.8.1 S4Vectors_0.12.1 BiocGenerics_0.20.0 [13] here_0.0-6 loaded via a namespace (and not attached): [1] Rcpp_0.12.8 BiocInstaller_1.24.0 XVector_0.14.0 bitops_1.0-6 [5] tools_3.3.1 zlibbioc_1.20.0 biomaRt_2.30.0 digest_0.6.10 [9] tibble_1.3.3 RSQLite_1.0.0 lattice_0.20-34 rlang_0.1.1 [13] Matrix_1.2-7.1 shiny_0.14.1 DBI_0.5-1 curl_2.6 [17] rtracklayer_1.34.1 httr_1.2.1 knitr_1.14 Biostrings_2.42.1 [21] rprojroot_1.2 grid_3.3.1 glue_1.1.1 R6_2.2.0 [25] XML_3.98-1.5 readxl_0.1.1 BiocParallel_1.8.0 magrittr_1.5 [29] Rsamtools_1.26.1 backports_1.0.5 htmltools_0.3.5 GenomicAlignments_1.10.0 [33] assertthat_0.1 SummarizedExperiment_1.4.0 xtable_1.8-2 mime_0.5 [37] interactiveDisplayBase_1.12.0 httpuv_1.3.3 RCurl_1.95-4.8
Thanks! I have a list of genes which I can not find in the txdb.
DAXX, HLA-A, SHOX and TRIM27.
where can I find the alternative names?
Tommy
thanks. looks like I can not find DAXX, HLA-A, SHOX and TRIM27 in TxDb.Hsapiens.UCSC.hg19.knownGene even using the other names. but POU5F1B is not shown either
AnnotationDbi::select(org.Hs.eg.db, c("POU5F1", "DAXX", "HLA-A","SHOX","TRIM27"), keytype = "SYMBOL", columns = "ALIAS")
using the ENTREZID.
What you show and what you say are sort of orthogonal things. Do note that using select on the org.Hs.eg.db package isn't the same thing as doing (like, anything) with the TxDb.Hsapiens.UCSC.hg19.knownGene package. So I am unsure what the code you show is supposed to illustrate, exactly.
But this brings up the question of "what is a gene, and how should one define it". Currently, we use a relatively simplistic definition of 'the region between the start of the transcripts from a gene to the end of the transcripts for that gene'. Which is sort of problematic for some genes. For example, some of the lincRNA genes can be found in two places on a given chromosome, so you end up with the (unlikely) gene region that encompasses some huge proportion of the chromosome.
This is also a problematic definition for - as it so happens - all of the genes you are talking about. So let's see:
So there are the Entrez Gene IDs for those genes. Now let's subset the GRanges of our genes:
How can this be? Well, let's look at the transcripts:
So we have transcripts for all those genes, but no genes? That seems...wrong? This next bit of code is a bit complex, primarily to clean it up to more easily show what is happening:
Every one of those genes is found on more than one chromosome (or unplaced scaffold), in which case our definition of a 'gene' fails, because you can't define a region between the start and stop of the transcripts if those transcripts are on different chromosomes!
Hi James, thanks very much for this detailed answer. now I remember a related question you answered me some time ago C: TxDb.Hsapiens.UCSC.hg19.knownGene misannotation of a gene
again, your help is much appreciated. learned a lot!
Tommy
Actually I'm a bit confused now.. Neither the gene symbol or alias returns the canonical OCT4/POU5F1 gene:
library("org.Hs.eg.db") library("TxDb.Hsapiens.UCSC.hg19.knownGene") genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) symbols <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "SYMBOL", keytype = "ENTREZID") alias <- mapIds(org.Hs.eg.db, keys = genes$gene_id, "ALIAS", keytype = "ENTREZID") grep("POU", symbols, value = TRUE) grep("POU", alias, value = TRUE) grep("OCT", symbols, value = TRUE) grep("OCT", alias, value = TRUE)
Why are you confused? I just explained in some detail why you won't get that gene doing what you have done.
Sorry, so are you saying OCT4/POU5F1 has transcripts on different chromosomes?
No, that's not what I said. Here are the locations for the gene you care about:
And all of those things are 1) Chromosome 6, and 2) a bunch of unplaced or haplotype scaffolds. If you don't know what that means, you should listen to what Dan has to say.