Hi,
I think I found a bug, not sure I am right or not.
library(TxDb.Hsapiens.UCSC.hg19.knownGene) UCSC.hg19<- TxDb.Hsapiens.UCSC.hg19.knownGene hg19.genes<- genes(UCSC.hg19) library("org.Hs.eg.db") gene_symbol<- AnnotationDbi::select(org.Hs.eg.db, keys=hg19.genes$gene_id, columns="SYMBOL", keytype="ENTREZID") all.equal(hg19.genes$gene_id, gene_symbol$ENTREZID) hg19.genes$gene_id<- gene_symbol$SYMBOL hg19.genes[9349] GRanges object with 1 range and 1 metadata column: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> 286297 chr9 [42844370, 67032072] - | LOC286297 ------- seqinfo: 93 sequences (1 circular) from hg19 genome
I then went to IGV to check "LOC286297", the end coordinate is 42859085
The reason I found this gene is because I want to check how many genes in each chromosome arm and found this gene is spanning p and q arm.
Thanks,
Tommy
That makes sense. It's less confusing than GENCODE Genes, which sometimes uses the same gene symbol but different ENSG identifiers for largely overlapping genes in a certain genomic region. I asked about it and one of the team replied that it would be fixed by GENCODE 26.
thanks. I did not know that a gene can have two transcripts that far apart.
There are actually quite a few genes that are on both the X and Y chromosomes, and they disappear when you do
because the smallest start and the largest end value tend to be on different chromosomes. By definition you can't have them in a GRanges object as one thing (e.g., the GRanges paradigm doesn't include regions that span between two chromosomes). So the simple notion that a 'gene' is simply the region encompassed by all transcripts for that gene sort of breaks down for some sex-linked genes, as well as some of the non-coding transcripts.
Thanks James, this is the details one need to pay attention to.
Is there an easy way to get gene numbers on each chromosome arms? this is my solution but number of genes on sex chromosomes maybe off.