Cenny,
You wonder about why the gene PLCXD1 can be found as transcripts but not genes in UCSC.
My understanding is that you picked a very special gene. This gene has two locations: one in chrX, and one in chrY.
http://www.genecards.org/cgi-bin/carddisp.pl?gene=PLCXD1&keywords=PLCXD1
When you use genes() function with default parameter in the GenomicFeatures package, the returned gene locations are all unique.
However, the transcript_ids are unique for the different transcripts of this gene. So you have no problem to use transcripts(txdb) to get the transcripts.
If you read the help of genes function, there is a parameter to deal with multiple location genes:
single.strand.genes.only
|
TRUE or FALSE. If TRUE (the default), then genes that have exons located on both strands of the same chromosome or on two different chromosomes are dropped. In that case, the genes are returned in a GRanges object. Otherwise, all genes are returned in aGRangesList object with the columns specified thru the columns argument set as top level metadata columns. (Please keep in mind that the top level metadata columns of a GRangesList object are not displayed by the show method.)
|
The following code are very educational:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
cols <- c("tx_id", "tx_chrom", "tx_strand",
"exon_id", "exon_chrom", "exon_strand")
single_strand_genes <- genes(txdb, columns=cols)
all_genes <- genes(txdb, columns=cols, single.strand.genes.only=FALSE)
all_genes # a GRangesList object
multiple_strand_genes <- all_genes[elementNROWS(all_genes) >= 2]
multiple_strand_genes[["55344"]]
55344 is the gene id of PLCXD1.
You can run it yourself.
As to the gene_bio_type, here is the explanation:
http://www.gencodegenes.org/gencode_biotypes.html
Thank you for using ChIPpeakAnno, and thanks for the suggestion. Let us know if you have more questions.
Thanks,
Posted for Jun Yu
Hi Julie and Jianhong,
Thank you very much. The above code is very useful. I will use Ensembl for the annotation.
Quick question, what is
seqlevelsStyle(genes) <- "UCSC"
for ?For those who was wondering like I did, although the genome assembly is the same hg19, you may find some genes have different locations. That is because of the way Ensembl and UCSC locate the genes. See more here.
Ensembl annotates many more genes than UCSC. Here is an interesting paper comparing them. One other thing to note: you need to add 1 to the start coordinates from UCSC to get the equivalent Ensembl coordinate (one-based vs. zero-based coordinate system).
The description of the different gene biotypes is here.
Thank you, Julie!
Hi Julie and Jianhong,
Is there an easy way to return only 1 gene for each peak region? Sometimes, there are genes that have exactly the same nearest distance that both of them got returned. Maybe return the longest gene? or collapse the multiple nearest genes together, so that each row still refers to one peak region? For example this one:
Thanks!
Here is my code:
Hope this will help you.
Thank you so much, Jianhong!! That is very efficient and clever. Not to mention very helpful.
I still need to get used to GRanges and other genomics R objects. Your codes help a lot!