Dear all,
I am writing to ask please for a piece of help regarding Biomart . I am using the piece of R code below, that was inspired by the package StructuralVariantAnnotation to annotate the Structural Variants from DELLY.
https://github.com/PapenfussLab/gridss/blob/master/example/somatic-fusion-gene-candidates.R;
However, the coordinates on chr21 are not annotated properly, in the sense that : shall I input the following coordinates for a breakpoint "chr21:10813930-10813931", it gives me the gene annotations such as "SMIM11B,U2AF1L5,LOC102724652,CRYAA,U2AF1,CBS"
I can post the full code in R, if you wish. Would you please let me know --- has this been noted before ? how shall I fix it ?
thank you very much !
-- bogdan
##################### the piece of R code that I am using is the following : ##############################################################
gns <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene) hits <- as.data.frame(findOverlaps(gr, gns, ignore.strand=TRUE)) hits$SYMBOL <- biomaRt::select(org.Hs.eg.db, gns[hits$subjectHits]$gene_id, "SYMBOL")$SYMBOL hits$gene_strand <- as.character(strand(gns[hits$subjectHits])) hits <- hits %>% group_by(queryHits) %>% summarise(SYMBOL=paste(SYMBOL, collapse=","), gene_strand=paste0(gene_strand, collapse=""))
Can you explain why this is surprising? It seems like 6 (Entrez) genes overlap the break point
and that these map to 6 gene SYMBOLs ?
Hi Martin, thank you for your comments and precious help. It has been surprising for me because the coordinates of the BREAKPOINT (i.e. chr21:10,813,930-10,813,931) are 3 MB away from SMIM11B (chr21:7,748,899-7,777,853), at least according to UCSC genome browser (the link to the gene area is below). I would aim to know only the genes that are intersecting the breakpoint, and not necessarily the genes that are 3Mb away. Any suggestions are welcome. Thanks you.,
Likely the explanation is that the org.* and TxDb.* packages are static and report the information at time of publication from particular sources, whereas UCSC is dynamic and presents the most recent updates of possibly different sources. The information in the annotation objects are
There was presumably some something wrong in the source file used at the time of generation, as those 6 genes each span approximately 30Mb, which seems unlikely.
Dear Martin, thank you for your suggestion. To me ... hmmm ... it is still a bit surprising: for example, shall I input the coordinate ("chr21:21004327-21004328") that overlap NCAM2 gene, the result (after running your code) is the following (below). It contains NCAM2 gene, but also those additional genes that are 2-3 MB away.
The code is doing the correct thing, it's finding entries in TxDb.Hsapiens.UCSC.hg38.knownGene that overlap with your query. However, the 6 genes that keep getting returned have one coordinate incorrect in their annotation and so appear to span a huge stretch of the chromosome and overlap a lot of things.
Thank you Mike. Yes, I share the same opinions. For now, I will just ignore those genes, until a new release of "org.Hs.eg.db" becomes available. Thank you.
I don't think this is really a biomaRt question. You're not actually querying a BioMart instance here, the annotation is found in
TxDb.Hsapiens.UCSC.hg38.knownGene
and the map between symbols is inorg.Hs.eg.db
which are both an AnnotationDb objects. You're usingbiomaRt::select()
but in the background this is really dispatching the version in AnnotationDbi, so you might get a better response if you add that tag to the question.