I have a long list of about 700 genes. Some of them here:
ABCF1 ABHD16A ACYP1 ADAMTSL4-AS1 AGER AGPAT1 AIF1 AKAP17A ALPI AMY1A ANKRD20A1 ANKRD20A3 ANKRD42 ANKRD60 APOM ARHGAP27 ARL17A ARPP19 ASMT ASMTL ATAT1 ATF6B ATP6V1G2-DDX39B AUNIP AZIN1-AS1 B3GALT4 BAG6 BECN2 BLACE BMS1P18 BPY2 BRD2 BTNL10 BTNL2 C10orf11 C10orf126 C11orf72 C12orf43 C12orf66 C16orf47 C17orf100 C19orf68 C1orf137 C1orf167 C1QTNF3-AMACR C2 C20orf62 C22orf34 C3orf56 C4A
From the following snippet, I could see that at least IL9R and VAMP7 genes are not included.
library(Homo.sapiens) library(GenomicRanges) g = genes(Homo.sapiens, column = "SYMBOL") s = mcols(g)[["SYMBOL"]] > grep("VAM", s , value = T) 10791 6843 6844 8673 8674 9341 "VAMP5" "VAMP1" "VAMP2" "VAMP8" "VAMP4" "VAMP3" > grep("IL9R", s , value = T) named character(0)
> sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS Matrix products: default BLAS: /usr/lib/openblas-base/libblas.so.3 LAPACK: /usr/lib/libopenblasp-r0.2.18.so locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] Homo.sapiens_1.3.1 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.5.0 [4] GO.db_3.5.0 OrganismDbi_1.20.0 GenomicFeatures_1.30.3 [7] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 AnnotationDbi_1.40.0 [10] IRanges_2.12.0 S4Vectors_0.16.0 Biobase_2.38.0 [13] BiocGenerics_0.24.0 GenABEL_1.8-0 GenABEL.data_1.0.0 [16] MASS_7.3-50 loaded via a namespace (and not attached): [1] SummarizedExperiment_1.8.1 progress_1.2.0 lattice_0.20-35 rtracklayer_1.38.3 blob_1.1.1 [6] XML_3.98-1.11 RBGL_1.54.0 rlang_0.2.1 DBI_1.0.0 BiocParallel_1.12.0 [11] bit64_0.9-7 matrixStats_0.53.1 GenomeInfoDbData_1.0.0 stringr_1.3.1 zlibbioc_1.24.0 [16] Biostrings_2.46.0 memoise_1.1.0 biomaRt_2.34.2 BiocInstaller_1.28.0 Rcpp_0.12.17 [21] DelayedArray_0.4.1 graph_1.56.0 XVector_0.18.0 bit_1.1-14 Rsamtools_1.30.0 [26] RMySQL_0.10.15 hms_0.4.2 digest_0.6.15 stringi_1.2.3 grid_3.4.4 [31] tools_3.4.4 bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.10 RSQLite_2.1.1 [36] crayon_1.3.4 pkgconfig_2.0.1 Matrix_1.2-14 prettyunits_1.0.2 assertthat_0.2.0 [41] httr_1.3.1 rstudioapi_0.7 R6_2.2.2 GenomicAlignments_1.14.2 compiler_3.4.4
Update with additional information.
I am working with this file S30409818_Covered.bed from Agilent's Clinical Research Exome V2. According to their support team, the BED file was construct from several sources: RefSeq, Ensembl, CCDS, Genecode, VEGA, dbSNP, CytoBand. The "Name" column of the file is quite messy (see first 2000 lines here: https://pastebin.com/6B7GdE3D).
Expected final result:
What I want is to clean up the names keep only one gene name per row.
In the final results, If an interval can be mapped to RefSeq with a HGNC symbol, use that symbol for name, and retrieve strand information for it. Else if it can be mapped to Ensembl, use Ensembl's Name. Or else take anything else that can be mapped to.
If an interval is a part of multiple genes, make multiple rows in the BED file with corresponding names. For example, below two intervals are repeated twice because they belong to 4 different genes.
chr1 13115348 13115613 Gene1 0 -
chr1 13115348 13115613 Gene2 0 -
chr1 13123587 13123707 Gene3 0 -
chr1 13123587 13123707 Gene4 0 +
If an interval is an intronic one, leave as is can copy over the final result.
I already did mapping to TxDb.UCSC.hg19.knownGene, but still there are many invervals that can't be mapped. Therefore I take an intermediate solution to copy those "NoHits" intervals over and clean up their names. The final results looks like the following. But I still want to be able to map all the non-intronic reasons and give them proper names and strands. My current R code is here: . It is best if NoHit
in mapBedtoGene
function in the gist is empty (i.e. all interval are found).
chr1 13115348 13115613 PRAMEF5 0 - chr1 13117509 13117772 PRAMEF5 0 - chr1 13123587 13123707 ENST00000512015 0 . chr1 13125298 13125455 ENST00000512015 0 . chr1 13140477 13140687 ENST00000415919 0 . chr1 13141134 13141791 ENST00000415919 0 . chr1 13144412 13145014 ENST00000415919 0 . chr1 13145171 13145389 - 0 . chr1 13172307 13172427 ENST00000458709 0 . chr1 13182988 13183108 HNRNPCL2 0 - chr1 13183113 13183236 HNRNPCL2 0 - chr1 13183403 13183567 HNRNPCL2 0 - chr1 13183590 13183768 HNRNPCL2 0 - chr1 13183793 13183943 HNRNPCL2 0 - chr1 13183997 13184220 HNRNPCL2 0 - chr1 13216298 13216516 ENST00000423177 0 . chr1 13216673 13217275 ENST00000423177 0 . chr1 13219028 13219665 ENST00000423177 0 . chr1 13328215 13328346 . 0 - chr1 13328812 13329471 . 0 - chr1 13330329 13330986 . 0 - chr1 13331433 13331643 . 0 - chr1 13352112 13352306 LOC391003 0 . chr1 13353831 13353981 LOC391003 0 . chr1 13359797 13360060 PRAMEF5 0 + chr1 13361957 13362222 PRAMEF5 0 +
Thank you for this James. Using your approach, I could reduce number of missing genes to 400 now. Of these 400, I check some and found out that they are partially overlapped with hg19.knownGene. It seems that they use ensembl gene coordinate, with HGNC symbol. How do I use (sort of) TxDb ensGene instead? I would like to have a "lookup table" as well: where HGNC Symbol is available, use that, or else use Ensembl ID.
I guess I don't really understand what you mean about overlapping with knownGene. The gene locations are based on the knownGene table, so that seems to be a bit of a tautology?
Anyway, it would probably be better if you said exactly what you are trying to get, and then perhaps we can make suggestions.
Hi James. I made update to the question. Please take a look from "Update with additional information. " downward.