I am trying to obtain genomic locations corresponding to predicted miRNA binding sites on transcripts, in order to construct a bedfile for visualizing these binding sites in the UCSC genome browser. The location data I have for miRNA binding sites is the nucleotide position in the mRNA obtained from miRNATIP (http://mirnatip.pa.icar.cnr.it:3838/mirnatip/). I have been trying to use the mapfromtranscripts function from GenomicFeatures but it seems to include the intron sequence when mapping so that the output is a genomic location often corresponding to a position in the first intron and not a coding region as expected.
I followed the vignette (available at this site https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/mapToTranscripts) for mapping 3'UTR positions from a transcript to the genome and thought that this would be an appropriate approach but this mapfromtranscripts output in this vignette also returns genomic locations that are in intronic regions (usually the fist intron) and not the 3'UTR as expected. I have pasted my complete code and the output from running this vignette below. I'm not sure why this isn't working for me.
Any help with the best approach to map miRNA binding sites to the genome, or where I am going wrong would be appreciated. Thankyou.
## -----------------------------------------------------------------------
## C. Map 3'UTR variants to genome coordinates
## -----------------------------------------------------------------------
## A study by Skeeles et. al (PLoS ONE 8(3): e58609. doi:
## 10.1371/journal.pone.0058609) investigated the impact of 3'UTR variants
## on the expression of cancer susceptibility genes.
> ## 8 candidate miRNA genes on chromosome 12 were used to test for
> ## differential luciferase expre
ssion in mice. In Table 2 of the manuscript
> ## variant locations are given as nucleotide position within the gene.
> geneNames <- c("Bcap29", "Dgkb", "Etv1", "Hbp1", "Hbp1", "Ifrd1",
+ "Ifrd1", "Pik3cg", "Pik3cg", "Tspan13", "Twistnb")
> starts <- c(1409, 3170, 3132, 2437, 2626, 3239, 3261, 4947, 4979, 958, 1489)
> snps <- GRanges(geneNames, IRanges(starts, width=1))
> ## To map these transcript-space coordinates to the genome we need gene ranges
> ## in genome space.
> library(org.Mm.eg.db)
> geneid <- select(org.Mm.eg.db, unique(geneNames), "ENTREZID", "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> geneid
SYMBOL ENTREZID
1 Bcap29 12033
2 Dgkb 217480
3 Etv1 14009
4 Hbp1 73389
5 Ifrd1 15982
6 Pik3cg 30955
7 Tspan13 66109
8 Twistnb 28071
> ## Extract the gene regions:
> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> genes <- genes(txdb)[geneid$ENTREZID]
> ## A prerequesite of the mapping from transcript space to genome space
> ## is that seqnames in 'x' match names in 'transcripts'. Rename
> ## 'genes' with the appropriate gene symbol.
> names(genes) <- geneid$SYMBOL
> ## The xHits and transcriptsHits metadta columns indicate which ranges in
> ## 'snps' and 'genes' were involved in the mapping.
> mapFromTranscripts(snps, genes)
GRanges object with 11 ranges and 2 metadata columns:
seqnames ranges strand | xHits transcriptsHits
<Rle> <IRanges> <Rle> | <integer> <integer>
[1] chr12 31633250 - | 1 1
[2] chr12 37883343 + | 2 2
[3] chr12 38783389 + | 3 3
[4] chr12 31948099 - | 4 4
[5] chr12 31947910 - | 5 4
[6] chr12 40219951 - | 6 5
[7] chr12 40219929 - | 7 5
[8] chr12 32203703 - | 8 6
[9] chr12 32203671 - | 9 6
[10] chr12 36041521 - | 10 7
[11] chr12 33431112 + | 11 8
-------
seqinfo: 66 sequences from an unspecified genome; no seqlengths
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 org.Mm.eg.db_3.6.0
[3] TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[5] BiocInstaller_1.30.0 GenomicFeatures_1.32.2
[7] AnnotationDbi_1.42.1 Biobase_2.40.0
[9] GenomicRanges_1.32.6 GenomeInfoDb_1.16.0
[11] IRanges_2.14.11 S4Vectors_0.18.3
[13] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 compiler_3.5.1 XVector_0.20.0
[4] prettyunits_1.0.2 bitops_1.0-6 tools_3.5.1
[7] progress_1.2.0 zlibbioc_1.26.0 biomaRt_2.36.1
[10] digest_0.6.16 bit_1.1-14 lattice_0.20-35
[13] RSQLite_2.1.1 memoise_1.1.0 pkgconfig_2.0.2
[16] rlang_0.2.2 Matrix_1.2-14 DelayedArray_0.6.5
[19] DBI_1.0.0 GenomeInfoDbData_1.1.0 rtracklayer_1.40.6
[22] stringr_1.3.1 httr_1.3.1 Biostrings_2.48.0
[25] hms_0.4.2 grid_3.5.1 bit64_0.9-7
[28] R6_2.2.2 XML_3.98-1.16 BiocParallel_1.14.2
[31] blob_1.1.1 magrittr_1.5 matrixStats_0.54.0
[34] Rsamtools_1.32.3 GenomicAlignments_1.16.0 SummarizedExperiment_1.10.1
[37] assertthat_0.2.0 stringi_1.2.4 RCurl_1.95-4.11
[40] crayon_1.3.4
>
I have since found a solution to the mapping problem using the package "ensembldb" following the tutorial at the link below.
https://bioconductor.org/packages/devel/bioc/vignettes/ensembldb/inst/doc/coordinate-mapping.html#6_mapping_transcript_coordinates_to_genomic_coordinates