Hi,
Is it possible to convert all genomic coordinates in a gene annotation file (gff3) to cDNA coordinates with the mapCoords function in GenomicRanges package in Bioconductor? I have a list of miRNA target sites with cDNA/transcript-based coordinates, and I would like to check if these sites overlap UTR/CDS regions in the transcript. Based on the GenomicRanges manual, I tired to follow the example usage codes for mapCoords to do this (please see below), but the results are not what I wanted (all exon positions based on cDNA coordinate) and I cannot figure out the correct way. Can anyone advise? Thanks in advance!
I am using the TAIR10 Arabidopsis gene-exon annotation file from Phytozome v10. Here is what I did:
>txdb <- makeTranscriptDbFromGFF(file="Athaliana_167_TAIR10.gene_exons.gff3",
format="gff3")
>transcripts<-exonsBy(txdb,by="tx",use.names=TRUE)
>transcripts
GRangesList object of length 35386:
$AT1G01010.1.TAIR10
GRanges object with 6 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] Chr1 [3631, 3913] + | 1 AT1G01010.1.TAIR10.exon.1 1
[2] Chr1 [3996, 4276] + | 2 AT1G01010.1.TAIR10.exon.2 2
[3] Chr1 [4486, 4605] + | 3 AT1G01010.1.TAIR10.exon.3 3
[4] Chr1 [4706, 5095] + | 4 AT1G01010.1.TAIR10.exon.4 4
[5] Chr1 [5174, 5326] + | 5 AT1G01010.1.TAIR10.exon.5 5
[6] Chr1 [5439, 5899] + | 6 AT1G01010.1.TAIR10.exon.6 6
...
<35385 more elements>
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome; no seqlengths
>GR <- transcripts(txdb) #extract transcript coordinates
>map<-mapCoords(GR,transcripts,elt.hits=TRUE, elt.loc=TRUE)
> map
GRanges object with 6483 ranges and 4 metadata columns:
seqnames ranges strand | queryHits subjectHits eltLoc eltHits
<Rle> <IRanges> <Rle> | <integer> <integer> <IRanges> <integer>
[1] Chr1 [ 1, 111] + | 4 4 [ 1, 111] 47
[2] Chr1 [ 1, 117] + | 7 7 [ 1, 117] 56
[3] Chr1 [ 1, 1176] + | 10 10 [ 1, 1176] 68
[4] Chr1 [ 1, 1941] + | 15 15 [ 1, 1941] 85
[5] Chr1 [29, 1941] + | 16 15 [29, 1941] 85
... ... ... ... ... ... ... ... ...
[6479] ChrM [1, 876] - | 35381 35381 [1, 876] 207458
[6480] ChrM [1, 384] - | 35383 35383 [1, 384] 207462
[6481] ChrM [1, 1584] - | 35384 35384 [1, 1584] 207463
[6482] ChrM [1, 336] - | 35385 35385 [1, 336] 207464
[6483] ChrM [1, 615] - | 35386 35386 [1, 615] 207465
-------
seqinfo: 7 sequences from an unspecified genome; no seqlengths
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.34.0 XVector_0.6.0 BiocInstaller_1.16.0 GenomicFeatures_1.18.1 AnnotationDbi_1.28.0
[6] Biobase_2.26.0 GenomicRanges_1.18.1 GenomeInfoDb_1.2.2 IRanges_2.0.0 S4Vectors_0.4.0
[11] BiocGenerics_0.12.0
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 BiocParallel_1.0.0 biomaRt_2.22.0
[6] bitops_1.0-6 brew_1.0-6 checkmate_1.5.0 codetools_0.2-9 colorspace_1.2-4
[11] DBI_0.3.1 digest_0.6.4 fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.0
[16] ggplot2_1.0.0 grid_3.1.1 gtable_0.1.2 iterators_1.0.7 MASS_7.3-35
[21] munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.3 RCurl_1.95-4.3
[26] reshape2_1.4 Rsamtools_1.18.1 RSQLite_1.0.0 rtracklayer_1.26.1 scales_0.2.4
[31] sendmailR_1.2-1 stringr_0.6.2 tools_3.1.1 XML_3.98-1.1 zlibbioc_1.12.0
I may have misunderstood what you want. In your example 'x' is transcripts and 'to' is the TAIR exons by transcript.
To map gene ranges to the TAIR exons 'x' should be a GRanges of genes and 'to' the TAIR exons. If instead you want to map transcripts to TAIR gene ranges then 'x' should be the transcripts ('GR' object above) and 'to' should be a GRangesList of genes.
Hi Valerie, thanks for your response. I checked the manual again and I think understand what you mean now--thanks again! However I see the local coordinates also includes the length for introns -- since the second exon start position does not continue after the first exon (i.e 284), but incorporates the intron length. I need the transcript local coordinates to be continuous without accounting for introns. I am also losing the metadata for transcript name and feature attribute. How can I achieve those things? here is what I tried:
>exon_ranges<-exons(txdb) # a GRanges object
>allgenes<-genes(txdb)# a GRanges object
>allgenes<-GRangesList(allgenes)
> mapCoords(exon_ranges,mygenes,elt.hits=TRUE, elt.loc=TRUE)
GRanges object with 16 ranges and 4 metadata columns:
seqnames ranges strand | queryHits subjectHits eltLoc eltHits
<Rle> <IRanges> <Rle> | <integer> <integer> <IRanges> <integer>
[1] Chr1 [ 1, 283] + | 1 1 [ 1, 283] 1
[2] Chr1 [ 366, 646] + | 2 1 [ 366, 646] 1
[3] Chr1 [ 856, 975] + | 3 1 [ 856, 975] 1
[4] Chr1 [1076, 1465] + | 4 1 [1076, 1465] 1
[5] Chr1 [1544, 1696] + | 5 1 [1544, 1696] 1
... ... ... ... ... ... ... ... ...
[12] Chr1 [3172, 3245] - | 12 1 [903, 976] 2
[13] Chr1 [3020, 3065] - | 13 1 [751, 796] 2
[14] Chr1 [2682, 2771] - | 14 1 [413, 502] 2
[15] Chr1 [2543, 2590] - | 15 1 [274, 321] 2
[16] Chr1 [2270, 2436] - | 16 1 [ 1, 167] 2
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
update: I solved the intron interval issue (mapCoords(exon_ranges,transcripts)-- now the issue is only adding metadata to the results?
Hi,
merge() isn't supported on GRanges objects. Conceptually GRanges and GAlignments are vector-like objects of 1 dimension so only c() works on them. rbind(), cbind() and merge() typically only work on 2-D objects.
You can merge() on a DataFrame which is what the GRanges metadata is held in. Extract the metadata with mcols(GRanges) then you can merge, add columns with '$' or remove with '['.
Valerie
Thanks so much for your help and patience--I got the metadata now!
Hi Valerie, I have another question--- I want to use pmapCoords instead of mapCoords to get one-to-one results for my comparison (exon_ranges,exonsBy(txdb,"tx"), however I can't make it to work even though my GRanges has the same number of ranges as those stored in the GRangesList. Could you suggest how can I get this?
>pmapCoords(exon_ranges,transcripts)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘pmapCoords’ for signature ‘"GRanges", "GRangesList"’
This has been added to GenomicRanges 1.19.5. Should be available via biocLite() Friday around noon PST or immediately in svn.
Valerie
A method for GRanges, GRangesList hasn't been implemented yet. Currently we only have
I'll put this on my TODO but can't get to it until next week. I'll post back here when it's done.
Valerie
Thank you, that would be great!