Hi Everybody,
I am using the mapFromTranscripts function to have genomic coordinate of transcript elements, which is absolutely great !
However, I am having some trouble with one specific transcript (same coordinates and ID of an annotated transcript in Gencode23):
library("GenomicFeatures") tx_pos<-GRanges(seqnames="ENST00000417659.1",ranges=IRanges(start=278,end=286),strand="+") tx_pos GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] ENST00000417659.1 [278, 286] + ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
As you see, my position of interest starts at 278 and ends at 286, on the plus strand (of the transcript).
Here I define the exons of this transcript (which map on the minus strand):
exon1<-GRanges(seqnames=1,ranges=IRanges(start=764723,end=764925),strand="-",exon_rank=1) exon2<-GRanges(seqnames=1,ranges=IRanges(start=759032,end=759123),strand="-",exon_rank=2) exons<-GRangesList(c(exon1,exon2)) names(exons)<-seqnames(tx_pos) exons GRangesList object of length 1: $ENST00000417659.1 GRanges object with 2 ranges and 1 metadata column: seqnames ranges strand | exon_rank <Rle> <IRanges> <Rle> | <numeric> [1] 1 [764723, 764925] - | 1 [2] 1 [759032, 759123] - | 2 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
Now I would like to have the genomic coordinates of my tx_pos element:
As expected, when we take the strand into account with ignore.strand=F in the mapFromTranscripts function, we do not see any result (gene on the minus strand and element on plus strand of the transcript).
mapFromTranscripts(x = tx_pos,transcripts = exons,ignore.strand=F) GRanges object with 0 ranges and 2 metadata columns: seqnames ranges strand | xHits transcriptsHits <Rle> <IRanges> <Rle> | <integer> <integer> ------- seqinfo: 2 sequences from an unspecified genome; no seqlengths
When we put ignore.strand=T now we do have an output:
mapFromTranscripts(x = tx_pos,transcripts = exons,ignore.strand=T) GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | xHits transcriptsHits <Rle> <IRanges> <Rle> | <integer> <integer> [1] 1 [764908, 764916] * | 1 1 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
but unfortunately is not what I expected, as this position actually does not map to any of my exons.
My transcript element lies in the second exon (as the first one is 203 nt long), thus what I expect as starting genomic position is 759123-(278-203-1) = 759049
If I put the strand of my tx_pos element to be "*" and ignore.strand=F I the function outputs the correct coordinates.
strand(tx_pos)<-"*" mapFromTranscripts(x = tx_pos,transcripts = exons,ignore.strand=F) GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | xHits transcriptsHits <Rle> <IRanges> <Rle> | <integer> <integer> [1] 1 [759041, 759049] - | 1 1 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
while the ignore.strand=T still gives me the wrong coordinates:
mapFromTranscripts(x = tx_pos,transcripts = exons,ignore.strand=T) GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | xHits transcriptsHits <Rle> <IRanges> <Rle> | <integer> <integer> [1] 1 [764908, 764916] * | 1 1 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
Am I doing something very wrong which I can't see ? In the help page it was not clear to me how we should set the strand of the transcript element (so I guess it's set to "*").
Should we always put the strand coordinates on the transcript to "*" and ignore.strand=F to have the correct genomic coordinates of a transcript element?
If that's the case I hope other confused researchers can look at this and clear their minds a bit : )
Thank you so much for the great work, it is really making my life so much easier !
sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS release 6.5 (Final) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3 Biobase_2.30.0 GenomicRanges_1.22.4 [5] GenomeInfoDb_1.6.3 IRanges_2.4.8 S4Vectors_0.8.11 BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] XVector_0.10.0 zlibbioc_1.16.0 GenomicAlignments_1.6.3 [4] BiocParallel_1.4.3 tools_3.2.2 SummarizedExperiment_1.0.2 [7] DBI_0.3.1 lambda.r_1.1.7 futile.logger_1.4.1 [10] rtracklayer_1.30.4 futile.options_1.0.0 bitops_1.0-6 [13] RCurl_1.95-4.8 biomaRt_2.26.1 RSQLite_1.0.0 [16] Biostrings_2.38.4 Rsamtools_1.22.0 XML_3.98-1.4
Thanks for the report. I'll have a look.
Valerie