Hey, in my package ORFik I give the possibility to find ORFs from circular genomes, but getting the sequences from those ranges are not possible as GRanges directly.
Here is the problem and I would like some input into the best way to solve it. I have a way to fix it as I show, but not sure it is the best way of doing it. Just copy the range, instead of downloading ORFik if you want to replicate this:
library(ORFik) # If you want to replicate, just copy the GRanges object, instead of downloading ORFik
# If you want to replicate with ORFik use github devel version: devtools::install_github("ORFik/master")
library(Biostrings)
require(BSgenome.Hsapiens.UCSC.hg38)
mtDNA <- DNAStringSet(BSgenome.Hsapiens.UCSC.hg38$chrM)
names(mtDNA) <- "chrM"
# Find all orf, allowing ORFs to wrap around from end to beginning (circular genome)
orfs <- findORFsFasta(mtDNA,
startCodon = "ATG",
stopCodon = stopDefinition(2), #2 The Vertebrate Mitochondrial Code
is.circular = TRUE)
isCircular(orfs)
chrM
TRUE
circular <- which(start(orfs) < 0 | end(orfs) > nchar(mtDNA)) # 1 ORF wraps around
seqs <- getSeq(mtDNA, orfs[-circular]) # This works, as there are no wrapping ranges here
seqlengths(mtDNA)
chrM
16569
circular_orf <- orfs[circular] # the 1 wrapping ORF
circular_orf
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrM 16567-16578 +
-------
seqinfo: 1 sequence (1 circular) from an unspecified genome; no seqlengths
overstep <- end(circular_orf) - seqlengths(mtDNA) # It steps 9 outside chromosome end
seqs <- getSeq(mtDNA, circular_orf) # Gives error, getSeq does not support circular, does not help to set seqlength
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'value' in selecting a method for function 'unsplit': some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x'
# Now make it work
overstep <- end(circular_orf) - seqlengths(mtDNA)
new_range <- IRanges(c(start(circular_orf), 1), c(seqlengths(mtDNA), overstep))
circular_orf_that_works <- GRangesList(GRanges("chrM", new_range, "+"))
circular_orf_that_works
GRangesList object of length 1:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chrM 16567-16569 +
[2] chrM 1-9 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
extractTranscriptSeqs(mtDNA, circular_orf_that_works)
DNAStringSet object of length 1:
width seq
[1] 12 ATGGATCACAGG
So, any ideas ? Else I thought I could make a getSeq function with an argument circular.wrapping = TRUE, that just wraps it to GRangesList for the user if set to true. Seqlengths/isCircular for all circular chromosomes must of course be set for this to work.
Hm, figured out getSeq already works for circular wrapping ranges if your DNA sequences is in BSGenome format, but not as DNAStringSet, which is strange..
So the question then is, why does not DNAStringSet support to update the isCircular flag ?