Does there exists a getSeq like function for ranges that wrap around end to start ?
2
0
Entering edit mode
@hauken_heyken-13992
Last seen 2.1 years ago
Bergen

Hey, in my package ORFik I give the possibility to find ORFs from circular genomes, but getting the sequences from those ranges are not possible as GRanges directly.

Here is the problem and I would like some input into the best way to solve it. I have a way to fix it as I show, but not sure it is the best way of doing it. Just copy the range, instead of downloading ORFik if you want to replicate this:

library(ORFik) # If you want to replicate, just copy the GRanges object, instead of downloading ORFik
# If you want to replicate with ORFik use github devel version: devtools::install_github("ORFik/master")
library(Biostrings)
require(BSgenome.Hsapiens.UCSC.hg38)

mtDNA <- DNAStringSet(BSgenome.Hsapiens.UCSC.hg38$chrM)
names(mtDNA) <- "chrM"
# Find all orf, allowing ORFs to wrap around from end to beginning (circular genome)
orfs <- findORFsFasta(mtDNA,
                       startCodon = "ATG",
                       stopCodon = stopDefinition(2), #2 The Vertebrate Mitochondrial Code
                       is.circular = TRUE)
isCircular(orfs)
  chrM 
  TRUE 

circular <- which(start(orfs) < 0 | end(orfs) > nchar(mtDNA)) # 1 ORF wraps around 
seqs <- getSeq(mtDNA, orfs[-circular]) # This works, as there are no wrapping ranges here
seqlengths(mtDNA)
 chrM 
16569 
circular_orf <- orfs[circular] # the 1 wrapping ORF
circular_orf
GRanges object with 1 range and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chrM 16567-16578      +
  -------
  seqinfo: 1 sequence (1 circular) from an unspecified genome; no seqlengths
overstep <- end(circular_orf) - seqlengths(mtDNA) # It steps 9 outside chromosome end
seqs <- getSeq(mtDNA, circular_orf) # Gives error, getSeq does not support circular, does not help to set seqlength
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'value' in selecting a method for function 'unsplit': some ranges in 'at' are off-limits with respect to their corresponding sequence in 'x'

# Now make it work
overstep <- end(circular_orf) - seqlengths(mtDNA)
new_range <- IRanges(c(start(circular_orf), 1), c(seqlengths(mtDNA), overstep))
circular_orf_that_works <- GRangesList(GRanges("chrM", new_range, "+"))
circular_orf_that_works
GRangesList object of length 1:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]     chrM 16567-16569      +
  [2]     chrM         1-9      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

extractTranscriptSeqs(mtDNA, circular_orf_that_works)
DNAStringSet object of length 1:
    width seq
[1]    12 ATGGATCACAGG

So, any ideas ? Else I thought I could make a getSeq function with an argument circular.wrapping = TRUE, that just wraps it to GRangesList for the user if set to true. Seqlengths/isCircular for all circular chromosomes must of course be set for this to work.

circulargenome ORFik getSeq • 2.0k views
ADD COMMENT
0
Entering edit mode

Hm, figured out getSeq already works for circular wrapping ranges if your DNA sequences is in BSGenome format, but not as DNAStringSet, which is strange..

 library(GenomicRanges)
 library(BSgenome.Hsapiens.UCSC.hg19) # Use 38 if you have that instead
 gr <- GRanges("chrMT:16567-16578:+")
 isCircular(gr) <- TRUE
 seqlengths(gr) <- 16569
 # Load sequence
 fa.loaded <- DNAStringSet(BSgenome.Hsapiens.UCSC.hg19$chrMT)
 names(fa.loaded) <- "chrMT"

 # Unloaded sequence
 fa.notloaded <- BSgenome.Hsapiens.UCSC.hg19

 getSeq(fa.notloaded, gr) # Works
 getSeq(fa.loaded, gr) # Does not work

 isCircular(fa.loaded) <- TRUE
 isCircular(fa.loaded) # Does not update to TRUE for some reason

 getSeq(fa.loaded, gr) # Still does not work

So the question then is, why does not DNAStringSet support to update the isCircular flag ?

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 1 hour ago
Seattle, WA, United States

Hi,

DNAStringSet objects predate the introduction of GRanges and Seqinfo objects, so they didn't have any seqinfo capabilities originally (i.e. they didn't support seqinfo(), seqlengths(), isCircular() etc...). At some point, someone decided to add the seqinfo capabilities to these objects but the seqinfo<- and seqinfo methods they provide are inconsistent:

> selectMethod("seqinfo<-", "DNAStringSet")
Method Definition:

function (x, new2old = NULL, pruning.mode = c("error", "coarse", 
    "fine", "tidy"), value) 
{
    .local <- function (x, value) 
    {
        metadata(x)$seqinfo <- value
        x
    }
    .local(x, value)
}
<bytecode: 0x5557c4a00928>
<environment: namespace:GenomicRanges>

Signatures:
        x             
target  "DNAStringSet"
defined "List"        

> selectMethod("seqinfo", "DNAStringSet")
Method Definition:

function (x) 
{
    x_names <- names(x)
    if (is.null(x_names)) 
        x_names <- as.character(seq(length(x)))
    Seqinfo(x_names, width(x))
}
<bytecode: 0x5557d7e80150>
<environment: namespace:rtracklayer>

Signatures:
        x             
target  "DNAStringSet"
defined "DNAStringSet"

As you can see, the setter stores the Seqinfo object in the metadata slot of the DNAStringSet object, but the getter ignores that and instead recreates the Seqinfo object on-the-fly each time it's called.

So after you do:

isCircular(fa.loaded) <- TRUE

the internal Seqinfo object gets modified:

> metadata(fa.loaded)
$seqinfo
Seqinfo object with 1 sequence (1 circular) from an unspecified genome:
  seqnames seqlengths isCircular genome
  chrMT         16569       TRUE   <NA>

but the seqinfo() getter ignores that.

Another problem is that the seqinfo() getter is defined in rtracklayer but the setter is defined in GenomicRanges. None of these places are good places for these methods. DNAStringSet objects are implemented in Biostrings and, unfortunately, after I load Biostrings, none of these methods are available. If they can't be defined in Biostrings (Biostrings would need to depend on GenomeInfoDb for that), at the very least they should be defined in the same package, maybe BSgenome where the getSeq() method for DNAStringSet objects is also defined.

Finally, I'm not sure that the getSeq() method for DNAStringSet objects is circularity aware. So even if you were able to effectively change the circularity flag of your DNAStringSet object, I don't think that getSeq() would be aware of it.

Would be great if you could open an issue here to report these problems.

Thanks!

ADD COMMENT
0
Entering edit mode
@hauken_heyken-13992
Last seen 2.1 years ago
Bergen

Great, we have found the solution, so I will accept the answer and continue discussion on the github issues section.

ADD COMMENT

Login before adding your answer.

Traffic: 837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6