Reorder exons on negative strand in GRanges list
1
0
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 2.3 years ago
United States

I am trying to make a fasta file of reduced exon sequences (collapse all isoforms of a gene into one transcript). I'm running into a problem where exons on the negative strand are listed in ascending order based on chromosome coordinates, but this results in the exon sequences being written in the wrong order. I've tried several methods to re-order the negative strand elements, but they were all running very slow. What would be the best way to do this?

Thanks

library('BSgenome.Mmusculus.UCSC.mm10')
library('GenomicFeatures')
genome <- BSgenome.Mmusculus.UCSC.mm10
gencode <- makeTxDbFromGFF('~/Downloads/gencode.vM8.basic.annotation.gtf')
exonAll <- exonsBy(gencode, by='gene')
exonAll2 <- reduce(exonAll)
seqs <- extractTranscriptSeqs(genome, exonAll2)
writeXStringSet(seqs, '~/Desktop/basic.fa')
genomicranges • 2.1k views
ADD COMMENT
2
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

A case could be made that extractTranscriptSeqs() should take care of this, but for now you could reverse the elements of exonAll2 when the gene is on the negative strand, e.g.:

exonAll2 <- revElements(exonAll2, any(strand(exonAll2) == "-"))

 

ADD COMMENT
0
Entering edit mode

I'm confused. Why should extractTranscriptSeqs() take care of this? The reduce() call removes the exon_rank information. The resulting GRangesList after the reduce() call does even not contain actual transcript sequences for many genes.

ADD REPLY
0
Entering edit mode

It's true that this operation probably does not make sense biologically, but my point about extractTranscriptSeqs() was that it could sort exons based on their strand, or at least throw an error if there is inconsistency. It's easy to get into that situation if not using a TxDb.

ADD REPLY
0
Entering edit mode
Ah, got it. Thanks.
ADD REPLY
0
Entering edit mode

It appears the case was never made.  revElements certainly works as suggested, but it is dreadfully slow. Is there any other new BioC juju that might make this operation sing (and, would it help that in my case ALL all my exonAll2 have exactly two elements)?

 

ADD REPLY
0
Entering edit mode

I'm surprised that revElements() on a GRangesList is slow. It's implemented pretty efficiently. Do you have an example?

ADD REPLY
0
Entering edit mode

Here's timing produced by the five statements in the following function, to give you a sense of relative speed on my computer:

   user  system elapsed 
 12.227   0.871  13.100 
   user  system elapsed 
 41.688   3.735  45.423 
   user  system elapsed 
636.549  29.291 665.827 
   user  system elapsed 
  5.991   0.060   6.052 
   user  system elapsed 
164.827  62.318 227.457 

 

writeIntronFlankingSeqs<-function(genomeName='BSgenome.Hsapiens.UCSC.hg38'
          ,TxDbName='TxDb.Hsapiens.UCSC.hg38.knownGene'
          ,flankSize=25
          ,con=paste(sep='.','sjdb',flankSize,TxDbName,'fa')
           ) {
    ## PURPOSE: Write a fasta file holding <flankSize> bases flanking
    ## each unique intron known to <TxDbName>, as extracted from the
    ## BSgenome named <genomeName>.
    library(rtracklayer)
    library(GenomicFeatures)
    library(BSgenome)
    genome<-getBSgenome(genomeName)
    require(TxDbName,char=T)
    TxDb<-eval(parse(text=TxDbName))

    print(system.timeintron.gr<-unique(unlist(intronsByTranscript(TxDb)))))

    print(system.time(
      flanking.grl<-
        ## Kudos: C: Element-wise setdiff when psetdiff fails for GenomicRanges
        setdiff(as(trimintron.gr+flankSize),'GRangesList')
               ,asintron.gr,'GRangesList'))))

    stopifnot(all(2==lengths(flanking.grl)))

    print(system.time(
      flanking.grl<-
        ## Required for correct 5-3' order in subsequent call to
        ## extractTranscriptSeqs.  Kudos:
        ## [[Reorder exons on negative strand in GRanges list][Reorder exons on
        ## negative strand in GRanges list]] - along with appeal for
        ## faster approach (THIS IS SLOW).
        revElements(flanking.grl,any(strand(flanking.grl)=='-')) 
    ))

    print(system.time(
      names(flanking.grl)<-
        ## Creates names of the form:
        ## chr:start1-end1:strand^chr:start2-end2:strand
        do.call(function(...) paste(sep='^',...)
               ,as.data.frame(matrix(as.character(unlist(flanking.grl))
                                    ,ncol=2
                                    ,byrow=T)))
    ))

    print(system.time(
      writeXStringSet(extractTranscriptSeqs(genome,flanking.grl)
                     ,con)
    ))

  }

Thanks for taking a squiz Michael.

ADD REPLY
0
Entering edit mode

I will look into the performance issue but I'm wondering whether it wouldn't be better to do this using flank(), which would get the upstream/downstream right based on the strand.

ADD REPLY
0
Entering edit mode

Turns out the issue is with the any(strand(flanking.grl)=='-')). There was a missing fast path for any() on a CompressedRleList. Try IRanges 2.15.19.

ADD REPLY
0
Entering edit mode

Mike - thanks for offering to take a look, and even more so for the "spot on" comment about using flank... you're absolutely right, computing flanking.grl as below is ~100x faster

 

        up.gr<-trim(flank(intron.gr,flankSize,T))
        down.gr<-trim(flank(intron.gr,flankSize,F))
        flanking.grl<-split(c(up.gr,down.gr),seq_along(up.gr))
        names(flanking.grl)<- paste(sep='^',as.character(up.gr),as.character(down.gr))

Any more idiomatic approaches to building that GRangesList are most welcome... I expect there is some unlist/relist juju I've never fully appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6