Question

Reorder exons on negative strand in GRanges list

0

Entering edit mode

Jake ▴ 90

@jake-7236

Last seen 2.7 years ago

United States

I am trying to make a fasta file of reduced exon sequences (collapse all isoforms of a gene into one transcript). I'm running into a problem where exons on the negative strand are listed in ascending order based on chromosome coordinates, but this results in the exon sequences being written in the wrong order. I've tried several methods to re-order the negative strand elements, but they were all running very slow. What would be the best way to do this?

Thanks

library('BSgenome.Mmusculus.UCSC.mm10')
library('GenomicFeatures')
genome <- BSgenome.Mmusculus.UCSC.mm10
gencode <- makeTxDbFromGFF('~/Downloads/gencode.vM8.basic.annotation.gtf')
exonAll <- exonsBy(gencode, by='gene')
exonAll2 <- reduce(exonAll)
seqs <- extractTranscriptSeqs(genome, exonAll2)
writeXStringSet(seqs, '~/Desktop/basic.fa')

genomicranges • 2.2k views

ADD COMMENT • link updated 8.9 years ago by Michael Lawrence ★ 11k • written 8.9 years ago by Jake ▴ 90

score 2 · Accepted Answer · 2016-05-11

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.4 years ago

United States

A case could be made that extractTranscriptSeqs() should take care of this, but for now you could reverse the elements of exonAll2 when the gene is on the negative strand, e.g.:

exonAll2 <- revElements(exonAll2, any(strand(exonAll2) == "-"))

ADD COMMENT • link 8.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I'm confused. Why should extractTranscriptSeqs() take care of this? The reduce() call removes the exon_rank information. The resulting GRangesList after the reduce() call does even not contain actual transcript sequences for many genes.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

It's true that this operation probably does not make sense biologically, but my point about extractTranscriptSeqs() was that it could sort exons based on their strand, or at least throw an error if there is inconsistency. It's easy to get into that situation if not using a TxDb.

ADD REPLY • link 8.9 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Ah, got it. Thanks.

ADD REPLY • link 8.9 years ago Michael Love 43k

0

Entering edit mode

It appears the case was never made. revElements certainly works as suggested, but it is dreadfully slow. Is there any other new BioC juju that might make this operation sing (and, would it help that in my case ALL all my exonAll2 have exactly two elements)?

ADD REPLY • link 6.5 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

I'm surprised that revElements() on a GRangesList is slow. It's implemented pretty efficiently. Do you have an example?

ADD REPLY • link 6.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Here's timing produced by the five statements in the following function, to give you a sense of relative speed on my computer:

user system elapsed
12.227 0.871 13.100
user system elapsed
41.688 3.735 45.423
user system elapsed
636.549 29.291 665.827
user system elapsed
5.991 0.060 6.052
user system elapsed
164.827 62.318 227.457

writeIntronFlankingSeqs<-function(genomeName='BSgenome.Hsapiens.UCSC.hg38'
          ,TxDbName='TxDb.Hsapiens.UCSC.hg38.knownGene'
          ,flankSize=25
          ,con=paste(sep='.','sjdb',flankSize,TxDbName,'fa')
           ) {
    ## PURPOSE: Write a fasta file holding <flankSize> bases flanking
    ## each unique intron known to <TxDbName>, as extracted from the
    ## BSgenome named <genomeName>.
    library(rtracklayer)
    library(GenomicFeatures)
    library(BSgenome)
    genome<-getBSgenome(genomeName)
    require(TxDbName,char=T)
    TxDb<-eval(parse(text=TxDbName))

    print(system.timeintron.gr<-unique(unlist(intronsByTranscript(TxDb)))))

    print(system.time(
      flanking.grl<-
        ## Kudos: C: Element-wise setdiff when psetdiff fails for GenomicRanges
        setdiff(as(trimintron.gr+flankSize),'GRangesList')
               ,asintron.gr,'GRangesList'))))

    stopifnot(all(2==lengths(flanking.grl)))

    print(system.time(
      flanking.grl<-
        ## Required for correct 5-3' order in subsequent call to
        ## extractTranscriptSeqs.  Kudos:
        ## [[Reorder exons on negative strand in GRanges list][Reorder exons on
        ## negative strand in GRanges list]] - along with appeal for
        ## faster approach (THIS IS SLOW).
        revElements(flanking.grl,any(strand(flanking.grl)=='-')) 
    ))

    print(system.time(
      names(flanking.grl)<-
        ## Creates names of the form:
        ## chr:start1-end1:strand^chr:start2-end2:strand
        do.call(function(...) paste(sep='^',...)
               ,as.data.frame(matrix(as.character(unlist(flanking.grl))
                                    ,ncol=2
                                    ,byrow=T)))
    ))

    print(system.time(
      writeXStringSet(extractTranscriptSeqs(genome,flanking.grl)
                     ,con)
    ))

  }

Thanks for taking a squiz Michael.

ADD REPLY • link 6.5 years ago Malcolm Cook ★ 1.6k

0

Entering edit mode

I will look into the performance issue but I'm wondering whether it wouldn't be better to do this using flank(), which would get the upstream/downstream right based on the strand.

ADD REPLY • link 6.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Turns out the issue is with the any(strand(flanking.grl)=='-')). There was a missing fast path for any() on a CompressedRleList. Try IRanges 2.15.19.

ADD REPLY • link 6.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Mike - thanks for offering to take a look, and even more so for the "spot on" comment about using flank... you're absolutely right, computing flanking.grl as below is ~100x faster

up.gr<-trim(flank(intron.gr,flankSize,T))
down.gr<-trim(flank(intron.gr,flankSize,F))
flanking.grl<-split(c(up.gr,down.gr),seq_along(up.gr))
names(flanking.grl)<- paste(sep='^',as.character(up.gr),as.character(down.gr))

Any more idiomatic approaches to building that GRangesList are most welcome... I expect there is some unlist/relist juju I've never fully appreciated.

ADD REPLY • link 6.5 years ago Malcolm Cook ★ 1.6k