I am trying to make a fasta file of reduced exon sequences (collapse all isoforms of a gene into one transcript). I'm running into a problem where exons on the negative strand are listed in ascending order based on chromosome coordinates, but this results in the exon sequences being written in the wrong order. I've tried several methods to re-order the negative strand elements, but they were all running very slow. What would be the best way to do this?
Thanks
library('BSgenome.Mmusculus.UCSC.mm10') library('GenomicFeatures') genome <- BSgenome.Mmusculus.UCSC.mm10 gencode <- makeTxDbFromGFF('~/Downloads/gencode.vM8.basic.annotation.gtf') exonAll <- exonsBy(gencode, by='gene') exonAll2 <- reduce(exonAll) seqs <- extractTranscriptSeqs(genome, exonAll2) writeXStringSet(seqs, '~/Desktop/basic.fa')
I'm confused. Why should extractTranscriptSeqs() take care of this? The reduce() call removes the exon_rank information. The resulting GRangesList after the reduce() call does even not contain actual transcript sequences for many genes.
It's true that this operation probably does not make sense biologically, but my point about
extractTranscriptSeqs()
was that it could sort exons based on their strand, or at least throw an error if there is inconsistency. It's easy to get into that situation if not using a TxDb.It appears the case was never made. revElements certainly works as suggested, but it is dreadfully slow. Is there any other new BioC juju that might make this operation sing (and, would it help that in my case ALL all my exonAll2 have exactly two elements)?
I'm surprised that
revElements()
on a GRangesList is slow. It's implemented pretty efficiently. Do you have an example?Here's timing produced by the five statements in the following function, to give you a sense of relative speed on my computer:
user system elapsed
12.227 0.871 13.100
user system elapsed
41.688 3.735 45.423
user system elapsed
636.549 29.291 665.827
user system elapsed
5.991 0.060 6.052
user system elapsed
164.827 62.318 227.457
Thanks for taking a squiz Michael.
I will look into the performance issue but I'm wondering whether it wouldn't be better to do this using
flank()
, which would get the upstream/downstream right based on the strand.Turns out the issue is with the
any(strand(flanking.grl)=='-'))
. There was a missing fast path forany()
on a CompressedRleList. Try IRanges 2.15.19.Mike - thanks for offering to take a look, and even more so for the "spot on" comment about using flank... you're absolutely right, computing flanking.grl as below is ~100x faster
up.gr<-trim(flank(intron.gr,flankSize,T))
down.gr<-trim(flank(intron.gr,flankSize,F))
flanking.grl<-split(c(up.gr,down.gr),seq_along(up.gr))
names(flanking.grl)<- paste(sep='^',as.character(up.gr),as.character(down.gr))
Any more idiomatic approaches to building that GRangesList are most welcome... I expect there is some unlist/relist juju I've never fully appreciated.