GenomicFeatures::makeTranscriptDbFromGTF?

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 20 months ago

United States

Hi, I thought it would be handy to make a GenomicFeatures::TranscriptDb from a gtf file and was somehow surprised to see that I couldn't find such a function in GenomicFeatures. I'm happy to whip up such a function, but wanted to check in to make sure I'm not missing something like (1) you can already do it; or (2) it's not as straightforward as I think; (3) maybe it's there and I'm not looking hard enough. Right now I just want to build it on the gff that the knew versions of tophat build when you pass in a gtf/gff file of known transcripts (the --G/--GTF cmd line arg), but I think it'd be handy overall. I don't think gff/gtf's have any column for exon ordering, though, in which case I'd just assume the exons are ordered linearly (bye bye trans-splicing)). Good idea? Bad idea? Already done? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

Cancer GenomicFeatures Cancer GenomicFeatures • 1.3k views

ADD COMMENT • link updated 12.7 years ago by Michael Lawrence ★ 11k • written 12.7 years ago by Steve Lianoglou ★ 13k

0

Entering edit mode

Tim Triche ★ 4.2k

@tim-triche-3561

Last seen 4.2 years ago

United States

Along these lines, I took Kasper Hansen's df2GR() function and tarted it up a bit, wrote a GR2df() function, and added some hooks to store the entire works in a trivial (4-table plus views) database schema. This stores descriptions of what each track/range means, where it came from, and when it was added, not unlike biomaRt. (I had trouble finding "pickling" functions for tracks and ranges so I rolled my own) It's not perfect (because I don't yet understand how to "bolt on" the appropriate BSgenome so that out-of-memory sequence access is performed the way I want it to) but it works well enough that I have started migrating other types of data (segmented copynumber, gene-, splice-, exon-wise RNAseq, and BS-seq) into freeze-dried GRanges. If there is a better container schema in GenomicFeatures::TranscriptDb, I'll switch to it today... And since it's fresh in my mind, how does one automatically attach the appropriate BSgenome to a GenomicRanges? It seems like it should be trivial, and the documentation hints at it, but I haven't succeeded yet in doing this automatically. I store the assembly from which a given GRanges object's features were aligned, so if I can figure out the syntax, it's done. thanks all, --t On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > I thought it would be handy to make a GenomicFeatures::TranscriptDb > from a gtf file and was somehow surprised to see that I couldn't find > such a function in GenomicFeatures. > > I'm happy to whip up such a function, but wanted to check in to make > sure I'm not missing something like (1) you can already do it; or (2) > it's not as straightforward as I think; (3) maybe it's there and I'm > not looking hard enough. > > Right now I just want to build it on the gff that the knew versions of > tophat build when you pass in a gtf/gff file of known transcripts (the > --G/--GTF cmd line arg), but I think it'd be handy overall. > > I don't think gff/gtf's have any column for exon ordering, though, in > which case I'd just assume the exons are ordered linearly (bye bye > trans-splicing)). > > Good idea? Bad idea? Already done? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD COMMENT • link 12.7 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim, A quick answer to one of your questions, specifically: > And since it's fresh in my mind, how does one automatically attach the > appropriate BSgenome to a GenomicRanges? ?It seems like it should be > trivial, and the documentation hints at it, but I haven't succeeded yet in > doing this automatically. I have a generic method defined somewhere in one of my utility called `getBsGenome` which loads the appropriate genome of choice. It's not bullet proof, but is handy. The genome is identified by its annotation/accession/build/release, whatever you call it, where I mostly go by UCSC conveintions, ie. 'hg19', 'mm9', etc. I use it like this bsg <- getBsGenome('hg19') And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return the `Hsapiens` object (that is also, now, in your workspace). I define methods for different classes that I have that basically end up calling down to the base function I pasted below. Assuming you are on bioc2.9, a potential GenomicRanges impl of the method would look like so (untested): setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) { g <- unique(genome(x)) if (!is.character(g) || length(g) != 1L || nchar(g) < 1) { stop("Expected a single, valid genome identifier in seqinfo slot") } getBsGenome(g, ...) }) You would have to add more cases in the switch statement of the base function when you want to expand your reportoire of organisms: setMethod("getBsGenome", c(x="character"), function(x, organism=NULL, anno.source='UCSC', ...) { lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:' if (is.null(organism)) { organism <- switch(substring(x, 1, 2), hg='Hsapiens', mm='Mmusculus', sa='Scerevisiae', dm='Dmelanogaster', rn='Rnorvegicus', ce='Celegans', stop("Unknown genome", x, sep=" ")) } lib.name <- gsub(':organism:', organism, lib.name) lib.name <- gsub(':anno.source:', anno.source, lib.name) lib.name <- gsub(':genome:', x, lib.name) suppressPackageStartupMessages({ found <- requirelib.name, character.only=TRUE) }) if (!found) { stoplib.name, " package required.") } get(organism, pos=paste('package', lib.name, sep=":")) }) HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that makes any difference... Thanks much, will play with these and see what happens :-) On Thu, Feb 9, 2012 at 10:17 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi Tim, > > A quick answer to one of your questions, specifically: > > > And since it's fresh in my mind, how does one automatically attach the > > appropriate BSgenome to a GenomicRanges? It seems like it should be > > trivial, and the documentation hints at it, but I haven't succeeded yet > in > > doing this automatically. > > I have a generic method defined somewhere in one of my utility called > `getBsGenome` which loads the appropriate genome of choice. It's not > bullet proof, but is handy. > > The genome is identified by its annotation/accession/build/release, > whatever you call it, where I mostly go by UCSC conveintions, ie. > 'hg19', 'mm9', etc. > > I use it like this > > bsg <- getBsGenome('hg19') > > And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return > the `Hsapiens` object (that is also, now, in your workspace). > > I define methods for different classes that I have that basically end > up calling down to the base function I pasted below. > > Assuming you are on bioc2.9, a potential GenomicRanges impl of the > method would look like so (untested): > > setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) { > g <- unique(genome(x)) > if (!is.character(g) || length(g) != 1L || nchar(g) < 1) { > stop("Expected a single, valid genome identifier in seqinfo slot") > } > getBsGenome(g, ...) > }) > > You would have to add more cases in the switch statement of the base > function when you want to expand your reportoire of organisms: > > setMethod("getBsGenome", c(x="character"), > function(x, organism=NULL, anno.source='UCSC', ...) { > lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:' > if (is.null(organism)) { > organism <- switch(substring(x, 1, 2), > hg='Hsapiens', > mm='Mmusculus', > sa='Scerevisiae', > dm='Dmelanogaster', > rn='Rnorvegicus', > ce='Celegans', > stop("Unknown genome", x, sep=" ")) > } > > lib.name <- gsub(':organism:', organism, lib.name) > lib.name <- gsub(':anno.source:', anno.source, lib.name) > lib.name <- gsub(':genome:', x, lib.name) > > suppressPackageStartupMessages({ > found <- requirelib.name, character.only=TRUE) > }) > > if (!found) { > stoplib.name, " package required.") > } > > get(organism, pos=paste('package', lib.name, sep=":")) > }) > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Tim Triche ★ 4.2k

0

Entering edit mode

On Thu, Feb 9, 2012 at 1:55 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote: > I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that makes any > difference... I should have said "running at least bioc2.9" .. I think it's only in 2.9 that GenomicRanges objects got the seqinfo slots and, therefore, the `genome` function ... my release-history-trivia is a bit weak, though :-) FWIW, I'm typically on *-devel too ... (^^ not worth much ...) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.9 years ago

United States

Hi Steve, I had thought someone would want something like this, and it would be nice to able to parse a GTF file into something more useful than a GRanges, as rtracklayer supports now. So I'd definitely encourage you to go ahead, while recommending that it be based on the import.gtf function in rtracklayer. We really need to have a centralized track I/O package. I'm not sure why, but everyone seems intent on writing the same parser over and over again. Duplicated effort frustrates me. It doesn't have to be rtracklayer, but somewhere in the infrastructure we need reliable I/O that handles all of the corner cases. I've embarked on a time consuming effort of implementing a comprehensive test suite for rtracklayer, which should make it at least a base for such a package. Thanks a lot for volunteering this contribution, Michael On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > I thought it would be handy to make a GenomicFeatures::TranscriptDb > from a gtf file and was somehow surprised to see that I couldn't find > such a function in GenomicFeatures. > > I'm happy to whip up such a function, but wanted to check in to make > sure I'm not missing something like (1) you can already do it; or (2) > it's not as straightforward as I think; (3) maybe it's there and I'm > not looking hard enough. > > Right now I just want to build it on the gff that the knew versions of > tophat build when you pass in a gtf/gff file of known transcripts (the > --G/--GTF cmd line arg), but I think it'd be handy overall. > > I don't think gff/gtf's have any column for exon ordering, though, in > which case I'd just assume the exons are ordered linearly (bye bye > trans-splicing)). > > Good idea? Bad idea? Already done? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hey Michael, I'll see if I can take a crack at it this w/e. Also I'm in complete agreement w.r.t i/o stuff and your thoroughness w/ rtracklayer is very much appreciated. If we do have a centralized track i/o package that is outside of rtracklayer, I nominate we name it bIO ... or something ;-) ... yes, I can hear the crickets ... -steve On Thu, Feb 9, 2012 at 4:48 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Hi Steve, > > I had thought someone would want something like this, and it would be nice > to able to parse a GTF file into something more useful than a GRanges, as > rtracklayer supports now. > > So I'd definitely encourage you to go ahead, while recommending that it be > based on the import.gtf function in rtracklayer. > > We really need to have a centralized track I/O package. I'm not sure why, > but everyone seems intent on writing the same parser over and over again. > Duplicated effort frustrates me. It doesn't have to be rtracklayer, but > somewhere in the infrastructure we need reliable I/O that handles all of the > corner cases. I've embarked on a time consuming effort of implementing a > comprehensive test suite for rtracklayer, which should make it at least a > base for such a package. > > Thanks a lot for volunteering this contribution, > Michael > > On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> >> Hi, >> >> I thought it would be handy to make a GenomicFeatures::TranscriptDb >> from a gtf file and was somehow surprised to see that I couldn't find >> such a function in GenomicFeatures. >> >> I'm happy to whip up such a function, but wanted to check in to make >> sure I'm not missing something like (1) you can already do it; or (2) >> it's not as straightforward as I think; (3) maybe it's there and I'm >> not looking hard enough. >> >> Right now I just want to build it on the gff that the knew versions of >> tophat build when you pass in a gtf/gff file of known transcripts (the >> --G/--GTF cmd line arg), but I think it'd be handy overall. >> >> I don't think gff/gtf's have any column for exon ordering, though, in >> which case I'd just assume the exons are ordered linearly (bye bye >> trans-splicing)). >> >> Good idea? Bad idea? Already done? >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Followup to both of youse: 1) I've been using import() a fur piece lately and have found things in it that don't make sense to me -- e.g. Brad Bernstein's recent chromatin marks paper -- some of the segmented BED files load, some don't. Clearly I need to read the spec better, or something, but a bit of fussing with 'cat' and 'cut' "solved" the problem and let me load them into the DB. 2) Steve -- your bits worked like a charm -- I ended up adding generics for getSeq(x='GRanges') and getSnpBases(x='GRanges') since rsidsToGRanges is so handy. This is particularly handy since I can now scan enhancers for SNP variant potential impact and scan TSSs and putative cryptic TSS or SS regions for motifs easily. Using installed.genomes() and strsplit, the first thing the function does is to see if the appropriate genome is installed, and if so, load it. If that fails, it calls available.genomes() to see if the genome is available, and if so, tells the user to install it. If neither option works it complains that the genome is unknown and throws an error. For the time being I'm putting this and several other ball-of-mud type hacks into a small package, but eventually it might be nice for the generics to go into base? thanks all On Thu, Feb 9, 2012 at 2:16 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hey Michael, > > I'll see if I can take a crack at it this w/e. > > Also I'm in complete agreement w.r.t i/o stuff and your thoroughness > w/ rtracklayer is very much appreciated. > > If we do have a centralized track i/o package that is outside of > rtracklayer, I nominate we name it bIO ... or something ;-) > > ... yes, I can hear the crickets ... > > -steve > > On Thu, Feb 9, 2012 at 4:48 PM, Michael Lawrence > <lawrence.michael@gene.com> wrote: > > Hi Steve, > > > > I had thought someone would want something like this, and it would be > nice > > to able to parse a GTF file into something more useful than a GRanges, as > > rtracklayer supports now. > > > > So I'd definitely encourage you to go ahead, while recommending that it > be > > based on the import.gtf function in rtracklayer. > > > > We really need to have a centralized track I/O package. I'm not sure why, > > but everyone seems intent on writing the same parser over and over again. > > Duplicated effort frustrates me. It doesn't have to be rtracklayer, but > > somewhere in the infrastructure we need reliable I/O that handles all of > the > > corner cases. I've embarked on a time consuming effort of implementing a > > comprehensive test suite for rtracklayer, which should make it at least a > > base for such a package. > > > > Thanks a lot for volunteering this contribution, > > Michael > > > > On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou > > <mailinglist.honeypot@gmail.com> wrote: > >> > >> Hi, > >> > >> I thought it would be handy to make a GenomicFeatures::TranscriptDb > >> from a gtf file and was somehow surprised to see that I couldn't find > >> such a function in GenomicFeatures. > >> > >> I'm happy to whip up such a function, but wanted to check in to make > >> sure I'm not missing something like (1) you can already do it; or (2) > >> it's not as straightforward as I think; (3) maybe it's there and I'm > >> not looking hard enough. > >> > >> Right now I just want to build it on the gff that the knew versions of > >> tophat build when you pass in a gtf/gff file of known transcripts (the > >> --G/--GTF cmd line arg), but I think it'd be handy overall. > >> > >> I don't think gff/gtf's have any column for exon ordering, though, in > >> which case I'd just assume the exons are ordered linearly (bye bye > >> trans-splicing)). > >> > >> Good idea? Bad idea? Already done? > >> > >> -steve > >> > >> -- > >> Steve Lianoglou > >> Graduate Student: Computational Systems Biology > >> | Memorial Sloan-Kettering Cancer Center > >> | Weill Medical College of Cornell University > >> Contact Info: http://cbio.mskcc.org/~lianos/contact > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Tim Triche ★ 4.2k

Login before adding your answer.