Create transcriptDb using gff3 files? - library GenomicFeatures and rtracklayer
1
0
Entering edit mode
@sang-chul-choi-5066
Last seen 10.2 years ago
Hi, I am wondering if I could create a TranscriptDb object (library GenomicFeatures) using a gff3 file. I could read a gff3 file using import.gff3, but I could not find a way to create TranscriptDb object from the object from import.gff3. Two arguments for makeTranscriptDb are required: transcripts, splicings. It does not seem to be easy to parse this information from the object form import.gff3. I will appreciate any help. Thank you, SangChul library(rtracklayer) import.gff3 library("GenomicFeatures") makeTranscriptDb
TranscriptDb TranscriptDb • 2.1k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States
Hi, On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi <schoi at="" cornell.edu=""> wrote: > Hi, > > I am wondering if I could create a TranscriptDb object (library GenomicFeatures) using a gff3 file. ?I could read a gff3 file using import.gff3, but I could not find a way to create TranscriptDb object from the object from import.gff3. > > Two arguments for makeTranscriptDb are required: transcripts, splicings. It does not seem to be easy to parse this information from the object form import.gff3. ?I will appreciate any help. As far as I know, this functionality isn't there yet ... I once (early feb, 2012) suggested I might take a crack at making this happen but haven't actually found the time to do it ... I'm not sure anyone in bioc-core land (hi, Marc) has found the time to do it either, so I think you're out of luck. Sorry for that. But the good news is that I bet a patch that does this would be welcome ;-) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Marc was working on this during the course in Feb. Not sure what happened to it. He said it was simple. Maybe just waiting for the release to pass. Michael On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi <schoi@cornell.edu> wrote: > > Hi, > > > > I am wondering if I could create a TranscriptDb object (library > GenomicFeatures) using a gff3 file. I could read a gff3 file using > import.gff3, but I could not find a way to create TranscriptDb object from > the object from import.gff3. > > > > Two arguments for makeTranscriptDb are required: transcripts, splicings. > It does not seem to be easy to parse this information from the object form > import.gff3. I will appreciate any help. > > As far as I know, this functionality isn't there yet ... > > I once (early feb, 2012) suggested I might take a crack at making this > happen but haven't actually found the time to do it ... I'm not sure > anyone in bioc-core land (hi, Marc) has found the time to do it > either, so I think you're out of luck. > > Sorry for that. But the good news is that I bet a patch that does this > would be welcome ;-) > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I was looking at this during the course, and this is on my TODO list for the next release cycle. I think it is long overdue and I don't think that the community is going to get it done in spite of all the enthusiasm. There has not been time to do it before now but I am hoping that will now change. It should be simple enough in principle, but it might not be exactly trivial as I have discovered (on closer inspection) that the gff specification is not as concrete as one would like it to be. Also there have been several different versions. Some things that can help speed me along: 1) which version is most important? gff3? Or one of the other versions? It is likely that with the older versions we may not be able to extract as much meaningful information. 2) where is the best place to find some typical gff3 files for examples? This should not be difficult, but when I was looking before I was finding that people were surprisingly stingy about sharing these. Marc On 04/03/2012 03:57 PM, Michael Lawrence wrote: > Marc was working on this during the course in Feb. Not sure what happened > to it. He said it was simple. Maybe just waiting for the release to pass. > > Michael > > On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > mailinglist.honeypot at gmail.com> wrote: > >> Hi, >> >> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> wrote: >>> Hi, >>> >>> I am wondering if I could create a TranscriptDb object (library >> GenomicFeatures) using a gff3 file. I could read a gff3 file using >> import.gff3, but I could not find a way to create TranscriptDb object from >> the object from import.gff3. >>> Two arguments for makeTranscriptDb are required: transcripts, splicings. >> It does not seem to be easy to parse this information from the object form >> import.gff3. I will appreciate any help. >> >> As far as I know, this functionality isn't there yet ... >> >> I once (early feb, 2012) suggested I might take a crack at making this >> happen but haven't actually found the time to do it ... I'm not sure >> anyone in bioc-core land (hi, Marc) has found the time to do it >> either, so I think you're out of luck. >> >> Sorry for that. But the good news is that I bet a patch that does this >> would be welcome ;-) >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
On Wed, Apr 4, 2012 at 5:44 PM, Marc Carlson <mcarlson@fhcrc.org> wrote: > I was looking at this during the course, and this is on my TODO list for > the next release cycle. I think it is long overdue and I don't think that > the community is going to get it done in spite of all the enthusiasm. > There has not been time to do it before now but I am hoping that will now > change. It should be simple enough in principle, but it might not be > exactly trivial as I have discovered (on closer inspection) that the gff > specification is not as concrete as one would like it to be. Also there > have been several different versions. > > Some things that can help speed me along: > > 1) which version is most important? gff3? Or one of the other versions? > It is likely that with the older versions we may not be able to extract as > much meaningful information. > > There are really only two ways to encode a real gene model in GFF. One is using GFF3 (with the conventions defined in the spec), and the other is to use GTF, which is layered on top of GFF2 (Sanger's version of the format). GFF3 is cleaner than GTF in my opinion. > 2) where is the best place to find some typical gff3 files for examples? > This should not be difficult, but when I was looking before I was finding > that people were surprisingly stingy about sharing these. > > I would recommend simply dumping out some of the TranscriptDb objects using rtracklayer. I have had no trouble feeding those files to other tools and having them be parsed, processed and displayed correctly. If you have any questions about the spec, just ask. I've read it over many times. > > Marc > > > > > On 04/03/2012 03:57 PM, Michael Lawrence wrote: > >> Marc was working on this during the course in Feb. Not sure what happened >> to it. He said it was simple. Maybe just waiting for the release to pass. >> >> Michael >> >> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >> mailinglist.honeypot@gmail.com**> wrote: >> >> Hi, >>> >>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu> >>> wrote: >>> >>>> Hi, >>>> >>>> I am wondering if I could create a TranscriptDb object (library >>>> >>> GenomicFeatures) using a gff3 file. I could read a gff3 file using >>> import.gff3, but I could not find a way to create TranscriptDb object >>> from >>> the object from import.gff3. >>> >>>> Two arguments for makeTranscriptDb are required: transcripts, splicings. >>>> >>> It does not seem to be easy to parse this information from the object >>> form >>> import.gff3. I will appreciate any help. >>> >>> As far as I know, this functionality isn't there yet ... >>> >>> I once (early feb, 2012) suggested I might take a crack at making this >>> happen but haven't actually found the time to do it ... I'm not sure >>> anyone in bioc-core land (hi, Marc) has found the time to do it >>> either, so I think you're out of luck. >>> >>> Sorry for that. But the good news is that I bet a patch that does this >>> would be welcome ;-) >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/**contact<http: cbio.="" mskcc.org="" %7elianos="" contact=""> >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: >>> http://news.gmane.org/gmane.**science.biology.informatics.**conduc tor<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >>> >>> [[alternative HTML version deleted]] >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
this may also be useful to some folks: http://splicegrapher.sourceforge.net/tutorial.html#gff3-gene-model- files I guess I was under the impression that GTF-to-TxDb worked already, had not tried it though: (from Aaron Statham at WEHI, I believe) On Wed, Apr 4, 2012 at 9:25 PM, Michael Lawrence <lawrence.michael@gene.com>wrote: > On Wed, Apr 4, 2012 at 5:44 PM, Marc Carlson <mcarlson@fhcrc.org> wrote: > > > I was looking at this during the course, and this is on my TODO list for > > the next release cycle. I think it is long overdue and I don't think > that > > the community is going to get it done in spite of all the enthusiasm. > > There has not been time to do it before now but I am hoping that will > now > > change. It should be simple enough in principle, but it might not be > > exactly trivial as I have discovered (on closer inspection) that the gff > > specification is not as concrete as one would like it to be. Also there > > have been several different versions. > > > > Some things that can help speed me along: > > > > 1) which version is most important? gff3? Or one of the other versions? > > It is likely that with the older versions we may not be able to extract > as > > much meaningful information. > > > > > > There are really only two ways to encode a real gene model in GFF. One is > using GFF3 (with the conventions defined in the spec), and the other is to > use GTF, which is layered on top of GFF2 (Sanger's version of the format). > GFF3 is cleaner than GTF in my opinion. > > > > 2) where is the best place to find some typical gff3 files for examples? > > This should not be difficult, but when I was looking before I was > finding > > that people were surprisingly stingy about sharing these. > > > > > I would recommend simply dumping out some of the TranscriptDb objects using > rtracklayer. I have had no trouble feeding those files to other tools and > having them be parsed, processed and displayed correctly. > > If you have any questions about the spec, just ask. I've read it over many > times. > > > > > > Marc > > > > > > > > > > On 04/03/2012 03:57 PM, Michael Lawrence wrote: > > > >> Marc was working on this during the course in Feb. Not sure what > happened > >> to it. He said it was simple. Maybe just waiting for the release to > pass. > >> > >> Michael > >> > >> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > >> mailinglist.honeypot@gmail.com**> wrote: > >> > >> Hi, > >>> > >>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> I am wondering if I could create a TranscriptDb object (library > >>>> > >>> GenomicFeatures) using a gff3 file. I could read a gff3 file using > >>> import.gff3, but I could not find a way to create TranscriptDb object > >>> from > >>> the object from import.gff3. > >>> > >>>> Two arguments for makeTranscriptDb are required: transcripts, > splicings. > >>>> > >>> It does not seem to be easy to parse this information from the object > >>> form > >>> import.gff3. I will appreciate any help. > >>> > >>> As far as I know, this functionality isn't there yet ... > >>> > >>> I once (early feb, 2012) suggested I might take a crack at making this > >>> happen but haven't actually found the time to do it ... I'm not sure > >>> anyone in bioc-core land (hi, Marc) has found the time to do it > >>> either, so I think you're out of luck. > >>> > >>> Sorry for that. But the good news is that I bet a patch that does this > >>> would be welcome ;-) > >>> > >>> -steve > >>> > >>> -- > >>> Steve Lianoglou > >>> Graduate Student: Computational Systems Biology > >>> | Memorial Sloan-Kettering Cancer Center > >>> | Weill Medical College of Cornell University > >>> Contact Info: http://cbio.mskcc.org/~lianos/**contact< > http://cbio.mskcc.org/%7Elianos/contact> > >>> > >>> ______________________________**_________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor< > https://stat.ethz.ch/mailman/listinfo/bioconductor> > >>> Search the archives: > >>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor< > http://news.gmane.org/gmane.science.biology.informatics.conductor> > >>> > >>> [[alternative HTML version deleted]] > >> > >> > >> ______________________________**_________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/**listinfo/bioconductor< > https://stat.ethz.ch/mailman/listinfo/bioconductor> > >> Search the archives: http://news.gmane.org/gmane.** > >> science.biology.informatics.**conductor< > http://news.gmane.org/gmane.science.biology.informatics.conductor> > >> > > > > ______________________________**_________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/**listinfo/bioconductor< > https://stat.ethz.ch/mailman/listinfo/bioconductor> > > Search the archives: http://news.gmane.org/gmane.** > > science.biology.informatics.**conductor< > http://news.gmane.org/gmane.science.biology.informatics.conductor> > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
I think that gff3 is the most advanced version of gff. See the following website http://www.sequenceontology.org/gff3.shtml It is not crystal clear, though. Feature mRNA seems to be transcripts, and exons and CDSs are parts of mRNAs. Features mRNA seem to be children of feature gene. I've found out that the gff3 file that I have been parsing does not seem to be typical because it does not have alternative spliced mRNA (the genome that I have been working on is pretty in draft stage). This makes it easy. I've also found out that not all exons have corresonding CDSs. This makes it hard to create TranscriptDb object. Thank you, SangChul ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] on behalf of Marc Carlson [mcarlson@fhcrc.org] Sent: Wednesday, April 04, 2012 8:44 PM To: bioconductor at r-project.org Subject: Re: [BioC] Create transcriptDb using gff3 files? - library GenomicFeatures and rtracklayer I was looking at this during the course, and this is on my TODO list for the next release cycle. I think it is long overdue and I don't think that the community is going to get it done in spite of all the enthusiasm. There has not been time to do it before now but I am hoping that will now change. It should be simple enough in principle, but it might not be exactly trivial as I have discovered (on closer inspection) that the gff specification is not as concrete as one would like it to be. Also there have been several different versions. Some things that can help speed me along: 1) which version is most important? gff3? Or one of the other versions? It is likely that with the older versions we may not be able to extract as much meaningful information. 2) where is the best place to find some typical gff3 files for examples? This should not be difficult, but when I was looking before I was finding that people were surprisingly stingy about sharing these. Marc On 04/03/2012 03:57 PM, Michael Lawrence wrote: > Marc was working on this during the course in Feb. Not sure what happened > to it. He said it was simple. Maybe just waiting for the release to pass. > > Michael > > On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > mailinglist.honeypot at gmail.com> wrote: > >> Hi, >> >> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> wrote: >>> Hi, >>> >>> I am wondering if I could create a TranscriptDb object (library >> GenomicFeatures) using a gff3 file. I could read a gff3 file using >> import.gff3, but I could not find a way to create TranscriptDb object from >> the object from import.gff3. >>> Two arguments for makeTranscriptDb are required: transcripts, splicings. >> It does not seem to be easy to parse this information from the object form >> import.gff3. I will appreciate any help. >> >> As far as I know, this functionality isn't there yet ... >> >> I once (early feb, 2012) suggested I might take a crack at making this >> happen but haven't actually found the time to do it ... I'm not sure >> anyone in bioc-core land (hi, Marc) has found the time to do it >> either, so I think you're out of luck. >> >> Sorry for that. But the good news is that I bet a patch that does this >> would be welcome ;-) >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Supporting both Ensemble's GTF and GFF3 would be ideal. Ensembl GTF would open up many genomes, including those in: ftp://ftp.ensembl.org/pub/release-66/gtf/ ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ Supporting Ensembl GTF would make it easy to distribute/archive the elements of a transcriptome analysis alongside a project/analysis in a generally useful format (i.e. IGV and other tools can work with it more or less directly) Related note, I have learned that the BioMarts produced for EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS available. Upshot: you'd best not depend upon being able to reproduce today's TranscriptDbFromBiomart tomorrow. re: "typical gff3 files"... Flybase makes gff3 extracts and if my understanding is correct, have been diligent in "getting it right" Also, NCBI historically has tried to provide GFFx extracts, with oodles of caveats. But, but, Last month they announced progress on improving their GFF3 offerings: http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ YMMV. I too once hoped to find makeTranscriptDbFromGFF3 capability so as to allow easy tracking the head of Flybase's offerings, i.e. ftp://ftp.fl ybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB2012_02/gff/ - alas I too have not followed up. ~Malcolm > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- > bounces at r-project.org] On Behalf Of Marc Carlson > Sent: Wednesday, April 04, 2012 7:44 PM > To: bioconductor at r-project.org > Subject: Re: [BioC] Create transcriptDb using gff3 files? - library > GenomicFeatures and rtracklayer > > I was looking at this during the course, and this is on my TODO list for > the next release cycle. I think it is long overdue and I don't think > that the community is going to get it done in spite of all the > enthusiasm. There has not been time to do it before now but I am hoping > that will now change. It should be simple enough in principle, but it > might not be exactly trivial as I have discovered (on closer inspection) > that the gff specification is not as concrete as one would like it to > be. Also there have been several different versions. > > Some things that can help speed me along: > > 1) which version is most important? gff3? Or one of the other > versions? It is likely that with the older versions we may not be able > to extract as much meaningful information. > > 2) where is the best place to find some typical gff3 files for > examples? This should not be difficult, but when I was looking before I > was finding that people were surprisingly stingy about sharing these. > > > Marc > > > > On 04/03/2012 03:57 PM, Michael Lawrence wrote: > > Marc was working on this during the course in Feb. Not sure what > happened > > to it. He said it was simple. Maybe just waiting for the release to pass. > > > > Michael > > > > On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > > mailinglist.honeypot at gmail.com> wrote: > > > >> Hi, > >> > >> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> > wrote: > >>> Hi, > >>> > >>> I am wondering if I could create a TranscriptDb object (library > >> GenomicFeatures) using a gff3 file. I could read a gff3 file using > >> import.gff3, but I could not find a way to create TranscriptDb object from > >> the object from import.gff3. > >>> Two arguments for makeTranscriptDb are required: transcripts, splicings. > >> It does not seem to be easy to parse this information from the object > form > >> import.gff3. I will appreciate any help. > >> > >> As far as I know, this functionality isn't there yet ... > >> > >> I once (early feb, 2012) suggested I might take a crack at making this > >> happen but haven't actually found the time to do it ... I'm not sure > >> anyone in bioc-core land (hi, Marc) has found the time to do it > >> either, so I think you're out of luck. > >> > >> Sorry for that. But the good news is that I bet a patch that does this > >> would be welcome ;-) > >> > >> -steve > >> > >> -- > >> Steve Lianoglou > >> Graduate Student: Computational Systems Biology > >> | Memorial Sloan-Kettering Cancer Center > >> | Weill Medical College of Cornell University > >> Contact Info: http://cbio.mskcc.org/~lianos/contact > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi all, Sorry I haven't read the whole thread, still I have a few comments that might be off the main topic then. On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: > Supporting both Ensemble's GTF and GFF3 would be ideal. > > Ensembl GTF would open up many genomes, including those in: > ftp://ftp.ensembl.org/pub/release-66/gtf/ > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ > > > Supporting Ensembl GTF would make it easy to distribute/archive the elements of a transcriptome analysis alongside a project/analysis in a generally useful format (i.e. IGV and other tools can work with it more or less directly) In my package easyRNASeq, I already load Ensembl GTF files and convert them into GRanges / RangedData object. It's pretty straightforward. I guess that adapting the code to create a transcriptDb should be do- able. > > Related note, I have learned that the BioMarts produced for EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS available. Upshot: you'd best not depend upon being able to reproduce today's TranscriptDbFromBiomart tomorrow. I don't know where you learned that and how you meant it exactly, but using biomaRt, you can still access Ensembl version as old as of march 2009: see http://mar2009.archive.ensembl.org/index.html. It's not straightforward to figure it out, but on the main Ensembl webpage, you can get the full list by clicking the "view in archive site" link at the bottom left of the papge. It redirects to this URL: http://www.ensembl.org/Help/ArchiveList. Then, to use biomaRt on a given archive, you need to change the host argument of useMart to the URL of the corresponding Ensembl archive as in: useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I recon that the biomaRT archive arguments does not work for that. I need to post something about this on the mailing list. > > re: "typical gff3 files"... > Flybase makes gff3 extracts and if my understanding is correct, have been diligent in "getting it right" I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and convert them to GRanges/RangedData object, but all the merit goes to the readGff3 function from the genomeIntervals package. Reading a gff3 file with this function is extremely quick as is accessing the gffAttributes (performed at the C layer) . Cheers, Nico > > Also, NCBI historically has tried to provide GFFx extracts, with oodles of caveats. > But, but, Last month they announced progress on improving their GFF3 offerings: http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ > YMMV. > > I too once hoped to find makeTranscriptDbFromGFF3 capability so as to allow easy tracking the head of Flybase's offerings, i.e. ftp://ftp .flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB2012_02/gff/ - alas I too have not followed up. > > ~Malcolm > > >> -----Original Message----- >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >> bounces at r-project.org] On Behalf Of Marc Carlson >> Sent: Wednesday, April 04, 2012 7:44 PM >> To: bioconductor at r-project.org >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library >> GenomicFeatures and rtracklayer >> >> I was looking at this during the course, and this is on my TODO list for >> the next release cycle. I think it is long overdue and I don't think >> that the community is going to get it done in spite of all the >> enthusiasm. There has not been time to do it before now but I am hoping >> that will now change. It should be simple enough in principle, but it >> might not be exactly trivial as I have discovered (on closer inspection) >> that the gff specification is not as concrete as one would like it to >> be. Also there have been several different versions. >> >> Some things that can help speed me along: >> >> 1) which version is most important? gff3? Or one of the other >> versions? It is likely that with the older versions we may not be able >> to extract as much meaningful information. >> >> 2) where is the best place to find some typical gff3 files for >> examples? This should not be difficult, but when I was looking before I >> was finding that people were surprisingly stingy about sharing these. >> >> >> Marc >> >> >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >>> Marc was working on this during the course in Feb. Not sure what >> happened >>> to it. He said it was simple. Maybe just waiting for the release to pass. >>> >>> Michael >>> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >>> mailinglist.honeypot at gmail.com> wrote: >>> >>>> Hi, >>>> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> >> wrote: >>>>> Hi, >>>>> >>>>> I am wondering if I could create a TranscriptDb object (library >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file using >>>> import.gff3, but I could not find a way to create TranscriptDb object from >>>> the object from import.gff3. >>>>> Two arguments for makeTranscriptDb are required: transcripts, splicings. >>>> It does not seem to be easy to parse this information from the object >> form >>>> import.gff3. I will appreciate any help. >>>> >>>> As far as I know, this functionality isn't there yet ... >>>> >>>> I once (early feb, 2012) suggested I might take a crack at making this >>>> happen but haven't actually found the time to do it ... I'm not sure >>>> anyone in bioc-core land (hi, Marc) has found the time to do it >>>> either, so I think you're out of luck. >>>> >>>> Sorry for that. But the good news is that I bet a patch that does this >>>> would be welcome ;-) >>>> >>>> -steve >>>> >>>> -- >>>> Steve Lianoglou >>>> Graduate Student: Computational Systems Biology >>>> | Memorial Sloan-Kettering Cancer Center >>>> | Weill Medical College of Cornell University >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
the Gist that I posted earlier has some minor (trivial) buglets in it, but I should have a fixed version that processes Cufflinks GTF files in a moment here. Obviously it's easier to use subread and subjunc since they just fire out a .bed file with the supported splice junctions (I have not yet checked to see if the BED forms a splicegraph or has enough information to do so, but I suspect that if it doesn't, it can be fixed fairly easily to do so). The featureCounts function in Rsubread is amazingly fast, and works for any arbitrary collection of features (e.g. edges), so having TranscriptDb objects to compare is nice. On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> wrote: > Hi all, > > Sorry I haven't read the whole thread, still I have a few comments that > might be off the main topic then. > > On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: > > > Supporting both Ensemble's GTF and GFF3 would be ideal. > > > > Ensembl GTF would open up many genomes, including those in: > > ftp://ftp.ensembl.org/pub/release-66/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ > > > > > > Supporting Ensembl GTF would make it easy to distribute/archive the > elements of a transcriptome analysis alongside a project/analysis in a > generally useful format (i.e. IGV and other tools can work with it more or > less directly) > > In my package easyRNASeq, I already load Ensembl GTF files and convert > them into GRanges / RangedData object. It's pretty straightforward. I guess > that adapting the code to create a transcriptDb should be do-able. > > > > > Related note, I have learned that the BioMarts produced for > EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS > available. Upshot: you'd best not depend upon being able to reproduce > today's TranscriptDbFromBiomart tomorrow. > > I don't know where you learned that and how you meant it exactly, but > using biomaRt, you can still access Ensembl version as old as of march > 2009: see http://mar2009.archive.ensembl.org/index.html. It's not > straightforward to figure it out, but on the main Ensembl webpage, you can > get the full list by clicking the "view in archive site" link at the bottom > left of the papge. It redirects to this URL: > http://www.ensembl.org/Help/ArchiveList. > Then, to use biomaRt on a given archive, you need to change the host > argument of useMart to the URL of the corresponding Ensembl archive as in: > useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I > recon that the biomaRT archive arguments does not work for that. I need to > post something about this on the mailing list. > > > > > re: "typical gff3 files"... > > Flybase makes gff3 extracts and if my understanding is correct, have > been diligent in "getting it right" > > I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and > convert them to GRanges/RangedData object, but all the merit goes to the > readGff3 function from the genomeIntervals package. Reading a gff3 file > with this function is extremely quick as is accessing the gffAttributes > (performed at the C layer) . > > Cheers, > > Nico > > > > > Also, NCBI historically has tried to provide GFFx extracts, with oodles > of caveats. > > But, but, Last month they announced progress on improving their GFF3 > offerings: http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html > > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ > > YMMV. > > > > I too once hoped to find makeTranscriptDbFromGFF3 capability so as to > allow easy tracking the head of Flybase's offerings, i.e. > ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB2 012_02/gff/- alas I too have not followed up. > > > > ~Malcolm > > > > > >> -----Original Message----- > >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- > >> bounces@r-project.org] On Behalf Of Marc Carlson > >> Sent: Wednesday, April 04, 2012 7:44 PM > >> To: bioconductor@r-project.org > >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library > >> GenomicFeatures and rtracklayer > >> > >> I was looking at this during the course, and this is on my TODO list for > >> the next release cycle. I think it is long overdue and I don't think > >> that the community is going to get it done in spite of all the > >> enthusiasm. There has not been time to do it before now but I am hoping > >> that will now change. It should be simple enough in principle, but it > >> might not be exactly trivial as I have discovered (on closer inspection) > >> that the gff specification is not as concrete as one would like it to > >> be. Also there have been several different versions. > >> > >> Some things that can help speed me along: > >> > >> 1) which version is most important? gff3? Or one of the other > >> versions? It is likely that with the older versions we may not be able > >> to extract as much meaningful information. > >> > >> 2) where is the best place to find some typical gff3 files for > >> examples? This should not be difficult, but when I was looking before I > >> was finding that people were surprisingly stingy about sharing these. > >> > >> > >> Marc > >> > >> > >> > >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: > >>> Marc was working on this during the course in Feb. Not sure what > >> happened > >>> to it. He said it was simple. Maybe just waiting for the release to > pass. > >>> > >>> Michael > >>> > >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > >>> mailinglist.honeypot@gmail.com> wrote: > >>> > >>>> Hi, > >>>> > >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu> > >> wrote: > >>>>> Hi, > >>>>> > >>>>> I am wondering if I could create a TranscriptDb object (library > >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file using > >>>> import.gff3, but I could not find a way to create TranscriptDb object > from > >>>> the object from import.gff3. > >>>>> Two arguments for makeTranscriptDb are required: transcripts, > splicings. > >>>> It does not seem to be easy to parse this information from the object > >> form > >>>> import.gff3. I will appreciate any help. > >>>> > >>>> As far as I know, this functionality isn't there yet ... > >>>> > >>>> I once (early feb, 2012) suggested I might take a crack at making this > >>>> happen but haven't actually found the time to do it ... I'm not sure > >>>> anyone in bioc-core land (hi, Marc) has found the time to do it > >>>> either, so I think you're out of luck. > >>>> > >>>> Sorry for that. But the good news is that I bet a patch that does this > >>>> would be welcome ;-) > >>>> > >>>> -steve > >>>> > >>>> -- > >>>> Steve Lianoglou > >>>> Graduate Student: Computational Systems Biology > >>>> | Memorial Sloan-Kettering Cancer Center > >>>> | Weill Medical College of Cornell University > >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
This is almost in the bag (cuff2db.R). I have one last thing to figure out: how to extract exon rank from Cufflinks output. Error in .normargSplicings(splicings, unique_tx_ids) : 'splicings$exon_rank' must be an integer vector with no NAs Ideas? Here's the code. cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { require(rtracklayer) require(GenomicRanges) require(GenomicFeatures) requiredAttribs <- c("gene_id", "transcript_id", "exon_number") if (verbose) message("Importing ", gtf.file) tmp <- import.gff(gtf.file, asRangedData=FALSE) #dispose of unspliced unstranded transcripts tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] #parse the per exon attributes if (verbose) message("Parsing gene/transcript/exon ids") tmpx <- strsplit(gsub(";", "", gsub("\"", "", values(tmp)$group)), split=" ") tmpx <- lapply(tmpx, function(x) { ans <- x[1:(length(x)/2)*2] names(ans) <- x[1:(length(x)/2)*2-1] ans }) attribs <- unique(unlist(lapply(tmpx, names))) if (!all(requiredAttribs %in% attribs)) stop("Not all required attributes are in this GFF file") names(attribs) <- attribs tmpx2 <- do.call(data.frame, c(lapply(attribs,function(x) sapply(tmpx,"[",x)), stringsAsFactors=FALSE)) values(tmp) <- tmpx2 if (verbose) message("Creating tables") #make transcripts table tmpT <- split(tmp, values(tmp)$transcript_id) transcripts <- data.frame( tx_id=1:length(tmpT), tx_name=names(tmpT), tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioni ng)]), tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitionin g)]), tx_start=sapply(start(ranges(tmpT)), min), tx_end=sapply(end(ranges(tmpT)), max), stringsAsFactors=FALSE) #make splicings table splicings <- data.frame( tx_id=rep(1:length(tmpT), elementLengths(tmpT)), exon_rank=as.integer(values(unlist(tmpT))$exon_number), exon_chrom=as.character(seqnames(unlist(tmpT))), exon_strand=as.character(strand(unlist(tmpT))), exon_start=start(unlist(tmpT)), exon_end=end(unlist(tmpT)), stringsAsFactors=FALSE) #make genes table gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, unique) genes <- data.frame( tx_name=unlist(gene_txs), gene_id=rep(names(gene_txs), sapply(gene_txs, length)), stringsAsFactors=FALSE) #create the db if (verbose) message("Creating TranscriptDb") tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) if (verbose) message("Use saveFeatures() to save the database to a file") return(tmpdb) } On Thu, Apr 5, 2012 at 9:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > the Gist that I posted earlier has some minor (trivial) buglets in it, but > I should have a fixed version that processes Cufflinks GTF files in a > moment here. > > Obviously it's easier to use subread and subjunc since they just fire out > a .bed file with the supported splice junctions (I have not yet checked to > see if the BED forms a splicegraph or has enough information to do so, but > I suspect that if it doesn't, it can be fixed fairly easily to do so). The > featureCounts function in Rsubread is amazingly fast, and works for any > arbitrary collection of features (e.g. edges), so having TranscriptDb > objects to compare is nice. > > > On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> wrote: > >> Hi all, >> >> Sorry I haven't read the whole thread, still I have a few comments that >> might be off the main topic then. >> >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: >> >> > Supporting both Ensemble's GTF and GFF3 would be ideal. >> > >> > Ensembl GTF would open up many genomes, including those in: >> > ftp://ftp.ensembl.org/pub/release-66/gtf/ >> > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ >> > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ >> > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ >> > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ >> > >> > >> > Supporting Ensembl GTF would make it easy to distribute/archive the >> elements of a transcriptome analysis alongside a project/analysis in a >> generally useful format (i.e. IGV and other tools can work with it more or >> less directly) >> >> In my package easyRNASeq, I already load Ensembl GTF files and convert >> them into GRanges / RangedData object. It's pretty straightforward. I guess >> that adapting the code to create a transcriptDb should be do-able. >> >> > >> > Related note, I have learned that the BioMarts produced for >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS >> available. Upshot: you'd best not depend upon being able to reproduce >> today's TranscriptDbFromBiomart tomorrow. >> >> I don't know where you learned that and how you meant it exactly, but >> using biomaRt, you can still access Ensembl version as old as of march >> 2009: see http://mar2009.archive.ensembl.org/index.html. It's not >> straightforward to figure it out, but on the main Ensembl webpage, you can >> get the full list by clicking the "view in archive site" link at the bottom >> left of the papge. It redirects to this URL: >> http://www.ensembl.org/Help/ArchiveList. >> Then, to use biomaRt on a given archive, you need to change the host >> argument of useMart to the URL of the corresponding Ensembl archive as in: >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I >> recon that the biomaRT archive arguments does not work for that. I need to >> post something about this on the mailing list. >> >> > >> > re: "typical gff3 files"... >> > Flybase makes gff3 extracts and if my understanding is correct, have >> been diligent in "getting it right" >> >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and >> convert them to GRanges/RangedData object, but all the merit goes to the >> readGff3 function from the genomeIntervals package. Reading a gff3 file >> with this function is extremely quick as is accessing the gffAttributes >> (performed at the C layer) . >> >> Cheers, >> >> Nico >> >> > >> > Also, NCBI historically has tried to provide GFFx extracts, with oodles >> of caveats. >> > But, but, Last month they announced progress on improving their GFF3 >> offerings: >> http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html >> > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ >> > YMMV. >> > >> > I too once hoped to find makeTranscriptDbFromGFF3 capability so as to >> allow easy tracking the head of Flybase's offerings, i.e. >> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB 2012_02/gff/- alas I too have not followed up. >> > >> > ~Malcolm >> > >> > >> >> -----Original Message----- >> >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- >> >> bounces@r-project.org] On Behalf Of Marc Carlson >> >> Sent: Wednesday, April 04, 2012 7:44 PM >> >> To: bioconductor@r-project.org >> >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library >> >> GenomicFeatures and rtracklayer >> >> >> >> I was looking at this during the course, and this is on my TODO list >> for >> >> the next release cycle. I think it is long overdue and I don't think >> >> that the community is going to get it done in spite of all the >> >> enthusiasm. There has not been time to do it before now but I am >> hoping >> >> that will now change. It should be simple enough in principle, but it >> >> might not be exactly trivial as I have discovered (on closer >> inspection) >> >> that the gff specification is not as concrete as one would like it to >> >> be. Also there have been several different versions. >> >> >> >> Some things that can help speed me along: >> >> >> >> 1) which version is most important? gff3? Or one of the other >> >> versions? It is likely that with the older versions we may not be able >> >> to extract as much meaningful information. >> >> >> >> 2) where is the best place to find some typical gff3 files for >> >> examples? This should not be difficult, but when I was looking before >> I >> >> was finding that people were surprisingly stingy about sharing these. >> >> >> >> >> >> Marc >> >> >> >> >> >> >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >> >>> Marc was working on this during the course in Feb. Not sure what >> >> happened >> >>> to it. He said it was simple. Maybe just waiting for the release to >> pass. >> >>> >> >>> Michael >> >>> >> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >> >>> mailinglist.honeypot@gmail.com> wrote: >> >>> >> >>>> Hi, >> >>>> >> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu> >> >> wrote: >> >>>>> Hi, >> >>>>> >> >>>>> I am wondering if I could create a TranscriptDb object (library >> >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file using >> >>>> import.gff3, but I could not find a way to create TranscriptDb >> object from >> >>>> the object from import.gff3. >> >>>>> Two arguments for makeTranscriptDb are required: transcripts, >> splicings. >> >>>> It does not seem to be easy to parse this information from the object >> >> form >> >>>> import.gff3. I will appreciate any help. >> >>>> >> >>>> As far as I know, this functionality isn't there yet ... >> >>>> >> >>>> I once (early feb, 2012) suggested I might take a crack at making >> this >> >>>> happen but haven't actually found the time to do it ... I'm not sure >> >>>> anyone in bioc-core land (hi, Marc) has found the time to do it >> >>>> either, so I think you're out of luck. >> >>>> >> >>>> Sorry for that. But the good news is that I bet a patch that does >> this >> >>>> would be welcome ;-) >> >>>> >> >>>> -steve >> >>>> >> >>>> -- >> >>>> Steve Lianoglou >> >>>> Graduate Student: Computational Systems Biology >> >>>> | Memorial Sloan-Kettering Cancer Center >> >>>> | Weill Medical College of Cornell University >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >>>> >> >>>> _______________________________________________ >> >>>> Bioconductor mailing list >> >>>> Bioconductor@r-project.org >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>>> Search the archives: >> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> _______________________________________________ >> >>> Bioconductor mailing list >> >>> Bioconductor@r-project.org >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >>> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor@r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor@r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hey guys, The rtracklayer package parses GFF3 and GTF and delivers them as either GRanges or RangedData. I am surprised to see how many parsers people are implementing for these file types. Stuff like file parsing needs to be pushed down into the core infrastructure. If for some reason rtracklayer is falling short, say in performance, we should work together to address that. The code below does not seem to be using rtracklayer correctly. If you have a GFF3 file, then passing version = "3" or simply calling export.gff3 will do all of that nasty attribute parsing that consumes so much of that function. It does try to auto-detect the format, but maybe that was not working at the time this was written, or maybe the file is missing the version directive. This is probably the fault of the rtracklayer documentation. I have tried to improve it though over the past release cycle, so I would encourage taking another look at it. Thanks, Michael On Thu, Apr 5, 2012 at 10:13 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > This is almost in the bag (cuff2db.R). I have one last thing to figure > out: how to extract exon rank from Cufflinks output. > > Error in .normargSplicings(splicings, unique_tx_ids) : > 'splicings$exon_rank' must be an integer vector with no NAs > > Ideas? Here's the code. > > cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { > > require(rtracklayer) > require(GenomicRanges) > require(GenomicFeatures) > > requiredAttribs <- c("gene_id", "transcript_id", "exon_number") > > if (verbose) message("Importing ", gtf.file) > tmp <- import.gff(gtf.file, asRangedData=FALSE) > > #dispose of unspliced unstranded transcripts > tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] > > #parse the per exon attributes > if (verbose) message("Parsing gene/transcript/exon ids") > tmpx <- strsplit(gsub(";", "", gsub("\"", "", values(tmp)$group)), > split=" ") > tmpx <- lapply(tmpx, function(x) { > ans <- x[1:(length(x)/2)*2] > names(ans) <- x[1:(length(x)/2)*2-1] > ans > }) > > attribs <- unique(unlist(lapply(tmpx, names))) > if (!all(requiredAttribs %in% attribs)) > stop("Not all required attributes are in this GFF file") > names(attribs) <- attribs > tmpx2 <- do.call(data.frame, c(lapply(attribs,function(x) > sapply(tmpx,"[",x)), > stringsAsFactors=FALSE)) > values(tmp) <- tmpx2 > > if (verbose) message("Creating tables") > > #make transcripts table > tmpT <- split(tmp, values(tmp)$transcript_id) > transcripts <- data.frame( > tx_id=1:length(tmpT), > tx_name=names(tmpT), > tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning > )]), > tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioni ng)]), > tx_start=sapply(start(ranges(tmpT)), min), > tx_end=sapply(end(ranges(tmpT)), max), > stringsAsFactors=FALSE) > > #make splicings table > splicings <- data.frame( > tx_id=rep(1:length(tmpT), elementLengths(tmpT)), > exon_rank=as.integer(values(unlist(tmpT))$exon_number), > exon_chrom=as.character(seqnames(unlist(tmpT))), > exon_strand=as.character(strand(unlist(tmpT))), > exon_start=start(unlist(tmpT)), > exon_end=end(unlist(tmpT)), > stringsAsFactors=FALSE) > > #make genes table > gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, unique) > genes <- data.frame( > tx_name=unlist(gene_txs), > gene_id=rep(names(gene_txs), sapply(gene_txs, length)), > stringsAsFactors=FALSE) > > #create the db > if (verbose) message("Creating TranscriptDb") > tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) > if (verbose) message("Use saveFeatures() to save the database to a file") > return(tmpdb) > > } > > > > > > > On Thu, Apr 5, 2012 at 9:34 AM, Tim Triche, Jr. <tim.triche@gmail.com> >wrote: > > > the Gist that I posted earlier has some minor (trivial) buglets in it, > but > > I should have a fixed version that processes Cufflinks GTF files in a > > moment here. > > > > Obviously it's easier to use subread and subjunc since they just fire out > > a .bed file with the supported splice junctions (I have not yet checked > to > > see if the BED forms a splicegraph or has enough information to do so, > but > > I suspect that if it doesn't, it can be fixed fairly easily to do so). > The > > featureCounts function in Rsubread is amazingly fast, and works for any > > arbitrary collection of features (e.g. edges), so having TranscriptDb > > objects to compare is nice. > > > > > > On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> > wrote: > > > >> Hi all, > >> > >> Sorry I haven't read the whole thread, still I have a few comments that > >> might be off the main topic then. > >> > >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: > >> > >> > Supporting both Ensemble's GTF and GFF3 would be ideal. > >> > > >> > Ensembl GTF would open up many genomes, including those in: > >> > ftp://ftp.ensembl.org/pub/release-66/gtf/ > >> > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ > >> > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ > >> > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ > >> > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ > >> > > >> > > >> > Supporting Ensembl GTF would make it easy to distribute/archive the > >> elements of a transcriptome analysis alongside a project/analysis in a > >> generally useful format (i.e. IGV and other tools can work with it more > or > >> less directly) > >> > >> In my package easyRNASeq, I already load Ensembl GTF files and convert > >> them into GRanges / RangedData object. It's pretty straightforward. I > guess > >> that adapting the code to create a transcriptDb should be do- able. > >> > >> > > >> > Related note, I have learned that the BioMarts produced for > >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS > >> available. Upshot: you'd best not depend upon being able to reproduce > >> today's TranscriptDbFromBiomart tomorrow. > >> > >> I don't know where you learned that and how you meant it exactly, but > >> using biomaRt, you can still access Ensembl version as old as of march > >> 2009: see http://mar2009.archive.ensembl.org/index.html. It's not > >> straightforward to figure it out, but on the main Ensembl webpage, you > can > >> get the full list by clicking the "view in archive site" link at the > bottom > >> left of the papge. It redirects to this URL: > >> http://www.ensembl.org/Help/ArchiveList. > >> Then, to use biomaRt on a given archive, you need to change the host > >> argument of useMart to the URL of the corresponding Ensembl archive as > in: > >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I > >> recon that the biomaRT archive arguments does not work for that. I need > to > >> post something about this on the mailing list. > >> > >> > > >> > re: "typical gff3 files"... > >> > Flybase makes gff3 extracts and if my understanding is correct, have > >> been diligent in "getting it right" > >> > >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files > and > >> convert them to GRanges/RangedData object, but all the merit goes to the > >> readGff3 function from the genomeIntervals package. Reading a gff3 file > >> with this function is extremely quick as is accessing the gffAttributes > >> (performed at the C layer) . > >> > >> Cheers, > >> > >> Nico > >> > >> > > >> > Also, NCBI historically has tried to provide GFFx extracts, with > oodles > >> of caveats. > >> > But, but, Last month they announced progress on improving their GFF3 > >> offerings: > >> http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html > >> > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ > >> > YMMV. > >> > > >> > I too once hoped to find makeTranscriptDbFromGFF3 capability so as to > >> allow easy tracking the head of Flybase's offerings, i.e. > >> > ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB2 012_02/gff/-alas I too have not followed up. > >> > > >> > ~Malcolm > >> > > >> > > >> >> -----Original Message----- > >> >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- > >> >> bounces@r-project.org] On Behalf Of Marc Carlson > >> >> Sent: Wednesday, April 04, 2012 7:44 PM > >> >> To: bioconductor@r-project.org > >> >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library > >> >> GenomicFeatures and rtracklayer > >> >> > >> >> I was looking at this during the course, and this is on my TODO list > >> for > >> >> the next release cycle. I think it is long overdue and I don't think > >> >> that the community is going to get it done in spite of all the > >> >> enthusiasm. There has not been time to do it before now but I am > >> hoping > >> >> that will now change. It should be simple enough in principle, but > it > >> >> might not be exactly trivial as I have discovered (on closer > >> inspection) > >> >> that the gff specification is not as concrete as one would like it to > >> >> be. Also there have been several different versions. > >> >> > >> >> Some things that can help speed me along: > >> >> > >> >> 1) which version is most important? gff3? Or one of the other > >> >> versions? It is likely that with the older versions we may not be > able > >> >> to extract as much meaningful information. > >> >> > >> >> 2) where is the best place to find some typical gff3 files for > >> >> examples? This should not be difficult, but when I was looking > before > >> I > >> >> was finding that people were surprisingly stingy about sharing these. > >> >> > >> >> > >> >> Marc > >> >> > >> >> > >> >> > >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: > >> >>> Marc was working on this during the course in Feb. Not sure what > >> >> happened > >> >>> to it. He said it was simple. Maybe just waiting for the release to > >> pass. > >> >>> > >> >>> Michael > >> >>> > >> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > >> >>> mailinglist.honeypot@gmail.com> wrote: > >> >>> > >> >>>> Hi, > >> >>>> > >> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu> > >> >> wrote: > >> >>>>> Hi, > >> >>>>> > >> >>>>> I am wondering if I could create a TranscriptDb object (library > >> >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file using > >> >>>> import.gff3, but I could not find a way to create TranscriptDb > >> object from > >> >>>> the object from import.gff3. > >> >>>>> Two arguments for makeTranscriptDb are required: transcripts, > >> splicings. > >> >>>> It does not seem to be easy to parse this information from the > object > >> >> form > >> >>>> import.gff3. I will appreciate any help. > >> >>>> > >> >>>> As far as I know, this functionality isn't there yet ... > >> >>>> > >> >>>> I once (early feb, 2012) suggested I might take a crack at making > >> this > >> >>>> happen but haven't actually found the time to do it ... I'm not > sure > >> >>>> anyone in bioc-core land (hi, Marc) has found the time to do it > >> >>>> either, so I think you're out of luck. > >> >>>> > >> >>>> Sorry for that. But the good news is that I bet a patch that does > >> this > >> >>>> would be welcome ;-) > >> >>>> > >> >>>> -steve > >> >>>> > >> >>>> -- > >> >>>> Steve Lianoglou > >> >>>> Graduate Student: Computational Systems Biology > >> >>>> | Memorial Sloan-Kettering Cancer Center > >> >>>> | Weill Medical College of Cornell University > >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact > >> >>>> > >> >>>> _______________________________________________ > >> >>>> Bioconductor mailing list > >> >>>> Bioconductor@r-project.org > >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >>>> Search the archives: > >> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> >>>> > >> >>> [[alternative HTML version deleted]] > >> >>> > >> >>> _______________________________________________ > >> >>> Bioconductor mailing list > >> >>> Bioconductor@r-project.org > >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >>> Search the archives: > >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> >> > >> >> _______________________________________________ > >> >> Bioconductor mailing list > >> >> Bioconductor@r-project.org > >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> >> Search the archives: > >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor@r-project.org > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > > > > -- > > *A model is a lie that helps you see the truth.* > > * > > * > > Howard Skipper< > http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> > > > > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper< > http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Good grief, I didn't even notice that I was using import.gff instead of import('foo.gtf'), which works fine. Although the stranding still needs cleanup for it to be turned into a TranscriptDb. On Thu, Apr 5, 2012 at 10:29 AM, Michael Lawrence <lawrence.michael@gene.com> wrote: > Hey guys, > > The rtracklayer package parses GFF3 and GTF and delivers them as either > GRanges or RangedData. I am surprised to see how many parsers people are > implementing for these file types. Stuff like file parsing needs to be > pushed down into the core infrastructure. If for some reason rtracklayer is > falling short, say in performance, we should work together to address that. > > The code below does not seem to be using rtracklayer correctly. If you > have a GFF3 file, then passing version = "3" or simply calling export.gff3 > will do all of that nasty attribute parsing that consumes so much of that > function. It does try to auto-detect the format, but maybe that was not > working at the time this was written, or maybe the file is missing the > version directive. > > This is probably the fault of the rtracklayer documentation. I have tried > to improve it though over the past release cycle, so I would encourage > taking another look at it. > > Thanks, > Michael > > On Thu, Apr 5, 2012 at 10:13 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> This is almost in the bag (cuff2db.R). I have one last thing to figure >> out: how to extract exon rank from Cufflinks output. >> >> Error in .normargSplicings(splicings, unique_tx_ids) : >> 'splicings$exon_rank' must be an integer vector with no NAs >> >> Ideas? Here's the code. >> >> cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { >> >> require(rtracklayer) >> require(GenomicRanges) >> require(GenomicFeatures) >> >> requiredAttribs <- c("gene_id", "transcript_id", "exon_number") >> >> if (verbose) message("Importing ", gtf.file) >> tmp <- import.gff(gtf.file, asRangedData=FALSE) >> >> #dispose of unspliced unstranded transcripts >> tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] >> >> #parse the per exon attributes >> if (verbose) message("Parsing gene/transcript/exon ids") >> tmpx <- strsplit(gsub(";", "", gsub("\"", "", values(tmp)$group)), >> split=" ") >> tmpx <- lapply(tmpx, function(x) { >> ans <- x[1:(length(x)/2)*2] >> names(ans) <- x[1:(length(x)/2)*2-1] >> ans >> }) >> >> attribs <- unique(unlist(lapply(tmpx, names))) >> if (!all(requiredAttribs %in% attribs)) >> stop("Not all required attributes are in this GFF file") >> names(attribs) <- attribs >> tmpx2 <- do.call(data.frame, c(lapply(attribs,function(x) >> sapply(tmpx,"[",x)), >> stringsAsFactors=FALSE)) >> values(tmp) <- tmpx2 >> >> if (verbose) message("Creating tables") >> >> #make transcripts table >> tmpT <- split(tmp, values(tmp)$transcript_id) >> transcripts <- data.frame( >> tx_id=1:length(tmpT), >> tx_name=names(tmpT), >> tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning >> )]), >> tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioning >> )]), >> tx_start=sapply(start(ranges(tmpT)), min), >> tx_end=sapply(end(ranges(tmpT)), max), >> stringsAsFactors=FALSE) >> >> #make splicings table >> splicings <- data.frame( >> tx_id=rep(1:length(tmpT), elementLengths(tmpT)), >> exon_rank=as.integer(values(unlist(tmpT))$exon_number), >> exon_chrom=as.character(seqnames(unlist(tmpT))), >> exon_strand=as.character(strand(unlist(tmpT))), >> exon_start=start(unlist(tmpT)), >> exon_end=end(unlist(tmpT)), >> stringsAsFactors=FALSE) >> >> #make genes table >> gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, >> unique) >> genes <- data.frame( >> tx_name=unlist(gene_txs), >> gene_id=rep(names(gene_txs), sapply(gene_txs, length)), >> stringsAsFactors=FALSE) >> >> #create the db >> if (verbose) message("Creating TranscriptDb") >> tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) >> if (verbose) message("Use saveFeatures() to save the database to a file") >> return(tmpdb) >> >> } >> >> >> >> >> >> >> On Thu, Apr 5, 2012 at 9:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>> >wrote: >> >> > the Gist that I posted earlier has some minor (trivial) buglets in it, >> but >> > I should have a fixed version that processes Cufflinks GTF files in a >> > moment here. >> > >> > Obviously it's easier to use subread and subjunc since they just fire >> out >> > a .bed file with the supported splice junctions (I have not yet checked >> to >> > see if the BED forms a splicegraph or has enough information to do so, >> but >> > I suspect that if it doesn't, it can be fixed fairly easily to do so). >> The >> > featureCounts function in Rsubread is amazingly fast, and works for any >> > arbitrary collection of features (e.g. edges), so having TranscriptDb >> > objects to compare is nice. >> > >> > >> > On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> >> wrote: >> > >> >> Hi all, >> >> >> >> Sorry I haven't read the whole thread, still I have a few comments that >> >> might be off the main topic then. >> >> >> >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: >> >> >> >> > Supporting both Ensemble's GTF and GFF3 would be ideal. >> >> > >> >> > Ensembl GTF would open up many genomes, including those in: >> >> > ftp://ftp.ensembl.org/pub/release-66/gtf/ >> >> > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ >> >> > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ >> >> > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ >> >> > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ >> >> > >> >> > >> >> > Supporting Ensembl GTF would make it easy to distribute/archive the >> >> elements of a transcriptome analysis alongside a project/analysis in a >> >> generally useful format (i.e. IGV and other tools can work with it >> more or >> >> less directly) >> >> >> >> In my package easyRNASeq, I already load Ensembl GTF files and convert >> >> them into GRanges / RangedData object. It's pretty straightforward. I >> guess >> >> that adapting the code to create a transcriptDb should be do- able. >> >> >> >> > >> >> > Related note, I have learned that the BioMarts produced for >> >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS >> >> available. Upshot: you'd best not depend upon being able to reproduce >> >> today's TranscriptDbFromBiomart tomorrow. >> >> >> >> I don't know where you learned that and how you meant it exactly, but >> >> using biomaRt, you can still access Ensembl version as old as of march >> >> 2009: see http://mar2009.archive.ensembl.org/index.html. It's not >> >> straightforward to figure it out, but on the main Ensembl webpage, you >> can >> >> get the full list by clicking the "view in archive site" link at the >> bottom >> >> left of the papge. It redirects to this URL: >> >> http://www.ensembl.org/Help/ArchiveList. >> >> Then, to use biomaRt on a given archive, you need to change the host >> >> argument of useMart to the URL of the corresponding Ensembl archive as >> in: >> >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I >> >> recon that the biomaRT archive arguments does not work for that. I >> need to >> >> post something about this on the mailing list. >> >> >> >> > >> >> > re: "typical gff3 files"... >> >> > Flybase makes gff3 extracts and if my understanding is correct, have >> >> been diligent in "getting it right" >> >> >> >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files >> and >> >> convert them to GRanges/RangedData object, but all the merit goes to >> the >> >> readGff3 function from the genomeIntervals package. Reading a gff3 file >> >> with this function is extremely quick as is accessing the gffAttributes >> >> (performed at the C layer) . >> >> >> >> Cheers, >> >> >> >> Nico >> >> >> >> > >> >> > Also, NCBI historically has tried to provide GFFx extracts, with >> oodles >> >> of caveats. >> >> > But, but, Last month they announced progress on improving their GFF3 >> >> offerings: >> >> http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html >> >> > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ >> >> > YMMV. >> >> > >> >> > I too once hoped to find makeTranscriptDbFromGFF3 capability so as to >> >> allow easy tracking the head of Flybase's offerings, i.e. >> >> >> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB 2012_02/gff/-alas I too have not followed up. >> >> >> > >> >> > ~Malcolm >> >> > >> >> > >> >> >> -----Original Message----- >> >> >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- >> >> >> bounces@r-project.org] On Behalf Of Marc Carlson >> >> >> Sent: Wednesday, April 04, 2012 7:44 PM >> >> >> To: bioconductor@r-project.org >> >> >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library >> >> >> GenomicFeatures and rtracklayer >> >> >> >> >> >> I was looking at this during the course, and this is on my TODO list >> >> for >> >> >> the next release cycle. I think it is long overdue and I don't >> think >> >> >> that the community is going to get it done in spite of all the >> >> >> enthusiasm. There has not been time to do it before now but I am >> >> hoping >> >> >> that will now change. It should be simple enough in principle, but >> it >> >> >> might not be exactly trivial as I have discovered (on closer >> >> inspection) >> >> >> that the gff specification is not as concrete as one would like it >> to >> >> >> be. Also there have been several different versions. >> >> >> >> >> >> Some things that can help speed me along: >> >> >> >> >> >> 1) which version is most important? gff3? Or one of the other >> >> >> versions? It is likely that with the older versions we may not be >> able >> >> >> to extract as much meaningful information. >> >> >> >> >> >> 2) where is the best place to find some typical gff3 files for >> >> >> examples? This should not be difficult, but when I was looking >> before >> >> I >> >> >> was finding that people were surprisingly stingy about sharing >> these. >> >> >> >> >> >> >> >> >> Marc >> >> >> >> >> >> >> >> >> >> >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >> >> >>> Marc was working on this during the course in Feb. Not sure what >> >> >> happened >> >> >>> to it. He said it was simple. Maybe just waiting for the release to >> >> pass. >> >> >>> >> >> >>> Michael >> >> >>> >> >> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >> >> >>> mailinglist.honeypot@gmail.com> wrote: >> >> >>> >> >> >>>> Hi, >> >> >>>> >> >> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu> >> >> >> wrote: >> >> >>>>> Hi, >> >> >>>>> >> >> >>>>> I am wondering if I could create a TranscriptDb object (library >> >> >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file >> using >> >> >>>> import.gff3, but I could not find a way to create TranscriptDb >> >> object from >> >> >>>> the object from import.gff3. >> >> >>>>> Two arguments for makeTranscriptDb are required: transcripts, >> >> splicings. >> >> >>>> It does not seem to be easy to parse this information from the >> object >> >> >> form >> >> >>>> import.gff3. I will appreciate any help. >> >> >>>> >> >> >>>> As far as I know, this functionality isn't there yet ... >> >> >>>> >> >> >>>> I once (early feb, 2012) suggested I might take a crack at making >> >> this >> >> >>>> happen but haven't actually found the time to do it ... I'm not >> sure >> >> >>>> anyone in bioc-core land (hi, Marc) has found the time to do it >> >> >>>> either, so I think you're out of luck. >> >> >>>> >> >> >>>> Sorry for that. But the good news is that I bet a patch that does >> >> this >> >> >>>> would be welcome ;-) >> >> >>>> >> >> >>>> -steve >> >> >>>> >> >> >>>> -- >> >> >>>> Steve Lianoglou >> >> >>>> Graduate Student: Computational Systems Biology >> >> >>>> | Memorial Sloan-Kettering Cancer Center >> >> >>>> | Weill Medical College of Cornell University >> >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> >>>> >> >> >>>> _______________________________________________ >> >> >>>> Bioconductor mailing list >> >> >>>> Bioconductor@r-project.org >> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >>>> Search the archives: >> >> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >>>> >> >> >>> [[alternative HTML version deleted]] >> >> >>> >> >> >>> _______________________________________________ >> >> >>> Bioconductor mailing list >> >> >>> Bioconductor@r-project.org >> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >>> Search the archives: >> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> _______________________________________________ >> >> >> Bioconductor mailing list >> >> >> Bioconductor@r-project.org >> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> >> Search the archives: >> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > >> >> > _______________________________________________ >> >> > Bioconductor mailing list >> >> > Bioconductor@r-project.org >> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> > Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor@r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > >> > >> > >> > -- >> > *A model is a lie that helps you see the truth.* >> > * >> > * >> > Howard Skipper< >> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >> > >> > >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper< >> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Alright, let's try that again. Having heeded your advice, it seems to work now. cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { require(rtracklayer) require(GenomicRanges) require(GenomicFeatures) requiredAttribs <- c("gene_id", "transcript_id", "exon_number") if (verbose) message("Importing ", gtf.file) tmp <- import(gtf.file, asRangedData=FALSE) #dispose of unspliced unstranded transcripts tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] # fix the gene IDs values(tmp)$gene_id <- gsub('CUFF.', '', values(tmp)$gene_id) # fix the exon IDs values(tmp)$transcript_id <- gsub('CUFF.', '', values(tmp)$transcript_id) # split the object into transcript and exon pieces by.type = split(tmp, values(tmp)$type) browser() #make transcripts table tmpT <- split(by.type$transcript, values(by.type$transcript)$transcript_id) if(verbose) message('Attempting to create the transcripts data.frame') transcripts <- data.frame( tx_id=1:length(tmpT), tx_name=names(tmpT), tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioni ng)]), tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitionin g)]), tx_start=sapply(start(ranges(tmpT)), min), tx_end=sapply(end(ranges(tmpT)), max), stringsAsFactors=FALSE ) #make splicings table tmpS <- split(by.type$exon, values(by.type$exon)$transcript_id) if(verbose) message('Attempting to create the splicings data.frame') splicings <- data.frame( tx_id=rep(1:length(tmpS), elementLengths(tmpS)), exon_rank=as.integer(values(unlist(tmpS))$exon_number), exon_chrom=as.character(seqnames(unlist(tmpS))), exon_strand=as.character(strand(unlist(tmpS))), exon_start=start(unlist(tmpS)), exon_end=end(unlist(tmpS)), stringsAsFactors=FALSE ) #make genes table if(verbose) message('Attempting to create the genes data.frame') gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, unique) genes <- data.frame( tx_name=unlist(gene_txs), gene_id=rep(names(gene_txs), sapply(gene_txs, length)), stringsAsFactors=FALSE) #create the db if (verbose) message("Creating TranscriptDb") tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) if (verbose) message("Use saveFeatures() to save the database to a file") return(tmpdb) } So, there's a GTF-to-TxDb function for Cufflinks output. On Thu, Apr 5, 2012 at 10:33 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > Good grief, I didn't even notice that I was using import.gff instead of > import('foo.gtf'), which works fine. Although the stranding still needs > cleanup for it to be turned into a TranscriptDb. > > > > On Thu, Apr 5, 2012 at 10:29 AM, Michael Lawrence < > lawrence.michael@gene.com> wrote: > >> Hey guys, >> >> The rtracklayer package parses GFF3 and GTF and delivers them as either >> GRanges or RangedData. I am surprised to see how many parsers people are >> implementing for these file types. Stuff like file parsing needs to be >> pushed down into the core infrastructure. If for some reason rtracklayer is >> falling short, say in performance, we should work together to address that. >> >> The code below does not seem to be using rtracklayer correctly. If you >> have a GFF3 file, then passing version = "3" or simply calling export.gff3 >> will do all of that nasty attribute parsing that consumes so much of that >> function. It does try to auto-detect the format, but maybe that was not >> working at the time this was written, or maybe the file is missing the >> version directive. >> >> This is probably the fault of the rtracklayer documentation. I have tried >> to improve it though over the past release cycle, so I would encourage >> taking another look at it. >> >> Thanks, >> Michael >> >> On Thu, Apr 5, 2012 at 10:13 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >> >>> This is almost in the bag (cuff2db.R). I have one last thing to figure >>> out: how to extract exon rank from Cufflinks output. >>> >>> Error in .normargSplicings(splicings, unique_tx_ids) : >>> 'splicings$exon_rank' must be an integer vector with no NAs >>> >>> Ideas? Here's the code. >>> >>> cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { >>> >>> require(rtracklayer) >>> require(GenomicRanges) >>> require(GenomicFeatures) >>> >>> requiredAttribs <- c("gene_id", "transcript_id", "exon_number") >>> >>> if (verbose) message("Importing ", gtf.file) >>> tmp <- import.gff(gtf.file, asRangedData=FALSE) >>> >>> #dispose of unspliced unstranded transcripts >>> tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] >>> >>> #parse the per exon attributes >>> if (verbose) message("Parsing gene/transcript/exon ids") >>> tmpx <- strsplit(gsub(";", "", gsub("\"", "", values(tmp)$group)), >>> split=" ") >>> tmpx <- lapply(tmpx, function(x) { >>> ans <- x[1:(length(x)/2)*2] >>> names(ans) <- x[1:(length(x)/2)*2-1] >>> ans >>> }) >>> >>> attribs <- unique(unlist(lapply(tmpx, names))) >>> if (!all(requiredAttribs %in% attribs)) >>> stop("Not all required attributes are in this GFF file") >>> names(attribs) <- attribs >>> tmpx2 <- do.call(data.frame, c(lapply(attribs,function(x) >>> sapply(tmpx,"[",x)), >>> stringsAsFactors=FALSE)) >>> values(tmp) <- tmpx2 >>> >>> if (verbose) message("Creating tables") >>> >>> #make transcripts table >>> tmpT <- split(tmp, values(tmp)$transcript_id) >>> transcripts <- data.frame( >>> tx_id=1:length(tmpT), >>> tx_name=names(tmpT), >>> tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning >>> )]), >>> tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioning >>> )]), >>> tx_start=sapply(start(ranges(tmpT)), min), >>> tx_end=sapply(end(ranges(tmpT)), max), >>> stringsAsFactors=FALSE) >>> >>> #make splicings table >>> splicings <- data.frame( >>> tx_id=rep(1:length(tmpT), elementLengths(tmpT)), >>> exon_rank=as.integer(values(unlist(tmpT))$exon_number), >>> exon_chrom=as.character(seqnames(unlist(tmpT))), >>> exon_strand=as.character(strand(unlist(tmpT))), >>> exon_start=start(unlist(tmpT)), >>> exon_end=end(unlist(tmpT)), >>> stringsAsFactors=FALSE) >>> >>> #make genes table >>> gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, >>> unique) >>> genes <- data.frame( >>> tx_name=unlist(gene_txs), >>> gene_id=rep(names(gene_txs), sapply(gene_txs, length)), >>> stringsAsFactors=FALSE) >>> >>> #create the db >>> if (verbose) message("Creating TranscriptDb") >>> tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) >>> if (verbose) message("Use saveFeatures() to save the database to a >>> file") >>> return(tmpdb) >>> >>> } >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 5, 2012 at 9:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>>> >wrote: >>> >>> > the Gist that I posted earlier has some minor (trivial) buglets in it, >>> but >>> > I should have a fixed version that processes Cufflinks GTF files in a >>> > moment here. >>> > >>> > Obviously it's easier to use subread and subjunc since they just fire >>> out >>> > a .bed file with the supported splice junctions (I have not yet >>> checked to >>> > see if the BED forms a splicegraph or has enough information to do so, >>> but >>> > I suspect that if it doesn't, it can be fixed fairly easily to do so). >>> The >>> > featureCounts function in Rsubread is amazingly fast, and works for any >>> > arbitrary collection of features (e.g. edges), so having TranscriptDb >>> > objects to compare is nice. >>> > >>> > >>> > On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> >>> wrote: >>> > >>> >> Hi all, >>> >> >>> >> Sorry I haven't read the whole thread, still I have a few comments >>> that >>> >> might be off the main topic then. >>> >> >>> >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: >>> >> >>> >> > Supporting both Ensemble's GTF and GFF3 would be ideal. >>> >> > >>> >> > Ensembl GTF would open up many genomes, including those in: >>> >> > ftp://ftp.ensembl.org/pub/release-66/gtf/ >>> >> > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ >>> >> > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ >>> >> > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ >>> >> > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ >>> >> > >>> >> > >>> >> > Supporting Ensembl GTF would make it easy to distribute/archive the >>> >> elements of a transcriptome analysis alongside a project/analysis in a >>> >> generally useful format (i.e. IGV and other tools can work with it >>> more or >>> >> less directly) >>> >> >>> >> In my package easyRNASeq, I already load Ensembl GTF files and convert >>> >> them into GRanges / RangedData object. It's pretty straightforward. I >>> guess >>> >> that adapting the code to create a transcriptDb should be do- able. >>> >> >>> >> > >>> >> > Related note, I have learned that the BioMarts produced for >>> >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF >>> IS >>> >> available. Upshot: you'd best not depend upon being able to reproduce >>> >> today's TranscriptDbFromBiomart tomorrow. >>> >> >>> >> I don't know where you learned that and how you meant it exactly, but >>> >> using biomaRt, you can still access Ensembl version as old as of march >>> >> 2009: see http://mar2009.archive.ensembl.org/index.html. It's not >>> >> straightforward to figure it out, but on the main Ensembl webpage, >>> you can >>> >> get the full list by clicking the "view in archive site" link at the >>> bottom >>> >> left of the papge. It redirects to this URL: >>> >> http://www.ensembl.org/Help/ArchiveList. >>> >> Then, to use biomaRt on a given archive, you need to change the host >>> >> argument of useMart to the URL of the corresponding Ensembl archive >>> as in: >>> >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). I >>> >> recon that the biomaRT archive arguments does not work for that. I >>> need to >>> >> post something about this on the mailing list. >>> >> >>> >> > >>> >> > re: "typical gff3 files"... >>> >> > Flybase makes gff3 extracts and if my understanding is correct, have >>> >> been diligent in "getting it right" >>> >> >>> >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files >>> and >>> >> convert them to GRanges/RangedData object, but all the merit goes to >>> the >>> >> readGff3 function from the genomeIntervals package. Reading a gff3 >>> file >>> >> with this function is extremely quick as is accessing the >>> gffAttributes >>> >> (performed at the C layer) . >>> >> >>> >> Cheers, >>> >> >>> >> Nico >>> >> >>> >> > >>> >> > Also, NCBI historically has tried to provide GFFx extracts, with >>> oodles >>> >> of caveats. >>> >> > But, but, Last month they announced progress on improving their GFF3 >>> >> offerings: >>> >> http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html >>> >> > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ >>> >> > YMMV. >>> >> > >>> >> > I too once hoped to find makeTranscriptDbFromGFF3 capability so as >>> to >>> >> allow easy tracking the head of Flybase's offerings, i.e. >>> >> >>> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_F B2012_02/gff/-alas I too have not followed up. >>> >>> >> > >>> >> > ~Malcolm >>> >> > >>> >> > >>> >> >> -----Original Message----- >>> >> >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- >>> >> >> bounces@r-project.org] On Behalf Of Marc Carlson >>> >> >> Sent: Wednesday, April 04, 2012 7:44 PM >>> >> >> To: bioconductor@r-project.org >>> >> >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library >>> >> >> GenomicFeatures and rtracklayer >>> >> >> >>> >> >> I was looking at this during the course, and this is on my TODO >>> list >>> >> for >>> >> >> the next release cycle. I think it is long overdue and I don't >>> think >>> >> >> that the community is going to get it done in spite of all the >>> >> >> enthusiasm. There has not been time to do it before now but I am >>> >> hoping >>> >> >> that will now change. It should be simple enough in principle, >>> but it >>> >> >> might not be exactly trivial as I have discovered (on closer >>> >> inspection) >>> >> >> that the gff specification is not as concrete as one would like it >>> to >>> >> >> be. Also there have been several different versions. >>> >> >> >>> >> >> Some things that can help speed me along: >>> >> >> >>> >> >> 1) which version is most important? gff3? Or one of the other >>> >> >> versions? It is likely that with the older versions we may not be >>> able >>> >> >> to extract as much meaningful information. >>> >> >> >>> >> >> 2) where is the best place to find some typical gff3 files for >>> >> >> examples? This should not be difficult, but when I was looking >>> before >>> >> I >>> >> >> was finding that people were surprisingly stingy about sharing >>> these. >>> >> >> >>> >> >> >>> >> >> Marc >>> >> >> >>> >> >> >>> >> >> >>> >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >>> >> >>> Marc was working on this during the course in Feb. Not sure what >>> >> >> happened >>> >> >>> to it. He said it was simple. Maybe just waiting for the release >>> to >>> >> pass. >>> >> >>> >>> >> >>> Michael >>> >> >>> >>> >> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >>> >> >>> mailinglist.honeypot@gmail.com> wrote: >>> >> >>> >>> >> >>>> Hi, >>> >> >>>> >>> >> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi@cornell.edu>>> > >>> >> >> wrote: >>> >> >>>>> Hi, >>> >> >>>>> >>> >> >>>>> I am wondering if I could create a TranscriptDb object (library >>> >> >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file >>> using >>> >> >>>> import.gff3, but I could not find a way to create TranscriptDb >>> >> object from >>> >> >>>> the object from import.gff3. >>> >> >>>>> Two arguments for makeTranscriptDb are required: transcripts, >>> >> splicings. >>> >> >>>> It does not seem to be easy to parse this information from the >>> object >>> >> >> form >>> >> >>>> import.gff3. I will appreciate any help. >>> >> >>>> >>> >> >>>> As far as I know, this functionality isn't there yet ... >>> >> >>>> >>> >> >>>> I once (early feb, 2012) suggested I might take a crack at making >>> >> this >>> >> >>>> happen but haven't actually found the time to do it ... I'm not >>> sure >>> >> >>>> anyone in bioc-core land (hi, Marc) has found the time to do it >>> >> >>>> either, so I think you're out of luck. >>> >> >>>> >>> >> >>>> Sorry for that. But the good news is that I bet a patch that does >>> >> this >>> >> >>>> would be welcome ;-) >>> >> >>>> >>> >> >>>> -steve >>> >> >>>> >>> >> >>>> -- >>> >> >>>> Steve Lianoglou >>> >> >>>> Graduate Student: Computational Systems Biology >>> >> >>>> | Memorial Sloan-Kettering Cancer Center >>> >> >>>> | Weill Medical College of Cornell University >>> >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >> >>>> >>> >> >>>> _______________________________________________ >>> >> >>>> Bioconductor mailing list >>> >> >>>> Bioconductor@r-project.org >>> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >>>> Search the archives: >>> >> >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>>> >>> >> >>> [[alternative HTML version deleted]] >>> >> >>> >>> >> >>> _______________________________________________ >>> >> >>> Bioconductor mailing list >>> >> >>> Bioconductor@r-project.org >>> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >>> Search the archives: >>> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >>> >> >> _______________________________________________ >>> >> >> Bioconductor mailing list >>> >> >> Bioconductor@r-project.org >>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> >> Search the archives: >>> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> > >>> >> > _______________________________________________ >>> >> > Bioconductor mailing list >>> >> > Bioconductor@r-project.org >>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> > Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>> >> _______________________________________________ >>> >> Bioconductor mailing list >>> >> Bioconductor@r-project.org >>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> >> Search the archives: >>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >>> > >>> > >>> > >>> > -- >>> > *A model is a lie that helps you see the truth.* >>> > * >>> > * >>> > Howard Skipper< >>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>> > >>> > >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard Skipper< >>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
nb. The output DB has no CDS, so beware of that: TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | transcript_nrow: 23524 | exon_nrow: 122247 | cds_nrow: 0 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-04-05 11:52:33 -0700 (Thu, 05 Apr 2012) | GenomicFeatures version at creation time: 1.8.0 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 Otherwise, it seems to work OK, and I assume the same logic would result in a GFF3toTranscriptDb object. I'll find out with SpliceGrapher's output fairly soon. On Thu, Apr 5, 2012 at 11:45 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > Alright, let's try that again. Having heeded your advice, it seems to > work now. > > cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { > > require(rtracklayer) > require(GenomicRanges) > require(GenomicFeatures) > > requiredAttribs <- c("gene_id", "transcript_id", "exon_number") > > if (verbose) message("Importing ", gtf.file) > tmp <- import(gtf.file, asRangedData=FALSE) > > #dispose of unspliced unstranded transcripts > tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] > > # fix the gene IDs > values(tmp)$gene_id <- gsub('CUFF.', '', values(tmp)$gene_id) > > # fix the exon IDs > values(tmp)$transcript_id <- gsub('CUFF.', '', values(tmp)$transcript_id) > > # split the object into transcript and exon pieces > by.type = split(tmp, values(tmp)$type) > browser() > > #make transcripts table > tmpT <- split(by.type$transcript, > values(by.type$transcript)$transcript_id) > if(verbose) message('Attempting to create the transcripts data.frame') > transcripts <- data.frame( > tx_id=1:length(tmpT), > tx_name=names(tmpT), > tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning > )]), > tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioning > )]), > tx_start=sapply(start(ranges(tmpT)), min), > tx_end=sapply(end(ranges(tmpT)), max), > stringsAsFactors=FALSE > ) > > #make splicings table > tmpS <- split(by.type$exon, values(by.type$exon)$transcript_id) > if(verbose) message('Attempting to create the splicings data.frame') > splicings <- data.frame( > tx_id=rep(1:length(tmpS), elementLengths(tmpS)), > exon_rank=as.integer(values(unlist(tmpS))$exon_number), > exon_chrom=as.character(seqnames(unlist(tmpS))), > exon_strand=as.character(strand(unlist(tmpS))), > exon_start=start(unlist(tmpS)), > exon_end=end(unlist(tmpS)), > stringsAsFactors=FALSE > ) > > #make genes table > if(verbose) message('Attempting to create the genes data.frame') > gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, > unique) > genes <- data.frame( > tx_name=unlist(gene_txs), > gene_id=rep(names(gene_txs), sapply(gene_txs, length)), > stringsAsFactors=FALSE) > > #create the db > if (verbose) message("Creating TranscriptDb") > tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) > if (verbose) message("Use saveFeatures() to save the database to a file") > return(tmpdb) > > } > > So, there's a GTF-to-TxDb function for Cufflinks output. > > > > On Thu, Apr 5, 2012 at 10:33 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> Good grief, I didn't even notice that I was using import.gff instead of >> import('foo.gtf'), which works fine. Although the stranding still needs >> cleanup for it to be turned into a TranscriptDb. >> >> >> >> On Thu, Apr 5, 2012 at 10:29 AM, Michael Lawrence < >> lawrence.michael@gene.com> wrote: >> >>> Hey guys, >>> >>> The rtracklayer package parses GFF3 and GTF and delivers them as either >>> GRanges or RangedData. I am surprised to see how many parsers people are >>> implementing for these file types. Stuff like file parsing needs to be >>> pushed down into the core infrastructure. If for some reason rtracklayer is >>> falling short, say in performance, we should work together to address that. >>> >>> The code below does not seem to be using rtracklayer correctly. If you >>> have a GFF3 file, then passing version = "3" or simply calling export.gff3 >>> will do all of that nasty attribute parsing that consumes so much of that >>> function. It does try to auto-detect the format, but maybe that was not >>> working at the time this was written, or maybe the file is missing the >>> version directive. >>> >>> This is probably the fault of the rtracklayer documentation. I have >>> tried to improve it though over the past release cycle, so I would >>> encourage taking another look at it. >>> >>> Thanks, >>> Michael >>> >>> On Thu, Apr 5, 2012 at 10:13 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >>> >>>> This is almost in the bag (cuff2db.R). I have one last thing to figure >>>> out: how to extract exon rank from Cufflinks output. >>>> >>>> Error in .normargSplicings(splicings, unique_tx_ids) : >>>> 'splicings$exon_rank' must be an integer vector with no NAs >>>> >>>> Ideas? Here's the code. >>>> >>>> cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { >>>> >>>> require(rtracklayer) >>>> require(GenomicRanges) >>>> require(GenomicFeatures) >>>> >>>> requiredAttribs <- c("gene_id", "transcript_id", "exon_number") >>>> >>>> if (verbose) message("Importing ", gtf.file) >>>> tmp <- import.gff(gtf.file, asRangedData=FALSE) >>>> >>>> #dispose of unspliced unstranded transcripts >>>> tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] >>>> >>>> #parse the per exon attributes >>>> if (verbose) message("Parsing gene/transcript/exon ids") >>>> tmpx <- strsplit(gsub(";", "", gsub("\"", "", values(tmp)$group)), >>>> split=" ") >>>> tmpx <- lapply(tmpx, function(x) { >>>> ans <- x[1:(length(x)/2)*2] >>>> names(ans) <- x[1:(length(x)/2)*2-1] >>>> ans >>>> }) >>>> >>>> attribs <- unique(unlist(lapply(tmpx, names))) >>>> if (!all(requiredAttribs %in% attribs)) >>>> stop("Not all required attributes are in this GFF file") >>>> names(attribs) <- attribs >>>> tmpx2 <- do.call(data.frame, c(lapply(attribs,function(x) >>>> sapply(tmpx,"[",x)), >>>> stringsAsFactors=FALSE)) >>>> values(tmp) <- tmpx2 >>>> >>>> if (verbose) message("Creating tables") >>>> >>>> #make transcripts table >>>> tmpT <- split(tmp, values(tmp)$transcript_id) >>>> transcripts <- data.frame( >>>> tx_id=1:length(tmpT), >>>> tx_name=names(tmpT), >>>> tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning >>>> )]), >>>> tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioning >>>> )]), >>>> tx_start=sapply(start(ranges(tmpT)), min), >>>> tx_end=sapply(end(ranges(tmpT)), max), >>>> stringsAsFactors=FALSE) >>>> >>>> #make splicings table >>>> splicings <- data.frame( >>>> tx_id=rep(1:length(tmpT), elementLengths(tmpT)), >>>> exon_rank=as.integer(values(unlist(tmpT))$exon_number), >>>> exon_chrom=as.character(seqnames(unlist(tmpT))), >>>> exon_strand=as.character(strand(unlist(tmpT))), >>>> exon_start=start(unlist(tmpT)), >>>> exon_end=end(unlist(tmpT)), >>>> stringsAsFactors=FALSE) >>>> >>>> #make genes table >>>> gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, >>>> unique) >>>> genes <- data.frame( >>>> tx_name=unlist(gene_txs), >>>> gene_id=rep(names(gene_txs), sapply(gene_txs, length)), >>>> stringsAsFactors=FALSE) >>>> >>>> #create the db >>>> if (verbose) message("Creating TranscriptDb") >>>> tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) >>>> if (verbose) message("Use saveFeatures() to save the database to a >>>> file") >>>> return(tmpdb) >>>> >>>> } >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Apr 5, 2012 at 9:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>>>> >wrote: >>>> >>>> > the Gist that I posted earlier has some minor (trivial) buglets in >>>> it, but >>>> > I should have a fixed version that processes Cufflinks GTF files in a >>>> > moment here. >>>> > >>>> > Obviously it's easier to use subread and subjunc since they just fire >>>> out >>>> > a .bed file with the supported splice junctions (I have not yet >>>> checked to >>>> > see if the BED forms a splicegraph or has enough information to do >>>> so, but >>>> > I suspect that if it doesn't, it can be fixed fairly easily to do >>>> so). The >>>> > featureCounts function in Rsubread is amazingly fast, and works for >>>> any >>>> > arbitrary collection of features (e.g. edges), so having TranscriptDb >>>> > objects to compare is nice. >>>> > >>>> > >>>> > On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> >>>> wrote: >>>> > >>>> >> Hi all, >>>> >> >>>> >> Sorry I haven't read the whole thread, still I have a few comments >>>> that >>>> >> might be off the main topic then. >>>> >> >>>> >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: >>>> >> >>>> >> > Supporting both Ensemble's GTF and GFF3 would be ideal. >>>> >> > >>>> >> > Ensembl GTF would open up many genomes, including those in: >>>> >> > ftp://ftp.ensembl.org/pub/release-66/gtf/ >>>> >> > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ >>>> >> > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ >>>> >> > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ >>>> >> > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ >>>> >> > >>>> >> > >>>> >> > Supporting Ensembl GTF would make it easy to distribute/archive the >>>> >> elements of a transcriptome analysis alongside a project/analysis in >>>> a >>>> >> generally useful format (i.e. IGV and other tools can work with it >>>> more or >>>> >> less directly) >>>> >> >>>> >> In my package easyRNASeq, I already load Ensembl GTF files and >>>> convert >>>> >> them into GRanges / RangedData object. It's pretty straightforward. >>>> I guess >>>> >> that adapting the code to create a transcriptDb should be do- able. >>>> >> >>>> >> > >>>> >> > Related note, I have learned that the BioMarts produced for >>>> >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF >>>> IS >>>> >> available. Upshot: you'd best not depend upon being able to >>>> reproduce >>>> >> today's TranscriptDbFromBiomart tomorrow. >>>> >> >>>> >> I don't know where you learned that and how you meant it exactly, but >>>> >> using biomaRt, you can still access Ensembl version as old as of >>>> march >>>> >> 2009: see http://mar2009.archive.ensembl.org/index.html. It's not >>>> >> straightforward to figure it out, but on the main Ensembl webpage, >>>> you can >>>> >> get the full list by clicking the "view in archive site" link at the >>>> bottom >>>> >> left of the papge. It redirects to this URL: >>>> >> http://www.ensembl.org/Help/ArchiveList. >>>> >> Then, to use biomaRt on a given archive, you need to change the host >>>> >> argument of useMart to the URL of the corresponding Ensembl archive >>>> as in: >>>> >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). >>>> I >>>> >> recon that the biomaRT archive arguments does not work for that. I >>>> need to >>>> >> post something about this on the mailing list. >>>> >> >>>> >> > >>>> >> > re: "typical gff3 files"... >>>> >> > Flybase makes gff3 extracts and if my understanding is correct, >>>> have >>>> >> been diligent in "getting it right" >>>> >> >>>> >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 >>>> files and >>>> >> convert them to GRanges/RangedData object, but all the merit goes to >>>> the >>>> >> readGff3 function from the genomeIntervals package. Reading a gff3 >>>> file >>>> >> with this function is extremely quick as is accessing the >>>> gffAttributes >>>> >> (performed at the C layer) . >>>> >> >>>> >> Cheers, >>>> >> >>>> >> Nico >>>> >> >>>> >> > >>>> >> > Also, NCBI historically has tried to provide GFFx extracts, with >>>> oodles >>>> >> of caveats. >>>> >> > But, but, Last month they announced progress on improving their >>>> GFF3 >>>> >> offerings: >>>> >> http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html >>>> >> > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ >>>> >> > YMMV. >>>> >> > >>>> >> > I too once hoped to find makeTranscriptDbFromGFF3 capability so as >>>> to >>>> >> allow easy tracking the head of Flybase's offerings, i.e. >>>> >> >>>> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_ FB2012_02/gff/-alas I too have not followed up. >>>> >>>> >> > >>>> >> > ~Malcolm >>>> >> > >>>> >> > >>>> >> >> -----Original Message----- >>>> >> >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- >>>> >> >> bounces@r-project.org] On Behalf Of Marc Carlson >>>> >> >> Sent: Wednesday, April 04, 2012 7:44 PM >>>> >> >> To: bioconductor@r-project.org >>>> >> >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - >>>> library >>>> >> >> GenomicFeatures and rtracklayer >>>> >> >> >>>> >> >> I was looking at this during the course, and this is on my TODO >>>> list >>>> >> for >>>> >> >> the next release cycle. I think it is long overdue and I don't >>>> think >>>> >> >> that the community is going to get it done in spite of all the >>>> >> >> enthusiasm. There has not been time to do it before now but I am >>>> >> hoping >>>> >> >> that will now change. It should be simple enough in principle, >>>> but it >>>> >> >> might not be exactly trivial as I have discovered (on closer >>>> >> inspection) >>>> >> >> that the gff specification is not as concrete as one would like >>>> it to >>>> >> >> be. Also there have been several different versions. >>>> >> >> >>>> >> >> Some things that can help speed me along: >>>> >> >> >>>> >> >> 1) which version is most important? gff3? Or one of the other >>>> >> >> versions? It is likely that with the older versions we may not >>>> be able >>>> >> >> to extract as much meaningful information. >>>> >> >> >>>> >> >> 2) where is the best place to find some typical gff3 files for >>>> >> >> examples? This should not be difficult, but when I was looking >>>> before >>>> >> I >>>> >> >> was finding that people were surprisingly stingy about sharing >>>> these. >>>> >> >> >>>> >> >> >>>> >> >> Marc >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >>>> >> >>> Marc was working on this during the course in Feb. Not sure what >>>> >> >> happened >>>> >> >>> to it. He said it was simple. Maybe just waiting for the release >>>> to >>>> >> pass. >>>> >> >>> >>>> >> >>> Michael >>>> >> >>> >>>> >> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >>>> >> >>> mailinglist.honeypot@gmail.com> wrote: >>>> >> >>> >>>> >> >>>> Hi, >>>> >> >>>> >>>> >> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi< >>>> schoi@cornell.edu> >>>> >> >> wrote: >>>> >> >>>>> Hi, >>>> >> >>>>> >>>> >> >>>>> I am wondering if I could create a TranscriptDb object (library >>>> >> >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file >>>> using >>>> >> >>>> import.gff3, but I could not find a way to create TranscriptDb >>>> >> object from >>>> >> >>>> the object from import.gff3. >>>> >> >>>>> Two arguments for makeTranscriptDb are required: transcripts, >>>> >> splicings. >>>> >> >>>> It does not seem to be easy to parse this information from the >>>> object >>>> >> >> form >>>> >> >>>> import.gff3. I will appreciate any help. >>>> >> >>>> >>>> >> >>>> As far as I know, this functionality isn't there yet ... >>>> >> >>>> >>>> >> >>>> I once (early feb, 2012) suggested I might take a crack at >>>> making >>>> >> this >>>> >> >>>> happen but haven't actually found the time to do it ... I'm not >>>> sure >>>> >> >>>> anyone in bioc-core land (hi, Marc) has found the time to do it >>>> >> >>>> either, so I think you're out of luck. >>>> >> >>>> >>>> >> >>>> Sorry for that. But the good news is that I bet a patch that >>>> does >>>> >> this >>>> >> >>>> would be welcome ;-) >>>> >> >>>> >>>> >> >>>> -steve >>>> >> >>>> >>>> >> >>>> -- >>>> >> >>>> Steve Lianoglou >>>> >> >>>> Graduate Student: Computational Systems Biology >>>> >> >>>> | Memorial Sloan-Kettering Cancer Center >>>> >> >>>> | Weill Medical College of Cornell University >>>> >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>> >> >>>> >>>> >> >>>> _______________________________________________ >>>> >> >>>> Bioconductor mailing list >>>> >> >>>> Bioconductor@r-project.org >>>> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> >>>> Search the archives: >>>> >> >>>> >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >>>> >>>> >> >>> [[alternative HTML version deleted]] >>>> >> >>> >>>> >> >>> _______________________________________________ >>>> >> >>> Bioconductor mailing list >>>> >> >>> Bioconductor@r-project.org >>>> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> >>> Search the archives: >>>> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >> >>>> >> >> _______________________________________________ >>>> >> >> Bioconductor mailing list >>>> >> >> Bioconductor@r-project.org >>>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> >> Search the archives: >>>> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Bioconductor mailing list >>>> >> > Bioconductor@r-project.org >>>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> > Search the archives: >>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >>>> >> _______________________________________________ >>>> >> Bioconductor mailing list >>>> >> Bioconductor@r-project.org >>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> >> Search the archives: >>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> >>>> > >>>> > >>>> > >>>> > -- >>>> > *A model is a lie that helps you see the truth.* >>>> > * >>>> > * >>>> > Howard Skipper< >>>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>>> > >>>> > >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard Skipper< >>>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Also, any transcripts where Cufflinks couldn't figure out the strand, get thrown out. On Thu, Apr 5, 2012 at 11:48 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > nb. The output DB has no CDS, so beware of that: > > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | transcript_nrow: 23524 > | exon_nrow: 122247 > | cds_nrow: 0 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2012-04-05 11:52:33 -0700 (Thu, 05 Apr 2012) > | GenomicFeatures version at creation time: 1.8.0 > | RSQLite version at creation time: 0.11.1 > | DBSCHEMAVERSION: 1.0 > > Otherwise, it seems to work OK, and I assume the same logic would result > in a GFF3toTranscriptDb object. > I'll find out with SpliceGrapher's output fairly soon. > > > > On Thu, Apr 5, 2012 at 11:45 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: > >> Alright, let's try that again. Having heeded your advice, it seems to >> work now. >> >> cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { >> >> require(rtracklayer) >> require(GenomicRanges) >> require(GenomicFeatures) >> >> requiredAttribs <- c("gene_id", "transcript_id", "exon_number") >> >> if (verbose) message("Importing ", gtf.file) >> tmp <- import(gtf.file, asRangedData=FALSE) >> >> #dispose of unspliced unstranded transcripts >> tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] >> >> # fix the gene IDs >> values(tmp)$gene_id <- gsub('CUFF.', '', values(tmp)$gene_id) >> >> # fix the exon IDs >> values(tmp)$transcript_id <- gsub('CUFF.', '', >> values(tmp)$transcript_id) >> >> # split the object into transcript and exon pieces >> by.type = split(tmp, values(tmp)$type) >> browser() >> >> #make transcripts table >> tmpT <- split(by.type$transcript, >> values(by.type$transcript)$transcript_id) >> if(verbose) message('Attempting to create the transcripts data.frame') >> transcripts <- data.frame( >> tx_id=1:length(tmpT), >> tx_name=names(tmpT), >> tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning >> )]), >> tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioning >> )]), >> tx_start=sapply(start(ranges(tmpT)), min), >> tx_end=sapply(end(ranges(tmpT)), max), >> stringsAsFactors=FALSE >> ) >> >> #make splicings table >> tmpS <- split(by.type$exon, values(by.type$exon)$transcript_id) >> if(verbose) message('Attempting to create the splicings data.frame') >> splicings <- data.frame( >> tx_id=rep(1:length(tmpS), elementLengths(tmpS)), >> exon_rank=as.integer(values(unlist(tmpS))$exon_number), >> exon_chrom=as.character(seqnames(unlist(tmpS))), >> exon_strand=as.character(strand(unlist(tmpS))), >> exon_start=start(unlist(tmpS)), >> exon_end=end(unlist(tmpS)), >> stringsAsFactors=FALSE >> ) >> >> #make genes table >> if(verbose) message('Attempting to create the genes data.frame') >> gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, >> unique) >> genes <- data.frame( >> tx_name=unlist(gene_txs), >> gene_id=rep(names(gene_txs), sapply(gene_txs, length)), >> stringsAsFactors=FALSE) >> >> #create the db >> if (verbose) message("Creating TranscriptDb") >> tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) >> if (verbose) message("Use saveFeatures() to save the database to a >> file") >> return(tmpdb) >> >> } >> >> So, there's a GTF-to-TxDb function for Cufflinks output. >> >> >> >> On Thu, Apr 5, 2012 at 10:33 AM, Tim Triche, Jr. <tim.triche@gmail.com>wrote: >> >>> Good grief, I didn't even notice that I was using import.gff instead of >>> import('foo.gtf'), which works fine. Although the stranding still needs >>> cleanup for it to be turned into a TranscriptDb. >>> >>> >>> >>> On Thu, Apr 5, 2012 at 10:29 AM, Michael Lawrence < >>> lawrence.michael@gene.com> wrote: >>> >>>> Hey guys, >>>> >>>> The rtracklayer package parses GFF3 and GTF and delivers them as either >>>> GRanges or RangedData. I am surprised to see how many parsers people are >>>> implementing for these file types. Stuff like file parsing needs to be >>>> pushed down into the core infrastructure. If for some reason rtracklayer is >>>> falling short, say in performance, we should work together to address that. >>>> >>>> The code below does not seem to be using rtracklayer correctly. If you >>>> have a GFF3 file, then passing version = "3" or simply calling export.gff3 >>>> will do all of that nasty attribute parsing that consumes so much of that >>>> function. It does try to auto-detect the format, but maybe that was not >>>> working at the time this was written, or maybe the file is missing the >>>> version directive. >>>> >>>> This is probably the fault of the rtracklayer documentation. I have >>>> tried to improve it though over the past release cycle, so I would >>>> encourage taking another look at it. >>>> >>>> Thanks, >>>> Michael >>>> >>>> On Thu, Apr 5, 2012 at 10:13 AM, Tim Triche, Jr. <tim.triche@gmail.com>>>> > wrote: >>>> >>>>> This is almost in the bag (cuff2db.R). I have one last thing to >>>>> figure >>>>> out: how to extract exon rank from Cufflinks output. >>>>> >>>>> Error in .normargSplicings(splicings, unique_tx_ids) : >>>>> 'splicings$exon_rank' must be an integer vector with no NAs >>>>> >>>>> Ideas? Here's the code. >>>>> >>>>> cuff2db <- function(gtf.file, out.file = NULL, verbose = TRUE) { >>>>> >>>>> require(rtracklayer) >>>>> require(GenomicRanges) >>>>> require(GenomicFeatures) >>>>> >>>>> requiredAttribs <- c("gene_id", "transcript_id", "exon_number") >>>>> >>>>> if (verbose) message("Importing ", gtf.file) >>>>> tmp <- import.gff(gtf.file, asRangedData=FALSE) >>>>> >>>>> #dispose of unspliced unstranded transcripts >>>>> tmp <- tmp[ which(strand(tmp) %in% c('+','-')) ] >>>>> >>>>> #parse the per exon attributes >>>>> if (verbose) message("Parsing gene/transcript/exon ids") >>>>> tmpx <- strsplit(gsub(";", "", gsub("\"", "", values(tmp)$group)), >>>>> split=" ") >>>>> tmpx <- lapply(tmpx, function(x) { >>>>> ans <- x[1:(length(x)/2)*2] >>>>> names(ans) <- x[1:(length(x)/2)*2-1] >>>>> ans >>>>> }) >>>>> >>>>> attribs <- unique(unlist(lapply(tmpx, names))) >>>>> if (!all(requiredAttribs %in% attribs)) >>>>> stop("Not all required attributes are in this GFF file") >>>>> names(attribs) <- attribs >>>>> tmpx2 <- do.call(data.frame, c(lapply(attribs,function(x) >>>>> sapply(tmpx,"[",x)), >>>>> stringsAsFactors=FALSE)) >>>>> values(tmp) <- tmpx2 >>>>> >>>>> if (verbose) message("Creating tables") >>>>> >>>>> #make transcripts table >>>>> tmpT <- split(tmp, values(tmp)$transcript_id) >>>>> transcripts <- data.frame( >>>>> tx_id=1:length(tmpT), >>>>> tx_name=names(tmpT), >>>>> tx_chrom=as.character(seqnames(unlist(tmpT))[start(tmpT@partitioning >>>>> )]), >>>>> tx_strand=as.character(strand(unlist(tmpT))[start(tmpT@partitioning >>>>> )]), >>>>> tx_start=sapply(start(ranges(tmpT)), min), >>>>> tx_end=sapply(end(ranges(tmpT)), max), >>>>> stringsAsFactors=FALSE) >>>>> >>>>> #make splicings table >>>>> splicings <- data.frame( >>>>> tx_id=rep(1:length(tmpT), elementLengths(tmpT)), >>>>> exon_rank=as.integer(values(unlist(tmpT))$exon_number), >>>>> exon_chrom=as.character(seqnames(unlist(tmpT))), >>>>> exon_strand=as.character(strand(unlist(tmpT))), >>>>> exon_start=start(unlist(tmpT)), >>>>> exon_end=end(unlist(tmpT)), >>>>> stringsAsFactors=FALSE) >>>>> >>>>> #make genes table >>>>> gene_txs <- tapply(values(tmp)$transcript_id, values(tmp)$gene_id, >>>>> unique) >>>>> genes <- data.frame( >>>>> tx_name=unlist(gene_txs), >>>>> gene_id=rep(names(gene_txs), sapply(gene_txs, length)), >>>>> stringsAsFactors=FALSE) >>>>> >>>>> #create the db >>>>> if (verbose) message("Creating TranscriptDb") >>>>> tmpdb <- makeTranscriptDb(transcripts, splicings, genes=genes) >>>>> if (verbose) message("Use saveFeatures() to save the database to a >>>>> file") >>>>> return(tmpdb) >>>>> >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Apr 5, 2012 at 9:34 AM, Tim Triche, Jr. <tim.triche@gmail.com>>>>> >wrote: >>>>> >>>>> > the Gist that I posted earlier has some minor (trivial) buglets in >>>>> it, but >>>>> > I should have a fixed version that processes Cufflinks GTF files in a >>>>> > moment here. >>>>> > >>>>> > Obviously it's easier to use subread and subjunc since they just >>>>> fire out >>>>> > a .bed file with the supported splice junctions (I have not yet >>>>> checked to >>>>> > see if the BED forms a splicegraph or has enough information to do >>>>> so, but >>>>> > I suspect that if it doesn't, it can be fixed fairly easily to do >>>>> so). The >>>>> > featureCounts function in Rsubread is amazingly fast, and works for >>>>> any >>>>> > arbitrary collection of features (e.g. edges), so having TranscriptDb >>>>> > objects to compare is nice. >>>>> > >>>>> > >>>>> > On Thu, Apr 5, 2012 at 8:21 AM, Nicolas Delhomme <delhomme@embl.de> >>>>> wrote: >>>>> > >>>>> >> Hi all, >>>>> >> >>>>> >> Sorry I haven't read the whole thread, still I have a few comments >>>>> that >>>>> >> might be off the main topic then. >>>>> >> >>>>> >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: >>>>> >> >>>>> >> > Supporting both Ensemble's GTF and GFF3 would be ideal. >>>>> >> > >>>>> >> > Ensembl GTF would open up many genomes, including those in: >>>>> >> > ftp://ftp.ensembl.org/pub/release-66/gtf/ >>>>> >> > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ >>>>> >> > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ >>>>> >> > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ >>>>> >> > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ >>>>> >> > >>>>> >> > >>>>> >> > Supporting Ensembl GTF would make it easy to distribute/archive >>>>> the >>>>> >> elements of a transcriptome analysis alongside a project/analysis >>>>> in a >>>>> >> generally useful format (i.e. IGV and other tools can work with it >>>>> more or >>>>> >> less directly) >>>>> >> >>>>> >> In my package easyRNASeq, I already load Ensembl GTF files and >>>>> convert >>>>> >> them into GRanges / RangedData object. It's pretty straightforward. >>>>> I guess >>>>> >> that adapting the code to create a transcriptDb should be do- able. >>>>> >> >>>>> >> > >>>>> >> > Related note, I have learned that the BioMarts produced for >>>>> >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic >>>>> GTF IS >>>>> >> available. Upshot: you'd best not depend upon being able to >>>>> reproduce >>>>> >> today's TranscriptDbFromBiomart tomorrow. >>>>> >> >>>>> >> I don't know where you learned that and how you meant it exactly, >>>>> but >>>>> >> using biomaRt, you can still access Ensembl version as old as of >>>>> march >>>>> >> 2009: see http://mar2009.archive.ensembl.org/index.html. It's not >>>>> >> straightforward to figure it out, but on the main Ensembl webpage, >>>>> you can >>>>> >> get the full list by clicking the "view in archive site" link at >>>>> the bottom >>>>> >> left of the papge. It redirects to this URL: >>>>> >> http://www.ensembl.org/Help/ArchiveList. >>>>> >> Then, to use biomaRt on a given archive, you need to change the host >>>>> >> argument of useMart to the URL of the corresponding Ensembl archive >>>>> as in: >>>>> >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org"). >>>>> I >>>>> >> recon that the biomaRT archive arguments does not work for that. I >>>>> need to >>>>> >> post something about this on the mailing list. >>>>> >> >>>>> >> > >>>>> >> > re: "typical gff3 files"... >>>>> >> > Flybase makes gff3 extracts and if my understanding is correct, >>>>> have >>>>> >> been diligent in "getting it right" >>>>> >> >>>>> >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 >>>>> files and >>>>> >> convert them to GRanges/RangedData object, but all the merit goes >>>>> to the >>>>> >> readGff3 function from the genomeIntervals package. Reading a gff3 >>>>> file >>>>> >> with this function is extremely quick as is accessing the >>>>> gffAttributes >>>>> >> (performed at the C layer) . >>>>> >> >>>>> >> Cheers, >>>>> >> >>>>> >> Nico >>>>> >> >>>>> >> > >>>>> >> > Also, NCBI historically has tried to provide GFFx extracts, with >>>>> oodles >>>>> >> of caveats. >>>>> >> > But, but, Last month they announced progress on improving their >>>>> GFF3 >>>>> >> offerings: >>>>> >> http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html >>>>> >> > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ >>>>> >> > YMMV. >>>>> >> > >>>>> >> > I too once hoped to find makeTranscriptDbFromGFF3 capability so >>>>> as to >>>>> >> allow easy tracking the head of Flybase's offerings, i.e. >>>>> >> >>>>> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44 _FB2012_02/gff/-alas I too have not followed up. >>>>> >>>>> >> > >>>>> >> > ~Malcolm >>>>> >> > >>>>> >> > >>>>> >> >> -----Original Message----- >>>>> >> >> From: bioconductor-bounces@r-project.org [mailto:bioconductor- >>>>> >> >> bounces@r-project.org] On Behalf Of Marc Carlson >>>>> >> >> Sent: Wednesday, April 04, 2012 7:44 PM >>>>> >> >> To: bioconductor@r-project.org >>>>> >> >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - >>>>> library >>>>> >> >> GenomicFeatures and rtracklayer >>>>> >> >> >>>>> >> >> I was looking at this during the course, and this is on my TODO >>>>> list >>>>> >> for >>>>> >> >> the next release cycle. I think it is long overdue and I don't >>>>> think >>>>> >> >> that the community is going to get it done in spite of all the >>>>> >> >> enthusiasm. There has not been time to do it before now but I am >>>>> >> hoping >>>>> >> >> that will now change. It should be simple enough in principle, >>>>> but it >>>>> >> >> might not be exactly trivial as I have discovered (on closer >>>>> >> inspection) >>>>> >> >> that the gff specification is not as concrete as one would like >>>>> it to >>>>> >> >> be. Also there have been several different versions. >>>>> >> >> >>>>> >> >> Some things that can help speed me along: >>>>> >> >> >>>>> >> >> 1) which version is most important? gff3? Or one of the other >>>>> >> >> versions? It is likely that with the older versions we may not >>>>> be able >>>>> >> >> to extract as much meaningful information. >>>>> >> >> >>>>> >> >> 2) where is the best place to find some typical gff3 files for >>>>> >> >> examples? This should not be difficult, but when I was looking >>>>> before >>>>> >> I >>>>> >> >> was finding that people were surprisingly stingy about sharing >>>>> these. >>>>> >> >> >>>>> >> >> >>>>> >> >> Marc >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >>>>> >> >>> Marc was working on this during the course in Feb. Not sure what >>>>> >> >> happened >>>>> >> >>> to it. He said it was simple. Maybe just waiting for the >>>>> release to >>>>> >> pass. >>>>> >> >>> >>>>> >> >>> Michael >>>>> >> >>> >>>>> >> >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >>>>> >> >>> mailinglist.honeypot@gmail.com> wrote: >>>>> >> >>> >>>>> >> >>>> Hi, >>>>> >> >>>> >>>>> >> >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi< >>>>> schoi@cornell.edu> >>>>> >> >> wrote: >>>>> >> >>>>> Hi, >>>>> >> >>>>> >>>>> >> >>>>> I am wondering if I could create a TranscriptDb object >>>>> (library >>>>> >> >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file >>>>> using >>>>> >> >>>> import.gff3, but I could not find a way to create TranscriptDb >>>>> >> object from >>>>> >> >>>> the object from import.gff3. >>>>> >> >>>>> Two arguments for makeTranscriptDb are required: transcripts, >>>>> >> splicings. >>>>> >> >>>> It does not seem to be easy to parse this information from the >>>>> object >>>>> >> >> form >>>>> >> >>>> import.gff3. I will appreciate any help. >>>>> >> >>>> >>>>> >> >>>> As far as I know, this functionality isn't there yet ... >>>>> >> >>>> >>>>> >> >>>> I once (early feb, 2012) suggested I might take a crack at >>>>> making >>>>> >> this >>>>> >> >>>> happen but haven't actually found the time to do it ... I'm >>>>> not sure >>>>> >> >>>> anyone in bioc-core land (hi, Marc) has found the time to do it >>>>> >> >>>> either, so I think you're out of luck. >>>>> >> >>>> >>>>> >> >>>> Sorry for that. But the good news is that I bet a patch that >>>>> does >>>>> >> this >>>>> >> >>>> would be welcome ;-) >>>>> >> >>>> >>>>> >> >>>> -steve >>>>> >> >>>> >>>>> >> >>>> -- >>>>> >> >>>> Steve Lianoglou >>>>> >> >>>> Graduate Student: Computational Systems Biology >>>>> >> >>>> | Memorial Sloan-Kettering Cancer Center >>>>> >> >>>> | Weill Medical College of Cornell University >>>>> >> >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>>> >> >>>> >>>>> >> >>>> _______________________________________________ >>>>> >> >>>> Bioconductor mailing list >>>>> >> >>>> Bioconductor@r-project.org >>>>> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >> >>>> Search the archives: >>>>> >> >>>> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> >>>> >>>>> >> >>> [[alternative HTML version deleted]] >>>>> >> >>> >>>>> >> >>> _______________________________________________ >>>>> >> >>> Bioconductor mailing list >>>>> >> >>> Bioconductor@r-project.org >>>>> >> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >> >>> Search the archives: >>>>> >> >> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> >> >>>>> >> >> _______________________________________________ >>>>> >> >> Bioconductor mailing list >>>>> >> >> Bioconductor@r-project.org >>>>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >> >> Search the archives: >>>>> >> >> >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> > >>>>> >> > _______________________________________________ >>>>> >> > Bioconductor mailing list >>>>> >> > Bioconductor@r-project.org >>>>> >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >> > Search the archives: >>>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Bioconductor mailing list >>>>> >> Bioconductor@r-project.org >>>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> >> Search the archives: >>>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >> >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > *A model is a lie that helps you see the truth.* >>>>> > * >>>>> > * >>>>> > Howard Skipper< >>>>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>>>> > >>>>> > >>>>> >>>>> >>>>> -- >>>>> *A model is a lie that helps you see the truth.* >>>>> * >>>>> * >>>>> Howard Skipper< >>>>> http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >>> >>> >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> >> > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
> Hi all, > > Sorry I haven't read the whole thread, still I have a few comments that might > be off the main topic then. > > On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: > > > Supporting both Ensemble's GTF and GFF3 would be ideal. > > > > Ensembl GTF would open up many genomes, including those in: > > ftp://ftp.ensembl.org/pub/release-66/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ > > ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ > > > > > > Supporting Ensembl GTF would make it easy to distribute/archive the > elements of a transcriptome analysis alongside a project/analysis in a > generally useful format (i.e. IGV and other tools can work with it more or less > directly) > > In my package easyRNASeq, I already load Ensembl GTF files and convert > them into GRanges / RangedData object. It's pretty straightforward. I guess > that adapting the code to create a transcriptDb should be do-able. > > > > > Related note, I have learned that the BioMarts produced for > EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS > available. Upshot: you'd best not depend upon being able to reproduce > today's TranscriptDbFromBiomart tomorrow. > > I don't know where you learned that and how you meant it exactly, but using > biomaRt, you can still access Ensembl version as old as of march 2009: see > http://mar2009.archive.ensembl.org/index.html. I learned it via an email exchange with Ensembl Genomes support Hello Malcolm, No, I am afraid that for Ensembl Genomes we don't make older versions available through an Archive! site, like we do for Ensembl. -- With kind regards, Bert Overduin, Ph.D. (Ensembl Helpdesk) I realize this refers to the Ensembl Genomes web site, not the BioMart per se, however I'm pretty sure it extends. Note, EnsemblGenomes sites do NOT have the same archive policy as the main Ensembl site. I would like to be able to more clearly refer to this distinction via an on-line policy document, or some such, and would welcome a reference if there is one to be had..... >It's not straightforward to > figure it out, but on the main Ensembl webpage, you can get the full list by > clicking the "view in archive site" link at the bottom left of the papge. It > redirects to this URL: http://www.ensembl.org/Help/ArchiveList. > Then, to use biomaRt on a given archive, you need to change the host > argument of useMart to the URL of the corresponding Ensembl archive as in: > useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org" > ). I recon that the biomaRT archive arguments does not work for that. I need > to post something about this on the mailing list. > > > > > re: "typical gff3 files"... > > Flybase makes gff3 extracts and if my understanding is correct, have been > diligent in "getting it right" > > I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and > convert them to GRanges/RangedData object, but all the merit goes to the > readGff3 function from the genomeIntervals package. Reading a gff3 file > with this function is extremely quick as is accessing the gffAttributes > (performed at the C layer) . > > Cheers, > > Nico > > > > > Also, NCBI historically has tried to provide GFFx extracts, with oodles of > caveats. > > But, but, Last month they announced progress on improving their GFF3 > offerings: http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html > > Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ > > YMMV. > > > > I too once hoped to find makeTranscriptDbFromGFF3 capability so as to > allow easy tracking the head of Flybase's offerings, i.e. > ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB201 > 2_02/gff/ - alas I too have not followed up. > > > > ~Malcolm > > > > > >> -----Original Message----- > >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- > >> bounces at r-project.org] On Behalf Of Marc Carlson > >> Sent: Wednesday, April 04, 2012 7:44 PM > >> To: bioconductor at r-project.org > >> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library > >> GenomicFeatures and rtracklayer > >> > >> I was looking at this during the course, and this is on my TODO list for > >> the next release cycle. I think it is long overdue and I don't think > >> that the community is going to get it done in spite of all the > >> enthusiasm. There has not been time to do it before now but I am hoping > >> that will now change. It should be simple enough in principle, but it > >> might not be exactly trivial as I have discovered (on closer inspection) > >> that the gff specification is not as concrete as one would like it to > >> be. Also there have been several different versions. > >> > >> Some things that can help speed me along: > >> > >> 1) which version is most important? gff3? Or one of the other > >> versions? It is likely that with the older versions we may not be able > >> to extract as much meaningful information. > >> > >> 2) where is the best place to find some typical gff3 files for > >> examples? This should not be difficult, but when I was looking before I > >> was finding that people were surprisingly stingy about sharing these. > >> > >> > >> Marc > >> > >> > >> > >> On 04/03/2012 03:57 PM, Michael Lawrence wrote: > >>> Marc was working on this during the course in Feb. Not sure what > >> happened > >>> to it. He said it was simple. Maybe just waiting for the release to pass. > >>> > >>> Michael > >>> > >>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< > >>> mailinglist.honeypot at gmail.com> wrote: > >>> > >>>> Hi, > >>>> > >>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> > >> wrote: > >>>>> Hi, > >>>>> > >>>>> I am wondering if I could create a TranscriptDb object (library > >>>> GenomicFeatures) using a gff3 file. I could read a gff3 file using > >>>> import.gff3, but I could not find a way to create TranscriptDb object > from > >>>> the object from import.gff3. > >>>>> Two arguments for makeTranscriptDb are required: transcripts, > splicings. > >>>> It does not seem to be easy to parse this information from the object > >> form > >>>> import.gff3. I will appreciate any help. > >>>> > >>>> As far as I know, this functionality isn't there yet ... > >>>> > >>>> I once (early feb, 2012) suggested I might take a crack at making this > >>>> happen but haven't actually found the time to do it ... I'm not sure > >>>> anyone in bioc-core land (hi, Marc) has found the time to do it > >>>> either, so I think you're out of luck. > >>>> > >>>> Sorry for that. But the good news is that I bet a patch that does this > >>>> would be welcome ;-) > >>>> > >>>> -steve > >>>> > >>>> -- > >>>> Steve Lianoglou > >>>> Graduate Student: Computational Systems Biology > >>>> | Memorial Sloan-Kettering Cancer Center > >>>> | Weill Medical College of Cornell University > >>>> Contact Info: http://cbio.mskcc.org/~lianos/contact > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor at r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Malcom, Thanks for the clarification, Nico On 5 Apr 2012, at 17:41, Cook, Malcolm wrote: >> Hi all, >> >> Sorry I haven't read the whole thread, still I have a few comments that might >> be off the main topic then. >> >> On 5 Apr 2012, at 17:01, Cook, Malcolm wrote: >> >>> Supporting both Ensemble's GTF and GFF3 would be ideal. >>> >>> Ensembl GTF would open up many genomes, including those in: >>> ftp://ftp.ensembl.org/pub/release-66/gtf/ >>> ftp://ftp.ensemblgenomes.org/pub/metazoa/release-13/gtf/ >>> ftp://ftp.ensemblgenomes.org/pub/fungi/release-13/gtf/ >>> ftp://ftp.ensemblgenomes.org/pub/protists/release-13/gtf/ >>> ftp://ftp.ensemblgenomes.org/pub/plants/release-13/gtf/ >>> >>> >>> Supporting Ensembl GTF would make it easy to distribute/archive the >> elements of a transcriptome analysis alongside a project/analysis in a >> generally useful format (i.e. IGV and other tools can work with it more or less >> directly) >> >> In my package easyRNASeq, I already load Ensembl GTF files and convert >> them into GRanges / RangedData object. It's pretty straightforward. I guess >> that adapting the code to create a transcriptDb should be do-able. >> >>> >>> Related note, I have learned that the BioMarts produced for >> EnsemblGenome's are NOT ARCHIVED, whereas it seems that historic GTF IS >> available. Upshot: you'd best not depend upon being able to reproduce >> today's TranscriptDbFromBiomart tomorrow. >> >> I don't know where you learned that and how you meant it exactly, but using >> biomaRt, you can still access Ensembl version as old as of march 2009: see >> http://mar2009.archive.ensembl.org/index.html. > > I learned it via an email exchange with Ensembl Genomes support > > Hello Malcolm, > No, I am afraid that for Ensembl Genomes we don't make older versions available through an Archive! site, like we do for Ensembl. > -- > With kind regards, > Bert Overduin, Ph.D. > (Ensembl Helpdesk) > > I realize this refers to the Ensembl Genomes web site, not the BioMart per se, however I'm pretty sure it extends. > > Note, EnsemblGenomes sites do NOT have the same archive policy as the main Ensembl site. > > I would like to be able to more clearly refer to this distinction via an on-line policy document, or some such, and would welcome a reference if there is one to be had..... > >> It's not straightforward to >> figure it out, but on the main Ensembl webpage, you can get the full list by >> clicking the "view in archive site" link at the bottom left of the papge. It >> redirects to this URL: http://www.ensembl.org/Help/ArchiveList. >> Then, to use biomaRt on a given archive, you need to change the host >> argument of useMart to the URL of the corresponding Ensembl archive as in: >> useMart("ENSEMBL_MART_ENSEMBL",host="mar2009.archive.ensembl.org" >> ). I recon that the biomaRT archive arguments does not work for that. I need >> to post something about this on the mailing list. > > > >> >>> >>> re: "typical gff3 files"... >>> Flybase makes gff3 extracts and if my understanding is correct, have been >> diligent in "getting it right" >> >> I believe so too. Again, in easyRNASeq, I do parse Flybase gff3 files and >> convert them to GRanges/RangedData object, but all the merit goes to the >> readGff3 function from the genomeIntervals package. Reading a gff3 file >> with this function is extremely quick as is accessing the gffAttributes >> (performed at the C layer) . >> >> Cheers, >> >> Nico >> >>> >>> Also, NCBI historically has tried to provide GFFx extracts, with oodles of >> caveats. >>> But, but, Last month they announced progress on improving their GFF3 >> offerings: http://bio.perl.org/pipermail/bioperl-l/2012-March/036387.html >>> Example: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ >>> YMMV. >>> >>> I too once hoped to find makeTranscriptDbFromGFF3 capability so as to >> allow easy tracking the head of Flybase's offerings, i.e. >> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.44_FB201 >> 2_02/gff/ - alas I too have not followed up. >>> >>> ~Malcolm >>> >>> >>>> -----Original Message----- >>>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- >>>> bounces at r-project.org] On Behalf Of Marc Carlson >>>> Sent: Wednesday, April 04, 2012 7:44 PM >>>> To: bioconductor at r-project.org >>>> Subject: Re: [BioC] Create transcriptDb using gff3 files? - library >>>> GenomicFeatures and rtracklayer >>>> >>>> I was looking at this during the course, and this is on my TODO list for >>>> the next release cycle. I think it is long overdue and I don't think >>>> that the community is going to get it done in spite of all the >>>> enthusiasm. There has not been time to do it before now but I am hoping >>>> that will now change. It should be simple enough in principle, but it >>>> might not be exactly trivial as I have discovered (on closer inspection) >>>> that the gff specification is not as concrete as one would like it to >>>> be. Also there have been several different versions. >>>> >>>> Some things that can help speed me along: >>>> >>>> 1) which version is most important? gff3? Or one of the other >>>> versions? It is likely that with the older versions we may not be able >>>> to extract as much meaningful information. >>>> >>>> 2) where is the best place to find some typical gff3 files for >>>> examples? This should not be difficult, but when I was looking before I >>>> was finding that people were surprisingly stingy about sharing these. >>>> >>>> >>>> Marc >>>> >>>> >>>> >>>> On 04/03/2012 03:57 PM, Michael Lawrence wrote: >>>>> Marc was working on this during the course in Feb. Not sure what >>>> happened >>>>> to it. He said it was simple. Maybe just waiting for the release to pass. >>>>> >>>>> Michael >>>>> >>>>> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >>>>> mailinglist.honeypot at gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> >>>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I am wondering if I could create a TranscriptDb object (library >>>>>> GenomicFeatures) using a gff3 file. I could read a gff3 file using >>>>>> import.gff3, but I could not find a way to create TranscriptDb object >> from >>>>>> the object from import.gff3. >>>>>>> Two arguments for makeTranscriptDb are required: transcripts, >> splicings. >>>>>> It does not seem to be easy to parse this information from the object >>>> form >>>>>> import.gff3. I will appreciate any help. >>>>>> >>>>>> As far as I know, this functionality isn't there yet ... >>>>>> >>>>>> I once (early feb, 2012) suggested I might take a crack at making this >>>>>> happen but haven't actually found the time to do it ... I'm not sure >>>>>> anyone in bioc-core land (hi, Marc) has found the time to do it >>>>>> either, so I think you're out of luck. >>>>>> >>>>>> Sorry for that. But the good news is that I bet a patch that does this >>>>>> would be welcome ;-) >>>>>> >>>>>> -steve >>>>>> >>>>>> -- >>>>>> Steve Lianoglou >>>>>> Graduate Student: Computational Systems Biology >>>>>> | Memorial Sloan-Kettering Cancer Center >>>>>> | Weill Medical College of Cornell University >>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Speaking of typical gff files, I am wondering if gff files in bacterial genomes available at ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria for example, ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptococcus_mutans_UA159 _uid57947/NC_004350.gff could be typical gff in bacterial genomes. Although eukaryotes are more complex than prokaryotes, I just wish that those bacterial genomes gff could be a typical gff. Out of curiosity, are there Bioconductor R packages of extracting bacterial genome annotation (e.g., genes and positions) to create TranscriptDb object? I think that creating AnnotationDb for a not- well-known bacterial species seems to be overkill. Thank you, SangChul On Apr 4, 2012, at 8:44 PM, Marc Carlson wrote: > I was looking at this during the course, and this is on my TODO list for the next release cycle. I think it is long overdue and I don't think that the community is going to get it done in spite of all the enthusiasm. There has not been time to do it before now but I am hoping that will now change. It should be simple enough in principle, but it might not be exactly trivial as I have discovered (on closer inspection) that the gff specification is not as concrete as one would like it to be. Also there have been several different versions. > > Some things that can help speed me along: > > 1) which version is most important? gff3? Or one of the other versions? It is likely that with the older versions we may not be able to extract as much meaningful information. > > 2) where is the best place to find some typical gff3 files for examples? This should not be difficult, but when I was looking before I was finding that people were surprisingly stingy about sharing these. > > > Marc > > > > On 04/03/2012 03:57 PM, Michael Lawrence wrote: >> Marc was working on this during the course in Feb. Not sure what happened >> to it. He said it was simple. Maybe just waiting for the release to pass. >> >> Michael >> >> On Tue, Apr 3, 2012 at 3:40 PM, Steve Lianoglou< >> mailinglist.honeypot at gmail.com> wrote: >> >>> Hi, >>> >>> On Tue, Apr 3, 2012 at 4:41 PM, Sang Chul Choi<schoi at="" cornell.edu=""> wrote: >>>> Hi, >>>> >>>> I am wondering if I could create a TranscriptDb object (library >>> GenomicFeatures) using a gff3 file. I could read a gff3 file using >>> import.gff3, but I could not find a way to create TranscriptDb object from >>> the object from import.gff3. >>>> Two arguments for makeTranscriptDb are required: transcripts, splicings. >>> It does not seem to be easy to parse this information from the object form >>> import.gff3. I will appreciate any help. >>> >>> As far as I know, this functionality isn't there yet ... >>> >>> I once (early feb, 2012) suggested I might take a crack at making this >>> happen but haven't actually found the time to do it ... I'm not sure >>> anyone in bioc-core land (hi, Marc) has found the time to do it >>> either, so I think you're out of luck. >>> >>> Sorry for that. But the good news is that I bet a patch that does this >>> would be welcome ;-) >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Graduate Student: Computational Systems Biology >>> | Memorial Sloan-Kettering Cancer Center >>> | Weill Medical College of Cornell University >>> Contact Info: http://cbio.mskcc.org/~lianos/contact >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6