makeTranscriptDbFromGFF for Unstranded Transcripts

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 23 hours ago

Australia

Hello, I have assembled transcripts based on unstranded RNA-seq data. It's not the ideal experimental design, of course. Is it not possible to make a TranscriptDb object from this ? makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") extracting transcript information Estimating transcript ranges. Extracting gene IDs Processing splicing information for gtf file. Prepare the 'metadata' data frame ... metadata: OK Now generating chrominfo from available sequence names. No chromosome length information is available. Error in .normargTranscripts(transcripts) : values in 'transcripts$tx_strand' must be "+" or "-" R version 3.0.0 (2013-04-03) Platform: x86_64-w64-mingw32/x64 (64-bit) GenomicFeatures_1.12.1 -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia

TranscriptDb TranscriptDb • 2.3k views

ADD COMMENT • link updated 12.0 years ago by Hervé Pagès 16k • written 12.0 years ago by Dario Strbenac ★ 1.6k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Dario, On 05/07/2013 10:00 PM, Dario Strbenac wrote: > Hello, > > I have assembled transcripts based on unstranded RNA-seq data. It's not the ideal experimental design, of course. Is it not possible to make a TranscriptDb object from this ? Right now, it's not possible. Note that in the TranscriptDb object returned by makeTranscriptDbFromGFF(), we really want to see stranded transcripts, exons, and CDSs. Because: (1) the purpose of a TranscriptDb is to describe a gene model, and the strand information (especially the exon strand) is an *essential* part of the gene model, and (2) there are too many tools/too much code around that expect this. The same could probably be said of GTF and GFF files, and I wonder if storing a set of unstranded mRNAs, exons, CDSs in those files is considered valid. Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and GFF files, we would need to automatically replace all missing strands by a + or a -. This would need to be done in a blunt way e.g. all missing mRNA/exon/CDS strands would be replaced with a + with a warning. Trying to make makeTranscriptDbFromGFF() smarter by looking whether the exons within a given mRNA have their start/end going up or down when the exon rank is going up would be too problematic. > > makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") Ok, so you've managed to store the exon rank in your file. But that means that you must have *implicitly* chosen a strand for your exons right? Because, within a given mRNA, when the exon rank is going up, start/end should normally go up on the + strand and down on the - strand. Why not just set the strand for the mRNAs / exons / CDSs in your GTF file? There is too much going on here so, at this point, I'm not convinced makeTranscriptDbFromGFF() should decide the strand for you. Cheers, H. > extracting transcript information > Estimating transcript ranges. > Extracting gene IDs > Processing splicing information for gtf file. > Prepare the 'metadata' data frame ... metadata: OK > Now generating chrominfo from available sequence names. No chromosome length information is available. > Error in .normargTranscripts(transcripts) : > values in 'transcripts$tx_strand' must be "+" or "-" > > R version 3.0.0 (2013-04-03) > Platform: x86_64-w64-mingw32/x64 (64-bit) > GenomicFeatures_1.12.1 > > -------------------------------------- > Dario Strbenac > PhD Student > University of Sydney > Camperdown NSW 2050 > Australia > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.0 years ago Hervé Pagès 16k

0

Entering edit mode

> The same could probably be said of GTF and GFF files, and I wonder if > storing a set of unstranded mRNAs, exons, CDSs in those files is > considered valid. >From the specification, it is. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). > Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and > GFF files, we would need to automatically replace all missing strands > by a + or a -. It is better if it retains the error result, so there is no ambiguity. Adding a sentence about this to the help file would be useful, since users will expect that it reads in all valid GTF and GFF files. > > makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") > Ok, so you've managed to store the exon rank in your file. But that > means that you must have *implicitly* chosen a strand for your exons > right? Cufflinks can infer the strand of the transcript for multi-exon transcripts by looking for the canonical GT-AG splice site in reads mapping across an intron, but not for single exon genes. So, it outputs a strand for some genes and not for others.

ADD REPLY • link 12.0 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

Dario, On 05/08/2013 06:00 PM, Dario Strbenac wrote: >> The same could probably be said of GTF and GFF files, and I wonder if >> storing a set of unstranded mRNAs, exons, CDSs in those files is >> considered valid. > > From the specification, it is. > > strand - Valid entries include '+', '-', or '.' (for don't know/don't care). Sure. But it's more complicated than that. For example, according to the GFF3 spec here http://www.sequenceontology.org/gff3.shtml the phase (column 8) is required for CDS features, and the phase can only be interpreted correctly if the strand is known. So it's like the strand is implicitly required for CDS features, and consequently also for the exons in a protein-coding mRNA. > >> Anyway, if we wanted makeTranscriptDbFromGFF() to support such GTF and >> GFF files, we would need to automatically replace all missing strands >> by a + or a -. > > It is better if it retains the error result, so there is no ambiguity. Adding a sentence about this to the help file would be useful, since users will expect that it reads in all valid GTF and GFF files. The requirement that the strand must be "+" or "-" in a TranscriptDb is documented in the man page for makeTranscriptDb(), which makeTranscriptDbFromGFF() is based on. Could certainly be made more visible. > >> >> makeTranscriptDbFromGFF("transcripts.gtf", format = "gtf", exonRankAttributeName = "exon_number") > >> Ok, so you've managed to store the exon rank in your file. But that >> means that you must have *implicitly* chosen a strand for your exons >> right? > > Cufflinks can infer the strand of the transcript for multi-exon transcripts by looking for the canonical GT-AG splice site in reads mapping across an intron, but not for single exon genes. So, it outputs a strand for some genes and not for others. I see. So if I understand correctly, it produces a GTF file where multi-exon transcripts are stranded, but not single-exon transcripts. So we could in theory support that in makeTranscriptDbFromGFF(), and return a TranscriptDb object where some single-exon transcripts and their exon are not stranded. However that would be a *big* change. Not only in makeTranscriptDbFromGFF() but in the schema of the TranscriptDb. It would almost certainly break some of the existing code that assumes that the features in a TranscriptDb are stranded. And changing that code to gracefully handle that would introduce extra complexity. It's doable but is it worth it? Not a decision to be taken lightly. What do other people think? H. -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.0 years ago Hervé Pagès 16k

0

Entering edit mode

It is not a frequent use case in RNA-seq research, so it would be fine to leave it.

ADD REPLY • link 12.0 years ago Dario Strbenac ★ 1.6k

Login before adding your answer.