rtracklayer proposal for ISSUE: import.gff3 asRangedData=FALSE fails when strand is '.'
1
0
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 3 months ago
United States
Hi, rtracklayerers, import.gff3 with asRangedData=TRUE passes a period through to the strand of imported RangedData, however, calling it with asRangedData=FALSE errors: > gff.str<-"2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Nam e=CG11023" > import.gff3(textConnection(gff.str),asRangedData=TRUE) RangedData with 1 row and 7 value columns across 1 space space ranges | type source phase strand ID Name score <factor> <iranges> | <factor> <factor> <factor> <factor> <character> <character> <numeric> 1 2L [7529, 9484] | gene FlyBase 0 NA FBgn0031208 CG11023 0 > import.gff3(textConnection(gff.str),asRangedData=FALSE) Error in strand(runValue(strand)) : strand values must be in '+' '-' '*' The GFF3 spec allows '.' (and '?') to appear as value of strand: Column 7: "strand" The strand of the feature. + for positive strand (relative to the landmark), - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown. Arguably, import.gff{,2,3} should provide some control over interpretation of '.' and '?' appearing in the strand column, allowing it to comport with strand and GRanges I propose the following as an intended backwards compatible fix. New argument to import.gff{,2,3} strandMap: control for mapping out-of-band values (FALSE,TRUE,a string, a list), understood as follows FALSE: the default - do not map out of band values to '*' TRUE: map all out of band values to '*' any 0 length character vector: map out of band values to it (presumably it will be one of '*', '-','+' a list: lookup how to map out of band values in the list by name. If it is agreed that this is the best resolution, and the rtracklayer gods wish it, I will take this as my first opportunity to contribute and will follow-up accordingly.... Else? Cheers, Malcolm
• 1.2k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 22 hours ago
Seattle, WA, United States
Hi Malcolm, On 04/18/2012 09:04 AM, Cook, Malcolm wrote: > Hi, rtracklayerers, > > import.gff3 with asRangedData=TRUE passes a period through to the strand of imported RangedData, however, calling it with asRangedData=FALSE errors: > >> gff.str<-"2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Na me=CG11023" >> import.gff3(textConnection(gff.str),asRangedData=TRUE) > RangedData with 1 row and 7 value columns across 1 space > space ranges | type source phase strand ID Name score > <factor> <iranges> |<factor> <factor> <factor> <factor> <character> <character> <numeric> > 1 2L [7529, 9484] | gene FlyBase 0 NA FBgn0031208 CG11023 0 IMO * should be used instead of NA to be more consistent with how the strand is handled in the rest of the infrastructure. >> import.gff3(textConnection(gff.str),asRangedData=FALSE) > Error in strand(runValue(strand)) : strand values must be in '+' '-' '*' It looks like this problem is fixed in rtracklayer 1.16.1: > import.gff3(textConnection(gff.str),asRangedData=FALSE) GRanges with 1 range and 6 elementMetadata cols: seqnames ranges strand | source type score phase <rle> <iranges> <rle> | <factor> <factor> <numeric> <integer> [1] 2L [7529, 9484] * | FlyBase gene 1 0 ID Name <character> <character> [1] FBgn0031208 CG11023 --- seqlengths: 2L NA Warning message: In newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = strand, : missing values in strand converted to "*" > > The GFF3 spec allows '.' (and '?') to appear as value of strand: > > Column 7: "strand" > The strand of the feature. + for positive strand (relative to the > landmark), - for minus strand, and . for features that are not > stranded. In addition, ? can be used for features whose strandedness > is relevant, but unknown. > > Arguably, import.gff{,2,3} should provide some control over interpretation of '.' and '?' appearing in the strand column, allowing it to comport with strand and GRanges In the early days of the strand() constructor, we've also tried to make the distinction between *'s and NA's in the strand column, with more or less the same subtle differences than GFF3 makes between . and ? But then we abandoned that. See: https://stat.ethz.ch/pipermail/bioconductor/2012-January/043067.html It's not written in stone though so if people have a use case where they need to be able to distinguish between (a) "range/feature is on both strands" and (b) "strand is unknown or irrelevant", then we could revisit that decision. Cheers, H. > > I propose the following as an intended backwards compatible fix. > > New argument to import.gff{,2,3} > > strandMap: control for mapping out-of-band values (FALSE,TRUE,a string, a list), understood as follows > FALSE: the default - do not map out of band values to '*' > TRUE: map all out of band values to '*' > any 0 length character vector: map out of band values to it (presumably it will be one of '*', '-','+' > a list: lookup how to map out of band values in the list by name. > > If it is agreed that this is the best resolution, and the rtracklayer gods wish it, I will take this as my first opportunity to contribute and will follow-up accordingly.... > > Else? > > Cheers, > > Malcolm > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Thanks Herve! A+ ~Malcolm > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Wednesday, April 18, 2012 12:19 PM > To: Cook, Malcolm > Cc: bioconductor at r-project.org > Subject: Re: [BioC] rtracklayer proposal for ISSUE: import.gff3 > asRangedData=FALSE fails when strand is '.' > > Hi Malcolm, > > On 04/18/2012 09:04 AM, Cook, Malcolm wrote: > > Hi, rtracklayerers, > > > > import.gff3 with asRangedData=TRUE passes a period through to the > strand of imported RangedData, however, calling it with > asRangedData=FALSE errors: > > > >> gff.str<- > "2L\tFlyBase\tgene\t7529\t9484\t0\t.\t0\tID=FBgn0031208;Name=CG11023" > >> import.gff3(textConnection(gff.str),asRangedData=TRUE) > > RangedData with 1 row and 7 value columns across 1 space > > space ranges | type source phase strand ID Name > score > > <factor> <iranges> |<factor> <factor> <factor> <factor> <character> > <character> <numeric> > > 1 2L [7529, 9484] | gene FlyBase 0 NA FBgn0031208 CG11023 > 0 > > IMO * should be used instead of NA to be more consistent with how the > strand is handled in the rest of the infrastructure. > > >> import.gff3(textConnection(gff.str),asRangedData=FALSE) > > Error in strand(runValue(strand)) : strand values must be in '+' '-' '*' > > It looks like this problem is fixed in rtracklayer 1.16.1: > > > import.gff3(textConnection(gff.str),asRangedData=FALSE) > GRanges with 1 range and 6 elementMetadata cols: > seqnames ranges strand | source type score > phase > <rle> <iranges> <rle> | <factor> <factor> <numeric> > <integer> > [1] 2L [7529, 9484] * | FlyBase gene 1 > 0 > ID Name > <character> <character> > [1] FBgn0031208 CG11023 > --- > seqlengths: > 2L > NA > Warning message: > In newGRanges("GRanges", seqnames = seqnames, ranges = ranges, > strand > = strand, : > missing values in strand converted to "*" > > > > > The GFF3 spec allows '.' (and '?') to appear as value of strand: > > > > Column 7: "strand" > > The strand of the feature. + for positive strand (relative to the > > landmark), - for minus strand, and . for features that are not > > stranded. In addition, ? can be used for features whose strandedness > > is relevant, but unknown. > > > > Arguably, import.gff{,2,3} should provide some control over interpretation > of '.' and '?' appearing in the strand column, allowing it to comport with > strand and GRanges > > In the early days of the strand() constructor, we've also tried to make > the distinction between *'s and NA's in the strand column, with more or > less the same subtle differences than GFF3 makes between . and ? > But then we abandoned that. > > See: > > https://stat.ethz.ch/pipermail/bioconductor/2012-January/043067.html > > It's not written in stone though so if people have a use case where > they need to be able to distinguish between (a) "range/feature is on > both strands" and (b) "strand is unknown or irrelevant", then we could > revisit that decision. > > Cheers, > H. > > > > > I propose the following as an intended backwards compatible fix. > > > > New argument to import.gff{,2,3} > > > > strandMap: control for mapping out-of-band values (FALSE,TRUE,a string, > a list), understood as follows > > FALSE: the default - do not map out of band values to '*' > > TRUE: map all out of band values to '*' > > any 0 length character vector: map out of band values to it > (presumably it will be one of '*', '-','+' > > a list: lookup how to map out of band values in the list by name. > > > > If it is agreed that this is the best resolution, and the rtracklayer gods wish > it, I will take this as my first opportunity to contribute and will follow-up > accordingly.... > > > > Else? > > > > Cheers, > > > > Malcolm > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
On Wed, Apr 18, 2012 at 10:19 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Malcolm, > > > On 04/18/2012 09:04 AM, Cook, Malcolm wrote: > >> Hi, rtracklayerers, >> >> import.gff3 with asRangedData=TRUE passes a period through to the strand >> of imported RangedData, however, calling it with asRangedData=FALSE errors: >> >> gff.str<-"2L\tFlyBase\tgene\**t7529\t9484\t0\t.\t0\tID=** >>> FBgn0031208;Name=CG11023" >>> import.gff3(textConnection(**gff.str),asRangedData=TRUE) >>> >> RangedData with 1 row and 7 value columns across 1 space >> space ranges | type source phase strand ID >> Name score >> <factor> <iranges> |<factor> <factor> <factor> <factor> >> <character> <character> <numeric> >> 1 2L [7529, 9484] | gene FlyBase 0 NA FBgn0031208 >> CG11023 0 >> > > IMO * should be used instead of NA to be more consistent with how the > strand is handled in the rest of the infrastructure. > > Thanks for pointing this out. That behavior predates GenomicRanges. rtracklayer 1.17.5 will return "*" for RangedData, and correspondingly no longer emits that warning for GRanges (asRangedData=FALSE). > > import.gff3(textConnection(**gff.str),asRangedData=FALSE) >>> >> Error in strand(runValue(strand)) : strand values must be in '+' '-' '*' >> > > It looks like this problem is fixed in rtracklayer 1.16.1: > > Yes, there were a lot of fixes committed just prior to 1.16.x. > > import.gff3(textConnection(**gff.str),asRangedData=FALSE) > GRanges with 1 range and 6 elementMetadata cols: > seqnames ranges strand | source type score phase > <rle> <iranges> <rle> | <factor> <factor> <numeric> <integer> > [1] 2L [7529, 9484] * | FlyBase gene 1 0 > ID Name > <character> <character> > [1] FBgn0031208 CG11023 > --- > seqlengths: > 2L > NA > Warning message: > In newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = > strand, : > missing values in strand converted to "*" > > > >> The GFF3 spec allows '.' (and '?') to appear as value of strand: >> >> Column 7: "strand" >> The strand of the feature. + for positive strand (relative to the >> landmark), - for minus strand, and . for features that are not >> stranded. In addition, ? can be used for features whose strandedness >> is relevant, but unknown. >> >> Arguably, import.gff{,2,3} should provide some control over >> interpretation of '.' and '?' appearing in the strand column, allowing it >> to comport with strand and GRanges >> > > In the early days of the strand() constructor, we've also tried to make > the distinction between *'s and NA's in the strand column, with more or > less the same subtle differences than GFF3 makes between . and ? > But then we abandoned that. > > See: > > https://stat.ethz.ch/**pipermail/bioconductor/2012-**January/043067 .html<https: stat.ethz.ch="" pipermail="" bioconductor="" 2012-january="" 043067.="" html=""> > > It's not written in stone though so if people have a use case where > they need to be able to distinguish between (a) "range/feature is on > both strands" and (b) "strand is unknown or irrelevant", then we could > revisit that decision. > > Cheers, > H. > > >> I propose the following as an intended backwards compatible fix. >> >> New argument to import.gff{,2,3} >> >> strandMap: control for mapping out-of-band values (FALSE,TRUE,a string, >> a list), understood as follows >> FALSE: the default - do not map out of band values to '*' >> TRUE: map all out of band values to '*' >> any 0 length character vector: map out of band values to it >> (presumably it will be one of '*', '-','+' >> a list: lookup how to map out of band values in the list by name. >> >> If it is agreed that this is the best resolution, and the rtracklayer >> gods wish it, I will take this as my first opportunity to contribute and >> will follow-up accordingly.... >> >> Else? >> >> Cheers, >> >> Malcolm >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6