Question

a possible bug of trimLRPatterns

0

Entering edit mode

wang peter ★ 2.0k

@wang-peter-4647

Last seen 10.6 years ago

reads <- readFastq(fastqfile); seqs <- sread(reads); max.mismatchs <- mismatch_rate*1:nchar(DNAString(PCR2rc)) trimmedCoords <- trimLRPatterns(Rpattern = PCR2rc, subject = seqs, max.Rmismatch= max.mismatchs, with.Rindels=T,ranges=T) > end(trimmedCoords)[1:20] [1] 22 18 20 33 14 22 22 20 22 22 22 15 20 37 19 13 20 22 0 34 > start(trimmedCoords)[1:20] [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 there is a "0" in the end of trimmedCoords so i cannot get the trimmed sequences trimmed3End <- narrow(reads, start=end(trimmedCoords), end=width(reads)) R version 2.14.1 (2011-12-22) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 [7] GenomicRanges_1.6.7 IRanges_1.12.6 loaded via a namespace (and not attached): [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.1 [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.1 [9] XML_3.9-4 zlibbioc_1.0.0 -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253

• 1.1k views

ADD COMMENT • link updated 13.1 years ago by Martin Morgan 25k • written 13.1 years ago by wang peter ★ 2.0k

score 0 · Answer 1 · 2012-03-08

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 weeks ago

United States

On 03/08/2012 04:40 PM, wang peter wrote: > reads<- readFastq(fastqfile); > seqs<- sread(reads); > max.mismatchs<- mismatch_rate*1:nchar(DNAString(PCR2rc)) > trimmedCoords<- trimLRPatterns(Rpattern = PCR2rc, subject = seqs, > max.Rmismatch= max.mismatchs, with.Rindels=T,ranges=T) > >> end(trimmedCoords)[1:20] > [1] 22 18 20 33 14 22 22 20 22 22 22 15 20 37 19 13 20 22 0 34 >> start(trimmedCoords)[1:20] > [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > there is a "0" in the end of trimmedCoords so i cannot get the trimmed sequences The sequence has been trimmed entirely > as.character(Views(DNAString("AA"), IRanges(1, 1))) [1] "A" > as.character(Views(DNAString("AA"), IRanges(1, 0))) [1] "" Martin > > trimmed3End<- narrow(reads, start=end(trimmedCoords), end=width(reads)) > > R version 2.14.1 (2011-12-22) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 > [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 > [7] GenomicRanges_1.6.7 IRanges_1.12.6 > > loaded via a namespace (and not attached): > [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.1 > [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.1 > [9] XML_3.9-4 zlibbioc_1.0.0 > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD COMMENT • link 13.1 years ago Martin Morgan 25k

0

Entering edit mode

A reply to a similar post also described removal of zero width reads; "If you want to remove zero-width records before writing, then subset as xx[width(xx) != 0]" So in your case something like; nonZero <- (width(trimmedCoords)!=0) narrow(reads[nonZero], start=start(trimmedCoords)[nonZero], end=end(trimmedCoords)[nonZero]) Marcus On Sat, Mar 10, 2012 at 12:10 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 03/08/2012 04:40 PM, wang peter wrote: > >> reads<- readFastq(fastqfile); >> seqs<- sread(reads); >> max.mismatchs<- mismatch_rate*1:nchar(**DNAString(PCR2rc)) >> trimmedCoords<- trimLRPatterns(Rpattern = PCR2rc, subject = seqs, >> max.Rmismatch= max.mismatchs, with.Rindels=T,ranges=T) >> >> end(trimmedCoords)[1:20] >>> >> [1] 22 18 20 33 14 22 22 20 22 22 22 15 20 37 19 13 20 22 0 34 >> >>> start(trimmedCoords)[1:20] >>> >> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >> >> there is a "0" in the end of trimmedCoords so i cannot get the trimmed >> sequences >> > > The sequence has been trimmed entirely > > > as.character(Views(DNAString("**AA"), IRanges(1, 1))) > [1] "A" > > as.character(Views(DNAString("**AA"), IRanges(1, 0))) > [1] "" > > Martin > > > >> trimmed3End<- narrow(reads, start=end(trimmedCoords), end=width(reads)) >> >> R version 2.14.1 (2011-12-22) >> Platform: x86_64-redhat-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >> [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 >> [7] GenomicRanges_1.6.7 IRanges_1.12.6 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.1 >> [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 tools_2.14.1 >> [9] XML_3.9-4 zlibbioc_1.0.0 >> >> > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Marcus Davy ▴ 390

0

Entering edit mode

An srFilter could be useful for filtering reads based on width using the subset operation, the advantage is that you can quickly retrieve statistics from the filter to put in a log file. The output is in the style of 'occurrenceFilter' which uses 'ShortRead:::.occurrenceName' to construct input arguments in the stats summary, other filters don't necessarily include input arguments depending on when they were written. widthFilter <- function (w, .name = .widthName(w)) { ShortRead:::.check_type_and_length(w, "numeric", 1L) if(w < 1) .throw(SRError("UserArgumentMismatch", "'w' must be >= 1'")) srFilter(function(x) { width(x)>=w }, name = .name) } .widthName <- function (w) { sprintf("%s w=%s", "widthFilter", w) } ## Example sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") ## Create some variable width reads, note: escaping the backslash rfq.ex <- trimEnds(rfq, "\\") ## e.g. Filter reads at least 20 nucleotide cycles wFilter <- widthFilter(20L) rfq.ex[wFilter(rfq.ex)] ## Output for a log file stats(wFilter(rfq.ex)) stats(!wFilter(rfq.ex)) Name Input Passing Op 1 widthFilter w=20 256 251 <na> 2 !(widthFilter w=20) 256 5 ! Marcus On Sat, Mar 10, 2012 at 11:00 AM, Marcus Davy <mdavy86@gmail.com> wrote: > A reply to a similar post also described removal of zero width reads; > > "If you want to remove zero-width records before writing, then subset as > > xx[width(xx) != 0]" > > So in your case something like; > > nonZero <- (width(trimmedCoords)!=0) > narrow(reads[nonZero], start=start(trimmedCoords)[nonZero], > end=end(trimmedCoords)[nonZero]) > > Marcus > > > On Sat, Mar 10, 2012 at 12:10 AM, Martin Morgan <mtmorgan@fhcrc.org>wrote: > >> On 03/08/2012 04:40 PM, wang peter wrote: >> >>> reads<- readFastq(fastqfile); >>> seqs<- sread(reads); >>> max.mismatchs<- mismatch_rate*1:nchar(**DNAString(PCR2rc)) >>> trimmedCoords<- trimLRPatterns(Rpattern = PCR2rc, subject = seqs, >>> max.Rmismatch= max.mismatchs, with.Rindels=T,ranges=T) >>> >>> end(trimmedCoords)[1:20] >>>> >>> [1] 22 18 20 33 14 22 22 20 22 22 22 15 20 37 19 13 20 22 0 34 >>> >>>> start(trimmedCoords)[1:20] >>>> >>> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >>> >>> there is a "0" in the end of trimmedCoords so i cannot get the trimmed >>> sequences >>> >> >> The sequence has been trimmed entirely >> >> > as.character(Views(DNAString("**AA"), IRanges(1, 1))) >> [1] "A" >> > as.character(Views(DNAString("**AA"), IRanges(1, 0))) >> [1] "" >> >> Martin >> >> >> >>> trimmed3End<- narrow(reads, start=end(trimmedCoords), end=width(reads)) >>> >>> R version 2.14.1 (2011-12-22) >>> Platform: x86_64-redhat-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>> [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 >>> [7] GenomicRanges_1.6.7 IRanges_1.12.6 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 grid_2.14.1 >>> [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 >>> tools_2.14.1 >>> [9] XML_3.9-4 zlibbioc_1.0.0 >>> >>> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Marcus Davy ▴ 390

0

Entering edit mode

Slight bug in width checking, .throw is in the ShortRead package. if(w < 1) ShortRead:::.throw(SRError("UserArgumentMismatch", "'w' must be >= 1'")) ## w=0 throws an error widthFilter(0L) Error: UserArgumentMismatch 'w' must be >= 1' wFilter <- widthFilter(1L) ## Negate the operator to check for 0 length reads zeroLength <- (!wFilter(rfq.ex)) rfq.ex[zeroLength] class: ShortReadQ length: 4 reads; width: 0 cycles Marcus On Sun, Mar 18, 2012 at 11:21 AM, Marcus Davy <mdavy86@gmail.com> wrote: > An srFilter could be useful for filtering reads based on width using the > subset operation, the advantage is that you can quickly retrieve statistics > from the filter to put in a log file. > > The output is in the style of 'occurrenceFilter' which uses > 'ShortRead:::.occurrenceName' to construct input arguments in the stats > summary, other filters don't necessarily include input arguments depending > on when they were written. > > > > > widthFilter <- > > > function (w, .name = .widthName(w)) > > > { > > > ShortRead:::.check_type_and_length(w, "numeric", 1L) > > > if(w < 1) > > > .throw(SRError("UserArgumentMismatch", "'w' must be >= 1'")) > > > > > > srFilter(function(x) { > > > width(x)>=w > > > }, name = .name) > > > } > > > > > > .widthName <- function (w) > > > { > > > sprintf("%s w=%s", "widthFilter", w) > > > } > > > > ## Example > > > sp <- SolexaPath(system.file('extdata', package='ShortRead')) > > > rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") > > > > ## Create some variable width reads, note: escaping the backslash > > > rfq.ex <- trimEnds(rfq, "\\") > > ## e.g. Filter reads at least 20 nucleotide cycles > > > wFilter <- widthFilter(20L) > > > rfq.ex[wFilter(rfq.ex)] > > > > ## Output for a log file > > > stats(wFilter(rfq.ex)) > > > stats(!wFilter(rfq.ex)) > > > > Name Input Passing Op > 1 widthFilter w=20 256 251 <na> > 2 !(widthFilter w=20) 256 5 ! > > > Marcus > > > On Sat, Mar 10, 2012 at 11:00 AM, Marcus Davy <mdavy86@gmail.com> wrote: > >> A reply to a similar post also described removal of zero width reads; >> >> "If you want to remove zero-width records before writing, then subset as >> >> xx[width(xx) != 0]" >> >> So in your case something like; >> >> nonZero <- (width(trimmedCoords)!=0) >> narrow(reads[nonZero], start=start(trimmedCoords)[nonZero], >> end=end(trimmedCoords)[nonZero]) >> >> Marcus >> >> >> On Sat, Mar 10, 2012 at 12:10 AM, Martin Morgan <mtmorgan@fhcrc.org>wrote: >> >>> On 03/08/2012 04:40 PM, wang peter wrote: >>> >>>> reads<- readFastq(fastqfile); >>>> seqs<- sread(reads); >>>> max.mismatchs<- mismatch_rate*1:nchar(**DNAString(PCR2rc)) >>>> trimmedCoords<- trimLRPatterns(Rpattern = PCR2rc, subject = seqs, >>>> max.Rmismatch= max.mismatchs, with.Rindels=T,ranges=T) >>>> >>>> end(trimmedCoords)[1:20] >>>>> >>>> [1] 22 18 20 33 14 22 22 20 22 22 22 15 20 37 19 13 20 22 0 34 >>>> >>>>> start(trimmedCoords)[1:20] >>>>> >>>> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >>>> >>>> there is a "0" in the end of trimmedCoords so i cannot get the trimmed >>>> sequences >>>> >>> >>> The sequence has been trimmed entirely >>> >>> > as.character(Views(DNAString("**AA"), IRanges(1, 1))) >>> [1] "A" >>> > as.character(Views(DNAString("**AA"), IRanges(1, 0))) >>> [1] "" >>> >>> Martin >>> >>> >>> >>>> trimmed3End<- narrow(reads, start=end(trimmedCoords), end=width(reads)) >>>> >>>> R version 2.14.1 (2011-12-22) >>>> Platform: x86_64-redhat-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> other attached packages: >>>> [1] ShortRead_1.12.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 >>>> [4] Rsamtools_1.6.3 lattice_0.20-0 Biostrings_2.22.0 >>>> [7] GenomicRanges_1.6.7 IRanges_1.12.6 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] Biobase_2.14.0 bitops_1.0-4.1 BSgenome_1.22.0 >>>> grid_2.14.1 >>>> [5] hwriter_1.3 RCurl_1.91-1 rtracklayer_1.14.4 >>>> tools_2.14.1 >>>> [9] XML_3.9-4 zlibbioc_1.0.0 >>>> >>>> >>> >>> -- >>> Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 >>> Telephone: 206 667-2793 >>> >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >> >> > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Marcus Davy ▴ 390