Non ambiguous DNA runs

0

Entering edit mode

Marcus Davy ▴ 390

@marcus-davy-5153

Last seen 7.1 years ago

Hi, I am wanting to find the longest run of non ambiguous DNA sequence (A, C, G, T only) from an example DNAString, e.g. sequences from sanger reads. Is there a simple Biostrings function to do this that I am missing? An approach I thought of was to use mask() to identify the ambiguous base positions, so they (and their converse) can be visualized as a XStringViews object, however in extracting the ambiguous base positions consensusMatrix() wouldn't work with the baseOnly option if the input was a DNAString. ## Ambiguous bases names(IUPAC_CODE_MAP)[-(1:4)] subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" ## Find ambiguous base positions ## pattern matching outside biostrings gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]] ## A DNAString class fails with baseOnly option try(which(consensusMatrix(DNAString(subject), baseOnly=TRUE)["other",]==1)) ## Find ambiguous bases, forced to use DNASetingSet(subject) ambGaps <- which(consensusMatrix(DNAStringSet(subject), baseOnly=TRUE)["other",]==1) masked <- mask(subject, start=ambGaps, end=ambGaps) masked Views on a 40-letter BString subject subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA views: start end width [1] 3 22 20 [CTTGTAAAAACTTACACGAA] [2] 24 36 13 [ATCGGCAGAGAAG] [3] 39 40 2 [CA] ## Return longest masked sequence ind <- which.max(width(masked)) masked[[ind]] 20-letter "BString" instance seq: CTTGTAAAAACTTACACGAA thanks, Marcus > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 BiocInstaller_1.4.3 loaded via a namespace (and not attached): [1] stats4_2.15.0 tools_2.15.0 [[alternative HTML version deleted]]

Biostrings Biostrings • 1.4k views

ADD COMMENT • link updated 13.0 years ago by Harris A. Jaffee ▴ 590 • written 13.0 years ago by Marcus Davy ▴ 390

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 10.5 years ago

United States

This will get you most of the way. > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > s = DNAString(subject) > F = letterFrequencyInSlidingView(s, letters="ACGT", view.width=1) > Rle(F) 'integer' Rle of length 40 with 6 runs Lengths: 2 20 1 13 2 2 Values : 0 1 0 1 0 1 On Apr 13, 2012, at 7:17 PM, Marcus Davy wrote: > Hi, > I am wanting to find the longest run of non ambiguous DNA sequence (A, C, > G, T only) from an example DNAString, e.g. sequences from sanger reads. > > Is there a simple Biostrings function to do this that I am missing? > > An approach I thought of was to use mask() to identify the ambiguous base > positions, so they (and their converse) can be visualized as a > XStringViews object, however in extracting the ambiguous base positions > consensusMatrix() wouldn't work with the baseOnly option if the input was a > DNAString. > > > ## Ambiguous bases > > names(IUPAC_CODE_MAP)[-(1:4)] > > > > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > > > > ## Find ambiguous base positions > > ## pattern matching outside biostrings > > gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]] > > > > ## A DNAString class fails with baseOnly option > > try(which(consensusMatrix(DNAString(subject), baseOnly=TRUE)["other",]==1)) > > > > ## Find ambiguous bases, forced to use DNASetingSet(subject) > > ambGaps <- which(consensusMatrix(DNAStringSet(subject), > baseOnly=TRUE)["other",]==1) > > > > masked <- mask(subject, start=ambGaps, end=ambGaps) > > masked > > > Views on a 40-letter BString subject > > subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA > > views: > > start end width > > [1] 3 22 20 [CTTGTAAAAACTTACACGAA] > > [2] 24 36 13 [ATCGGCAGAGAAG] > > [3] 39 40 2 [CA] > > > ## Return longest masked sequence > > ind <- which.max(width(masked)) > > masked[[ind]] > > 20-letter "BString" instance > > seq: CTTGTAAAAACTTACACGAA > > > thanks, > > Marcus > > >> sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > locale: > > [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > > [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 > BiocInstaller_1.4.3 > > > loaded via a namespace (and not attached): > > [1] stats4_2.15.0 tools_2.15.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 13.0 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

thanks, so some alternatives using 'letterFrequencyInSlidingView() to generate the list of base positions that are ambiguous; subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" s <- DNAString(subject) nucPos <- letterFrequencyInSlidingView(s, letters="ACGT", view.width=1) ## Finding ambiguous base positions ambGaps <- which(nucPos==0) ambGaps [1] 1 2 23 37 38 ## Alternative using Rle class RlePos <- Rle(nucPos) ambGaps <- which(RlePos==0) mask(subject, start=ambGaps, end=ambGaps) Views on a 40-letter BString subject subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA views: start end width [1] 3 22 20 [CTTGTAAAAACTTACACGAA] [2] 24 36 13 [ATCGGCAGAGAAG] [3] 39 40 2 [CA] Marcus On Sat, Apr 14, 2012 at 3:28 PM, Harris A. Jaffee <hj@jhu.edu> wrote: > This will get you most of the way. > > > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > > s = DNAString(subject) > > F = letterFrequencyInSlidingView(s, letters="ACGT", view.width=1) > > > Rle(F) > 'integer' Rle of length 40 with 6 runs > Lengths: 2 20 1 13 2 2 > Values : 0 1 0 1 0 1 > > On Apr 13, 2012, at 7:17 PM, Marcus Davy wrote: > > > Hi, > > I am wanting to find the longest run of non ambiguous DNA sequence (A, C, > > G, T only) from an example DNAString, e.g. sequences from sanger reads. > > > > Is there a simple Biostrings function to do this that I am missing? > > > > An approach I thought of was to use mask() to identify the ambiguous base > > positions, so they (and their converse) can be visualized as a > > XStringViews object, however in extracting the ambiguous base positions > > consensusMatrix() wouldn't work with the baseOnly option if the input > was a > > DNAString. > > > > > > ## Ambiguous bases > > > > names(IUPAC_CODE_MAP)[-(1:4)] > > > > > > > > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > > > > > > > > ## Find ambiguous base positions > > > > ## pattern matching outside biostrings > > > > gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]] > > > > > > > > ## A DNAString class fails with baseOnly option > > > > try(which(consensusMatrix(DNAString(subject), > baseOnly=TRUE)["other",]==1)) > > > > > > > > ## Find ambiguous bases, forced to use DNASetingSet(subject) > > > > ambGaps <- which(consensusMatrix(DNAStringSet(subject), > > baseOnly=TRUE)["other",]==1) > > > > > > > > masked <- mask(subject, start=ambGaps, end=ambGaps) > > > > masked > > > > > > Views on a 40-letter BString subject > > > > subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA > > > > views: > > > > start end width > > > > [1] 3 22 20 [CTTGTAAAAACTTACACGAA] > > > > [2] 24 36 13 [ATCGGCAGAGAAG] > > > > [3] 39 40 2 [CA] > > > > > > ## Return longest masked sequence > > > > ind <- which.max(width(masked)) > > > > masked[[ind]] > > > > 20-letter "BString" instance > > > > seq: CTTGTAAAAACTTACACGAA > > > > > > thanks, > > > > Marcus > > > > > >> sessionInfo() > > > > R version 2.15.0 (2012-03-30) > > > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > > > > locale: > > > > [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 > > > > > > attached base packages: > > > > [1] stats graphics grDevices utils datasets methods base > > > > > > other attached packages: > > > > [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 > > BiocInstaller_1.4.3 > > > > > > loaded via a namespace (and not attached): > > > > [1] stats4_2.15.0 tools_2.15.0 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 13.0 years ago Marcus Davy ▴ 390

Login before adding your answer.