Non ambiguous DNA runs
1
0
Entering edit mode
Marcus Davy ▴ 390
@marcus-davy-5153
Last seen 6.7 years ago
Hi, I am wanting to find the longest run of non ambiguous DNA sequence (A, C, G, T only) from an example DNAString, e.g. sequences from sanger reads. Is there a simple Biostrings function to do this that I am missing? An approach I thought of was to use mask() to identify the ambiguous base positions, so they (and their converse) can be visualized as a XStringViews object, however in extracting the ambiguous base positions consensusMatrix() wouldn't work with the baseOnly option if the input was a DNAString. ## Ambiguous bases names(IUPAC_CODE_MAP)[-(1:4)] subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" ## Find ambiguous base positions ## pattern matching outside biostrings gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]] ## A DNAString class fails with baseOnly option try(which(consensusMatrix(DNAString(subject), baseOnly=TRUE)["other",]==1)) ## Find ambiguous bases, forced to use DNASetingSet(subject) ambGaps <- which(consensusMatrix(DNAStringSet(subject), baseOnly=TRUE)["other",]==1) masked <- mask(subject, start=ambGaps, end=ambGaps) masked Views on a 40-letter BString subject subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA views: start end width [1] 3 22 20 [CTTGTAAAAACTTACACGAA] [2] 24 36 13 [ATCGGCAGAGAAG] [3] 39 40 2 [CA] ## Return longest masked sequence ind <- which.max(width(masked)) masked[[ind]] 20-letter "BString" instance seq: CTTGTAAAAACTTACACGAA thanks, Marcus > sessionInfo() R version 2.15.0 (2012-03-30) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 BiocInstaller_1.4.3 loaded via a namespace (and not attached): [1] stats4_2.15.0 tools_2.15.0 [[alternative HTML version deleted]]
Biostrings Biostrings • 1.3k views
ADD COMMENT
0
Entering edit mode
@harris-a-jaffee-3972
Last seen 10.1 years ago
United States
This will get you most of the way. > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > s = DNAString(subject) > F = letterFrequencyInSlidingView(s, letters="ACGT", view.width=1) > Rle(F) 'integer' Rle of length 40 with 6 runs Lengths: 2 20 1 13 2 2 Values : 0 1 0 1 0 1 On Apr 13, 2012, at 7:17 PM, Marcus Davy wrote: > Hi, > I am wanting to find the longest run of non ambiguous DNA sequence (A, C, > G, T only) from an example DNAString, e.g. sequences from sanger reads. > > Is there a simple Biostrings function to do this that I am missing? > > An approach I thought of was to use mask() to identify the ambiguous base > positions, so they (and their converse) can be visualized as a > XStringViews object, however in extracting the ambiguous base positions > consensusMatrix() wouldn't work with the baseOnly option if the input was a > DNAString. > > > ## Ambiguous bases > > names(IUPAC_CODE_MAP)[-(1:4)] > > > > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > > > > ## Find ambiguous base positions > > ## pattern matching outside biostrings > > gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]] > > > > ## A DNAString class fails with baseOnly option > > try(which(consensusMatrix(DNAString(subject), baseOnly=TRUE)["other",]==1)) > > > > ## Find ambiguous bases, forced to use DNASetingSet(subject) > > ambGaps <- which(consensusMatrix(DNAStringSet(subject), > baseOnly=TRUE)["other",]==1) > > > > masked <- mask(subject, start=ambGaps, end=ambGaps) > > masked > > > Views on a 40-letter BString subject > > subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA > > views: > > start end width > > [1] 3 22 20 [CTTGTAAAAACTTACACGAA] > > [2] 24 36 13 [ATCGGCAGAGAAG] > > [3] 39 40 2 [CA] > > > ## Return longest masked sequence > > ind <- which.max(width(masked)) > > masked[[ind]] > > 20-letter "BString" instance > > seq: CTTGTAAAAACTTACACGAA > > > thanks, > > Marcus > > >> sessionInfo() > > R version 2.15.0 (2012-03-30) > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > locale: > > [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > other attached packages: > > [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 > BiocInstaller_1.4.3 > > > loaded via a namespace (and not attached): > > [1] stats4_2.15.0 tools_2.15.0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
thanks, so some alternatives using 'letterFrequencyInSlidingView() to generate the list of base positions that are ambiguous; subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" s <- DNAString(subject) nucPos <- letterFrequencyInSlidingView(s, letters="ACGT", view.width=1) ## Finding ambiguous base positions ambGaps <- which(nucPos==0) ambGaps [1] 1 2 23 37 38 ## Alternative using Rle class RlePos <- Rle(nucPos) ambGaps <- which(RlePos==0) mask(subject, start=ambGaps, end=ambGaps) Views on a 40-letter BString subject subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA views: start end width [1] 3 22 20 [CTTGTAAAAACTTACACGAA] [2] 24 36 13 [ATCGGCAGAGAAG] [3] 39 40 2 [CA] Marcus On Sat, Apr 14, 2012 at 3:28 PM, Harris A. Jaffee <hj@jhu.edu> wrote: > This will get you most of the way. > > > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > > s = DNAString(subject) > > F = letterFrequencyInSlidingView(s, letters="ACGT", view.width=1) > > > Rle(F) > 'integer' Rle of length 40 with 6 runs > Lengths: 2 20 1 13 2 2 > Values : 0 1 0 1 0 1 > > On Apr 13, 2012, at 7:17 PM, Marcus Davy wrote: > > > Hi, > > I am wanting to find the longest run of non ambiguous DNA sequence (A, C, > > G, T only) from an example DNAString, e.g. sequences from sanger reads. > > > > Is there a simple Biostrings function to do this that I am missing? > > > > An approach I thought of was to use mask() to identify the ambiguous base > > positions, so they (and their converse) can be visualized as a > > XStringViews object, however in extracting the ambiguous base positions > > consensusMatrix() wouldn't work with the baseOnly option if the input > was a > > DNAString. > > > > > > ## Ambiguous bases > > > > names(IUPAC_CODE_MAP)[-(1:4)] > > > > > > > > subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA" > > > > > > > > ## Find ambiguous base positions > > > > ## pattern matching outside biostrings > > > > gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]] > > > > > > > > ## A DNAString class fails with baseOnly option > > > > try(which(consensusMatrix(DNAString(subject), > baseOnly=TRUE)["other",]==1)) > > > > > > > > ## Find ambiguous bases, forced to use DNASetingSet(subject) > > > > ambGaps <- which(consensusMatrix(DNAStringSet(subject), > > baseOnly=TRUE)["other",]==1) > > > > > > > > masked <- mask(subject, start=ambGaps, end=ambGaps) > > > > masked > > > > > > Views on a 40-letter BString subject > > > > subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA > > > > views: > > > > start end width > > > > [1] 3 22 20 [CTTGTAAAAACTTACACGAA] > > > > [2] 24 36 13 [ATCGGCAGAGAAG] > > > > [3] 39 40 2 [CA] > > > > > > ## Return longest masked sequence > > > > ind <- which.max(width(masked)) > > > > masked[[ind]] > > > > 20-letter "BString" instance > > > > seq: CTTGTAAAAACTTACACGAA > > > > > > thanks, > > > > Marcus > > > > > >> sessionInfo() > > > > R version 2.15.0 (2012-03-30) > > > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > > > > locale: > > > > [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 > > > > > > attached base packages: > > > > [1] stats graphics grDevices utils datasets methods base > > > > > > other attached packages: > > > > [1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0 > > BiocInstaller_1.4.3 > > > > > > loaded via a namespace (and not attached): > > > > [1] stats4_2.15.0 tools_2.15.0 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6