Entering edit mode
Marcus Davy
▴
390
@marcus-davy-5153
Last seen 6.7 years ago
Hi,
I am wanting to find the longest run of non ambiguous DNA sequence (A,
C,
G, T only) from an example DNAString, e.g. sequences from sanger
reads.
Is there a simple Biostrings function to do this that I am missing?
An approach I thought of was to use mask() to identify the ambiguous
base
positions, so they (and their converse) can be visualized as a
XStringViews object, however in extracting the ambiguous base
positions
consensusMatrix() wouldn't work with the baseOnly option if the input
was a
DNAString.
## Ambiguous bases
names(IUPAC_CODE_MAP)[-(1:4)]
subject <- "YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA"
## Find ambiguous base positions
## pattern matching outside biostrings
gregexpr("[^A|C|G|T]", subject, perl=TRUE)[[1]]
## A DNAString class fails with baseOnly option
try(which(consensusMatrix(DNAString(subject),
baseOnly=TRUE)["other",]==1))
## Find ambiguous bases, forced to use DNASetingSet(subject)
ambGaps <- which(consensusMatrix(DNAStringSet(subject),
baseOnly=TRUE)["other",]==1)
masked <- mask(subject, start=ambGaps, end=ambGaps)
masked
Views on a 40-letter BString subject
subject: YYCTTGTAAAAACTTACACGAAHATCGGCAGAGAAGNBCA
views:
start end width
[1] 3 22 20 [CTTGTAAAAACTTACACGAA]
[2] 24 36 13 [ATCGGCAGAGAAG]
[3] 39 40 2 [CA]
## Return longest masked sequence
ind <- which.max(width(masked))
masked[[ind]]
20-letter "BString" instance
seq: CTTGTAAAAACTTACACGAA
thanks,
Marcus
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.24.1 IRanges_1.14.2 BiocGenerics_0.2.0
BiocInstaller_1.4.3
loaded via a namespace (and not attached):
[1] stats4_2.15.0 tools_2.15.0
[[alternative HTML version deleted]]