I'm not sure if this is the best way to go about it, but I'd probably do something like the following:
compareStrings <- function(dna1, dna2){
(as.raw(dna1) & as.raw(dna2)) > 0
}
compareStringsNoGaps <- function(dna1, dna2){
((as.raw(dna1) & as.raw(dna2)) & as.raw(15)) > 0
}
x <- DNAString("ATGCATGCATGC")
y <- DNAString("ATGGAYYCANNA")
compareStrings(x, y)
# [1] TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
which(compareStrings(x,y))
# [1] 1 2 3 5 6 8 9 10 11
## your example
x <- DNAString("ACTAGCTG")
y <- DNAString("ACYAGCBA")
compareStrings(x,y)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## have to AND with 15 to prevent matching to gaps, if that's important
compareStrings(DNAString("ATGC.+-"),DNAString("NNNN.+-"))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
compareStringsNoGaps(DNAString("ATGC.+-"),DNAString("NNNN.+-"))
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
That does require that you have enough space for a raw
vector as long as the DNAString
, but since you're looking for an output that's logical(length(x))
, that shouldn't really be a problem.
The first function returns TRUE if the two characters match (obeying ambiguities). The second returns TRUE if the two characters match AND neither character is a gap character (+-.
).
Edit: just wanted to give more detail on why this works.
DNAString
objects store their values internally as bits, with A=0b0001, C=0b0010, G=0b0100, T=0b1000
. Ambiguity codes are created by bitwise-OR
s on the values (so M = (A OR C) = 0b0011
). You can see all the values for these with xscodes(DNAString())
(though they're expressed in decimal).
Calling as.raw
on a DNAString
object will convert it to its raw bit representation. If you then AND
the strings together, only bits in common will remain. For example, if one position has M
and the other A
, then M AND A = 0b0011 & 0b0001 = 0b0001 = A
. That means that if they have any bits in common, they'll return a non-zero value, and otherwise it'll be zero.
There's a slight hiccup in that other ambiguities are stored in the higher bits. In addition to the first four bits, the fifth bit denotes -
, the sixth +
, and the seventh .
. If it's important to return FALSE
when both strings are the same value and in -+.
, then you have to remove the bits above the fourth bit, which is done with a bitwise AND
with 15 = 0b1111
.