Biostrings regex matching
1
1
Entering edit mode
Aditya ▴ 160
@aditya-7667
Last seen 2.5 years ago
Germany

How to do Biostrings regex matching?

chr1 <- BSgenome.Mmusculus.UCSC.mm10::Mmusculus$chr1

Biostrings::countPattern('AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', chr1)
    [1] 363

Biostrings::countPattern('A{44}', chr1, fixed = FALSE)
    Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  :
      key 123 (char '{') not in lookup table
    Error in normargPattern(pattern, subject) :
      could not turn 'pattern' into a DNAString instance
Biostrings • 2.4k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 4 days ago
Seattle, WA, United States

Hi Aditya,

matchPattern() and family in Biostrings don't support the regex syntax. You would have to use grep() for that:

library(Biostrings)
subject <- DNAStringSet(c("TTATATT", "CCCAACCCAAACCCAAAAAAT"))
grep("A{3}", subject)
# [1] 2

or regexpr() or gregexpr(), depending on what you are after:

regexpr("A{3}", subject)
# [1] -1  9
# attr(,"match.length")
# [1] -1  3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

gregexpr("A{3}", subject)
# [[1]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1]  9 15 18
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE

However grep() and family won't be as efficient as matchPattern() and family on a DNAStringSet or DNAString object. This was actually the original motivation for coming up with the matchPattern family of string matching functions in Biostrings.

FWIW note that this family supports some limited form of fuzzy matching via the use of IIUPAC ambiguity letters in the pattern and/or subject. It also supports a small number of mismatches and indels via the max.mismatch, min.mismatch, and with.indels arguments. See ?matchPattern for the details.

Finally note that the grep("A{n}", subject) use case can easily be handled without using regex at all. For example:

matchPattern(strrep("A", 3), subject[[2]])
#   Views on a 21-letter DNAString subject
# subject: CCCAACCCAAACCCAAAAAAT
# views:
#     start end width
# [1]     9  11     3 [AAA]
# [2]    15  17     3 [AAA]
# [3]    16  18     3 [AAA]
# [4]    17  19     3 [AAA]
# [5]    18  20     3 [AAA]

Not only will this be much more efficient than using grep() and family on long DNA sequences but, as you can see, unlike with a regex, it also returns all the matches. This was another original motivation for coming up with the matchPattern() family of string matching functions. And of course, you can still combine this with the use of fuzzy matching if you need that. For example, allowing 1 nucleotide insertion or deletion:

matchPattern(strrep("A", 3), subject[[1]], max.mismatch=1, with.indels=TRUE)
#   Views on a 7-letter DNAString subject
# subject: TTATATT
# views:
#     start end width
# [1]     3   5     3 [ATA]

matchPattern(strrep("A", 5), "TTAAAATT", max.mismatch=1, with.indels=TRUE)
#   Views on a 8-letter BString subject
# subject: TTAAAATT
# views:
#     start end width
# [1]     3   6     4 [AAAA]

matchPattern(strrep("A", 6), "TTAATAATT", max.mismatch=2, with.indels=TRUE)
#   Views on a 9-letter BString subject
# subject: TTAATAATT
# views:
#     start end width
# [1]     3   7     5 [AATAA]

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thank you Herve :-).

ADD REPLY

Login before adding your answer.

Traffic: 507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6