Question

Tabulating instances of letter combinations in strings.

0

Entering edit mode

marc.carlson • 0

@marccarlson-22212

Last seen 5.1 years ago

I have been getting a lot of use out of Biostrings latley. My collaborator has a series of regulatory elements that have been randomly joined together randomly, then put into a t-cell, and then sequenced after the cells were put under selective pressure. He basically wants to know which combinations of elements (which sequences) are most popular in his cells. The TLDR is that I was able to use your Biostrings package to quickly learn what order his specific sequences had been seen in. So I gave each one of these elements a one letter code to simplify representation, and so now I have a bunch of strings that look like this:

ABDFOYT QEWNILL UDFNHOA

Etc. (and so on for many thousands of strings)

What we want to do now is to ask: which combinations of elements are most common? Well letterFrequency() is great for the 1st layer of that!

But the next thing we want to know is: what combinations of letters are most common? IOW: how can I tabulate how often I see “AB”, or even "ABD" etc.

I tried using “AB” as a string for letterFrequency(). But that assumes that I actually mean “A|B” (A OR B), when what I really want is “A followed by B” OR possibly “B followed by A” (in my case those two things would be equivalent). Can letterFrequency() be repurposed to do anything like that?

Biostrings alphabetFrequency letterFrequency • 784 views

ADD COMMENT • link updated 5.2 years ago by Hervé Pagès 16k • written 5.2 years ago by marc.carlson • 0

score 1 · Answer 1 · 2019-10-24

Hi Marc,

One possibility is to do:

library(Biostrings)

extractWords <- function(x, width=2L)
{
  starts <- as(IRanges(1L, pmax(lengths(x) - width + 1L, 0L)), "IntegerList")
  at <- relist(IRanges(unlist(starts, use.names=FALSE), width=width), starts)
  extractAt(x, at)
}

Then:

x <- BStringSet(c("ABDFOYT", "QEWNILL", "UDFNHOA", "ABABAAA"))

table(unlist(extractWords(x)))
# AA AB BA BD DF EW FN FO HO IL LL NH NI OA OY QE UD WN YT
#  2  3  2  1  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1

table(unlist(extractWords(x, width=3)))
# AAA ABA ABD BAA BAB BDF DFN DFO EWN FNH FOY HOA ILL NHO NIL OYT QEW UDF WNI 
#   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

This tabulates all the 2-letter and 3-letter words that are represented in x though so could be expensive if x is "big" (e.g. has hundreds of thousands of strings).

If you want to tabulate only a predefined set of words, a solution is to use vcountPDict(). Note that in this case the words don't need to have the same length:

words <- c("AA", "AB", "AAB")

## Get the counts in a matrix with 1 row per word and 1 col per string in 'x':
count_mat <- vcountPDict(BStringSet(words), x)
rownames(count_mat) <- words
count_mat
#     [,1] [,2] [,3] [,4]
# AA     0    0    0    2
# AB     1    0    0    2
# AAB    0    0    0    0

You can use the collapse argument to summarize the counts in a vector parallel to words:

setNames(vcountPDict(BStringSet(words), x, collapse=1), words)
#  AA  AB AAB 
#   2   3   0

Note that the latest is equivalent to rowSums(count_mat) but will be more efficient if x is "big".

H.