calculate frequency of non-overlapping matches for a motif in a XStringSet
1
0
Entering edit mode
@stefanie-tauber-3978
Last seen 2.7 years ago
Germany

Hi,

 

I would like to count the number of disjoint matches for a given pattern and character vector where matches are sought. So, basically what gregexpr is doing. However I would like to do this for a set of patterns as well as a set of characters (such as XStringSet).

Right now I just apply gregexpr multiple times. Is there a more clever way to do this? I know about vcountPDict. However, I would like to count number of disjoint occurrences rather than total number of occurrences.

Any comment on this?

Toy example

library(Biostrings)

data(yeastSEQCHR1)
yeast1 <- DNAString(yeastSEQCHR1)

x = Views(yeast1, start = sample(length(yeast1),20), width=20)

# returns number of disjoint matches for a given pattern
gregexpr("AAA",x)

# vectorized way to get the number of matches for each motif in the dictionary
vcountPDict(DNAStringSet(c("AAA","TTT")), x)

 

Best,

Stefanie

 

Biostrings patternmatch biostrings matchpattern • 1.5k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 2 hours ago
Seattle, WA, United States

Hi Stephanie,

If you were able to use vmatchPDict() (instead of vcountPDict()) then you could get the ranges of the matches (instead of the counts only), then merge the ranges that overlap, and then finally count them. Unfortunately vmatchPDict() is not available yet. However you might still be able to use a loop and get reasonable performance:

If you don't have too many patterns (i.e. < 10000), you can loop over them and call vmatchPattern() in the loop:

my_patterns <- DNAStringSet(c("AAA","TTT"))
m1 <- sapply(my_patterns, function(p) elementLengths(as(vmatchPattern(p, as(x, "DNAStringSet")), "NormalIRangesList")))
m1
#       [,1] [,2]
#  [1,]    0    1
#  [2,]    1    0
#  [3,]    1    1
#  [4,]    1    0
#  [5,]    0    0
#  [6,]    0    0
#  [7,]    0    2
#  [8,]    3    0
#  [9,]    1    0
# [10,]    0    1
# [11,]    0    0
# [12,]    0    0
# [13,]    0    0
# [14,]    1    0
# [15,]    0    1
# [16,]    0    0
# [17,]    0    0
# [18,]    0    0
# [19,]    1    0
# [20,]    0    1

The coercion to NormalIRangesList is what performs the merging of overlapping ranges.

If you have many patterns but a small number of subjects, it might be more efficient to loop over the subjects and call matchPDict() in the loop:

m2 <- sapply(as(x, "DNAStringSet"), function(s) elementLengths(as(matchPDict(PDict(my_patterns), s), "NormalIRangesList")))
m2
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
# [1,]    0    1    1    1    0    0    0    3    1     0     0     0     0     1
# [2,]    1    0    1    0    0    0    2    0    0     1     0     0     0     0
#      [,15] [,16] [,17] [,18] [,19] [,20]
# [1,]     0     0     0     0     1     0
# [2,]     1     0     0     0     0     1

Note that m2 is the transpose of m1:

identical(m2, t(m1))
# [1] TRUE

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi Herve,

thanks for the comment. This definitely helps. Since I am interested in non-overlapping hits, I would calculate - after the conversion to 'NormalIRangesList' the number of non-overlapping hits by means of the width of the matches and the width of the pattern.
In my case, I have many ( about 800,000 ) subjects and few patterns (~ 100). Is there any possibility to speed things up?
The conversion to 'NormalIRangesList' takes several minutes for each single pattern ...
Thanks a lot!

ADD REPLY
0
Entering edit mode

Hi Stefanie,

Coercion from MIndex to NormalIRangesList was indeed very inefficient. I fixed this in Biostrings 2.38.4. Now it's much faster. Biostrings 2.38.4 should become available via biocLite() in about 36 hours.

Let me know if you still run into problems with this.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6