A problem about trimLRPatterns

0

Entering edit mode

wang peter ★ 2.0k

@wang-peter-4647

Last seen 10.2 years ago

i have found a problem if we allow edit distance as 1, with indel=1 subject aaaaaaATCGACTG pattern TCGACTG and i found some reads were trimmed ATCGACTG not TCGACTG that is ok for RNA-seq but not ok for microRNA, if we want to keep the whole microRNA sequences -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253

microRNA microRNA • 1.5k views

ADD COMMENT • link updated 12.8 years ago by Harris A. Jaffee ▴ 590 • written 12.8 years ago by wang peter ★ 2.0k

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 10.1 years ago

United States

I'm not sure what you are doing. Let's change your "A" to "C", so we don't confuse it with the lower case a's. Then we have this example, which looks ok to me since the "C" is not trimmed. pattern = "TCGACTG" subject = "aaaaaaCTCGACTG" trimLRPatterns(subject=subject, Rpattern=pattern, with.Rindels=TRUE, max.Rmismatch=1) [1] "aaaaaaC" What happens to you? Please be explicit. On Mar 1, 2012, at 4:05 PM, wang peter wrote: > i have found a problem > if we allow edit distance as 1, with indel=1 > subject aaaaaaATCGACTG > pattern TCGACTG > and i found some reads were trimmed ATCGACTG > not TCGACTG > > that is ok for RNA-seq > but not ok for microRNA, if we want to keep the whole microRNA sequences > > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.8 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

Harris A. Jaffee ▴ 590

@harris-a-jaffee-3972

Last seen 10.1 years ago

United States

I'm forwarding back to the list so that others can check my answer. What is going on is something to this effect. subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA" pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG" sapply(15:nchar(subject), function(j) { s = substr(subject, j, nchar(subject)) p = substr(pattern, 1, nchar(subject)-j+1) neditEndingAtending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE) }) [1] 8 7 6 5 4 3 2 1 0 2 4 6 8 10 11 11 10 9 8 8 9 8 7 7 6 [26] 5 5 4 3 2 4 3 2 1 (I started my sapply at j=15 to reduce the noise for j less than 15. You can start at j=1 if you like. Those edit values will be larger than 8 and thus too large given your limits.) When you say "should be", you are focusing on the smallest number (0) in this vector, which I admit reflects the smallest edit distance between the strings s and p, as s and p vary with j, with j going from 15 to nchar(subject). But trimLRPatterns is designed to perform the longest trim possible given your mismatch limits, even if that is not the one with minimal edit distance. I think you will find that edit distance 8 (the first element in the vector, corresponding to j=15) is within your mismatch limits. This is why the function trims from position 15 on, leaving the 14-character result, "TATAGTAGATATTG". Does that explain it? On Mar 1, 2012, at 5:48 PM, wang peter wrote: > can you try this sample > the adapter is PCR2rc CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG > > the function is > max.mismatchs <- 0.25*1:nchar(DNAString(PCR2rc)) > trimmedCoords <- trimLRPatterns(Rpattern = PCR2rc, subject = > sread(highQuaReads), max.Rmismatch= max.mismatchs, > with.Rindels=T,ranges=T) > > highQuaReads is one read: > > @HWI-ST132:506:D0CNUABXX:3:1101:1576:1834 1:N:0:TTCACA > TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA > + > :BDDFDHHGHHJJJIIJIIIIEIGIJJJBGHIFEEH9EGIBHEHG6:8 > > > and the results is > > @HWI-ST132:506:D0CNUABXX:3:1101:1576:1834 1:N:0:TTCACA > TATAGTAGATATTG > + > :BDDFDHHGHHJJJ > > > u see > it should be > TATAGTAGATATTGGAATAGTA > > i have many samples like this > > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253

ADD COMMENT • link 12.8 years ago Harris A. Jaffee ▴ 590

0

Entering edit mode

dear Harris thank you for your perfect example: but i still have 3 small questions: 1. when j=15, s="GAATAGTACTGTAGGCACCATCAATAGATCGGAA" and p = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTT" and the edit distance between s and p is 16, not 8 > subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA" > pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG" > > sapply(15:nchar(subject), function(j) { > ? ? ? ?s = substr(subject, j, nchar(subject)) > ? ? ? ?p = substr(pattern, 1, nchar(subject)-j+1) > ? ? ? ?neditEndingAtending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE) > }) > > ?[1] ?8 ?7 ?6 ?5 ?4 ?3 ?2 ?1 ?0 ?2 ?4 ?6 ?8 10 11 11 10 ?9 ?8 ?8 ?9 ?8 ?7 ?7 ?6 > [26] ?5 ?5 ?4 ?3 ?2 ?4 ?3 ?2 ?1 2. if the trimLRPatterns try to trim the longest substring in the scope of mismatch number, it will remove some bp which are not noise? right? 3. so the trimLRPatterns algorithm is based on the edit distance, right? i think it uses dynamic programming to calculate the Levenshtein distance. but it seems much faster than my program which also uses dynamic programming thank you very much shan

ADD REPLY • link 12.8 years ago wang peter ★ 2.0k

0

Entering edit mode

On Mar 2, 2012, at 9:48 AM, wang peter wrote: > dear Harris > thank you for your perfect example: but i still have 3 small questions: > 1. when j=15, s="GAATAGTACTGTAGGCACCATCAATAGATCGGAA" > and p = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTT" > and the edit distance between s and p is 16, not 8 Actually, not so perfect. My code was not quite right, and that may be the source of your question 1, although I really can't be sure what you are asking. Let me try again now. subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA" pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG" Notice that the pattern has to start 2 positions before the subject, in order to end at the same place. So my loop should really start at j=-1 to reflect everything that trimLRPatterns tests. Moreover, the subject CAN NOT be shortened in the loop; only the pattern can be (and must be) shortened: N = sapply(-1:nchar(subject), function(j) { p = substr(pattern, 1, nchar(subject)-j+1) neditEndingAtending.at=nchar(subject), pattern=p, subject=subject, with.indels=TRUE) }) The subject can not be shortened, to "s" as I was doing before, because an insert may be desirable in matching p, and so the subject string to which it may match, fuzzily, may be longer than s. Now the result is: > N [1] 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 [26] 1 2 3 4 5 6 7 8 9 8 8 9 8 7 7 6 5 5 4 3 2 3 3 2 1 To get us located explicitly on the subject, set names(N) = -1:nchar(subject) Then > as.integer(N["15"]) [1] 8 > as.integer(N["23"]) [1] 0 So, where is the your edit distance of 16? See more details below. >> subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA" >> pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG" >> >> sapply(15:nchar(subject), function(j) { >> s = substr(subject, j, nchar(subject)) >> p = substr(pattern, 1, nchar(subject)-j+1) >> neditEndingAtending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE) >> }) >> >> [1] 8 7 6 5 4 3 2 1 0 2 4 6 8 10 11 11 10 9 8 8 9 8 7 7 6 >> [26] 5 5 4 3 2 4 3 2 1 > > 2. if the trimLRPatterns try to trim the longest substring in the > scope of mismatch number, > it will remove some bp which are not noise? right? Your mismatch limits: M = as.integer(rev(0.25*1:nchar(pattern))) > M [1] 12 12 12 11 11 11 11 10 10 10 10 9 9 9 9 8 8 8 8 7 7 7 7 6 6 [26] 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 We are looking for the first time, in terms of increasing j, that the edit distance N is <= your mismatch limit M. This happens at j=15 (at the 17th element of N), and the edit distance at this point is 8: > match(TRUE, N<=M) [1] 17 > N[17] 15 8 This is confirmed by the following call, which is how trimLRPatterns does it: > which.isMatchingEndingAtending.at=nchar(subject), pattern=pattern, subject=subject, max.mismatch=M, with.indels=TRUE, auto.reduce.pattern=TRUE) [1] 17 > 3. so the trimLRPatterns algorithm is based on the edit distance, right? i think > it uses dynamic programming to calculate the Levenshtein distance. > but it seems much faster than my program which also uses dynamic programming For a large subject, i.e. consisting of very many (short) reads, the "auto.reduce.pattern=TRUE" feature in which.isMatchingEndingAt() allows trimLRPatterns to make a single call to C to process all the reads. For each read, the looping in C (analogous to all the R code I've written above) can, and does, stop at the first acceptable match, while of course this stopping point will vary from read to read, so can't be known in advance. This might account for some of the timing you are seeing. I cannot speak for why Herve's distance calculation, which is used by the looping I've just sketched out, is fast, or faster than your method. For details see Biostrings/src/lowlevel_matching.c, in this case the function _nedit_for_Proffset(). > thank you very much > shan

ADD REPLY • link 12.8 years ago Harris A. Jaffee ▴ 590

Login before adding your answer.