sorry to disturb you again
but i am still feeling confused
see this problem
subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCC
CGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG"
Rpattern =
"AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"
max.mismatchs <- 0.2*1:nchar(DNAString(Rpattern))
trimLRPatterns(Rpattern = Rpattern, subject = subject,
max.Rmismatch=max.mismatchs, with.Rindels=TRUE)
[1] "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCTTT
CACGGTCTGTATTCGTACTGAAAAT"
CAAGATC AAG
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
overlap length is 10 bp, so the allowed distance is 10*0.2=2
so it should trim the "AGATCAAG", not include the "CA"
I am confused why?
--
shan gao
Room 231(Dr.Fei lab)
Boyce Thompson Institute
Cornell University
Tower Road, Ithaca, NY 14853-1801
Office phone: 1-607-254-1267(day)
Official email:sg839 at cornell.edu
Facebook:http://www.facebook.com/profile.php?id=100001986532253
You are perfectly correct, but it is not terrible that we trimmed a
little extra. On the other
hand, there is a problem lurking here that, in opposite cases, we
would not quite trim enough.
To simplify the example, we are only concerned with the last 10
letters of your subject and the
first 10 letters of Rpattern:
> substr(subject, 91, 100)
[1] "CAAGATCAAG"
> substr(Rpattern, 1, 10)
[1] "AGATCGGAAG"
Then note that
> neditEndingAt("AGATCGGAAG", "CAAGATCAAG", ending.at=10,
with.indels=TRUE)
[1] 2
And your result:
> trimLRPatterns(Rpattern = "AGATCGGAAG", subject = "CAAGATCAAG",
max.Rmismatch=2, with.Rindels=TRUE)
[1] ""
But you ask why, after deleting GG in the Rpattern, we don't trim only
the last 8 letters of the subject
rather than all 10. This is a valid question.
In general, with max.Rmismatch=2 and with.Rindels=TRUE, the matching
suffix of the subject ending.at=10)
could be as short as 8 letters or as long as 12. The lower (C) level
matching code does not actually
report the matching portion of the subject -- at least, it didn't long
ago. Thus, trimLRPatterns takes a
lazy out and trims the subject suffix of the same length as the
matching Rpattern prefix. Up to the
length of the pattern prefix, this has the advantage of always
removing any possible adapter even if more
is removed than necessary, as you describe here. If the match used 2
insertions in the pattern, rather
than deletions, so was 12 letters long, then we aren't removing all
possible adapter, and this will hurt
or prevent later alignment. One might say that we should trim 12 in
all cases, even yours, and I have
proposed that.
Thanks, and let us know if I didn't get your point.
On Nov 29, 2012, at 11:17 PM, Wang Peter wrote:
> sorry to disturb you again
>
> but i am still feeling confused
> see this problem
>
> subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGC
CCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG"
>
> Rpattern =
"AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"
>
> max.mismatchs <- 0.2*1:nchar(DNAString(Rpattern))
>
> trimLRPatterns(Rpattern = Rpattern, subject = subject,
> max.Rmismatch=max.mismatchs, with.Rindels=TRUE)
> [1] "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCT
TTCACGGTCTGTATTCGTACTGAAAAT"
>
> CAAGATC AAG
> AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
>
> overlap length is 10 bp, so the allowed distance is 10*0.2=2
>
> so it should trim the "AGATCAAG", not include the "CA"
>
> I am confused why?
>
>
>
> --
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor