pairwiseAlignment generates different outcomes for the same inp...
1
0
Entering edit mode
@alogmail2aolcom-4540
Last seen 10.0 years ago
Hi Martin, Thanks for the clarification. Does the variant of pattern="DNAString", subject="DNAString" look more realistic to align DNA sequences? One more question. What pairwise alignment tool would you recommend if I need to align a list#1 of 50-mer probes with a list#2 of more longer contigs (hundreds of bases): I take each 50-mer probe from the list#1 and align it to each contig from the list#2 Finally, for each 50-mer probe I need to find the best match in some terms e.g. in terms of alignment scores. I have tried pairwiseAlignment() but it works slowly and looks unfeasible for the list# 1 of 16798 probes vs. the list#2 of 14338 contigs, i.e. when output matrix should be 16798 by 14338. I have looked at the mailing list and found suggestions to apply matchPattern/vmatchPattern but they provide the fact of a match for a substring within a string rather than alignment score to compare alignments. Thanks Best regards, Alex In a message dated 3/14/2011 9:04:04 A.M. Pacific Daylight Time, mtmorgan@fhcrc.org writes: On 03/13/2011 06:58 PM, Alogmail2@aol.com wrote: > Dear Mailing List, > > Why does pairwiseAlignment() generate different outcomes for the same input > sequences defined differently in terms of classes (see showMethods below): > > for > > pattern="character", subject="character" Hi Alex -- here 'character' could be anything -- 'the quick brown fox', whereas > > vs. > > pattern="DNAString", subject="DNAString" here pattern and subject are drawn from a restricted alphabet and some symbols have particular meaning (e.g., 'N' does not mean the nucleotide 'N'). Martin > > ? > > It generates the same outcomes for the case of > > > pattern="character", subject="character" > > vs. > > pattern="character", subject="DNAString" > > > It looks like a bug. > > Thanks > > Alex > > > > > > > #showMethods("pairwiseAlignment") > #Function: pairwiseAlignment (package Biostrings) > #pattern="character", subject="character" > #pattern="character", subject="DNAString" > # (inherited from: pattern="character", subject="XString") > #pattern="DNAString", subject="DNAString" > >> pattern.1 > 50-letter "DNAString" instance > seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> pattern.2 > [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" >> subject.1 > 543-letter "DNAString" instance > seq: > AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTC AGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTA GTTACG > CGTCGT >> subject.2 > [1] > "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTT CAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTC CTATTAC > AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATG GCGCAT > CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGG TTTAGG > CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTC GGTGGT > GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCG CCACCG > CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGG CCAGGC > CAGTAGTTACGCGTCGT" >> > >> > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC > subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG-- GCCTAC > score: 95.69296 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 > makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1 > [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 > > > > > ############################### > > Nutritional Sciences and Toxicology, > 119 Morgan Hall > UC.Berkeley,CA 94720 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 [[alternative HTML version deleted]]
Alignment Cancer probe Alignment Cancer probe • 980 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
On 03/14/2011 12:15 PM, Alogmail2 at aol.com wrote: > Hi Martin, > > Thanks for the clarification. Does the variant of pattern="DNAString", > subject="DNAString" > > look more realistic to align DNA sequences? The method for DNAString pattern and subject is definitely the one you should use to align DNA sequences. > > One more question. > > What pairwise alignment tool would you recommend if I need to align a > list#1 of 50-mer probes with a list#2 of more longer contigs (hundreds of > bases): > > I take each 50-mer probe from the list#1 and align it to each contig from > the list#2 > > Finally, for each 50-mer probe I need to find the best match in some terms > e.g. in terms of alignment scores. > > I have tried pairwiseAlignment() but it works slowly and looks unfeasible > for the list# 1 of 16798 probes vs. the list#2 of 14338 contigs, i.e. when > output matrix should be 16798 by 14338. Taking advantage of the fact that pairwiseAlignment() is vectorized with respect to the pattern: library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) # 25-mer probes only library(BSgenome.Dmelanogaster.UCSC.dm3) contig1 <- subseq(unmasked(Dmelanogaster$chr2L), end=1000) Then: > system.time(pa1 <- pairwiseAlignment(probes[1:16798], contig1, type="global-local")) user system elapsed 34.386 0.172 36.380 Which translates to 6 days for 14338 contigs. Slow but not unfeasible. > > I have looked at the mailing list and found suggestions to apply > matchPattern/vmatchPattern but they provide the fact of a match for a substring > within a string rather than alignment score to compare alignments. Right, they don't provide a score. But a workaround is to proceed in 2 passes: first use vmatchPattern with max.mismatch set to the max nb of indels you are willing to accept, and then compute your own score for example by computing the "edit distance" between the pattern and their matches (you can try to use neditStartingAt() or stringDist() for this). Cheers, H. > > Thanks > > Best regards, > > Alex > > > > In a message dated 3/14/2011 9:04:04 A.M. Pacific Daylight Time, > mtmorgan at fhcrc.org writes: > > On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote: >> Dear Mailing List, >> >> Why does pairwiseAlignment() generate different outcomes for the same > input >> sequences defined differently in terms of classes (see showMethods > below): >> >> for >> >> pattern="character", subject="character" > > Hi Alex -- > > here 'character' could be anything -- 'the quick brown fox', whereas > >> >> vs. >> >> pattern="DNAString", subject="DNAString" > > here pattern and subject are drawn from a restricted alphabet and some > symbols have particular meaning (e.g., 'N' does not mean the nucleotide > 'N'). > > Martin > >> >> ? >> >> It generates the same outcomes for the case of >> >> >> pattern="character", subject="character" >> >> vs. >> >> pattern="character", subject="DNAString" >> >> >> It looks like a bug. >> >> Thanks >> >> Alex >> >> >> >> >> >> >> #showMethods("pairwiseAlignment") >> #Function: pairwiseAlignment (package Biostrings) >> #pattern="character", subject="character" >> #pattern="character", subject="DNAString" >> # (inherited from: pattern="character", subject="XString") >> #pattern="DNAString", subject="DNAString" >> >>> pattern.1 >> 50-letter "DNAString" instance >> seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >>> pattern.2 >> [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" >>> subject.1 >> 543-letter "DNAString" instance >> seq: >> > AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCT TCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAG TAGTTACG >> CGTCGT >>> subject.2 >> [1] >> > "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTC TTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCT TCCTATTAC >> > AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAA TGGCGCAT >> > CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGA GGTTTAGG >> > CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCAT TCGGTGGT >> > GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAG CGCCACCG >> > CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTAC GGCCAGGC >> CAGTAGTTACGCGTCGT" >>> >> >>> >> > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") >> Global-Local PairwiseAlignedFixedSubject (1 of 1) >> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC >> score: -91.50367 >>> >> > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") >> Global-Local PairwiseAlignedFixedSubject (1 of 1) >> pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC >> subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG --GCCTAC >> score: 95.69296 >>> >> > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") >> Global-Local PairwiseAlignedFixedSubject (1 of 1) >> pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC >> score: -91.50367 >> >>> sessionInfo() >> R version 2.12.2 (2011-02-25) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C LC_TIME=English_United >> States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 >> makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 > seqinr_3.0-1 >> [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 >> >> >> >> >> ############################### >> >> Nutritional Sciences and Toxicology, >> 119 Morgan Hall >> UC.Berkeley,CA 94720 >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6