pairwiseAlignment generates different outcomes for the same input sequences
2
0
Entering edit mode
@alogmail2aolcom-4540
Last seen 10.0 years ago
Dear Mailing List, Why does pairwiseAlignment() generate different outcomes for the same input sequences defined differently in terms of classes (see showMethods below): for pattern="character", subject="character" vs. pattern="DNAString", subject="DNAString" ? It generates the same outcomes for the case of pattern="character", subject="character" vs. pattern="character", subject="DNAString" It looks like a bug. Thanks Alex #showMethods("pairwiseAlignment") #Function: pairwiseAlignment (package Biostrings) #pattern="character", subject="character" #pattern="character", subject="DNAString" # (inherited from: pattern="character", subject="XString") #pattern="DNAString", subject="DNAString" > pattern.1 50-letter "DNAString" instance seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > pattern.2 [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" > subject.1 543-letter "DNAString" instance seq: AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTTC AGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAGTA GTTACG CGTCGT > subject.2 [1] "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCTT CAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCTTC CTATTAC AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAATG GCGCAT CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGAGG TTTAGG CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCATTC GGTGGT GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAGCG CCACCG CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGG CCAGGC CAGTAGTTACGCGTCGT" > > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC score: -91.50367 > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG-- GCCTAC score: 95.69296 > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC score: -91.50367 > sessionInfo() R version 2.12.2 (2011-02-25) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1 [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 loaded via a namespace (and not attached): [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 ############################### Nutritional Sciences and Toxicology, 119 Morgan Hall UC.Berkeley,CA 94720 [[alternative HTML version deleted]]
• 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States
On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote: > Dear Mailing List, > > Why does pairwiseAlignment() generate different outcomes for the same input > sequences defined differently in terms of classes (see showMethods below): > > for > > pattern="character", subject="character" Hi Alex -- here 'character' could be anything -- 'the quick brown fox', whereas > > vs. > > pattern="DNAString", subject="DNAString" here pattern and subject are drawn from a restricted alphabet and some symbols have particular meaning (e.g., 'N' does not mean the nucleotide 'N'). Martin > > ? > > It generates the same outcomes for the case of > > > pattern="character", subject="character" > > vs. > > pattern="character", subject="DNAString" > > > It looks like a bug. > > Thanks > > Alex > > > > > > > #showMethods("pairwiseAlignment") > #Function: pairwiseAlignment (package Biostrings) > #pattern="character", subject="character" > #pattern="character", subject="DNAString" > # (inherited from: pattern="character", subject="XString") > #pattern="DNAString", subject="DNAString" > >> pattern.1 > 50-letter "DNAString" instance > seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> pattern.2 > [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" >> subject.1 > 543-letter "DNAString" instance > seq: > AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCT TCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAG TAGTTACG > CGTCGT >> subject.2 > [1] > "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTC TTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCT TCCTATTAC > AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAA TGGCGCAT > CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGA GGTTTAGG > CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCAT TCGGTGGT > GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAG CGCCACCG > CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTAC GGCCAGGC > CAGTAGTTACGCGTCGT" >> > >> > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC > subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG-- GCCTAC > score: 95.69296 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 > makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1 > [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 > > > > > ############################### > > Nutritional Sciences and Toxicology, > 119 Morgan Hall > UC.Berkeley,CA 94720 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Alex, Yes, I find it surprising too that you don't get the same thing. With this simpler example: > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AA-AA subject: [4] AAGAA score: 17.92695 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: 0.0459826 Note that I would *definititely* expect to get the same thing when a 'substitutionMatrix' is provided. And this seems to be actually the case (as long as all the letters in 'pattern' and 'subject' are represented in the substitution matrix): > mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3, baseOnly=TRUE) > mat A C G T A 1 -3 -3 -3 C -3 1 -3 -3 G -3 -3 1 -3 T -3 -3 -3 1 > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local", substitutionMatrix=mat) Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [5] AGAA score: 0 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), type="global-local", substitutionMatrix=mat) Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [5] AGAA score: 0 However, by default (i.e. when no 'substitutionMatrix' is provided), pairwiseAlignment() performs *quality-based* alignments. According to the man page, this means that, internally, a substitution matrix is generated (and used) based on the qualities of the letters in the pattern and the subject (a constant default quality is assigned to all the letters). It's unclear to me why the code generating this substitution matrix would return different matrices depending on whether 'pattern' and 'subject' are ordinary character vectors or DNAString objects. It seems that the matrix generated for DNAString objects makes it harder to see gaps in the pattern. Anyway, I need to have a closer look at this before I can comment more. Cheers, H. On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote: > Dear Mailing List, > > Why does pairwiseAlignment() generate different outcomes for the same input > sequences defined differently in terms of classes (see showMethods below): > > for > > pattern="character", subject="character" > > vs. > > pattern="DNAString", subject="DNAString" > > ? > > It generates the same outcomes for the case of > > > pattern="character", subject="character" > > vs. > > pattern="character", subject="DNAString" > > > It looks like a bug. > > Thanks > > Alex > > > > > > > #showMethods("pairwiseAlignment") > #Function: pairwiseAlignment (package Biostrings) > #pattern="character", subject="character" > #pattern="character", subject="DNAString" > # (inherited from: pattern="character", subject="XString") > #pattern="DNAString", subject="DNAString" > >> pattern.1 > 50-letter "DNAString" instance > seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> pattern.2 > [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" >> subject.1 > 543-letter "DNAString" instance > seq: > AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCT TCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAG TAGTTACG > CGTCGT >> subject.2 > [1] > "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTC TTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCT TCCTATTAC > AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAA TGGCGCAT > CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGA GGTTTAGG > CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCAT TCGGTGGT > GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAG CGCCACCG > CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTAC GGCCAGGC > CAGTAGTTACGCGTCGT" >> > >> > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC > subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG-- GCCTAC > score: 95.69296 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 > makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1 > [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 > > > > > ############################### > > Nutritional Sciences and Toxicology, > 119 Morgan Hall > UC.Berkeley,CA 94720 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi Alex, For *quality-based* alignments, pairwiseAlignment() uses the length of the underlying alphabet to compute the substitution matrix. So I did the following change in Biostrings 2.19.17 (devel): when the input is stored in character vectors, BString or BStringSet objects (i.e. not DNAString or AAString), pairwiseAlignment() will now consider that the length of the underlying alphabet is the number of distinct letters in the input. Not 256 anymore. The reason for this change: Why 256? Why would we want the result of an alignment to depend on the internal representation of characters? If characters were represented on 16 bits (unicode) instead of 8 bits, then this would change how "AAAA" aligns to "CCCAAGAATTT". Doesn't sound like a good feature. So when the input contains 4 distinct letters (like in your situation), then having the input in DNAString objects or just ordinary character vectors should always give the same result: > pairwiseAlignment("AAAA", "CCCAAGAATTT") Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: -47.95402 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT")) Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: -47.95402 And if the input contains a number of distinct letters close to 4 (i.e. 3 or 5), results should be close too: > pairwiseAlignment("xxxx", "CCCxxGxxCCC") Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] xxxx subject: [4] xxGx score: -49.02994 Biostrings 2.19.17 should become available thru biocLite() in the next 24 hours or so. Cheers, H. ----- Original Message ----- From: "Hervé Pagès" <hpages@fhcrc.org> To: Alogmail2 at aol.com Cc: bioconductor at r-project.org Sent: Monday, March 14, 2011 2:57:37 PM Subject: Re: [BioC] pairwiseAlignment generates different outcomes for the same input sequences Hi Alex, Yes, I find it surprising too that you don't get the same thing. With this simpler example: > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AA-AA subject: [4] AAGAA score: 17.92695 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: 0.0459826 Note that I would *definititely* expect to get the same thing when a 'substitutionMatrix' is provided. And this seems to be actually the case (as long as all the letters in 'pattern' and 'subject' are represented in the substitution matrix): > mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3, baseOnly=TRUE) > mat A C G T A 1 -3 -3 -3 C -3 1 -3 -3 G -3 -3 1 -3 T -3 -3 -3 1 > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local", substitutionMatrix=mat) Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [5] AGAA score: 0 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), type="global-local", substitutionMatrix=mat) Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [5] AGAA score: 0 However, by default (i.e. when no 'substitutionMatrix' is provided), pairwiseAlignment() performs *quality-based* alignments. According to the man page, this means that, internally, a substitution matrix is generated (and used) based on the qualities of the letters in the pattern and the subject (a constant default quality is assigned to all the letters). It's unclear to me why the code generating this substitution matrix would return different matrices depending on whether 'pattern' and 'subject' are ordinary character vectors or DNAString objects. It seems that the matrix generated for DNAString objects makes it harder to see gaps in the pattern. Anyway, I need to have a closer look at this before I can comment more. Cheers, H. On 03/13/2011 06:58 PM, Alogmail2 at aol.com wrote: > Dear Mailing List, > > Why does pairwiseAlignment() generate different outcomes for the same input > sequences defined differently in terms of classes (see showMethods below): > > for > > pattern="character", subject="character" > > vs. > > pattern="DNAString", subject="DNAString" > > ? > > It generates the same outcomes for the case of > > > pattern="character", subject="character" > > vs. > > pattern="character", subject="DNAString" > > > It looks like a bug. > > Thanks > > Alex > > > > > > > #showMethods("pairwiseAlignment") > #Function: pairwiseAlignment (package Biostrings) > #pattern="character", subject="character" > #pattern="character", subject="DNAString" > # (inherited from: pattern="character", subject="XString") > #pattern="DNAString", subject="DNAString" > >> pattern.1 > 50-letter "DNAString" instance > seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> pattern.2 > [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" >> subject.1 > 543-letter "DNAString" instance > seq: > AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCT TCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAG TAGTTACG > CGTCGT >> subject.2 > [1] > "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTC TTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCT TCCTATTAC > AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAA TGGCGCAT > CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGA GGTTTAGG > CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCAT TCGGTGGT > GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAG CGCCACCG > CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTAC GGCCAGGC > CAGTAGTTACGCGTCGT" >> > >> > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC > subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG-- GCCTAC > score: 95.69296 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 > makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1 > [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 > > > > > ############################### > > Nutritional Sciences and Toxicology, > 119 Morgan Hall > UC.Berkeley,CA 94720 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Dear Herve, It looks pretty reasonable. Thanks for your great job. Best, Alex Alex Loguinov, Ph.D. Nutritional Sciences and Toxicology 119 Morgan Hall UC Berkeley, CA 94720 -----Original Message----- From: Pages, Herve <hpages@fhcrc.org> To: Alogmail2 <alogmail2@aol.com> Cc: bioconductor <bioconductor@r-project.org>; Alogmail2 <alogmail2@aol.com> Sent: Tue, Mar 15, 2011 12:08 am Subject: Re: [BioC] pairwiseAlignment generates different outcomes for the same input sequences Hi Alex, For *quality-based* alignments, pairwiseAlignment() uses the length of the underlying alphabet to compute the substitution matrix. So I did the following change in Biostrings 2.19.17 (devel): when the input is stored in character vectors, BString or BStringSet objects (i.e. not DNAString or AAString), pairwiseAlignment() will now consider that the length of the underlying alphabet is the number of distinct letters in the input. Not 256 anymore. The reason for this change: Why 256? Why would we want the result of an alignment to depend on the internal representation of characters? If characters were represented on 16 bits (unicode) instead of 8 bits, then this would change how "AAAA" aligns to "CCCAAGAATTT". Doesn't sound like a good feature. So when the input contains 4 distinct letters (like in your situation), then having the input in DNAString objects or just ordinary character vectors should always give the same result: > pairwiseAlignment("AAAA", "CCCAAGAATTT") Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: -47.95402 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT")) Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: -47.95402 And if the input contains a number of distinct letters close to 4 (i.e. 3 or 5), results should be close too: > pairwiseAlignment("xxxx", "CCCxxGxxCCC") Global PairwiseAlignedFixedSubject (1 of 1) pattern: [1] xxxx subject: [4] xxGx score: -49.02994 Biostrings 2.19.17 should become available thru biocLite() in the next 24 hours or so. Cheers, H. ----- Original Message ----- From: "Hervé Pagès" <hpages@fhcrc.org> To: Alogmail2@aol.com Cc: bioconductor@r-project.org Sent: Monday, March 14, 2011 2:57:37 PM Subject: Re: [BioC] pairwiseAlignment generates different outcomes for the same input sequences Hi Alex, Yes, I find it surprising too that you don't get the same thing. With this simpler example: > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AA-AA subject: [4] AAGAA score: 17.92695 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), type="global-local") Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [4] AAGA score: 0.0459826 Note that I would *definititely* expect to get the same thing when a 'substitutionMatrix' is provided. And this seems to be actually the case (as long as all the letters in 'pattern' and 'subject' are represented in the substitution matrix): > mat <- nucleotideSubstitutionMatrix(match=1, mismatch=-3, baseOnly=TRUE) > mat A C G T A 1 -3 -3 -3 C -3 1 -3 -3 G -3 -3 1 -3 T -3 -3 -3 1 > pairwiseAlignment("AAAA", "CCCAAGAATTT", type="global-local", substitutionMatrix=mat) Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [5] AGAA score: 0 > pairwiseAlignment(DNAString("AAAA"), DNAString("CCCAAGAATTT"), type="global-local", substitutionMatrix=mat) Global-Local PairwiseAlignedFixedSubject (1 of 1) pattern: [1] AAAA subject: [5] AGAA score: 0 However, by default (i.e. when no 'substitutionMatrix' is provided), pairwiseAlignment() performs *quality-based* alignments. According to the man page, this means that, internally, a substitution matrix is generated (and used) based on the qualities of the letters in the pattern and the subject (a constant default quality is assigned to all the letters). It's unclear to me why the code generating this substitution matrix would return different matrices depending on whether 'pattern' and 'subject' are ordinary character vectors or DNAString objects. It seems that the matrix generated for DNAString objects makes it harder to see gaps in the pattern. Anyway, I need to have a closer look at this before I can comment more. Cheers, H. On 03/13/2011 06:58 PM, Alogmail2@aol.com wrote: > Dear Mailing List, > > Why does pairwiseAlignment() generate different outcomes for the same input > sequences defined differently in terms of classes (see showMethods below): > > for > > pattern="character", subject="character" > > vs. > > pattern="DNAString", subject="DNAString" > > ? > > It generates the same outcomes for the case of > > > pattern="character", subject="character" > > vs. > > pattern="character", subject="DNAString" > > > It looks like a bug. > > Thanks > > Alex > > > > > > > #showMethods("pairwiseAlignment") > #Function: pairwiseAlignment (package Biostrings) > #pattern="character", subject="character" > #pattern="character", subject="DNAString" > # (inherited from: pattern="character", subject="XString") > #pattern="DNAString", subject="DNAString" > >> pattern.1 > 50-letter "DNAString" instance > seq: CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC >> pattern.2 > [1] "CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC" >> subject.1 > 543-letter "DNAString" instance > seq: > AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTCT TCAGATT...CAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTACGGCCAGGCCAG TAGTTACG > CGTCGT >> subject.2 > [1] > "AAAAAAAAAAAAAAAAAAAAATGAAATCCGAACTTCTTGGAGCCTCGTCTGAAGGCCATCTCGGCTC TTCAGATTCGCCCTTCGTCAGCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCGGCCTACGCTGCT TCCTATTAC > AGCAGCGCGATGCAGGCCTACAATAGTCAATCGACGTCGGCCTACATGCCAAGCAGTGGATTCTATAA TGGCGCAT > CTTCGCAGACGCCCTACGGAGTCCTGGCGCCCTCCACTTACACAACGATGGGCGTTCCCAGTACAAGA GGTTTAGG > CCAACAATGTAAAAATGGACAATCATTAGCACAAACGCCTCCGTATTTGAGCTCGTACGGGTCGGCAT TCGGTGGT > GTCACAGCCAGCAGTTCGCCTTCGGGTCCACCCGCCTACGCGTCCGCTTATGGATCGGCATACAATAG CGCCACCG > CCGCCCAATCGTTCACCAACAGTCAACAAGTTCCAACGGGACTAGATTACGGGGCGTATACGCCGTAC GGCCAGGC > CAGTAGTTACGCGTCGT" >> > >> > pairwiseAlignment(pattern=pattern.1,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.2,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGC--CATGGCAAAGCTC--GCTGCC-TCAGAGGCCGCCAC- AATGGTTGCGCAC > subject: [80] CTTCGTCA--GCGACGCTCTGGCTGCCGTCACCGGTGACTACCAATCG-- GCCTAC > score: 95.69296 >> > pairwiseAlignment(pattern=pattern.2,subject=subject.1,type="global- local") > Global-Local PairwiseAlignedFixedSubject (1 of 1) > pattern: [1] CTGCCATGGCAAAGCTCGCTGCCTCAGAGGCCGCCACAATGGTTGCGCAC > subject: [429] TCGGCATACAATAGC--GCCACC------GCCGCC-CAATCGTT---CAC > score: -91.50367 > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] altcdfenvs_2.12.0 hypergraph_1.22.0 graph_1.28.0 > makecdfenv_1.28.0 affy_1.28.0 Biobase_2.10.0 GeneR_2.20.0 seqinr_3.0-1 > [9] Biostrings_2.18.4 IRanges_1.8.9 limma_3.6.9 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 preprocessCore_1.12.0 tools_2.12.2 > > > > > ############################### > > Nutritional Sciences and Toxicology, > 119 Morgan Hall > UC.Berkeley,CA 94720 > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6