issue with Views on PairwiseAlignmentsSingleSubject?
1
0
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 5.1 years ago
Fred Hutchinson Cancer Research Center,…
Hi there, I've been using Biostrings' pairwiseAlignment for various things. I just saw the Views feature described in help("PairwiseAlignments- class") and thought I'd give it a try, but it's not behaving as I'd expect in cases where my subject sequence is one sequence selected from a DNAStringSet (rather than starting with a DNAString). I think I found a bug (?) although I could also be misunderstanding what Views is supposed to do. I updated to the new Bioc devel and new R today, but had the same issue before updating. I think the code chunk below will explain - does it make sense? thanks very much, Janet library(Biostrings) ## example seqs (taken from ?pairwiseAlignment) s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAAT TTTCATC") ### make some sequence sets s1a <- DNAStringSet(c(as.character(s1),as.character(s1))) s2a <- DNAStringSet(c(as.character(s2),as.character(s2))) ### align myAln <- pairwiseAlignment ( s1a[1], s2a[1]) ### I think the next line of code should give me a Views on s1a[1], but instead it gives me Views on all the seqs of s1a concatenated together Views(myAln) ##### and, an aside, it would seem intuitive to me that the code below should work to make a stringset of two sequences, but instead it concetenates them to one sequence s1b <- DNAStringSet(c(s1,s1)) ######### sessionInfo() R Under development (unstable) (2012-10-03 r60868) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Biostrings_2.27.1 IRanges_1.17.0 BiocGenerics_0.5.0 loaded via a namespace (and not attached): [1] parallel_2.16.0 stats4_2.16.0
• 866 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 7 hours ago
Seattle, WA, United States
Hi Janet, Yes, this is clearly a bug. Thanks for reporting it. Just to make the problem even more apparent: subject1 <- paste(rep("N", 60), collapse="") subject2 <- "GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC" subject <- DNAStringSet(c(subject1, subject2)) pattern <- DNAStringSet(c("ATATAAAAATTAATT", "CTTCCATTTCG")) myAln <- pairwiseAlignment(pattern, subject[2]) Then: > myAln Global PairwiseAlignmentsSingleSubject (1 of 2) pattern: [1] ATATAAAAAT-TAATT subject: [39] ATATAAAAATATAATT score: -180.2737 > Views(myAln) Views on a 120-letter DNAString subject subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAATATATAAATATATAAAAATATAATTTTCAT C views: start end width [1] 39 54 16 [NNNNNNNNNNNNNNNN] [2] 1 19 19 [NNNNNNNNNNNNNNNNNNN] A fix is coming (in Biostrings 2.26.1 and 2.27.2). With this fix: > Views(myAln) Views on a 60-letter DNAString subject subject: GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC views: start end width [1] 39 54 16 [ATATAAAAATATAATT] [2] 1 19 19 [GTTTCACTACTTCCTTTCG] And looking closely at 'myAln' with writePairwiseAlignments() will confirm those views: > writePairwiseAlignments(myAln) ######################################## # Program: Biostrings (version 2.26.0), a Bioconductor package # Rundate: Fri Oct 5 16:15:47 2012 ######################################## #======================================= # # Aligned_sequences: 2 # 1: P1 # 2: S1 # Matrix: NA # Gap_penalty: 14.0 # Extend_penalty: 4.0 # # Length: 60 # Identity: 15/60 (25.0%) # Similarity: NA/60 (NA%) # Gaps: 45/60 (75.0%) # Score: -180.2737 # # #======================================= P1 1 --------------------------------------ATATAAAAAT-T 11 |||||||||| | S1 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT 50 P1 12 AATT------ 15 |||| S1 51 AATTTTCATC 60 #======================================= # # Aligned_sequences: 2 # 1: P2 # 2: S1 # Matrix: NA # Gap_penalty: 14.0 # Extend_penalty: 4.0 # # Length: 60 # Identity: 9/60 (15.0%) # Similarity: NA/60 (NA%) # Gaps: 49/60 (81.7%) # Score: -209.9628 # # #======================================= P2 1 CTTCCA-------- TTTCG------------------------------- 11 || || ||||| S1 1 GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATAT 50 P2 12 ---------- 11 S1 51 AATTTTCATC 60 #--------------------------------------- #--------------------------------------- More below... On 10/05/2012 03:31 PM, Janet Young wrote: > Hi there, > > I've been using Biostrings' pairwiseAlignment for various things. I just saw the Views feature described in help("PairwiseAlignments- class") and thought I'd give it a try, but it's not behaving as I'd expect in cases where my subject sequence is one sequence selected from a DNAStringSet (rather than starting with a DNAString). > > I think I found a bug (?) although I could also be misunderstanding what Views is supposed to do. I updated to the new Bioc devel and new R today, but had the same issue before updating. > > I think the code chunk below will explain - does it make sense? > > thanks very much, > > Janet > > > > library(Biostrings) > > ## example seqs (taken from ?pairwiseAlignment) > s1 <- DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") > s2 <- DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATA ATTTTCATC") > > ### make some sequence sets > s1a <- DNAStringSet(c(as.character(s1),as.character(s1))) > s2a <- DNAStringSet(c(as.character(s2),as.character(s2))) > > > ### align > myAln <- pairwiseAlignment ( s1a[1], s2a[1]) > > > ### I think the next line of code should give me a Views on s1a[1], but instead it gives me Views on all the seqs of s1a concatenated together > Views(myAln) > > ##### and, an aside, it would seem intuitive to me that the code below should work to make a stringset of two sequences, but instead it concetenates them to one sequence > s1b <- DNAStringSet(c(s1,s1)) Combining (with c()) 2 vectors of n1 and n2 elements, respectively, is expected to return a vector of n1 + n2 elements (where the elements of the 2nd vector have been placed after the elements of the 1st vector), and of the same class as the original vectors. In a DNAString object, the elements are the letters and the length of the object is the number of letters: > x <- DNAString("AAAAACTTA") > x 9-letter "DNAString" instance seq: AAAAACTTA > length(x) [1] 9 > x[6:7] 2-letter "DNAString" instance seq: CT So doing c() on DNAString objects does: > c(x, DNAString("NN")) 11-letter "DNAString" instance seq: AAAAACTTANN In a DNAStringSet object, the elements are the sequences (represented as DNAString objects) and the length of the object is the number of sequences: > dna <- DNAStringSet("AAAAACTTA") > dna A DNAStringSet instance of length 1 width seq [1] 9 AAAAACTTA > length(dna) [1] 1 So doing c() on DNAStringSet objects does: > dna2 <- c(dna, dna) > dna2 A DNAStringSet instance of length 2 width seq [1] 9 AAAAACTTA [2] 9 AAAAACTTA > length(dna2) [1] 2 > dna2[[1]] 9-letter "DNAString" instance seq: AAAAACTTA > dna2[[2]] 9-letter "DNAString" instance seq: AAAAACTTA > dna2[[3]] Error in checkAndTranslateDbleBracketSubscript(x, i) : subscript out of bounds So if you had made 's1' a DNAStringSet instead of a DNAString, you would have gotten what you wanted: s1 <- DNAStringSet("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG") s1b <- c(s1,s1) For those reasons a DNAStringSet object should be seen as the analog of an ordinary character vector. But not DNAString. There is actually no good analogy for DNAString in R ordinary objects. The closest to it is maybe a raw vector: a DNAString object can also be seen as a vector of bytes. HTH, H. > > ######### > sessionInfo() > > R Under development (unstable) (2012-10-03 r60868) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Biostrings_2.27.1 IRanges_1.17.0 BiocGenerics_0.5.0 > > loaded via a namespace (and not attached): > [1] parallel_2.16.0 stats4_2.16.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6