Fwd: BioStrings PairwiseAlignment deletion() insertion() function
2
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
Hi Marcin, On 09/27/2013 07:48 AM, Marcin Imielinski wrote: > Sorry, forgot to cc you. Thanks in advance! > > ---------- Forwarded message ---------- > From: *Marcin Imielinski* <marcin at="" broad.mit.edu=""> <mailto:marcin at="" broad.mit.edu="">> > Date: Fri, Sep 27, 2013 at 10:40 AM > Subject: BioStrings PairwiseAlignment deletion() insertion() function > To: bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > > > Hi - > > I'm confused about the output of deletion(pa) and insertion(pa) > functions for pa = pairwiseAlignment(). My understanding is that they > should output IRanges corresponding to gaps in the pattern (deletion()) > and gaps in the subject (insertion()) in terms of alignment coordinates. > > However, it appears that the outputted ranges can overlap. For example, > the alignment (below) of a 101 letter pattern and 404 letter subject. > The deletion ranges overlap. What sequence positions do these ranges > refer to (pattern, subject, or alignment)? > > Is this is a bug or am I misinterpreting this output? It's a questionable design choice but the ranges describing the deletions are reported with respect to the "original" (aka unaligned) pattern, not to the aligned pattern. For example the 1st range you see in 'deletion(pa)' means there is a deletion of 3 letters (when going from subject to pattern) and that this deletion starts at position 4 in the original pattern. With such conventions, the *ranges* describing the deletions can overlap but the deletions don't, which I must admit is very confusing. As you noticed, those ranges can be shifted to make them refer to the aligned pattern: > offset <- c(0L, cumsum(head(width(deletion(pa)[[1]]), n=-1))) > shift(deletion(pa)[[1]], offset) IRanges of length 9 start end width [1] 4 6 3 [2] 21 39 19 [3] 49 76 28 [4] 84 128 45 [5] 135 172 38 [6] 183 186 4 [7] 195 204 10 [8] 212 215 4 [9] 245 284 40 Hope this helps, H. > > Thanks > Marcin > > ############################################################## > > > str[1] > [1] > "TCCTTGCACATTGATAAGTTCACATCTGAAATTTGCATGACATAAACATACAGTTGAGAAGGAGAGA ACGTATGCCCTATGGTAAATATTGACATTTTAAA" > > str[2] > [1] > "CTGGGCTTTCGATGAAATAGTTCATTTATCTGTGGGTAGATATTACTTACTGGTTGAGTTAAACTGG GTTAAACATCAATTCTATTTCCATTTTTCATTTTTATAAATAGGTACTGAGAATCTTTGTTCATATAAAT AGATGGATAGGATTAGCCACTTCTTTGAATTTCTTTTTCAAGTTTCATGCCAAGATTCACATCATAACAC ATGTAACTGCATGTCTGGATGGAGAACAGATGTACCTATGCAGCGGCAGGGACATCAACACTCTCACTGA TGAATTGGCCGAGGAATGAGGAATAGCACAAATCAGCTACGGAACATTGACAAACTGGGAGCTAAACTTT GCTTCATGCCTGTGAGGCAGTATTTTGATGAGCGGTGGATGCCCAGTGCTTCCTTGT" > > > > pa = pairwiseAlignment(str[1], str[2]) > > deletion(pa) > IRangesList of length 1 > [[1]] > IRanges of length 9 > start end width > [1] 4 6 3 > [2] 18 36 19 > [3] 27 54 28 > [4] 34 78 45 > [5] 40 77 38 > [6] 50 53 4 > [7] 58 67 10 > [8] 65 68 4 > [9] 94 133 40 > > > insertion(pa) > IRangesList of length 1 > [[1]] > IRanges of length 1 > start end width > [1] 234 235 2 > > > nchar(pa) > [1] 292 > > > as.character(aligned(pattern(pa))) > [1] > "TCC---TTGCACATTGATAA-------------------GTTCACATC ----------------------------TGAAATT ---------------------------------------------TGCATG --------------------------------------ACATAAACAT----ACAGTTGA---------- GAAGGAG----AGAACGTATGCCCTATGGTAAATATTGAC ----------------------------------------ATTTTAAA" > > as.character(aligned(subject(pa))) > [1] > "TCCATTTTTCATTTTTATAAATAGGTACTGAGAATCTTTGTTCATATAAATAGATGGATAGGATTAG CCACTTCTTTGAATTTCTTTTTCAAGTTTCATGCCAAGATTCACATCATAACACATGTAACTGCATGTCT GGATGGAGAACAGATGTACCTATGCAGCGGCAGGGACATCAACACTCTCACTGATGAATTGGCCGAGGAA TGAGGAATAGCACAAATCAGCTACGG-- AACATTGACAAACTGGGAGCTAAACTTTGCTTCATGCCTGTGAGGCAGTATTTTGAT" > > ### here is what I would expect deletion(pa) to output ... notice that > it resembles > ### the above deletion(pa) output with a shift corresponding to > cumsum(width) > ### is this a bug? > > > as(which(strsplit(as.character(aligned(subject(pa))), '')[[1]] == > "-"), 'IRanges') > IRanges of length 9 > start end width > [1] 4 6 3 > [2] 21 39 19 > [3] 49 76 28 > [4] 84 128 45 > [5] 135 172 38 > [6] 183 186 4 > [7] 195 204 10 > [8] 212 215 4 > [9] 245 284 40 > > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Biostrings_2.28.0 IRanges_1.18.4 BiocGenerics_0.6.0 > [4] BiocInstaller_1.10.3 multicore_0.1-7 > > loaded via a namespace (and not attached): > [1] stats4_3.0.0 tools_3.0.0 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Alignment Cancer Biostrings IRanges Alignment Cancer Biostrings IRanges • 1.3k views
ADD COMMENT
0
Entering edit mode
@marcin-imielinski-5383
Last seen 10.3 years ago
Thanks, but I'm still a bit confused. str1 (pattern) has width 101 and str2 (subject) has width 404. If deletion() specifies pattern gaps as coordinates on str1 then why is the 9th deletion range 94-133? I suppose that the "end" coordinate of these indel ranges are irrelevant - you are just specifying a gap of width 40 at pattern position 94, and the 133 "endpoint" is just an artifact of using IRanges to specify these gaps. It would seem to be more intuitive to specify subject / pattern gaps as ranges on the alignment - just my two cents. With that said, I'd like to thank you and the rest of the FHCRC team for these fantastic packages (Biostrings, IRanges, GenomicRanges, etc)! On Fri, Sep 27, 2013 at 8:22 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Marcin, > > > On 09/27/2013 07:48 AM, Marcin Imielinski wrote: > >> Sorry, forgot to cc you. Thanks in advance! >> >> ---------- Forwarded message ---------- >> From: *Marcin Imielinski* <marcin@broad.mit.edu>> <mailto:marcin@broad.mit.edu>> >> Date: Fri, Sep 27, 2013 at 10:40 AM >> Subject: BioStrings PairwiseAlignment deletion() insertion() function >> To: bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> >> >> Hi - >> >> I'm confused about the output of deletion(pa) and insertion(pa) >> functions for pa = pairwiseAlignment(). My understanding is that they >> should output IRanges corresponding to gaps in the pattern (deletion()) >> and gaps in the subject (insertion()) in terms of alignment coordinates. >> >> However, it appears that the outputted ranges can overlap. For example, >> the alignment (below) of a 101 letter pattern and 404 letter subject. >> The deletion ranges overlap. What sequence positions do these ranges >> refer to (pattern, subject, or alignment)? >> >> Is this is a bug or am I misinterpreting this output? >> > > It's a questionable design choice but the ranges describing the > deletions are reported with respect to the "original" (aka unaligned) > pattern, not to the aligned pattern. For example the 1st range you > see in 'deletion(pa)' means there is a deletion of 3 letters (when > going from subject to pattern) and that this deletion starts at position > 4 in the original pattern. With such conventions, the *ranges* > describing the deletions can overlap but the deletions don't, which > I must admit is very confusing. > > As you noticed, those ranges can be shifted to make them refer to the > aligned pattern: > > > offset <- c(0L, cumsum(head(width(deletion(pa)**[[1]]), n=-1))) > > shift(deletion(pa)[[1]], offset) > > IRanges of length 9 > start end width > [1] 4 6 3 > [2] 21 39 19 > [3] 49 76 28 > [4] 84 128 45 > [5] 135 172 38 > [6] 183 186 4 > [7] 195 204 10 > [8] 212 215 4 > [9] 245 284 40 > > Hope this helps, > H. > > > >> Thanks >> Marcin >> >> ##############################**##############################**## >> >> > str[1] >> [1] >> "**TCCTTGCACATTGATAAGTTCACATCTGAA**ATTTGCATGACATAAACATACAGTTGAGAA** >> GGAGAGAACGTATGCCCTATGGTAAATATT**GACATTTTAAA" >> > str[2] >> [1] >> "**CTGGGCTTTCGATGAAATAGTTCATTTATC**TGTGGGTAGATATTACTTACTGGTTGAGTT** >> AAACTGGGTTAAACATCAATTCTATTTCCA**TTTTTCATTTTTATAAATAGGTACTGAGAA** >> TCTTTGTTCATATAAATAGATGGATAGGAT**TAGCCACTTCTTTGAATTTCTTTTTCAAGT** >> TTCATGCCAAGATTCACATCATAACACATG**TAACTGCATGTCTGGATGGAGAACAGATGT** >> ACCTATGCAGCGGCAGGGACATCAACACTC**TCACTGATGAATTGGCCGAGGAATGAGGAA** >> TAGCACAAATCAGCTACGGAACATTGACAA**ACTGGGAGCTAAACTTTGCTTCATGCCTGT** >> GAGGCAGTATTTTGATGAGCGGTGGATGCC**CAGTGCTTCCTTGT" >> > >> > pa = pairwiseAlignment(str[1], str[2]) >> > deletion(pa) >> IRangesList of length 1 >> [[1]] >> IRanges of length 9 >> start end width >> [1] 4 6 3 >> [2] 18 36 19 >> [3] 27 54 28 >> [4] 34 78 45 >> [5] 40 77 38 >> [6] 50 53 4 >> [7] 58 67 10 >> [8] 65 68 4 >> [9] 94 133 40 >> >> > insertion(pa) >> IRangesList of length 1 >> [[1]] >> IRanges of length 1 >> start end width >> [1] 234 235 2 >> >> > nchar(pa) >> [1] 292 >> >> > as.character(aligned(pattern(**pa))) >> [1] >> "TCC---TTGCACATTGATAA---------**----------GTTCACATC-----------** >> -----------------TGAAATT------**------------------------------** >> ---------TGCATG---------------**-----------------------** >> ACATAAACAT----ACAGTTGA--------**--GAAGGAG----** >> AGAACGTATGCCCTATGGTAAATATTGAC-**------------------------------** >> ---------ATTTTAAA" >> > as.character(aligned(subject(**pa))) >> [1] >> "**TCCATTTTTCATTTTTATAAATAGGTACTG**AGAATCTTTGTTCATATAAATAGATGGATA** >> GGATTAGCCACTTCTTTGAATTTCTTTTTC**AAGTTTCATGCCAAGATTCACATCATAACA** >> CATGTAACTGCATGTCTGGATGGAGAACAG**ATGTACCTATGCAGCGGCAGGGACATCAAC** >> ACTCTCACTGATGAATTGGCCGAGGAATGA**GGAATAGCACAAATCAGCTACGG--** >> AACATTGACAAACTGGGAGCTAAACTTTGC**TTCATGCCTGTGAGGCAGTATTTTGAT" >> >> ### here is what I would expect deletion(pa) to output ... notice that >> it resembles >> ### the above deletion(pa) output with a shift corresponding to >> cumsum(width) >> ### is this a bug? >> >> > as(which(strsplit(as.**character(aligned(subject(pa))**), '')[[1]] == >> "-"), 'IRanges') >> IRanges of length 9 >> start end width >> [1] 4 6 3 >> [2] 21 39 19 >> [3] 49 76 28 >> [4] 84 128 45 >> [5] 135 172 38 >> [6] 183 186 4 >> [7] 195 204 10 >> [8] 212 215 4 >> [9] 245 284 40 >> >> >> > sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] Biostrings_2.28.0 IRanges_1.18.4 BiocGenerics_0.6.0 >> [4] BiocInstaller_1.10.3 multicore_0.1-7 >> >> loaded via a namespace (and not attached): >> [1] stats4_3.0.0 tools_3.0.0 >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
On 09/28/2013 02:35 PM, Marcin Imielinski wrote: > Thanks, but I'm still a bit confused. str1 (pattern) has width 101 and > str2 (subject) has width 404. If deletion() specifies pattern gaps as > coordinates on str1 then why is the 9th deletion range 94-133? > > I suppose that the "end" coordinate of these indel ranges are irrelevant > - you are just specifying a gap of width 40 at pattern position 94, and > the 133 "endpoint" is just an artifact of using IRanges to specify these > gaps. Exactly. > > It would seem to be more intuitive to specify subject / pattern gaps as > ranges on the alignment - just my two cents. I agree but these were decisions taken a long time ago and unfortunately we're kind of stuck with them. On way we could try to improve things is by adding an extra argument, e.g. 'on.alignment' (that would need to be FALSE by default) to deletion() and insertion(), to let the user request the ranges to be reported with respect to the alignments. Another way to go, which has my preference, would be to use CIGARs to represent the indels in pairwise alignments. That's what SAMtools does and we already have efficient/well-tested CIGAR utilities in GenomicRanges (see ?`cigar-utils`, they've been refactored in BioC devel) for converting cigars to ranges along the unaligned patterns (query space), or the unaligned subject (reference space), or the alignments ("pairwise space"). Those utilities are used in several places internally e.g. when converting a GAlignments object into a GRanges or GRangesList object or in sequenceLayer(). There is an oportunity to reuse them for PairwiseAlignments object too and to make those objects look more like GAlignments objects by providing the CIGARs of the alignments (via a cigar() getter like for Alignments objects). That would also make PairwiseAlignments objects much more compact in memory. > > With that said, I'd like to thank you and the rest of the FHCRC team for > these fantastic packages (Biostrings, IRanges, GenomicRanges, etc)! You're welcome. Glad you like them and thanks for your feedback! H. > > > On Fri, Sep 27, 2013 at 8:22 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Marcin, > > > On 09/27/2013 07:48 AM, Marcin Imielinski wrote: > > Sorry, forgot to cc you. Thanks in advance! > > ---------- Forwarded message ---------- > From: *Marcin Imielinski* <marcin at="" broad.mit.edu=""> <mailto:marcin at="" broad.mit.edu=""> > <mailto:marcin at="" broad.mit.edu="" <mailto:marcin="" at="" broad.mit.edu="">>> > Date: Fri, Sep 27, 2013 at 10:40 AM > Subject: BioStrings PairwiseAlignment deletion() insertion() > function > To: bioconductor at r-project.org > <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-__project.org=""> <mailto:bioconductor at="" r-project.org="">> > > > Hi - > > I'm confused about the output of deletion(pa) and insertion(pa) > functions for pa = pairwiseAlignment(). My understanding is > that they > should output IRanges corresponding to gaps in the pattern > (deletion()) > and gaps in the subject (insertion()) in terms of alignment > coordinates. > > However, it appears that the outputted ranges can overlap. For > example, > the alignment (below) of a 101 letter pattern and 404 letter > subject. > The deletion ranges overlap. What sequence positions do these > ranges > refer to (pattern, subject, or alignment)? > > Is this is a bug or am I misinterpreting this output? > > > It's a questionable design choice but the ranges describing the > deletions are reported with respect to the "original" (aka unaligned) > pattern, not to the aligned pattern. For example the 1st range you > see in 'deletion(pa)' means there is a deletion of 3 letters (when > going from subject to pattern) and that this deletion starts at position > 4 in the original pattern. With such conventions, the *ranges* > describing the deletions can overlap but the deletions don't, which > I must admit is very confusing. > > As you noticed, those ranges can be shifted to make them refer to the > aligned pattern: > > > offset <- c(0L, cumsum(head(width(deletion(pa)__[[1]]), n=-1))) > > shift(deletion(pa)[[1]], offset) > > IRanges of length 9 > start end width > [1] 4 6 3 > [2] 21 39 19 > [3] 49 76 28 > [4] 84 128 45 > [5] 135 172 38 > [6] 183 186 4 > [7] 195 204 10 > [8] 212 215 4 > [9] 245 284 40 > > Hope this helps, > H. > > > > Thanks > Marcin > > ##############################__##############################__## > > > str[1] > [1] > "__TCCTTGCACATTGATAAGTTCACATCTGAA__ATTTGCATGACATAAACATACAGTT GAGAA__GGAGAGAACGTATGCCCTATGGTAAATATT__GACATTTTAAA" > > str[2] > [1] > "__CTGGGCTTTCGATGAAATAGTTCATTTATC__TGTGGGTAGATATTACTTACTGGTT GAGTT__AAACTGGGTTAAACATCAATTCTATTTCCA__TTTTTCATTTTTATAAATAGGTACTGAGAA_ _TCTTTGTTCATATAAATAGATGGATAGGAT__TAGCCACTTCTTTGAATTTCTTTTTCAAGT__TTCAT GCCAAGATTCACATCATAACACATG__TAACTGCATGTCTGGATGGAGAACAGATGT__ACCTATGCAGC GGCAGGGACATCAACACTC__TCACTGATGAATTGGCCGAGGAATGAGGAA__TAGCACAAATCAGCTAC GGAACATTGACAA__ACTGGGAGCTAAACTTTGCTTCATGCCTGT__GAGGCAGTATTTTGATGAGCGGT GGATGCC__CAGTGCTTCCTTGT" > > > > pa = pairwiseAlignment(str[1], str[2]) > > deletion(pa) > IRangesList of length 1 > [[1]] > IRanges of length 9 > start end width > [1] 4 6 3 > [2] 18 36 19 > [3] 27 54 28 > [4] 34 78 45 > [5] 40 77 38 > [6] 50 53 4 > [7] 58 67 10 > [8] 65 68 4 > [9] 94 133 40 > > > insertion(pa) > IRangesList of length 1 > [[1]] > IRanges of length 1 > start end width > [1] 234 235 2 > > > nchar(pa) > [1] 292 > > > as.character(aligned(pattern(__pa))) > [1] > "TCC---TTGCACATTGATAA---------__----------GTTCACATC -----------__-----------------TGAAATT------__ ------------------------------__---------TGCATG---------------__ -----------------------__ACATAAACAT----ACAGTTGA--------__--GAAGGAG ----__AGAACGTATGCCCTATGGTAAATATTGAC-__------------------------------__ ---------ATTTTAAA" > > as.character(aligned(subject(__pa))) > [1] > "__TCCATTTTTCATTTTTATAAATAGGTACTG__AGAATCTTTGTTCATATAAATAGAT GGATA__GGATTAGCCACTTCTTTGAATTTCTTTTTC__AAGTTTCATGCCAAGATTCACATCATAACA_ _CATGTAACTGCATGTCTGGATGGAGAACAG__ATGTACCTATGCAGCGGCAGGGACATCAAC__ACTCT CACTGATGAATTGGCCGAGGAATGA__GGAATAGCACAAATCAGCTACGG-- __AACATTGACAAACTGGGAGCTAAACTTTGC__TTCATGCCTGTGAGGCAGTATTTTGAT" > > ### here is what I would expect deletion(pa) to output ... > notice that > it resembles > ### the above deletion(pa) output with a shift corresponding to > cumsum(width) > ### is this a bug? > > > as(which(strsplit(as.__character(aligned(subject(pa))__), > '')[[1]] == > "-"), 'IRanges') > IRanges of length 9 > start end width > [1] 4 6 3 > [2] 21 39 19 > [3] 49 76 28 > [4] 84 128 45 > [5] 135 172 38 > [6] 183 186 4 > [7] 195 204 10 > [8] 212 215 4 > [9] 245 284 40 > > > > sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods > [8] base > > other attached packages: > [1] Biostrings_2.28.0 IRanges_1.18.4 BiocGenerics_0.6.0 > [4] BiocInstaller_1.10.3 multicore_0.1-7 > > loaded via a namespace (and not attached): > [1] stats4_3.0.0 tools_3.0.0 > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6