Hi, when doing 100 000+ pairwise alignments with pairwiseAlignment()
, I noticed that some (only 3) aligned sequences retrieved from the function were shorter than the input sequences.
Here is one offending sequence pair:
> seqs1 GL349802 "GTCTGTACAGTCTGAACTGTACCATAGAAAATATATAGGTACTACTGTTACTAGTTACTCCAACTGTTGTAGGGTTTGTTGTATAAGTCATCGTCATGACTCATGACGAAGTATTAATTATGCCATGATGGTCAGATGTTGCACGGTATTCTTCAACTGTTATTTGTGCGCCATCTTGATCTTATGTTCTTAAATAACGCCTTAGGTACTGGCTATACACCTACTGTCTTAGCGTAAATCGTGTTGACAATCACACATAACATTTTTATCATTGAACATTATTATTTCTCTTGAGTTGTTAATTTATTGTTAAACCCAACACGGACGTTTCATCGTTAATTATCAATAACTGTTTTGAGTGACTATTTCGATTTTCGTTTACTGTACGCGTGTAACTATACGTTAACTATACGTTTTGTGGCGGACAGCAGACCACGATGGAAAACGAGAGAACGTACGTAGTACCTAGCTTGTTGCAAACAGCGGCACCGAACGCTTGCACCGAGATCGAGCGTGGCAGGTTCGACTTTCTGGAGGTACCGGACGTCGATATGCGACATTGTTGCGAGTCGCTGTTAGAAATATTAAAAATCGAGAAAAACCGAAACGAGGGCAACGAAGGACAGACCATGTCCATCAGGTTGCAGTTGACCAGCAAGGACAAGAGGTTTCTAAACATATGCGAGGAGGCAGCGTCGCATGGGCACATAAATTGCCTGAAACTCGCTCACGAGATTGGCGTCCCTTGGAACGATAAGGCGTGCAAACGGGCTGCGAGTTCGGGGCATCTGGAGTGTCTACGATACGCACGGGAAAACGGGTGCCCGTGGGACGAGAATACGTGCAGCAATGCCGCATTAAAGGGTCACATTGACTGCCTAATATACGCACGGGAAAACGGGTGCCCTTGGAATGAGAATACTTGCGGCCATGCCGCATTGAACGGTCACATGGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGGACCATAATACATGTACCGGTGCCGCAATGAAGGGTCGCATAGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGAACCTTAATACATGTAACGGTGCCGCAATGGAGGGTCACAAGGACTGCCTGGTTTATGCACGGGAAAACGGGTGCCCGTGGAACGAGTTAACGTGCAGCTTTGCCGCATTTTCCGGTCACACTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACAAGTTTACGTGCAACGAAGCCGCATTAAAGGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACACGGATACATGCAACTTTGCAGCGAAATACGGACACAAGGACTGCCTAGTTTACGCACGGGAAAACGGTTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCTTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCGAATACGTGCACCTTTGCAGCGAAGAACGGACACAAGGACTGCCTAGTTTACGCACGGGAATACGGGTGCCCTTGGAACGACTGGACATGCTTAAGTGCCGCAACAAACGGTCACATGGACTGCCTGGTATACGCCCGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCAGCTATGCCGCTTTGAATGGTCACATGGCCTGCCTGATGTACTCACGGGAAAACGGGTGCCCGTGGGACGCGGATACGTGCAACTTTGCCGCGAAGAACGGCCACAAGGACTGCCTAGTGTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGGAATGCCGCATTGAACGGTGACATAGACTGCCTGATATACGCACGGTTAAACGGGTGCCCGTGGGACGAGAAAACGTGCAGCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGTCCGTGGGACGCGGATACGTGCAACTTTTCCGC" > seqs2 GL350841 "GTCTGTACAGTCTGAACTGTACCATAGAAAATATATAGGTACTACTGTTACTAGTTACTCCAACTGTTGTAGGGTTTGTTGTATAAGTCGTCGTCATGACTCATGACGAAATATTAATTATGCCATGATGGTCAGATCTTGCACGGTTAGCTTAGGTTATTCTTCAACTGTTATTCGTGCGCCATCTTGATCTTATGTTCTTAAAAACGCCTTAGGTACTGGCTATACACCTACTGTCTTAGCGTAAATCGTGTTGACAATCACACATAACATTTTTATCATTGAACATTATTATTTCTCTTGAGTTATTAATTTATTGTTAAAGTCAACACGGACGTTTCATCGTTAATTATCAATAACTGTTTTGAGTGACTATTTCGATTTTCGTTTACTGTACGCGTGTAAGCACTCAACTATACGTTTTGTGGCGGACAGCAGACCACGATGGCAAACGAGAGAACGTACGTAGTACCTAGCTTGTTGCAAACGGCAGCACCGAACGCTTGCACCGAGATCGAGCGTGGCAGGTTCGACTTTCTGGAGGTACCGGACGTCGATATGCGACATTGTTGCGAGTCGCTGTTAGAAATATTAAAAATCGAGAAAAACCGAAACGAGGGCAACGAAGGACAGACCATGTTCATCAGGTTGCAGTTGACCAGCAAGGACAAGAGGTTTCTAAACATATGCGAGGAGGCAGCGTCGCACGGGCACATAAATTGCCTGAAACTCGCTCACGAGATTGGCGTCCCTTGGAACGATAAGGCGTGCAAACGGGCTGCGAGTTCGGGGCATCTGGAGTGTCTACGATACGCACGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCGGCCATGCCGCATTGAACGGTCACATGGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGAACTTTGATACATGCGCCAATGCCGCAATGGAGGGTCACAAGGACTGCCTGGTTTATGCACGGGAAAACGGGTGCCCGTGGAATGAGTATACGTGCACCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGGCAAGTATACGTGCAGCGAAGCCGCATCAAAGGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCAGATACGTGCAACTTTGCAGCGAAATACGGACACAAGGACTGCCTAGTTTACGCACGGGAAAACGGTTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCTTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCGAATACGTGCACCTTTGCAGCAAAGAACGGACACAAGGACTGCCTAGTTTACGCACGGGAATACGGGTGCCCTTGGAACGACTGGACATGCTTAAGTGCCGCAACAAACGGTCACATGGACTGCCTGGTATACGCCCGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCAGCTATGCCGCTTTGAATGGTCACATGGACTGCCTGATGTACTCACGGGAAAACGGGTGCCCGTGGGACGCGGATACGTGCAACTTTGCCGCGAAGAACGGCCACAAGGACTGCCTAGTTTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGGAATGCCGCATTGAACGGTGACATAGACTGCCTGATATACGCACGGTTAAACGGGTGCCCGTGGGACGAGAAAACGTGCAGCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGTCCGTGGGACGCGGATACGTGCAACTTTTCCGCGAAGAACGGCCACAAGGACTGCCTAGTTTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCCTGATATTTGCACGGGAAAAAGGGTGCCCGTGGAACGAGAATACGTGCAGCTTTGCCGC"
I called aln = pairwiseAlignment(seqs1,seqs2)
and put the results into a string with: char = as.character(pattern(aln)).
I then removed the deletions with char = gsub("-","",char)
Here is what I get for "char":
> char [1] "GTCTGTACAGTCTGAACTGTACCATAGAAAATATATAGGTACTACTGTTACTAGTTACTCCAACTGTTGTAGGGTTTGTTGTATAAGTCGTCGTCATGACTCATGACGAAATATTAATTATGCCATGATGGTCAGATCTTGCACGGTTAGCTTAGGTTATTCTTCAACTGTTATTCGTGCGCCATCTTGATCTTATGTTCTTAAAAACGCCTTAGGTACTGGCTATACACCTACTGTCTTAGCGTAAATCGTGTTGACAATCACACATAACATTTTTATCATTGAACATTATTATTTCTCTTGAGTTATTAATTTATTGTTAAAGTCAACACGGACGTTTCATCGTTAATTATCAATAACTGTTTTGAGTGACTATTTCGATTTTCGTTTACTGTACGCGTGTAAGCACTCAACTATACGTTTTGTGGCGGACAGCAGACCACGATGGCAAACGAGAGAACGTACGTAGTACCTAGCTTGTTGCAAACGGCAGCACCGAACGCTTGCACCGAGATCGAGCGTGGCAGGTTCGACTTTCTGGAGGTACCGGACGTCGATATGCGACATTGTTGCGAGTCGCTGTTAGAAATATTAAAAATCGAGAAAAACCGAAACGAGGGCAACGAAGGACAGACCATGTTCATCAGGTTGCAGTTGACCAGCAAGGACAAGAGGTTTCTAAACATATGCGAGGAGGCAGCGTCGCACGGGCACATAAATTGCCTGAAACTCGCTCACGAGATTGGCGTCCCTTGGAACGATAAGGCGTGCAAACGGGCTGCGAGTTCGGGGCATCTGGAGTGTCTACGATACGCACGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCGGCCATGCCGCATTGAACGGTCACATGGACTGCCTGATATACGCACGGGAAAACGGGTGCCCTTGGAACTTTGATACATGCGCCAATGCCGCAATGGAGGGTCACAAGGACTGCCTGGTTTATGCACGGGAAAACGGGTGCCCGTGGAATGAGTATACGTGCACCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGGCAAGTATACGTGCAGCGAAGCCGCATCAAAGGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCAGATACGTGCAACTTTGCAGCGAAATACGGACACAAGGACTGCCTAGTTTACGCACGGGAAAACGGTTGCCCTTGGAATGAGAAAACTTGCAGCTATGCCGCATTGAACGGTCACATGGACTGCTTGATATACGCACGGGAAAACGGGTGCCCGTGGGACGCGAATACGTGCACCTTTGCAGCAAAGAACGGACACAAGGACTGCCTAGTTTACGCACGGGAATACGGGTGCCCTTGGAACGACTGGACATGCTTAAGTGCCGCAACAAACGGTCACATGGACTGCCTGGTATACGCCCGGGAAAACGGGTGCCCGTGGAATGAGAAAACTTGCAGCTATGCCGCTTTGAATGGTCACATGGACTGCCTGATGTACTCACGGGAAAACGGGTGCCCGTGGGACGCGGATACGTGCAACTTTGCCGCGAAGAACGGCCACAAGGACTGCCTAGTTTACGCACGGGAAAACGGGTGCCCTTGGAATGAGAAAACTTGCAGGAATGCCGCATTGAACGGTGACATAGACTGCCTGATATACGCACGGTTAAACGGGTGCCCGTGGGACGAGAAAACGTGCAGCTTTGCCGCATTTTCCGGTCACATTGACTGCCTGATATACGCACGGGAAAACGGGTGTCCGTGGGACGCGGATACGTGCAACTTTTCCGC"
Which omits the last 162 bases of seqs2. It appears that these bases correspond to a repeated pattern.
Thanks for looking into it.
Jean
Edit : the issue appears also with the "local" alignment type. And interestingly, seqs1 is not truncated. It seems that seqs2 was truncated before the alignment.
Hi Jean,
To extract the aligned sequences from a PairwiseAlignments object of length > 1 you can use the following code (which basically calls
Biostrings:::.makePostalignedSeqs()
in a loop):Then:
Unfortunately this is pretty slow (might take a couple of minutes on each of your batches of 1000 aligned pairs).
I agree this is not ideal and we will need to work on something better to let the user retrieve the whole aligned sequences.
H.
Thanks. For now I think I'll use other tools to align my sequences. An alternative solution would be to extract the size of the leading gaps. It doesn't seem that the
indel()
method does it.I need the size of the leading gap or else the alignment is useless of me. That's because I'm doing stuff at specific positions of the aligned sequences (which are determined before the alignment), and this won't work if the aligned sequences are truncated at the left.
Jean
You can get the size of the leading gap with:
For example:
This is very fast.
H.
Great! Thanks.