pairwisealignment how to normalize score?
1
0
Entering edit mode
marie • 0
@marie-22260
Last seen 5.1 years ago

Hi - I would like to normalize the score returned from the following so that scores are between 0 and 1

pairwiseAlignment(pattern = "TINA JOY ANDERSON", subject = "TINA ANDERSON", type = "global", gapOpening = 1, gapExtension = 1, scoreOnly = TRUE)

is this possible?

Some understanding of the formula used to get the score would also be great... I had thought this would implement something like the Affine Gap distance...

Thanks in advance. Marie

biostrings • 876 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 12 hours ago
Seattle, WA, United States

Hi,

Unfortunately the default scoring scheme is quality-based and somewhat opaque so probably best avoid it. The good news is that you can take full control of the scoring scheme by supplying your own substitution matrix:

library(Biostrings)
my_alphabet <- c(LETTERS, letters, " ")
my_substmat <- diag(rep(1L, length(my_alphabet))) - 1L
dimnames(my_substmat) <- list(my_alphabet, my_alphabet)

With our scoring scheme, the best possible score is 0 and this independently of the lengths of the 2 strings:

pairwiseAlignment("TINA", "TINA", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA
# subject: TINA
# score: 0 

pairwiseAlignment("TINA ANDERSON", "TINA ANDERSON", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA ANDERSON
# subject: TINA ANDERSON
# score: 0 

Any difference between the 2 strings (i.e. any substitution or gap) takes the score to the negative side:

## -1 for each substitution (controlled via our subsitution matrix)
pairwiseAlignment("TINA ANDERSON", "TYNA ANDERSOM", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA ANDERSON
# subject: TYNA ANDERSOM
# score: -2 

## -5 for a gap opening + size_of_gap * (-1) (controlled via the gapOpening
## and gapExtension arguments):
pairwiseAlignment("TINA JOY ANDERSON", "TINA ANDERSON", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA JOY ANDERSON
# subject: TINA ----ANDERSON
# score: -9 

## With 2 substitutions and a gap of size 4:
pairwiseAlignment("TINA JOY ANDERSON", "TYNA ANDERSOM", type="global",
                  substitutionMatrix=my_substmat, gapOpening=5, gapExtension=1)
# Global PairwiseAlignmentsSingleSubject (1 of 1)
# pattern: TINA JOY ANDERSON
# subject: TYNA ----ANDERSOM
# score: -11 

Note that with this scoring scheme, the worst possible score depends on the length of the 2 strings. If we want to be able to normalize the score, we need to know the worst possible score, which can be tricky... unless we ask pairwiseAlignment() to do that for us:

fill_with_letter <- function(x, letter)
{
    stopifnot(is.character(letter), length(letter) == 1L, nchar(letter) == 1L)
    x2 <- strrep(letter, nchar(x))
    if (!is.character(x))
        x2 <- as(x2, class(x))
    x2
}

## Assuming that all substitutions have the same cost.
worst_possible_score <- function(pattern, subject, substitutionMatrix,
                                 gapOpening=10, gapExtension=4)
{
    # Change the content of pattern and subject to make them 100% different.
    pattern2 <- fill_with_letter(pattern, rownames(substitutionMatrix)[[1L]])
    subject2 <- fill_with_letter(subject, rownames(substitutionMatrix)[[2L]])
    pairwiseAlignment(pattern2, subject2, type="global",
                      substitutionMatrix=substitutionMatrix,
                      gapOpening=gapOpening, gapExtension=gapExtension,
                      scoreOnly=TRUE)
}

Then we'll be able to normalize the score by applying the following generic transformation to it:

normalize_score <- function(score, best_possible_score, worst_possible_score)
{
    (score - worst_possible_score) / (best_possible_score - worst_possible_score)
}

Let's wrap all this in one function that takes as input the object returned by pairwiseAlignment():

## Assuming we are doing global alignments and that the best possible
## score is 0!
normalizedScore <- function(pa, substitutionMatrix)
{
    stopifnot(is(pa, "PairwiseAlignments"), type(pa) == "global")
    pattern <- unaligned(pattern(pa))
    subject <- unaligned(subject(pa))
    wps <- worst_possible_score(pattern, subject, substitutionMatrix,
                                pa@gapOpening, pa@gapExtension)
    normalize_score(score(pa), 0L, wps)
}

Then:

pa1 <- pairwiseAlignment("TINA", "TINA", type="global",
                         substitutionMatrix=my_substmat,
                         gapOpening=5, gapExtension=1)
normalizedScore(pa1, my_substmat)
# [1] 1
pa2 <- pairwiseAlignment("TINA JOY ANDERSON", "TYNA ANDERSOM", type="global",
                         substitutionMatrix=my_substmat,
                         gapOpening=5, gapExtension=1)
normalizedScore(pa2, my_substmat)
# [1] 0.5
pa3 <- pairwiseAlignment("ABCDEF", "abcd", type="global",
                         substitutionMatrix=my_substmat,
                         gapOpening=5, gapExtension=1)
normalizedScore(pa3, my_substmat)
# [1] 0

Not really as straightforward as it should be :-/

H.

ADD COMMENT

Login before adding your answer.

Traffic: 872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6