Formal Definition of Sequence Alignment

Consider an alphabet $Σ$ and two sequences $s$ and $t$ on $Σ$ . Let $\overset{ˉ}{Σ} = Σ \cup {⋄}$ where $⋄$ is some symbol not in $Σ$ , it’s called the gap symbol. $M : \overset{ˉ}{Σ} \times \overset{ˉ}{Σ} \to Z$ . We have a gap penalties functions $g_{s}, g_{t} : N \to Z$ . The functions are monotonic and are $0$ at $0$ . The four boundary gap penalty coefficient $b_{s}, b_{t}, e_{s}, e_{t} \in {0, 1}$ . We want to find the alignment score between the two sequences.

A gap is the maximal substring consist of only $⋄$ ’s. The gap sequence of a string is a sequence of lengths of each gap. Define $s_{⋄}$ be the gap sequence of string $s$ .

$G (x, g, y, a) = xg (a_{1}) + \sum_{i = 2}^{n - 1} g (a_{i}) + y g (a_{n})$ , where $a$ is a sequence of length $n$ .

$A (u, v) = i = 1 \sum ∣ u ∣ M (u_{i}, v_{i}) + G (b_{s}, g_{s}, e_{s}, u_{⋄}) + G (b_{t}, g_{t}, e_{t}, v_{⋄})$

Define $S$ and $T$ be the set of all strings that can be formed by inserting $⋄$ into $s$ and $t$ respectively.

The alignment score is defined as $max {A (u, v) : ∣ u ∣ = ∣ v ∣, u \in S, v \in T}$

Now, once one write an algorithm for this problem, it can be used for many sequence alignment problems on Rosalind.

Hamming Distance: $M (a, a) = - 1$ , and $0$ otherwise. All other function are $- \infty$ .
Finding a Shared Spliced Motif, longest common subsequence. $M (a, a) = 1$ for $a \in Σ$ , $0$ otherwise. Everything else are $0$ .
Edit distance: Levenshtein distance, $M (a, a) = - 1$ for $a \in \overset{ˉ}{Σ}$ , $0$ otherwise. All other are $- \infty$ .
Global alignment: $b_{s} = e_{s} = g_{s}$ , $b_{t} = e_{t} = g_{t}$ .
Overlap alignment: $b_{s} = e_{t} = 0$ , $b_{t} = g_{t}$ , $e_{s} = g_{s}$ .
Fitting alignment: $b_{s} = e_{s} = 0$ , $b_{t} = e_{t} = g_{t}$ .
Semiglobal alignment: $b_{s} = e_{s} = b_{t} = e_{t} = 0$ .
Substring Matching: $b_{s} = e_{s} = 0$ , $g_{s}, g_{t}, b_{t}, e_{t} = - \infty$ .

Posted by Chao Xu on 2013-07-10.

Tags: alignment.