 # Formal Definition of Sequence Alignment

Consider an alphabet $\Sigma$ and two sequences $s$ and $t$ on $\Sigma$. Let $\bar{\Sigma} = \Sigma\cup \{\diamond\}$ where $\diamond$ is some symbol not in $\Sigma$, it’s called the gap symbol. $M:\bar{\Sigma}\times \bar{\Sigma}\to \Z$. We have a gap penalties functions $g_s,g_t:\N\to \Z$. The functions are monotonic and are $0$ at $0$. The four boundary gap penalty coefficient $b_s,b_t,e_s,e_t\in \{0,1\}$. We want to find the alignment score between the two sequences.

A gap is the maximal substring consist of only $\diamond$’s. The gap sequence of a string is a sequence of lengths of each gap. Define $s_\diamond$ be the gap sequence of string $s$.

$G(x,g,y,a) = xg(a_1) + \sum_{i=2}^{n-1} g(a_i) + yg(a_n)$, where $a$ is a sequence of length $n$.

$A(u,v) = \sum_{i=1}^{|u|} M(u_i,v_i) + G(b_s,g_s,e_s,u_\diamond) + G(b_t,g_t,e_t,v_\diamond)$

Define $S$ and $T$ be the set of all strings that can be formed by inserting $\diamond$ into $s$ and $t$ respectively.

The alignment score is defined as $\max \{A(u,v) : |u|=|v|, u\in S, v\in T\}$

Now, once one write an algorithm for this problem, it can be used for many sequence alignment problems on Rosalind.

• Hamming Distance: $M(a,a)=-1$, and $0$ otherwise. All other function are $-\infty$.
• Finding a Shared Spliced Motif, longest common subsequence. $M(a,a)=1$ for $a\in \Sigma$, $0$ otherwise. Everything else are $0$.
• Edit distance: Levenshtein distance, $M(a,a)=-1$ for $a\in \bar{\Sigma}$, $0$ otherwise. All other are $-\infty$.
• Global alignment: $b_s=e_s=g_s$, $b_t=e_t=g_t$.
• Overlap alignment: $b_s=e_t=0$, $b_t=g_t$, $e_s=g_s$.
• Fitting alignment: $b_s=e_s=0$, $b_t=e_t=g_t$.
• Semiglobal alignment: $b_s=e_s=b_t=e_t=0$.
• Substring Matching: $b_s=e_s=0$, $g_s,g_t,b_t,e_t=-\infty$.
Posted by Chao Xu on .
Tags: alignment.