Consider an alphabet Σ and two sequences s and t on Σ. Let Σˉ=Σ∪{⋄} where ⋄ is some symbol not in Σ, it’s called the gap symbol. M:Σˉ×Σˉ→Z. We have a gap penalties functions gs,gt:N→Z. The functions are monotonic and are 0 at 0. The four boundary gap penalty coefficient bs,bt,es,et∈{0,1}. We want to find the alignment score between the two sequences.
A gap is the maximal substring consist of only ⋄’s. The gap sequence of a string is a sequence of lengths of each gap. Define s⋄ be the gap sequence of string s.
G(x,g,y,a)=xg(a1)+∑i=2n−1g(ai)+yg(an), where a is a sequence of length n.
A(u,v)=i=1∑∣u∣M(ui,vi)+G(bs,gs,es,u⋄)+G(bt,gt,et,v⋄)
Define S and T be the set of all strings that can be formed by inserting ⋄ into s and t respectively.
The alignment score is defined as max{A(u,v):∣u∣=∣v∣,u∈S,v∈T}
Now, once one write an algorithm for this problem, it can be used for many sequence alignment problems on Rosalind.
- Hamming Distance: M(a,a)=−1, and 0 otherwise. All other function are −∞.
- Finding a Shared Spliced Motif, longest common subsequence. M(a,a)=1 for a∈Σ, 0 otherwise. Everything else are 0.
- Edit distance: Levenshtein distance, M(a,a)=−1 for a∈Σˉ, 0 otherwise. All other are −∞.
- Global alignment: bs=es=gs, bt=et=gt.
- Overlap alignment: bs=et=0, bt=gt, es=gs.
- Fitting alignment: bs=es=0, bt=et=gt.
- Semiglobal alignment: bs=es=bt=et=0.
- Substring Matching: bs=es=0, gs,gt,bt,et=−∞.