The Art Gallery Guardian

The KMP algorithm in Haskell


Almost all the string algorithms I read are doing index manipulations somewhere. However meddling with indices are never a smart move in pedagogical settings. For example, see the code for the KMP algorithm in C. It's short, to the point, and elegant in it's own way. Except it's hard to see the meaning behind all the operations.

As an example, this article demonstrates how to write the KMP string matching algorithm without all those indices.

The KMP string matching algorithm solves the following problem.

Problem1

Given a string patpat of length mm, return if it exist in texttext of length nn in O(n)O(n) time.

Half of the KMP algorithm implementations are actually the MP algorithm. Twan van Laarhoven's implementation, the earlier version of the KMP package and even Wikipedia's page. Although both KMP and MP runs in O(n)O(n) time, KMP uses at most O(logm)O(\log m) time to advance to match the next element in the list when MP could take O(m)O(m) comparisons. More concretely, KMP could output the sequence m0,,mn1m_0,\ldots,m_{n-1} in O(n)O(n) time, where mi=1m_i=1 iff patpat is a suffix of text[0..i]text[0..i], and the time between output is O(logm)O(\log m).
This added benefit comes at a cost. In the MP algorithm, the failure table has only one 1-1. The failure table for KMP, there is no 1-1, and everything just goes to 00 instead. If one tries to not use any indices, and want to separates the searching and the table building, one would need a special element to treat the 00 positions.

Comparing to the KMP package, the implementation here doesn't use any array.

1 The algorithm

We build the a automaton AA using a input string S=a0,,am1S = a_0,\ldots,a_{m-1}. The automaton consist of states nil,s0,,sm1nil,s_0,\ldots,s_{m-1}.

Because of laziness, it is built incrementally. The total running time is O(n)O(n), even if mm is much larger than nn.

δ\delta is the transition function, δ(si,ai)=si+1\delta(s_i,a_i)=s_{i+1}, and δ(si,x)=fi\delta(s_i,x)=f_i for xaix\neq a_i, where fif_i is the failure state associated with state sis_i. We call aia_i the value of sis_i.

One would also want to know if an state is accepting state or not. This prompt us to use the following declaration for the automaton.

data Automaton a = Node {value   :: a,
                         success :: Automaton a,
                         failure :: Automaton a,
                         accept  :: Bool
                         } | 
                   Null {success :: Automaton a,
                         accept :: Bool}

isNull (Null _ _) = True
isNull _ = False

Assume we have built a automaton, then doing a matching is quite easy. Just simulate the automaton. It's important to notice the next and stay are saying if we want to read the next character or stay at the same character. The code below is a generalized version of matching. It basically does a fold while we match. isInfixOf' is an example of how to use matchFold to test if a list is a infix in another list.

matchFold :: Eq a => Automaton a -> [a] -> ([a]->b->b) -> ([a]->b->b) -> b -> b
matchFold _ [] _ _ identity = identity
matchFold state text nomat mat identity = match' state text
  where match' _ [] = identity
        match' a (x:xs)
          | not (isNull a) && value a /= x = stay
          | not (accept a)                 = nomat (x:xs) next
          | otherwise                      = mat   (x:xs) next
          where next = match' (success a) xs
                stay = match' (failure a) (x:xs)

isInfixOf' :: Eq a => [a] -> [a] -> Bool
isInfixOf' pattern text
 | null pattern = True
 | otherwise    = or $ matchFold (buildAutomaton pattern) text (const (False:)) (const (True:)) []

It is obvious how we can build the entire automaton except for the failure transition.

Another way to reason about it. We impose an order on a set of strings SS by measuring the length of the string, so this order is a linear order when the set of strings have different length. Let Prefixes(s)Prefixes(s) to be the set of all prefixes of ss, Suffixes(s)Suffixes(s) to be the set of all suffixes of ss. Border(s)=Prefixes(s)Suffixes(s)Border(sa)={xxBorder(s),xa∉Border(sa)}b(s)=max(Border(s)\{s})b(sa)=max(Border(sa))\displaystyle \begin{aligned} Border(s) &= Prefixes(s)\cap Suffixes(s)\\ Border'(sa) &= \{x|x\in Border(s),xa\not\in Border(sa)\}\\ b(s) &= \max(Border(s)\backslash \{s\})\\ b'(sa) &= \max(Border'(sa)) \end{aligned}

Assume we try to compute b(x)b(x) for a string xx, we first build the function nextnext, such that nextnext iterates through Prefix(x)Prefix(x) by length, and lastlast is a function that returns the last element in the string. We have the following relation for sasa a prefix of xx, where ss is a string and aa is in the alphabet.

b(sa)={next(b(s))last(next(b(s)))=anext(b(b(s)a))otherwise\displaystyle b(sa) = \begin{cases} next(b(s)) & last(next(b(s))) = a\\ next(b(b(s)a)) & otherwise \end{cases}

b(sa)={b(next(b(s)))last(b(next(b(s))))=ab(b(next(b(s))a))otherwise\displaystyle b'(sa) = \begin{cases} b'(next(b(s))) & last(b'(next(b(s)))) = a\\ b'(b'(next(b(s))a)) & otherwise \end{cases} In fact, one can see the failure function is precisely bb'. What's not clear is how can one compute this function without keep track of the entire history of bb.

The essential function is build ys s. It builds the automaton by consume the remaining string, and s records essential information to compute the transition of the failure function.

  1. values=valuetvalue s = value t, then b(t)=b(s)b(t) = b(s), and we would know b(next(t))=next(s)b'(next(t)) = next(s).
  2. valuesvaluetvalue s \neq value t, then b(t)=sb(t) = s, and we would compute b(next(t))b'(next(t)) by searching back through failure edges.

Note we have also computed b(next(t))b'(next(t)), which allow us to compute the next node next(t)next(t).

So in build ys s, s will precisely store the state b(t)b'(t).

buildAutomaton :: Eq a => [a] -> Automaton a
buildAutomaton (x:xs) = automaton
  where automaton = Node x (build xs automaton) (Null automaton False) (null xs)
        build [] s = s
        build (x:xs) s
         | x == value s = success s `nextNode` failure s
         | otherwise    = newS      `nextNode` s
         where nextNode a b = Node x (build xs a) b (null xs)
               newS         = success $ until (\s->isNull s || x == value s) failure s
Posted by Chao Xu on .
Tags: .