The Art Gallery Guardian

The KMP algorithm in Haskell


Almost all the string algorithms I read are doing index manipulations somewhere. However meddling with indices are never a smart move in pedagogical settings. For example, see the code for the KMP algorithm in C. It’s short, to the point, and elegant in it’s own way. Except it’s hard to see the meaning behind all the operations.

As an example, this article demonstrates how to write the KMP string matching algorithm without all those indices.

The KMP string matching algorithm solves the following problem.

Problem1

Given a string of length , return if it exist in of length in time.

Half of the KMP algorithm implementations are actually the MP algorithm. Twan van Laarhoven’s implementation, the earlier version of the KMP package and even Wikipedia’s page. Although both KMP and MP runs in time, KMP uses at most time to advance to match the next element in the list when MP could take comparisons. More concretely, KMP could output the sequence in time, where iff is a suffix of , and the time between output is .
This added benefit comes at a cost. In the MP algorithm, the failure table has only one . The failure table for KMP, there is no , and everything just goes to instead. If one tries to not use any indices, and want to separates the searching and the table building, one would need a special element to treat the positions.

Comparing to the KMP package, the implementation here doesn’t use any array.

1 The algorithm

We build the a automaton using a input string . The automaton consist of states .

Because of laziness, it is built incrementally. The total running time is , even if is much larger than .

is the transition function, , and for , where is the failure state associated with state . We call the value of .

One would also want to know if an state is accepting state or not. This prompt us to use the following declaration for the automaton.

data Automaton a = Node {value   :: a,
                         success :: Automaton a,
                         failure :: Automaton a,
                         accept  :: Bool
                         } | 
                   Null {success :: Automaton a,
                         accept :: Bool}

isNull (Null _ _) = True
isNull _ = False

Assume we have built a automaton, then doing a matching is quite easy. Just simulate the automaton. It’s important to notice the next and stay are saying if we want to read the next character or stay at the same character. The code below is a generalized version of matching. It basically does a fold while we match. isInfixOf' is an example of how to use matchFold to test if a list is a infix in another list.

matchFold :: Eq a => Automaton a -> [a] -> ([a]->b->b) -> ([a]->b->b) -> b -> b
matchFold _ [] _ _ identity = identity
matchFold state text nomat mat identity = match' state text
  where match' _ [] = identity
        match' a (x:xs)
          | not (isNull a) && value a /= x = stay
          | not (accept a)                 = nomat (x:xs) next
          | otherwise                      = mat   (x:xs) next
          where next = match' (success a) xs
                stay = match' (failure a) (x:xs)

isInfixOf' :: Eq a => [a] -> [a] -> Bool
isInfixOf' pattern text
 | null pattern = True
 | otherwise    = or $ matchFold (buildAutomaton pattern) text (const (False:)) (const (True:)) []

It is obvious how we can build the entire automaton except for the failure transition.

Another way to reason about it. We impose an order on a set of strings by measuring the length of the string, so this order is a linear order when the set of strings have different length. Let to be the set of all prefixes of , to be the set of all suffixes of .

Assume we try to compute for a string , we first build the function , such that iterates through by length, and is a function that returns the last element in the string. We have the following relation for a prefix of , where is a string and is in the alphabet.

In fact, one can see the failure function is precisely . What’s not clear is how can one compute this function without keep track of the entire history of .

The essential function is build ys s. It builds the automaton by consume the remaining string, and s records essential information to compute the transition of the failure function.

  1. , then , and we would know .
  2. , then , and we would compute by searching back through failure edges.

Note we have also computed , which allow us to compute the next node .

So in build ys s, s will precisely store the state .

buildAutomaton :: Eq a => [a] -> Automaton a
buildAutomaton (x:xs) = automaton
  where automaton = Node x (build xs automaton) (Null automaton False) (null xs)
        build [] s = s
        build (x:xs) s
         | x == value s = success s `nextNode` failure s
         | otherwise    = newS      `nextNode` s
         where nextNode a b = Node x (build xs a) b (null xs)
               newS         = success $ until (\s->isNull s || x == value s) failure s
Posted by Chao Xu on .
Tags: Algorithm.