Recognize Uniquely Decodable Codes

A set of strings $C$ is called a uniquely decodable code, if $C^{*}$ has a unique factorization over $C$ . Namely, for each element $s$ in $C^{*}$ , there exist a unique finite sequence ${s_{i}}_{i = 1}^{m}$ of elements in $C$ , such that $s_{1} \dots s_{m} = s$ . We call the strings in $C$ a code string.

Problem1Uniquely Decodable Code Recognition

Let $C$ be a finite set of strings, decide if $C$ is a uniquely decodable code.

In general, the infinite version of the problem is undecidable. It is however decidable if $C$ is a regular language.

To solve this problem, we shall describe a formulation of Sardinas–Patterson algorithm, which combines the description of [1] and [2].

In the entire article, we assume the alphabet size is fixed.

For variable sized alphabet, let $σ$ be the number of distinct alphabet appeared in $C$ . There is an extra factor of $σ$ or $lo g σ$ depending on if there exist a comparator for the alphabet.

Define $S (C) = {v ∣ xv = c, c \in C}$ , the set of suffixes of $C$ . $I (C) = {v ∣ c^{'} v = c, c^{'} \neq = c, c^{'}, c \in C}$ , we call those initial suffix.

Lemma2

$C$ is uniquely decodable if and only if there is no $v \in I (C)$ , such that $v s \in C^{*}$ and $s \in C^{*}$ .

Proof

If there exist such $v$ so $v s, s \in C^{*}$ . $c^{'} v = c$ for some $c, c^{'} \in C$ . Consider the string $cs = c^{'} (v s)$ . It has at least two factorization, one start with $c$ , the other start with $c^{'}$ , and $c^{'} \neq = c$ .

Let no such $v$ exists. Consider some $u \in C^{*}$ . It can be written as $cs$ and $c^{'} s^{'}$ for $c, c^{'} \in C$ , $s, s^{'} \in C^{*}$ . Assume there is more than one factorization, then $c^{'} \neq = c$ and wlog $c^{'} v = c$ . We arrive $v s = s^{'} \in C^{*}$ , a contradiction.

Consider a $G = (V, A)$ a directed graph. $V = S (C)$ . There is an arc from $a$ to $b$ iff $ab = c$ or $c b = a$ for some $c \in C$ .

Theorem3

$C$ is uniquely decodable code iff there is no path from a vertex in $I (C)$ to $ϵ$ .

Proof

For any arc $uv$ , if $uv = c \in C$ , then label the arc $uv$ by $u$ . Otherwise, label the arc with $c$ where $c v = u$ . It’s then easy to see concatenate the labels on any walk from $v$ to $ϵ$ spells a string of the form $v s \in C^{*}$ for $s \in C^{*}$ .

By induction, one can show that for a vertex $v$ , there exist string $v s \in C^{*}$ , where $s \in C^{*}$ , if and only if there is a path from $v$ to $ε$ .

In order to compute the arcs on the graph, we need to answer two questions.

Is $u$ a prefix of $c$ ? For all $u \in S (C)$ and $c \in C$ .
Is $c$ a prefix of $u$ ? For all $u \in S (C)$ and $c \in C$ .

Let $k = ∣ C ∣$ , $n = \sum_{c \in C} ∣ c ∣$ . It is known that we can build a generalized suffix tree for $C$ in linear time.

For the first question, “Is $u$ a prefix of $c$ ?”, consider we constructed a suffix tree $T$ for $C$ . Transversing the suffix tree $T (C)$ with a code string $c \in C$ . Assume we have read a prefix $u$ of $c$ , we can check if $u \in S (C)$ in constant time during the transversal. If $u \in S (C)$ , then $c = uv$ where $v \in S (C)$ . It will add an arc $uv$ . We use $O (∣ c ∣)$ time to find all the arcs can be formed by answering the “Is $u$ a prefix of $c$ ”. In total, we can answer question 1 in $O (n)$ time.

For the second question, we can also use the same suffix tree. Transverse the suffix tree with a code string $c$ . Find all the leaves in the subtree when the string $c$ ends. Those correspond to all strings $u \in S (C)$ such that $c$ is a prefix. Namely $u = c v$ where $v \in S (C)$ . The algorithm add an arc $uv$ . Note there might be many $u$ with this property, worst case $O (n)$ . This step might run in $O (nk)$ time.

It’s not clear we can construct arcs between the vertices in constant time. The main difficulty lies in two suffix of two different code word might correspond to the same vertex in the graph. If we number the code words and name the suffixes by their length, then we can construct a $O (n)$ size table that maps each pair $(i, j)$ , which represents the suffix of length $j$ in the $i$ th string, to a vertex that represent the string. Simply build a trie for the reverse of the strings, and do some bookkeeping. This map allow us to add an edge in constant time.

The graph $G$ has at most $O (nk)$ arcs. We add a new vertex that has an arc to each initial vertices and apply a DFS from the new vertex. If it reaches the $ϵ$ vertex, we return true, else we return false.

Theorem4

There exist an algorithm to test if $C$ is a uniquely decodable code in $O (nk)$ time, where $k = ∣ C ∣$ and $n = \sum_{c \in C} ∣ c ∣$ .

I have an implementation in Haskell here. Note it doesn’t run in exactly $O (nk)$ time because of the Map takes $O (lo g n)$ time. It can, however, be easily modified to run in $O (nk)$ time.

References

[1]

M. Crochemore, W. Rytter, Jewels of stringology, World Scientific, 2002.

[2]

M. Rodeh, A fast test for unique decipherability based on suffix trees (corresp.), Information Theory, IEEE Transactions On. 28 (1982) 648–651 10.1109/TIT.1982.1056535.

Posted by Chao Xu on 2014-04-27.

Tags: Algorithm.