Recognize Uniquely Decodable Codes
A set of strings is called a uniquely decodable code, if has a unique factorization over . Namely, for each element in , there exist a unique finite sequence of elements in , such that . We call the strings in a code string.
Let be a finite set of strings, decide if is a uniquely decodable code.
In general, the infinite version of the problem is undecidable. It is however decidable if is a regular language.
In the entire article, we assume the alphabet size is fixed.
For variable sized alphabet, let be the number of distinct alphabet appeared in . There is an extra factor of or depending on if there exist a comparator for the alphabet.
Define , the set of suffixes of . , we call those initial suffix.
is uniquely decodable if and only if there is no , such that and .
If there exist such so . for some . Consider the string . It has at least two factorization, one start with , the other start with , and .
Let no such exists. Consider some . It can be written as and for , . Assume there is more than one factorization, then and wlog . We arrive , a contradiction.
Consider a a directed graph. . There is an arc from to iff or for some .
is uniquely decodable code iff there is no path from a vertex in to .
For any arc , if , then label the arc by . Otherwise, label the arc with where . It's then easy to see concatenate the labels on any walk from to spells a string of the form for .
By induction, one can show that for a vertex , there exist string , where , if and only if there is a path from to .
In order to compute the arcs on the graph, we need to answer two questions.
- Is a prefix of ? For all and .
- Is a prefix of ? For all and .
Let , . It is known that we can build a generalized suffix tree for in linear time.
For the first question, "Is a prefix of ?", consider we constructed a suffix tree for . Transversing the suffix tree with a code string . Assume we have read a prefix of , we can check if in constant time during the transversal. If , then where . It will add an arc . We use time to find all the arcs can be formed by answering the "Is a prefix of ". In total, we can answer question 1 in time.
For the second question, we can also use the same suffix tree. Transverse the suffix tree with a code string . Find all the leaves in the subtree when the string ends. Those correspond to all strings such that is a prefix. Namely where . The algorithm add an arc . Note there might be many with this property, worst case . This step might run in time.
It's not clear we can construct arcs between the vertices in constant time. The main difficulty lies in two suffix of two different code word might correspond to the same vertex in the graph. If we number the code words and name the suffixes by their length, then we can construct a size table that maps each pair , which represents the suffix of length in the th string, to a vertex that represent the string. Simply build a trie for the reverse of the strings, and do some bookkeeping. This map allow us to add an edge in constant time.
The graph has at most arcs. We add a new vertex that has an arc to each initial vertices and apply a DFS from the new vertex. If it reaches the vertex, we return true, else we return false.
There exist an algorithm to test if is a uniquely decodable code in time, where and .
I have an implementation in Haskell here. Note it doesn't run in exactly time because of the
Map takes time. It can, however, be easily modified to run in time.
 M. Crochemore, W. Rytter, Jewels of stringology, World Scientific, 2002.
 M. Rodeh, A fast test for unique decipherability based on suffix trees (corresp.), Information Theory, IEEE Transactions on. 28 (1982) 648–651 10.1109/TIT.1982.1056535.