The Art Gallery Guardian

Recognize Uniquely Decodable Codes


A set of strings CC is called a uniquely decodable code, if CC^* has a unique factorization over CC. Namely, for each element ss in CC^*, there exist a unique finite sequence {si}i=1m\{s_i\}_{i=1}^m of elements in CC, such that s1sm=ss_1\ldots s_m = s. We call the strings in CC a code string.

Problem1Uniquely Decodable Code Recognition

Let CC be a finite set of strings, decide if CC is a uniquely decodable code.

In general, the infinite version of the problem is undecidable. It is however decidable if CC is a regular language.

To solve this problem, we shall describe a formulation of Sardinas–Patterson algorithm, which combines the description of [1] and [2].

In the entire article, we assume the alphabet size is fixed.

For variable sized alphabet, let σ\sigma be the number of distinct alphabet appeared in CC. There is an extra factor of σ\sigma or logσ\log \sigma depending on if there exist a comparator for the alphabet.

Define S(C)={vxv=c,cC}S(C) = \set{v| xv = c, c\in C}, the set of suffixes of CC. I(C)={vcv=c,cc,c,cC}I(C) = \set{ v| c'v = c, c'\neq c, c',c\in C}, we call those initial suffix.

Lemma2

CC is uniquely decodable if and only if there is no vI(C)v\in I(C), such that vsCvs\in C^* and sCs\in C^*.

Proof

If there exist such vv so vs,sCvs,s\in C^*. cv=cc'v=c for some c,cCc,c'\in C. Consider the string cs=c(vs)cs=c'(vs). It has at least two factorization, one start with cc, the other start with cc', and ccc'\neq c.

Let no such vv exists. Consider some uCu\in C^*. It can be written as cscs and csc's' for c,cCc,c'\in C, s,sCs,s'\in C^*. Assume there is more than one factorization, then ccc'\neq c and wlog cv=cc'v=c. We arrive vs=sCvs=s'\in C^*, a contradiction.

Consider a G=(V,A)G=(V,A) a directed graph. V=S(C)V=S(C). There is an arc from aa to bb iff ab=cab=c or cb=acb=a for some cCc\in C.

Theorem3

CC is uniquely decodable code iff there is no path from a vertex in I(C)I(C) to ϵ\epsilon.

Proof

For any arc uvuv, if uv=cCuv=c\in C, then label the arc uvuv by uu. Otherwise, label the arc with cc where cv=ucv = u. It’s then easy to see concatenate the labels on any walk from vv to ϵ\epsilon spells a string of the form vsCvs \in C^* for sCs \in C^*.

By induction, one can show that for a vertex vv, there exist string vsCvs\in C^*, where sCs\in C^*, if and only if there is a path from vv to ε\e.

In order to compute the arcs on the graph, we need to answer two questions.

  1. Is uu a prefix of cc? For all uS(C)u\in S(C) and cCc\in C.
  2. Is cc a prefix of uu? For all uS(C)u\in S(C) and cCc\in C.

Let k=Ck=|C|, n=cCcn=\sum_{c\in C}|c|. It is known that we can build a generalized suffix tree for CC in linear time.

For the first question, “Is uu a prefix of cc?”, consider we constructed a suffix tree TT for CC. Transversing the suffix tree T(C)T(C) with a code string cCc\in C. Assume we have read a prefix uu of cc, we can check if uS(C)u\in S(C) in constant time during the transversal. If uS(C)u\in S(C), then c=uvc=uv where vS(C)v\in S(C). It will add an arc uvuv. We use O(c)O(|c|) time to find all the arcs can be formed by answering the “Is uu a prefix of cc”. In total, we can answer question 1 in O(n)O(n) time.

For the second question, we can also use the same suffix tree. Transverse the suffix tree with a code string cc. Find all the leaves in the subtree when the string cc ends. Those correspond to all strings uS(C)u\in S(C) such that cc is a prefix. Namely u=cvu=cv where vS(C)v\in S(C). The algorithm add an arc uvuv. Note there might be many uu with this property, worst case O(n)O(n). This step might run in O(nk)O(nk) time.

It’s not clear we can construct arcs between the vertices in constant time. The main difficulty lies in two suffix of two different code word might correspond to the same vertex in the graph. If we number the code words and name the suffixes by their length, then we can construct a O(n)O(n) size table that maps each pair (i,j)(i,j), which represents the suffix of length jj in the iith string, to a vertex that represent the string. Simply build a trie for the reverse of the strings, and do some bookkeeping. This map allow us to add an edge in constant time.

The graph GG has at most O(nk)O(nk) arcs. We add a new vertex that has an arc to each initial vertices and apply a DFS from the new vertex. If it reaches the ϵ\epsilon vertex, we return true, else we return false.

Theorem4

There exist an algorithm to test if CC is a uniquely decodable code in O(nk)O(nk) time, where k=Ck=|C| and n=cCcn=\sum_{c\in C} |c|.

I have an implementation in Haskell here. Note it doesn’t run in exactly O(nk)O(nk) time because of the Map takes O(logn)O(\log n) time. It can, however, be easily modified to run in O(nk)O(nk) time.

References

[1] M. Crochemore, W. Rytter, Jewels of stringology, World Scientific, 2002.

[2] M. Rodeh, A fast test for unique decipherability based on suffix trees (corresp.), Information Theory, IEEE Transactions on. 28 (1982) 648–651 10.1109/TIT.1982.1056535.

Posted by Chao Xu on .
Tags: Algorithm.