The Art Gallery Guardian

Strings with hamming distance exactly 11


Lin Yang asked me about the complexity for the following problem, which is the day 2 part 2 of the advent of code 2018. It is an elegant programming exercise, and also a clever algorithmic exercise. The problem can be summarized below.

Problem1

Given a set WW of nn length mm strings. Decide if there are two of them that differs at exactly one position.

In other words, we want to find two strings in WW with hamming distance 11.

The naive algorithm would have running time O(n2m)O(n^2m). The complexity of the problem have gathered a lot of attention a while ago, for example a post in dev.to, and on reddit. Some of them had a running time of O(nm2)O(nm^2) instead. Some require hashing to get the expected running time of O(mn)O(mn). Here we are interested in an algorithm with worst case O(mn)O(mn) time.

1 An O(mn)O(mn) time algorithm

First, we define some equivalent classes on the strings in WW.

  1. xiyx\equiv^i y, if x[1..i1]=y[1..i1]x[1..i-1]=y[1..i-1]. Namely, if the first i1i-1 elements of xx and yy match.
  2. xiyx\equiv_i y if x[i+1..m]=y[i+1..m]x[i+1..m]=y[i+1..m]. Namely, if the last mi+1m-i+1 elements of xx and yy match.

The algorithm uses the following idea. For each ii, decide if there are any strings xx and yy such that differs in precisely position ii.

Lemma2

For distinct xx and yy, they differ only in position ii if and only if xiyx\equiv^i y and xiyx\equiv_i y.

Let Pi\mathcal{P}_i and Si\mathcal{S}_i be the collection of equivalent classes of i\equiv^i and i\equiv_i, respectively. We show a result related to the meet of partitions.

Lemma3

Let A\mathcal{A} and B\mathcal{B} be partitions of [n][n]. There is an O(n)O(n) time algorithm to test find the sets in {ABAA,BB}\set{ A\cap B | A\in \mathcal{A}, B\in \mathcal{B}}.

Proof

Let A={A1,,Ak}\mathcal{A}=\set{A_1,\ldots,A_k} and B={B1,,B}\mathcal{B} = \set{B_1,\ldots,B_\ell}. We define Ii=(a,b)I_i = (a,b) such that iAai\in A_a and iBbi\in B_b. Then we know ii is in AaBbA_a\cap B_b if Ii=(a,b)I_i=(a,b). Hence we are interested in find the largest set of elements such SS such that for i,jSi,j\in S, Ii=IjI_i=I_j. The simplified problem can be solved in O(n)O(n) time. Indeed, the pair is just a base nn number with 22 digits. We can apply radix sort with running time O(n)O(n) and group by the result.

Note one can also directly use a partition refinement data structure to get the same result.

As a corollary, consider A=Pi\mathcal{A}=\mathcal{P}_i and B=Si\mathcal{B}=\mathcal{S}_i, then we obtain the following lemma.

Lemma4

Given the collections Pi\mathcal{P}_i and Si\mathcal{S}_i, there is an O(n)O(n) time algorithm to test if there are two strings x,yWx,y\in W that differs in precisely position ii.

Lemma5

Finding P1,,Pm\mathcal{P}_1,\ldots,\mathcal{P}_m and S1,,Sm\mathcal{S}_1,\ldots,\mathcal{S}_m can be done in O(mn)O(mn) time.

Proof

To find the equivalent classes, build two tries for the strings. Trie TPT_\mathcal{P} for strings in WW and trie TST_\mathcal{S} for the reversal of strings in WW. Building the tries takes O(mn)O(mn) time. Inspect the nodes at depth i1i-1 in TPT_P and nodes at depth mi+1m-i+1 in TST_S to recover Pi\mathcal{P}_i and Si\mathcal{S}_i in O(n)O(n) time.

Theorem6

There is an O(mn)O(mn) time algorithm that solves Problem 1.

Proof

Finding the sequence of equivalent classes takes O(mn)O(mn) time by Lemma 4. For each ii, checking if there exists x,yWx,y\in W differs in precisely position ii takes O(n)O(n) time by Lemma 3. Since ii ranges from 11 to mm, we obtain the final running time is O(mn)O(mn).

2 Remarks

Ruosong Wang communicated another O(mn)O(mn) solution. It is much easier to describe. Let \diamond be a symbol not in the alphabet. Build a generalized suffix tree over the set of strings S={xxxW}S'=\set{x\diamond x| x\in W}. Traverse the suffix tree, up to level mm, and output true if a path that contains \diamond was traversed, and can lead to more than 22 leaves. Indeed, this means the substring xyx\diamond y appears at least twice. Hence there are at least two strings of the form yaxyax and ybxybx in WW. This definitely hits the optimal running time, but implementing a generalized suffix tree is fairly hard.

We do assume the alphabet size is constant. If the alphabet size is σ\sigma and ordered, then there is an extra factor of logσ\log \sigma in building the tries. The the final running time will be O(mnlogσ)O(mn\log \sigma).

Problem 1 also reduces to finding the closest pair of elements by hamming metric [1]. It does not get us the desired running time though.

3 An implementation in Haskell

The implementation is mostly faithful to the presentation in the article. We did not implement counting sort nor radix sort.

References

[1] K. Min, M.-Y. Kao, H. Zhu, The closest pair problem under the hamming metric, in: H.Q. Ngo (Ed.), Computing and Combinatorics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009: pp. 205–214.

Posted by Chao Xu on .
Tags: algorithms, strings, tries.