Strings with hamming distance exactly
Lin Yang asked me about the complexity for the following problem, which is the day 2 part 2 of the advent of code 2018. It is an elegant programming exercise, and also a clever algorithmic exercise. The problem can be summarized below.
Given a set of length strings. Decide if there are two of them that differs at exactly one position.
In other words, we want to find two strings in with hamming distance .
The naive algorithm would have running time . The complexity of the problem have gathered a lot of attention a while ago, for example a post in dev.to, and on reddit. Some of them had a running time of instead. Some require hashing to get the expected running time of . Here we are interested in an algorithm with worst case time.
1 An time algorithm
First, we define some equivalent classes on the strings in .
- , if . Namely, if the first elements of and match.
- if . Namely, if the last elements of and match.
The algorithm uses the following idea. For each , decide if there are any strings and such that differs in precisely position .
For distinct and , they differ only in position if and only if and .
Let and be the collection of equivalent classes of and , respectively. We show a result related to the meet of partitions.
Let and be partitions of . There is an time algorithm to test find the sets in .
Let and . We define such that and . Then we know is in if . Hence we are interested in find the largest set of elements such such that for , . The simplified problem can be solved in time. Indeed, the pair is just a base number with digits. We can apply radix sort with running time and group by the result.
Note one can also directly use a partition refinement data structure to get the same result.
As a corollary, consider and , then we obtain the following lemma.
Given the collections and , there is an time algorithm to test if there are two strings that differs in precisely position .
Finding and can be done in time.
To find the equivalent classes, build two tries for the strings. Trie for strings in and trie for the reversal of strings in . Building the tries takes time. Inspect the nodes at depth in and nodes at depth in to recover and in time.
There is an time algorithm that solves Problem 1.
Finding the sequence of equivalent classes takes time by Lemma 4. For each , checking if there exists differs in precisely position takes time by Lemma 3. Since ranges from to , we obtain the final running time is .
Ruosong Wang communicated another solution. It is much easier to describe. Let be a symbol not in the alphabet. Build a generalized suffix tree over the set of strings . Traverse the suffix tree, up to level , and output
true if a path that contains was traversed, and can lead to more than leaves. Indeed, this means the substring appears at least twice. Hence there are at least two strings of the form and in . This definitely hits the optimal running time, but implementing a generalized suffix tree is fairly hard.
We do assume the alphabet size is constant. If the alphabet size is and ordered, then there is an extra factor of in building the tries. The the final running time will be .
3 An implementation in Haskell
The implementation is mostly faithful to the presentation in the article. We did not implement counting sort nor radix sort.
 K. Min, M.-Y. Kao, H. Zhu, The closest pair problem under the hamming metric, in: H.Q. Ngo (Ed.), Computing and Combinatorics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009: pp. 205–214.