 # Strings with hamming distance exactly

Lin Yang asked me about the complexity for the following problem, which is the day 2 part 2 of the advent of code 2018. It is an elegant programming exercise, and also a clever algorithmic exercise. The problem can be summarized below.

Problem1

Given a set of length strings. Decide if there are two of them that differs at exactly one position.

In other words, we want to find two strings in with hamming distance .

The naive algorithm would have running time . The complexity of the problem have gathered a lot of attention a while ago, for example a post in dev.to, and on reddit. Some of them had a running time of instead. Some require hashing to get the expected running time of . Here we are interested in an algorithm with worst case time.

# 1 An time algorithm

First, we define some equivalent classes on the strings in .

1. , if . Namely, if the first elements of and match.
2. if . Namely, if the last elements of and match.

The algorithm uses the following idea. For each , decide if there are any strings and such that differs in precisely position .

Lemma2

For distinct and , they differ only in position if and only if and .

Let and be the collection of equivalent classes of and , respectively. We show a result related to the meet of partitions.

Lemma3

Let and be partitions of . There is an time algorithm to test find the sets in .

Proof

Let and . We define such that and . Then we know is in if . Hence we are interested in find the largest set of elements such such that for , . The simplified problem can be solved in time. Indeed, the pair is just a base number with digits. We can apply radix sort with running time and group by the result.

Note one can also directly use a partition refinement data structure to get the same result.

As a corollary, consider and , then we obtain the following lemma.

Lemma4

Given the collections and , there is an time algorithm to test if there are two strings that differs in precisely position .

Lemma5

Finding and can be done in time.

Proof

To find the equivalent classes, build two tries for the strings. Trie for strings in and trie for the reversal of strings in . Building the tries takes time. Inspect the nodes at depth in and nodes at depth in to recover and in time.

Theorem6

There is an time algorithm that solves Problem 1.

Proof

Finding the sequence of equivalent classes takes time by Lemma 4. For each , checking if there exists differs in precisely position takes time by Lemma 3. Since ranges from to , we obtain the final running time is .

# 2 Remarks

Ruosong Wang communicated another solution. It is much easier to describe. Let be a symbol not in the alphabet. Build a generalized suffix tree over the set of strings . Traverse the suffix tree, up to level , and output true if a path that contains was traversed, and can lead to more than leaves. Indeed, this means the substring appears at least twice. Hence there are at least two strings of the form and in . This definitely hits the optimal running time, but implementing a generalized suffix tree is fairly hard.

We do assume the alphabet size is constant. If the alphabet size is and ordered, then there is an extra factor of in building the tries. The the final running time will be .

Problem 1 also reduces to finding the closest pair of elements by hamming metric . It does not get us the desired running time though.

# 3 An implementation in Haskell

The implementation is mostly faithful to the presentation in the article. We did not implement counting sort nor radix sort.

# References


K. Min, M.-Y. Kao, H. Zhu, The closest pair problem under the hamming metric, in: H.Q. Ngo (Ed.), Computing and Combinatorics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009: pp. 205–214.
Posted by Chao Xu on .
Tags: algorithms, strings, tries.