Strings with hamming distance exactly $1$

Lin Yang asked me about the complexity for the following problem, which is the day 2 part 2 of the advent of code 2018. It is an elegant programming exercise, and also a clever algorithmic exercise. The problem can be summarized below.

Problem1

Given a set $W$ of $n$ length $m$ strings. Decide if there are two of them that differs at exactly one position.

In other words, we want to find two strings in $W$ with hamming distance $1$.

The naive algorithm would have running time $O(n^2m)$. The complexity of the problem have gathered a lot of attention a while ago, for example a post in dev.to, and on reddit. Some of them had a running time of $O(nm^2)$ instead. Some require hashing to get the expected running time of $O(mn)$. Here we are interested in an algorithm with worst case $O(mn)$ time.

1 An $O(mn)$ time algorithm

First, we define some equivalent classes on the strings in $W$.

1. $x\equiv^i y$, if $x[1..i-1]=y[1..i-1]$. Namely, if the first $i-1$ elements of $x$ and $y$ match.
2. $x\equiv_i y$ if $x[i+1..m]=y[i+1..m]$. Namely, if the last $m-i+1$ elements of $x$ and $y$ match.

The algorithm uses the following idea. For each $i$, decide if there are any strings $x$ and $y$ such that differs in precisely position $i$.

Lemma2

For distinct $x$ and $y$, they differ only in position $i$ if and only if $x\equiv^i y$ and $x\equiv_i y$.

Let $\mathcal{P}_i$ and $\mathcal{S}_i$ be the collection of equivalent classes of $\equiv^i$ and $\equiv_i$, respectively. We show a result related to the meet of partitions.

Lemma3

Let $\mathcal{A}$ and $\mathcal{B}$ be partitions of $[n]$. There is an $O(n)$ time algorithm to test find the sets in $\set{ A\cap B | A\in \mathcal{A}, B\in \mathcal{B}}$.

Proof

Let $\mathcal{A}=\set{A_1,\ldots,A_k}$ and $\mathcal{B} = \set{B_1,\ldots,B_\ell}$. We define $I_i = (a,b)$ such that $i\in A_a$ and $i\in B_b$. Then we know $i$ is in $A_a\cap B_b$ if $I_i=(a,b)$. Hence we are interested in find the largest set of elements such $S$ such that for $i,j\in S$, $I_i=I_j$. The simplified problem can be solved in $O(n)$ time. Indeed, the pair is just a base $n$ number with $2$ digits. We can apply radix sort with running time $O(n)$ and group by the result.

Note one can also directly use a partition refinement data structure to get the same result.

As a corollary, consider $\mathcal{A}=\mathcal{P}_i$ and $\mathcal{B}=\mathcal{S}_i$, then we obtain the following lemma.

Lemma4

Given the collections $\mathcal{P}_i$ and $\mathcal{S}_i$, there is an $O(n)$ time algorithm to test if there are two strings $x,y\in W$ that differs in precisely position $i$.

Lemma5

Finding $\mathcal{P}_1,\ldots,\mathcal{P}_m$ and $\mathcal{S}_1,\ldots,\mathcal{S}_m$ can be done in $O(mn)$ time.

Proof

To find the equivalent classes, build two tries for the strings. Trie $T_\mathcal{P}$ for strings in $W$ and trie $T_\mathcal{S}$ for the reversal of strings in $W$. Building the tries takes $O(mn)$ time. Inspect the nodes at depth $i-1$ in $T_P$ and nodes at depth $m-i+1$ in $T_S$ to recover $\mathcal{P}_i$ and $\mathcal{S}_i$ in $O(n)$ time.

Theorem6

There is an $O(mn)$ time algorithm that solves Problem 1.

Proof

Finding the sequence of equivalent classes takes $O(mn)$ time by Lemma 4. For each $i$, checking if there exists $x,y\in W$ differs in precisely position $i$ takes $O(n)$ time by Lemma 3. Since $i$ ranges from $1$ to $m$, we obtain the final running time is $O(mn)$.

2 Remarks

Ruosong Wang communicated another $O(mn)$ solution. It is much easier to describe. Let $\diamond$ be a symbol not in the alphabet. Build a generalized suffix tree over the set of strings $S'=\set{x\diamond x| x\in W}$. Traverse the suffix tree, up to level $m$, and output true if a path that contains $\diamond$ was traversed, and can lead to more than $2$ leaves. Indeed, this means the substring $x\diamond y$ appears at least twice. Hence there are at least two strings of the form $yax$ and $ybx$ in $W$. This definitely hits the optimal running time, but implementing a generalized suffix tree is fairly hard.

We do assume the alphabet size is constant. If the alphabet size is $\sigma$ and ordered, then there is an extra factor of $\log \sigma$ in building the tries. The the final running time will be $O(mn\log \sigma)$.

Problem 1 also reduces to finding the closest pair of elements by hamming metric [1]. It does not get us the desired running time though.

3 An implementation in Haskell

The implementation is mostly faithful to the presentation in the article. We did not implement counting sort nor radix sort.

References

[1] K. Min, M.-Y. Kao, H. Zhu, The closest pair problem under the hamming metric, in: H.Q. Ngo (Ed.), Computing and Combinatorics, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009: pp. 205–214.

Posted by Chao Xu on .
Tags: algorithms, strings, tries.