Word break with cost

Problem1

Given a set of strings $W$ , a cost function $c : W \to R$ and a string $s$ . Find elements $w_{1}, \dots, w_{k} \in W$ such that $s = w_{1} \dots w_{k}$ , and $\sum_{i = 1}^{k} c (w_{i})$ is minimized.

This problem is a generalization of the word break problem on leetcode. Many algorithms you see online assumes that each string in $W$ has constant length, checking the hash table takes $O (1)$ time, and obtains an $O (n^{2})$ time algorithm. It is not as easy if the strings can have arbitrary length. Here we show an algorithm that considers the strings in $W$ have arbitrary length.

Assume the input $s$ has length $n$ . Consider the following graph $G = (V, E)$ , where $V = {0, \dots, n}$ , and there is an edge from $i$ and $j$ , if $s [i + 1.. j] = w \in W$ , the label of the edge $(i, j)$ is the string $w$ , and the cost is $c (w)$ . Let $L$ be total length of the input, namely $_{wW} |w| + |s|. Let $z$ be number of substrings in $s$ matches some element in $W$ . The graph has $z$ edges. Note $z = O (n L)$ . Indeed, the sum of the length of the labels of all outgoing edges cannot be more than $L$ , and the length of each label is different. Hence each vertex can have at most $O (L)$ outgoing edges. The graph is a DAG, so we can find the shortest path from $0$ to $n$ in linear time with respect to the number of edges. This shows if we can compute the graph in $O (z + L)$ time, then we solve the problem in $O (z + L)$ time.

We can build an Aho–Corasick automaton for $W$ in $O (L)$ time. It can be used to find all substrings of $s$ that matches something in $W$ by traversing the automaton once. The running time is the total number of substrings matched, which is $O (z)$ . Hence building the graph takes $O (z + L)$ time. $z$ is clearly no more than $nm$ , where $m = ∣ W ∣$ . Also, it is also clear $z = O (n L)$ . Indeed, there can be at most $O (L)$ edges start from $i$ , since each edge has a label of different length, and sum of those length labels is no larger than $L$ .

If we only want to know if there exists a solution, then there is a $\tilde{O} (n L^{1/3} + L)$ time algorithm [1]. The algorithm is close to optimal assuming the algorithm is combinatorial and the alphabet can be arbitrarily large. A recent progress showed the version with minimum cost is also solvable in $\tilde{O} (n L^{1/3} + L)$ time [2].

References

[1]

K. Bringmann, A. Grønlund, K.G. Larsen, A dichotomy for regular expression membership testing, in: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017: pp. 307–318 10.1109/FOCS.2017.36.

[2]

T.M. Chan, Q. He, More on Change-Making and Related Problems, in: F. Grandoni, G. Herman, P. Sanders (Eds.), 28th Annual European Symposium on Algorithms (ESA 2020), Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2020: pp. 29:1–29:14 10.4230/LIPIcs.ESA.2020.29.

Posted by Chao Xu on 2019-09-19.

Tags: Algorithm.