The Art Gallery Guardian

Word break with cost


Problem1

Given a set of strings WW, a cost function c:WRc:W\to \R and a string ss. Find elements w1,,wkWw_1,\ldots,w_k\in W such that s=w1wks=w_1\ldots w_k, and i=1kc(wi)\sum_{i=1}^k c(w_i) is minimized.

This problem is a generalization of the word break problem on leetcode. Many algorithms you see online assumes that string in WW has constant length, checking the hash table takes O(1)O(1) time, and obtain an O(n2)O(n^2) time algorithm. It is not as easy. Here we show an algorithm that considers the strings in WW have arbitrary length.

Consider the following graph G=(V,E)G=(V,E), where V={0,,n}V=\set{0,\ldots,n}, and there is an edge from ii and jj, if s[i+1..j]=wWs[i+1..j]=w\in W, and the label of the edge (i,j)(i,j) is the string ww, and the cost is c(w)c(w). Let zz be number of substrings in ss matches some element in WW. The graph has zz edges. Note z=O(nL)z=O(n\sqrt{L}). Indeed, the sum of the length of the labels of all outgoing edges cannot be more than LL, and the length of each label is different. Hence each vertex can have at most O(L)O(\sqrt{L}) outgoing edges. The graph is a DAG, so we can find the shortest path from 00 to nn in linear time with respect to the number of edges. This shows if we can compute the graph in O(z+L)O(z+L) time, then we solve the problem in O(z+L)O(z+L) time.

We can build the Aho–Corasick automaton for WW in O(L)O(L) time. It can be used to find all substrings of ss that matches something in WW by traversing the automaton once. The running time is the total number of substrings matched, which is O(z)O(z). Hence building the graph takes O(z+L)O(z+L) time. zz is clearly no more than nmnm, where m=Wm=|W|. Also, it is also clear z=O(nL)z=O(n\sqrt{L}). Indeed, there can be at most O(L)O(\sqrt{L}) edges start from ii, since each edge has a label of different length, and sum of those length labels is no larger than LL.

If we only want to know if there exists a solution, then there is a O~(nL1/3+L)\tilde{O}(nL^{1/3}+L) time algorithm [1]. The algorithm is close to optimal assuming the algorithm is combinatorial and the alphabet can be arbitrarily large.

Can we obtain similar running time for the word break with cost problem? There are evidence against it. If the alphabet is unary, this problem is equivalent to the unbounded knapsack problem, which likely does not have an algorithm with running time O((n+n)2ε)O((n+n)^{2-\e}) for any ε>0\e>0 [2] and mm can be as large as Ω(L)\Omega(\sqrt{L}). Of course, this does not mean there might not be a O(nL1/3+L)O(nL^{1/3}+L) time algorithm, since the reduction involved in the paper might not hold when we require m=Ω(L)m=\Omega(\sqrt{L}).

References

[1] K. Bringmann, A. Grønlund, K.G. Larsen, A dichotomy for regular expression membership testing, in: 2017 Ieee 58th Annual Symposium on Foundations of Computer Science (Focs), 2017: pp. 307–318 10.1109/FOCS.2017.36.

[2] M. Cygan, M. Mucha, K. Węgrzycki, M. W, On problems equivalent to (min,+)-convolution, ACM Trans. Algorithms. 15 (2019) 14:1–14:25 10.1145/3293465.

Posted by Chao Xu on .
Tags: Algorithm.