The Art Gallery Guardian

Bounded regression on data streams


1 Bounded Regression on Data Streams

Hsien-Chih sent me this problem. Similar problem has been asked on Quora. He noticed it might be solved in near linear time using min-cost circulation. Here we show a generalization.

Problem1Bounded Regression on Data Stream

Given

  1. (a1,,an)Rn(a_1,\ldots,a_n)\in \R^n,
  2. (w1,,wn)R+n(w_1,\ldots,w_n)\in \R^n_+,
  3. (l1,,ln1)(u1,,un1)Rn(l_1,\ldots,l_{n-1})\leq (u_1,\ldots,u_{n-1}) \in \R^n.

Output (x1,,xn)Rn(x_1,\ldots,x_n)\in \R^n such that lixi+1xiuil_i \leq x_{i+1}-x_i\leq u_i for all 1i<n1\leq i<n, and minimize i=1nwiaixi\sum_{i=1}^n w_i |a_i-x_i|.

2 Reduce the problem to min-cost circulation

It’s natural to model this problem as variations of min-cost circulation problem on a graph.

The graph G=(V,E)G=(V,E) with vertices V={s,v0,,vn}V=\{s,v_0,\ldots,v_n\}.

Edges:

  1. Edge vivi+1v_iv_{i+1} for all 0i<n0\leq i <n.
  2. Edge svisv_i for all 0in0\leq i\leq n.

Edge Capacity:

  1. svisv_i has lower bound lil_i, upper bound uiu_i for all 1in11\leq i\leq n-1.
  2. All other edges are uncapacited. Namely lower bound and upper bound are -\infty and \infty respectively.

Edge Costs: vi1viv_{i-1}v_i has cost function ci(x)=wiaixc_i(x)=w_i |a_i-x|. Cost function on other edges are 00.

A function ff is called a circulation if eδ+(v)f(e)eδ(v)f(e)=0\sum_{e\in \delta^+(v)} f(e)-\sum_{e\in \delta^-(v)} f(e)=0 for all vertex vv. It is feasible if f(e)f(e) is within the capacity. It is min-cost if ece(f(e))\sum_{e} c_e(f(e)) is minimized.

Solving the min-cost circulation problem would give us the desired xix_i by setting xi=f(vi1vi)x_i=f(v_{i-1}v_i).

3 min-cost circulation on series-parallel graphs

The constructed graph is a two terminal series-parallel graph. There is a simple procedure to solve min-cost flow problem on series-parallel graphs. Consider a series connection of two edges, each with cost function ff and gg. We can replace it with an edge with cost function f+gf + g. If it is a parallel connection, then we can replace it with one edge and a cost function f  gf~\square~g, where \square is the infimal convolution: (f  g)(x)=infyf(xy)+g(y)(f~\square~g)(x)= \inf_y f(x-y) + g(y).

Once we have a good data structure to represent the costs, we can reduce the graph to one single edge easily, and find the minimum cost circulation. In particular, if the cost are continuous, convex and piecewise linear in a interval and \infty everywhere else, and the total number of breakpoints is nn, then Booth and Tarjan has an algorithm that runs in O(nlogn)O(n\log n) time [1].

Because all edge has a cost function with at most 11 breakpoint. The bounded regression problem can be solved in O(nlogn)O(n\log n) time.

4 Isotonic regression

We can try to minimize i=1nwi(aixi)2\sqrt{\sum_{i=1}^n w_i (a_i-x_i)^2} instead (L2L_2 error). It is a generalization of the lipschitz isotonic regression problem [2] when li=0l_i=0 and ui=uu_i=u for some constant uu. We can also ask to minimize the LL_\infty error.

If the upper bounds are \infty and all lower bounds are 00, then the problem is called the isotonic regression problem. I have solved a interesting problem using isotonic regression.

We can express all the problems as min-cost circulation problem on a appropriate graph. If the min-cost circulation algorithm on those graphs have the same running time as current best algorithm, it would imply something more general is acting in the background.

Here is what we know.

  1. L1L_1 error: This post shows it can be solved in O(nlogn)O(n\log n) time using the min-cost circulation formulation. It matches the running time of specialized algorithms.
  2. L2L_2 error: It can be solved in O(n)O(n) time, but doesn’t come from the quadratic cost min-cost circulation formulation.
  3. LL_\infty error: It can be solved in O(n)O(n) time. However, it doesn’t come from the minimax circulation problem. (In the minimax circulation, the cost is the largest edge cost incurred by the circulation).

This prompt the following two natural problems:

  1. Can min-cost circulation with quadratic cost on series parallel graph have O(n)O(n) time solution? This is in fact possible when all edges have no capacity[3]. But with capacity, even for a edge with a lower bound of 00 and 00 cost, we don’t know.

  2. What about minimax circulation? We can’t find any study of minimax circulation on series-parallel graphs.

References

[1] H. Booth, R.E. Tarjan, Finding the minimum-cost maximum flow in a series-parallel network, Journal of Algorithms. 15 (1993) 416–446 10.1006/jagm.1993.1048.

[2] P.K. Agarwal, J.M. Phillips, B. Sadri, Lipschitz unimodal and isotonic regression on paths and trees, in: A. LopezOrtiz (Ed.), LATIN 2010: THEORETICAL Informatics, 2010: pp. 384–396 10.1007/978-3-642-12200-2_34.

[3] R. Zohar, D. Geiger, Estimation of flows in flow networks, European Journal of Operational Research. 176 (2007) 691–706 10.1016/j.ejor.2005.08.009.

Posted by Chao Xu on .
Tags: flow, circulation, algorithms.