# Bounded regression on data streams

# 1 Bounded Regression on Data Streams

Hsien-Chih sent me this problem. Similar problem has been asked on Quora. He noticed it might be solved in near linear time using min-cost circulation. Here we show a generalization.

Given

- $(a_1,\ldots,a_n)\in \R^n$,
- $(w_1,\ldots,w_n)\in \R^n_+$,
- $(l_1,\ldots,l_{n-1})\leq (u_1,\ldots,u_{n-1}) \in \R^n$.

Output $(x_1,\ldots,x_n)\in \R^n$ such that $l_i \leq x_{i+1}-x_i\leq u_i$ for all $1\leq i<n$, and minimize $\sum_{i=1}^n w_i |a_i-x_i|$.

# 2 Reduce the problem to min-cost circulation

It’s natural to model this problem as variations of min-cost circulation problem on a graph.

The graph $G=(V,E)$ with vertices $V=\{s,v_0,\ldots,v_n\}$.

Edges:

- Edge $v_iv_{i+1}$ for all $0\leq i <n$.
- Edge $sv_i$ for all $0\leq i\leq n$.

Edge Capacity:

- $sv_i$ has lower bound $l_i$, upper bound $u_i$ for all $1\leq i\leq n-1$.
- All other edges are uncapacited. Namely lower bound and upper bound are $-\infty$ and $\infty$ respectively.

Edge Costs: $v_{i-1}v_i$ has cost function $c_i(x)=w_i |a_i-x|$. Cost function on other edges are $0$.

A function $f$ is called a circulation if $\sum_{e\in \delta^+(v)} f(e)-\sum_{e\in \delta^-(v)} f(e)=0$ for all vertex $v$. It is feasible if $f(e)$ is within the capacity. It is min-cost if $\sum_{e} c_e(f(e))$ is minimized.

Solving the min-cost circulation problem would give us the desired $x_i$ by setting $x_i=f(v_{i-1}v_i)$.

# 3 min-cost circulation on series-parallel graphs

The constructed graph is a two terminal series-parallel graph. There is a simple procedure to solve min-cost flow problem on series-parallel graphs. Consider a series connection of two edges, each with cost function $f$ and $g$. We can replace it with an edge with cost function $f + g$. If it is a parallel connection, then we can replace it with one edge and a cost function $f~\square~g$, where $\square$ is the infimal convolution: $(f~\square~g)(x)= \inf_y f(x-y) + g(y)$.

Once we have a good data structure to represent the costs, we can reduce the graph to one single edge easily, and find the minimum cost circulation. In particular, if the cost are continuous, convex and piecewise linear in a interval and $\infty$ everywhere else, and the total number of breakpoints is $n$, then Booth and Tarjan has an algorithm that runs in $O(n\log n)$ time [1].

Because all edge has a cost function with at most $1$ breakpoint. The bounded regression problem can be solved in $O(n\log n)$ time.

# 4 Isotonic regression

We can try to minimize $\sqrt{\sum_{i=1}^n w_i (a_i-x_i)^2}$ instead ($L_2$ error). It is a generalization of the lipschitz isotonic regression problem [2] when $l_i=0$ and $u_i=u$ for some constant $u$. We can also ask to minimize the $L_\infty$ error.

If the upper bounds are $\infty$ and all lower bounds are $0$, then the problem is called the isotonic regression problem. I have solved a interesting problem using isotonic regression.

We can express all the problems as min-cost circulation problem on a appropriate graph. If the min-cost circulation algorithm on those graphs have the same running time as current best algorithm, it would imply something more general is acting in the background.

Here is what we know.

- $L_1$ error: This post shows it can be solved in $O(n\log n)$ time using the min-cost circulation formulation. It matches the running time of specialized algorithms.
- $L_2$ error: It can be solved in $O(n)$ time, but doesn’t come from the quadratic cost min-cost circulation formulation.
- $L_\infty$ error: It can be solved in $O(n)$ time. However, it doesn’t come from the minimax circulation problem. (In the minimax circulation, the cost is the largest edge cost incurred by the circulation).
- $L_0$ error: It can be solved in $O(n\log n)$ time. This is equivalent to the longest non-decreasing subsequence problem.

This prompt the following two natural problems:

*Can min-cost circulation with quadratic cost on series parallel graph have $O(n)$ time solution?*This is in fact possible when all edges have no capacity[3]. But with capacity, even for a edge with a lower bound of $0$ and $0$ cost, we don’t know.*What about minimax circulation?*We can’t find any study of minimax circulation on series-parallel graphs.

# References

[1] H. Booth, R.E. Tarjan, **Finding the minimum-cost maximum flow in a series-parallel network**, Journal of Algorithms. 15 (1993) 416–446 10.1006/jagm.1993.1048.

[2] P.K. Agarwal, J.M. Phillips, B. Sadri, Lipschitz unimodal and isotonic regression on paths and trees, in: A. LopezOrtiz (Ed.), LATIN 2010: THEORETICAL Informatics, 2010: pp. 384–396 10.1007/978-3-642-12200-2_34.

[3] R. Zohar, D. Geiger, **Estimation of flows in flow networks**, European Journal of Operational Research. 176 (2007) 691–706 10.1016/j.ejor.2005.08.009.