The Art Gallery Guardian

L1L_1 linear regression


I read an article on the errors in visualization. The example of forcing a relationship by cherry-picking scales is delightful. I recommend reading it.

I am interested in misleading people while being completely honest. The article inspires the following problem. Given 2 vectors x,yRn\bm{x},\bm{y}\in \R^n. Let 1\bm{1} be the all 11 vector in Rn\R^n. We are interested in finding a,bRa,b\in \R, such that y(ax+b1)p\|\bm{y}-(a\bm{x}+b\bm{1})\|_p is minimized. Here pp is either 1,21,2 or \infty.

Note the problem is precisely the same as the linear regression problem. In the linear regression problem, we are given a point set SR2S\subset \R^2 of size nn and we are interested in find a line f(x)=ax+bf(x) = ax+b, such that it minimizes the error, defined as

(x,y)Syf(x)p\displaystyle \sum_{(x,y)\in S} \|y - f(x)\|_p

For p=2p=2, there is a O(n)O(n) time algorithm because there is a closed formula. For p=p=\infty, the problem can be rewritten as a linear program with 33 variables and nn constraints. Using Megiddo's result [1], there is a O(n)O(n) time algorithm to solve this problem.

It is hard to find the worst case complexity when p=1p=1. This case is called the least absolute deviations. Statisticians just don't care about worst case running time as CS people do.

There are a few methods I found. One is to write it as a linear program on n+2n+2 variables and nn constraints and solve it using the simplex method. The linear program is as follows.

mina,b,t1,,tni=1ntis.t.ti(axi+b)yi1intiyi(axi+b)1in\displaystyle \begin{aligned} & \min_{a,b,t_1,\ldots,t_n} & & \sum_{i=1}^n t_i & \\ & \text{s.t.} & & t_i \geq (ax_i+b)-y_i & \forall 1 \leq i \leq n \\ & & & t_i \leq y_i-(ax_i+b) & \forall 1 \leq i \leq n \\ \end{aligned}

There are a bunch of other algorithms that specializes the simplex algorithm on this particular problem. There are also some iterative methods. Unfortunately, those algorithms depends on the actual numbers in the input. I want a running time that only depends on nn.

There exists an optimal solution that contains two points in SS. The native algorithm is to try all possible O(n2)O(n^2) lines. For each line, the algorithm can compute the error in O(n)O(n) time. The naive algorithm's running time is O(n3)O(n^3). There is a smarter algorithm. The optimal line that contains the point can actually be found in O(n)O(n) time. Indeed, consider the line passes through the point (x,y)(x,y). We consider changing the slope of the line, while maintaining it still contain (x,y)(x,y). One can see a minimum will be reached at some line. Indeed, assume we reorder the points, so yiyxixyi+1yxi+1x\frac{y_i-y}{x_i-x}\leq \frac{y_{i+1}-y}{x_{i+1}-x} (namely, increasing slope). Let kk be the smallest integer such that the sum of i=1kxixi=k+1nxix\sum_{i=1}^k |x_i-x|\geq \sum_{i=k+1}^n |x_i-x|. The line determined by (x,y)(x,y) and (xk,yk)(x_k,y_k) is the desired line. This can be computed in linear time by finding weighted median. Hence one can show the running time is O(n2)O(n^2). This is the idea of [2]. As far as I know, this seems to be the state of the art in terms of worst case complexity.

After discussing with Qizheng He, he suggested the following approach. Consider the function gp(s)g_p(s) for pSp\in S. It is defined as the error for the line of slope ss that contains pp. The function is bitonic, therefore we can do a ternary search to find the minimum. There are only n1n-1 possible slopes, hence the ternary search will take O(logn)O(\log n) queries, where each query asks for the error of the line that goes through pp and some other point.

Given a line f(x)=ax+bf(x)=ax+b, can one compute the error quickly? It is possible to decompose it to few halfspace range counting queries (allowing weights). In halfspace counting queries problem, we are given nn points with weights, we can preprocess it and obtain a data structure. Each query to a data structure is a halfspace, the output is the sum of all elements in the halfspace. In 22D, there exists a preprocessing time O~(n4/3)\tilde{O}(n^{4/3}) and query time O~(n1/3)\tilde{O}(n^{1/3}) data structure [3]. Let S+S^+ be the set of points above ff, and SS^- be the set of points below ff. The result is precisely the following.

(x,y)S+yaxb+(x,y)Sax+by\displaystyle \sum_{(x,y)\in S^+} y - ax-b + \sum_{(x,y)\in S^-} ax+b - y

Let's consider the second sum, (x,y)Sax+by=a(x,y)Sx+Sb(x,y)Sy\sum_{(x,y)\in S^-} ax+b - y = a\sum_{(x,y)\in S^-}x + |S^-|b -\sum_{(x,y)\in S^-}y. Note the 33 terms can each be solved with a halfspace counting query, consider all points lies below ff. This shows in 66 halfspace counting queries.

How can one do ternary search? This would need us to be able to pick the point that gives us the iith largest slope with pp. We need a data structure such that it can return the iith largest point in the radial ordering of the points in SS around pp. It is equivalent to halfspace range counting up to polylog factors.

Thus, the total running time after building the data structure in O~(n4/3)\tilde{O}(n^{4/3}) is nn times ternary search over nn elements, where each decision process takes O~(n1/3)\tilde{O}(n^{1/3}) time. Therefore the final running time is O~(n4/3)\tilde{O}(n^{4/3}) time.

Qizheng mentioned the problem to Timothy Chan, who gave us some references. There is an easy solution that obtains O(nlog2n)O(n\log^2 n) time algorithm using simple parametric search [4]. Consider the following linear program. Let kk be a constant. We are given a1,,ak,b1,,bna_1,\ldots,a_k,b_1,\ldots,b_n, kkD vectors β1,,βm\beta_1,\ldots,\beta_m and reals α1,,αm\alpha_1,\ldots,\alpha_m. Sets J1,,JnJ_1,\ldots,J_n a partition of [m][m].

minw1,,wk,x1,,xni=1kaiwi+i=1nbixis.t.xi(d=1kβj,dwd)αj1in,jJi\displaystyle \begin{aligned} & \min_{w_1,\ldots,w_k,x_1,\ldots,x_n} & & \sum_{i=1}^k a_iw_i + \sum_{i=1}^n b_ix_i & \\ & \text{s.t.} & & x_i \geq (\sum_{d=1}^k \beta_{j,d} w_d) - \alpha_j & \forall 1 \leq i \leq n, j\in J_i \end{aligned}

Zemel showed such linear program can be solved in O(m)O(m) time for constant kk [5]. The idea is a similar algorithm to Megiddo's linear time constant dimension LP algorithm [1]. For linear regression problem in L1L_1 with nn data points. The linear program we derived is a special case of the above linear program when k=2k=2 and m=O(n)m=O(n). In fact, Zemel use the same linear program to show constant dimension L1L_1 regression can be solved in linear time.

1 Open problem

One can also define another metric, the lexicographical minimum. Such idea was already present in fairness related linear regression [6]. Once we sort the values of yf(x)|y - f(x)| for (x,y)S(x,y)\in S, say obtaining a1,,ana_1,\ldots,a_n, where a1a2ana_1\geq a_2 \geq \ldots \geq a_n. We are interested in finding a ff that minimizes the sequence a1,,ana_1,\ldots,a_n, lexicographically. Can this problem be solved in O(n)O(n) time?

References

[1] N. Megiddo, Linear programming in linear time when the dimension is fixed, J. ACM. 31 (1984) 114–127 10.1145/2422.322418.

[2] P. Bloomfield, W. Steiger, Least absolute deviations curve-fitting, SIAM Journal on Scientific and Statistical Computing. 1 (1980) 290–301 10.1137/0901019.

[3] J. Matoušek, Range searching with efficient hierarchical cuttings, Discrete & Computational Geometry. 10 (1993) 157–182 10.1007/BF02573972.

[4] N. Megiddo, A. Tamir, Finding Least-Distances Lines, SIAM Journal on Algebraic Discrete Methods. 4 (1983) 207–211 10.1137/0604021.

[5] E. Zemel, An O(n) algorithm for the linear multiple choice knapsack problem and related problems, 18 (1984) 123–128 10.1016/0020-0190(84)90014-0.

[6] M. Köeppen, K. Yoshida, K. Ohnishi, Evolving fair linear regression for the representation of human-drawn regression lines, in: 2014 International Conference on Intelligent Networking and Collaborative Systems, 2014: pp. 296–303 10.1109/INCoS.2014.89.

Posted by Chao Xu on 2019-03-28.
Tags: combinatorial optimization.