$L_{1}$ linear regression

I read an article on the errors in visualization. The example of forcing a relationship by cherry-picking scales is delightful. I recommend reading it.

I am interested in misleading people while being completely honest. The article inspires the following problem. Given 2 vectors $x, y \in R^{n}$ . Let $1$ be the all $1$ vector in $R^{n}$ . We are interested in finding $a, b \in R$ , such that $∥ y - (a x + b 1) ∥_{p}$ is minimized. Here $p$ is either $1, 2$ or $\infty$ .

Note the problem is precisely the same as the linear regression problem. In the linear regression problem, we are given a point set $S \subset R^{2}$ of size $n$ and we are interested in find a line $f (x) = a x + b$ , such that it minimizes the error, defined as

$(x, y) \in S \sum ∥ y - f (x) ∥_{p}$

For $p = 2$ , there is a $O (n)$ time algorithm because there is a closed formula. For $p = \infty$ , the problem can be rewritten as a linear program with $3$ variables and $n$ constraints. Using Megiddo’s result [1], there is a $O (n)$ time algorithm to solve this problem.

It is hard to find the worst case complexity when $p = 1$ . This case is called the least absolute deviations. Statisticians just don’t care about worst case running time as CS people do.

There are a few methods I found. One is to write it as a linear program on $n + 2$ variables and $n$ constraints and solve it using the simplex method. The linear program is as follows.

$a, b, t_{1}, \dots, t_{n} min s.t. i = 1 \sum n t_{i} t_{i} \geq (a x_{i} + b) - y_{i} t_{i} \leq y_{i} - (a x_{i} + b) \forall1 \leq i \leq n \forall1 \leq i \leq n$

There are a bunch of other algorithms that specializes the simplex algorithm on this particular problem. There are also some iterative methods. Unfortunately, those algorithms depends on the actual numbers in the input. I want a running time that only depends on $n$ .

There exists an optimal solution that contains two points in $S$ . The native algorithm is to try all possible $O (n^{2})$ lines. For each line, the algorithm can compute the error in $O (n)$ time. The naive algorithm’s running time is $O (n^{3})$ . There is a smarter algorithm. The optimal line that contains the point can actually be found in $O (n)$ time. Indeed, consider the line passes through the point $(x, y)$ . We consider changing the slope of the line, while maintaining it still contain $(x, y)$ . One can see a minimum will be reached at some line. Indeed, assume we reorder the points, so $\frac{y _{i} - y}{x _{i} - x} \leq \frac{y _{i + 1} - y}{x _{i + 1} - x}$ (namely, increasing slope). Let $k$ be the smallest integer such that the sum of $\sum_{i = 1}^{k} ∣ x_{i} - x ∣ \geq \sum_{i = k + 1}^{n} ∣ x_{i} - x ∣$ . The line determined by $(x, y)$ and $(x_{k}, y_{k})$ is the desired line. This can be computed in linear time by finding weighted median. Hence one can show the running time is $O (n^{2})$ . This is the idea of [2]. That is all I can find through an hour of search.

After discussing with Qizheng He, he suggested the following approach. Consider the function $g_{p} (s)$ for $p \in S$ . It is defined as the error for the line of slope $s$ that contains $p$ . The function is bitonic, therefore we can do a ternary search to find the minimum. There are only $n - 1$ possible slopes, hence the ternary search will take $O (lo g n)$ queries, where each query asks for the error of the line that goes through $p$ and some other point.

Given a line $f (x) = a x + b$ , can one compute the error quickly? It is possible to decompose it to few halfspace range counting queries (allowing weights). In halfspace counting queries problem, we are given $n$ points with weights, we can preprocess it and obtain a data structure. Each query to a data structure is a halfspace, the output is the sum of all elements in the halfspace. In $2$ D, there exists a preprocessing time $\tilde{O} (n^{4/3})$ and query time $\tilde{O} (n^{1/3})$ data structure [3]. Let $S^{+}$ be the set of points above $f$ , and $S^{-}$ be the set of points below $f$ . The result is precisely the following.

$(x, y) \in S^{+} \sum y - a x - b + (x, y) \in S^{-} \sum a x + b - y$

Let’s consider the second sum, $\sum_{(x, y) \in S^{-}} a x + b - y = a \sum_{(x, y) \in S^{-}} x + ∣ S^{-} ∣ b - \sum_{(x, y) \in S^{-}} y$ . Note the $3$ terms can each be solved with a halfspace counting query, consider all points lies below $f$ . This shows in $6$ halfspace counting queries.

How can one do ternary search? This would need us to be able to pick the point that gives us the $i$ th largest slope with $p$ . We need a data structure such that it can return the $i$ th largest point in the radial ordering of the points in $S$ around $p$ . It is equivalent to halfspace range counting up to polylog factors.

Thus, the total running time after building the data structure in $\tilde{O} (n^{4/3})$ is $n$ times ternary search over $n$ elements, where each decision process takes $\tilde{O} (n^{1/3})$ time. Therefore the final running time is $\tilde{O} (n^{4/3})$ time.

Qizheng mentioned the problem to Timothy Chan, who gave us some references. There is an easy solution that obtains $O (n lo g^{2} n)$ time algorithm using simple parametric search [4]. Consider the following linear program. Let $k$ be a constant. We are given $a_{1}, \dots, a_{k}, b_{1}, \dots, b_{n}$ , $k$ D vectors $β_{1}, \dots, β_{m}$ and reals $α_{1}, \dots, α_{m}$ . Sets $J_{1}, \dots, J_{n}$ a partition of $[m]$ .

$w_{1}, \dots, w_{k}, x_{1}, \dots, x_{n} min s.t. i = 1 \sum k a_{i} w_{i} + i = 1 \sum n b_{i} x_{i} x_{i} \geq (d = 1 \sum k β_{j, d} w_{d}) - α_{j} \forall1 \leq i \leq n, j \in J_{i}$

Zemel showed such linear program can be solved in $O (m)$ time for constant $k$ [5]. The idea is a similar algorithm to Megiddo’s linear time constant dimension LP algorithm [1]. For linear regression problem in $L_{1}$ with $n$ data points. The linear program we derived is a special case of the above linear program when $k = 2$ and $m = O (n)$ . In fact, Zemel use the same linear program to show constant dimension $L_{1}$ regression can be solved in linear time.

1 Open problem

One can also define another metric, the lexicographical minimum. Such idea was already present in fairness related linear regression [6]. Once we sort the values of $∣ y - f (x) ∣$ for $(x, y) \in S$ , say obtaining $a_{1}, \dots, a_{n}$ , where $a_{1} \geq a_{2} \geq \dots \geq a_{n}$ . We are interested in finding a $f$ that minimizes the sequence $a_{1}, \dots, a_{n}$ , lexicographically. Can this problem be solved in $O (n)$ time?

References

[1]

N. Megiddo, Linear programming in linear time when the dimension is fixed, J. ACM. 31 (1984) 114–127 10.1145/2422.322418.

[2]

P. Bloomfield, W. Steiger, Least absolute deviations curve-fitting, SIAM Journal on Scientific and Statistical Computing. 1 (1980) 290–301 10.1137/0901019.

[3]

J. Matoušek, Range searching with efficient hierarchical cuttings, Discrete & Computational Geometry. 10 (1993) 157–182 10.1007/BF02573972.

[4]

N. Megiddo, A. Tamir, Finding Least-Distances Lines, SIAM Journal on Algebraic Discrete Methods. 4 (1983) 207–211 10.1137/0604021.

[5]

E. Zemel, An O(n) algorithm for the linear multiple choice knapsack problem and related problems, 18 (1984) 123–128 10.1016/0020-0190(84)90014-0.

[6]

M. Köeppen, K. Yoshida, K. Ohnishi, Evolving fair linear regression for the representation of human-drawn regression lines, in: 2014 International Conference on Intelligent Networking and Collaborative Systems, 2014: pp. 296–303 10.1109/INCoS.2014.89.

Posted by Chao Xu on 2019-03-28.

Tags: combinatorial optimization.

L1​ linear regression

1 Open problem

References

$L_{1}$ linear regression