The Art Gallery Guardian

linear regression


I read an article on the errors in visualization. The example of forcing a relationship by cherry-picking scales is delightful. I recommend reading it.

I am interested in misleading people while being completely honest. The article inspires the following problem. Given 2 vectors . Let be the all vector in . We are interested in finding , such that is minimized. Here is either or .

Note the problem is precisely the same as the linear regression problem. In the linear regression problem, we are given a point set of size and we are interested in find a line , such that it minimizes the error, defined as

For , there is a time algorithm because there is a closed formula. For , the problem can be rewritten as a linear program with variables and constraints. Using Megiddo’s result [1], there is a time algorithm to solve this problem.

It is hard to find the worst case complexity when . This case is called the least absolute deviations. Statisticians just don’t care about worst case running time as CS people do.

There are a few methods I found. One is to write it as a linear program on variables and constraints and solve it using the simplex method. The linear program is as follows.

There are a bunch of other algorithms that specializes the simplex algorithm on this particular problem. There are also some iterative methods. Unfortunately, those algorithms depends on the actual numbers in the input. I want a running time that only depends on .

There exists an optimal solution that contains two points in . The native algorithm is to try all possible lines. For each line, the algorithm can compute the error in time. The naive algorithm’s running time is . There is a smarter algorithm. The optimal line that contains the point can actually be found in time. Indeed, consider the line passes through the point . We consider changing the slope of the line, while maintaining it still contain . One can see a minimum will be reached at some line. Indeed, assume we reorder the points, so (namely, increasing slope). Let be the smallest integer such that the sum of . The line determined by and is the desired line. This can be computed in linear time by finding weighted median. Hence one can show the running time is . This is the idea of [2]. That is all I can find through an hour of search.

After discussing with Qizheng He, he suggested the following approach. Consider the function for . It is defined as the error for the line of slope that contains . The function is bitonic, therefore we can do a ternary search to find the minimum. There are only possible slopes, hence the ternary search will take queries, where each query asks for the error of the line that goes through and some other point.

Given a line , can one compute the error quickly? It is possible to decompose it to few halfspace range counting queries (allowing weights). In halfspace counting queries problem, we are given points with weights, we can preprocess it and obtain a data structure. Each query to a data structure is a halfspace, the output is the sum of all elements in the halfspace. In D, there exists a preprocessing time and query time data structure [3]. Let be the set of points above , and be the set of points below . The result is precisely the following.

Let’s consider the second sum, . Note the terms can each be solved with a halfspace counting query, consider all points lies below . This shows in halfspace counting queries.

How can one do ternary search? This would need us to be able to pick the point that gives us the th largest slope with . We need a data structure such that it can return the th largest point in the radial ordering of the points in around . It is equivalent to halfspace range counting up to polylog factors.

Thus, the total running time after building the data structure in is times ternary search over elements, where each decision process takes time. Therefore the final running time is time.

Qizheng mentioned the problem to Timothy Chan, who gave us some references. There is an easy solution that obtains time algorithm using simple parametric search [4]. Consider the following linear program. Let be a constant. We are given , D vectors and reals . Sets a partition of .

Zemel showed such linear program can be solved in time for constant [5]. The idea is a similar algorithm to Megiddo’s linear time constant dimension LP algorithm [1]. For linear regression problem in with data points. The linear program we derived is a special case of the above linear program when and . In fact, Zemel use the same linear program to show constant dimension regression can be solved in linear time.

1 Open problem

One can also define another metric, the lexicographical minimum. Such idea was already present in fairness related linear regression [6]. Once we sort the values of for , say obtaining , where . We are interested in finding a that minimizes the sequence , lexicographically. Can this problem be solved in time?

References

[1]
N. Megiddo, Linear programming in linear time when the dimension is fixed, J. ACM. 31 (1984) 114–127 10.1145/2422.322418.
[2]
P. Bloomfield, W. Steiger, Least absolute deviations curve-fitting, SIAM Journal on Scientific and Statistical Computing. 1 (1980) 290–301 10.1137/0901019.
[3]
J. Matoušek, Range searching with efficient hierarchical cuttings, Discrete & Computational Geometry. 10 (1993) 157–182 10.1007/BF02573972.
[4]
N. Megiddo, A. Tamir, Finding Least-Distances Lines, SIAM Journal on Algebraic Discrete Methods. 4 (1983) 207–211 10.1137/0604021.
[5]
E. Zemel, An O(n) algorithm for the linear multiple choice knapsack problem and related problems, 18 (1984) 123–128 10.1016/0020-0190(84)90014-0.
[6]
M. Köeppen, K. Yoshida, K. Ohnishi, Evolving fair linear regression for the representation of human-drawn regression lines, in: 2014 International Conference on Intelligent Networking and Collaborative Systems, 2014: pp. 296–303 10.1109/INCoS.2014.89.
Posted by Chao Xu on .
Tags: combinatorial optimization.