Least Squares
The scientific custom of taking multiple observations of the same quantity and then selecting a single estimate that best represents it has its origin the early part of the 16th century. At the beginning of the 19th century, two of the foremost mathematicians of the time, the German C.F. Gauss (1777--1855) and the Frenchman A.M. Legendre (1752--1833), proposed, independently of each other, an optimal criterion that bears the same relation to the average.
The context for Legendre's proposal of the least squares was that of geodesy. At that time, France's scientists had decided to adopt a new measurement system and proposed to use a unit length to be a meter, which would be equal 1/40,000,000 of the circumference of the earth. This necessitated an accurate determination of the said circumference that, in turn, depended on the exact shape of the earth.
The credibility of the method of least squares were greatly enhanced by the Ceres incident. On January 1, 1801 the Italian astronomer Giuseppe Piazzi sighted a heavenly body that he strongly suspected to be a new planet. He announced his discovery and named it Ceres. Unfortunately, 6 weeks later, before enough observations had been taken to make possible accurate determination of its orbit, so as to ascertain that it was indeed a planet, Ceres disappeared behind the sun and was not expected to reemerge for nearly a year. Interest in this possibly new planet was widespread, and the young Gauss proposed that an area of the sky be searched that was quite different from those suggested by other astronomers; he turned out to be right! He became a celebrity upon his discovery, which includes two mathematical methods: the row echelon reduction and the least square method.
In science and business, we often need to predict the relationship between two given variables. In many cases, we begin by performing an appropriate experiments or statistical analysis to obtain the necessary data. However, even if a simple law governs the behavior of the variables, this law may not be easy to find because of errors introduced in measuring or sampling. In practice, therefore, we are often content with a functional equation that provides a close approximation. There are know three general techniques for finding functions which are closest to a given curve:
- linear regression using linear polynomials (matching straight lines);
- general linear regression (polynomials, etc.), and
- transformations to linear regression (for matching exponential or logarithmic functions).
- maximum error: \( \quad \| e \|_{\infty} = \max_{1\le i \le n} \{ | f(x_i ) - y_i | \} ; \)
- average error: \( \quad \| e \|_1 = \frac{1}{n} \,\sum_{i=1}^n | f(x_i ) - y_i | ; \)
- root-mean square error (or L2-error): \( \quad \| e \|_2 = \left( \frac{1}{n} \,\sum_{i=1}^n | f(x_i ) - y_i |^2 \right)^{1/2} . \)
Example. Compare the maximum error, average error, and rms error for the linear approximation \( y = f(x) = 3.5 -1.3\,x \) to the data points (-1,5), (0,4), (1,2), (2,1), (3,0), (4,-2), (5,-3), (6,-4).
The errors are summarized in the following table:
xi | yi | \( f(x_i ) = 3.5 -1.3\,x_i \) | |ei| | ei2 |
-1 | 5 | 4.8 | 0.2 | 0.04 |
0 | 4 | 3.5 | 0.5 | 0.25 |
1 | 2 | 2.2 | 0.2 | 0.04 |
2 | 1 | 0.9 | 0.1 | 0.01 |
3 | 0 | -0.4 | 0.4 | 0.16 |
4 | -2 | -1.7 | 0.3 | 0.09 |
5 | -3 | -1 | 0 | 0 |
6 | -4 | -4.3 | 0.3 | 0.09 |
Let \( \{ ( x_i , y_i) \}_{i=1}^n \) be set of n points, where the abscissas xi are distinct. The least-squares line \( y = m\,x+b \) is the line that minimizes the root-mean-square error e2.
The quantity \( \| e \|_2 \) will be minimum if and only if the quantity \( \sum_{i=1}^n \left( m\,x_i +b - y_i \right)^2 \) is a minimum. The letter is visualized geometrically by minimizing the sum of the squares of the vertical distances from the points to the line.
Example. Suppose we’re given a collection of data points (x,y):
In our example the line does not go through all three points, so this equation is not solvable.
In statistical modeling, one often wishes to estimate certain parameters xj based on some observations, where the observations are contaminated by noise. For instance, one wishes to predict the college grade point average (GPA) of freshman applications, which we denote by b, based on their high school GPA, denoted by a1, and two Scholastic Aptitude Test scores, one verbal (a2) and one quantitative (a3), as part of the college admission process. It is a custom to use a linear model in this case: \( b = a_1 x_1 + a_2 x_2 + a_3 x_3 . \) ■
Now we formulate the least square problem. Suppose that we have a set (usually linearly independent, but not necessarily) of vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) and a given vector b. We seek coefficients \( x_1 , x_2 , \ldots , x_n \) that produce a minimal error
If column vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) are linearly independent (which will be assumed), matrix A becomes full column rank. The vector \( \hat{\bf x} = x_1 {\bf a}_1 + x_2 {\bf a}_2 + \cdots + x_n {\bf a}_n \) that is closest to a given vector b is its orthogonal projection onto the subspace spanned by the a's. The hat over \( \hat{\bf x} \) indicates the best choice (in the sense that it minimizes \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) ) that gives the closest vector in the column space. Therefore, to solve the least square problem is equivalent to find the orthogonal projection matrix P on the column space such that \( {\bf P} {\bf b} = {\bf A}\,\hat{\bf x} . \)
The vector \( \hat{\bf x} \) is a solution to the least squares problem when the error vector \( {\bf e} = {\bf b} - {\bf A} \hat{\bf x} \) is perpendicular to the subspace. Therefore, this error vector makes right angle with all the vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} . \) That gives the n equations we need to find \( \hat{\bf x} : \)
Now we derive the normal equation using another approach: calculus. To find the best approximation, we look for the x where the gradient of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 = \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right) \) vanishes. So we want
Why is \( \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} \) the minimizer of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 ? \) We can note that the Hessian \( {\bf A}^{\mathrm T} {\bf A} \) is positive definite, which means that the function is strictly convex and any critical point is a global minimum. ■
Theorem: Suppose that \( \{ ( x_i , y_i) \}_{i=1}^n \) are n points, where the abscissas xi are distinct. The coefficients of the least squares line
A matrix is full row rank when each of the rows of the matrix are linearly independent and full column rank when each of the columns of the matrix are linearly independent. For a square matrix these two concepts are equivalent and we say the matrix is full rank if all rows and columns are linearly independent. A square matrix is full rank if and only if its determinant is nonzero.
For a non-square matrix with m rows and n columns, it will always be the case that either the rows or columns (whichever is larger in number) are linearly dependent. Hence when we say that a non-square matrix is full rank, we mean that the row and column rank are as high as possible, given the shape of the matrix. So if there are more rows than columns (m>n), then the matrix is full rank if the matrix is full column rank.
Suppose that \( {\bf A}\,{\bf x} = {\bf b} \) is an inconsistent a linear system of m algebraic equations in n unknowns. We suspect that inconsistency is caused either by too many equations or by errors in the entries of matrix A or input vector b. Usually, there are more equations than unknowns (m is greater than n). The n-dimensional column space constitutes a small part of m-dimensional space, and b is outside the column space. Since no exact solution is possible, we seek a vector \( \hat{\bf x} \) that comes as close as possible.
Since no straight line goes through those points, we are looking to the line \( y = \alpha + \beta\,t \) that is as close to the given points as possible. Substituting coordinates into our linear equation, we get the system of equations:
From geometrical point of view, every Ax lies in the plane of the columns (1, 1, 1) and (0, 1, 2). In that plane, we look for the point closest to b = [5, 3, 7]. The nearest point is the projection \( \hat{\bf x} = \mbox{proj}_{\normalsize{ column\ space}} {\bf b} . \) The error vector is
From the above theorem, it follows that if \( V = \mathbb{R}^n \) and U is the column space of matrix A, then the best approximation to b from the column space is its projection. But every vector in the column space of A is expressible in the form Ax for some vector x, so there is at least one vector \( \hat{\bf x} \) in the column space of A for which \( {\bf A}\,\hat{\bf x} = \mbox{proj}_{\normalsize column\ space} {\bf b} . \) Each such vector is a least squares solution of Ax = b. Although there may be more than one least squares solution of Ax = b, each such solution \( \hat{\bf x} \) has the same error vector \( {\bf b} - {\bf A}\,\hat{\bf x} . \)
If the kernel of A is nontrivial, there can be many vectors \( \hat{\bf x} \) that satisfy the equation \( {\bf A}\,\hat{\bf x} = \mbox{proj}_{U} {\bf b} , \) where U is the column space of matrix A. Fortunately, all these projections are the same. If the matrix A is of full column rank, then the square symmetric matrix \( {\bf A}^{\mathrm T}\,{\bf A} \) is invertible. Upon application of its inverse, we find the unique least squares solution:
- \( {\bf A}\,{\bf A}^{\dagger} \,{\bf A} = {\bf A} \qquad \) (matrix \( {\bf A}\,{\bf A}^{\dagger} \) maps all column vectors of A to themselves)
- \( {\bf A}^{\dagger} {\bf A} \,{\bf A}^{\dagger} = {\bf A}^{\dagger} \qquad \)
- \( \left( {\bf A}\,{\bf A}^{\dagger} \right)^{\ast} = {\bf A}\,{\bf A}^{\dagger} \qquad \) (matrix \( {\bf A}\,{\bf A}^{\dagger} \) is self-adjoint)
- \( \left( {\bf A}^{\dagger} {\bf A} \right)^{\ast} = {\bf A}^{\dagger} {\bf A} \qquad \) (matrix \( {\bf A}^{\dagger} {\bf A} \) is self-adjoint)
![]() |
![]() |
![]() |
---|---|---|
E. H. Moore. | Roger Penrose, 2011. | Erik Ivar Fredholm. |
The term pseudoinverse was independently described by E. H. Moore (1862--1932) in 1920, Arne Bjerhammar (1917--2011) in 1951, and Roger Penrose (born in 1931) in 1955. Earlier, Erik Ivar Fredholm (1866--1927) had introduced the concept of a pseudoinverse of integral operators in 1903.
The American mathematician Eliakim Hastings Moore learned mathematics at Yale University. He got the Ph.D. in 1885 with a thesis, supervised by Hubert Anson Newton, on some work of William Kingdon Clifford and Arthur Cayley. Newton encouraged Moore to study in Germany, and thus he spent an academic year at the University of Berlin, attending lectures by Kronecker and Weierstrass. On his return to the United States, Moore taught at Yale and at Northwestern University. When the University of Chicago opened its doors in 1892, Moore was the first head of its mathematics department, a position he retained until his death.
Arne Bjerhammar was a Swedish geodesist. He was professor at Royal Institute of Technology in Stockholm. He was born in Båstad, Scania in the south of Sweden. He developed a method used to determine the geoid in gravimetric data, as well as a system for electro-optical measuring of distances. Seven years after introducing pseudoinverse, fascinated by M.S. Molodensky’s new approach to solve the basic problems of physical geodesy, he presented his original idea of analytical downward continuation of the gravity anomaly to an internal sphere.
Sir Roger Penrose is an English mathematical physicist, mathematician and philosopher of science. Penrose told a Russian audience that his grandmother had left St. Petersburg in the late 1880s. His uncle was artist Roland Penrose, whose son with photographer Lee Miller is Antony Penrose. Penrose is the brother of physicist Oliver Penrose and of chess Grandmaster Jonathan Penrose. Penrose attended University College School and University College, London, where he graduated with a first class degree in mathematics. In 1955, while still a student, Penrose reintroduced the generalized matrix inverse. Penrose finished his PhD at Cambridge in 1958, with a thesis on "tensor methods in algebraic geometry" under algebraist and geometer John A. Todd. He devised and popularized the Penrose triangle in the 1950s, describing it as "impossibility in its purest form" and exchanged material with the artist M. C. Escher, whose earlier depictions of impossible objects partly inspired it.
The Swedish mathematician Erik Ivar Fredholm is remembered for his work on integral equations and operator theory foreshadowed the theory of Hilbert spaces. He obtained his PhD at Uppsala University in 1898, under the supervision of Gösta Mittag-Leffler. He was docent at Stockholm University from 1898 to 1906 and professor from 1906 until his death.