Least Squares

The scientific custom of taking multiple observations of the same quantity and then selecting a single estimate that best represents it has its origin the early part of the 16th century. At the beginning of the 19th century, two of the foremost mathematicians of the time, the German C.F. Gauss (1777--1855) and the Frenchman A.M. Legendre (1752--1833), proposed, independently of each other, an optimal criterion that bears the same relation to the average.

The context for Legendre's proposal of the least squares was that of geodesy. At that time, France's scientists had decided to adopt a new measurement system and proposed to use a unit length to be a meter, which would be equal 1/40,000,000 of the circumference of the earth. This necessitated an accurate determination of the said circumference that, in turn, depended on the exact shape of the earth.

The credibility of the method of least squares were greatly enhanced by the Ceres incident. On January 1, 1801 the Italian astronomer Giuseppe Piazzi sighted a heavenly body that he strongly suspected to be a new planet. He announced his discovery and named it Ceres. Unfortunately, 6 weeks later, before enough observations had been taken to make possible accurate determination of its orbit, so as to ascertain that it was indeed a planet, Ceres disappeared behind the sun and was not expected to reemerge for nearly a year. Interest in this possibly new planet was widespread, and the young Gauss proposed that an area of the sky be searched that was quite different from those suggested by other astronomers; he turned out to be right! He became a celebrity upon his discovery, which includes two mathematical methods: the row echelon reduction and the least square method.

In science and business, we often need to predict the relationship between two given variables. In many cases, we begin by performing an appropriate experiments or statistical analysis to obtain the necessary data. However, even if a simple law governs the behavior of the variables, this law may not be easy to find because of errors introduced in measuring or sampling. In practice, therefore, we are often content with a functional equation that provides a close approximation. There are know three general techniques for finding functions which are closest to a given curve:

Historically, besides to curve fitting, the least square technique is proved to be very useful in statistical modeling of noisy data, and in geodetic modeling. We discuss three standard ways to solve the least square problem: the normal equations, the QR factorization, and the singular value decomposition.

Consider a typical application of least squares in curve fitting. Suppose that we are given a set of points \( (x_i , y_i ) \) for \( i = 0, 1, 2, \ldots , n, \) for which we may not be able (or may not want) to find a function that passes through all points, but rather, we may want to find a function f(x) of a particular form that passes as closely as possible to the points. The differences
\[ e_i = f(x_i ) - y_i \quad\mbox{for}\quad i =1,2,\ldots n, \]
are called the errors or deviations or residuals. There are several norms that can be used with the residuals to measure how the curve \( y = f(x) \) lies from data:

Example. Compare the maximum error, average error, and rms error for the linear approximation \( y = f(x) = 3.5 -1.3\,x \) to the data points (-1,5), (0,4), (1,2), (2,1), (3,0), (4,-2), (5,-3), (6,-4).

The errors are summarized in the following table:

xi yi \( f(x_i ) = 3.5 -1.3\,x_i \) |ei| ei2
-1 5 4.8 0.2 0.04
0 3.5  0.5  0.25 
2.2 0.2 0.04
0.9 0.1 0.01
3 0 -0.4 0.4 0.16
4 -2 -1.7 0.3 0.09
5 -3 -1 0 0
6 -4 -4.3 0.3 0.09
We can see that the maximum error \( \| e \|_{\infty} = 0.5 \) is largest because
\[ \| e \|_1 = \frac{1}{8}\,\sum_{i=1}^8 |e_i | = 0.25, \qquad \| e \|_2 = \frac{1}{2\sqrt{2}}\left( \sum_{i=1}^8 |e_i |^2 \right)^{1/2} \approx 0.291548 . \qquad ■ \]

Let \( \{ ( x_i , y_i) \}_{i=1}^n \) be set of n points, where the abscissas xi are distinct. The least-squares line \( y = m\,x+b \) is the line that minimizes the root-mean-square error e2.

The quantity \( \| e \|_2 \) will be minimum if and only if the quantity \( \sum_{i=1}^n \left( m\,x_i +b - y_i \right)^2 \) is a minimum. The letter is visualized geometrically by minimizing the sum of the squares of the vertical distances from the points to the line.

Example. Suppose we’re given a collection of data points (x,y):

\[ (1,1) , \quad (2,2), \quad (3,2) \]
and we want to find the closest line \( y = mx+b \) to that collection. If the line went through all three points, we’d have:
\[ \begin{split} m+b &=1 , \\ 2m+b &=2, \\ 3m+b &= 2, \end{split} \]
which is equivalent to the vector equation:
\[ \begin{bmatrix} 1&1 \\ 1&2 \\ 1&3 \end{bmatrix} \begin{bmatrix} b \\ m \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} \qquad\mbox{or, in vector form:} \qquad {\bf A} \, {\bf x} = {\bf b} . \]

In our example the line does not go through all three points, so this equation is not solvable.

In statistical modeling, one often wishes to estimate certain parameters xj based on some observations, where the observations are contaminated by noise. For instance, one wishes to predict the college grade point average (GPA) of freshman applications, which we denote by b, based on their high school GPA, denoted by a1, and two Scholastic Aptitude Test scores, one verbal (a2) and one quantitative (a3), as part of the college admission process. It is a custom to use a linear model in this case: \( b = a_1 x_1 + a_2 x_2 + a_3 x_3 . \)

Now we formulate the least square problem. Suppose that we have a set (usually linearly independent, but not necessarily) of vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) and a given vector b. We seek coefficients \( x_1 , x_2 , \ldots , x_n \) that produce a minimal error

\[ \left\| {\bf b} - \sum_{i=1}^n x_i \,{\bf a}_i \right\| . \]
with respect to the Euclidean inner product on \( \mathbb{R}^n . \) As long as we are interested only in linear combinations of a finite set \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} ,\) it is possible to transform the problem into one involving finite columns of numbers. In this case, define a matrix A with columns given by the linearly independent column vectors \( {\bf a}_i , \) and a vector x whose entries are the unknown coefficients xi. Then matrix A has dimensions m by n, meaning that \( {\bf a}_i \) are vectors of length m. The problem then can be reformulated by choosing x that minimizing \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) because \( {\bf A} {\bf x} = {\bf b} \) is an inconsistent linear system of m equations in n unknowns. We call such a vector, if it exists, a least squares solution \( {\bf A} \hat{\bf x} = {\bf b} , \) and call \( {\bf b} - {\bf A} \hat{\bf x} \) the least squares error vector. Statisticians call it linear regression.

If column vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) are linearly independent (which will be assumed), matrix A becomes full column rank. The vector \( \hat{\bf x} = x_1 {\bf a}_1 + x_2 {\bf a}_2 + \cdots + x_n {\bf a}_n \) that is closest to a given vector b is its orthogonal projection onto the subspace spanned by the a's. The hat over \( \hat{\bf x} \) indicates the best choice (in the sense that it minimizes \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) ) that gives the closest vector in the column space. Therefore, to solve the least square problem is equivalent to find the orthogonal projection matrix P on the column space such that \( {\bf P} {\bf b} = {\bf A}\,\hat{\bf x} . \)

The vector \( \hat{\bf x} \) is a solution to the least squares problem when the error vector \( {\bf e} = {\bf b} - {\bf A} \hat{\bf x} \) is perpendicular to the subspace. Therefore, this error vector makes right angle with all the vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} . \) That gives the n equations we need to find \( \hat{\bf x} : \)

\[ \begin{array}{c} {\bf a}^{\mathrm T}_1 \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \\ \vdots \quad \vdots \\ {\bf a}^{\mathrm T}_n \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \end{array} \qquad \mbox{or} \qquad \begin{bmatrix} --- \, {\bf a}^{\mathrm T}_1 \, --- \\ \vdots \\ --- \, {\bf a}^{\mathrm T}_n \, --- \end{bmatrix} \begin{bmatrix} \quad \\ {\bf b} - {\bf A} \hat{\bf x} \\ \quad \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix} . \]
The matrix in the right equation is \( {\bf A}^{\mathrm T} . \) The n equations are exactly
\[ {\bf A}^{\mathrm T} {\bf A} \, \hat{\bf x} = {\bf A}^{\mathrm T} \, {\bf b} . \]
This equation is called the normal equation.

Now we derive the normal equation using another approach: calculus. To find the best approximation, we look for the x where the gradient of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 = \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right) \) vanishes. So we want

\begin{align*} 0 &= \lim_{\epsilon \mapsto 0} \frac{\left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right)^{\mathrm T} \left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right) - \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right)}{\| \epsilon \|_2} \\ &= \lim_{\epsilon \mapsto 0} \frac{2\epsilon^{\mathrm T} \left( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \right) + \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon}{\| \epsilon \|_2} . \end{align*}
The second term
\[ \frac{\left\vert \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon \right\vert}{\| \epsilon \|_2} \le \frac{\| {\bf A} \|_2^2 \| \epsilon \|_2^2}{\| \epsilon \|_2} = \| {\bf A} \|_2^2 \| \epsilon \|_2 \, \mapsto \, 0 \]
approaches 0 as ε goes to 0, so the factor \( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \) in the first term must also be zero, or \( {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} . \) This is a system of n linear equations in n unknowns, the normal equation.

Why is \( \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} \) the minimizer of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 ? \) We can note that the Hessian \( {\bf A}^{\mathrm T} {\bf A} \) is positive definite, which means that the function is strictly convex and any critical point is a global minimum. ■

Theorem: Suppose that \( \{ ( x_i , y_i) \}_{i=1}^n \) are n points, where the abscissas xi are distinct. The coefficients of the least squares line

\[ y= m\,x +b \]
are the solution to the following linear system, known as the normal equations:
\begin{align*} \left( \sum_{i=1}^n x_i^2 \right) m + b \left( \sum_{i=1}^n x_i \right) &= \sum_{i=1}^n x_i y_i , \\ \left( \sum_{i=1}^n x_i \right) m + n\,b &= \sum_{i=1}^n y_i . \end{align*}

A matrix is full row rank when each of the rows of the matrix are linearly independent and full column rank when each of the columns of the matrix are linearly independent. For a square matrix these two concepts are equivalent and we say the matrix is full rank if all rows and columns are linearly independent. A square matrix is full rank if and only if its determinant is nonzero.

For a non-square matrix with m rows and n columns, it will always be the case that either the rows or columns (whichever is larger in number) are linearly dependent. Hence when we say that a non-square matrix is full rank, we mean that the row and column rank are as high as possible, given the shape of the matrix. So if there are more rows than columns (m>n), then the matrix is full rank if the matrix is full column rank.

Suppose that \( {\bf A}\,{\bf x} = {\bf b} \) is an inconsistent a linear system of m algebraic equations in n unknowns. We suspect that inconsistency is caused either by too many equations or by errors in the entries of matrix A or input vector b. Usually, there are more equations than unknowns (m is greater than n). The n-dimensional column space constitutes a small part of m-dimensional space, and b is outside the column space. Since no exact solution is possible, we seek a vector \( \hat{\bf x} \) that comes as close as possible.

Least Squares Problem: Given a linear system \( {\bf A}\,{\bf x} = {\bf b} \) of m equations in n unknowns, find a vector x that minimizes \( \| {\bf b} - {\bf A}\,{\bf x} \| \) with respect to the Euclidean inner product on \( \mathbb{R}^m . \) Such vector, if it exists, is called a least squares solution of \( {\bf A}\,{\bf x} = {\bf b} \) , which is usually denoted as \( \hat{\bf x} \) The difference \( {\bf e} = {\bf b} - {\bf A}\,{\bf x} \) is referred to as the least squares error vector, and we call \( \| {\bf b} - {\bf A}\,{\bf x} \| \) the least squares error.

Example: Suppose we are given three points (0,5), (1,3), and (2,7). Find the closest line to the given three points.

Since no straight line goes through those points, we are looking to the line \( y = \alpha + \beta\,t \) that is as close to the given points as possible. Substituting coordinates into our linear equation, we get the system of equations:

\[ \begin{split} 5&= \alpha , \\ 3&= \alpha + \beta , \\ 7 &= \alpha + 2\,\beta ; \end{split} \qquad\Longrightarrow \qquad \begin{bmatrix} 1&0 \\ 1&1 \\ 1&2 \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \\ 7 \end{bmatrix} . \]
This \( 3 \times 2 \) system has no solution: b = [5, 3, 7] is not a combination of the columns (1, 1, 1) and (0, 1, 2).

From geometrical point of view, every Ax lies in the plane of the columns (1, 1, 1) and (0, 1, 2). In that plane, we look for the point closest to b = [5, 3, 7]. The nearest point is the projection \( \hat{\bf x} = \mbox{proj}_{\normalsize{ column\ space}} {\bf b} . \) The error vector is

\[ {\bf e} = \begin{bmatrix} 5- x_2 \\ 3-x_1 - x_2 \\ 7-x_1 -2\,x_2 \end{bmatrix} . \]
The error to minimize becomes
\[ \| {\bf e} \|^2 = \left( 5- x_2 \right)^2 + \left( 3-x_1 - x_2 \right)^2 + \left( 7-x_1 -2\,x_2 \right)^2 . \]
This error will reach minimum when we make projection on the column space. This is achieved with multiplication by the adjoint matrix (which is actually transposed one):
\[ {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} \qquad \Longleftrightarrow \qquad \begin{bmatrix} 2&3 \\ 3&6 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 10 \\ 22 \end{bmatrix} . \]
Since the determinant of the matrix \( {\bf A}^{\mathrm T} {\bf A} \) is 3, the above linear system has the unique solution
\[ x_1 = -2, \qquad x_2 = \frac{14}{3} , \]
which provides the least possible error of \( \| {\bf e} \| = \frac{2\sqrt{58}}{3} \approx 5.07718 . \)

 

Theorem (Best Approximation Theorem): If U is a finite-dimensional subspace of an inner product space V, and if b is a vector in V, then its projection on U is the best approximation to b from U in the sense that
\[ \| {\bf b} - \mbox{proj}_U {\bf b} \| < \| {\bf b} - {\bf u} \| \]
for every vector u in U that is different from projUb.
For every vector u from U, we can write
\[ {\bf b} - {\bf u} = \left( {\bf b} - \mbox{proj}_U {\bf b} \right) + \left( \mbox{proj}_U {\bf b} - {\bf u} \right) . \]
However, projUb - u, being a difference of vectors in U, is itself in U; and since b - projUb is orthogonal to U, the two terms on the right side are orthogonal. Thus, it follows from Pythagorean theorem that
\[ \| {\bf b} - {\bf u} \|^2 = \| \left( {\bf b} - \mbox{proj}_U {\bf b} \right) \|^2 + \| \left( \mbox{proj}_U {\bf b} - {\bf u} \right) \|^2 . \]
If \( {\bf u} \ne \mbox{proj}_U {\bf b} , \) it follows that the second term in the latter is positive, and hence that
\[ \| \left( {\bf b} - \mbox{proj}_U {\bf b} \right) \|^2 < \| {\bf b} - {\bf u} \|^2 . \]
Since norms are nonnegative, we can take square roots to obtain \( \| \left( {\bf b} - \mbox{proj}_U {\bf b} \right) \| < \| {\bf b} - {\bf u} \| . \)

From the above theorem, it follows that if \( V = \mathbb{R}^n \) and U is the column space of matrix A, then the best approximation to b from the column space is its projection. But every vector in the column space of A is expressible in the form Ax for some vector x, so there is at least one vector \( \hat{\bf x} \) in the column space of A for which \( {\bf A}\,\hat{\bf x} = \mbox{proj}_{\normalsize column\ space} {\bf b} . \) Each such vector is a least squares solution of Ax = b. Although there may be more than one least squares solution of Ax = b, each such solution \( \hat{\bf x} \) has the same error vector \( {\bf b} - {\bf A}\,\hat{\bf x} . \)

If the kernel of A is nontrivial, there can be many vectors \( \hat{\bf x} \) that satisfy the equation \( {\bf A}\,\hat{\bf x} = \mbox{proj}_{U} {\bf b} , \) where U is the column space of matrix A. Fortunately, all these projections are the same. If the matrix A is of full column rank, then the square symmetric matrix \( {\bf A}^{\mathrm T}\,{\bf A} \) is invertible. Upon application of its inverse, we find the unique least squares solution:

\[ \hat{\bf x} - \left( {\bf A}^{\mathrm T}\,{\bf A} \right)^{-1} {\bf A}^{\mathrm T} {\bf b} \qquad\mbox{or} \qquad \hat{\bf x} - \left( {\bf A}^{\ast}\,{\bf A} \right)^{-1} {\bf A}^{\ast} {\bf b} \]
if matrix A has complex entries.
Let A be a rectangular m × n matrix. A pseudoinverse of A is a rectangular n × m matrix, denoted by \( {\bf A}^{\dagger} ,\) that satisfies the following four criteria, known as the Moore--Penrose conditions.
  1. \( {\bf A}\,{\bf A}^{\dagger} \,{\bf A} = {\bf A} \qquad \) (matrix \( {\bf A}\,{\bf A}^{\dagger} \) maps all column vectors of A to themselves)
  2. \( {\bf A}^{\dagger} {\bf A} \,{\bf A}^{\dagger} = {\bf A}^{\dagger} \qquad \)
  3. \( \left( {\bf A}\,{\bf A}^{\dagger} \right)^{\ast} = {\bf A}\,{\bf A}^{\dagger} \qquad \) (matrix \( {\bf A}\,{\bf A}^{\dagger} \) is self-adjoint)
  4. \( \left( {\bf A}^{\dagger} {\bf A} \right)^{\ast} = {\bf A}^{\dagger} {\bf A} \qquad \) (matrix \( {\bf A}^{\dagger} {\bf A} \) is self-adjoint)
Here \( {\bf A}^{\ast} = \overline{{\bf A}^{\mathrm T}} \) is the adjoint of matrix A. The pseudoinverse \( {\bf A}^{\dagger} \) exists for any matrix A, but when the latter has full rank, \( {\bf A}^{\dagger} \) can be expressed as a simple algebraic formula
\[ {\bf A}^{\dagger} = \left( {\bf A}^{\ast}\,{\bf A} \right)^{-1} {\bf A}^{\ast} . \]
This particular pseudoinverse constitutes a left inverse, since, in this case, \( {\bf A}^{\dagger} {\bf A} = {\bf I} , \) the identity m × m matrix.

When A has linearly independent rows (matrix\( {\bf A}\,{\bf A}^{\dagger} \) is invertible), \( {\bf A}^{\dagger} \) can be computed as:
\[ {\bf A}^{\dagger} = {\bf A}^{\ast} \left( {\bf A}\,{\bf A}^{\ast} \right)^{-1} . \]
This is a right inverse because \( {\bf A} \,{\bf A}^{\dagger} = {\bf I} . \)
     
  E. H. Moore.   Roger Penrose, 2011.   Erik Ivar Fredholm.

The term pseudoinverse was independently described by E. H. Moore (1862--1932) in 1920, Arne Bjerhammar (1917--2011) in 1951, and Roger Penrose (born in 1931) in 1955. Earlier, Erik Ivar Fredholm (1866--1927) had introduced the concept of a pseudoinverse of integral operators in 1903.

The American mathematician Eliakim Hastings Moore learned mathematics at Yale University. He got the Ph.D. in 1885 with a thesis, supervised by Hubert Anson Newton, on some work of William Kingdon Clifford and Arthur Cayley. Newton encouraged Moore to study in Germany, and thus he spent an academic year at the University of Berlin, attending lectures by Kronecker and Weierstrass. On his return to the United States, Moore taught at Yale and at Northwestern University. When the University of Chicago opened its doors in 1892, Moore was the first head of its mathematics department, a position he retained until his death.

Arne Bjerhammar was a Swedish geodesist. He was professor at Royal Institute of Technology in Stockholm. He was born in Båstad, Scania in the south of Sweden. He developed a method used to determine the geoid in gravimetric data, as well as a system for electro-optical measuring of distances. Seven years after introducing pseudoinverse, fascinated by M.S. Molodensky’s new approach to solve the basic problems of physical geodesy, he presented his original idea of analytical downward continuation of the gravity anomaly to an internal sphere.

Sir Roger Penrose is an English mathematical physicist, mathematician and philosopher of science. Penrose told a Russian audience that his grandmother had left St. Petersburg in the late 1880s. His uncle was artist Roland Penrose, whose son with photographer Lee Miller is Antony Penrose. Penrose is the brother of physicist Oliver Penrose and of chess Grandmaster Jonathan Penrose. Penrose attended University College School and University College, London, where he graduated with a first class degree in mathematics. In 1955, while still a student, Penrose reintroduced the generalized matrix inverse. Penrose finished his PhD at Cambridge in 1958, with a thesis on "tensor methods in algebraic geometry" under algebraist and geometer John A. Todd. He devised and popularized the Penrose triangle in the 1950s, describing it as "impossibility in its purest form" and exchanged material with the artist M. C. Escher, whose earlier depictions of impossible objects partly inspired it.

The Swedish mathematician Erik Ivar Fredholm is remembered for his work on integral equations and operator theory foreshadowed the theory of Hilbert spaces. He obtained his PhD at Uppsala University in 1898, under the supervision of Gösta Mittag-Leffler. He was docent at Stockholm University from 1898 to 1906 and professor from 1906 until his death.

Example: In \( \mathbb{R}^n , \) the vectors \( e_1 [1,0,0,\ldots , 0] , \quad e_2 =[0,1,0,\ldots , 0], \quad \ldots , e_n =[0,0,\ldots , 0,1] \) form a basis for n-dimensional real space, and it is called the standard basis. Its dimension is n.

 

Example: Let us consider the set of all real \( m \times n \) matrices, and let \( {\bf M}_{i,j} \) denote the matrix whose only nonzero entry is a 1 in the i-th row and j-th column. Then the set \( {\bf M}_{i,j} \ : \ 1 \le i \le m , \ 1 \le j \le n \) is a basis for the set of all such real matrices. Its dimension is mn.

 

Example: The set of monomials \( \left\{ 1, x, x^2 , \ldots , x^n \right\} \) form a basis in the set of all polynomials of degree up to n. It has dimension n+1. ■

 

Example: The infinite set of monomials \( \left\{ 1, x, x^2 , \ldots , x^n , \ldots \right\} \) form a basis in the set of all polynomials. ■