Linear Algebra

This page supports the main stream of the web site by providing/reminding some information regading Linear Algebra. We demonstrate capabilities of MuPad for this topic.

Least Square Approximation

The scientific custom of taking multiple observations of the same quantity and then selecting a single estimate that best represents it has its origin the early part of the 16th century. At the beginning of the 19th century, two of the foremost mathematicians of the time, the German C.F. Gauss (1777--1855) and the Frenchman A.M. Legendre (1752--1833), proposed, idependently of each other, an optimality criterion that bears the same relation to the average.

The context for Legendre's proposal of the least squares was that of geodesy. At that time, France's scientists had decided to adopt a new measurement system and proposed to use a unit length to be a meter, which would be equal 1/40,000,000 of the circumference of the earth. This necessitated an accurate determinantion of the said circumference that, in turn, depended on the exact shape of the earth.

The credibility of the method of least squares were greatly enhanced by the Ceres incident. On January 1, 1801 the Italian astronomer Giuseppe Piazzi sighted a heavently body that he strongly suspected to be a new planet. He announced his discovery and named it Ceres. Unfortunately, 6 weeks later, before enough observations had been taken to make possible accurate determinantion of its orbit, so as to ascertain that it was indeed a planet, Ceres disappared behind the sun and was not expected to reemerge for nearly a year. Interest in this possibly new planet was widespread, and the young Gauss proposed that an area of the sky be searched that was quite different from those suggested by other astronomers; he turned out to be right! He became a celebrity upon his discovery, which includes two mathematical methods: the row echelon reduction and the least square method.

In science and business, we often need to predict the relationship between two given variables. In many cases, we begin by performing an appropriate experiments or statistical analyisis to obtain the necessary data. However, even if a simple law governs the behavior of the variables, this law may not be easy to find because of errors introduced in measuring or sampling. In practice, therefore, we are often content with a functional equation that provides a close approximation. There are know three general techniques for finding functions which are closest to a given curve:

  • linear regression using linear polynomials (matching straight lines);
  • general linear regression (polynomials, etc.), and
  • transformations to linear regression (for matching exponential or logarithmic functions).

Historically, besides to curve fitting, the least square technique is proved to be very useful in statistical modeling of noisy data, and in geodetic modeling. We discuss three standard ways to solve teh least square problem: the normal equations, the QR factorization, and the singular value decomposition.

Consider a typical application of least squares in curve fitting. Suppose that we are given a set of points \( (x_i , y_i ) \) for \( i = 0, 1, 2, \ldots , n, \) for which we may not be able (or may not want) to find a function that passes through all points, but rather, we may want to find a function of a particular form that passes as closely as possible to the points. For example, suppose we’re given a collection of data points (x,y):
\[ (1,1) , \quad (2,2), \quad (3,2) \]
and we want to find the closest line \( y = mx+b \) to that collection. If the line went through all three points, we’d have:
\[ \begin{split} m+b &=1 , \\ 2m+b &=2, \\ 3m+b &= 2, \end{split} \]
which is equivalent to the vector equation:
\[ \begin{bmatrix} 1&1 \\ 1&2 \\ 1&3 \end{bmatrix} \begin{bmatrix} b \\ m \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} \qquad\mbox{or, in vector form:} \qquad {\bf A} \, {\bf x} = {\bf b} . \]

In our example the line does not go through all three points, so this equation is not solvable.

In statistical modeling, one often wishes to estimate certain parameters xj based on some observations, where the observations are contaminated by noise. For instance, one wishes to predic the college grade point average (GPA) of freshman applications, which we denote by b, based on their high school GPA, denoted by a1, and two Scholastic Aptitude Test scores, one verbal (a2) and one quantitative (a3), as part of the college admission process. It is a custom to use a linear model in this case: \( b = a_1 x_1 + a_2 x_2 + a_3 x_3 . \)

Now we formulate the least square problem. Suppose that we have a set (usually linearly indepndent, but not necessarily) of vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) and a given vector b. We seek coefficients \( x_1 , x_2 , \ldots , x_n \) that produce a minimal error

\[ \left\| {\bf b} - \sum_{i=1}^n x_i \,{\bf a}_i \right\| . \]
with respect to the Euclidean inner product on \( \mathbb{R}^n . \) As long as we are interested only in linear combinations of a finite set \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} ,\) it is possible to transform the problem into one involving finite columns of numbers. In this case, define a matrix A with columns given by the linearly independent column vectors \( {\bf a}_i , \) and a vector x whose entries are the unknown coefficients xi. Then matrix A has dimensions m by n, meaning that \( {\bf a}_i \) are vectors of length m. The problem then can be reformulated by choosing x that minimizing \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) because \( {\bf A} {\bf x} = {\bf b} \) is an inconsistent linear system of m equations in n unknowns. We call such a vector, if it exists, a least squares solution \( {\bf A} \hat{\bf x} = {\bf b} , \) and call \( {\bf b} - {\bf A} \hat{\bf x} \) the least squares error vector. Statisticians call it linear regression.

If column vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) are linearly independent (which will be assumed), matrix A becomes full column rank. The vector \( \hat{\bf x} = x_1 {\bf a}_1 + x_2 {\bf a}_2 + \cdots + x_n {\bf a}_n \) that is closest to a given vector b is its orthogonal projection onto the subspace spanned by the a's. The hat over \( \hat{\bf x} \) indicates the best choice (in the sense that it minimizes \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) ) that gives the closest vector in the column space. Therefore, to solve the least square problem is equivalent to find the orthogonal projection matrix P on the column space such that \( {\bf P} {\bf b} = {\bf A}\,\hat{\bf x} . \)

The vector \( \hat{\bf x} \) is a solution to the least squares problem when the error vector \( {\bf e} = {\bf b} - {\bf A} \hat{\bf x} \) is perpendicular to the subspace. Therefore, this error vector makes right angle with all the vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} . \) That gives the n equations we need to find \( \hat{\bf x} : \)

\[ \begin{array}{c} {\bf a}^{\mathrm T}_1 \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \\ \vdots \quad \vdots \\ {\bf a}^{\mathrm T}_n \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \end{array} \qquad \mbox{or} \qquad \begin{bmatrix} --- \, {\bf a}^{\mathrm T}_1 \, --- \\ \vdots \\ --- \, {\bf a}^{\mathrm T}_n \, --- \end{bmatrix} \begin{bmatrix} \quad \\ {\bf b} - {\bf A} \hat{\bf x} \\ \quad \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix} . \]
The matrix in the right equation is \( {\bf A}^{\mathrm T} . \) The n equations are exactly
\[ {\bf A}^{\mathrm T} {\bf A} \, \hat{\bf x} = {\bf A}^{\mathrm T} \, {\bf b} . \]
This equation is called the normal equation.

Now we derive the normal equation using another approach: calculus. To find the best approximation, we look for the x where the gradient of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 = \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right) \) vanishes. So we want

\begin{align*} 0 &= \lim_{\epsilon \mapsto 0} \frac{\left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right)^{\mathrm T} \left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right) - \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right)}{\| \epsilon \|_2} \\ &= \lim_{\epsilon \mapsto 0} \frac{2\epsilon^{\mathrm T} \left( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \right) + \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon}{\| \epsilon \|_2} . \end{align*}
The second term
\[ \frac{\left\vert \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon \right\vert}{\| \epsilon \|_2} \le \frac{\| {\bf A} \|_2^2 \| \epsilon \|_2^2}{\| \epsilon \|_2} = \| {\bf A} \|_2^2 \| \epsilon \|_2 \, \mapsto \, 0 \]
approaches 0 as ε goes to 0, so the factor \( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \) in the first term must also be zero, or \( {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} . \) This is a system of n linear equations in n unknowns, the normal equation.

Why is \( \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} \) the minimizer of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 ? \) We can note that the Hessian \( {\bf A}^{\mathrm T} {\bf A} \) is positive definite, which means that the function is strictly convex and any critical point is a global minimum. ■

Theorem: If W is a finite-dimensional subspace of an inner product vector space V, and if b is a vector in V, then \( \mbox{proj}_W ({\bf b}) \) is the best approximation to b from W in the sense that

\[ \| {\bf b} - \mbox{proj}_W {\bf b} \| < \| {\bf b} - {\bf w} \| \]
for every vector w in W that is different from \( \mbox{proj}_W ({\bf b}) . \)

To solve the normal equation, we need one more definition: for an arbitrary \( m \times n \) matrix A, its pseudoinverse is defined as an \( n \times m \) matrix, denoted by \( {\bf A}^{\dagger} , \) that satisfies all four criteria:

  • \( {\bf A}\,{\bf A}^{\dagger} \,{\bf A} = {\bf A}; \)
  • \( {\bf A}^{\dagger} {\bf A} \,{\bf A}^{\dagger} = {\bf A}^{\dagger} ; \)
  • \( \left( {\bf A} \,{\bf A}^{\dagger} \right)^{\ast} = {\bf A} \,{\bf A}^{\dagger} \) (so \( {\bf P} = {\bf A} \,{\bf A}^{\dagger} \) is self-adjoint m-by-m matrix) ;
  • \( \left( {\bf A}^{\dagger} {\bf A} \right)^{\ast} = {\bf A}^{\dagger} {\bf A} \) (so \( {\bf Q} = {\bf A}^{\dagger}{\bf A} \) is self-adjoint n-by-n matrix). ■

A pseudoinverse exists for arbitrary matrix A, but when the latter has full rank, \( {\bf A}^{\dagger} \) can be expressed as a simple algebraic formula. In particular, when A has linearly independent columns (and thus matrix \( {\bf A}^{\ast} {\bf A} \) is invertible), \( {\bf A}^{\dagger} \) can be computed as:

\[ {\bf A}^{\dagger} = \left( {\bf A}^{\ast} {\bf A} \right)^{-1} {\bf A}^{\ast} . \]
This pseudoinverse is actually a left inverse to A because, in this case, \( {\bf A}^{\dagger} {\bf A} = {\bf I}_n . \) When A has linearly independent rows (matrix \( {\bf A}\,{\bf A}^{\ast} \) is invertible), \( {\bf A}^{\dagger} \) can be computed as:
\[ {\bf A}^{\dagger} = {\bf A}^{\ast} \left( {\bf A}^{\ast} {\bf A} \right)^{-1} . \]
This is a right inverse, as \( {\bf A} {\bf A}^{\dagger} = {\bf I}_m . \)

When singular value decomposition for a m-by-n matrix A is known to be \( {\bf A} = {\bf U}\,{\bf \Sigma}\, {\bf V}^{\ast} \) with factors of unitary (or orthogonal) \( m \times m \) matrix U, a \( m \times n \) 'diagonal' matrix \( {\bf \Sigma} \) and another unitary \( n \times n \) matrix \( {\bf V}^{\ast} , \) we can construct its pseudoinverse explicitely: \( {\bf A}^{\dagger} = {\bf V} {\bf \Sigma}^{\dagger} {\bf U}^{\ast} , \) where

\[ {\bf \Sigma}^{\dagger} = \mbox{diag} \left( \frac{1}{\sigma_1} , \cdots , \frac{1}{\sigma_r} , 0, \cdots , 0\right) \in \mathbb{R}^{m\times n} . \]
Here \( \sigma_1 , \ldots , \sigma_r \) are nonzero singular values of matrix A having rank r. The reduced form of pseudoinverse is also available when unitary matrices U and V are truncated to keep only r columns:
\[ {\bf A}^{\dagger} = {\bf V}_{n \times r} {\bf \Sigma}^{\dagger}_r {\bf U}^{\ast}_{r \times m} , \quad {\bf \Sigma}^{\dagger}_r = {\bf \Sigma}^{-1}_r = \mbox{diag} \left( \frac{1}{\sigma_1} , \cdots , \frac{1}{\sigma_r}\right) \in \mathbb{R}^{r\times r} . \]

Theorem: Let A be an \( m \times n \) matrix and \( {\bf A}^{\dagger} \) be its pseudoinverse n-by-m matrix.

  • \( {\bf P} = {\bf A}\,{\bf A}^{\dagger} \) and \( {\bf Q} = {\bf A}^{\dagger} {\bf A} \) are orthogonal projection operators (so they are idempotent \( {\bf P}^2 = {\bf P} , \quad {\bf Q}^2 = {\bf Q} \) and self-adjoint \( {\bf P} = {\bf P}^{\ast} , \quad {\bf Q} = {\bf Q}^{\ast} \) );
  • P is the orthogonal projector onto the range of A (which equals the orthogonal complement of the kernel of \( {\bf A}^{\ast} \) );
  • Q is the orthogonal projector onto the range of \( {\bf A}^{\ast} \) (which equals the orthogonal complement of the kernel of A);
  • \( \left( {\bf I} - {\bf P} \right) \) is the orthogonal projector onto the kernel of \( {\bf A}^{\ast} ; \)
  • \( \left( {\bf I} - {\bf Q} \right) \) is the orthogonal projector onto the kernel of A;
  • ker ( \( {\bf A}^{\dagger} \) ) = ker ( \( {\bf A}^{\ast} \) );
  • Range ( \( {\bf A}^{\dagger} \) ) = Range ( \( {\bf A}^{\ast} \) ). ■

The symmetric matrix \( {\bf A}^{\ast} {\bf A} \) is n by n. It is invertible if the a's (column vectors) are independent. For a full column rank m-by-n real matrix A, the solution of least squares problem becomes \( \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} . \) The projection m-by-m matrix on the subspace of columns of A (range of m-by-n matrix A) is
\[ {\bf P} = {\bf A} \,\left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = {\bf A} \,{\bf A}^{\dagger} . \]
Then
\[ {\bf A} \hat{\bf x} = {\bf A} \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} . \]
So to solve the least square problem \( {\bf A} {\bf x} = {\bf b} , \) which is inconsistent, one can find the nearest solution \( {\bf A} \hat{\bf x} = {\bf P}\,{\bf b} . \)

Theorem: The general solution of the linear least squares problem \( {\bf A} {\bf x} = {\bf b} \) is

\[ \hat{\bf x} = {\bf A}^{\dagger} {\bf b} + \left( {\bf I} - {\bf A}^{\dagger} {\bf A} \right) {\bf w} , \qquad {\bf w} \ \mbox{ arbitrary}. \]

Theorem: If m-by-n matrix A has full rank (so rank of A is n) then the solution of the linear least squares problem \( {\bf A} {\bf x} = {\bf b} \) is unique:

\[ \hat{\bf x} = {\bf A}^{\dagger} {\bf b} = {\bf V} {\bf \Sigma}^{\dagger} {\bf U}^{\ast} {\bf b} . \]
In the above formular, one can also use a reduced form of singular value decomposition for \( {\bf A}^{\dagger} . \)

The general least squares solution presented in Theorem is also valid for a system of equations Ax = b where \( m \le n , \) i.e., an undertermined linear system with fewer equations than unknowns. In this case the \( {\bf x} = {\bf V}_r {\bf \Sigma}_r^{-1} {\bf U}^{\mathrm T} {\bf b} \) solves the problem

\[ \min \| {\bf x} \| \qquad\mbox{subject to } {\bf A} {\bf x} = {\bf b} . \]

Algorithm for general solution to the linear least squares problem

  1. Compute the SVD: [ U S V ] = svd(A)
  2. Make a rank decision, i.e., choose r such that \( \sigma_r >0 \) and \( \sigma_{r+1} = \cdots = \sigma_n =0 . \) This decision is necessary because rounding errors will prevent the zero singular values to be exact zero.
  3. Set V1 = V(:,1:r), V2 = V(: ,r+1:n), Sr = S(1:r,1:r), U1 = U(:,1:r)
  4. Solution with minimal norm is xm = V1*(Sr\U1'*b)
  5. The general solution is x = xm + V2*c with an arbitrary \( c \in \mathbb{R}^{n-r} . \)

There is known another approach to solve least squares problem, using QR decomposition. So suppose that we know that the given matrix A can be decomposed into a product \( {\bf A} = {\bf Q} {\bf R} \) of an orthogonal matrix Q and an upper triangular matrix R. Then we can rewrite \( {\bf A} {\bf x} - {\bf b} \) as a difference of two orthogonal vectors:

\[ {\bf A} {\bf x} - {\bf b} = {\bf Q} \left( {\bf R} {\bf x} - {\bf Q}^{\mathrm T} {\bf b} \right) - \left( {\bf I} - {\bf Q} {\bf Q}^{\mathrm T} \right) {\bf b} . \]
By the Pythagorean theorem,
\[ \left\| {\bf A} {\bf x} - {\bf b} \right\|_2^2 = \left\| {\bf Q} \left( {\bf R} {\bf x} - {\bf Q}^{\mathrm T} {\bf b} \right) \right\|_2^2 + \left\| \left( {\bf I} - {\bf Q} {\bf Q}^{\mathrm T} \right) {\bf b} \right\|_2^2 . \]
This sum of squares is minimized when the first term is zero, and we get the solution of least squares problem:
\[ \hat{\bf x} = {\bf R}^{-1} {\bf Q}^{\mathrm T} {\bf b} . \]
The cost of this decomposition and subsequent least squares solution is \( 2n^2 m - \frac{2}{3}\,n^3 , \) about twice the cost of the normal equations if \( m \ge n \) and about the same if m = n.

Example. Let U be the plane in \( \mathbb{R}^3 \) given by

\[ U = \left\{ \left[ x\ y \ z \right]^{\mathrm T} \ : \ 2x-y-2z=0 \right\} . \]
We are going to find the nearest point (position vector) in U to the point whose position vector is \( {\bf b} = \left[ 1, \ 1, \ 1 \right]^{\mathrm T} . \)

Since S is a plane with given normal vector n = [ 2 -1 -2 ], we first find two linearly independent solutions to the equation 2x-y-2z=0:

\[ {\bf a} = \left[ 1, \ 2, \ 0 \right]^{\mathrm T} \quad \mbox{and} {\bf a} = \left[ 1, \ 0, \ 1 \right]^{\mathrm T} . \]
Now we build a 3-by-2 matrix of these column vectors:
\[ {\bf A} = \begin{bmatrix} 1&1 \\ 2&0 \\ 0&1 \end{bmatrix} \qquad \Longrightarrow \qquad {\bf A}^{\mathrm T} = \begin{bmatrix} 1&2&0 \\ 1&0&1 \end{bmatrix} . \]
Then we build an orthogonal projection operator that maps 3-space onto the range of matrix A:
\[ {\bf P} = {\bf A} \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = \frac{1}{9} \, \begin{bmatrix} 5&2&4 \\ 2&8&-2 \\ 4&-2&5 \end{bmatrix} . \]
This projection matrix does not depend on our choice of vectors a and b; indeed, if choose, for instance, two other vectors [ 2 2 1] and [0 -2 1], the projection matrix will be the same.
sage: P*P-P 
Application of the projection matrix P to vector b yields
\[ \hat{\bf b} = {\bf P} {\bf b} = \frac{1}{9} \, \begin{bmatrix} 5&2&4 \\ 2&8&-2 \\ 4&-2&5 \end{bmatrix} \, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \frac{1}{9} \, \begin{bmatrix} 11 \\ 8 \\ 7 \end{bmatrix} \quad \Longrightarrow \quad \| \hat{\bf b} \| = \frac{\sqrt{26}}{3} \approx 1.69967 < \sqrt{3} = \| {\bf b} \| . \]

Example. Let us find the least squares solution to the linear system

\[ \begin{split} x_1 - x_2 &= 4 , \\ x_1 -2 x_2 &= 1, \\ -x_1 + 2x_2 &= 3 . \end{split} \]
It will be convenient to express the system in the matrix form \( {\bf A} {\bf x} = {\bf b} , \) where
\[ {\bf A} = \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} , \qquad {\bf b} = \begin{bmatrix} 4\\ 1 \\ 3 \end{bmatrix} . \]
It follows that
\begin{align*} {\bf A}^{\mathrm T} {\bf A} &= \begin{bmatrix} 1&1&-1 \\ -1&-2&2 \end{bmatrix} \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} = \begin{bmatrix} 3 & -5 \\ -5 & 9 \end{bmatrix} , \\ {\bf A} {\bf A}^{\mathrm T} &= \begin{bmatrix} 2&3&-3 \\ 3&5&-5 \\ -3&-5&5 \end{bmatrix} , \\ {\bf A}^{\mathrm T} {\bf b} &= \begin{bmatrix} 1&1&-1 \\ -1&-2&2 \end{bmatrix} \begin{bmatrix} 4\\ 1 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} , \end{align*}
so the normal system \( {\bf A}^{\mathrm T} {\bf A} \, {\bf x} = {\bf A}^{\mathrm T} {\bf b} \) becomes
\[ \begin{bmatrix} 3 & -5 \\ -5 & 9 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} . \]
Solving this system yields a unique least squares solution, namely,
\[ x_1 = 9, \quad x_2 = 5. \]
The least squares error vector is
\[ {\bf b} - {\bf A} {\bf x} = \begin{bmatrix} 4 \\ 1 \\ 3 \end{bmatrix} - \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} \begin{bmatrix} 2 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 2 \end{bmatrix} \]
and the least squares error is \( \left\| {\bf b} - {\bf A} {\bf x} \right\| = 2\sqrt{2} \approx 2.82843 . \)

The pseudoinverse is the following 2 by 3 matrix:

\[ {\bf A}^{\dagger} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = \frac{1}{2} \begin{bmatrix} 4&-1&1 \\ 2&-1&1 \end{bmatrix} , \]
and the projectors on the range of A and on the range of \( {\bf A}^{\mathrm T} \) (which is the orthogonal complement of the kernel of A) are
\[ {\bf P} = {\bf A} {\bf A}^{\dagger} = \frac{1}{2} \begin{bmatrix} 2&0&0 \\ 0&1&-1 \\ 0&-1&1 \end{bmatrix} \quad\mbox{and} \quad {\bf Q} = {\bf A}^{\dagger} {\bf A} = \begin{bmatrix} 1&0 \\ 0& 1 \end{bmatrix} , \]
respectively.

Example. Find the least square solutions, the least squares error vector, and the least squares error of the linear system

\[ \begin{split} 3x_1 + 2x_2 + x_3 &= 2 , \\ -x_1 + x_2 + 4x_3 &= -2, \\ x_1 + 4x_2 + 9 x_3 &= 1 . \end{split} \]
The matrix form of the system is \( {\bf A} {\bf x} = {\bf b} , \) where
\[ {\bf A} = \begin{bmatrix} 3&2&1 \\ -1&1&4 \\ 1&4&9 \end{bmatrix} \qquad\mbox{and} \qquad {\bf b} = \begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix} . \]
Since the matrix A is singular and the rank of the augmented matrix \( \left[ {\bf A} \, | \, {\bf b} \right] \) is 3, the given system of algebraic equations is inconsistent. It follows that
\[ {\bf A}^{\mathrm T} {\bf A} = \begin{bmatrix} 11&9&8 \\ 9&21&42 \\ 8&42&98 \end{bmatrix} \qquad\mbox{and} \qquad {\bf A}^{\mathrm T} {\bf b} = \begin{bmatrix} 9 \\ 6 \\ 3 \end{bmatrix} , \]
so the augmented matrix for the normal system \( {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} \) is
\[ \left[ \begin{array}{ccc|c} 11&9&8&9 \\ 9&21&42&6 \\ 8&42&98&3 \end{array} \right] . \]
The reduced row echelon form of this matrix is
\[ \left[ \begin{array}{ccc|c} 1&0&- \frac{7}{5}&\frac{9}{10} \\ 0&1&\frac{13}{5}&-\frac{1}{10} \\ 0&0&0&0 \end{array} \right] \]
from which it follows that there are infinitely many least square solutions, and they are given by the parametric eqwuations
\[ \begin{split} x_1 &= \frac{7}{5}\, t + \frac{9}{10} \\ x_2 &= - \frac{13}{5}\, t - \frac{1}{10} , \\ x_3 &= t . \end{split} \]
As a check, let us verify that all least squares solutions produce the same least squares error vector and the same least squares error. To see this so, we first compute
\[ {\bf b} - {\bf A} \hat{\bf x} = \begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix} - \begin{bmatrix} 3&2&1 \\ -1&1&4 \\ 1&4&9 \end{bmatrix} \begin{bmatrix} \frac{7}{5}\, t + \frac{9}{10} \\ - \frac{13}{5}\, t - \frac{1}{10} \\ t \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 \\ -2 \\ 1 \end{bmatrix} . \]
Since \( {\bf b} - {\bf A} \hat{\bf x} \) does not depend on t, all least squares solutions produce the same error vector, namely, \( \| {\bf b} - {\bf A} \hat{\bf x} \| = \sqrt{\frac{3}{2}} .\)

Fitting Curves

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, usually subject to constraints. These minimization problems leads to non-linear least squares problems. While there are known some algorithms to handle these problems (for instance, the Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, which was developed by American statisticians Kenneth Levenberg (1919--1973) and Donald W. Marquardt (1929--1997)), we consider two typical relatively simple problems.

We start with finding a straight line on the plane that is defined by

\[ c + n_1 x + n_2 y =0 , \qquad n_1^2 + n_2^2 =1. \]
The unit vector \( (n_1 , n_2 ) \) is orthogonal to the line. For a given point (x,y), the absolute value of \( r= c + n_1 x + n_2 y \) gives the distance from the line yo the point. Therefore, if we want to determine the line for which the sum of squares of the distances to given points is minimal, we have to solve the constrainted least squares problem
\[ \left( \begin{array}{ccc} 1& x_1 & y_1 \\ 1 & x_2 & y_2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & y_m \end{array} \right) \begin{bmatrix} c \\ n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]
Using QR decomposition A = QR we can reduce the linear system to R x = 0 and the problem becomes
\[ \left( \begin{array}{ccc} r_{11} & r_{12} & r_{13} \\ 0& r_{22} & r_{23} \\ 0&0&r_{33} \end{array} \right) \begin{bmatrix} c \\ n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]
Since the nonlinear constraint only involves two unknowns, we have
\[ \left( \begin{array}{cc} r_{22} & r_{23} \\ 0&r_{33} \end{array} \right) \begin{bmatrix} n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]

Another problem is about fitting ellipses on the plane having defined by quadratic form

\[ {\bf x}^{\mathrm T} {\bf A} \,{\bf x} + {\bf b}^{\mathrm T} {\bf x} +c =0 \]
with positive definite symmetric matrix A. For each measured points xi we obtain by inserting into the ellipse equation a linear equation for the unknown coefficients \( {\bf u} = \left[ a_{11} , a_{12} = a_{21}, a_{22} , b_1 , b_2 , c \right] . \) This yields the minimization problem subject to a normalized condition:
\[ \| {\bf B}\,{\bf u} \| = \min \quad \mbox{subject to } \ \| {\bf u} \| =1 , \]
which can be solved using the SVD.

Total Least Squares

The linear least squares problem A x = b has so far been solved by projecting the vector b on the range of A. With ``Total Least Squares'' the system of equations is made consistent by changing both A and b. We are looking for a matrix A and a vector b from its range that differ as little as possible from the given data
\[ \left\| \left[ {\bf A} , {\bf b} \right] - \left[ {\cal A} , {\cal b} \right] \right\|_F = \min \qquad \mbox{subject to } \ {\cal b} \in \mbox{Range}({\cal A} ) . \]
In general, the augmented matrix \( {\bf C} = \left[ {\bf A} , {\bf b} \right] \) have rank n + 1 while matrix A has rank n. The constraint says that the augmented matrix \( {\cal C} = \left[ {\cal A} , {\cal b} \right] \) must have rank n. Upon introducing the error matrix \( \Delta = {\bf C} - {\cal C} , \) the problem is reduced to
\[ \| \Delta \|_F^2 = \min \qquad\mbox{subject to } \ \left( {\bf C} + \Delta \right) {\bf z} =0 , \]
where \( {\bf z} = \left( \begin{array}{c} {\bf x} \\ -1 \end{array} \right) \ne 0. \) Applying the SVD for C, we get
\[ \Delta = - \sigma_{n+1} {\bf u}_{n+1} {\bf v}_{n+1}^{\mathrm T} . \]
If the last component of vn+1 is not zero, the total least squares solution exists.
\[ {\cal A} \hat{\bf x} = {\cal b} , \qquad \hat{\bf x} = - \frac{1}{v_{n+1,n+1}} \,v(1:n,n+1) . \]