Linear Systems of Algebraic Equations

This page presents some topics from Linear Algebra needed for construction of solutions to systems of linear algebraic equations and some applications. We use matrices and vectors as essential elements in obtaining and expressing the solutions.

Least Square Approximation

The scientific custom of taking multiple observations of the same quantity and then selecting a single estimate that best represents it has its origin the early part of the 16th century. At the beginning of the 19th century, two of the foremost mathematicians of the time, the German C.F. Gauss (1777--1855) and the Frenchman A.M. Legendre (1752--1833), proposed, idependently of each other, an optimality criterion that bears the same relation to the average.

The context for Legendre's proposal of the least squares was that of geodesy. At that time, France's scientists had decided to adopt a new measurement system and proposed to use a unit length to be a meter, which would be equal 1/40,000,000 of the circumference of the earth. This necessitated an accurate determinantion of the said circumference that, in turn, depended on the exact shape of the earth.

The credibility of the method of least squares were greatly enhanced by the Ceres incident. On January 1, 1801 the Italian astronomer Giuseppe Piazzi sighted a heavently body that he strongly suspected to be a new planet. He announced his discovery and named it Ceres. Unfortunately, 6 weeks later, before enough observations had been taken to make possible accurate determinantion of its orbit, so as to ascertain that it was indeed a planet, Ceres disappared behind the sun and was not expected to reemerge for nearly a year. Interest in this possibly new planet was widespread, and the young Gauss proposed that an area of the sky be searched that was quite different from those suggested by other astronomers; he turned out to be right! He became a celebrity upon his discovery, which includes two mathematical methods: the row echelon reduction and the least square method.

In science and business, we often need to predict the relationship between two given variables. In many cases, we begin by performing an appropriate experiments or statistical analyisis to obtain the necessary data. However, even if a simple law governs the behavior of the variables, this law may not be easy to find because of errors introduced in measuring or sampling. In practice, therefore, we are often content with a functional equation that provides a close approximation. There are know three general techniques for finding functions which are closest to a given curve:

  • linear regression using linear polynomials (matching straight lines);
  • general linear regression (polynomials, etc.), and
  • transformations to linear regression (for matching exponential or logarithmic functions).

Historically, besides to curve fitting, the least square technique is proved to be very useful in statistical modeling of noisy data, and in geodetic modeling. We discuss three standard ways to solve teh least square problem: the normal equations, the QR factorization, and the singular value decomposition.

Consider a typical application of least squares in curve fitting. Suppose that we are given a set of points \( (x_i , y_i ) \) for \( i = 0, 1, 2, \ldots , n, \) for which we may not be able (or may not want) to find a function that passes through all points, but rather, we may want to find a function f(x) of a particular form that passes as closely as possible to the points. The differences
\[ e_i = f(x_i ) - y_i \quad\mbox{for}\quad i =1,2,\ldots n, \]
are called the errors or deviations or residuals. There are several norms that can be used with the residuals to measure how the curve \( y = f(x) \) lies from data:
  • maximum error: \( \quad \| e \|_{\infty} = \max_{1\le i \le n} \{ | f(x_i ) - y_i | \} ; \)
  • average error: \( \quad \| e \|_1 = \frac{1}{n} \,\sum_{i=1}^n | f(x_i ) - y_i | ; \)
  • root-mean square error (or L2-error): \( \quad \| e \|_2 = \left( \frac{1}{n} \,\sum_{i=1}^n | f(x_i ) - y_i |^2 \right)^{1/2} . \)

Example. Compare the maximum error, average error, and rms error for the linear approximation \( y = f(x) = 3.5 -1.3\,x \) to the data points (-1,5), (0,4), (1,2), (2,1), (3,0), (4,-2), (5,-3), (6,-4).

The errors are summarized in the following table:

xi yi \( f(x_i ) = 3.5 -1.3\,x_i \) |ei| ei2
-1 5 4.8 0.2 0.04
0 3.5  0.5  0.25 
2.2 0.2 0.04
0.9 0.1 0.01
3 0 -0.4 0.4 0.16
4 -2 -1.7 0.3 0.09
5 -3 -1 0 0
6 -4 -4.3 0.3 0.09
We can see that the maximum error \( \| e \|_{\infty} = 0.5 \) is largest because
\[ \| e \|_1 = \frac{1}{8}\,\sum_{i=1}^8 |e_i | = 0.25, \qquad \| e \|_2 = \frac{1}{2\sqrt{2}}\left( \sum_{i=1}^8 |e_i |^2 \right)^{1/2} \approx 0.291548 . \qquad ■ \]

Let \( \{ ( x_i , y_i) \}_{i=1}^n \) be set of n points, where the abscissas xi are distinct. The least-squares line \( y = m\,x+b \) is the line that minimizes the root-mean-square error e2.

The quantity \( \| e \|_2 \) will be minimum if and only if the quantity \( \sum_{i=1}^n \left( m\,x_i +b - y_i \right)^2 \) is a minimum. The letter is visualized geometrically by minimizing the sum of the squares of the vertical distances from the points to the line.

Example. Suppose we’re given a collection of data points (x,y):

\[ (1,1) , \quad (2,2), \quad (3,2) \]
and we want to find the closest line \( y = mx+b \) to that collection. If the line went through all three points, we’d have:
\[ \begin{split} m+b &=1 , \\ 2m+b &=2, \\ 3m+b &= 2, \end{split} \]
which is equivalent to the vector equation:
\[ \begin{bmatrix} 1&1 \\ 1&2 \\ 1&3 \end{bmatrix} \begin{bmatrix} b \\ m \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} \qquad\mbox{or, in vector form:} \qquad {\bf A} \, {\bf x} = {\bf b} . \]

In our example the line does not go through all three points, so this equation is not solvable.

In statistical modeling, one often wishes to estimate certain parameters xj based on some observations, where the observations are contaminated by noise. For instance, one wishes to predic the college grade point average (GPA) of freshman applications, which we denote by b, based on their high school GPA, denoted by a1, and two Scholastic Aptitude Test scores, one verbal (a2) and one quantitative (a3), as part of the college admission process. It is a custom to use a linear model in this case: \( b = a_1 x_1 + a_2 x_2 + a_3 x_3 . \)

Now we formulate the least square problem. Suppose that we have a set (usually linearly indepndent, but not necessarily) of vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) and a given vector b. We seek coefficients \( x_1 , x_2 , \ldots , x_n \) that produce a minimal error

\[ \left\| {\bf b} - \sum_{i=1}^n x_i \,{\bf a}_i \right\| . \]
with respect to the Euclidean inner product on \( \mathbb{R}^n . \) As long as we are interested only in linear combinations of a finite set \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} ,\) it is possible to transform the problem into one involving finite columns of numbers. In this case, define a matrix A with columns given by the linearly independent column vectors \( {\bf a}_i , \) and a vector x whose entries are the unknown coefficients xi. Then matrix A has dimensions m by n, meaning that \( {\bf a}_i \) are vectors of length m. The problem then can be reformulated by choosing x that minimizing \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) because \( {\bf A} {\bf x} = {\bf b} \) is an inconsistent linear system of m equations in n unknowns. We call such a vector, if it exists, a least squares solution \( {\bf A} \hat{\bf x} = {\bf b} , \) and call \( {\bf b} - {\bf A} \hat{\bf x} \) the least squares error vector. Statisticians call it linear regression.

If column vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} \) are linearly independent (which will be assumed), matrix A becomes full column rank. The vector \( \hat{\bf x} = x_1 {\bf a}_1 + x_2 {\bf a}_2 + \cdots + x_n {\bf a}_n \) that is closest to a given vector b is its orthogonal projection onto the subspace spanned by the a's. The hat over \( \hat{\bf x} \) indicates the best choice (in the sense that it minimizes \( \| {\bf A} {\bf x} - {\bf b} \|_2 \) ) that gives the closest vector in the column space. Therefore, to solve the least square problem is equivalent to find the orthogonal projection matrix P on the column space such that \( {\bf P} {\bf b} = {\bf A}\,\hat{\bf x} . \)

The vector \( \hat{\bf x} \) is a solution to the least squares problem when the error vector \( {\bf e} = {\bf b} - {\bf A} \hat{\bf x} \) is perpendicular to the subspace. Therefore, this error vector makes right angle with all the vectors \( \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} . \) That gives the n equations we need to find \( \hat{\bf x} : \)

\[ \begin{array}{c} {\bf a}^{\mathrm T}_1 \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \\ \vdots \quad \vdots \\ {\bf a}^{\mathrm T}_n \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \end{array} \qquad \mbox{or} \qquad \begin{bmatrix} --- \, {\bf a}^{\mathrm T}_1 \, --- \\ \vdots \\ --- \, {\bf a}^{\mathrm T}_n \, --- \end{bmatrix} \begin{bmatrix} \quad \\ {\bf b} - {\bf A} \hat{\bf x} \\ \quad \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix} . \]
The matrix in the right equation is \( {\bf A}^{\mathrm T} . \) The n equations are exactly
\[ {\bf A}^{\mathrm T} {\bf A} \, \hat{\bf x} = {\bf A}^{\mathrm T} \, {\bf b} . \]
This equation is called the normal equation.

Now we derive the normal equation using another approach: calculus. To find the best approximation, we look for the x where the gradient of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 = \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right) \) vanishes. So we want

\begin{align*} 0 &= \lim_{\epsilon \mapsto 0} \frac{\left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right)^{\mathrm T} \left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right) - \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right)}{\| \epsilon \|_2} \\ &= \lim_{\epsilon \mapsto 0} \frac{2\epsilon^{\mathrm T} \left( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \right) + \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon}{\| \epsilon \|_2} . \end{align*}
The second term
\[ \frac{\left\vert \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon \right\vert}{\| \epsilon \|_2} \le \frac{\| {\bf A} \|_2^2 \| \epsilon \|_2^2}{\| \epsilon \|_2} = \| {\bf A} \|_2^2 \| \epsilon \|_2 \, \mapsto \, 0 \]
approaches 0 as ε goes to 0, so the factor \( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \) in the first term must also be zero, or \( {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} . \) This is a system of n linear equations in n unknowns, the normal equation.

Why is \( \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} \) the minimizer of \( \| {\bf A} {\bf x} - {\bf b} \|_2^2 ? \) We can note that the Hessian \( {\bf A}^{\mathrm T} {\bf A} \) is positive definite, which means that the function is strictly convex and any critical point is a global minimum. ■

Theorem: Suppose that \( \{ ( x_i , y_i) \}_{i=1}^n \) are n points, where the abscissas xi are distinct. The coefficients of the least squares line

\[ y= m\,x +b \]
are the solution to the following linear system, known as the normal equations:
\begin{align*} \left( \sum_{i=1}^n x_i^2 \right) m + b \left( \sum_{i=1}^n x_i \right) &= \sum_{i=1}^n x_i y_i , \\ \left( \sum_{i=1}^n x_i \right) m + n\,b &= \sum_{i=1}^n y_i . \end{align*}

Theorem: If W is a finite-dimensional subspace of an inner product vector space V, and if b is a vector in V, then \( \mbox{proj}_W ({\bf b}) \) is the best approximation to b from W in the sense that

\[ \| {\bf b} - \mbox{proj}_W {\bf b} \| < \| {\bf b} - {\bf w} \| \]
for every vector w in W that is different from \( \mbox{proj}_W ({\bf b}) . \)

To solve the normal equation, we need one more definition: for an arbitrary \( m \times n \) matrix A, its pseudoinverse is defined as an \( n \times m \) matrix, denoted by \( {\bf A}^{\dagger} , \) that satisfies all four criteria:

  • \( {\bf A}\,{\bf A}^{\dagger} \,{\bf A} = {\bf A}; \)
  • \( {\bf A}^{\dagger} {\bf A} \,{\bf A}^{\dagger} = {\bf A}^{\dagger} ; \)
  • \( \left( {\bf A} \,{\bf A}^{\dagger} \right)^{\ast} = {\bf A} \,{\bf A}^{\dagger} \) (so \( {\bf P} = {\bf A} \,{\bf A}^{\dagger} \) is self-adjoint m-by-m matrix) ;
  • \( \left( {\bf A}^{\dagger} {\bf A} \right)^{\ast} = {\bf A}^{\dagger} {\bf A} \) (so \( {\bf Q} = {\bf A}^{\dagger}{\bf A} \) is self-adjoint n-by-n matrix). ■

A pseudoinverse exists for arbitrary matrix A, but when the latter has full rank, \( {\bf A}^{\dagger} \) can be expressed as a simple algebraic formula. In particular, when A has linearly independent columns (and thus matrix \( {\bf A}^{\ast} {\bf A} \) is invertible), \( {\bf A}^{\dagger} \) can be computed as:

\[ {\bf A}^{\dagger} = \left( {\bf A}^{\ast} {\bf A} \right)^{-1} {\bf A}^{\ast} . \]
This pseudoinverse is actually a left inverse to A because, in this case, \( {\bf A}^{\dagger} {\bf A} = {\bf I}_n . \) When A has linearly independent rows (matrix \( {\bf A}\,{\bf A}^{\ast} \) is invertible), \( {\bf A}^{\dagger} \) can be computed as:
\[ {\bf A}^{\dagger} = {\bf A}^{\ast} \left( {\bf A}^{\ast} {\bf A} \right)^{-1} . \]
This is a right inverse, as \( {\bf A} {\bf A}^{\dagger} = {\bf I}_m . \)

When singular value decomposition for a m-by-n matrix A is known to be \( {\bf A} = {\bf U}\,{\bf \Sigma}\, {\bf V}^{\ast} \) with factors of unitary (or orthogonal) \( m \times m \) matrix U, a \( m \times n \) 'diagonal' matrix \( {\bf \Sigma} \) and another unitary \( n \times n \) matrix \( {\bf V}^{\ast} , \) we can construct its pseudoinverse explicitely: \( {\bf A}^{\dagger} = {\bf V} {\bf \Sigma}^{\dagger} {\bf U}^{\ast} , \) where

\[ {\bf \Sigma}^{\dagger} = \mbox{diag} \left( \frac{1}{\sigma_1} , \cdots , \frac{1}{\sigma_r} , 0, \cdots , 0\right) \in \mathbb{R}^{m\times n} . \]
Here \( \sigma_1 , \ldots , \sigma_r \) are nonzero singular values of matrix A having rank r. The reduced form of pseudoinverse is also available when unitary matrices U and V are truncated to keep only r columns:
\[ {\bf A}^{\dagger} = {\bf V}_{n \times r} {\bf \Sigma}^{\dagger}_r {\bf U}^{\ast}_{r \times m} , \quad {\bf \Sigma}^{\dagger}_r = {\bf \Sigma}^{-1}_r = \mbox{diag} \left( \frac{1}{\sigma_1} , \cdots , \frac{1}{\sigma_r}\right) \in \mathbb{R}^{r\times r} . \]

Theorem: Let A be an \( m \times n \) matrix and \( {\bf A}^{\dagger} \) be its pseudoinverse n-by-m matrix.

  • \( {\bf P} = {\bf A}\,{\bf A}^{\dagger} \) and \( {\bf Q} = {\bf A}^{\dagger} {\bf A} \) are orthogonal projection operators (so they are idempotent \( {\bf P}^2 = {\bf P} , \quad {\bf Q}^2 = {\bf Q} \) and self-adjoint \( {\bf P} = {\bf P}^{\ast} , \quad {\bf Q} = {\bf Q}^{\ast} \) );
  • P is the orthogonal projector onto the range of A (which equals the orthogonal complement of the kernel of \( {\bf A}^{\ast} \) );
  • Q is the orthogonal projector onto the range of \( {\bf A}^{\ast} \) (which equals the orthogonal complement of the kernel of A);
  • \( \left( {\bf I} - {\bf P} \right) \) is the orthogonal projector onto the kernel of \( {\bf A}^{\ast} ; \)
  • \( \left( {\bf I} - {\bf Q} \right) \) is the orthogonal projector onto the kernel of A;
  • ker ( \( {\bf A}^{\dagger} \) ) = ker ( \( {\bf A}^{\ast} \) );
  • Range ( \( {\bf A}^{\dagger} \) ) = Range ( \( {\bf A}^{\ast} \) ). ■

The symmetric matrix \( {\bf A}^{\ast} {\bf A} \) is n by n. It is invertible if the a's (column vectors) are independent. For a full column rank m-by-n real matrix A, the solution of least squares problem becomes \( \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} . \) The projection m-by-m matrix on the subspace of columns of A (range of m-by-n matrix A) is
\[ {\bf P} = {\bf A} \,\left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = {\bf A} \,{\bf A}^{\dagger} . \]
Then
\[ {\bf A} \hat{\bf x} = {\bf A} \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} . \]
So to solve the least square problem \( {\bf A} {\bf x} = {\bf b} , \) which is inconsistent, one can find the nearest solution \( {\bf A} \hat{\bf x} = {\bf P}\,{\bf b} . \)

Theorem: The general solution of the linear least squares problem \( {\bf A} {\bf x} = {\bf b} \) is

\[ \hat{\bf x} = {\bf A}^{\dagger} {\bf b} + \left( {\bf I} - {\bf A}^{\dagger} {\bf A} \right) {\bf w} , \qquad {\bf w} \ \mbox{ arbitrary}. \]

Theorem: If m-by-n matrix A has full rank (so rank of A is n) then the solution of the linear least squares problem \( {\bf A} {\bf x} = {\bf b} \) is unique:

\[ \hat{\bf x} = {\bf A}^{\dagger} {\bf b} = {\bf V} {\bf \Sigma}^{\dagger} {\bf U}^{\ast} {\bf b} . \]
In the above formular, one can also use a reduced form of singular value decomposition for \( {\bf A}^{\dagger} . \)

The general least squares solution presented in Theorem is also valid for a system of equations Ax = b where \( m \le n , \) i.e., an undertermined linear system with fewer equations than unknowns. In this case the \( {\bf x} = {\bf V}_r {\bf \Sigma}_r^{-1} {\bf U}^{\mathrm T} {\bf b} \) solves the problem

\[ \min \| {\bf x} \| \qquad\mbox{subject to } {\bf A} {\bf x} = {\bf b} . \]

Algorithm for general solution to the linear least squares problem

  1. Compute the SVD: [ U S V ] = svd(A)
  2. Make a rank decision, i.e., choose r such that \( \sigma_r >0 \) and \( \sigma_{r+1} = \cdots = \sigma_n =0 . \) This decision is necessary because rounding errors will prevent the zero singular values to be exact zero.
  3. Set V1 = V(:,1:r), V2 = V(: ,r+1:n), Sr = S(1:r,1:r), U1 = U(:,1:r)
  4. Solution with minimal norm is xm = V1*(Sr\U1'*b)
  5. The general solution is x = xm + V2*c with an arbitrary \( c \in \mathbb{R}^{n-r} . \)

There is known another approach to solve least squares problem, using QR decomposition. So suppose that we know that the given matrix A can be decomposed into a product \( {\bf A} = {\bf Q} {\bf R} \) of an orthogonal matrix Q and an upper triangular matrix R. Then we can rewrite \( {\bf A} {\bf x} - {\bf b} \) as a difference of two orthogonal vectors:

\[ {\bf A} {\bf x} - {\bf b} = {\bf Q} \left( {\bf R} {\bf x} - {\bf Q}^{\mathrm T} {\bf b} \right) - \left( {\bf I} - {\bf Q} {\bf Q}^{\mathrm T} \right) {\bf b} . \]
By the Pythagorean theorem,
\[ \left\| {\bf A} {\bf x} - {\bf b} \right\|_2^2 = \left\| {\bf Q} \left( {\bf R} {\bf x} - {\bf Q}^{\mathrm T} {\bf b} \right) \right\|_2^2 + \left\| \left( {\bf I} - {\bf Q} {\bf Q}^{\mathrm T} \right) {\bf b} \right\|_2^2 . \]
This sum of squares is minimized when the first term is zero, and we get the solution of least squares problem:
\[ \hat{\bf x} = {\bf R}^{-1} {\bf Q}^{\mathrm T} {\bf b} . \]
The cost of this decomposition and subsequent least squares solution is \( 2n^2 m - \frac{2}{3}\,n^3 , \) about twice the cost of the normal equations if \( m \ge n \) and about the same if m = n.

Example. Let U be the plane in \( \mathbb{R}^3 \) given by

\[ U = \left\{ \left[ x\ y \ z \right]^{\mathrm T} \ : \ 2x-y-2z=0 \right\} . \]
We are going to find the nearest point (position vector) in U to the point whose position vector is \( {\bf b} = \left[ 1, \ 1, \ 1 \right]^{\mathrm T} . \)

Since S is a plane with given normal vector n = [ 2 -1 -2 ], we first find two linearly independent solutions to the equation 2x-y-2z=0:

\[ {\bf a} = \left[ 1, \ 2, \ 0 \right]^{\mathrm T} \quad \mbox{and} {\bf a} = \left[ 1, \ 0, \ 1 \right]^{\mathrm T} . \]
Now we build a 3-by-2 matrix of these column vectors:
\[ {\bf A} = \begin{bmatrix} 1&1 \\ 2&0 \\ 0&1 \end{bmatrix} \qquad \Longrightarrow \qquad {\bf A}^{\mathrm T} = \begin{bmatrix} 1&2&0 \\ 1&0&1 \end{bmatrix} . \]
Then we build an orthogonal projection operator that maps 3-space onto the range of matrix A:
\[ {\bf P} = {\bf A} \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = \frac{1}{9} \, \begin{bmatrix} 5&2&4 \\ 2&8&-2 \\ 4&-2&5 \end{bmatrix} . \]
This projection matrix does not depend on our choice of vectors a and b; indeed, if choose, for instance, two other vectors [ 2 2 1] and [0 -2 1], the projection matrix will be the same.
sage: P*P-P 
Application of the projection matrix P to vector b yields
\[ \hat{\bf b} = {\bf P} {\bf b} = \frac{1}{9} \, \begin{bmatrix} 5&2&4 \\ 2&8&-2 \\ 4&-2&5 \end{bmatrix} \, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \frac{1}{9} \, \begin{bmatrix} 11 \\ 8 \\ 7 \end{bmatrix} \quad \Longrightarrow \quad \| \hat{\bf b} \| = \frac{\sqrt{26}}{3} \approx 1.69967 < \sqrt{3} = \| {\bf b} \| . \]

Example. Let us find the least squares solution to the linear system

\[ \begin{split} x_1 - x_2 &= 4 , \\ x_1 -2 x_2 &= 1, \\ -x_1 + 2x_2 &= 3 . \end{split} \]
It will be convenient to express the system in the matrix form \( {\bf A} {\bf x} = {\bf b} , \) where
\[ {\bf A} = \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} , \qquad {\bf b} = \begin{bmatrix} 4\\ 1 \\ 3 \end{bmatrix} . \]
It follows that
\begin{align*} {\bf A}^{\mathrm T} {\bf A} &= \begin{bmatrix} 1&1&-1 \\ -1&-2&2 \end{bmatrix} \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} = \begin{bmatrix} 3 & -5 \\ -5 & 9 \end{bmatrix} , \\ {\bf A} {\bf A}^{\mathrm T} &= \begin{bmatrix} 2&3&-3 \\ 3&5&-5 \\ -3&-5&5 \end{bmatrix} , \\ {\bf A}^{\mathrm T} {\bf b} &= \begin{bmatrix} 1&1&-1 \\ -1&-2&2 \end{bmatrix} \begin{bmatrix} 4\\ 1 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} , \end{align*}
so the normal system \( {\bf A}^{\mathrm T} {\bf A} \, {\bf x} = {\bf A}^{\mathrm T} {\bf b} \) becomes
\[ \begin{bmatrix} 3 & -5 \\ -5 & 9 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} . \]
Solving this system yields a unique least squares solution, namely,
\[ x_1 = 9, \quad x_2 = 5. \]
The least squares error vector is
\[ {\bf b} - {\bf A} {\bf x} = \begin{bmatrix} 4 \\ 1 \\ 3 \end{bmatrix} - \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} \begin{bmatrix} 2 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 2 \end{bmatrix} \]
and the least squares error is \( \left\| {\bf b} - {\bf A} {\bf x} \right\| = 2\sqrt{2} \approx 2.82843 . \)

The pseudoinverse is the following 2 by 3 matrix:

\[ {\bf A}^{\dagger} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = \frac{1}{2} \begin{bmatrix} 4&-1&1 \\ 2&-1&1 \end{bmatrix} , \]
and the projectors on the range of A and on the range of \( {\bf A}^{\mathrm T} \) (which is the orthogonal complement of the kernel of A) are
\[ {\bf P} = {\bf A} {\bf A}^{\dagger} = \frac{1}{2} \begin{bmatrix} 2&0&0 \\ 0&1&-1 \\ 0&-1&1 \end{bmatrix} \quad\mbox{and} \quad {\bf Q} = {\bf A}^{\dagger} {\bf A} = \begin{bmatrix} 1&0 \\ 0& 1 \end{bmatrix} , \]
respectively.

Example. Find the least square solutions, the least squares error vector, and the least squares error of the linear system

\[ \begin{split} 3x_1 + 2x_2 + x_3 &= 2 , \\ -x_1 + x_2 + 4x_3 &= -2, \\ x_1 + 4x_2 + 9 x_3 &= 1 . \end{split} \]
The matrix form of the system is \( {\bf A} {\bf x} = {\bf b} , \) where
\[ {\bf A} = \begin{bmatrix} 3&2&1 \\ -1&1&4 \\ 1&4&9 \end{bmatrix} \qquad\mbox{and} \qquad {\bf b} = \begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix} . \]
Since the matrix A is singular and the rank of the augmented matrix \( \left[ {\bf A} \, | \, {\bf b} \right] \) is 3, the given system of algebraic equations is inconsistent. It follows that
\[ {\bf A}^{\mathrm T} {\bf A} = \begin{bmatrix} 11&9&8 \\ 9&21&42 \\ 8&42&98 \end{bmatrix} \qquad\mbox{and} \qquad {\bf A}^{\mathrm T} {\bf b} = \begin{bmatrix} 9 \\ 6 \\ 3 \end{bmatrix} , \]
so the augmented matrix for the normal system \( {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} \) is
\[ \left[ \begin{array}{ccc|c} 11&9&8&9 \\ 9&21&42&6 \\ 8&42&98&3 \end{array} \right] . \]
The reduced row echelon form of this matrix is
\[ \left[ \begin{array}{ccc|c} 1&0&- \frac{7}{5}&\frac{9}{10} \\ 0&1&\frac{13}{5}&-\frac{1}{10} \\ 0&0&0&0 \end{array} \right] \]
from which it follows that there are infinitely many least square solutions, and they are given by the parametric eqwuations
\[ \begin{split} x_1 &= \frac{7}{5}\, t + \frac{9}{10} \\ x_2 &= - \frac{13}{5}\, t - \frac{1}{10} , \\ x_3 &= t . \end{split} \]
As a check, let us verify that all least squares solutions produce the same least squares error vector and the same least squares error. To see this so, we first compute
\[ {\bf b} - {\bf A} \hat{\bf x} = \begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix} - \begin{bmatrix} 3&2&1 \\ -1&1&4 \\ 1&4&9 \end{bmatrix} \begin{bmatrix} \frac{7}{5}\, t + \frac{9}{10} \\ - \frac{13}{5}\, t - \frac{1}{10} \\ t \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 \\ -2 \\ 1 \end{bmatrix} . \]
Since \( {\bf b} - {\bf A} \hat{\bf x} \) does not depend on t, all least squares solutions produce the same error vector, namely, \( \| {\bf b} - {\bf A} \hat{\bf x} \| = \sqrt{\frac{3}{2}} .\)

Example. Consider an enviromental problem when a sewage treatment plant discharged contaminated water in a river. Environmental engineers study the degradation of water due to pollution source containing key chemical species because their concentration exceeds the maximum allowed by a regulating agency, such as the U.S. Environmental Protection Agency (EPA). In this example, we present a simple pollution model describing contamination in the river and aquifers (geological units containing the pores of soil filled with water).

To develop a mathematical model, we first locate the origin at the plant and denote by x the distance from the plant. The variables of interest include the contaminant concentration, C (\( \texttt{kg/m}^3 \) ), the water flow rate, Q (\( \texttt{m}^3 /\texttt{day} \) ), the flow velocity, U (\( \texttt{m/day} \) ), and the river cross-sectional area, A (\( \texttt{m}^2 \) ). The boundary condition for this model is the measured(known) concentration, \( C_0 (\texttt{kg/m}^3 ) \mbox{ at } x=0 .\)

We make an assumption that the system is in a steady state, so flaw is uniform in one dimension, and the velocity vectors are parallel to one another and perpendicular to the cross-sectional area. The contaminant discharge rate is assumed to be negligible as compared to the river flow rate. Other assumptions: groundwater and tributary flow are small and therefore are disregarded; the initial condition is kept constant over time; longitudimal and transverse dispersion is neglected.

First, we invoke the law of mass conservation and apply it to a control volume in the river, \( A \times \Delta x , \) , where Δx is a small length (in meters): the chemical rate of change within the control volume must equal its loss rate due to any chemical transformation. The chemical mass rate may be expressed as the product of the fluid flow rate and the solid-mass concentration, QC (\( \texttt{kg/day} \) ). On the other hand, the solid-mass concentration may be expressed as the change of concentration per unit distance, which leads to the conservation equation:

\[ Q \,\frac{{\text d}C}{{\text d} x}\,\Delta x = I , \]
where I is the solid-mass loss rate due to transformation (\( \texttt{kg/day} \) ). Since Q = AU and the element volume is V = A Δx (\( \texttt{m}^3 \) ), the above equation becomes
\[ Q \,\frac{{\text d}C}{{\text d} x} = \frac{I}{V} . \]
A common representation of I/V (loss rate per unit volume) conceives that the substance decays according to a first-order reaction. Therefore, the substance loss rate is proportional to its concentration, and I/V = KC, where K is the decay rate of the substance (\( \texttt{day}^{-1} \) ), assumed constant. This yields
\[ U \,\frac{{\text d}C}{{\text d} x} = - K\,C, \qquad C(0) = C_0 . \]
The above initial value problem can be easily solved to obtain its solution:
\[ C(x) = C_0 e^{-Kx/U} . \]
Taking natural logarithms on both sides, we obtain the linear equation
\[ \ln C(x) = - \frac{K}{U}\,x + \ln C_0 . \]
This equation may be used to forecast the contaminant concentration at a distance x downstream the source.

Environmental engineers measured organic pollution which is the concentration of fecal coliform bacteria in a stream. A stream with an average flow velocity of 5 m/s was sampled for total coliform count per liter (N/liter) at different distances downstream a sewage treatment plant. At distances of 10, 20, 30, and 40 km, the coliform concentrations were 58000, 25000, 9000, and 6500 N/liter, respectively. We use least squares approach to estimate the total coliform decay rate K.

First, we convert units from meters per second into km per day:

\[ U = 5 \,\texttt{m/sec} = 5 * \frac{432}{5}\,\texttt{km/day} = 432 \,\texttt{km/day} . \]
Next, we input data by taking logarithm:
Cdata = {Log[58000], Log[25000], Log[9000], Log[6500]}
and then enter all other numerical values:
X = {10,20,30,40}
n = Length[X]
U = 432
m = {X, Cdata}
FindFit[Transpose[m], alpha - beta*x, {alpha, beta}, x]
{alpha -> 11.6417, beta -> 0.0758757}
Finally, we estimate the total coliform decay rate to be
K = 0.07587573748466468*432
32.7783
and plot data along with the least squares line:
lp = ListPlot[Transpose[m], PlotStyle -> PointSize[Large]]
line = Plot[11.641735113511862 - 0.07587573748466468*x, {x, 0, 40}, PlotStyle -> Brown]
Show[{line, lp}]

Fitting Curves

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, usually subject to constraints. These minimization problems leads to non-linear least squares problems. While there are known some algorithms to handle these problems (for instance, the Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, which was developed by American statisticians Kenneth Levenberg (1919--1973) and Donald W. Marquardt (1929--1997)), we consider two typical relatively simple problems.

We start with finding a straight line on the plane that is defined by

\[ c + n_1 x + n_2 y =0 , \qquad n_1^2 + n_2^2 =1. \]
The unit vector \( (n_1 , n_2 ) \) is orthogonal to the line. For a given point (x,y), the absolute value of \( r= c + n_1 x + n_2 y \) gives the distance from the line yo the point. Therefore, if we want to determine the line for which the sum of squares of the distances to given points is minimal, we have to solve the constrainted least squares problem
\[ \left( \begin{array}{ccc} 1& x_1 & y_1 \\ 1 & x_2 & y_2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & y_m \end{array} \right) \begin{bmatrix} c \\ n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]
Using QR decomposition A = QR we can reduce the linear system to R x = 0 and the problem becomes
\[ \left( \begin{array}{ccc} r_{11} & r_{12} & r_{13} \\ 0& r_{22} & r_{23} \\ 0&0&r_{33} \end{array} \right) \begin{bmatrix} c \\ n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]
Since the nonlinear constraint only involves two unknowns, we have
\[ \left( \begin{array}{cc} r_{22} & r_{23} \\ 0&r_{33} \end{array} \right) \begin{bmatrix} n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]

Another problem is about fitting ellipses on the plane having defined by quadratic form

\[ {\bf x}^{\mathrm T} {\bf A} \,{\bf x} + {\bf b}^{\mathrm T} {\bf x} +c =0 \]
with positive definite symmetric matrix A. For each measured points xi we obtain by inserting into the ellipse equation a linear equation for the unknown coefficients \( {\bf u} = \left[ a_{11} , a_{12} = a_{21}, a_{22} , b_1 , b_2 , c \right] . \) This yields the minimization problem subject to a normalized condition:
\[ \| {\bf B}\,{\bf u} \| = \min \quad \mbox{subject to } \ \| {\bf u} \| =1 , \]
which can be solved using the SVD.

 

2.2.7b. Polynomial Fit


Some situation involve a power function \( f(x) = C\,x^m , \) where m is a known constant. For example, in 1601 the German mathematician, astronomer, and astrologer Johannes Kepler (1571--1630) formulated the third law of planetary motion, \( T = C\,x^{3/2} , \) where x is the distance to the sun measured in millions of kilometers, T is the orbital period measured in days, and C is a positive constant. In these cases, there is only one parameter C to be determined.

Theorem: Suppose that \( \left\{ (x_i , y_i ) \right\}_{i=1}^n \) are n points, where abscissas are distinct. The coefficient C of the least-squares power curve \( y = C\,x^m , \) is given by

\[ C = \left( \sum_{i=1}^n x_i^m y_i \right) \Big/ \left( \sum_{i=1}^n x_i^{2m} \right) . \qquad ■ \]

In many cases linear regression is not suitable for the data. Then one may try to fit data with a polynomial of degree :

\[ f(x) = c_0 + c_1 x + c_2 x^2 + \cdots + c_m x^m . \]
However, if the data do not exhibit a polynomial nature, the resulting curve may exhibit large oscillations. This phenomenon, called polynomial wiggle, becomes more pronounced with higher-degree polynomials. For this reason we seldom use a polynomial of degree 6 or above unless it is known that the true function we are working with is a polynomial. We show how the method works on example of parabola.

Theorem: Suppose that \( \left\{ (x_i , y_i ) \right\}_{i=1}^n \) are n points, where abscissas are distinct. The coefficients of the least-squares parabola

\[ f(x) = a\, x^2 + b\,x + c \]
are the solution values of the linear system
\begin{align*} \left( \sum_{i=1}^n x_i^4 \right) a + \left( \sum_{i=1}^n x_i^3 \right) b + \left( \sum_{i=1}^n x_i^2 \right) c &= \sum_{i=1}^n y_i x_i^2 , \\ \left( \sum_{i=1}^n x_i^3 \right) a + \left( \sum_{i=1}^n x_i^2 \right) b + \left( \sum_{i=1}^n x_i \right) c &= \sum_{i=1}^n y_i x_i \\ \left( \sum_{i=1}^n x_i^2 \right) a + \left( \sum_{i=1}^n x_i \right) b + n\,c &= \sum_{i=1}^n y_i . \qquad ■ \end{align*}

Total Least Squares

The linear least squares problem A x = b has so far been solved by projecting the vector b on the range of A. With ``Total Least Squares'' the system of equations is made consistent by changing both A and b. We are looking for a matrix A and a vector b from its range that differ as little as possible from the given data
\[ \left\| \left[ {\bf A} , {\bf b} \right] - \left[ {\cal A} , {\cal b} \right] \right\|_F = \min \qquad \mbox{subject to } \ {\cal b} \in \mbox{Range}({\cal A} ) . \]
In general, the augmented matrix \( {\bf C} = \left[ {\bf A} , {\bf b} \right] \) have rank n + 1 while matrix A has rank n. The constraint says that the augmented matrix \( {\cal C} = \left[ {\cal A} , {\cal b} \right] \) must have rank n. Upon introducing the error matrix \( \Delta = {\bf C} - {\cal C} , \) the problem is reduced to
\[ \| \Delta \|_F^2 = \min \qquad\mbox{subject to } \ \left( {\bf C} + \Delta \right) {\bf z} =0 , \]
where \( {\bf z} = \left( \begin{array}{c} {\bf x} \\ -1 \end{array} \right) \ne 0. \) Applying the SVD for C, we get
\[ \Delta = - \sigma_{n+1} {\bf u}_{n+1} {\bf v}_{n+1}^{\mathrm T} . \]
If the last component of vn+1 is not zero, the total least squares solution exists.
\[ {\cal A} \hat{\bf x} = {\cal b} , \qquad \hat{\bf x} = - \frac{1}{v_{n+1,n+1}} \,v(1:n,n+1) . \]

Table Of Contents

Previous topic

Matrix Algebra

Next topic

Systems of linear ODEs