Preface

This is a tutorial made solely for the purpose of education and it was designed for students taking Applied Math 0340. It is primarily for students who have some experience using Mathematica. If you have never used Mathematica before and would like to learn more of the basics for this computer algebra system, it is strongly recommended looking at the APMA 0330 tutorial. As a friendly reminder, don'r forget to clear variables in use and/or the kernel.

Finally, the commands in this tutorial are all written in bold black font, while Mathematica output is in regular fonts. This means that you can copy and paste all comamnds into Mathematica, change the parameters and run them. You, as the user, are free to use the scripts to your needs for learning how to use the Mathematica program, and have the right to distribute this tutorial and refer to this tutorial as long as this tutorial is accredited appropriately.

Return to computing page for the first course APMA0330
Return to computing page for the second course APMA0340
Return to Mathematica tutorial for the first course APMA0330
Return to Mathematica tutorial for the second course APMA0340
Return to the main page for the course APMA0340
Return to the main page for the course APMA0330
Return to Part II of the course APMA0340

A.-M. Legendre.

Least Squares Approximation

The scientific custom of taking multiple observations of the same quantity and then selecting a single estimate that best represents it has its origin the early part of the 16th century. At the beginning of the 19th century, two of the foremost mathematicians of the time, the German Johann Carl Friedrich Gauss (1777--1855) and the Frenchman Adrien-Marie Legendre (1752--1833), proposed, idependently of each other, an optimality criterion that bears the same relation to the average.

The context for Legendre's proposal of the least squares was that of geodesy. At that time, France's scientists had decided to adopt a new measurement system and proposed to use a unit length to be a meter, which would be equal 1/40,000,000 of the circumference of the earth. This necessitated an accurate determinantion of the said circumference that, in turn, depended on the exact shape of the earth.

The credibility of the method of least squares were greatly enhanced by the Ceres incident. On January 1, 1801 the Italian astronomer Giuseppe Piazzi sighted a heavently body that he strongly suspected to be a new planet. He announced his discovery and named it Ceres. Unfortunately, 6 weeks later, before enough observations had been taken to make possible accurate determinantion of its orbit, so as to as certain that it was indeed a planet, Ceres disappared behind the sun and was not expected to reemerge for nearly a year. Interest in this possibly new planet was widespread, and the young Gauss proposed that an area of the sky be searched that was quite different from those suggested by other astronomers; he turned out to be right! He became a celebrity upon his discovery, which includes two mathematical methods: the row echelon reduction and the least square method.

In science and business, we often need to predict the relationship between two given variables. In many cases, we begin by performing an appropriate experiments or statistical analyisis to obtain the necessary data. However, even if a simple law governs the behavior of the variables, this law may not be easy to find because of errors introduced in measuring or sampling. In practice, therefore, we are often content with a functional equation that provides a close approximation. There are know three general techniques for finding functions which are closest to a given curve:

linear regression using linear polynomials (matching straight lines);
general linear regression (polynomials, etc.), and
transformations to linear regression (for matching exponential or logarithmic functions).

Historically, besides to curve fitting, the least square technique is proved to be very useful in statistical modeling of noisy data, and in geodetic modeling. We discuss three standard ways to solve teh least square problem: the normal equations, the QR factorization, and the singular value decomposition.

Consider a typical application of least squares in curve fitting. Suppose that we are given a set of points $ (x_i , y_i ) $ for $ i = 0, 1, 2, \ldots , n, $ for which we may not be able (or may not want) to find a function that passes through all points, but rather, we may want to find a function f(x) of a particular form that passes as closely as possible to the points. The differences

\[ e_i = f(x_i ) - y_i \quad\mbox{for}\quad i =1,2,\ldots n, \]

are called the errors or deviations or residuals. There are several norms that can be used with the residuals to measure how the curve $ y = f(x) $ lies from data:

maximum error: $ \quad \| e \|_{\infty} = \max_{1\le i \le n} \{ | f(x_i ) - y_i | \} ; $
average error: $ \quad \| e \|_1 = \frac{1}{n} \,\sum_{i=1}^n | f(x_i ) - y_i | ; $
root-mean square error (or L²-error): $ \quad \| e \|_2 = \left( \frac{1}{n} \,\sum_{i=1}^n | f(x_i ) - y_i |^2 \right)^{1/2} . $

Example. Compare the maximum error, average error, and rms error for the linear approximation $ y = f(x) = 3.5 -1.3\,x $ to the data points (-1,5), (0,4), (1,2), (2,1), (3,0), (4,-2), (5,-3), (6,-4).

The errors are summarized in the following table:

x_i	y_i	$ f(x_i ) = 3.5 -1.3\,x_i $	\|e_i\|	e_i²
-1	5	4.8	0.2	0.04
0	4	3.5	0.5	0.25
1	2	2.2	0.2	0.04
2	1	0.9	0.1	0.01
3	0	-0.4	0.4	0.16
4	-2	-1.7	0.3	0.09
5	-3	-1	0	0
6	-4	-4.3	0.3	0.09

We can see that the maximum error $ \| e \|_{\infty} = 0.5 $ is largest because

\[ \| e \|_1 = \frac{1}{8}\,\sum_{i=1}^8 |e_i | = 0.25, \qquad \| e \|_2 = \frac{1}{2\sqrt{2}}\left( \sum_{i=1}^8 |e_i |^2 \right)^{1/2} \approx 0.291548 . \qquad ■ \]

Linear Regression

Let $ \{ ( x_i , y_i) \}_{i=1}^n $ be set of n points, where the abscissas x_i are distinct. The least-squares line (or linear regression) $ y = m\,x+b $ is the line that minimizes the root-mean-square error e₂.

The quantity $ \| e \|_2 $ will be minimum if and only if the quantity $ \sum_{i=1}^n \left( m\,x_i +b - y_i \right)^2 $ is a minimum. The letter is visualized geometrically by minimizing the sum of the squares of the vertical distances from the points to the line.

line = Graphics[Line[{{0, 0}, {2, 2}}]]
line1 = Graphics[Line[{{0.25, 0.25}, {0.25, 0.36}}]]
disk1 = Graphics[{Green, Disk[{0.25, 0.36}, 0.02]}]
line2 = Graphics[Line[{{0.5, 0.5}, {0.5, 0.3}}]]
disk2 = Graphics[{Green, Disk[{0.5, 0.3}, 0.02]}]
line3 = Graphics[Line[{{1.0, 1.0}, {1.0, 1.25}}]]
disk3 = Graphics[{Green, Disk[{1.0, 1.25}, 0.02]}]
line4 = Graphics[Line[{{1.75, 1.75}, {1.75, 1.85}}]]
disk4 = Graphics[{Green, Disk[{1.75, 1.85}, 0.02]}]
line5 = Graphics[Line[{{2.0, 2.0}, {2.0, 1.75}}]]
disk5 = Graphics[{Green, Disk[{2.0, 1.75}, 0.02]}]
text1 = Graphics[Text["(x_i, m x_i +b)", {1.0, 0.9}]]
text2 = Graphics[Text["(x_i, y_i)", {1.0, 1.35}]]

Show[line, line1, disk1, line2, disk2, line3, disk3, line4, disk4,
line5, disk5, text1, text2, Axes -> True, AxesLabel -> {x, y},
Ticks -> {{{0.25, Subscript[x, 1]}, {0.5, Subscript[x, 2]}, {1.0, Subscript[x, i]}, {1.75, Subscript[x, n - 1]},
{2.0, Subscript[x, n]}}, {0, 2}}]

Now we explicitely express the coefficients m and b in the linear regression (least-squares) line:

\[ m= \frac{S_{xy}}{S_{xx}} , \qquad b = y_{\text{ave}} - m\,x_{\text{ave}} , \]

where

\[ y_{\text{ave}} = \frac{1}{n} \, \sum_{i=1}^n y_i , \qquad x_{\text{ave}} = \frac{1}{n} \, \sum_{i=1}^n x_i , \]

and expressions S_xy and S_xx are expressed through average values of data:

\[ S_{xy} = \sum_{i=1}^n x_i y_i - n\,x_{\text{ave}} y_{\text{ave}} , \qquad S_{xx} = \sum_{i=1}^n x_i^2 - n \left( x_{\text{ave}} \right)^2 . \]

We use Mathematica to find all ingredients needed for linear regression:

xave = Sum[x[[i]],{i,1,n}]/n
yave = Sum[y[[i]],{i,1,n}]/n
Sxy=0;
Sxx=0;
For[i = 1, i <= n, i++,
Sxy = Sxy + x[[i]]*x[[i]] - xave*yave;
Sxx = Sxx + (x[[i]]^2) - xave^2]

Then we determine the coefficients in least-squares line:

m = N[Sxy/Sxx]
b = N[yave - m*xave]

Then we print the output:

y = b + m*x;
Print["y=", b, "+", m, "x"]

Finally we plot the data points as well as the least squares regression line:

Datatable = Table[{x[[i]], y[[i]]}, {i, 1, n}];
points = ListPlot[Datatable, PlotStyle -> PointSize[0.02], DisplayFunction -> Identity];
line = Plot[b + m*xx, {xx, Min[x], Max[x]}, DisplayFunction -> Identity];

Show[points, line, PlotLabel -> "Linear Regression Model with data points", DisplayFunction -> $DisplayFunction]

For points

n=5;
x = {1, 2, 3, 4, 5}
y = {-1, 2, 9, 6, 0}

we get the picture

Example. Suppose we’re given a collection of data points (x,y):

\[ (1,1) , \quad (2,2), \quad (3,2) \]

and we want to find the closest line $ y = mx+b $ to that collection. If the line went through all three points, we’d have:

\[ \begin{split} m+b &=1 , \\ 2m+b &=2, \\ 3m+b &= 2, \end{split} \]

which is equivalent to the vector equation:

\[ \begin{bmatrix} 1&1 \\ 1&2 \\ 1&3 \end{bmatrix} \begin{bmatrix} b \\ m \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix} \qquad\mbox{or, in vector form:} \qquad {\bf A} \, {\bf x} = {\bf b} . \]

In our example the line does not go through all three points, so this equation is not solvable. ■

In statistical modeling, one often wishes to estimate certain parameters x_j based on some observations, where the observations are contaminated by noise. For instance, one wishes to predict the college grade point average (GPA) of freshman applications, which we denote by b, based on their high school GPA, denoted by a₁, and two Scholastic Aptitude Test scores, one verbal (a₂) and one quantitative (a₃), as part of the college admission process. It is a custom to use a linear model in this case: $ b = a_1 x_1 + a_2 x_2 + a_3 x_3 . $ ■

Now we formulate the least square problem. Suppose that we have a set (usually linearly indepndent, but not necessarily) of vectors $ \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} $ and a given vector b. We seek coefficients $ x_1 , x_2 , \ldots , x_n $ that produce a minimal error

\[ \left\| {\bf b} - \sum_{i=1}^n x_i \,{\bf a}_i \right\| . \]

with respect to the Euclidean inner product on $ \mathbb{R}^n . $ As long as we are interested only in linear combinations of a finite set $ \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} ,$ it is possible to transform the problem into one involving finite columns of numbers. In this case, define a matrix A with columns given by the linearly independent column vectors $ {\bf a}_i , $ and a vector x whose entries are the unknown coefficients x_i. Then matrix A has dimensions m by n, meaning that $ {\bf a}_i $ are vectors of length m. The problem then can be reformulated by choosing x that minimizing $ \| {\bf A} {\bf x} - {\bf b} \|_2 $ because $ {\bf A} {\bf x} = {\bf b} $ is an inconsistent linear system of m equations in n unknowns. We call such a vector, if it exists, a least squares solution $ {\bf A} \hat{\bf x} = {\bf b} , $ and call $ {\bf b} - {\bf A} \hat{\bf x} $ the least squares error vector. Statisticians call it linear regression.

If column vectors $ \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} $ are linearly independent (which will be assumed), matrix A becomes full column rank. The vector $ \hat{\bf x} = x_1 {\bf a}_1 + x_2 {\bf a}_2 + \cdots + x_n {\bf a}_n $ that is closest to a given vector b is its orthogonal projection onto the subspace spanned by the a's. The hat over $ \hat{\bf x} $ indicates the best choice (in the sense that it minimizes $ \| {\bf A} {\bf x} - {\bf b} \|_2 $ ) that gives the closest vector in the column space. Therefore, to solve the least square problem is equivalent to find the orthogonal projection matrix P on the column space such that $ {\bf P} {\bf b} = {\bf A}\,\hat{\bf x} . $

The vector $ \hat{\bf x} $ is a solution to the least squares problem when the error vector $ {\bf e} = {\bf b} - {\bf A} \hat{\bf x} $ is perpendicular to the subspace. Therefore, this error vector makes right angle with all the vectors $ \{ {\bf a}_1 , {\bf a}_2 , \ldots , {\bf a}_n \} . $ That gives the n equations we need to find $ \hat{\bf x} : $

\[ \begin{array}{c} {\bf a}^{\mathrm T}_1 \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \\ \vdots \quad \vdots \\ {\bf a}^{\mathrm T}_n \left( {\bf b} - {\bf A} \hat{\bf x} \right) =0 \end{array} \qquad \mbox{or} \qquad \begin{bmatrix} --- \, {\bf a}^{\mathrm T}_1 \, --- \\ \vdots \\ --- \, {\bf a}^{\mathrm T}_n \, --- \end{bmatrix} \begin{bmatrix} \quad \\ {\bf b} - {\bf A} \hat{\bf x} \\ \quad \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix} . \]

The matrix in the right equation is $ {\bf A}^{\mathrm T} . $ The n equations are exactly

\[ {\bf A}^{\mathrm T} {\bf A} \, \hat{\bf x} = {\bf A}^{\mathrm T} \, {\bf b} . \]

This equation is called the normal equation.

Now we derive the normal equation using another approach: calculus. To find the best approximation, we look for the x where the gradient of $ \| {\bf A} {\bf x} - {\bf b} \|_2^2 = \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right) $ vanishes. So we want

\begin{align*} 0 &= \lim_{\epsilon \mapsto 0} \frac{\left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right)^{\mathrm T} \left( {\bf A} ({\bf x} + \epsilon ) - {\bf b} \right) - \left( {\bf A} {\bf x} - {\bf b} \right)^{\mathrm T} \left( {\bf A} {\bf x} - {\bf b} \right)}{\| \epsilon \|_2} \\ &= \lim_{\epsilon \mapsto 0} \frac{2\epsilon^{\mathrm T} \left( {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} \right) + \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon}{\| \epsilon \|_2} . \end{align*}

The second term

\[ \frac{\left\vert \epsilon^{\mathrm T} {\bf A}^{\mathrm T} {\bf A} \epsilon \right\vert}{\| \epsilon \|_2} \le \frac{\| {\bf A} \|_2^2 \| \epsilon \|_2^2}{\| \epsilon \|_2} = \| {\bf A} \|_2^2 \| \epsilon \|_2 \, \mapsto \, 0 \]

approaches 0 as ε goes to 0, so the factor $ {\bf A}^{\mathrm T} {\bf A} {\bf x} - {\bf A}^{\mathrm T} {\bf b} $ in the first term must also be zero, or $ {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} . $ This is a system of n linear equations in n unknowns, the normal equation.

Why is $ \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} $ the minimizer of $ \| {\bf A} {\bf x} - {\bf b} \|_2^2 ? $ We can note that the Hessian $ {\bf A}^{\mathrm T} {\bf A} $ is positive definite, which means that the function is strictly convex and any critical point is a global minimum. ■

Theorem: Suppose that $ \{ ( x_i , y_i) \}_{i=1}^n $ are n points, where the abscissas x_i are distinct. The coefficients of the least squares line

\[ y= m\,x +b \]

are the solution to the following linear system, known as the normal equations:

\begin{align*} \left( \sum_{i=1}^n x_i^2 \right) m + b \left( \sum_{i=1}^n x_i \right) &= \sum_{i=1}^n x_i y_i , \\ \left( \sum_{i=1}^n x_i \right) m + n\,b &= \sum_{i=1}^n y_i . \end{align*}

Theorem: If W is a finite-dimensional subspace of an inner product vector space V, and if b is a vector in V, then $ \mbox{proj}_W ({\bf b}) $ is the best approximation to b from W in the sense that

\[ \| {\bf b} - \mbox{proj}_W {\bf b} \| < \| {\bf b} - {\bf w} \| \]

for every vector w in W that is different from $ \mbox{proj}_W ({\bf b}) . $ ■

To solve the normal equation, we need one more definition: for an arbitrary $ m \times n $ matrix A, its pseudoinverse is defined as an $ n \times m $ matrix, denoted by $ {\bf A}^{\dagger} , $ that satisfies all four criteria:

$ {\bf A}\,{\bf A}^{\dagger} \,{\bf A} = {\bf A}; $
$ {\bf A}^{\dagger} {\bf A} \,{\bf A}^{\dagger} = {\bf A}^{\dagger} ; $
$ \left( {\bf A} \,{\bf A}^{\dagger} \right)^{\ast} = {\bf A} \,{\bf A}^{\dagger} $ (so $ {\bf P} = {\bf A} \,{\bf A}^{\dagger} $ is self-adjoint m-by-m matrix) ;
$ \left( {\bf A}^{\dagger} {\bf A} \right)^{\ast} = {\bf A}^{\dagger} {\bf A} $ (so $ {\bf Q} = {\bf A}^{\dagger}{\bf A} $ is self-adjoint n-by-n matrix). ■

A pseudoinverse exists for arbitrary matrix A, but when the latter has full rank, $ {\bf A}^{\dagger} $ can be expressed as a simple algebraic formula. In particular, when A has linearly independent columns (and thus matrix $ {\bf A}^{\ast} {\bf A} $ is invertible), $ {\bf A}^{\dagger} $ can be computed as:

\[ {\bf A}^{\dagger} = \left( {\bf A}^{\ast} {\bf A} \right)^{-1} {\bf A}^{\ast} . \]

This pseudoinverse is actually a left inverse to A because, in this case, $ {\bf A}^{\dagger} {\bf A} = {\bf I}_n . $ When A has linearly independent rows (matrix $ {\bf A}\,{\bf A}^{\ast} $ is invertible), $ {\bf A}^{\dagger} $ can be computed as:

\[ {\bf A}^{\dagger} = {\bf A}^{\ast} \left( {\bf A}^{\ast} {\bf A} \right)^{-1} . \]

This is a right inverse, as $ {\bf A} {\bf A}^{\dagger} = {\bf I}_m . $ ■

When singular value decomposition for a m-by-n matrix A is known to be $ {\bf A} = {\bf U}\,{\bf \Sigma}\, {\bf V}^{\ast} $ with factors of unitary (or orthogonal) $ m \times m $ matrix U, a $ m \times n $ 'diagonal' matrix $ {\bf \Sigma} $ and another unitary $ n \times n $ matrix $ {\bf V}^{\ast} , $ we can construct its pseudoinverse explicitely: $ {\bf A}^{\dagger} = {\bf V} {\bf \Sigma}^{\dagger} {\bf U}^{\ast} , $ where

\[ {\bf \Sigma}^{\dagger} = \mbox{diag} \left( \frac{1}{\sigma_1} , \cdots , \frac{1}{\sigma_r} , 0, \cdots , 0\right) \in \mathbb{R}^{m\times n} . \]

Here $ \sigma_1 , \ldots , \sigma_r $ are nonzero singular values of matrix A having rank r. The reduced form of pseudoinverse is also available when unitary matrices U and V are truncated to keep only r columns:

\[ {\bf A}^{\dagger} = {\bf V}_{n \times r} {\bf \Sigma}^{\dagger}_r {\bf U}^{\ast}_{r \times m} , \quad {\bf \Sigma}^{\dagger}_r = {\bf \Sigma}^{-1}_r = \mbox{diag} \left( \frac{1}{\sigma_1} , \cdots , \frac{1}{\sigma_r}\right) \in \mathbb{R}^{r\times r} . \]

Theorem: Let A be an $ m \times n $ matrix and $ {\bf A}^{\dagger} $ be its pseudoinverse n-by-m matrix.

$ {\bf P} = {\bf A}\,{\bf A}^{\dagger} $ and $ {\bf Q} = {\bf A}^{\dagger} {\bf A} $ are orthogonal projection operators (so they are idempotent $ {\bf P}^2 = {\bf P} , \quad {\bf Q}^2 = {\bf Q} $ and self-adjoint $ {\bf P} = {\bf P}^{\ast} , \quad {\bf Q} = {\bf Q}^{\ast} $ );
P is the orthogonal projector onto the range of A (which equals the orthogonal complement of the kernel of $ {\bf A}^{\ast} $ );
Q is the orthogonal projector onto the range of $ {\bf A}^{\ast} $ (which equals the orthogonal complement of the kernel of A);
$ \left( {\bf I} - {\bf P} \right) $ is the orthogonal projector onto the kernel of $ {\bf A}^{\ast} ; $
$ \left( {\bf I} - {\bf Q} \right) $ is the orthogonal projector onto the kernel of A;
ker ( $ {\bf A}^{\dagger} $ ) = ker ( $ {\bf A}^{\ast} $ );
Range ( $ {\bf A}^{\dagger} $ ) = Range ( $ {\bf A}^{\ast} $ ). ■

The symmetric matrix $ {\bf A}^{\ast} {\bf A} $ is n by n. It is invertible if the a's (column vectors) are independent. For a full column rank m-by-n real matrix A, the solution of least squares problem becomes $ \hat{\bf x} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} . $ The projection m-by-m matrix on the subspace of columns of A (range of m-by-n matrix A) is

\[ {\bf P} = {\bf A} \,\left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = {\bf A} \,{\bf A}^{\dagger} . \]

Then

\[ {\bf A} \hat{\bf x} = {\bf A} \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} \, {\bf b} . \]

So to solve the least square problem $ {\bf A} {\bf x} = {\bf b} , $ which is inconsistent, one can find the nearest solution $ {\bf A} \hat{\bf x} = {\bf P}\,{\bf b} . $ ■

Theorem: The general solution of the linear least squares problem $ {\bf A} {\bf x} = {\bf b} $ is

\[ \hat{\bf x} = {\bf A}^{\dagger} {\bf b} + \left( {\bf I} - {\bf A}^{\dagger} {\bf A} \right) {\bf w} , \qquad {\bf w} \ \mbox{ arbitrary}. \]

Theorem: If m-by-n matrix A has full rank (so rank of A is n) then the solution of the linear least squares problem $ {\bf A} {\bf x} = {\bf b} $ is unique:

\[ \hat{\bf x} = {\bf A}^{\dagger} {\bf b} = {\bf V} {\bf \Sigma}^{\dagger} {\bf U}^{\ast} {\bf b} . \]

In the above formular, one can also use a reduced form of singular value decomposition for $ {\bf A}^{\dagger} . $ ■

The general least squares solution presented in Theorem is also valid for a system of equations Ax = b where $ m \le n , $ i.e., an undertermined linear system with fewer equations than unknowns. In this case the $ {\bf x} = {\bf V}_r {\bf \Sigma}_r^{-1} {\bf U}^{\mathrm T} {\bf b} $ solves the problem

\[ \min \| {\bf x} \| \qquad\mbox{subject to } {\bf A} {\bf x} = {\bf b} . \]

Algorithm for general solution to the linear least squares problem

Compute the SVD: [ U S V ] = svd(A)
Make a rank decision, i.e., choose r such that $ \sigma_r >0 $ and $ \sigma_{r+1} = \cdots = \sigma_n =0 . $ This decision is necessary because rounding errors will prevent the zero singular values to be exact zero.
Set V1 = V(:,1:r), V2 = V(: ,r+1:n), Sr = S(1:r,1:r), U1 = U(:,1:r)
Solution with minimal norm is xm = V1*(Sr\U1'*b)
The general solution is x = xm + V2*c with an arbitrary $ c \in \mathbb{R}^{n-r} . $ ■

There is known another approach to solve least squares problem, using QR decomposition. So suppose that we know that the given matrix A can be decomposed into a product $ {\bf A} = {\bf Q} {\bf R} $ of an orthogonal matrix Q and an upper triangular matrix R. Then we can rewrite $ {\bf A} {\bf x} - {\bf b} $ as a difference of two orthogonal vectors:

\[ {\bf A} {\bf x} - {\bf b} = {\bf Q} \left( {\bf R} {\bf x} - {\bf Q}^{\mathrm T} {\bf b} \right) - \left( {\bf I} - {\bf Q} {\bf Q}^{\mathrm T} \right) {\bf b} . \]

By the Pythagorean theorem,

\[ \left\| {\bf A} {\bf x} - {\bf b} \right\|_2^2 = \left\| {\bf Q} \left( {\bf R} {\bf x} - {\bf Q}^{\mathrm T} {\bf b} \right) \right\|_2^2 + \left\| \left( {\bf I} - {\bf Q} {\bf Q}^{\mathrm T} \right) {\bf b} \right\|_2^2 . \]

This sum of squares is minimized when the first term is zero, and we get the solution of least squares problem:

\[ \hat{\bf x} = {\bf R}^{-1} {\bf Q}^{\mathrm T} {\bf b} . \]

The cost of this decomposition and subsequent least squares solution is $ 2n^2 m - \frac{2}{3}\,n^3 , $ about twice the cost of the normal equations if $ m \ge n $ and about the same if m = n.

Example. Let U be the plane in $ \mathbb{R}^3 $ given by

\[ U = \left\{ \left[ x\ y \ z \right]^{\mathrm T} \ : \ 2x-y-2z=0 \right\} . \]

We are going to find the nearest point (position vector) in U to the point whose position vector is $ {\bf b} = \left[ 1, \ 1, \ 1 \right]^{\mathrm T} . $

Since S is a plane with given normal vector n = [ 2 -1 -2 ], we first find two linearly independent solutions to the equation 2x-y-2z=0:

\[ {\bf a} = \left[ 1, \ 2, \ 0 \right]^{\mathrm T} \quad \mbox{and} {\bf a} = \left[ 1, \ 0, \ 1 \right]^{\mathrm T} . \]

Now we build a 3-by-2 matrix of these column vectors:

\[ {\bf A} = \begin{bmatrix} 1&1 \\ 2&0 \\ 0&1 \end{bmatrix} \qquad \Longrightarrow \qquad {\bf A}^{\mathrm T} = \begin{bmatrix} 1&2&0 \\ 1&0&1 \end{bmatrix} . \]

Then we build an orthogonal projection operator that maps 3-space onto the range of matrix A:

\[ {\bf P} = {\bf A} \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = \frac{1}{9} \, \begin{bmatrix} 5&2&4 \\ 2&8&-2 \\ 4&-2&5 \end{bmatrix} . \]

This projection matrix does not depend on our choice of vectors a and b; indeed, if choose, for instance, two other vectors [ 2 2 1] and [0 -2 1], the projection matrix will be the same.

A.Inverse[Transpose[A].A].Transpose[A]

Application of the projection matrix P to vector b yields

\[ \hat{\bf b} = {\bf P} {\bf b} = \frac{1}{9} \, \begin{bmatrix} 5&2&4 \\ 2&8&-2 \\ 4&-2&5 \end{bmatrix} \, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \frac{1}{9} \, \begin{bmatrix} 11 \\ 8 \\ 7 \end{bmatrix} \quad \Longrightarrow \quad \| \hat{\bf b} \| = \frac{\sqrt{26}}{3} \approx 1.69967 < \sqrt{3} = \| {\bf b} \| . \]

Example. Let us find the least squares solution to the linear system

\[ \begin{split} x_1 - x_2 &= 4 , \\ x_1 -2 x_2 &= 1, \\ -x_1 + 2x_2 &= 3 . \end{split} \]

It will be convenient to express the system in the matrix form $ {\bf A} {\bf x} = {\bf b} , $ where

\[ {\bf A} = \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} , \qquad {\bf b} = \begin{bmatrix} 4\\ 1 \\ 3 \end{bmatrix} . \]

It follows that

\begin{align*} {\bf A}^{\mathrm T} {\bf A} &= \begin{bmatrix} 1&1&-1 \\ -1&-2&2 \end{bmatrix} \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} = \begin{bmatrix} 3 & -5 \\ -5 & 9 \end{bmatrix} , \\ {\bf A} {\bf A}^{\mathrm T} &= \begin{bmatrix} 2&3&-3 \\ 3&5&-5 \\ -3&-5&5 \end{bmatrix} , \\ {\bf A}^{\mathrm T} {\bf b} &= \begin{bmatrix} 1&1&-1 \\ -1&-2&2 \end{bmatrix} \begin{bmatrix} 4\\ 1 \\ 3 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} , \end{align*}

so the normal system $ {\bf A}^{\mathrm T} {\bf A} \, {\bf x} = {\bf A}^{\mathrm T} {\bf b} $ becomes

\[ \begin{bmatrix} 3 & -5 \\ -5 & 9 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \end{bmatrix} . \]

Solving this system yields a unique least squares solution, namely,

\[ x_1 = 9, \quad x_2 = 5. \]

The least squares error vector is

\[ {\bf b} - {\bf A} {\bf x} = \begin{bmatrix} 4 \\ 1 \\ 3 \end{bmatrix} - \begin{bmatrix} 1&-1 \\ 1&-2 \\-1&2 \end{bmatrix} \begin{bmatrix} 2 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 2 \\ 2 \end{bmatrix} \]

and the least squares error is $ \left\| {\bf b} - {\bf A} {\bf x} \right\| = 2\sqrt{2} \approx 2.82843 . $

The pseudoinverse is the following 2 by 3 matrix:

\[ {\bf A}^{\dagger} = \left( {\bf A}^{\mathrm T} {\bf A} \right)^{-1} {\bf A}^{\mathrm T} = \frac{1}{2} \begin{bmatrix} 4&-1&1 \\ 2&-1&1 \end{bmatrix} , \]

and the projectors on the range of A and on the range of $ {\bf A}^{\mathrm T} $ (which is the orthogonal complement of the kernel of A) are

\[ {\bf P} = {\bf A} {\bf A}^{\dagger} = \frac{1}{2} \begin{bmatrix} 2&0&0 \\ 0&1&-1 \\ 0&-1&1 \end{bmatrix} \quad\mbox{and} \quad {\bf Q} = {\bf A}^{\dagger} {\bf A} = \begin{bmatrix} 1&0 \\ 0& 1 \end{bmatrix} , \]

respectively.

Example. Find the least square solutions, the least squares error vector, and the least squares error of the linear system

\[ \begin{split} 3x_1 + 2x_2 + x_3 &= 2 , \\ -x_1 + x_2 + 4x_3 &= -2, \\ x_1 + 4x_2 + 9 x_3 &= 1 . \end{split} \]

The matrix form of the system is $ {\bf A} {\bf x} = {\bf b} , $ where

\[ {\bf A} = \begin{bmatrix} 3&2&1 \\ -1&1&4 \\ 1&4&9 \end{bmatrix} \qquad\mbox{and} \qquad {\bf b} = \begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix} . \]

Since the matrix A is singular and the rank of the augmented matrix $ \left[ {\bf A} \, | \, {\bf b} \right] $ is 3, the given system of algebraic equations is inconsistent. It follows that

\[ {\bf A}^{\mathrm T} {\bf A} = \begin{bmatrix} 11&9&8 \\ 9&21&42 \\ 8&42&98 \end{bmatrix} \qquad\mbox{and} \qquad {\bf A}^{\mathrm T} {\bf b} = \begin{bmatrix} 9 \\ 6 \\ 3 \end{bmatrix} , \]

so the augmented matrix for the normal system $ {\bf A}^{\mathrm T} {\bf A} {\bf x} = {\bf A}^{\mathrm T} {\bf b} $ is

\[ \left[ \begin{array}{ccc|c} 11&9&8&9 \\ 9&21&42&6 \\ 8&42&98&3 \end{array} \right] . \]

The reduced row echelon form of this matrix is

\[ \left[ \begin{array}{ccc|c} 1&0&- \frac{7}{5}&\frac{9}{10} \\ 0&1&\frac{13}{5}&-\frac{1}{10} \\ 0&0&0&0 \end{array} \right] \]

from which it follows that there are infinitely many least square solutions, and they are given by the parametric eqwuations

\[ \begin{split} x_1 &= \frac{7}{5}\, t + \frac{9}{10} \\ x_2 &= - \frac{13}{5}\, t - \frac{1}{10} , \\ x_3 &= t . \end{split} \]

As a check, let us verify that all least squares solutions produce the same least squares error vector and the same least squares error. To see this so, we first compute

\[ {\bf b} - {\bf A} \hat{\bf x} = \begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix} - \begin{bmatrix} 3&2&1 \\ -1&1&4 \\ 1&4&9 \end{bmatrix} \begin{bmatrix} \frac{7}{5}\, t + \frac{9}{10} \\ - \frac{13}{5}\, t - \frac{1}{10} \\ t \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 \\ -2 \\ 1 \end{bmatrix} . \]

Since $ {\bf b} - {\bf A} \hat{\bf x} $ does not depend on t, all least squares solutions produce the same error vector, namely, $ \| {\bf b} - {\bf A} \hat{\bf x} \| = \sqrt{\frac{3}{2}} .$ ■

Example. Consider an enviromental problem when a sewage treatment plant discharged contaminated water in a river. Environmental engineers study the degradation of water due to pollution source containing key chemical species because their concentration exceeds the maximum allowed by a regulating agency, such as the U.S. Environmental Protection Agency (EPA). In this example, we present a simple pollution model describing contamination in the river and aquifers (geological units containing the pores of soil filled with water).

To develop a mathematical model, we first locate the origin at the plant and denote by x the distance from the plant. The variables of interest include the contaminant concentration, C ($ \texttt{kg/m}^3 $ ), the water flow rate, Q ($ \texttt{m}^3 /\texttt{day} $ ), the flow velocity, U ($ \texttt{m/day} $ ), and the river cross-sectional area, A ($ \texttt{m}^2 $ ). The boundary condition for this model is the measured(known) concentration, $ C_0 (\texttt{kg/m}^3 ) \mbox{ at } x=0 .$

We make an assumption that the system is in a steady state, so flaw is uniform in one dimension, and the velocity vectors are parallel to one another and perpendicular to the cross-sectional area. The contaminant discharge rate is assumed to be negligible as compared to the river flow rate. Other assumptions: groundwater and tributary flow are small and therefore are disregarded; the initial condition is kept constant over time; longitudimal and transverse dispersion is neglected.

First, we invoke the law of mass conservation and apply it to a control volume in the river, $ A \times \Delta x , $ , where Δx is a small length (in meters): the chemical rate of change within the control volume must equal its loss rate due to any chemical transformation. The chemical mass rate may be expressed as the product of the fluid flow rate and the solid-mass concentration, QC ($ \texttt{kg/day} $ ). On the other hand, the solid-mass concentration may be expressed as the change of concentration per unit distance, which leads to the conservation equation:

\[ Q \,\frac{{\text d}C}{{\text d} x}\,\Delta x = I , \]

where I is the solid-mass loss rate due to transformation ($ \texttt{kg/day} $ ). Since Q = AU and the element volume is V = A Δx ($ \texttt{m}^3 $ ), the above equation becomes

\[ Q \,\frac{{\text d}C}{{\text d} x} = \frac{I}{V} . \]

A common representation of I/V (loss rate per unit volume) conceives that the substance decays according to a first-order reaction. Therefore, the substance loss rate is proportional to its concentration, and I/V = KC, where K is the decay rate of the substance ($ \texttt{day}^{-1} $ ), assumed constant. This yields

\[ U \,\frac{{\text d}C}{{\text d} x} = - K\,C, \qquad C(0) = C_0 . \]

The above initial value problem can be easily solved to obtain its solution:

\[ C(x) = C_0 e^{-Kx/U} . \]

Taking natural logarithms on both sides, we obtain the linear equation

\[ \ln C(x) = - \frac{K}{U}\,x + \ln C_0 . \]

This equation may be used to forecast the contaminant concentration at a distance x downstream the source.

Environmental engineers measured organic pollution which is the concentration of fecal coliform bacteria in a stream. A stream with an average flow velocity of 5 m/s was sampled for total coliform count per liter (N/liter) at different distances downstream a sewage treatment plant. At distances of 10, 20, 30, and 40 km, the coliform concentrations were 58000, 25000, 9000, and 6500 N/liter, respectively. We use least squares approach to estimate the total coliform decay rate K.

First, we convert units from meters per second into km per day:

\[ U = 5 \,\texttt{m/sec} = 5 * \frac{432}{5}\,\texttt{km/day} = 432 \,\texttt{km/day} . \]

Next, we input data by taking logarithm:

Cdata = {Log[58000], Log[25000], Log[9000], Log[6500]}

and then enter all other numerical values:

X = {10,20,30,40}
n = Length[X]
U = 432
m = {X, Cdata}
FindFit[Transpose[m], alpha - beta*x, {alpha, beta}, x]

{alpha -> 11.6417, beta -> 0.0758757}

Finally, we estimate the total coliform decay rate to be

K=0.07587573748466468*432

32.7783

and plot data along with the least squares line:

lp = ListPlot[Transpose[m], PlotStyle -> PointSize[Large]]
line = Plot[11.641735113511862 - 0.07587573748466468*x, {x, 0, 40}, PlotStyle -> Brown]
Show[{line, lp}]

Fitting Curves

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, usually subject to constraints. These minimization problems leads to non-linear least squares problems. While there are known some algorithms to handle these problems (for instance, the Levenberg–Marquardt algorithm (LMA), also known as the damped least-squares (DLS) method, which was developed by American statisticians Kenneth Levenberg (1919--1973) and Donald W. Marquardt (1929--1997)), we consider two typical relatively simple problems.

We start with finding a straight line on the plane that is defined by

\[ c + n_1 x + n_2 y =0 , \qquad n_1^2 + n_2^2 =1. \]

The unit vector $ (n_1 , n_2 ) $ is orthogonal to the line. For a given point (x,y), the absolute value of $ r= c + n_1 x + n_2 y $ gives the distance from the line yo the point. Therefore, if we want to determine the line for which the sum of squares of the distances to given points is minimal, we have to solve the constrainted least squares problem

\[ \left( \begin{array}{ccc} 1& x_1 & y_1 \\ 1 & x_2 & y_2 \\ \vdots & \vdots & \vdots \\ 1 & x_m & y_m \end{array} \right) \begin{bmatrix} c \\ n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]

Using QR decomposition A = QR we can reduce the linear system to R x = 0 and the problem becomes

\[ \left( \begin{array}{ccc} r_{11} & r_{12} & r_{13} \\ 0& r_{22} & r_{23} \\ 0&0&r_{33} \end{array} \right) \begin{bmatrix} c \\ n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]

Since the nonlinear constraint only involves two unknowns, we have

\[ \left( \begin{array}{cc} r_{22} & r_{23} \\ 0&r_{33} \end{array} \right) \begin{bmatrix} n_1 \\ n_2 \end{bmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix} \qquad \mbox{subject to } \quad n_1^2 + n_2^2 =1 . \]

Another problem is about fitting ellipses on the plane having defined by quadratic form

\[ {\bf x}^{\mathrm T} {\bf A} \,{\bf x} + {\bf b}^{\mathrm T} {\bf x} +c =0 \]

with positive definite symmetric matrix A. For each measured points x_i we obtain by inserting into the ellipse equation a linear equation for the unknown coefficients $ {\bf u} = \left[ a_{11} , a_{12} = a_{21}, a_{22} , b_1 , b_2 , c \right] . $ This yields the minimization problem subject to a normalized condition:

\[ \| {\bf B}\,{\bf u} \| = \min \quad \mbox{subject to } \ \| {\bf u} \| =1 , \]

which can be solved using the SVD.

Three common nonlinear models are illustrated:

Exponential: $ y = a\,e^{b\,x} , $
Power: $ y = a\, x^b , $ and
Saturation: $ y = \frac{ax}{b+x} . $

Polynomial Fit

Some situation involve a power function $ f(x) = C\,x^m , $ where m is a known constant. For example, in 1601 the German mathematician, astronomer, and astrologer Johannes Kepler (1571--1630) formulated the third law of planetary motion, $ T = C\,x^{3/2} , $ where x is the distance to the sun measured in millions of kilometers, T is the orbital period measured in days, and C is a positive constant. In these cases, there is only one parameter C to be determined.

Theorem: Suppose that $ \left\{ (x_i , y_i ) \right\}_{i=1}^n $ are n points, where abscissas are distinct. The coefficient C of the least-squares power curve $ y = C\,x^m , $ is given by

\[ C = \left( \sum_{i=1}^n x_i^m y_i \right) \Big/ \left( \sum_{i=1}^n x_i^{2m} \right) . \qquad ■ \]

In many cases linear regression is not suitable for the data. Then one may try to fit data with a polynomial of degree m:

\[ f(x) = c_0 + c_1 x + c_2 x^2 + \cdots + c_m x^m . \]

However, if the data do not exhibit a polynomial nature, the resulting curve may exhibit large oscillations. This phenomenon, called polynomial wiggle, becomes more pronounced with higher-degree polynomials. For this reason we seldom use a polynomial of degree 6 or above unless it is known that the true function we are working with is a polynomial. We show how the method works on example of parabola.

Theorem: Suppose that $ \left\{ (x_i , y_i ) \right\}_{i=1}^n $ are n points, where abscissas are distinct. The coefficients of the least-squares parabola

\[ f(x) = a\, x^2 + b\,x + c \]

are the solution values of the linear system

\begin{align*} \left( \sum_{i=1}^n x_i^4 \right) a + \left( \sum_{i=1}^n x_i^3 \right) b + \left( \sum_{i=1}^n x_i^2 \right) c &= \sum_{i=1}^n y_i x_i^2 , \\ \left( \sum_{i=1}^n x_i^3 \right) a + \left( \sum_{i=1}^n x_i^2 \right) b + \left( \sum_{i=1}^n x_i \right) c &= \sum_{i=1}^n y_i x_i \\ \left( \sum_{i=1}^n x_i^2 \right) a + \left( \sum_{i=1}^n x_i \right) b + n\,c &= \sum_{i=1}^n y_i . \qquad ■ \end{align*}

In general, we have a m-th order polynomial

\[ f(x) = a_m\, x^m + a_{m-1}\,x^{m-1} + \cdots + a_1 \, x + a_0 , \qquad m<n, \]

which we use least to regress the data. The residual at each data point is given by

\[ e_i = a_m\, x^m_i + a_{m-1}\,x^{m-1}_i + \cdots + a_1 \, x_i + a_0 - y_i , \qquad i = 1,2, \ldots , n . \]

The sum of the square of the residuals is given by

\[ Sr = \sum_{i=1}^n e_i^2 . \]

To find the constants of the polynomial regression model, we put the derivatives with respect to a_i to zero, that is,

\[ \frac{\partial Sr}{\partial a_i} = 0, \qquad i=0,1,2,\ldots , n . \]

Setting those equations in matrix form gives

\[ \begin{bmatrix} n & \sum x_i & \cdots & \sum x_i^m \\ \sum x_i & \sum x_i^2 & \cdots & \sum x_i^{m+1} \\ \vdots & \vdots & \ddots & \vdots \\ \sum x_i^m & \sum x_i^{m+1} & \cdots & \sum x_i^{2m} \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_m \end{bmatrix} \begin{bmatrix} \sum y_i \\ \sum x_i y_i \\ \vdots \\ \sum x_i^m y_i \end{bmatrix} . \]

The above simultaneous linear equations are solved for the (m+1) constants, $ a_0 , a_1 , \ldots , a_m . $

Example. We use Mathematica to generate a polynomial regression model for the following data set.

n = 11; (* number of equations *)
x = {8, 4, 0, -4, -8, -12, -16, -20, -24, -28, -32};
y = {6.53, 6.22, 6., 5.81, 5.33, 5.07, 4.32, 4.1, 3.53, 3.02, 2.61}

Desired order of polynomial regression model

OrderPoly = 1

Enter the lowest order of polynomial to check for optimum order:

LowOrder = 1

Enter the highest order of polynomial to check for optimum order. NOTE: HighOrder must be less than or equal to n-1.

HighOrder = 6

Create the matrix size according to the order of the polynomial:

M = Array[0, {OrderPoly + 1, OrderPoly + 1}];

Determining each value of the first row of matrix M:

M[[1, 1]] = n;
For[i = 2, i <= OrderPoly + 1, i++, M[[1, i]] = 0;
For[j = 1, j <= n, j++,
M[[1, i]] = N[M[[1, i]] + x[[j]]^(i - 1)]]];

Calculating the remaining values of the coefficient matrix:

For[i = 1, i <= OrderPoly + 1, i++,
For[k = 2, k <= OrderPoly + 1, k++, M[[k, i]] = 0;
For[j = 1, j <= n, j++,
M[[k, i]] = N[M[[k, i]] + x[[j]]^(i + k - 2)]]]]

We now have the coefficient matrix, M:

M // MatrixForm

({ {11, -132.},
{-132., 3344.} })

Create the vector size according to the order of the polynomial model.

b = Array[0, {OrderPoly + 1}];

Fill out the values of the array

b[[1]] = 0;

For[i = 1, i <= n, i++,
b[[1]] = b[[1]] + y[[i]]];
For[i = 2, i <= OrderPoly + 1, i++, b[[i]] = 0;
For[j = 1, j <= n, j++,
b[[i]] = N[b[[i]] + (x[[j]]^(i - 1))*y[[j]]]]]

Now, we have the right hand side vector b:

b // MatrixForm

({ 52.54,
-453.52 })

Now we form the augmented matrix

F = Transpose[Append[Transpose[M], b]];

Selecting the last column with the coefficient's values. This array contains the coefficients of the model as $ {\bf a} = [ a_1 , a_2 , \ldots , a_m ] . $

a = RowReduce[F];
a = a[[All, OrderPoly + 2]]

{5.98291, 0.100545}

The polynomial regression model is as follows:

For[i = 1, i <= n, i++,
y = a[[1]];
For[j = 1, j <= OrderPoly, j++,
Y = Y + a[[(j + 1)]]*X^j]]; Print["y= " Y]

y= (5.98291 +0.100545 X)

The plot shows the data points as well as the regression model.

Observed = Table[{x[[i]], y[[i]]}, {i, 1, n}];
predicted = Y;
ttl = Print["Polynomial Regression Model of order ", OrderPoly]
pt = ListPlot[Observed, PlotStyle -> PointSize[0.01], DisplayFunction -> Identity];
line = Plot[Y, {X, Min[x], Max[x]}, DisplayFunction -> Identity];
Show[pt, line, AxesLabel -> {"X",
"y=a_0 + a_1 x + a_2 x^2 + \cdots + a_m x^m"},
DisplayFunction -> $DisplayFunction]

■

Optimum Order

We are going to determine the optimum order of the polynomial model by plotting the variance defined as

\[ \sigma = \frac{1}{n-m+1} \,Sr = \frac{1}{n-m+1} \,\sum_{i=1}^n e_i^2 \]

as a function of m, where n is the number of data points, Sr is the sum of the square of residuals and m is the order of the polynomial. The optimum order is considered as to be the one where the value of the variance σ is minimum or where its value is significantly decreasing.

Many a times, the variance may show signs of decreasing and then increasing as a function of the order of the polynomial regression model. Such increases in the variance are normal as the variance is calculated as the ratio between the sum of the squares of the residuals and the difference between the number of data points and number of constants of the polynomial model. Both the numerator and denominator decrease as the order of the polynomial is increased. However, as the order of the polynomial increases, the coefficient matrix in the calculation of the constants of the model becomes more ill-conditioned. This ill-conditioning of the coefficient matrix results in fewer significant digits that can be trusted to be correct in the coefficients of the polynomial model, and hence artificially amplify the value of the variance.

Sr = Array[0, {HighOrder}];

For[m = LowOrder, m <= HighOrder, m++,
(* Calculating the coefficient matrix "M" for the given order m *)
M = Array[0, {m + 1, m + 1}];
M[[1, 1]] = n;
For[i = 2, i <= m + 1, i++, M[[1, i]] = 0;
For[j = 1, j <= n, j++, M[[1, i]] = N[M[[1, i]] + x[[j]]^(i - 1)]]];
For[i = 1, i <= m + 1, i++,
For[k = 2, k <= m + 1, k++, M[[k, i]] = 0;
For[j = 1, j <= n, j++,
M[[k, i]] = N[M[[k, i]] + x[[j]]^(i + k - 2)]]]];
(* Calculating the RHS vector for the given order, m *)
q = Array[0, {m + 1}];
q[[1]] = 0;
For[i = 1, i <= n, i++,
q[[1]] = q[[1]] + y[[i]]];
For[i = 2, i <= m + 1, i++, q[[i]] = 0;
For[j = 1, j <= n, j++,
q[[i]] = N[q[[i]] + (x[[j]]^(i - 1))*x[[j]]]]];
(* Calculating the coefficients of the m^th order polynomial model *)
F = Transpose[Append[Transpose[M], q]];
a = RowReduce[F];
a = a[[All, m + 2]];
(* Determining Sr *)
Sr[[m]] = 0;
For[i = 1, i <= n, i++, summ = 0;
For[j = 1, j <= m, j++, summ = summ + a[[j + 1]]*x[[i]]^j];
Sr[[m]] = Sr[[m]] + (y[[i]] - (a[[1]] + summ))^2];]
(* Calculating the variance for the m^th order Polynomial *)
var = Array[0, {HighOrder}];
For[i = LowOrder, i <= HighOrder, i++, var[[i]] = 0;
var[[i]] = N[Sr[[i]]/(n - (i + 1))]];
Print["var =", var];

Then we plot

ListPlot[var, PlotStyle -> PointSize[0.02], Joined -> True,
PlotLabel -> "Optimum Order of Polynomial", AxesLabel -> {"Order of Polynomial, m", "Sr/[n-(m+1)]"}]

Fitting Models

We start with assigning the proper regression model:

If[modeltype == "exponential", f[x_] = a*E^(b*x)];
If[modeltype == "power", f[x_] = a*x^b];
If[modeltype == "growth", f[x_] = (a*x)/(b + x)];

Calculating the sum of the square of the residuals

modeltype = "power";
Sr = 0;
For[i = 1, i <= n, i++, Sr = Sr + ((Y[[i]] - f[X[[i]]])^2)];

Differentiating the sum of the square of the residuals with respect to the constants of the model, a and b, to setup two simultaneous nonlinear equations and two unknowns.

eqn1 = Expand[D[Sr, a] == 0];
eqn2 = Expand[D[Sr, b] == 0];

Solving the two simultaneous nonlinear equations. We are using FindRoot since we are looking for real solutions.

Ainit = 2.1;
Binit = 1.1;
soln = FindRoot[{eqn1, eqn2}, {{a, Ainit}, {b, Binit}}, MaxIterations -> 1000];
a /. soln[[1]];
a = %;
b /. soln[[2]];
b = %;
Print["The constants of the ", modeltype, " model are a = ", a, " and \ b = ", b]

Plotting the observed values and the predicted curve

predicted = f[xx];
observed = Table[{X[[i]], Y[[i]]}, {i, 1, n}];
ttl = Print[modeltype, " regression model, y vs x"];
points = ListPlot[observed, PlotStyle -> PointSize[.02], DisplayFunction -> Identity];
line = Plot[predicted, {xx, Min[x], Max[x]}, DisplayFunction -> Identity];
Show[points, line, DisplayFunction -> $DisplayFunction]

The following manipulation must first be made to each model in order to linearize the data:

Exponential: $ \ln y = \ln a +b\,x , $
Power: $ \ln y = \ln a + b\,\ln x , $ and
Saturation: $ \frac{1}{y} = \frac{b}{a} \,\frac{1}{x} + \frac{1}{a} . $

Once the data is linearized, substitutions are made so as to apply a direct solution approach using least squares regression method for a straight line. In this section, a procedure is written for each model. Once a specific nonlinear model is called for with the modeltype variable, the proper procedure will calculate the coefficients for that model using data linearization.

Exponential model

Upon data linearization, the following substitutions are made in the equation $ \ln y = \ln a +b\,x , $

\[ a_0 = \ln a, \qquad a_1 =b . \]

The data $ z= \ln y $ versus x now takes the form of a linear model:

\[ z = \ln y = a_0 + a_1 \,x . \]

In the exponential model procedure, least squares linear regression method is used to solve for the a₀ and a₁ coefficients which are then used to determine the original constants of the exponential model, $ a = e^{a_0} $ and b = a₁, where $ y = a\, e^{b\,x} . $

ExponentialModel[x_, y_, n_] :=
Module[{z, sumz, sumx, sumxz, sumxx, a1, a0},
z = Array[0, {1, n}]; sumz = 0; sumx = 0; sumxz = 0; sumxx = 0;
For[i = 1, i <= n, i++, z[[1, i]] = N[Log[y[[i]]]];
sumz = sumz + z[[1, i]];
sumx = sumx + x[[i]];
sumxz = sumxz + x[[i]]*z[[1, i]];
sumxx = sumxx + (x[[i]])^2;];
a1 = N[(n*sumxz - sumx*sumz)/((n*sumxx) - sumx^2)];
a0 = N[sumz/n - a1*sumx/n];
a = E^a0; b = a1; a; b];

Power model

Upon data linearization, the following substitutions are made in the equation $ \ln y = \ln a +b\,x , $

\[ a_0 = \ln a, \qquad a_1 =b . \]

The data $ z= \ln y $ versus w now takes the form of a linear model:

\[ z = \ln y = a_0 + a_1 \,w . \]

In the power model procedure, least squares linear regression method is used to solve for the a₀ and a₁ coefficients which are then used to determine the original constants of the power model, a and b, where $ y = a\, x^b . $

PowerModel[x_, y_, n_] := Module[{z, w, sumz, sumw, sumwz, sumww},
z = Array[0, {1, n}]; w = Array[0, {1, n}]; sumz = 0; sumw = 0;
sumwz = 0; sumww = 0;
For[i = 1, i <= n, i++, z[[1, i]] = N[Log[y[[i]]]];
w[[1, i]] = N[Log[x[[i]]]];
sumw = sumw + w[[1, i]];
sumz = sumz + z[[1, i]];
sumwz = sumwz + w[[1, i]]*z[[1, i]];
sumww = sumww + (w[[1, i]])^2;];
a1 = N[(n*sumwz - sumw*sumz)/(n*sumww - sumw^2)];
a0 = N[sumz/n - a1*sumw/n];
a = E^a0; b = a1] ;

Saturation Growth model

Upon data linearization, the following substitutions are made in the equation $ \frac{1}{y} = \frac{b}{a} \,\frac{1}{x} + \frac{1}{a} , $

\[ a_0 = 1/a, \qquad a_1 =b/a . \]

The data $ z=1/y $ versus q now takes the form of a linear model:

\[ z = 1/ y = a_0 + a_1 \,q . \]

In the saturation growth model procedure, least squares linear regression method is used to solve for the a₀ and a₁ coefficients which are then used to determine the original constants of the growth model, a and b, where $ y = \frac{ax}{b+x} . $

GrowthModel[x_, y_, n_] :=
Module[{z, q, sumq, sumz, sumqz, sumqq, a1, a0},
z = Array[0, {1, n}]; q = Array[0, {1, n}]; sumq = 0; sumz = 0;
sumqz = 0; sumqq = 0;
For[i = 1, i <= n, i++, z[[1, i]] = N[1/y[[i]]];
q[[1, i]] = N[1/x[[i]]];
sumq = sumq + q[[1, i]];
sumz = sumz + z[[1, i]];
sumqz = sumqz + q[[1, i]]*z[[1, i]];
sumqq = sumqq + (q[[1, i]])^2;];
a1 = N[(n*sumqz - sumq*sumz)/(n*sumqq - sumq^2)];
a0 = N[sumz/n - a1*sumq/n];
a = N[1/a0]; b = N[a1/a0];];

Finding the constants of the model

The proper regression model is assigned below, returning the constants of the model that was chosen with the modeltype variable.

Datatable = Table[{x[[i]], y[[i]]}, {i, 1, n}];
title = Print[modeltype, " Regression Model, y vs x"];
points = ListPlot[Datatable, PlotStyle -> PointSize[0.02], DisplayFunction -> Identity];
line = Plot[f[X], {X, Min[x], Max[x]}, DisplayFunction -> Identity];
Show[points, line, DisplayFunction -> $DisplayFunction];

Model Comparison

Before evaluating the worksheet, the user must enter initial guesses of the constants of the model a and b for the nonlinear regression without data linearization procedure. For reasonable initial guesses, use the solution from the nonlinear model with data linearization worksheet. For convergence use initial guesses for a and b close to the values of a and b obtained by using data linearization.

n = 5; (* numer of points *)
x = {1, 2, 3, 4, 5}
y = {-1, 2, 9, 6, 0}

Enter the model type to be used:

modeltype = "Exponential"

Insert your initial guesses for a and b here. Reasonable initial guesses for a and b can be obtained from data linearization models

Ainit = 0.6; Ainit = 0.39

In the procedure below, the data is first linearized so that least squares linear regression method for a linear model can be used. Once the coefficients of the linear model are determined, the constants of the nonlinear regression model a and b can be calculated. Linearizing the data is a useful technique to estimate the parameters of a nonlinear model because it does not require iterative methods to solve for the model constants. Note that data linearization is only done for mathematical convenience.

datalinear[x_, y_, n_] := Module[{},
z = Array[0, {1, n}];
q = Array[0, {1, n}];
sumq = 0; sumz = 0; sumqz = 0; sumqq = 0;
(* Data Linearization Step *)
For[i = 1, i <= n, i++,
If[modeltype == "Exponential", z[[1, i]] = N[Log[y[[i]]]];
q[[1, i]] = x[[i]]];
If[modeltype == "Power", z[[1, i]] = N[Log[y[[i]]]];
q[[1, i]] = N[Log[x[[i]]]]];
If[modeltype == "Growth", z[[1, i]] = N[1/y[[i]]];
q[[1, i]] = 1/x[[i]]];
(* Calculating the constants of the Linear model, z= a0+a1q *)
sumq = sumq + q[[1, i]];
sumz = sumz + z[[1, i]];
sumqz = sumqz + q[[1, i]]*z[[1, i]];
sumqq = sumqq + (q[[1, i]])^2];
a1 = N[(n*sumqz - sumq*sumz)/(n*sumqq - sumq^2)];
a0 = N[sumz/n - a1*sumq/n];
(* Calculating the constants of the original Nonlinear Regression \ model, a and b *)
If[modeltype == "Exponential",
a = E^a0; b = a1];
If[modeltype == "Power",
a = E^a0; b = a1];
If[modeltype == "Growth",
a = 1/a0; b = a1/a0];
a; b]

Calling the data linearization procedure for the input data set and returning the constants of the nonlinear regression model a and b.

datalinear[x, y, n];
If[modeltype == "Exponential", g[x_] := N[a*E^(b*X)]];
If[modeltype == "Power", g[x_] := a*X^b];
If[modeltype == "Growth", g[x_] := (a*X)/(b + X)];
Print["The constants of the ", modeltype, " model found with data \ linearization are a = ", a, " and b = ", b]

In the following procedure, the constants of the nonlinear regression model a and b are found without linearizing the data. This requires the FindRoot command which utilizes numerical techniques to converge to a real solution. The initial guess inputs, Ainit and Binit, are used as the starting values. First, we assign the proper regression model.

If[modeltype == "Exponential", f[x_] = aa*E^(bb*x)];
If[modeltype == "Power", f[x_] = aa*x^bb];
If[modeltype == "Growth", f[x_] = (aa*x)/(bb + x)];

Calculate the sum of the squares of the residuals:

Sr = 0;
For[i = 1, i <= n, i++, Sr = Sr + ((y[[i]] - f[x[[i]]])^2)];

Differentiating the sum of the square of the residuals with respect to the constants of the model, a and b, to setup two simultaneous nonlinear equations and two unknowns.

eqn1 = Expand[D[Sr, aa] == 0];
eqn2 = Expand[D[Sr, bb] == 0];

Solving the two simultaneous nonlinear equations, we obtain the numerical values for unknown constants. We are using FindRoot since we are looking for real solutions.

soln = FindRoot[{eqn1, eqn2}, {{aa, Ainit}, {bb, Binit}}, MaxIterations -> 1000];
aa /. soln[[1]];
aa = %;
bb /. soln[[2]];
bb = %;
Print["The constants of the ", modeltype, " model without data \ linearization are a = ", aa, " and b = ", bb]

The following table allows you to compare the coefficient values calculated between the two methods.

TableForm[{{a, b}, {aa, bb}}, TableDepth -> 2,
TableHeadings -> {{"with data linearization", "without data linearization"}, {"a", "b"}}]

Plotting the observed values and both predicted curves.

Observed = Table[{x[[i]], y[[i]]}, {i, 1, n}];
Predicted1 = g[x];
Predicted2 = f[x];
titl = Print[modeltype,
" Regression model with data linearization and without \ linearization"];
points = ListPlot[Observed, PlotStyle -> PointSize[.02], DisplayFunction -> Identity];
line = Plot[{Predicted1, Predicted2}, {x, Min[X], Max[X]},
PlotStyle -> {{RGBColor[1, 0, 0]}, {RGBColor[0, 1, 0]}}, DisplayFunction -> Identity];
Show[points, line, AxesLabel -> {x, y}, DisplayFunction -> $DisplayFunction]
Print["model with data linearization = red"]
Print["model without data linearization = green"]

Total Least Squares

The linear least squares problem A x = b has so far been solved by projecting the vector b on the range of A. With ``Total Least Squares'' the system of equations is made consistent by changing both A and b. We are looking for a matrix A and a vector b from its range that differ as little as possible from the given data with respect to Frobenius norm

\[ \left\| \left[ {\bf A} , {\bf b} \right] - \left[ {\cal A} , {\cal b} \right] \right\|_F = \min \qquad \mbox{subject to } \ {\cal b} \in \mbox{Range}({\cal A} ) . \]

In general, the augmented matrix $ {\bf C} = \left[ {\bf A} , {\bf b} \right] $ have rank n + 1 while matrix A has rank n. The constraint says that the augmented matrix $ {\cal C} = \left[ {\cal A} , {\cal b} \right] $ must have rank n. Upon introducing the error matrix $ \Delta = {\bf C} - {\cal C} , $ the problem is reduced to

\[ \| \Delta \|_F^2 = \min \qquad\mbox{subject to } \ \left( {\bf C} + \Delta \right) {\bf z} =0 , \]

where $ {\bf z} = \left( \begin{array}{c} {\bf x} \\ -1 \end{array} \right) \ne 0. $ Applying the SVD for C, we get

\[ \Delta = - \sigma_{n+1} {\bf u}_{n+1} {\bf v}_{n+1}^{\mathrm T} . \]

If the last component of v_n+1 is not zero, the total least squares solution exists.

\[ {\cal A} \hat{\bf x} = {\cal b} , \qquad \hat{\bf x} = - \frac{1}{v_{n+1,n+1}} \,v(1:n,n+1) . \]

Mathematica solves linear systems

Elementary Row Operations

Row Echelon Forms

LU-factorization

PLU Factorization

Reflection

Givens rotation

Row Space and Column Space

Null Space or Kernel

Rank and Nullity

Square Matrices

Cramer's Rule

Symmetric and self-adjoint matrices

Cholesky decomposition

Projection Operators

Gram-Schmidt

QR-decomposition

Least Square Approximation

SVD Factorization

Numerical Methods

Markov chains

Inequalities

Miscellaneous

Return to Mathematica page

Return to the main page (APMA0330)
Return to the Part 1 Matrix Algebra
Return to the Part 2 Linear Systems of Equations
Return to the Part 3 Linear Systems of Ordinary Differential Equations
Return to the Part 4 Non-linear Systems of Ordinary Differential Equations
Return to the Part 5 Numerical Methods
Return to the Part 6 Fourier Series
Return to the Part 7 Partial Differential Equations

MATHEMATICA TUTORIAL for the Second Course. Part II (Least Squares).

Vladimir Dobrushkin