Least squares 1/2

Motivation

In many practical cases, an affine model $g(x; \color{var(--emphColor)}{w}, \color{var(--emphColor)}{\beta}) = \color{var(--emphColor)}{w_1} x_1 + \cdots + \color{var(--emphColor)}{w_n} x_n + \color{var(--emphColor)}{\beta}$ simply won't do.

For instance, $g$ might represent the output of a (non-linear!) neural network in some machine learning problems.

Non-linear least squares

Consider the optimization problem $$\min_x ||g(x) - b||^2.$$

The variable $x$ belongs to $\mathbb{R}^n$, the function $g: \mathbb{R}^\color{var(--emphColor)}{n} \to \mathbb{R}^m$ is vector-valued, and the constant vector $b \in \mathbb{R}^m$ is given.

Non-linear least squares

Once again, we can write down a system of equations describing the critical points.

We'll compute the entries of the gradient of our objective function $$f(x) = ||g(x) - b||^2$$ one at a time.

Gradient

If $$f(x) = ||g(x) - b||^2,$$ then

$$\frac{\partial f}{\partial x_\color{var(--emphColor)}{j}} = \frac{\partial}{\partial \color{var(--emphColor)}{j}}||g(x)-b||_2^2$$

Gradient

If $$f(x) = ||g(x) - b||^2,$$ then

$$\frac{\partial f}{\partial x_\color{var(--emphColor)}{j}} = \sum^m_{i=1}\frac{\partial}{\partial \color{var(--emphColor)}{j}}(g_i(x) - b_i)^2$$

Gradient

If $$f(x) = ||g(x) - b||^2,$$ then

$$\frac{\partial f}{\partial x_\color{var(--emphColor)}{j}} = 2 \sum^m_{i=1}\frac{\partial{g_i}}{\partial{x_\color{var(--emphColor)}{j}}}(g_i(x) - b_i)$$

We've used the chain rule to compute $\frac{\partial}{\partial \color{var(--emphColor)}{j}}(g_i(x) - b_i)^2$.

Gradient

Combining results, we find that $$\nabla f(x)^T = \begin{bmatrix} \frac{\partial f}{\partial \color{var(--emphColor)}{1}} \\ \vdots \\ \frac{\partial f}{\partial \color{var(--emphColor)}{n}} \end{bmatrix}$$

Gradient

Combining results, we find that $$\nabla f(x)^T = 2 \begin{bmatrix} \sum^m_{i=1}\frac{\partial{g_i}}{\partial{x_\color{var(--emphColor)}{1}}}(g_i(x) -b_i)\\ \vdots\\ \sum^m_{i=1}\frac{\partial{g_i}}{\partial{x_\color{var(--emphColor)}{n}}}(g_i(x) -b_i) \end{bmatrix}$$

Gradient

Combining results, we find that $$\nabla f(x)^T = 2J_g(x)^T(g(x) -b),$$ with $J_g$ denoting the Jacobian of the vector-valued function $g$.

Residual

Let $r(x) = g(x) - b$ denote the residual.

In terms of the residual, the optimality condition $\nabla f(x) = 0$ becomes $$J_g(x)^T r(x) = 0.$$

Residual

$$J_g(x)^T r(x) = 0.$$

Since $g$ is non-linear in general, this system is non-linear.

We can use Newton's method to solve this system!

However, the product makes for a nasty derivative calculation so we might be better off using a different solution method!

Gauss-Newton

We'll develop an iterative method based on a Taylor expansion!

At the $\color{var(--emphColor)}{k}$th step, we approximate $$g(x^{(\color{var(--emphColor)}{k})} + h) \approx g(x^{(\color{var(--emphColor)}{k})}) + J_g(x^{(\color{var(--emphColor)}{k})}) h.$$

Then we solve the linear least squares problem $$\min_h ||J_g(x^{(\color{var(--emphColor)}{k})}) h - (-g(x^{(\color{var(--emphColor)}{k})}) + b) ||^2$$

Gauss-Newton

We may solve the linear least squares problem in the variable $h$ by solving the normal equations $$J_g(x^{(\color{var(--emphColor)}{k})})^T J_g(x^{(\color{var(--emphColor)}{k})}) h = -J_g(x^{(\color{var(--emphColor)}{k})})^T r(x^{(\color{var(--emphColor)}{k})}).$$

Pseudocode

							\begin{algorithm}
							\caption{Gauss-Newton method}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$J \gets J_g(x^{(k)})$}
							\STATE{solve $(J^T J) h = -J^T r(x^{(k)})$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$||h|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

Pseudocode

							\begin{algorithm}
							\caption{Gauss-Newton method}
							\begin{algorithmic}
							\Input{$\color{var(--emphColor)}{\text{initial guess }x^{(0)}}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$J \gets J_g(x^{(k)})$}
							\STATE{solve $(J^T J) h = -J^T r(x^{(k)})$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$||h|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

The convergence properties of the Gauss-Newton method also depend on a choice of initial guess.

Pseudocode

							\begin{algorithm}
							\caption{Gauss-Newton method}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$J \gets J_g(x^{(k)})$}
							\STATE{solve $\color{var(--emphColor)}{(J^T J) h = -J^T r(x^{(k)})}$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$||h|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

At the heart of our method is a linear system involving the the Jacobian of $g$.

Pseudocode

							\begin{algorithm}
							\caption{Gauss-Newton method}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$J \gets J_g(x^{(k)})$}
							\STATE{solve $(J^T J) h = -J^T r(x^{(k)})$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$\color{var(--emphColor)}{||h|| < \epsilon}$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

We stop iterating when consecutive updates are sufficiently close to each other.

The Gauss-Newton method constructs a sequence of approximations to a solution of the non-linear optimality conditions $$J_g(x)^T r(x) = 0,$$ using only first derivatives of $g$.

To solve the optimality conditions using Newton's method, we would need to compute the Hessian of $g$.

(Bacterial population modeling) Consider a bacterial population $N(t)$ which obeys the logistic growth equation $${ dN \over dt} = rN (1 - {N \over \kappa}), \quad N(0) = N_0,$$ where $\kappa$ and $N_0$ are given constants.

(Bacterial population modeling) This means that the population $N(t)$ grows like $$N(t) = {\color{var(--emphColor)}{N_0 \kappa} e^{\color{var(--emphColor)}{r}t} \over \color{var(--emphColor)}{\kappa} + \color{var(--emphColor)}{N_0} (e^{\color{var(--emphColor)}{r}t} - 1)}.$$

(Bacterial population modeling)

Suppose the lab next door gives us measurements of the bacterial population $b_i$ at times $t_i$, for $i = 1, 2, \ldots, m$.

Our task is to find the parameters $$\mathbf{x} = \begin{bmatrix} \color{var(--emphColor)}{N_0} \\ \color{var(--emphColor)}{\kappa} \\ \color{var(--emphColor)}{r} \end{bmatrix}$$ that best fit the data.

(Bacterial population modeling) In other words, we seek to solve the minimization problem $$\min_{\color{var(--emphColor)}{\mathbf{x}}} \sum_{i=1}^m \bigg({\color{var(--emphColor)}{N_0 \kappa} e^{\color{var(--emphColor)}{r}t_i} \over \color{var(--emphColor)}{\kappa} + \color{var(--emphColor)}{N_0} (e^{\color{var(--emphColor)}{r}t_i} - 1)} - b_i \bigg)^2$$

(Bacterial population modeling) In other words, we seek to solve the minimization problem $$\min_{\color{var(--emphColor)}{\mathbf{x}}} \bigg| \bigg| \begin{bmatrix} {\color{var(--emphColor)}{N_0 \kappa} e^{\color{var(--emphColor)}{r}t_1} \over \color{var(--emphColor)}{\kappa} + \color{var(--emphColor)}{N_0} (e^{\color{var(--emphColor)}{r}t_1} - 1)} \\ \vdots \\ {\color{var(--emphColor)}{N_0 \kappa} e^{\color{var(--emphColor)}{r}t_m} \over \color{var(--emphColor)}{\kappa} + \color{var(--emphColor)}{N_0} (e^{\color{var(--emphColor)}{r}t_m} - 1)} \end{bmatrix} - \begin{bmatrix} b_1 \\ \vdots \\ b_m \end{bmatrix} \bigg|\bigg|^2$$

(Bacterial population modeling) In other words, we seek to solve the minimization problem $$\min_{\color{var(--emphColor)}{\mathbf{x}}} \bigg| \bigg| \begin{bmatrix} {\color{var(--emphColor)}{x_1 x_2} e^{\color{var(--emphColor)}{x_3}t_1} \over \color{var(--emphColor)}{x_2} + \color{var(--emphColor)}{x_1} (e^{\color{var(--emphColor)}{x_3}t_1} - 1)} \\ \vdots \\ {\color{var(--emphColor)}{x_1 x_2} e^{\color{var(--emphColor)}{x_3}t_m} \over \color{var(--emphColor)}{x_2} + \color{var(--emphColor)}{x_1} (e^{\color{var(--emphColor)}{x_3}t_m} - 1)} \end{bmatrix} - \begin{bmatrix} b_1 \\ \vdots \\ b_m \end{bmatrix} \bigg|\bigg|^2$$

Recall $\mathbf{x} = \begin{bmatrix} N_0 & \kappa & r \end{bmatrix}^T$.

(Bacterial population modeling) In other words, we seek to solve the minimization problem $$\min_{\color{var(--emphColor)}{\mathbf{x}}} \bigg| \bigg| \begin{bmatrix} g_1(\color{var(--emphColor)}{\mathbf{x}}) \\ \vdots \\ g_m(\color{var(--emphColor)}{\mathbf{x}}) \end{bmatrix} - \begin{bmatrix} b_1 \\ \vdots \\ b_m \end{bmatrix} \bigg|\bigg|^2$$

(Bacterial population modeling) In other words, we seek to solve the minimization problem $$\min_{\color{var(--emphColor)}{\mathbf{x}}} ||g(\color{var(--emphColor)}{\mathbf{x}}) - b ||^2$$

(Bacterial population modeling) In order to implement the Gauss-Newton scheme, we must compute the Jacobian of $g$, with $(i, j)$-entry given by ${\partial g_i \over \partial x_j}$.

For instance, the second column of the Jacobian is $$(J_g(\mathbf{x}))_{i,\color{var(--emphColor)}{2}} = \begin{bmatrix} {\partial g_1 \over \partial \color{var(--emphColor)}{x_2}} \\ \vdots \\ {\partial g_m \over \partial \color{var(--emphColor)}{x_2}} \end{bmatrix} $$

(Bacterial population modeling) In order to implement the Gauss-Newton scheme, we must compute the Jacobian of $g$, with $(i, j)$-entry given by ${\partial g_i \over \partial x_j}$.

For instance, the second column of the Jacobian is $$(J_g(\mathbf{x}))_{i,\color{var(--emphColor)}{2}} = \begin{bmatrix} {\partial \over \partial \color{var(--emphColor)}{x_2}}\bigg[{x_1\color{var(--emphColor)}{ x_2} e^{x_3 t_1} \over \color{var(--emphColor)}{x_2} + x_1 (e^{x_3 t_1} - 1)}\bigg] \\ \vdots \\ {\partial \over \partial \color{var(--emphColor)}{x_2}}\bigg[{x_1\color{var(--emphColor)}{ x_2} e^{x_3 t_m} \over \color{var(--emphColor)}{x_2} + x_1 (e^{x_3 t_m} - 1)}\bigg] \\ \end{bmatrix}. $$

Congratulations! You reached the end of this lecture.

Take me home