Optimization 1/1

Introduction to Optimization

GUIDING QUESTION: How can I locate the extreme values of a function?

Motivation

If $f:\mathbb{R} \to \mathbb{R}$ is continuously differentiable and $f$ has a local extremum at $x=x^*$, then $$ f'(x^*) = 0.$$

We can generalize this insight to the multivariate setting!

(Fermat) If $f:\mathbb{R}^\color{var(--emphColor)}{n} \to \mathbb{R}$ is continuously differentiable and $f$ has a local extremum at $x=x^*$, then $$\color{var(--emphColor)}{\nabla} f(x^*) = 0.$$

Fermat's theorem guarantees that local extrema of a smooth function occur at critical points, where $\nabla f(x) = 0$.

As a root finding problem

Fermat's theorem implies that the candidate extrema of a smooth function $f$ are roots of the (possibly non-linear) system $${\partial{f} \over \partial{x_\color{var(--emphColor)}{i}}}(x) = 0, \quad \color{var(--emphColor)}{i} = 1, \ldots, n.$$

Optimization becomes a root finding problem, and we can apply the techniques we developed earlier!

Newton's method (again)

Our first approach to locating extrema the extrema of $f$ is applying Newton's method to the function \begin{align} g(x) = \nabla{f}(x)^T = \begin{bmatrix} \frac{\partial{f}}{\partial{x_1}}(x)\\ \vdots\\ \frac{\partial{f}}{\partial{x_n}}(x) \end{bmatrix}. \end{align}

Recall that the gradient is a row vector!

Newton's method (again)

Newton's method prescribes the update $$x^{(k+1)} \gets x^{(k)} + \color{var(--emphColor)}{h},$$ where $\color{var(--emphColor)}{h}$ satisfies $$J_g(x^{(k)})\color{var(--emphColor)}{h} = - f(x^{(k)}).$$

Newton's method (again)

In this case, the Jacobian matrix of $g$ contains the second partials of $f$:

\begin{align} J_g(x^{(k)}) &= \begin{bmatrix} \frac{\partial{g_1}}{\partial x_1} & \frac{\partial{g_1}}{\partial x_2} & \ldots & \frac{\partial{g_1}}{\partial x_n}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial{g_n}}{\partial x_1} & \frac{\partial{g_n}}{\partial x_2} & \ldots & \frac{\partial{g_n}}{\partial x_n} \end{bmatrix} \end{align}

Newton's method (again)

In this case, the Jacobian matrix of $g$ contains the second partials of $f$:

\begin{align} J_g(x^{(k)}) &= \begin{bmatrix} \frac{\partial^\color{var(--emphColor)}{2}{f}}{\partial{x_1}\partial{x_1}} & \frac{\partial^\color{var(--emphColor)}{2}{f}}{\partial{x_2}\partial{x_1}} &\ldots & \frac{\partial^\color{var(--emphColor)}{2}{f}}{\partial{x_n}\partial{x_1}}\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^\color{var(--emphColor)}{2}{f}}{\partial{x_1}\partial{x_n}} & \frac{\partial^\color{var(--emphColor)}{2}{f}}{\partial{x_2}\partial{x_n}} & \ldots & \frac{\partial{f}^\color{var(--emphColor)}{2}}{\partial{x_n}\partial{x_n}} \end{bmatrix} \end{align}

Newton's method (again)

In this case, the Jacobian matrix of $g$ contains the second partials of $f$:

\begin{align} J_g(x^{(k)}) &= \color{var(--emphColor)}{H_f}(x^{(k)}) \end{align}

The matrix $\color{var(--emphColor)}{H_f}$ of second partials of $f$ is called the Hessian of $f$.

If $f$ is sufficiently smooth, $$\frac{\partial^2{f}}{\partial{x_\color{var(--emphColor)}{i}}\partial{x_\color{var(--emphColor)}{j}}} = \frac{\partial^2 {f}}{\partial{x_\color{var(--emphColor)}{j}} \partial{x_\color{var(--emphColor)}{i}}},$$ so $H_f(x)$ is symmetric at every $x$.

Pseudocode

							\begin{algorithm}
							\caption{Newton's method (Optimization)}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{solve $H_f(x^{(k)})h = -\nabla f(x^{(k)})^T$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$||h|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

Pseudocode

							\begin{algorithm}
							\caption{Newton's method (Optimization)}
							\begin{algorithmic}
							\Input{$\color{var(--emphColor)}{\text{initial guess }x^{(0)}}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{solve $H_f(x^{(k)})h = -\nabla f(x^{(k)})^T$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$||h|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

Recall once again that the convergence of Newton's method is highly dependent on the starting point, so it is not necessarily “easy” to locate the extrema of a smooth function.

Pseudocode

							\begin{algorithm}
							\caption{Newton's method (Optimization)}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{solve $\color{var(--emphColor)}{H_f(x^{(k)})h = -\nabla f(x^{(k)})^T}$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$||h|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

At the heart of our method is a linear system involving the gradient, which is a row vector, and the Hessian of $f$.

Pseudocode

							\begin{algorithm}
							\caption{Newton's method (Optimization)}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{solve $H_f(x^{(k)})h = -\nabla f(x^{(k)})^T$ for $h$}
							\STATE{$x^{(k+1)} \gets x^{(k)} + h$}
							\IF{$\color{var(--emphColor)}{||h|| < \epsilon}$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

We stop iterating when consecutive updates are sufficiently close to each other.

This procedure helps us locate local extrema, and some may be more extreme than others.

Even when the method converges, it is hard to say if the function achieves a global extremum at that point.

Min or max?

There is an analogue to the Second Derivative Test that we can use to classify the extrema of a multivariate function.

The test depends on a generalized notion of “sign” of a matrix.

Suppose $f: \mathbb{R}^n \to \mathbb{R}$ is twice continuously differentiable. If $\nabla f(x^*) = 0$ and the Hessian $H_f(x^*)$ is SPD , then $f$ has a local minimum at $x= x^*$.

If $H_f(x^*)$ is SPD then $x^T \color{var(--emphColor)}{H_f(x^*)} x > 0$ for every $x \neq 0$.

Compare this to the one-dimensional setting, where the Second Derivative Test guarantees that if $f$ is twice continuously differentiable, $f'(x^*) = 0$, and $\color{var(--emphColor)}{f'' (x^*)} > 0$, then $f$ has a local minimum at $x = x^*$.

A similar statement holds when $H_f(x^*)$ is negative definite, so that $$x^T H_f(x^*) x \color{var(--emphColor)}{<} 0$$ for every $x \neq 0$.

A different approach

Let's try something else.

What's the quickest way down a mountain?

At each step, look for the steepest downward direction and walk toward it.

Steepest gradient descent

Using the Cauchy-Schwarz inequality, we may show that the steepest direction at a given point $x$ is parallel to $$- \nabla f(x).$$

So the steepest descent method suggests the update $$x^{(k+1)} \gets x^{(k)} - \alpha \nabla f(x),$$ where $\alpha > 0$ is some constant.

In modern literature and codebases, the abbreviation SGD might refer to the Stochastic Gradient Descent method.

This method is very useful in solving separable problems and it is “stochastic” because a random sample of objective components are chosen at each step.

Pseudocode

							\begin{algorithm}
							\caption{Steepest descent with exact line search}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$\alpha \gets \text{arg} \min_{\alpha > 0} f\big(x^{(k)} - \alpha \nabla f(x^{(k)})\big)$}
							\STATE{$x^{(k+1)} \gets x^{(k)} - \alpha \nabla f(x^{(k)})$}
							\IF{$||x^{(k+1)} - x^{(k)}|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

Pseudocode

							\begin{algorithm}
							\caption{Steepest descent with exact line search}
							\begin{algorithmic}
							\Input{$\color{var(--emphColor)}{\text{initial guess } x^{(0)}, \text{ tolerance } \epsilon}$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$\alpha \gets \text{arg} \min_{\alpha > 0} f\big(x^{(k)} - \alpha \nabla f(x^{(k)})\big)$}
							\STATE{$x^{(k+1)} \gets x^{(k)} - \alpha \nabla f(x^{(k)})$}
							\IF{$||x^{(k+1)} - x^{(k)}|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

The convergence properties of the steepest descent algorithm also depend on the choice of starting point.

Pseudocode

							\begin{algorithm}
							\caption{Steepest descent with exact line search}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$\color{var(--emphColor)}{\alpha} \gets \text{arg} \min_{\color{var(--emphColor)}{\alpha} > 0} f\big(x^{(k)} - \color{var(--emphColor)}{\alpha} \nabla f(x^{(k)})\big)$}
							\STATE{$x^{(k+1)} \gets x^{(k)} - \alpha \nabla f(x^{(k)})$}
							\IF{$||x^{(k+1)} - x^{(k)}|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

Find $\color{var(--emphColor)}{\alpha}$ that minimizes the $f$ along the ray based at $x^{(k)}$ pointing in the steepest descent direction.

This moves us to the minimum of $$\Phi(\alpha) = f(x^{(k)} - \alpha \nabla f(x^{(k)})).$$

This is a one dimensional optimization problem that can be solved using Newton's method, the secant method, or a zeroth order method like golden search, which doesn't use information about the derivative of $$f\big(x^{(k)} - \color{var(--emphColor)}{\alpha} \nabla f(x^{(k)})\big).$$

Pseudocode

							\begin{algorithm}
							\caption{Steepest descent with exact line search}
							\begin{algorithmic}
							\Input{initial guess $x^{(0)}$, tolerance $\epsilon$}
							\FOR{$k \gets 1$ to max\_iter}
							\STATE{$\alpha \gets  \min_{\alpha > 0} f\big(x^{(k)} - \alpha \nabla f(x^{(k)})\big)$}
							\STATE{$\color{var(--emphColor)}{x^{(k+1)} \gets x^{(k)} - \alpha \nabla f(x^{(k)})}$}
							\IF{$||x^{(k+1)} - x^{(k)}|| < \epsilon$}
							\STATE{break}
							\ENDIF
						 \ENDFOR
							\end{algorithmic}
							\end{algorithm}

Update the approximation by taking a step, of optimal length, in the steepest descent direction.

There are many strategies for determining $\alpha$, which is known as the learning rate in some contexts.

In some implementations of gradient descent, the learning rate is fixed a priori.

Let's find the minimum of $$f(x) = 3x_1^2 + 4x^2_2$$ using gradient descent with exact line search.

Please refer to the supplementary notes for details.

Special cases

Much like the specialized methods we developed for solving linear systems with certain coefficient matrices, we can develop specialized optimization methods that exploit the structure of certain objective functions.

In the next module we'll discuss least squares problems, which are common in modeling.

Congratulations! You reached the end of this lecture.

Take me home