Least squares optimization

GUIDING QUESTION: How can I compute the optimal parameters for my model?

Motivation

Suppose you are trying to fit a model $g$, which depends on some parameters (or weights) $\color{var(--emphColor)}{w_1}, \ldots, \color{var(--emphColor)}{w_n}$.
In practice, many measurements are typically available, and we seek the weights that can best describe the given data.

Motivation

For instance, $$g(x; \color{var(--emphColor)}{w}) = \sum_j \color{var(--emphColor)}{w_j} \phi_j(x)$$ could be a linear combination of some basis functions $\phi_j$, with some unknown coefficients $\color{var(--emphColor)}{w_j}$.
However, $g$ could also be something else, like a neural network!
We'll use a semi-colon to highlight the dependence of our model on the parameters $\color{var(--emphColor)}{w} = \begin{bmatrix} w_1 & \cdots & w_n \end{bmatrix}^T.$

Motivation

Typically the number of data points $(x_i, y_i)$, $i = 1, 2, \ldots, m$ is much larger than the number of parameters, and we can't hope that the system $$g(x_i; \color{var(--emphColor)}{w}) = y_i, \qquad i = 1, 2, \ldots, m$$ has a solution.

Motivation

We'll seek the next best thing, to minimize the sum $$\sum_{i=1}^m (g(x_i; \color{var(--emphColor)}{w}) - y_i)^2,$$ by varying the parameters $\color{var(--emphColor)}{w_1}, \ldots, \color{var(--emphColor)}{w_n}$.
Recall our discussion of interpolation vs approximation.

Least squares

We aim to minimize the sum of squared errors $$\sum_{i=1}^n (g(x_i; \color{var(--emphColor)}{w}) - y_i)^2$$
$$= \bigg| \bigg|\begin{bmatrix} g(x_1; \color{var(--emphColor)}{w}) \\ \vdots \\ g(x_m; \color{var(--emphColor)}{w}) \end{bmatrix} - \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix}\bigg| \bigg|^2$$ as a function of the paramters $w_1, \ldots, w_n$.

Linear regresssion

We begin by considering the case where $$g(\mathbf{x}) = w_1 x_1 + \cdots + w_n x_n = w^T \mathbf{x}$$ is a linear function.
Each data point $(\mathbf{x}^{(i)}, y_i)$ consists of a vector $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and a scalar $y_i$.
Then $$f(\color{var(--emphColor)}{w}) = \bigg| \bigg|\begin{bmatrix} g(\mathbf{x}^{(1)}; \color{var(--emphColor)}{w}) \\ \vdots \\ g(\mathbf{x}^{(m)}; \color{var(--emphColor)}{w}) \end{bmatrix} - \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix}\bigg| \bigg|^2 $$

Linear regresssion

We begin by considering the case where $$g(x) =\mathbf{x}^T w$$ is a linear function.
Then $$f(\color{var(--emphColor)}{w}) = \bigg| \bigg|\begin{bmatrix} (\mathbf{x}^{(1)})^T \color{var(--emphColor)}{w} \\ \vdots \\ (\mathbf{x}^{(m)})^T \color{var(--emphColor)}{w} \end{bmatrix} - \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix}\bigg| \bigg|^2 $$

Linear regresssion

We begin by considering the case where $$g(x) =\mathbf{x}^T w$$ is a linear function.
Then $$f(\color{var(--emphColor)}{w}) = \bigg| \bigg|\begin{bmatrix} (\mathbf{x}^{(1)})^T \\ \vdots \\ (\mathbf{x}^{(m)})^T \end{bmatrix} \color{var(--emphColor)}{w} - \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix}\bigg| \bigg|^2 $$

Linear regresssion

We begin by considering the case where $$g(x) = \mathbf{x}^T w$$ is a linear function.
Then $$f(\color{var(--emphColor)}{w}) = \big| \big|X^T \color{var(--emphColor)}{w} - y\big| \big|^2, $$
with $X = \begin{bmatrix} \mathbf{x}^{(1)} & \cdots & \mathbf{x}^{(m)} \end{bmatrix}$.

Finding the smallest error

Our objective function $f(w)$ is a smooth function, so we'll seek parameters such that $$\nabla f(w) = 0.$$
In this special case, we can find the find the critical points analytically.

Finding the smallest error

Note that $$f(\color{var(--emphColor)}{w}) = ||X^T \color{var(--emphColor)}{w} - y||^2$$

Finding the smallest error

Note that $$f(\color{var(--emphColor)}{w}) = (X^T \color{var(--emphColor)}{w} - y)^T (X^T \color{var(--emphColor)}{w} - y)$$

Finding the smallest error

Note that $$f( \color{var(--emphColor)}{w}) = \color{var(--emphColor)}{w}^T X X^T \color{var(--emphColor)}{w} - 2 y^T X^T \color{var(--emphColor)}{w} + y^T y$$
We've studied how to compute the gradient of functions of this form.

Finding the smallest error

Since the gradient of a sum is the sum of the gradients, $$\nabla f(\color{var(--emphColor)}{w}) = \nabla_\color{var(--emphColor)}{w}(\color{var(--emphColor)}{w}^T X X^T \color{var(--emphColor)}{w}) - 2 \nabla_\color{var(--emphColor)}{w}(y^T X^T \color{var(--emphColor)}{w}) + \nabla_\color{var(--emphColor)}{w}(y^T y)$$

Finding the smallest error

Since the gradient of a sum is the sum of the gradients, $$\nabla f(\color{var(--emphColor)}{w}) = 2 \color{var(--emphColor)}{w}^T X X^T - 2 y^T X^T$$
For us, the gradient is a row vector!

Finding the smallest error

Setting $\nabla f(\color{var(--emphColor)}{w}) = 0$ and taking the transpose of both sides, we obtain the normal equations, $$X X^T \color{var(--emphColor)}{w} = X y.$$
This single linear system describes the optimal $\color{var(--emphColor)}{n}$ parameters $\color{var(--emphColor)}{w_1}, \ldots, \color{var(--emphColor)}{w_n}$.
The matrix $X = \begin{bmatrix} \mathbf{x}^{(1)} & \cdots & \mathbf{x}^{(m)} \end{bmatrix}$ is $\color{var(--emphColor)}{n} \times m$, so the coefficient matrix $$X X^T$$ of the normal equations is $\color{var(--emphColor)}{n} \times \color{var(--emphColor)}{n}$.
In practice, the number $m$ of measurements or samples available is typically much larger than the number $\color{var(--emphColor)}{n}$ of parameters, so this is good news!
Although we may use the LU factorization or an iteration technique to solve the normal equations, numerical issues may prevent a precise solution in practice.
One of the final practice problems explores more appropriate solution techniques, which involve factorizing the data matrix $X$.
When the matrix $X$ is full rank , the objective function $f$ has a unique optimal solution, given by $$x_{\text{LS}} = (XX^T)^{-1} X y.$$
The matrix $(XX^T)^{-1} X$ is the Moore-Penrose pseudoinverse of $X^T$.

Simple linear regression

Suppose we want to fit the simple affine model $g(x) = m x + b$, where $x \in \mathbb{R}$.
We are given the data points $(x_i, y_i)$, $i = 1, \ldots, 100$, and we seek parameters $\color{var(--emphColor)}{m}$ and $\color{var(--emphColor)}{b}$ to minimize the sum of squared error $$\sum_{i=1}^{100} (\color{var(--emphColor)}{m} x_i + \color{var(--emphColor)}{b} - y_i)^2.$$
Let's use the normal equations!

Simple linear regression

We aim to minimize $$\sum_{i=1}^{100} (\color{var(--emphColor)}{m} x_i + \color{var(--emphColor)}{b} - y_i)^2 = \bigg|\bigg| \begin{bmatrix} \color{var(--emphColor)}{m} x_1 + \color{var(--emphColor)}{b} \\ \vdots \\ \color{var(--emphColor)}{m} x_{100} + \color{var(--emphColor)}{b} \end{bmatrix} - \begin{bmatrix} y_1 \\ \vdots \\ y_{100} \end{bmatrix} \bigg| \bigg|^2$$

Simple linear regression

We aim to minimize $$\sum_{i=1}^{100} (\color{var(--emphColor)}{m} x_i + \color{var(--emphColor)}{b} - y_i)^2 = \bigg|\bigg| \begin{bmatrix} x_1 & 1 \\ \vdots & \vdots \\ x_{100} & 1 \end{bmatrix} \begin{bmatrix} \color{var(--emphColor)}{m} \\ \color{var(--emphColor)}{b} \end{bmatrix} - \begin{bmatrix} y_1 \\ \vdots \\ y_{100} \end{bmatrix} \bigg| \bigg|^2$$

Simple linear regression

We aim to minimize $$\sum_{i=1}^{100} (\color{var(--emphColor)}{m} x_i + \color{var(--emphColor)}{b} - y_i)^2 = || X^T \color{var(--emphColor)}{w} - y||^2,$$ with $X = \begin{bmatrix} x_1 & \cdots & x_{100} \\ 1 & \cdots & 1 \end{bmatrix}$ and $\color{var(--emphColor)}{w} = \begin{bmatrix} \color{var(--emphColor)}{m} \\ \color{var(--emphColor)}{b} \end{bmatrix}$.

Simple linear regression

The optimal slope and $y$-intercept satisfy the $\color{var(--emphColor)}{2} \times \color{var(--emphColor)}{2}$ system $$XX^T \color{var(--emphColor)}{w} = X y$$

Simple linear regression

The optimal slope and $y$-intercept satisfy the $\color{var(--emphColor)}{2} \times \color{var(--emphColor)}{2}$ system $$\begin{bmatrix} x_1 & \cdots & x_{100} \\ 1 & \cdots & 1 \end{bmatrix} \begin{bmatrix} x_1 & 1 \\ \vdots & \vdots \\ x_{100} & 1 \end{bmatrix} \color{var(--emphColor)}{w} = \begin{bmatrix} x_1 & \cdots & x_{100} \\ 1 & \cdots & 1 \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_{100} \end{bmatrix}$$

Simple linear regression

The optimal slope and $y$-intercept satisfy the $\color{var(--emphColor)}{2} \times \color{var(--emphColor)}{2}$ system $$\begin{bmatrix} \sum_{i=1}^{100} x_i^{2} & \sum^{100}_{i=1}x_i\\ \sum_{i=1}^{100} x_i & 100 \end{bmatrix} \color{var(--emphColor)}{w} = \begin{bmatrix} \sum^{100}_{i=1}x_i y_i\\ \sum^{100}_{i=1} y_i \end{bmatrix}$$
Notice that the determinant of the coefficient matrix is proportional to the variance \begin{align} \text{var}(x) &= {1 \over 100} \sum_{i = 1}^{100} x_i^2 - \bigg({1\over 100}\sum_{i=1}^{100} x_i\bigg) \\ &= {1\over 100} \sum_{i=1}^{100} (x_i - \bar{x})^2, \end{align} so in this case the optimal parameters are uniquely determined when at least two $x_i$'s are distinct.

Example problems

Please refer to the warm-up exercises on HW6 for simple linear regression problem.