GUIDING QUESTION:
How can I compute the optimal parameters for my model?
Motivation
Suppose you are trying to fit a model $g$, which depends on some parameters (or weights)
$\color{var(--emphColor)}{w_1}, \ldots, \color{var(--emphColor)}{w_n}$.
In practice, many measurements are typically available, and we seek
the weights that can best describe
the given data.
Motivation
For instance,
$$g(x; \color{var(--emphColor)}{w}) = \sum_j \color{var(--emphColor)}{w_j} \phi_j(x)$$
could be a linear combination of some basis
functions $\phi_j$, with some unknown coefficients $\color{var(--emphColor)}{w_j}$.
However, $g$ could also be something else, like a neural network!
We'll use a semi-colon to highlight the dependence of our
model on the parameters $\color{var(--emphColor)}{w} = \begin{bmatrix} w_1 & \cdots & w_n \end{bmatrix}^T.$
Motivation
Typically the number of data points $(x_i, y_i)$, $i = 1, 2, \ldots, m$
is much larger than the number of parameters, and we can't hope
that the system
$$g(x_i; \color{var(--emphColor)}{w}) = y_i, \qquad i = 1, 2, \ldots, m$$
has a solution.
Motivation
We'll seek the next best thing, to minimize the sum
$$\sum_{i=1}^m (g(x_i; \color{var(--emphColor)}{w}) - y_i)^2,$$
by varying the parameters $\color{var(--emphColor)}{w_1}, \ldots, \color{var(--emphColor)}{w_n}$.
Since the gradient of a sum is the sum of the gradients,
$$\nabla f(\color{var(--emphColor)}{w}) = \nabla_\color{var(--emphColor)}{w}(\color{var(--emphColor)}{w}^T X X^T \color{var(--emphColor)}{w})
- 2 \nabla_\color{var(--emphColor)}{w}(y^T X^T \color{var(--emphColor)}{w})
+ \nabla_\color{var(--emphColor)}{w}(y^T y)$$
Finding the smallest error
Since the gradient of a sum is the sum of the gradients,
$$\nabla f(\color{var(--emphColor)}{w}) = 2 \color{var(--emphColor)}{w}^T X X^T
- 2 y^T X^T$$
For us, the gradient is a row vector!
Finding the smallest error
Setting $\nabla f(\color{var(--emphColor)}{w}) = 0$ and taking the
transpose of both sides, we obtain the
normal equations,
$$X X^T \color{var(--emphColor)}{w} = X y.$$
This single linear system describes the optimal $\color{var(--emphColor)}{n}$
parameters $\color{var(--emphColor)}{w_1}, \ldots, \color{var(--emphColor)}{w_n}$.
The matrix $X = \begin{bmatrix} \mathbf{x}^{(1)} & \cdots & \mathbf{x}^{(m)} \end{bmatrix}$
is $\color{var(--emphColor)}{n} \times m$, so the coefficient matrix
$$X X^T$$ of the normal equations is $\color{var(--emphColor)}{n} \times \color{var(--emphColor)}{n}$.
In practice, the number $m$ of measurements or samples available
is typically much larger than the number $\color{var(--emphColor)}{n}$
of parameters, so this is good news!
Although we may use the LU factorization or an iteration technique
to solve the normal equations, numerical issues may prevent a
precise solution in practice.
One of the
final practice problems
explores more appropriate solution techniques, which involve
factorizing the data matrix $X$.
When the matrix $X$ is
full rank
, the objective function $f$ has a unique optimal solution,
given by
$$x_{\text{LS}} = (XX^T)^{-1} X y.$$
Suppose we want to fit the simple affine model $g(x) = m x + b$,
where $x \in \mathbb{R}$.
We are given the data points $(x_i, y_i)$, $i = 1, \ldots, 100$, and
we seek parameters $\color{var(--emphColor)}{m}$ and $\color{var(--emphColor)}{b}$
to minimize the sum of squared error
$$\sum_{i=1}^{100} (\color{var(--emphColor)}{m} x_i + \color{var(--emphColor)}{b} - y_i)^2.$$
The optimal slope and $y$-intercept satisfy the $\color{var(--emphColor)}{2} \times \color{var(--emphColor)}{2}$ system
$$XX^T \color{var(--emphColor)}{w} = X y$$
The optimal slope and $y$-intercept satisfy the $\color{var(--emphColor)}{2} \times \color{var(--emphColor)}{2}$ system
$$\begin{bmatrix}
\sum_{i=1}^{100} x_i^{2} & \sum^{100}_{i=1}x_i\\
\sum_{i=1}^{100} x_i & 100
\end{bmatrix}
\color{var(--emphColor)}{w} = \begin{bmatrix}
\sum^{100}_{i=1}x_i y_i\\
\sum^{100}_{i=1} y_i
\end{bmatrix}$$
Notice that the determinant of the coefficient matrix is proportional
to the variance
\begin{align}
\text{var}(x) &= {1 \over 100} \sum_{i = 1}^{100} x_i^2 - \bigg({1\over 100}\sum_{i=1}^{100} x_i\bigg) \\
&= {1\over 100} \sum_{i=1}^{100} (x_i - \bar{x})^2,
\end{align}
so in this case the optimal parameters are uniquely determined when at
least two $x_i$'s are distinct.
Example problems
Please refer to the warm-up exercises on
HW6
for simple linear regression problem.