Asymptotic Theory and AIC

In the finance world, a common saying is that “past performance does not guarantee future returns”. In statistics, what exists is a generalization gap, which is some typical degradation of predictive power when one moves from training/in-sample data to test/out-of-sample data. Partly, this is because parameters are directly optimized on training data, and thus in-sample measures are necessarily optimistic.

In this note we use the classical asymptotic theory in statistics to derive some generalization gaps, and show how the Akaike Information Criteria is a special case when the model is correctly specified.

$\newcommand\E{\mathbb E}$ $\newcommand\Var{\mathrm {Var}}$

I learnt this material from Model Selection and Averaging, Hjort and Claeskens (2008).

Setup

In the classical setting for machine learning, we have an unknown, true distribution $g$ we are sampling from, which we are trying to approximate using the parametric family $f(\cdot, \theta)$.

One way to do this is to maximize the log-likelihood of our observations - if $Y_1,Y_2,...,Y_n$ are random samples from $g$, then

$$\ell_n(\theta) := \sum_{i=1}^n \log f(Y_i, \theta)$$

is the log-likelihood.

Almost surely, for each individual value of $\theta$, we have the following convergence: $$\frac{1}{n} \ell_n (\theta) \overset{a.s.}\longrightarrow \E_g \log f(Y,\theta).$$

The value $\theta_0$ which minimizes the RHS is called the least-false value. We do not know what this value is, but our best guess is $$\hat\theta = \argmax_{\theta} \ell_n(\theta).$$

Maximizing log-likelihood vs empirical loss minimization

Let’s take one step back.

For most people who have seen a little bit of machine learning, this sort of looks like the thing we do in machine learning, but with a small change: if we call the negative log-likelihood $-\log f$ instead the loss function $L$, and switch out $Y$ for the errors $Y - f_\theta(X)$, then we simply have the so called empirical risk minimization

$$\min_\theta \sum_{i=1}^n L(Y - f_\theta(X))$$

A common choice for $L$ is the squared loss $L(t) = t^2$ (sometimes also called the sum-squared-error or SSE). Often, $L$ is a convex function with a minimum at 0.

This makes a ton of intuitive sense if our goal is to predict $Y$ using a function of the form $f_\theta(X)$ - we want $f_\theta(X)$ to be as close to $Y$ as possible on the training data, and having a small total loss guarantees this.

So what’s the explanation for $L = -\log f$? If we think about how the log-likelihood is formed, this means that the errors $\epsilon_i = Y_i - f_\theta(X_i)$ identically follow a probability distribution proportional to $\exp(-\beta L(\epsilon_i))$. (For the squared loss, this precisely says that $\epsilon_i$ is a centered normal distribution with the same unknown variance.)

In other words, we’re mentally imposing a probabilistic model on our data: we pretend that it is generated by $$Y = f_\theta(X) + \epsilon, \quad P(\epsilon) \propto e^{-\beta L(\epsilon)}.$$

The other question is: given a probabilistic model, why might we want to pick $\theta$ to maximize it? (The name for the process of figuring out $\theta$ is sometimes called probabilistic inference.)

Again, there’s an intuitive answer. Suppose for some $\theta=\theta_0$, the computed log-likelihood is incredibly low. This usually means that the probability that $\theta$ is within some reasonable neighborhood of $\theta$ is also pretty low. That is, we can imagine $$P(\theta: \|\theta - \theta_0\| \le R) \approx P(\theta_0) \cdot \mathrm{Vol}(\theta: \|\theta - \theta_0\| \le R).$$

A slightly better answer involves the Bayesian perspective. Technically, what we have is not exactly a probabilistic model, because one would still need $\theta$ to generate the random variables we care about. We can fix this by defining a probability distribution over $\theta$ (which is called the prior). Then, Bayes’ rule tells us that if we see data $Y$, then $$P(\theta | Y) = \frac{P(Y, \theta)}{P(Y)} = \frac{P(Y | \theta) P (\theta)}{P(Y)}.$$ That is, given $Y$, we would like to know what the most likely $\theta$ is. In this case, one would maximize $P(Y|\theta) P(\theta)$. This reduces back to just the regular log-likelihood if either:

you imagine we are allowed to take a completely agnostic prior $P(\theta) \propto 1$. This doesn’t make sense sometimes (e.g. if $\theta$ is a real number) but we can pretend by taking a sequence of priors which are uniform over larger and larger regions.
actually, we have many observations $Y_1,Y_2,..., Y_n$, and the combined likelihood is a product: $$P(Y_1,...,Y_n|\theta) = \prod_{i=1}^n P(Y_i|\theta).$$ So in some sense, when $n$ is large this forms the bulk of the log-likelihood and whatever prior you end up needing doesn’t matter. (In other words, a large amount of data should lead one to the same conclusion, regardless of what you believed before!)

Maximum likelihood estimation also enjoys a bunch of nice properties, such as being asymptotically normal. That means that the deviations from the limit optimum value of $\theta$ end up looking like a normal variable (once rescaled).

Deviations in the log-likelihood function

Let’s try to understand how $\hat\theta$ deviates from $\theta$ asymptotically. To do this, we would like to do a Taylor expansion of $\ell_n(\theta)$ up to second order. Define $u$ and $I$ to denote the first and second derivatives of the log-likelihood: $$u(y,\theta) = \nabla_\theta \log f(y,\theta),$$ $$I(y,\theta) = -\nabla_\theta^2 \log f(y,\theta).$$ $u$ is sometimes called the score, and $I$ is sometimes called the information.

Since $\theta_0$ miminizes $\E_g[\log f(Y, \theta)]$, we must have $\E_g [u(Y, \theta_0)] = 0$. Thus, expanding $\ell_n$ about $\theta_0$ looks like:

$$\ell_n(\theta) \approx \ell_n(\theta_0) + (\theta- \theta_0)^\top \overline{U_n} - (\theta- \theta_0)^\top \overline{I_n} (\theta- \theta_0)$$ where we define the empirical score and information $$\overline{U}_n = \frac{1}{n} \sum_{i=1}^n u(Y_i,\theta_0),$$ $$\overline{I_n} = \frac{1}{n} \sum_{i=1}^n I(Y_i,\theta_0).$$

The second average converges to the matrix $J := \E_g[I(Y,\theta_0)]$, while the first has a CLT: $$\sqrt{n}\ \overline{U_n} \overset d \Longrightarrow \mathcal N(0,K)$$ where $K:=\Var_g [u(Y,\theta_0)]$.

Solving the optimization (i.e. setting the gradient $\nabla_\theta \ell_n$ to 0), we have $$\hat \theta = \theta_0 + J^{-1} \overline {U_n} + o_p(\sqrt{n}).$$ Therefore, $$\sqrt{n} (\hat\theta - \theta_0) \overset d \Longrightarrow \mathcal N(0,J^{-1}K J^{-1}).$$

In words, $\hat\theta - \theta_0$ behaves like if it were an average of many independent deviations with covariance $J^{-1} K J^{-1}$.

Optimism

We typically see higher losses in training as compared to test. The reason why is simple - we optimize directly on the training loss, not on the ideal “test” loss.

But how much higher should the loss be outside of training? Define the training and test loss of $\hat\theta$ as: $$\hat Q_n = \frac{1}{n} \ell_n (\hat \theta),$$ $$Q_n = \int g(y) \log f(y, \hat\theta)\, dy.$$

The difference $\hat Q_n - Q_n$ is precisely what we want to compute. This can be interpreted as a function of the parameter $\theta$:

$$(\hat Q_n - Q_n)(\theta) = \frac{1}{n}\sum_{i=1}^n \log f(Y_i, \theta) - \int g(y) \log f(y, \theta)\, dy$$

which can be expanded about $\theta = \theta_0$ up to first order, giving us

$$(\hat Q_n - Q_n)(\theta) = (\hat Q_n - Q_n)(\theta_0) + (\theta - \theta_0)^\top \overline{U_n} + ...\ .$$

Thus, plugging the estimate $\hat\theta$ back in and using $\hat \theta - \theta_0 = J^{-1} \overline {U_n} + o_p(\sqrt{n})$, we obtain $$(\hat Q_n - Q_n)(\hat\theta) = (\hat Q_n - Q_n)(\theta_0) + \overline{U_n} J^{-1} \overline{U_n} + ...$$

where $(\hat Q_n - Q_n)(\theta_0)$ is a sum of independent mean zero terms. Hence, in expectation over the randomness of the dataset $\{Y_i\}$ (and thus the estimator $\hat\theta$), we have $$\E_{\hat\theta \text{ under }g}[\hat Q_n - Q_n] = \E[\overline{U_n} J^{-1} \overline{U_n}] = \frac 1 n \text{tr}(J^{-1}K).$$

Correct specification

It helps to assume that our model is correctly specified - that is, $g = f(\cdot, \theta_0)$. In this case, note that $$ \begin{align*} \nabla_\theta(\nabla_\theta\log f(Y;\theta)) &= \nabla_\theta\left(\frac{\nabla_\theta f(Y, \theta)}{f(Y, \theta)}\right)\\ &= \frac{\nabla_\theta^2 f(Y, \theta)}{f(Y,\theta)} -\left(\frac{\nabla_\theta f(Y, \theta)}{f(Y, \theta)}\right)\left(\frac{\nabla_\theta f(Y, \theta)}{f(Y, \theta)}\right)^\top\\ &= \frac{\nabla_\theta^2 f(Y, \theta)}{f(Y,\theta)} - u(Y,\theta) u(Y,\theta)^\top \end{align*} $$

Averaging across $f(Y,\theta)$ itself, we note that $$\E_{f(\cdot, \theta)} \frac{\nabla_\theta^2 f(Y, \theta)}{f(Y,\theta)} = \int \nabla_\theta^2 f(Y, \theta) d\theta =\nabla_\theta^2 \int f(Y,\theta) d\theta = 0.$$ So in fact, we have $$\E_{f(\cdot, \theta)}[- I(Y,\theta)] = \E_{f(\cdot, \theta)}[u(Y,\theta) u(Y,\theta)^\top]$$

This is known as the Fisher information matrix. The implication is that at the least false value $\theta_0$, we must have $J=K$, so the optimism reduces to simply $d$ - the number of parameters. The so-called Akaike Information Criterion suggests adding $d$ to the negative log-likelihood (per observation) when we are selecting across models.

Thoughts

Through this derivation, we witness the power of the asymptotic approach. It seems like under this framework we should be able to derive almost anything we want. However, in practice:

these estimates only kick in for large $n$, which may not be a relevant regime for e.g. high dimensional data
lots of “second-order fluctuation” - even though we’ve computed the averaged optimism, the truth is that the optimism itself may fluctuate to the same order of magnitude, and so there’s much more uncertainty throughout the whole process
in a lot of statistical contexts, such as linear regression, we don’t explicitly have a probabilistic model! There may be some intermediate estimation of the scale (e.g. using $\hat\sigma^2$ derived from the residuals) before we can apply any of this, and the noise from this could be more significant than the generalization.

But despite there being lots of reasons to be skeptical, this approach gives us a nice clean answer, and is a good base for us to understand generalization gaps better.

David Kewei Lin

Welcome to my personal site!