The generalization error computation for linear models

$\newcommand \E {\mathbb E}$ $\newcommand \eps {\epsilon}$ $\newcommand \tr {\mathrm{tr}}$ $\newcommand \Cov {\mathrm{Cov}}$

In this article, I outline a computation for linear models that I feel is really essential but was never asked of me during most of my education in machine learning.

Background

My ML education looked a little like this:

deep learning applied ML -> statistical learning theory -> applied statistics

This meant that for a long time, my understanding of ML was just “fit the model on the training data, and then hope that it worked okay on the test data”. It would be that the model was literally able to “learn from examples”. If it was not able to, it was because either we were using a model class that was too restrictive, or we could not optimize the model fully (for e.g. deep neural nets, getting caught in local minima), or there just weren’t enough samples.

Of course, given this, something like regularization would make no sense. Sure, the model overfits, but is pulling the coefficients close to 0 really better? It feels kind of like an arbitrary choice to me.

In the statistical learning theory class, I learnt that the loss in machine learning can be decomposed into three pieces (which we’ve just mentioned):

not learning the correct function from the data (generalization error)
not actually minimizing the loss (optimization error)
not capturing the actual function within the model class (approximation error)

I’ve always just handwaved the generalization piece away, because in my mind this always vanished when you start taking the number of datapoints $\to\infty$. Really, there was something interesting here that I just never realized, that it was possible to learn the wrong thing from data if you didn’t have enough of it.

A practical challenge

Generate the following data:

$\beta\sim N(0, \frac 1 d I_d)$
$x\sim N(0,1)$
$y = x^\top \beta + \epsilon$, $\epsilon \sim N(0,\sigma^2)$.

The goal is to recover $\beta$ given $d$ samples of data $(x^{(i)},y^{(i)})$. You could try using e.g. $d=100$.

Here’s some python code to generate the data:

import numpy as np

d = 100
n = 1000
sigma2 = 1
beta = np.random.randn(d) * np.sqrt(1/d)
x = np.random.randn(n,d)
y = x @ beta + np.random.randn(n) * np.sqrt(sigma2)

Analysis of OLS

Let’s say I have a model given by $$y = x^\top \beta + \epsilon$$

and let’s also suppose:

$x = (x_1, x_2, \dots, x_d)$ is $d$-dimensional
we have a dataset $(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)$
$\epsilon$ is independent of $x$, and has mean 0 and variance $\sigma^2$

We can represent the conditions imposed on the dataset in stacked (or matrix) form $$Y = X\beta + \vec \eps$$ The most naive way to fit the linear model is to use ordinary least squares (“OLS”) - we minimize $\|Y-X\beta\|_2^2$. This gives the solution $$\hat \beta = (X^\top X)^{-1} (X^\top Y) = \beta + (X^\top X)^{-1} X^\top\vec{\epsilon}.$$

Interestingly, this solution fares pretty badly on the given challenge, and the goal here in the following analysis is to try to figure out why. In my opinion, this is a very educational exercise, and I regret that none of my earlier courses put me through this.

First, we note that by expansion, we realize that $\hat\beta$ deviates about the true $\beta$: $$\hat \beta = \beta + (X^\top X)^{-1} X^\top\vec{\epsilon}.$$

What is my generalization error using the learnt $\hat\beta$? That is, what is the error we should expect from predicting on a fresh datapoint $(x,y)$, i.e. $$\E[(y - x^\top \hat \beta)^2]?$$

Expanding, we get $$\E[(x^\top(\beta - \hat\beta))^2 + 2x^\top(\beta -\hat\beta) \eps + \eps^2] = \underbrace{\E[(x^\top(\beta-\hat\beta))^2]}_{\text{bias}} + \underbrace{\E[\eps^2]}_{\text{variance}}$$

So it remains to expand the bias term. We use a trick here: $$ \begin{align*} \E[(x^\top(\beta-\hat\beta))^2] &= \E[\tr(xx^\top (\beta-\hat\beta)(\beta - \hat\beta)^\top)] \\ &= \tr(\E[xx^\top]\E[(X^\top X)^{-1} X^\top \vec\eps\vec\eps^{\top} X (X^\top X)^{-1}])\\ &=\sigma^2 \tr((X^\top X)^{-1}\E[xx^\top])\\ \end{align*} $$

Let’s use the first order approximation $(X^\top X) \approx n \E[xx^\top]$. Then we just have that the answer is $\frac{\sigma^2d}{n}$.

We can make the following conclusions:

To minimize overall error, you should roughly scale up the number of datapoints linearly with the number of features.
This analysis is agnostic to lots of things: $\beta$ and $\Cov(x)$, except for perhaps the first order approximation.
The bias-variance decomposition holds regardless of $\hat\beta$ (i.e. it is agnostic to the fitting method).
When $n$ and $d$ are comparable, this conclusion is bad (since we do not approach 0 bias). There is a more fundamental problem, which is that the first order approximation fails to hold, and random matrix effects kick in. We’ll explore this more in a second.

Bayesian perspective

How might we think about a “lower” bound? One idea is to further randomize on $\beta$ (as we did in the challenge). It turns out that in this case, we can in fact get the best possible prediction!

Let’s do a temporary simplication that $x\sim N(0,I_d)$. Then, we saw earlier that the generalization error is exactly $$ \begin{align*} \E[(y - x^\top \beta)^2] &= \E[(x^\top(\beta - \hat\beta))^2] + \E[\eps^2] \\ &= \tr[\E[xx^\top](\beta-\hat\beta)(\beta-\hat\beta)^\top] + \sigma^2 \\ &= \|\beta-\hat\beta\|^2 + \sigma^2 \end{align*} $$

Supposing that we minimize the expectation of this across experiments (i.e. the randomness of the $\beta$’s), the minimizer of this is exactly $\hat\beta = \E[\beta | X,Y]$. Fortunately, with a single trick we can compute this easily - we will write probabilities in proportional form, so instead of $$p(Z) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac {Z^2} {2\sigma^2}\right)$$ we can actually drop the constant term and write $$p(Z) \propto \exp\left(-\frac {Z^2} {2\sigma^2}\right).$$

This shorthand is extremely convenient for figuring out what posteriors are. For this case in particular, we have $$ p(Y,X,\beta) = p(Y|X,\beta) p(X) p(\beta) \propto \exp\left(-\frac 1 {2\sigma^2} \|Y-X\beta\|^2 - \frac{d}{2}\|\beta\|^2 \right) p(X) $$ Our question then, is: fixing $X$ and $Y$, what is the posterior on $\beta$? The answer is that it must be a Gaussian distribution by virtue of form, since regardless of what constant we end up dividing by, $p(\beta|X,Y)$ is exponential of a quadratic thing. We can figure out its mean and variance by completing the square: $$ \begin{align*} \beta | X,Y &\sim N\left(\hat\beta, \sigma^2 \left(X^\top X + \sigma^2 d I\right)^{-1}\right) \\ \hat\beta &= \E[\beta | X,Y] = \left(X^\top X + \sigma^2 d I\right)^{-1} X^\top Y \end{align*} $$

You might recognize that $\hat\beta$ is exactly the ridge regression solution for $\lambda = \frac d {\sigma^2}$! This allows us to wrap up the generalization error computation: $$ \E[\|\beta-\hat\beta\|^2 | X,Y] = \sigma^2\tr\left(X^\top X + \sigma^2 d I_d\right)^{-1} $$ We can do two sanity checks at this point:

when we have no data, the variance matches that of the assumed unconditional distribution of $\beta$, which is $\frac 1 d I_d$.
In the limit of large data and fixed $d$, $X^\top X$ is asymptotically $n\E[xx^\top] = nI_d$, which matches the typical generalization rate of $1/n$ from OLS.
Also, since we have the optimal prediction here, this gives us a lower bound on how good any prediction can be (including OLS) - so nothing can beat OLS asymptotically in this setting.

The fully generic version of the computation.

With this simple case settled, let’s also do the fully generic version of the computation.

Suppose that $\beta \sim N(0,\Sigma_\beta)$ and $\E[xx^\top] = I_d$. (And let’s assume that $\Sigma_\beta$ is full-rank.) Then the log-likelihood is $$ p(Y,X,\beta) \propto \exp\left(-\frac 1 {2\sigma^2} \|Y-X\beta\|^2 - \frac 1 {2} \beta^\top \Sigma_\beta^{-1} \beta\right)p(X) $$

so therefore the posterior is $$ \beta | X,Y \sim N\left(\hat\beta, \left(X^\top X + \sigma^2 \Sigma_\beta^{-1}\right)^{-1}\right) $$ with $$\hat\beta = \left(X^\top X + \sigma^2 \Sigma_\beta^{-1}\right)^{-1} X^\top Y$$

Finally, the generalization error is $$ \E[\|y-x^\top \beta\|^2 | X,Y] = \E[\|x^\top (\beta-\hat\beta)\|^2 | X,Y] + \sigma^2 = \tr\left(\left(X^\top X + \sigma^2 \Sigma_\beta^{-1}\right)^{-1}\E[xx^\top]\right) + \sigma^2 $$

We can check that in the limit $\Sigma_\beta \to \infty$ (in the eigenvalue sense, so that $\Sigma_\beta^{-1} \to 0$), we recover OLS. A good sanity check is that the way that $\Sigma_\beta$ goes to infinity doesn’t matter.

So what goes wrong when $d=n$?

When $d=n$, we end up inverting $\E[X^\top X]$. You might think that there’s nothing wrong with this because $X^\top X$ should be full rank, but there’s a subtle issue - the estimate that $$\E[X^\top X] \approx n \E[xx^\top]$$ starts to break down (even if we initially assumed that $\E[xx^\top] = I_d$). Let’s suppose that the eigenvalues of $\frac 1 n X^\top X$ are $\lambda_1,\dots,\lambda_d$. Then, this allows us to compute the bias error for both examples.

Suppose we were doing OLS. Then the bias error is $$ \sigma^2\tr\left(\left(X^\top X\right)^{-1} \right) = \sigma^2 \cdot \frac 1 n \sum_{i=1}^d \frac 1 {\lambda_i} $$

This might look like a dead end, but this is where random matrix theory kicks in. The Marchenko-Pastur law tells us exactly how to evaluate this sum: in the limit of large $n$, we can estimate it (with high probability) as $$ \frac 1 n \sum_{i=1}^d \frac 1 {\lambda_i} \approx \int_0^\infty \frac { \nu(\lambda)} {\lambda} d\lambda $$ where $$\nu(\lambda) = \frac 1 {2\pi} \frac{\sqrt{\lambda(4 - \lambda)}}{\lambda}.$$

Unfortunately, this integral diverges, so OLS fails spectacularly. (With some care, we can even get a rate of $\Theta(\sqrt n)$ for the sum.) This is the basis of our practical challenge.

If we knew the distribution of $\beta$, we can do much better from an exact Bayesian computation. The expected bias error is $$ \E[\|\beta-\hat\beta\|^2 | X,Y] = \sigma^2\tr\left(X^\top X + \Sigma_\beta^{-1}\right)^{-1} = \sigma^2 \cdot \frac 1 n \sum_{i=1}^d \frac 1 {\lambda_i + 1} $$

Now the integral is instead $$ \int_0^\infty \frac { \nu(\lambda)} {\lambda + 1} d\lambda = \frac 1 {2\pi} \int_0^\infty \frac{\sqrt{\lambda(4 - \lambda)}}{\lambda(\lambda + 1)} d\lambda $$ which converges to a constant. (And remember, this is a lower bound on the error, so it is impossible to do better!)

Should my data being correlated change how I fit?

I’ll start with the folklore/empirical answer: it’s a common excuse that if your data is correlated, then you should add a ridge penalty so that you don’t invert a badly conditioned matrix as part of the fit.

Shockingly enough, from what we’ve gathered in the Bayesian version the answer seems to be no. It’s clear that the correlation of $x$ affects the size of the variance term (i.e. the bias error), which is the squared distance between $\beta$ and $\hat\beta$ induced by the correlation structure of $x$. However, more surprisingly, the estimate $\hat\beta$ is unaffected by the correlation structure of $X$ (since the posterior is just a Gaussian), because all we needed was that $$\hat\beta = \E[\beta | X,Y]$$ and this holds regardless of which PSD matrix we use to measure the difference between $\beta$ and $\hat\beta$. Futhermore, the ridge coefficient only depends on the prior we had on $\beta$, and not on $X$ at all.

But should it matter?

Well, think about the following special case: we have $n$ exact identical features. Then, the correlation of $x$ is rank 1, and we evaluate the variance in only one direction. In this case, both OLS and Ridge give us the same answer as if we were doing univariate regression, since the features can be bundled together with the coefficient $\beta_1+... +\beta_d$. (In other words, whatever $\hat\beta_1 + \hat\beta_2 + ... + \hat\beta_d$ is should just match the $\hat\beta$ from the univariate regression.)

But the answer is not the problem - the problem is that there’s a degeneracy: as long as the sum of the regressed coefficients matches the univariate regression, the answer is the same. So if instead the features are slightly noisy, it’s possible that we end up “overfitting the noise” and getting a worse generalization error.

Here’s one possible model to explore this discrepancy. Consider that instead that $x$ is a noisier version of the true latent $z$:

$$x_i = z_i + \epsilon_i$$

where each $\epsilon_i$ is drawn iid from $N(0,\sigma_X^2)$, and suppose now that $$y = \beta^\top z + \epsilon', \qquad \epsilon' \sim N(0,\sigma_Y^2)$$ Clearly, we can rewrite this as $$y = \beta^\top x + (-\beta^\top \epsilon + \epsilon')$$ so we can consider this as having noise independently generated given $x$ and $y$. For simplicity, we assume that $\sigma_X^2$ is much smaller than the actual variance of $x$, so $x_i$ is essentially independent of $\epsilon_i$. This allow us to compute $$p(\beta|X,Y) = p(\beta)p(X)p(Y|\beta,X) \propto \exp\left(-\frac 1 {2(\sigma_Y^2 + \|\beta\|^2 \sigma_X^2)} \|Y-\beta^\top X\|^2 - \frac {\|\beta\|^2} 2 \right).$$

This is not a Gaussian any more - but instead of getting the exact mean $\beta$ we can think about what maximizes the posterior likelihood (a.k.a. “maximum a posteriori” or MAP). Multiplying throughout to make this simpler gives

$$\beta_{MAP} = \arg\min \left\{ \|Y-\beta^\top X\|^2 + \|\beta\|^2 (\sigma_Y^2 + \|\beta\|^2 \sigma_X^2) \right\}.$$

This seems scary to analyze, but one way to look at it is that it’s simply the ridge estimator $\beta_\lambda$ for some value of $\lambda$ satisfying the self-consistency equation $$\lambda = \sigma_Y^2 + \|\beta_\lambda\|^2\sigma_X^2.$$

We can make a qualitative argument here: if $x$ is more correlated, then $\beta_\lambda$ tends to be higher. However, this means that the self-consistency condition is satisfied at a higher value of $\lambda$! So we do end up with some kind of conclusion tells us to use a higher ridge coefficient for correlated $x$.

Of course, there are still some missing details to put this into practice: there’s no clear way to infer $\sigma_X$ from the data, since that would require us to split out signal and noise within $x$, and we would have to ensure that it is in the correct regime for the independence assumption to hold. But it’s a start that somewhat explains our intuitions.

Conclusion

Looking at generalization error led us to reject OLS for our challenge problem, and we started to (implicitly) ask questions about e.g. what the best ridge penalty $\lambda$ is for correlated data. The broader theme is that generalization error can be used to answer questions about model selection - when you have a few competing classes of models, the way you pick the best amongst them is via various proxies of generalization error. There are theoretical methods (e.g. various information criteria) and empirical methods (e.g. cross-validation) that provide us with an estimated measure of how well a model generalizes.

David Kewei Lin

Welcome to my personal site!