Thinking about the true loss function is important

Most of the {\bf regression} methods we’ve studied aim to minimize the RSS, while {\bf classification} methods aim to minimize the 0-1 loss.
In classification, we often care about certain kinds of error more than others; i.e. the natural loss function is not the 0-1 loss.
Even if we use a method which minimizes a certain kind of training error, we can tune it to optimize our true loss function.
Example: in the default study we could find the threshold that brings the False negative rate below an acceptable level.

Validation#

**Problem:** Choose a supervised method that minimizes the test error.
In addition, *tune* the parameters of each method: maybe
- $k$ in $k$-nearest neighbors.
- The number of variables to include in forward or backward selection.
- The order of a polynomial in polynomial regression.

Validation set approach#

Use of a validation set is one way to approximate the test error:

Divide the data into two parts.
Train each model with one part.
Compute the error on the remaining validation data.

Example: choosing order of polynomial#

Polynomial regression to estimate mpg from horsepower in the Auto data.
Problem: Every split yields a different estimate of the error.

Leave one out cross-validation (LOOCV)#

For every $i=1,\dots,n$:
1. train the model on every point except $i$,
2. compute the test error on the held out point.
Average the test errors.

Regression#

Overall error: $$\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^n (y_i - \color{Red}{\hat y_i^{(-i)}})^2$$
Notation \hat y_i^{(-i)}: prediction for the $i$ sample when learning without using the $i$th sample.

Example: ?#

Algorithm 5.2?: LOOCV#

For every $i=1,\dots,n$:
1. train the model on every point except $i$,
2. compute the test error on the held out point.
Average the test errors.

Classification#

Overall error: $$\text{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^n \mathbf{1}(y_i \neq \color{Red}{\hat y_i^{(-i)}})$$
Here, \hat y_i^{(-i)} is predicted label for the $i$ sample when learning without using the $i$th sample.

Shortcut for linear regression#

Computing $\text{CV}_{(n)}$ can be computationally expensive, since it involves fitting the model $n$ times.
For linear regression, there is a shortcut:

\[\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i-\hat y_i}{1-h_{ii}}\right)^2\]

Above, $h_{ii}$ is the leverage statistic.
Approximate versions sometimes used for logistic regression…

$K$-fold cross-validation#

Algorithm 5.3? $K$-fold CV#

Split the data into $K$ subsets or *folds*.
For every $i=1,\dots,K$:
1. train the model on every fold except the $i$th fold,
2. compute the test error on the $i$th fold.
Average the test errors.

Example: ?#

LOOCV vs. $K$-fold cross-validation#

$K$-fold CV depends on the chosen split (somewhat).
In $K$-fold CV, we train the model on less data than what is available. This introduces bias into the estimates of test error.
In LOOCV, the training samples highly resemble each other. This increases the variance of the test error estimate.
$n$-fold CV is equivalent LOOCV.

Choosing an optimal model#

Even if the error estimates are off, choosing the model with the minimum cross validation error often leads to a method with near minimum test error.

In a classification problem, things look similar.

– – – Bayes boundary,—— Logistic regression with polynomial predictors of increasing degree.

Choosing an optimal model#

In a classification problem, things look similar.

The one standard error (1SE) rule of thumb#

Forward stepwise selection
10-fold cross validation, True test error

A number of models with $10\le p\le 15$ have almost the same CV error.

The vertical bars represent 1 standard error in the test error from the 10 folds.

**1-SE rule of thumb:** Choose the simplest model whose CV error is no more than one standard error above the model with the lowest CV error. \ei \ec \ecc

The wrong way to do cross validation#

*Reading:* Section 7.10.2 of The Elements of Statistical Learning.
We want to classify 200 individuals according to whether they have cancer or not.
We use logistic regression onto 1000 measurements of gene expression.
**Proposed strategy:**
1. Using all the data, select the 20 most significant genes using $z$-tests.
2. Estimate the test error of logistic regression with these 20 predictors via 10-fold cross validation.

The wrong way to do cross validation#

To see how that works, let's use the following simulated data:
1. Each gene expression is standard normal and independent of all others.
2. The response (cancer or not) is sampled from a coin flip --- no correlation to any of the "genes".
Q: What should the misclassification rate be for any classification method using these predictors?
A: Roughly 50%.

The wrong way to do cross validation#

We run this simulation, and obtain a CV error rate of 3%!
Why?
- Since we only have 200 individuals in total, among 1000 variables, at least some will be correlated with the response.
- We had run variable selection using \emph{all the data}, so the variables we select have some correlation with the response in every subset or fold in the cross validation.

The right way to do cross validation#

Divide the data into 10 folds.
For $i=1,\dots,10$:
1. Using every fold except $i$, perform the variable selection and fit the model with the selected variables.
2. Compute the error on fold $i$.
Average the 10 test errors obtained.

In our simulation, this produces an error estimate of close to 50%.
Moral of the story: Every aspect of the learning method that involves using the data — variable selection, for example — must be cross-validated.

Bootstrap#

Another resampling technique often seen in practice.

Cross-validation vs. the Bootstrap#

Cross-validation: provides estimates of the (test) error
The Bootstrap: provides the (standard) error of estimates

Bootstrap#

- One of the most important techniques in all of Statistics.

Computer intensive method.
Popularized by Brad Efron $\leftarrow$ Stanford pride!

Standard errors in linear regression from a sample of size $n$#

summary(M.sales)

Classical way to compute Standard Errors#

**Example:** Estimate the variance of a sample $x_1,x_2,\dots,x_n$:
Unbiased estimate of $\sigma^2$: $$\hat \sigma^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i-\overline x)^2.$$
What is the Standard Error of $\hat \sigma^2$?
- Assume that $x_1,\dots,x_n$ are normally distributed with common mean $\mu$ and variance $\sigma^2$.
- Then $\hat \sigma^2(n-1)$ has a $\chi$-squared distribution with $n-1$ degrees of freedom.
- For large $n$, $\hat{\sigma}^2$ is normally distributed around $\sigma^2$.
- The SD of this *sampling distribution* is the Standard Error.

Limitations of the classical approach#

This approach has served statisticians well for many years; however, what happens if:
- The distributional assumption --- for example, $x_1,\dots,x_n$ being normal --- breaks down?
- The estimator does not have a simple form and its sampling distribution cannot be derived analytically?

Example: Investing in two assets#

Suppose that $X$ and $Y$ are the returns of two assets.
These returns are observed every day: $(x_1,y_1),\dots,(x_n,y_n)$.

Example. Investing in two assets#

We have a fixed amount of money to invest and we will invest a fraction $\alpha$ on $X$ and a fraction $(1-\alpha)$ on $Y$.
Therefore, our return will be

\[\alpha X + (1-\alpha) Y.\]

Our goal will be to minimize the variance of our return as a function of $\alpha$.
One can show that the optimal $\alpha$ is: $$\alpha = \frac{\sigma_Y^2 - \text{Cov}(X,Y)}{\sigma_X^2 + \sigma_Y^2 -2\text{Cov}(X,Y)}.$$
Proposal: Use an estimate: $$\hat \alpha = \frac{\hat \sigma_Y^2 - \hat{ \text{Cov}}(X,Y)}{\hat \sigma_X^2 + \hat \sigma_Y^2 -2\hat{ \text{Cov}}(X,Y)}.$$

Example: Investing in two assets#

Suppose we compute the estimate $\hat\alpha = 0.6$ using the samples $(x_1,y_1),\dots,(x_n,y_n)$.
How sure can we be of this value? (A little vague of a question.)
If we had sampled the observations in a different 100 days, would we get a wildly different $\hat \alpha$? (A more precise question.) \ei

Resampling the data from the true distribution#

In this thought experiment, we know the actual joint distribution $P(X,Y)$, so we can resample the $n$ observations to our hearts’ content.

Computing the standard error of $\hat \alpha$#

We will use $S$ samples to estimate the standard error of $\hat{\alpha}$.
For each sampling of the data, for $1 \leq s \leq S$

\[(x_1^{(s)},\dots,x_n^{(s)})\]

we can compute a value of the estimate $\hat \alpha^{(1)},\hat \alpha^{(2)},\dots$.

The Standard Error of $\hat \alpha$ is approximated by the standard deviation of these values.

In reality, we only have $n$ samples#

However, these samples can be used to approximate the joint distribution of $X$ and $Y$.
The Bootstrap: Sample from the empirical distribution:

\[\hat P(X,Y) = \frac{1}{n}\sum_{i=1}^{n} \delta(x_i,y_i).\]

Equivalently, resample the data by drawing $n$ samples with replacement from the actual observations.
Why it works: variances computed under the empirical distribution are good approximations of variances computed under the true distribution (in many cases).

STATS 202

Validation

Contents

Validation#

Validation set approach#

Example: choosing order of polynomial#

Leave one out cross-validation (LOOCV)#

Regression#

Example: ?#

Algorithm 5.2?: LOOCV#

Classification#

Shortcut for linear regression#

\(K\)-fold cross-validation#

Algorithm 5.3? \(K\)-fold CV#

Example: ?#

LOOCV vs. \(K\)-fold cross-validation#

Choosing an optimal model#

Choosing an optimal model#

The one standard error (1SE) rule of thumb#

The wrong way to do cross validation#

The wrong way to do cross validation#

The wrong way to do cross validation#

The right way to do cross validation#

Bootstrap#

Cross-validation vs. the Bootstrap#

Bootstrap#

Standard errors in linear regression from a sample of size \(n\)#

Classical way to compute Standard Errors#

Limitations of the classical approach#

Example: Investing in two assets#

Example. Investing in two assets#

Example: Investing in two assets#

Resampling the data from the true distribution#

Computing the standard error of \(\hat \alpha\)#

In reality, we only have \(n\) samples#

A schematic of the Bootstrap#

Comparing Bootstrap sampling to sampling from the true distribution#