Stats 203
If data are generated from the linear regression model \(Y_i = \alpha + \beta x_i + \epsilon_i\), then:
\[ \hat\beta \sim \text{Normal}\left(\beta, \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}\right) \]
\[ \hat\alpha \sim \text{Normal}\left(\alpha, \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2} \bar x^2 \right) \]
Under the linear regression model, \[ \begin{align} \hat\beta &\sim \text{Normal}\Bigg(\beta, \underbrace{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}}_{\text{Var}[\hat\beta]}\Bigg) & &\Longrightarrow & \frac{\hat\beta - \beta}{\sqrt{\text{Var}[\hat\beta]}} &\sim \text{Normal}(0, 1). \end{align} \]
But this is impractical for inference because \(\text{Var}[\hat\beta] = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}\), and \(\sigma^2\) is unknown.
We can estimate \(\sigma^2\) by \(S^2 = \frac{1}{n-2} \sum_{i=1}^n (Y_i - \hat Y_i)^2\) to obtain \(\widehat{\text{Var}}[\hat\beta] = \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2}\):
\[ \frac{\hat\beta - \beta}{\sqrt{\widehat{\text{Var}}[\hat\beta]}} \sim t_{n-2} \]
This estimate is often called the standard error: \(\text{SE}[\hat\beta] = \sqrt{\widehat{\text{Var}}[\hat\beta]}\).
(There is an analogous result for the intercept \(\alpha\).)
We can form a 95% confidence interval for \(\beta\) by observing that \[ \begin{align} \left|\frac{\hat\beta - \beta}{\text{SE}[\hat\beta]}\right| &\leq t_{n-2, 0.975} & \Longleftrightarrow & & \hat\beta - t_{n-2, 0.975} \text{SE}[\hat\beta] \leq \beta \leq \hat\beta + t_{n-2, 0.975} \text{SE}[\hat\beta]. \end{align} \]
We can use this to obtain a 95% confidence interval for the age of the universe.
We can test the hypotheses: \[ \begin{align} H_0&: \beta = \beta_0 \\ H_A&: \beta \neq \beta_0 \end{align} \]
by rejecting \(H_0\) when the \(t\)-statistic is too large: \(\displaystyle \left|\frac{\hat\beta - \beta_0}{\text{SE}[\hat\beta]}\right| > t_{n-2, 0.975}\).
We can use this to test whether the universe is 14 billion years old.
R calculates confidence intervals and hypothesis tests automatically.
Note that R tests \(H_0: \beta = 0\) by default.
All of the above conclusions assume that the linear regression model is correct!
As we saw last time, some of the assumptions are questionable (homoskedasticity, normality, even linearity).
Because of the Central Limit Theorem, inferences are often robust to departures from normality.
Karpicke & Blunt (2011):
What happens if we convert “condition” to a quantitative variable (\(1 =\) Retrieval, \(0 =\) Concept) and run linear regression?
What does it mean to test \(H_0: \beta = 0\) in this case?
It is equivalent to testing \(H_0: \mu_0 = \mu_1\) using a (pooled) \(t\)-test.
The value of the line at \(X = x^*\) is a parameter: \[ \mu_{x^*} = \alpha + \beta x^*. \]
An estimate of \(\mu_{x^*}\) is the prediction from the least-squares line: \[ \hat Y_{x^*} = \hat\alpha + \hat\beta x^*. \]
The variance is \(\displaystyle \text{Var}[\hat Y_{x^*}] = \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2\).
Therefore, a 95% confidence interval for \(\mu_{x^*}\) is \[ \hat Y_{x^*} \pm t_{n-2, 0.975} \underbrace{\sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2}}_{\text{SE}[\hat Y_{x^*}]}. \]

We can predict the waiting time until the next eruption based on the duration of the last eruption.
On average, how long does one have to wait until the next eruption when the last eruption lasted 4.0 minutes?
\[ {\scriptsize \hat Y_{x^*} \pm t_{n-2, 0.975} \sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2}.} \]
Let’s use R to calculate the confidence interval at each \(x^*\) and draw a confidence band around the least-squares line.
If the last eruption lasted 4.0 minutes, a visitor to Yellowstone is more interested in the waiting time until the next eruption, not the average waiting time.
A new observation at \(X = x^*\) is assumed to be generated as
\[ Y = \alpha + \beta x^* + \epsilon. \]
A prediction interval takes into account the variability from estimating \(\alpha\) and \(\beta\), as well as the variability in the noise \(\epsilon\).
\[ {\small \text{Var}[\hat Y_{x^*}] + \text{Var}[\epsilon] = \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2 + \sigma^2} \]
Therefore, a 95% prediction interval for \(Y_{x^*}\) is \[ {\small \hat Y_{x^*} \pm t_{n-2, 0.975} \sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2 + S^2}} \]
How long is it until the next eruption, if the last eruption was 4 minutes?
\[ {\scriptsize \hat Y_{x^*} \pm t_{n-2, 0.975} \sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2 + S^2}.} \]
Let’s use R to calculate the prediction interval at each \(x^*\) and draw a prediction band around the least-squares line.
Both intervals are centered at \(\hat Y_{x^*} = \hat\alpha + \hat\beta x^*\) and have the form \(\hat Y_{x^*} \pm t_{n-2, 0.975} \cdot \text{SE}\).
| SE formula | Interpretation | Requires Normality? | |
|---|---|---|---|
| Confidence interval | \(\sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2}(x^* - \bar x)^2}\) | Captures the true mean response \(\mu_{x^*} = \alpha + \beta x^*\) | No |
| Prediction interval | \(\sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2}(x^* - \bar x)^2 + S^2}\) | Captures a new individual observation \(Y = \alpha + \beta x^* + \epsilon\) | Yes |
The prediction interval is wider because it adds \(S^2\) to account for the noise \(\epsilon\).
In light of what we discussed today, take 5 minutes to try to retrieve as much information from memory as possible.