Inference for Linear Regression

Stats 203

Stanford University

Review

If data are generated from the linear regression model \(Y_i = \alpha + \beta x_i + \epsilon_i\), then:

\[ \hat\beta \sim \text{Normal}\left(\beta, \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}\right) \]

\[ \hat\alpha \sim \text{Normal}\left(\alpha, \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2} \bar x^2 \right) \]

Inference for Coefficients

Foundations for Inference

Under the linear regression model, \[ \begin{align} \hat\beta &\sim \text{Normal}\Bigg(\beta, \underbrace{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}}_{\text{Var}[\hat\beta]}\Bigg) & &\Longrightarrow & \frac{\hat\beta - \beta}{\sqrt{\text{Var}[\hat\beta]}} &\sim \text{Normal}(0, 1). \end{align} \]

But this is impractical for inference because \(\text{Var}[\hat\beta] = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}\), and \(\sigma^2\) is unknown.

We can estimate \(\sigma^2\) by \(S^2 = \frac{1}{n-2} \sum_{i=1}^n (Y_i - \hat Y_i)^2\) to obtain \(\widehat{\text{Var}}[\hat\beta] = \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2}\):

\[ \frac{\hat\beta - \beta}{\sqrt{\widehat{\text{Var}}[\hat\beta]}} \sim t_{n-2} \]

This estimate is often called the standard error: \(\text{SE}[\hat\beta] = \sqrt{\widehat{\text{Var}}[\hat\beta]}\).

(There is an analogous result for the intercept \(\alpha\).)

Confidence Intervals (CIs)

We can form a 95% confidence interval for \(\beta\) by observing that \[ \begin{align} \left|\frac{\hat\beta - \beta}{\text{SE}[\hat\beta]}\right| &\leq t_{n-2, 0.975} & \Longleftrightarrow & & \hat\beta - t_{n-2, 0.975} \text{SE}[\hat\beta] \leq \beta \leq \hat\beta + t_{n-2, 0.975} \text{SE}[\hat\beta]. \end{align} \]

We can use this to obtain a 95% confidence interval for the age of the universe.

Hypothesis Tests

We can test the hypotheses: \[ \begin{align} H_0&: \beta = \beta_0 \\ H_A&: \beta \neq \beta_0 \end{align} \]

by rejecting \(H_0\) when the \(t\)-statistic is too large: \(\displaystyle \left|\frac{\hat\beta - \beta_0}{\text{SE}[\hat\beta]}\right| > t_{n-2, 0.975}\).

We can use this to test whether the universe is 14 billion years old.

Inference in R

R calculates confidence intervals and hypothesis tests automatically.

Note that R tests \(H_0: \beta = 0\) by default.

Remark about Assumptions

All of the above conclusions assume that the linear regression model is correct!

As we saw last time, some of the assumptions are questionable (homoskedasticity, normality, even linearity).

Because of the Central Limit Theorem, inferences are often robust to departures from normality.

Application to Two-Sample Inference

Retrieval Practice

  • Retrieval practice is a learning strategy in which students actively recall information from memory (rather than reviewing it).
  • The act of struggling to reconstruct knowledge encodes it more durably and strengthens long-term retention.

Karpicke & Blunt (2011):

  • Undergraduates studied a science text under two conditions: retrieval practice (free recall after reading) or control (concept mapping).
  • One week later, students were given a test on the text.

A Look at the Data

Encoding Condition as Quantitative

What happens if we convert “condition” to a quantitative variable (\(1 =\) Retrieval, \(0 =\) Concept) and run linear regression?

Two-Sample Inference

What does it mean to test \(H_0: \beta = 0\) in this case?

It is equivalent to testing \(H_0: \mu_0 = \mu_1\) using a (pooled) \(t\)-test.

Inference for Other Parameters

CI for the Mean Response

The value of the line at \(X = x^*\) is a parameter: \[ \mu_{x^*} = \alpha + \beta x^*. \]

An estimate of \(\mu_{x^*}\) is the prediction from the least-squares line: \[ \hat Y_{x^*} = \hat\alpha + \hat\beta x^*. \]

The variance is \(\displaystyle \text{Var}[\hat Y_{x^*}] = \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2\).

Therefore, a 95% confidence interval for \(\mu_{x^*}\) is \[ \hat Y_{x^*} \pm t_{n-2, 0.975} \underbrace{\sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2}}_{\text{SE}[\hat Y_{x^*}]}. \]

Example: Old Faithful Geyser

We can predict the waiting time until the next eruption based on the duration of the last eruption.

Example: Old Faithful Geyser

On average, how long does one have to wait until the next eruption when the last eruption lasted 4.0 minutes?

\[ {\scriptsize \hat Y_{x^*} \pm t_{n-2, 0.975} \sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2}.} \]

Confidence Band for the Mean Response

Let’s use R to calculate the confidence interval at each \(x^*\) and draw a confidence band around the least-squares line.

Prediction Intervals

If the last eruption lasted 4.0 minutes, a visitor to Yellowstone is more interested in the waiting time until the next eruption, not the average waiting time.

Prediction Interval

A new observation at \(X = x^*\) is assumed to be generated as

\[ Y = \alpha + \beta x^* + \epsilon. \]

A prediction interval takes into account the variability from estimating \(\alpha\) and \(\beta\), as well as the variability in the noise \(\epsilon\).

\[ {\small \text{Var}[\hat Y_{x^*}] + \text{Var}[\epsilon] = \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2 + \sigma^2} \]

Therefore, a 95% prediction interval for \(Y_{x^*}\) is \[ {\small \hat Y_{x^*} \pm t_{n-2, 0.975} \sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2 + S^2}} \]

Example: Old Faithful Geyser

How long is it until the next eruption, if the last eruption was 4 minutes?

\[ {\scriptsize \hat Y_{x^*} \pm t_{n-2, 0.975} \sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2} (x^* - \bar x)^2 + S^2}.} \]

Prediction Band for Individual Observations

Let’s use R to calculate the prediction interval at each \(x^*\) and draw a prediction band around the least-squares line.

Confidence vs. Prediction Intervals

Both intervals are centered at \(\hat Y_{x^*} = \hat\alpha + \hat\beta x^*\) and have the form \(\hat Y_{x^*} \pm t_{n-2, 0.975} \cdot \text{SE}\).

SE formula Interpretation Requires Normality?
Confidence interval \(\sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2}(x^* - \bar x)^2}\) Captures the true mean response \(\mu_{x^*} = \alpha + \beta x^*\) No
Prediction interval \(\sqrt{\frac{S^2}{n} + \frac{S^2}{\sum_{i=1}^n (x_i - \bar x)^2}(x^* - \bar x)^2 + S^2}\) Captures a new individual observation \(Y = \alpha + \beta x^* + \epsilon\) Yes

The prediction interval is wider because it adds \(S^2\) to account for the noise \(\epsilon\).

Recap

In light of what we discussed today, take 5 minutes to try to retrieve as much information from memory as possible.