Tests and confidence intervals in regression

STATS 60

Outline

Regression refresher
Tests related to the slope in regression
Confidence interval for the slope in regression
Other examples of confidence intervals

Mother & daughter data

Reminder: regression

Let \(X_i=\texttt{mother}_i\) and \(Y_i=\texttt{daughter}_i\):
The regression line is the line

\[ y = \hat{\beta}_0 + \hat{\beta}_1 * x \]

where \((\hat{\beta}_0, \hat{\beta}_1)\) are computed found by

\[ \text{minimize}_{\beta_0, \beta_1} \sum_{i=1}^n (Y_i - (\beta_0 + \beta_1 X_i))^2 \]

Best fitting line

A hypothesis test

If a line captures relationship between \(\texttt{mother}\) (\(X\)) and (\(Y\)), we might ask: ** Is the slope of the true regression of \(Y\) on \(X\) line (\(>,<,\neq\)) 0?**

What do we mean by true regression line?

Just like our other testing problems, we need a model (what we’ve been calling a box) to make sense of this question.

Regression model

A model for how the \(\texttt{(mother, daughter)}\) pairs are generated:
1. A value for \(\texttt{mother}\) is drawn from \(\texttt{box}_{\texttt{mother}}\) (i.e. a distribution with its own probability histogram)
2. A value \(\epsilon\) for the error is drawn from a \(\texttt{box}_{\epsilon}\)
3. For some true slope and intercept we draw the value for \(\texttt{daughter}\):

\[\texttt{daughter} = \beta_0 + \beta_1 * \texttt{mother} + \epsilon\]

There are two boxes (distributions / probability histograms):
- The box for \(\texttt{mother}\)
- The box for \(\epsilon\)
There are two parameters:
- \(\beta_0\): intercept
- \(\beta_1:\) slope

A note on our regression model

This is not the only model for regression, but it’s (relatively) simple…
The important part of this model is the way \(\texttt{daughter}\) is generated after having fixed \(\texttt{mother}\): the mean (\(\beta_0 + \beta_1 * \texttt{mother}\)) plus an error independent of \(\texttt{mother}\).

Another model

Assume there is some big box (e.g. a population) of pairs of \((\texttt{mother}, \texttt{daughter})\) heights.
We draw a simple random sample from this population (which will be close to sampling with replacement).
For some populations our first model will still be reasonable.
- Since scatterplot is football shaped it is a reasonable assumption for this data.
For other populations our model won’t be appropriate. This will mostly affect our estimate of \(\text{SE}(\hat{\beta}_1)\)…

Parameter estimates

Slope estimate

If our regression model is appropriate, then:
- \(E[\hat{\beta}_1] = \beta_1\)
- \(\text{SE}(\hat{\beta}_1) \approx \sqrt{\frac{1}{n} \frac{\texttt{MSE}(\texttt{box}_{\epsilon})}{\texttt{MSE}(\texttt{box}_{\texttt{mother}})}}\)
- The probability histogram of \(\hat{\beta}_1\) is approximately a normal curve.

Testing slope is 0

\[ Z = \frac{\hat{\beta}_1 - 0}{\text{SE}(\hat{\beta}_1)} \]

To carry out the test, use the usual rules for one and two-sided

Carrying out the test

Most software packages will compute \(\text{SE}(\hat{\beta}_1)\)

Finding the \(Z\) score in the output of the model

Most software packages will also compute this \(Z\) score for you
Most software will refer to Student’s \(T\) distribution instead of standard normal… with \(n\) large this distinction is moot.

Sometimes, testing 0 isn’t of interest

We might want to test the hypothesis \(\beta_1=1\):

\[ Z = \frac{\hat{\beta}_1 - 1}{\text{SE}(\hat{\beta}_1)} \]

Usual rules for one and two-sided

Plausible values for \(\beta_1\)

When there is no particular value of interest for the slope, it is common to summarize results with a confidence interval

95% confidence interval

\[ CI_{0.95} = \hat{\beta}_1 \pm 2 * \text{SE}(\hat{\beta}_1) \]

This is an interval of plausible values for the true parameter \(\beta_1\): captures our estimate of the parameter and its (estimated) variability.

Confidence intervals are random

The interval \(CI_{0.95}\) is random: different samples will give different intervals…
Let’s make some data for which our model holds:

We can compute confidence intervals using confint for a randomly generated dataset

Illustration of confidence intervals

Let’s create 100 data sets from the same model and look at the confidence intervals…

Interpretation of confidence intervals

Assuming our regression model holds, there is true (unknown) slope \(\beta_1\)
For every \(CI_{0.95}\) we can ask whether \(\beta_1\) is in the interval: this is an event, we can compute its probability

\[ P(\beta_1 \in CI_{0.95}) = P(\beta_1 \in \hat{\beta}_1 \pm 2 * \text{SE}(\hat{\beta}_1)) \approx 95\%. \]

Relation to 2 SE rule

Our 2 SE rule says that

\[ P(\hat{\beta}_1 \in \beta_1 \pm 2 * \text{SE}(\hat{\beta}_1)) \approx 95\% \]

Almost the same statement, but here the interval is not random.
Note: we are blurring the lines a little here between true SE and our estimate of SE…

Relationship between confidence intervals and tests

If you know the 95% confidence interval for \(\beta_1\), it is easy to check whether you would reject the hypothesis the slope is 1, or any other value \(V\).
If \(V\) is in \(CI_{0.95}\) then we do not reject the null hypothesis the slope is \(V\) at 5%…
Said differently, the interval \(CI_{0.95}\) are all the values we would fail to reject at level 5%.

Confidence intervals for `mean(box)`

Going back to our first testing problems, we can compute a 95% CI for \(\texttt{mean(box)}\):

\[ CI_{0.95} = \bar{X} \pm 2 * \sqrt{\texttt{MSE(box)}/n} \]

Example: athletic training program

Group A:

40 athletes: data is stored in groupA in R
mean(groupA) = 6.1
sd(groupA) = 1.2

Group B:

50 athletes: data is stored in groupB in R
mean(groupB) = 6.3
sd(groupB) = 1.3

Confidence intervals for `mean(boxA) - mean(boxB)`

General form of 95% confidence interval

We saw several tests with \(Z\) scores

\[ Z = \frac{\texttt{observed test statistic} - E_{\texttt{null}}[\texttt{test statistic}]}{\text{SE}(\texttt{test statistic})} \]

There is often a corresponding 95% confidence region:

\[ \texttt{observed test statistic} \pm 2 * \text{SE}(\texttt{test statistic}). \]