Regression refresher
Tests related to the slope in regression
Confidence interval for the slope in regression
Other examples of confidence intervals
Let \(X_i=\texttt{mother}_i\) and \(Y_i=\texttt{daughter}_i\):
The regression line is the line
\[ y = \hat{\beta}_0 + \hat{\beta}_1 * x \]
where \((\hat{\beta}_0, \hat{\beta}_1)\) are computed found by
\[ \text{minimize}_{\beta_0, \beta_1} \sum_{i=1}^n (Y_i - (\beta_0 + \beta_1 X_i))^2 \]
A model for how the \(\texttt{(mother, daughter)}\) pairs are generated:
A value for \(\texttt{mother}\) is drawn from \(\texttt{box}_{\texttt{mother}}\) (i.e. a distribution with its own probability histogram)
A value \(\epsilon\) for the error is drawn from a \(\texttt{box}_{\epsilon}\)
For some true slope and intercept we draw the value for \(\texttt{daughter}\):
\[\texttt{daughter} = \beta_0 + \beta_1 * \texttt{mother} + \epsilon\]
There are two boxes (distributions / probability histograms):
The box for \(\texttt{mother}\)
The box for \(\epsilon\)
There are two parameters:
\(\beta_0\): intercept
\(\beta_1:\) slope
This is not the only model for regression, but it’s (relatively) simple…
The important part of this model is the way \(\texttt{daughter}\) is generated after having fixed \(\texttt{mother}\): the mean (\(\beta_0 + \beta_1 * \texttt{mother}\)) plus an error independent of \(\texttt{mother}\).
Assume there is some big box (e.g. a population) of pairs of \((\texttt{mother}, \texttt{daughter})\) heights.
We draw a simple random sample from this population (which will be close to sampling with replacement).
For some populations our first model will still be reasonable.
For other populations our model won’t be appropriate. This will mostly affect our estimate of \(\text{SE}(\hat{\beta}_1)\)…
If our regression model is appropriate, then:
\(E[\hat{\beta}_1] = \beta_1\)
\(\text{SE}(\hat{\beta}_1) \approx \sqrt{\frac{1}{n} \frac{\texttt{MSE}(\texttt{box}_{\epsilon})}{\texttt{MSE}(\texttt{box}_{\texttt{mother}})}}\)
The probability histogram of \(\hat{\beta}_1\) is approximately a normal curve.
\[ Z = \frac{\hat{\beta}_1 - 0}{\text{SE}(\hat{\beta}_1)} \]
Most software packages will also compute this \(Z\) score for you
Most software will refer to Student’s \(T\) distribution instead of standard normal… with \(n\) large this distinction is moot.
\[ Z = \frac{\hat{\beta}_1 - 1}{\text{SE}(\hat{\beta}_1)} \]
\[ CI_{0.95} = \hat{\beta}_1 \pm 2 * \text{SE}(\hat{\beta}_1) \]
The interval \(CI_{0.95}\) is random: different samples will give different intervals…
Let’s make some data for which our model holds:
confint for a randomly generated datasetAssuming our regression model holds, there is true (unknown) slope \(\beta_1\)
For every \(CI_{0.95}\) we can ask whether \(\beta_1\) is in the interval: this is an event, we can compute its probability
\[ P(\beta_1 \in CI_{0.95}) = P(\beta_1 \in \hat{\beta}_1 \pm 2 * \text{SE}(\hat{\beta}_1)) \approx 95\%. \]
\[ P(\hat{\beta}_1 \in \beta_1 \pm 2 * \text{SE}(\hat{\beta}_1)) \approx 95\% \]
Almost the same statement, but here the interval is not random.
Note: we are blurring the lines a little here between true SE and our estimate of SE…
If you know the 95% confidence interval for \(\beta_1\), it is easy to check whether you would reject the hypothesis the slope is 1, or any other value \(V\).
If \(V\) is in \(CI_{0.95}\) then we do not reject the null hypothesis the slope is \(V\) at 5%…
Said differently, the interval \(CI_{0.95}\) are all the values we would fail to reject at level 5%.
mean(box)\[ CI_{0.95} = \bar{X} \pm 2 * \sqrt{\texttt{MSE(box)}/n} \]
40 athletes: data is stored in groupA in R
mean(groupA) = 6.1
sd(groupA) = 1.2
50 athletes: data is stored in groupB in R
mean(groupB) = 6.3
sd(groupB) = 1.3
mean(boxA) - mean(boxB)\[ Z = \frac{\texttt{observed test statistic} - E_{\texttt{null}}[\texttt{test statistic}]}{\text{SE}(\texttt{test statistic})} \]
\[ \texttt{observed test statistic} \pm 2 * \text{SE}(\texttt{test statistic}). \]