
Stats 203
The linear regression model assumes that the data are generated as \[ \begin{align} Y_i &= \alpha + \beta x_i + \epsilon_i, & i=1, 2, \dots, n \end{align} \]
where \(x_i\) are fixed and \(\epsilon_i \overset{i.i.d.}{\sim} \text{Normal}(0, \sigma^2)\).
Alternative descriptions:
To estimate parameters in the linear regression model, we use maximum likelihood (as in any probabilistic model).
\[ {\small L(\alpha, \beta, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} (y_1 - (\alpha + \beta x_1))^2} \dots \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{1}{2\sigma^2} (y_n - (\alpha + \beta x_n))^2}} \]
We usually take logs.
\[ \ell(\alpha, \beta, \sigma^2) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]
\[ \ell(\alpha, \beta, \sigma^2) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]
To estimate \(\alpha\) and \(\beta\), we maximize \(\ell\)—which, as a function of \(\alpha\) and \(\beta\), is of the form \[ \text{constant} - \text{constant} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]
That is, to maximize the (log-)likelihood, it is equivalent to minimize \[ \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]
But this is just the SSE criterion from last time!
In other words, the least-squares coefficients are also the MLE.
\[ \ell(\alpha, \beta, \sigma^2) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2. \]
To estimate \(\sigma^2\), we plug in the least-squares estimates \(\hat\alpha\) and \(\hat\beta\) (which do not depend on \(\sigma^2\)), and maximize \[ -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - (\hat\alpha + \hat\beta x_i))^2. \]
Taking derivatives with respect to \(\sigma^2\), we obtain: \[ - \frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (y_i - (\hat\alpha + \hat\beta x_i))^2 = 0, \]
and solving for \(\sigma^2\), we obtain \(\displaystyle \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i - (\hat\alpha + \hat\beta x_i))^2\).
The problem with \(\displaystyle \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (y_i - (\hat\alpha + \hat\beta x_i))^2\) is that it is biased for \(\sigma^2\).
\[ E\left[ \hat\sigma^2 \right] = \frac{n-2}{n} \sigma^2 \]
We can obtain an unbiased estimator by dividing by \(n-2\) instead of \(n\): \[ S^2 = \frac{1}{n-2} \sum_{i=1}^n (y_i - (\hat\alpha + \hat\beta x_i))^2. \]
Recall that \(\hat\beta = \dfrac{\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\).
The easiest way to calculate the expectation is to first write \[ Y_i - \bar Y = (\alpha + \beta x_i + \epsilon_i) - \frac{1}{n} \sum_{j=1}^n (\alpha + \beta x_j + \epsilon_j) = \textcolor{red}{\beta (x_i - \bar x) + (\epsilon_i - \bar\epsilon)}, \]
and substitute into the expression above: \[ \hat\beta = \dfrac{\sum_{i=1}^n (x_i - \bar{x})\left[\textcolor{red}{\beta (x_i - \bar{x}) + (\epsilon_i - \bar \epsilon)}\right]}{\sum_{i=1}^n (x_i - \bar{x})^2} = \beta + \dfrac{\sum_{i=1}^n (x_i - \bar{x})(\epsilon_i - \bar{\epsilon})}{\sum_{i=1}^n (x_i - \bar{x})^2}. \]
Taking expectations (using \(E[\epsilon_i] = E[\bar\epsilon] = 0\)):
\[ E[\hat\beta] = \beta + \frac{\sum_{i=1}^n (x_i - \bar{x})\, E[\epsilon_i - \bar\epsilon]}{\sum_{i=1}^n (x_i - \bar{x})^2} = \beta. \qquad \checkmark \]
The easiest way to calculate the variance is to use \(\sum (x_i - \bar x) = 0\) to write
\[ \hat\beta = \dfrac{\displaystyle\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2} = \dfrac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})Y_i}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2} - \underbrace{\dfrac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})\bar Y}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2}}_0. \]
Taking variances (recalling that \(\text{Var}[cW] = c^2 \text{Var}[W]\) and \(\text{Var}[V + W] = \text{Var}[V] + \text{Var}[W]\) when \(V\) and \(W\) are independent):
\[ \text{Var}[\hat\beta] = \dfrac{\displaystyle \sum_{i=1}^n (x_i - \bar x)^2 \text{Var}[Y_i]}{\displaystyle \left( \sum_{i=1}^n (x_i - \bar x)^2 \right)^2} \phantom{= \frac{\sigma^2}{\displaystyle\sum_{i=1}^n (x_i - \bar x)^2}.} \]
The easiest way to calculate the variance is to use \(\sum (x_i - \bar x) = 0\) to write \[ \hat\beta = \dfrac{\displaystyle\sum_{i=1}^n (x_i - \bar{x})(Y_i - \bar{Y})}{\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2} = \dfrac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})Y_i}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2} - \underbrace{\dfrac{\displaystyle \sum_{i=1}^n (x_i - \bar{x})\bar Y}{\displaystyle \sum_{i=1}^n (x_i - \bar{x})^2}}_0. \]
Taking variances (recalling that \(\text{Var}[cW] = c^2 \text{Var}[W]\) and \(\text{Var}[V + W] = \text{Var}[V] + \text{Var}[W]\) when \(V\) and \(W\) are independent):
\[ \text{Var}[\hat\beta] = \dfrac{\displaystyle \sum_{i=1}^n (x_i - \bar x)^2 \text{Var}[Y_i]}{\displaystyle \left( \sum_{i=1}^n (x_i - \bar x)^2 \right)^2} = \frac{\sigma^2}{\displaystyle\sum_{i=1}^n (x_i - \bar x)^2}. \]
From the previous slide, we see that \[ \hat\beta = \sum_{i=1}^n \underbrace{\frac{(x_i - \bar x)}{\sum_{j=1}^n (x_j - \bar x)^2}}_{c_i} Y_i, \]
a linear combination of the independent normal random variables \(Y_1, \dots, Y_n\).
Any linear combination of independent normals is also normal, so \[ \hat\beta \sim \text{Normal}\left(\beta, \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}\right). \]
\[ \hat\beta \sim \text{Normal}\left(\beta, \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar x)^2}\right). \]
We can convince ourselves of this by simulation.
The problem with a model is that everything depends on whether it is correct.
\[ \begin{align} Y_i &= \alpha + \beta x_i + \epsilon_i, & \epsilon_i &\sim \text{Normal}(0, \sigma^2) & i=1, \dots, n \end{align} \]
What are the assumptions in this expression?
How do we know if these assumptions are satisfied?
We do not observe the error \[ \epsilon_i = Y_i - (\alpha + \beta x_i), \]
but we can estimate it by the residual \[ e_i = Y_i - \underbrace{(\hat\alpha + \hat\beta x_i)}_{\hat Y_i}. \]
We can use the residuals to diagnose violations of the assumptions.
A residual plot is a plot of the residuals versus \(x\).
Ideally, the residuals should look randomly and evenly scattered around the \(y=0\) line.
This residual plot shows that:
Don’t try to eyeball it! A quantile-quantile (Q-Q) plot is more reliable.

If the errors were normal, the residuals should fall along the line.
Because they do not, there is evidence that the errors are not normal.
Because we only observe one realization of each residual, it is not possible to test for independence.
Even if it were, the residuals are not independent because they all depend on the same \(\hat\alpha\) and \(\hat\beta\):
One common heuristic is to plot the residuals against their order in the data set (sometimes, the data is ordered in a way that reveals nonindependence).
Next time: Using this model to do inference (tests, intervals).