
Stats 203
All of us got different lines. How do we know which line is best?

Calculate the sum of squared errors between the points and the line.
All of us got different lines. How do we know which line is best?

Calculate the sum of squared errors between the points and the line.
\[\text{SSE}(\alpha, \beta) = \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2\]
\[\text{SSE}(\alpha, \beta) = \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2\]
How do we find the values of \(\alpha\) and \(\beta\) that minimize \(\text{SSE}\)?
To minimize \(\text{SSE}(\alpha, \beta) = \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2\), take derivatives, set equal to \(0\), and solve for \((\alpha, \beta)\).
\[ \frac{\partial (\text{SSE})}{\partial \alpha} = 0 \]
\[ \frac{\partial (\text{SSE})}{\partial \beta} = 0 \]
\[ \hat{\alpha} = \bar y - \hat{\beta} \bar x \]
\[ \hat\beta = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} \]
\[ \begin{align*} \hat{\alpha} &= \bar y - \hat{\beta} \bar x & \hat\beta &= \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2} \end{align*} \]
The lm function in R calculates these coefficients for you.
The least-squares line is \[ \hat y = -3931447 + 14853755377 x. \]
How far is a supernova that is moving away at a velocity of \(0.020\) parsecs/year?
Simply plug \(x = 0.020\) into the equation of the line: \[ \hat y = -3931447 + 14853755377 (0.020) = 293143661 \text{ (parsecs)}. \]
R can make predictions for you, but the data must be in a data.frame.
The least-squares line, also called the regression line, is \(\hat y = \hat\alpha + \hat\beta x\), where \[ \begin{align*} \hat{\alpha} &= \bar y - \hat{\beta} \bar x & \hat\beta &= \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n (x_i - \bar x)^2}. \end{align*} \]
How do we interpret this line?

To understand this formula, let’s look at Galton’s study of heights of parents and offspring, which led him to coin the term “regression”.
What is the relationship between heights of fathers and their children?
\[\hat\beta = r\frac{s_y}{s_x} \approx 0.4\]
What would you predict is…
Francis Galton called this phenomenon regression to the mean, which is why the least-squares line is often called the regression line.
Earlier, we saw that a 70.2 in. tall father (1 in. above average) is predicted to have a son who is 69.6 in. tall (0.4 in. above average).
Now, if a son is 69.6 in. tall, what is the predicted height of the father?
Moral: The regression of \(Y\) on \(X\) is not the same as the regression of \(X\) on \(Y\).
Suppose we have \(n\) observations: \((x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\).
We can write this in terms of vectors
\[ \begin{align} {\bf x} &= (x_1, x_2, \dots, x_n) & {\bf y} &= (y_1, y_2, \dots, y_n) \end{align} \]
The corresponding fitted values are: \[ \begin{pmatrix} \hat y_1 \\ \hat y_2 \\ \vdots \\ \hat y_n \end{pmatrix} = \begin{pmatrix} \alpha + \beta x_1 \\ \alpha + \beta x_2 \\ \vdots \\ \alpha + \beta x_n \end{pmatrix} = \alpha \begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix} + \beta \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}. \]
Suppose we have \(n\) observations: \((x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\).
We can write this in terms of vectors
\[ \begin{align} {\bf x} &= (x_1, x_2, \dots, x_n) & {\bf y} &= (y_1, y_2, \dots, y_n) \end{align} \]
The corresponding fitted values are: \[ \underbrace{\begin{pmatrix} \hat y_1 \\ \hat y_2 \\ \vdots \\ \hat y_n \end{pmatrix}}_{\hat{\bf y}} = \begin{pmatrix} \alpha + \beta x_1 \\ \alpha + \beta x_2 \\ \vdots \\ \alpha + \beta x_n \end{pmatrix} = \alpha\, \underbrace{\begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix}}_{{\bf 1}} + \beta\, \underbrace{\begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}}_{{\bf x}}. \]
The vector of fitted values \(\hat {\bf y}\) is a linear combination of \({\bf 1}\) and \({\bf x}\), so it is in their span: \[ \hat {\bf y} \in \text{span}({\bf 1}, {\bf x}). \]
Remember that we choose \(\alpha\) and \(\beta\) to minimize
\[ \begin{align} \text{SSE}(\alpha, \beta) &= \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2 \end{align} \]
This is equivalent to choosing \(\hat{\bf y}\) to minimize
\[ \begin{align} || {\bf y} - &\hat{\bf y} ||^2 \\ \text{subject to } &\hat {\bf y} \in \text{span}({\bf 1}, {\bf x}) \end{align} \]

Visualization for \(n = 3\)
Remember that we choose \(\alpha\) and \(\beta\) to minimize
\[ \begin{align} \text{SSE}(\alpha, \beta) &= \sum_{i=1}^n (y_i - (\alpha + \beta x_i))^2 \end{align} \]
This is equivalent to choosing \(\hat{\bf y}\) to minimize
\[ \begin{align} || {\bf y} - &\hat{\bf y} ||^2 \\ \text{subject to } &\hat {\bf y} \in \text{span}({\bf 1}, {\bf x}) \end{align} \]
which is \(P_{\text{span}({\bf 1}, {\bf x})}{\bf y}\), the projection of \({\bf y}\) onto the span.

Visualization for \(n = 3\)

We see that \({\bf y} - \hat{\bf y}\) is orthogonal to both \({\bf 1}\) and \({\bf x}\), so
which are the equations that we solve for \(\alpha\) and \(\beta\) (no derivatives required)!
By the Pythagorean Theorem, we see that
This identity is not easy to prove by algebra!
Next time: The linear regression model (with assumptions about distributions).