Linear Regression

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Simple linear regression

Simple linear regression

Model

\[y_i = \beta_0 + \beta_1 x_i +\varepsilon_i\]

\[ \begin{aligned} \text{RSS}(\beta_0,\beta_1) &= \sum_{i=1}^n (y_i -\hat y_i(\beta_0, \beta_1))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_i)^2. \end{aligned} \]

Sample code: advertising data

library(ISLR2)
Advertising = read.csv("https://www.statlearning.com/s/Advertising.csv", header=TRUE, row.names=1)
M.sales = lm(sales ~ TV, data=Advertising)
M.sales
## 
## Call:
## lm(formula = sales ~ TV, data = Advertising)
## 
## Coefficients:
## (Intercept)           TV  
##     7.03259      0.04754

Estimates \(\hat\beta_0\) and \(\hat\beta_1\)

A little calculus shows that the minimizers of the RSS are:

\[ \begin{aligned} \hat \beta_1 & = \frac{\sum_{i=1}^n (x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^n (x_i-\overline x)^2} \\ \hat \beta_0 & = \overline y- \hat\beta_1\overline x. \end{aligned} \]

Assessing the accuracy of \(\hat \beta_0\) and \(\hat\beta_1\)

How variable is the regression line?

Based on our model

\[ \begin{aligned} \text{SE}(\hat\beta_0)^2 &= \sigma^2\left[\frac{1}{n}+\frac{\overline x^2}{\sum_{i=1}^n(x_i-\overline x)^2}\right] \\ \text{SE}(\hat\beta_1)^2 &= \frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline x)^2}. \end{aligned}\]

\[ \begin{aligned} \hat\beta_0 &\pm 2\cdot\text{SE}(\hat\beta_0) \\ \hat\beta_1 &\pm 2\cdot\text{SE}(\hat\beta_1) \end{aligned} \]

Hypothesis test

\[\quad t = \frac{\hat\beta_1 -0}{\text{SE}(\hat\beta_1)}.\]

Sample output: advertising data

summary(M.sales)
## 
## Call:
## lm(formula = sales ~ TV, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
confint(M.sales)
##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256

Interpreting the hypothesis test

Multiple linear regression

Model

\[ \begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1}+\dots+\beta_p x_{ip}+\varepsilon_i \\ Y &= \beta_0 + \beta_1 X_{1}+\dots+\beta_p X_{p}+\varepsilon \\ \end{aligned} \]

\[ \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \]

\[E({Y}) = {X}\beta\]

Multiple linear regression answers several questions

The estimates \(\hat\beta\)

Our goal again is to minimize the RSS: \[ \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \]

Which variables are important?

M.Boston = lm(medv ~ ., data=Boston)
summary(M.Boston)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## crim         -0.121389   0.033000  -3.678 0.000261 ***
## zn            0.046963   0.013879   3.384 0.000772 ***
## indus         0.013468   0.062145   0.217 0.828520    
## chas          2.839993   0.870007   3.264 0.001173 ** 
## nox         -18.758022   3.851355  -4.870 1.50e-06 ***
## rm            3.658119   0.420246   8.705  < 2e-16 ***
## age           0.003611   0.013329   0.271 0.786595    
## dis          -1.490754   0.201623  -7.394 6.17e-13 ***
## rad           0.289405   0.066908   4.325 1.84e-05 ***
## tax          -0.012682   0.003801  -3.337 0.000912 ***
## ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
## lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16
M0.Boston = lm(medv ~ . -indus - zn, data=Boston)
anova(M0.Boston, M.Boston)
## Analysis of Variance Table
## 
## Model 1: medv ~ (crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat) - indus - zn
## Model 2: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1    495 11614                                
## 2    493 11349  2    264.07 5.7355 0.003449 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How many variables are important?

How good are the predictions?

Advertising = read.csv("https://www.statlearning.com/s/Advertising.csv", header=TRUE, row.names=1)
M.sales = lm(sales ~ TV, data=Advertising)
predict(M.sales, data.frame(TV=c(50,150,250)), interval='confidence', level=0.95)
##         fit       lwr      upr
## 1  9.409426  8.722696 10.09616
## 2 14.163090 13.708423 14.61776
## 3 18.916754 18.206189 19.62732
predict(M.sales, data.frame(TV=c(50,150,250)), interval='prediction', level=0.95)
##         fit       lwr      upr
## 1  9.409426  2.946709 15.87214
## 2 14.163090  7.720898 20.60528
## 3 18.916754 12.451461 25.38205

Dealing with categorical or qualitative predictors

For each qualitative predictor, e.g. Region:

  1. The model fit and predictions are independent of the choice of the baseline category.

  2. However, hypothesis tests derived from these variables are affected by the choice.

  3. Solution: To check whether region is important, use an \(F\)-test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

  4. Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\), but the coefficients have different interpretations.

Recap

So far, we have:

  1. Defined Multiple Linear Regression

  2. Discussed how to test the importance of variables.

  3. Described one approach to choose a subset of variables.

  4. Explained how to code qualitative variables.

  5. Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit?

Residuals

Potential issues in linear regression

  1. Interactions between predictors

  2. Non-linear relationships

  3. Correlation of error terms

  4. Non-constant variance of error (heteroskedasticity)

  5. Outliers

  6. High leverage points

  7. Collinearity

Interactions between predictors

\[\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\color{Red}{\beta_3\times(\mathtt{tv}\cdot\mathtt{radio})}+\varepsilon\]

summary(lm(sales ~ TV + radio + radio:TV, data=Advertising))
## 
## Call:
## lm(formula = sales ~ TV + radio + radio:TV, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16
M = lm(sales ~ TV + radio + radio:TV, data=Advertising)
predict(M, data.frame(TV=50, radio=40), interval='confidence')
##        fit      lwr      upr
## 1 11.03268 10.73224 11.33311

Non-linearities

A nonlinear fit might be better here.

\[\mathtt{MPG} = \beta_0 + \beta_1\times\mathtt{horsepower} + \beta_2 \times\mathtt{horsepower}^2 + \beta_3 \times\mathtt{horsepower}^3 + \beta_4 \times\mathtt{horsepower}^4 + \dots + \varepsilon \]

\[\mathtt{MPG} = \beta_0 + \beta_1\times h_1(\mathtt{horsepower}) + \beta_2 \times h_2(\mathtt{horsepower}) + \beta_3 \times h_3(\mathtt{horsepower}) + \beta_4 \times h_4(\mathtt{horsepower}) + \dots + \varepsilon \]

Residuals for Auto data

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms

\[y_i = f(x_i) + \varepsilon_i \quad;\quad \varepsilon_i \sim N(0,\sigma^2) \text{ i.i.d.}\]

When could this happen in real life:

Correlated errors

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity)

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers

Possible solutions:

High leverage points

\[h_{ii} = \frac{\partial \hat y_i}{\partial y_i} = (\underbrace{ X ( X^T X)^{-1} X^T}_{\text{Hat matrix}})_{i,i} \in [1/n,1].\]

Studentized residuals

  1. The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\).

  2. The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\).

  3. A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\))

  4. When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity

\[\mathtt{limit} \approx a\times\mathtt{rating}+b\]

\[VIF(\hat \beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}},\]

\(K\)-nearest neighbors

Illustration of KNN regression: \(K=1\) (left) is rougher than \(K=9\) (right)

KNN estimator

\[\hat f(x) = \frac{1}{K} \sum_{i\in N_K(x)} y_i\]

Comparing linear regression to \(K\)-nearest neighbors

Long story short:

KNN estimates for a simulation from a linear model

+++

Linear models can dominate KNN

Increasing deviations from linearity

When there are more predictors than observations, Linear Regression dominates