Regression Diagnostics I: Residuals

Dennis Sun

2024-05-06

Ames Housing Data Set

Here is a data set of house sales in Ames, IA.

Let’s use this data to build models to predict house prices!

Simple Linear Regression 1

How well does square footage predict house price?

Simple Linear Regression 2

How well does number of bedrooms predict house price?

Simple Linear Regression 3

How well does number of bathrooms predict house price?

Multiple Linear Regression

Let’s build a model that incorporates all of the predictors!

Why is the coefficient of Bed negative?

For each additional bedroom, the price of a home decreases by $30,054 on average, holding square footage constant. (Economists call this “ceteris paribus.”)
If we increase the number of bedrooms without changing the square footage or the number of bathrooms, then each bedroom will be smaller, decreasing the house price.

Inference for Multiple Regression

What can we learn from inferential statistics?

Assumptions of Linear Regression

Inference for linear regression depends on the following assumptions:

The linear model is correct: \[ \mu\{ Y | X_1, ..., X_p \} = \beta_0 + \beta_1 X_{1} + \dots + \beta_p X_{p}.\] In other words, $\displaystyle Y_i = \beta_0 + \beta_1 X_{1i} + \dots + \beta_p X_{pi} + \text{error}_i.$
The errors are independent $\text{Normal}(0, \sigma^2)$.

Assumption 2 is actually three assumptions in one:

The errors are normally distributed.
The errors have constant variance. (Statisticians call this “homoskedasticity.”)
The errors are independent.

How do we check these assumptions?

Residuals

We fit a linear regression model to the data and obtain \[ \hat\mu\{ Y | X_1, ..., X_p \} = \hat\beta_0 + \hat\beta_1 X_1 + \dots + \hat\beta_p X_p. \]
We can obtain the fitted values by plugging in the original data points: \[ \hat Y_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \dots + \hat\beta_p X_{ip}. \]
The residuals are the difference between the actual values and the fitted values: \[ \begin{aligned} \text{residual}_i &= Y_i - \hat Y_i \\ &= (\beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \text{error}_i) \\ &\quad - (\hat\beta_0 + \hat\beta_1 X_{i1} + \dots + \hat\beta_p X_{ip}) \\ &\approx \text{error}_i \end{aligned} \]
The residuals are an estimate of the errors!

Residual Plots

To check assumptions about the errors, we can examine the residuals.

Example: To test the normality of the errors, make a normal Q-Q plot of the residuals.

Residual Plots

To check assumptions about the errors, we can examine the residuals.

Example: To check correctness of the linear model and constant variance, plot the residuals against $\hat Y$.

Residual Plots

If you call plot on a model fit using lm, then R will make many of these residual plots for you.

Conclusions

What have we learned from the residual analysis?

Linear model appears reasonable.
Errors are not normal.
Errors are not constant variance.

To fix these problems, we could consider transformations of $Y$:

log transformations: $Y' = \ln(Y)$

but we will not pursue that here.

Other Residual Plots

Multiple regression models are fundamentally hard to visualize.
Residuals can also help us visualize multiple regression models.

Partial Residual Plots

A partial residual plot helps visualize the functional form of one predictor variable in a multiple regression model.

For example, the partial residuals of predictor $X_k$ are defined by \[ \begin{aligned} \text{(partial residual for $X_k$)}_i &= Y_i - \hat\beta_0 - \sum_{j\neq k} \hat\beta_j X_{ij} = \text{residual}_i + \hat\beta_k X_{ik} \end{aligned} \]

A partial residual plot plots these partial residuals against $X_k$.

Partial Residual Models

What if we fit a (simple) linear regression model to the partial residuals?

Where have we seen $-30054$ before?

It is the coefficient of Bed in the multiple regression model!
In general, the slope of the linear regression fit to the partial residuals will match the coefficient in the multiple regression.
However, the standard error is not the same, so the inferences ($p$-value, confidence interval) will not match.

Partial Regression Plots

A partial regression plot (also called an added variable plot) shows $Y$ and $X_k$, adjusting for the effect of the other predictors.

residual of multiple regression of $Y$ on $X_1, \dots, X_{k-1}, X_{k+1}, \dots X_p$
residual of multiple regression of $X_k$ on $X_1, \dots, X_{k-1}, X_{k+1}, \dots X_p$

Partial Regression Models

A partial regression plot is harder to interpret than a partial residual plot (because both the $x$ and $y$ axes display residuals).

However, if we fit a (simple) linear regression model to the partial regression plot, the slope and inferences match the coefficient and inferences for Bed in the multiple regression exactly.