Diagnostics I

Diagnostics I#

Download#

Ames Housing Data Set {.scrollable}#

Here is a data set of house sales in Ames, IA.

df.housing <- read.csv("https://dlsun.github.io/data/housing.csv")
head(df.housing)

A data.frame: 6 × 6
	SaleMonth	SaleYear	SalePrice	Sqft	Bed	Bath
	<int>	<int>	<int>	<int>	<int>	<dbl>
1	1	2006	250000	2064	4	2.5
2	1	2006	401179	2795	4	3.5
3	1	2006	145000	1056	3	1.0
4	1	2006	181000	1309	3	1.5
5	1	2006	172400	1664	3	2.5
6	1	2006	252000	1964	3	2.5

…

Let’s use this data to build models to predict house prices!

Simple Linear Regression 1 {.scrollable}#

How well does square footage predict house price?

…

with(df.housing, plot(Sqft, SalePrice))
simple.model.1 <- lm(SalePrice ~ Sqft, data=df.housing)
simple.model.1
abline(simple.model.1)

Call:
lm(formula = SalePrice ~ Sqft, data = df.housing)

Coefficients:
(Intercept)         Sqft  
    13289.6        111.7  

../../_images/7c858a5be9486af18a9a48d5ceee214f7e94306c2f7f5e76de71cf1aaafaacfe.png

Simple Linear Regression 2 {.scrollable}#

How well does number of bedrooms predict house price?

…

with(df.housing, plot(Bed, SalePrice))
simple.model.2 <- lm(SalePrice ~ Bed, data=df.housing)
simple.model.2
abline(simple.model.2)

Call:
lm(formula = SalePrice ~ Bed, data = df.housing)

Coefficients:
(Intercept)          Bed  
     141152        13889  

../../_images/a652e863fafe3978cecb3c0c3f7cafee2755ef124b6f34d045b12cdd4f973c66.png

Simple Linear Regression 3 {.scrollable}#

How well does number of bathrooms predict house price?

…

with(df.housing, plot(Bath, SalePrice))
simple.model.2 <- lm(SalePrice ~ Bath, data=df.housing)
simple.model.2
abline(simple.model.2)

Call:
lm(formula = SalePrice ~ Bath, data = df.housing)

Coefficients:
(Intercept)         Bath  
      54055        72163  

../../_images/e2feca598c9f72451724f3f8cc355b58cd2c5dd2dbebf25fbb01250ede2a0c65.png

Multiple Linear Regression {.scrollable .smaller}#

Let’s build a model that incorporates all of the predictors!

multiple.model <- lm(SalePrice ~ Sqft + Bed + Bath, data=df.housing)
multiple.model

Call:
lm(formula = SalePrice ~ Sqft + Bed + Bath, data = df.housing)

Coefficients:
(Intercept)         Sqft          Bed         Bath  
    51946.8        116.7     -30054.5      22529.8  

…

Why is the coefficient of Bed negative?

::: {.incremental}

For each additional bedroom, the price of a home decreases by $30,054 on average, holding square footage constant. (Economists call this “ceteris paribus.”)
If we increase the number of bedrooms without changing the square footage or the number of bathrooms, then each bedroom will be smaller, decreasing the house price. :::

Inference for Multiple Regression#

What can we learn from inferential statistics?

summary(multiple.model)
confint(multiple.model)

Call:
lm(formula = SalePrice ~ Sqft + Bed + Bath, data = df.housing)

Residuals:
    Min      1Q  Median      3Q     Max 
-516716  -28082   -1978   24883  330612 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  51946.789   3739.458   13.89   <2e-16 ***
Sqft           116.733      2.875   40.61   <2e-16 ***
Bed         -30054.520   1349.251  -22.27   <2e-16 ***
Bath         22529.794   2117.394   10.64   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51640 on 2926 degrees of freedom
Multiple R-squared:  0.5825,	Adjusted R-squared:  0.5821 
F-statistic:  1361 on 3 and 2926 DF,  p-value: < 2.2e-16

A matrix: 4 × 2 of type dbl
	2.5 %	97.5 %
(Intercept)	44614.5535	59279.0238
Sqft	111.0963	122.3699
Bed	-32700.0984	-27408.9422
Bath	18378.0606	26681.5265

Assumptions of Linear Regression {.smaller .scrollable}#

Inference for linear regression depends on the following assumptions:

::: {.incremental}

The linear model is correct: $$ \mu\{ Y | X_1, ..., X_p \} = \beta_0 + \beta_1 X_{1} + \dots + \beta_p X_{p}.$$ In other words, $\displaystyle Y_i = \beta_0 + \beta_1 X_{1i} + \dots + \beta_p X_{pi} + \text{error}_i.$
The errors are independent $\text{Normal}(0, \sigma^2)$. :::

…

Assumption 2 is actually three assumptions in one:

::: {.incremental} a. The errors are normally distributed. b. The errors have constant variance. (Statisticians call this “homoskedasticity.”) c. The errors are independent. :::

…

How do we check these assumptions?

Residuals {.smaller}#

::: {.incremental}

We fit a linear regression model to the data and obtain $$ \hat\mu\{ Y | X_1, ..., X_p \} = \hat\beta_0 + \hat\beta_1 X_1 + \dots + \hat\beta_p X_p. $$
We can obtain the fitted values by plugging in the original data points: $$ \hat Y_i = \hat\beta_0 + \hat\beta_1 X_{i1} + \dots + \hat\beta_p X_{ip}. $$
The residuals are the difference between the actual values and the fitted values: $$ \begin{aligned} \text{residual}_i &= Y_i - \hat Y_i \\ &= (\beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \text{error}_i) \\ &\quad - (\hat\beta_0 + \hat\beta_1 X_{i1} + \dots + \hat\beta_p X_{ip}) \\ &\approx \text{error}_i \end{aligned} $$
The residuals are an estimate of the errors! :::

Residual Plots {.scrollable}#

To check assumptions about the errors, we can examine the residuals.

Example: To test the normality of the errors, make a normal Q-Q plot of the residuals.

qqnorm(multiple.model$resid)

../../_images/cdf730067fc548e3fb2f9b92679f3d3bdc74ca40c131486b872e7de10949fbb7.png

Residual Plots {.scrollable}#

To check assumptions about the errors, we can examine the residuals.

Example: To check correctness of the linear model and constant variance, plot the residuals against $\hat Y$.

plot(multiple.model$fitted, multiple.model$resid)

../../_images/078af8c71b171f1d3a1aef5fc0db843ce280fc71e7ec102e4459a9b277a16138.png

Residual Plots {.scrollable}#

If you call plot on a model fit using lm, then R will make many of these residual plots for you.

par(mfrow=c(2,2))
plot(multiple.model)

../../_images/493d6760e6d6c6258097f0a16acde1fc315a03a55531837f7526d21c5264b8fc.png

Conclusions#

What have we learned from the residual analysis?

::: {.incremental}

Linear model appears reasonable.
Errors are not normal.
Errors are not constant variance. :::

…

To fix these problems, we could consider transformations of $Y$:

log transformations: $Y' = \ln(Y)$

but we will not pursue that here.

Other Residual Plots#

Multiple regression models are fundamentally hard to visualize.
Residuals can also help us visualize multiple regression models.

Partial Residual Plots {.smaller .scrollable}#

A partial residual plot helps visualize the functional form of one predictor variable in a multiple regression model.

…

For example, the partial residuals of predictor $X_k$ are defined by $$ \begin{aligned} \text{(partial residual for $X_k$)}_i &= Y_i - \hat\beta_0 - \sum_{j\neq k} \hat\beta_j X_{ij} = \text{residual}_i + \hat\beta_k X_{ik} \end{aligned} $$

…

A partial residual plot plots these partial residuals against $X_k$.

partial.res <- multiple.model$resid + multiple.model$coef["Bed"] * df.housing$Bed
plot(df.housing$Bed, partial.res)

../../_images/ad60955667824517c1dfc807e0c84de71cbf49882823c713b8f68b1af362d940.png

Partial Residual Models {.smaller .scrollable}#

What if we fit a (simple) linear regression model to the partial residuals?

plot(df.housing$Bed, partial.res)
partial.res.model <- lm(partial.res ~ df.housing$Bed)
abline(partial.res.model)
summary(partial.res.model)

Call:
lm(formula = partial.res ~ df.housing$Bed)

Residuals:
    Min      1Q  Median      3Q     Max 
-516716  -28082   -1978   24883  330612 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.394e-09  3.425e+03    0.00        1    
df.housing$Bed -3.005e+04  1.152e+03  -26.08   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51620 on 2928 degrees of freedom
Multiple R-squared:  0.1885,	Adjusted R-squared:  0.1882 
F-statistic: 680.2 on 1 and 2928 DF,  p-value: < 2.2e-16

../../_images/4f2184cb6e42511303cb29ccd3ed692f27be882e7bbee7d6695eb4ded8c91617.png

…

Where have we seen $-30054$ before?

::: {.incremental}

It is the coefficient of Bed in the multiple regression model!
In general, the slope of the linear regression fit to the partial residuals will match the coefficient in the multiple regression.
However, the standard error is not the same, so the inferences ($p$-value, confidence interval) will not match. :::

Partial Regression Plots {.smaller .scrollable}#

A partial regression plot (also called an added variable plot) shows $Y$ and $X_k$, adjusting for the effect of the other predictors.

residual of multiple regression of $Y$ on $X_1, \dots, X_{k-1}, X_{k+1}, \dots X_p$
residual of multiple regression of $X_k$ on $X_1, \dots, X_{k-1}, X_{k+1}, \dots X_p$

…

res.y <- lm(SalePrice ~ Sqft + Bath, data=df.housing)$resid
res.x <- lm(Bed ~ Sqft + Bath, data=df.housing)$resid
plot(res.x, res.y)

../../_images/8355048df96e7f3d6873163fcb6e8a65b1ec60bbdc440d338ae45a4480c7b1c8.png

Partial Regression Models {.smaller .scrollable}#

A partial regression plot is harder to interpret than a partial residual plot (because both the $x$ and $y$ axes display residuals).

…

However, if we fit a (simple) linear regression model to the partial regression plot, the slope and inferences match the coefficient and inferences for Bed in the multiple regression exactly.

plot(res.x, res.y)
partial.reg.model <- lm(res.y ~ res.x)
abline(partial.reg.model)
summary(partial.reg.model)

Call:
lm(formula = res.y ~ res.x)

Residuals:
    Min      1Q  Median      3Q     Max 
-516716  -28082   -1978   24883  330612 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.320e-13  9.537e+02    0.00        1    
res.x       -3.005e+04  1.349e+03  -22.28   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 51620 on 2928 degrees of freedom
Multiple R-squared:  0.145,	Adjusted R-squared:  0.1447 
F-statistic: 496.5 on 1 and 2928 DF,  p-value: < 2.2e-16

../../_images/0c3981672b1fbfa275109bda2d1ccb8cc59421dc61efcdd4ceb3677764ef5cd3.png

Diagnostics I

Contents

Diagnostics I#

Download#

Ames Housing Data Set {.scrollable}#

Simple Linear Regression 1 {.scrollable}#

Simple Linear Regression 2 {.scrollable}#

Simple Linear Regression 3 {.scrollable}#

Multiple Linear Regression {.scrollable .smaller}#

Inference for Multiple Regression#

Assumptions of Linear Regression {.smaller .scrollable}#

Residuals {.smaller}#

Residual Plots {.scrollable}#

Residual Plots {.scrollable}#

Residual Plots {.scrollable}#

Conclusions#

Other Residual Plots#

Partial Residual Plots {.smaller .scrollable}#

Partial Residual Models {.smaller .scrollable}#

Partial Regression Plots {.smaller .scrollable}#

Partial Regression Models {.smaller .scrollable}#