Simple Linear Regression Assumptions

Simple Linear Regression Assumptions#

Download#

Outline#

Goodness of fit of regression: analysis of variance.
\(F\)-statistics.
Residuals.
Diagnostic plots.

{height=600 fig-align=”center”}

Figure depicts the statistical model for regression:#

First we start with \(X\), then compute the mean \(\beta_0 + \beta_1 X\) then add error \(\epsilon\) yielding

\[ Y = \beta_0 + \beta_1 X + \epsilon \]

Geometry of least squares#

Full model#

This is the model

\[ Y = \beta_0 \cdot 1 + \beta_1 \cdot X + \epsilon. \]

Its fitted values are

\[ \hat{Y} = \hat{Y}_F = \hat{\beta}_0 \cdot 1 + \hat{\beta}_1 \cdot X \]

Reduced model#

This is the model

\[ Y = \beta_0 \cdot 1 + \epsilon \]

Its fitted values are

\[ \hat{Y} = \hat{Y}_R = \bar{Y} \cdot 1 \]

Regression sum of squares#

The closer \(\hat{Y}\) is to the \({1}\) axis, the less “variation” there is along the \(X\) axis.
This closeness can be measured by the length of the vector \(\hat{Y}-\bar{Y} \cdot 1\).
Its length is

\[ SSR = \|\hat{Y} - \bar{Y} \cdot 1 \|^2 = \|\hat{Y}_F - \hat{Y}_R\|^2 \]

An important right triangle#

{height=300 fig-align=”center”}

Sides of the triangle: SSR, SSE
Hypotenuse: SST

Degrees of freedom in the right triangle#

{height=300 fig-align=”center”}

Sides of the triangle: SSR has 1 d.f., SSE has n-2 d.f.
Hypotenuse: SST has n-1 d.f.

Mean squares#

Each sum of squares gets an extra bit of information associated to them, called their degrees of freedom.
Roughly speaking, the degrees of freedom can be determined by dimension counting.

\[\begin{split} \begin{aligned} MSE &= \frac{1}{n-2}\sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ MSR &= \frac{1}{1} \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ MST &= \frac{1}{n-1}\sum_{i=1}^n(Y_i - \overline{Y})^2 \\ \end{aligned} \end{split}\]

Computing degrees of freedom#

The \(SSE\) has \(n-2\) degrees of freedom because it is the squared length of a vector that lies in \(n-2\) dimensions. To see this, note that it is perpendicular to the 2-dimensional plane formed by the \(X\) axis and the \(1\) axis.
The \(SST\) has \(n-1\) degrees of freedom because it is the squared length of a vector that lies in \(n-1\) dimensions. In this case, this vector is perpendicular to the \(1\) axis.
The \(SSR\) has 1 degree of freedom because it is the squared length of a vector that lies in the 2-dimensional plane but is perpendicular to the \(1\) axis.

A different visualization#

These sums of squares can be visualized by other means as well. We will illustrate with a synthetic dataset.

X = seq(0, 20, length = 21)
Y = 0.5 * X + 1 + rnorm(21)
plot(X,Y, pch=23, bg='red', cex=2)

../../_images/51ec08c1819ddebb16f824d4809dd8fb8699c7415e53728f625c01bf70f70e7e.png

SST: total sum of squares#

plot(X, Y, pch = 23, bg = "red", main='Total sum of squares', cex=2)
abline(h = mean(Y), col = "yellow", lwd = 2)
for (i in 1:21) {
      points(X[i], mean(Y), pch = 23, bg = "yellow")
      lines(c(X[i], X[i]), c(Y[i], mean(Y)))
}

../../_images/8ce645f173489d69bdc3f0d11c4053e0b3cb79e9e844b59e8462d236e008d7a1.png

Description#

This figure depicts the **total sum of squares, \(SST\) ** – the sum of the squared differences between the Y values and the sample mean of the Y values.

SSE: error sum of squares#

plot(X, Y, pch = 23, bg = "red", main="Error sum of squares", cex=2)
Y.lm = lm(Y ~ X)
Yhat = predict(Y.lm)
abline(Y.lm, col = "green", lwd = 2)
for (i in 1:21) {
    points(X[i], Yhat[i], pch = 23, bg = "green")
    lines(c(X[i], X[i]), c(Y[i], Yhat[i]))
}
abline(h = mean(Y), col = "yellow", lwd = 2)

../../_images/ef75fb4e778e30b2bd3a82dc77a95f93bee2dc17f4d1b705cf59a526f369087a.png

Description#

This figure depicts the **error sum of squares, \(SSE\) ** - the sum of the squared differences between the \(Y\) values and the \(\hat{Y}\) values, i.e. the fitted values of the regression model.

SSR: regression sum of squares#

plot(X, Y, pch = 23, bg = "red", main="Regression sum of squares", cex=2)
abline(Y.lm, col = "green", lwd = 2)
abline(h = mean(Y), col = "yellow", lwd = 2)
for (i in 1:21) {
     points(X[i], Yhat[i], pch = 23, bg = "green")
     points(X[i], mean(Y), pch = 23, bg = "yellow")
     lines(c(X[i], X[i]), c(mean(Y), Yhat[i]))
}

../../_images/28050b30ae760ad88bbd8611da1cc54b259eb6b31f0e407bd80663559dd5fa7d.png

Description#

This figure depicts the **regression sum of squares, \(SSR\) ** - the sum of the squared differences between the \(\hat{Y}\) values and the sample mean of the \(Y\) values.

Definition of \(R^2\)#

As noted above, if the regression model fits very well, then \(SSR\) will be large relative to \(SST\). The \(R^2\) score is just the ratio of these sums of squares.

Let’s verify this on the big_bang data.

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/big_bang.csv'
big_bang = read.table(url, sep=',', header=TRUE)
big_bang.lm = lm(Distance ~ Velocity, data=big_bang)

Let’s verify our claim \(SST=SSE+SSR\):

SSE = sum(resid(big_bang.lm)^2)
SST = with(big_bang, sum((Distance - mean(Distance))^2))
SSR = with(big_bang, sum((mean(Distance) - predict(big_bang.lm))^2))
c(SST, SSE + SSR) # these values are the same

9.5906625
9.5906625

\(R^2\) and correlation#

Finally, \(R=\sqrt{R^2}\) is called the (absolute) correlation coefficient because it is equal to the absolute value of sample correlation coefficient of \(X\) and \(Y\).

with(big_bang, cor(Velocity, Distance)^2)

0.622571527821496

\(F\)-statistics#

After a \(t\)-statistic, the next most commonly encountered statistic is a \(\chi^2\) statistic, or its closely related cousin, the \(F\) statistic.

A \(\chi^2_k\) random variable is the distribution of the squared length of a centered normal vector in \(k\) dimensions (proper definition needs slightly more detail).
Sums of squares are squared lengths!

\(F\) statistic for simple linear regression#

Defined as

(8)#\[\begin{equation} F=\frac{SSR/1}{SSE/(n-2)} = \frac{(SST-SSE)/1}{SSE/(n-2)} = \frac{MSR}{MSE} \end{equation}\]

Can be thought of as a ratio of a difference in sums of squares normalized by our “best estimate” of variance .

\(F\) statistics and \(R^2\)#

The \(R^2\) is also closely related to the \(F\) statistic reported as the goodness of fit in summary of lm.

F = (SSR / 1) / (SSE / big_bang.lm$df)
F
summary(big_bang.lm)$fstatistic
anova(big_bang.lm)

36.2891901954741

value: 36.2891901954741
numdf: 1
dendf: 22

A anova: 2 × 5
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
	<int>	<dbl>	<dbl>	<dbl>	<dbl>
Velocity	1	5.970873	5.9708734	36.28919	4.607681e-06
Residuals	22	3.619789	0.1645359	NA	NA

Simple manipulations yield

(9)#\[\begin{equation} F = \frac{(n-2) \cdot R^2}{1-R^2} \end{equation}\]

22*0.6226 / (1 - 0.6226)

36.2935877053524

\(F\) test under \(H_0\)#

Under \(H_0:\beta_1=0\),

\[ F \sim F_{1, n-2} \]

because

\[\begin{split} \begin{aligned} SSR &= \|\hat{Y} - \bar{Y} \cdot 1\|^2 \\ SSE &= \|Y - \hat{Y}\|^2 \end{aligned} \end{split}\]

and from our “right triangle”, these vectors are orthogonal.

The null hypothesis \(H_0:\beta_1=0\) implies that \(SSR \sim \chi^2_1 \cdot \sigma^2\).

\(F\)-statistics and mean squares#

An \(F\)-statistic is a ratio of mean squares: it has a numerator, \(N\), and a denominator, \(D\) that are independent.
Let

\[N \sim \frac{\chi^2_{\rm num} }{ df_{{\rm num}}}, \qquad D \sim \frac{\chi^2_{\rm den} }{ df_{{\rm den}}}\]

and define

\[ F = \frac{N}{D}. \]

We say \(F\) has an \(F\) distribution with parameters \(df_{{\rm num}}, df_{{\rm den}}\) and write \(F \sim F_{df_{{\rm num}}, df_{{\rm den}}}\)

Takeaway#

\(F\) statistics are computed to test some \(H_0\).
When that \(H_0\) is true, the \(F\) statistic has this \(F\) distribution (with appropriate degrees of freedom).

Relation between \(F\) and \(t\) statistics.#

If \(T \sim t_{\nu}\), then

\[ T^2 \sim \frac{N(0,1)^2}{\chi^2_{\nu}/\nu} \sim \frac{\chi^2_1/1}{\chi^2_{\nu}/\nu}.\]
In other words, the square of a \(t\)-statistic is an \(F\)-statistic. Because it is always positive, an \(F\)-statistic has no direction associated with it.
In fact

\[ F = \frac{MSR}{MSE} = \frac{\widehat{\beta}_1^2}{SE(\widehat{\beta}_1)^2}.\]

Verifying \(F\)-statistic calculation#

Let’s check this in our example.

summary(big_bang.lm)$fstatistic

value: 36.2891901954741
numdf: 1
dendf: 22

The \(t\) statistic for education is the \(t\)-statistic for the parameter \(\beta_1\) under \(H_0:\beta_1=0\). Its value is 6.024 above. If we square it, we should get about the same as the F-statistic.

6.024**2

36.288576

Interpretation of an \(F\)-statistic#

In regression, the numerator is usually a difference in goodness of fit of two (nested) models.
The denominator is \(\hat{\sigma}^2\) – an estimate of \(\sigma^2\).
In our example today: the bigger model is the simple linear regression model, the smaller is the model with constant mean (one sample model).
If the \(F\) is large, it says that the bigger model explains a lot more variability in \(Y\) (relative to \(\sigma^2\)) than the smaller one.

Analysis of variance#

The \(F\)-statistic has the form

\[ F=\frac{(SSE_R - SSE_F) / (df_R - df_F)}{SSE_F / df_F} \]

Right triangle with full and reduced model: sum of squares#

{height=300 fig-align=”center”}

Sides of the triangle: \(SSE_R-SSE_F\), \(SSE_F\)
Hypotenuse: \(SSE_R\)

Right triangle with full and reduced model: degrees of freedom#

{height=300 fig-align=”center”}

Sides of the triangle: \(df_R-df_F\), \(df_F\)
Hypotenuse: \(df_R\)

The \(F\)-statistic for simple linear regression revisited#

The null hypothesis is

\[ H_0: \text{reduced model (R) is correct}. \]

The usual \(\alpha\) rejection rule would be to reject \(H_0\) if the \(F_{\text{obs}}\) the observed \(F\) statistic is greater than \(F_{1,n-2,1-\alpha}\).
In our case, the observed \(F\) was 36.3, \(n-2=22\) and the appropriate 5% threshold is computed below to be 4.30. Therefore, we strongly reject \(H_0\).

qf(0.95, 1, 22)

4.30094950177766

Case study B: breakdown time for insulating fluid#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/fluid_breakdown.csv'
breakdown = read.table(url, sep=',', header=TRUE)
with(breakdown, plot(Voltage, Time))

../../_images/c9c6fe23a90f73f2c7d6a345b4af139084c4b9e7c435555cba7c3c5d3d117c63.png

breakdown.lm = lm(Time ~ Voltage, data=breakdown)
summary(breakdown.lm)$coef
summary(breakdown.lm)$fstatistic

A matrix: 2 × 4 of type dbl
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1886.16946	364.48122	5.174943	1.886969e-06
Voltage	-53.95492	10.95264	-4.926203	4.965655e-06

value: 24.267474478293
numdf: 1
dendf: 74

A designed experiment to estimate average breakdown time under different voltages.

Another model#

full.lm = lm(Time ~ factor(Voltage), data=breakdown)
summary(full.lm)$coef
summary(full.lm)$fstatistic

A matrix: 7 × 4 of type dbl
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1303.0033	132.3377	9.846051	8.809748e-15
factor(Voltage)28	-946.7833	167.3954	-5.655971	3.242824e-07
factor(Voltage)30	-1227.2215	149.2970	-8.220001	7.914303e-12
factor(Voltage)32	-1261.8407	144.9686	-8.704232	1.031985e-12
factor(Voltage)34	-1288.6444	142.4026	-9.049303	2.427522e-13
factor(Voltage)36	-1298.3973	144.9686	-8.956402	3.582214e-13
factor(Voltage)38	-1302.0871	155.1797	-8.390836	3.855139e-12

value: 16.1227405670227
numdf: 6
dendf: 69

There were only 7 distinct values of Voltage: can be treated as a category (i.e. factor)

A different reduced & full model#

anova(breakdown.lm, full.lm)

A anova: 2 × 6
	Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	74	6557345	NA	NA	NA	NA
2	69	3625244	5	2932102	11.16146	6.674496e-08

Our “right triangle” again (only degrees of freedom this time):#

{height=300 fig-align=”center”}

Sides of the triangle: \(df_R-df_F=5\), \(df_F=69\)
Hypotenuse: \(df_R=74\)

Diagnostics for simple linear regression#

What can go wrong?#

Using a linear regression function can be wrong: maybe regression function should be quadratic.
We assumed independent Gaussian errors with the same variance. This may be incorrect.
1. The errors may not be normally distributed.
2. The errors may not be independent.
3. The errors may not have the same variance.
Detecting problems is more art then science, i.e. we cannot test for all possible problems in a regression model.

Inspecting residuals#

The basic idea of most diagnostic measures is the following:

If the model is correct then residuals \(e_i = Y_i -\widehat{Y}_i, 1 \leq i \leq n\) should look like a sample of (not quite independent) \(N(0, \sigma^2)\) random variables.

A poorly fitting model#

y = anscombe$y2 + rnorm(length(anscombe$y2)) * 0.45
x = anscombe$x2
plot(x, y, pch = 23, bg = "orange", cex = 2, ylab = "Y",
     xlab = "X")
simple.lm = lm(y ~ x)
abline(simple.lm, lwd = 2, col = "red", lty = 2)

../../_images/e6c803662f145ee198d43748d4ec4f5767bee867e34e8a367c6135e30eb3c80a.png

Figure: \(Y\) vs. \(X\) and fitted regression line#

Residuals for Anscombe’s data#

plot(x, resid(simple.lm), ylab = "Residual", xlab = "X", pch = 23,
        bg = "orange", cex = 2)
abline(h = 0, lwd = 2, col = "red", lty = 2)

../../_images/9dc9fce0784d833c25855e6046143469ff2d633729d25276fd2fb28afe86c220.png

Figure: residuals vs. \(X\)#

Quadratic model#

Let’s add a quadratic term to our model (a multiple linear regression model).

quadratic.lm = lm(Y ~ poly(X, 2))

quadratic.lm = lm(y ~ poly(x, 2))
Xsort = sort(x)
plot(x, y, pch = 23, bg = "orange", cex = 2, ylab = "Y",
     xlab = "X")
lines(Xsort, predict(quadratic.lm, list(x = Xsort)), col = "red", lty = 2,
      lwd = 2)

../../_images/972035c71efebd89f856196bc40ea46e4127d384ff063603e04870beb3bef06f.png

Figure: \(Y\) and fitted quadratic model vs. \(X\)#

Inspecting residuals of quadratic model#

The residuals of the quadratic model have no apparent pattern in them, suggesting this is a better fit than the simple linear regression model.

plot(x, resid(quadratic.lm), ylab = "Residual", xlab = "X", pch = 23,
        bg = "orange", cex = 2)
abline(h = 0, lwd = 2, col = "red", lty = 2)

../../_images/e0ff82c2dae9fd87d2d57bac9691663fc65f3c0068f8d4de21fc698703975f66.png

Figure: residuals of quadratic model vs. \(X\)#

Assessing normality of errors: QQ-plot for linear model#

qqnorm(resid(simple.lm), pch = 23, bg = "orange", cex = 2)

../../_images/a539ed625407859971e9f6bdaf9150dcff49846e7c59a43300b9088ba88279de.png

Figure: quantiles of poorly fitting model’s residuals vs. expected Gaussian quantiles#

QQ-plot for quadratic model#

qqnorm(resid(quadratic.lm), pch = 23, bg = "orange", cex = 2)

../../_images/c3a26bab72326cdf415c820636c9d06e3841f82204198b0432fe5f15216c444b.png

Figure: quantiles of quadratic model’s residuals vs. expected Gaussian quantiles#

qqnorm does not seem vastly different \(\implies\) several diagnostic tools can be useful in assessing a model.

Default plots in `R`#

par(mfrow=c(2,2)) # this makes 2x2 grid of plots
plot(simple.lm)

../../_images/0ffed2ebca1b4fada119f520bcca19fc034b6c02a10f1bcc99d8b52697609526.png

Assessing constant variance assumption#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/HIV.VL.table'
viral.load = read.table(url, header=TRUE)

with(viral.load, plot(GSS, VL, pch=23, bg='orange', cex=2))
viral.lm = lm(VL ~ GSS, data=viral.load)
abline(viral.lm, col='red', lwd=2)

../../_images/491d58bf766646c0d578f74c6218813901c4a03207a7ee5e922f62290c4f4332.png

Removing an outlier#

viral.load$keep = (viral.load$VL < 200000)
with(viral.load, plot(GSS, VL, pch=23, bg='orange', cex=2))
viral.lm.good = lm(VL ~ GSS, subset=keep, data=viral.load)
abline(viral.lm.good, col='green', lwd=2)

../../_images/c6fe945eca9be367a15413376ae8e6537eb62d890011a8e9d6d60647e939c9c1.png

Heteroscedastic errors#

When we plot the residuals against the fitted values for this model (even with the outlier removed) we see that the variance clearly depends on \(GSS\). They also do not seem symmetric around 0 so perhaps the Gaussian model is not appropriate.

with(viral.load,
     plot(GSS[keep], resid(viral.lm.good), pch=23,
          bg='orange', cex=2, xlab='GSS', ylab='Residual'))
abline(h=0, lwd=2, col='red', lty=2)

../../_images/6911d3d5f082f376d2a149b909aa885e0b3fc51c5869d4dfd00ddc3ad72b0765.png

Plots from `lm`#

We can see some of these plots in R:

par(mfrow=c(2,2))
plot(viral.lm.good)

../../_images/b36fe1e3a9485a2b7ab160cbd7ec9784ce01daa69bb0ab4a5a66a3e0775598c2.png

Simple Linear Regression Assumptions

Contents

Simple Linear Regression Assumptions#

Download#

Outline#

Figure depicts the statistical model for regression:#

Geometry of least squares#

Full model#

Reduced model#

Regression sum of squares#

An important right triangle#

Degrees of freedom in the right triangle#

Mean squares#

Computing degrees of freedom#

A different visualization#

SST: total sum of squares#

Description#

SSE: error sum of squares#

Description#

SSR: regression sum of squares#

Description#

Definition of \(R^2\)#

\(R^2\) and correlation#

\(F\)-statistics#

\(F\) statistic for simple linear regression#

\(F\) statistics and \(R^2\)#

\(F\) test under \(H_0\)#

\(F\)-statistics and mean squares#

Takeaway#

Relation between \(F\) and \(t\) statistics.#

Verifying \(F\)-statistic calculation#

Interpretation of an \(F\)-statistic#

Analysis of variance#

Right triangle with full and reduced model: sum of squares#

Right triangle with full and reduced model: degrees of freedom#

The \(F\)-statistic for simple linear regression revisited#

Case study B: breakdown time for insulating fluid#

Another model#

A different reduced & full model#

Our “right triangle” again (only degrees of freedom this time):#

Diagnostics for simple linear regression#

What can go wrong?#

Inspecting residuals#

A poorly fitting model#

Figure: \(Y\) vs. \(X\) and fitted regression line#

Residuals for Anscombe’s data#

Figure: residuals vs. \(X\)#

Quadratic model#

Figure: \(Y\) and fitted quadratic model vs. \(X\)#

Inspecting residuals of quadratic model#

Figure: residuals of quadratic model vs. \(X\)#

Assessing normality of errors: QQ-plot for linear model#

Figure: quantiles of poorly fitting model’s residuals vs. expected Gaussian quantiles#

QQ-plot for quadratic model#

Figure: quantiles of quadratic model’s residuals vs. expected Gaussian quantiles#

Default plots in R#

Assessing constant variance assumption#

Removing an outlier#

Heteroscedastic errors#

Plots from lm#

Default plots in `R`#

Plots from `lm`#