# Transformations¶

• We have been working with linear regression models so far in the course.

• Some models are nonlinear, but can be transformed to a linear model.

• We will also see that transformations can sometimes stabilize the variance making constant variance a more reasonable assumption.

• Finally, we will see how to correct for unequal variance using a technique weighted least squares (WLS).

In [1]:
options(repr.plot.width=4, repr.plot.height=4)


## Bacterial colony decay¶

Here is a simple dataset showing the number of bacteria alive in a colony, $N_t$ as a function of time $t$. A simple linear regression model is clearly not a very good fit.

In [2]:
bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table', header=T)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
bacteria.lm = lm(N_t ~ t, bacteria.table)
abline(bacteria.lm$coef, lwd=2, col='red')  In [3]: par(mfrow=c(2,2)) plot(bacteria.lm, pch=23, bg='orange')  ## Exponential decay model¶ • Suppose the expected number of cells grows like $$E(n_t) = n_0 e^{\beta_1t}, \qquad t=1, 2, 3, \dots$$ • If we take logs of both sides $$\log E(n_t) = \log n_0 + \beta_1 t.$$ • A reasonable (?) model: $$\log n_t = \log n_0 + \beta_1 t + \varepsilon_t, \qquad \varepsilon_t \overset{IID}{\sim} N(0,\sigma^2).$$ In [4]: bacteria.log.lm <- lm(log(N_t) ~ t, bacteria.table) plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange') lines(bacteria.table$t, fitted(bacteria.lm), lwd=2, col='red')
lines(bacteria.table$t, exp(fitted(bacteria.log.lm)), lwd=2, col='green')  In [5]: par(mfrow=c(2,2)) plot(bacteria.log.lm, pch=23, bg='orange')  ### Logarithmic transformation¶ • This model slightly different than original model: $$E(\log n_t) \leq \log E(n_t)$$ but may be approximately true. • If$\varepsilon_t \sim N(0,\sigma^2)$then $$n_t = n_0 \cdot \epsilon_t \cdot e^{\beta_1 t}.$$ •$\epsilon_t=e^{\varepsilon_t}$is called a log-normal random$(0,\sigma^2)$random variable. ## Linearizing regression function¶ We see that an exponential growth or decay model can be made (approximately) linear. Here are a few other models that can be linearized: •$y=\alpha x^{\beta}$, use$\tilde{y}=\log(y), \tilde{x}=\log(x)$; •$y=\alpha e^{\beta x}$, use$\tilde{y}=\log(y)$; •$y=x/(\alpha x - \beta)$, use$\tilde{y}=1/y, \tilde{x}=1/x$. • More in textbook. ### Caveats¶ • Just because expected value linearizes, doesn’t mean that the errors behave correctly. • In some cases, this can be corrected using weighted least squares (more later). • Constant variance, normality assumptions should still be checked. ## Stabilizing variance¶ • Sometimes, a transformation can turn non-constant variance errors to "close to" constant variance. This is another situation in which we might consider a transformation. • Example: by the "delta rule", if $$\text{Var}(Y) = \sigma^2 E(Y)$$ then $$\text{Var}(\sqrt{Y}) \simeq \frac{\sigma^2}{4}.$$ • In practice, we might not know which transformation is best. Box-Cox transformations offer a tool to find a "best" transformation. ## Delta rule¶ The following approximation is ubiquitous in statistics. • Taylor series expansion: $$f(Y) = f(E(Y)) + \dot{f}(E(Y)) (Y - E(Y)) + \dots$$ • Taking expectations of both sides yields: $$\text{Var}(f(Y)) \simeq \dot{f}(E(Y))^2 \cdot \text{Var}(Y)$$ ## Delta rule¶ • So, for our previous example: $$\text{Var}(\sqrt{Y}) \simeq \frac{\text{Var}(Y)}{4 \cdot E(Y)}$$ • Another example $$\text{Var}(\log(Y)) \simeq \frac{\text{Var}(Y)}{E(Y)^2}.$$ ### Caveats¶ • Just because a transformation makes variance constant doesn’t mean regression function is still linear (or even that it was linear)! • The models are approximations, and once a model is selected our standard diagnostics should be used to assess adequacy of fit. • It is possible to have non-constant variance but the variance stabilizing transformation may destroy linearity of the regression function. • Solution: try weighted least squares (WLS). ## Correcting for unequal variance¶ • We will now see an example in which there seems to be strong evidence for variance that changes based on Region. • After observing this, we will create a new model that attempts to correct for this and come up with better estimates. • Correcting for unequal variance, as we describe it here, generally requires a model for how the variance depends on observable quantities. ## Correcting for unequal variance¶  Variable Description$Y$Per capita education expenditure by state$X_1$Per capita income in 1973 by state$X_2$Proportion of population under 18$X_3$Proportion in urban areas Region Which region of the country are the states located in In [6]: education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T) education.table$Region = factor(education.table$Region) education.lm = lm(Y ~ X1 + X2 + X3, data=education.table) summary(education.lm)  Call: lm(formula = Y ~ X1 + X2 + X3, data = education.table) Residuals: Min 1Q Median 3Q Max -84.878 -26.878 -3.827 22.246 99.243 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.566e+02 1.232e+02 -4.518 4.34e-05 *** X1 7.239e-02 1.160e-02 6.239 1.27e-07 *** X2 1.552e+00 3.147e-01 4.932 1.10e-05 *** X3 -4.269e-03 5.139e-02 -0.083 0.934 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 40.47 on 46 degrees of freedom Multiple R-squared: 0.5913, Adjusted R-squared: 0.5647 F-statistic: 22.19 on 3 and 46 DF, p-value: 4.945e-09  In [7]: par(mfrow=c(2,2)) plot(education.lm)  In [8]: boxplot(rstandard(education.lm) ~ education.table$Region,
col=c('red', 'green', 'blue', 'yellow'))

In [9]:
keep.subset = (education.table$STATE != 'AK') education.noAK.lm = lm(Y ~ X1 + X2 + X3, subset=keep.subset, data=education.table) summary(education.noAK.lm)  Call: lm(formula = Y ~ X1 + X2 + X3, data = education.table, subset = keep.subset) Residuals: Min 1Q Median 3Q Max -81.128 -22.154 -7.542 22.542 80.890 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -277.57731 132.42286 -2.096 0.041724 * X1 0.04829 0.01215 3.976 0.000252 *** X2 0.88693 0.33114 2.678 0.010291 * X3 0.06679 0.04934 1.354 0.182591 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 35.81 on 45 degrees of freedom Multiple R-squared: 0.4967, Adjusted R-squared: 0.4631 F-statistic: 14.8 on 3 and 45 DF, p-value: 7.653e-07  In [10]: par(mfrow=c(2,2)) plot(education.noAK.lm)  In [11]: par(mfrow=c(1,1)) boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))


## Reweighting observations¶

• If you have a reasonable guess of variance as a function of the predictors, you can use this to reweight the data.

• Hypothetical example $$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2 X_i^2).$$

• Setting $\tilde{Y}_i = Y_i / X_i$, $\tilde{X}_i = 1 / X_i$, model becomes $$\tilde{Y}_i = \beta_0 \tilde{X}_i + \beta_1 + \epsilon_i, \epsilon_i \sim N(0,\sigma^2).$$

## Weighted Least Squares¶

• Fitting this model is equivalent to minimizing $$\sum_{i=1}^n \frac{1}{X_i^2} \left(Y_i - \beta_0 - \beta_1 X_i\right)^2$$

• Weighted Least Squares $$SSE(\beta, w) = \sum_{i=1}^n w_i \left(Y_i - \beta_0 - \beta_1 X_i\right)^2, \qquad w_i = \frac{1}{X_i^2}.$$

• In general, weights should be like: $$w_i = \frac{1}{\text{Var}(\varepsilon_i)}.$$

• Our education expenditure example assumes $$w_i = W_{\tt Region[i]}$$

## Common weighting schemes¶

• If you have a qualitative variable, then it is easy to estimate weight within groups (our example today).

• "Often" $$\text{Var}(\varepsilon_i) = \text{Var}(Y_i) = V(E(Y_i))$$

• Many non-Gaussian (non-Normal) models behave like this: logistic, Poisson regression.

## Two stage procedure¶

• Suppose we have a hypothesis about the weights, i.e. they are constant within Region, or they are something like $$w_i^{-1} = \text{Var}(\epsilon_i) = \alpha_0 + \alpha_1 X_{i1}^2.$$

• We pre-whiten:

1. Fit model using OLS (Ordinary Least Squares) to get initial estimate $\widehat{\beta}_{OLS}$

2. Use predicted values from this model to estimate $w_i$.

3. Refit model using WLS (Weighted Least Squares).

4. If needed, iterate previous two steps.

In [12]:
educ.weights = 0 * education.table$Y for (region in levels(education.table$Region)) {
subset.region = (education.table$Region[keep.subset] == region) educ.weights[subset.region] <- 1.0 / (sum(resid(education.noAK.lm) [subset.region]^2) / sum(subset.region)) }  In [13]: unique(educ.weights)  1. 0.00068912628894633 2. 0.000410344340187708 3. 0.00400908850190692 4. 0.00105216278017872 Here is our new model. Note that the scale of the estimates is unchanged. Numerically the estimates are similar. What changes most is the Std. Error column. In [14]: education.noAK.weight.lm <- lm(Y ~ X1 + X2 + X3, weights=educ.weights, subset=keep.subset, data=education.table) summary(education.noAK.weight.lm)  Call: lm(formula = Y ~ X1 + X2 + X3, data = education.table, subset = keep.subset, weights = educ.weights) Weighted Residuals: Min 1Q Median 3Q Max -1.69882 -0.71382 -0.07928 0.79298 1.86328 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.181e+02 7.833e+01 -4.060 0.000193 *** X1 6.245e-02 7.867e-03 7.938 4.24e-10 *** X2 8.791e-01 2.003e-01 4.388 6.83e-05 *** X3 2.981e-02 3.421e-02 0.871 0.388178 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.984 on 45 degrees of freedom Multiple R-squared: 0.7566, Adjusted R-squared: 0.7404 F-statistic: 46.63 on 3 and 45 DF, p-value: 7.41e-14  In [15]: summary(education.noAK.lm)  Call: lm(formula = Y ~ X1 + X2 + X3, data = education.table, subset = keep.subset) Residuals: Min 1Q Median 3Q Max -81.128 -22.154 -7.542 22.542 80.890 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -277.57731 132.42286 -2.096 0.041724 * X1 0.04829 0.01215 3.976 0.000252 *** X2 0.88693 0.33114 2.678 0.010291 * X3 0.06679 0.04934 1.354 0.182591 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 35.81 on 45 degrees of freedom Multiple R-squared: 0.4967, Adjusted R-squared: 0.4631 F-statistic: 14.8 on 3 and 45 DF, p-value: 7.653e-07  In [16]: par(mfrow=c(2,2)) plot(education.noAK.weight.lm)  Let's look at the boxplot again. It looks better, but perhaps not perfect. In [17]: par(mfrow=c(1,1)) boxplot(resid(education.noAK.weight.lm, type='pearson') ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))


## Unequal variance: effects on inference¶

• So far, we have just mentioned that things may have unequal variance, but not thought about how it affects inference.

• In general, if we ignore unequal variance, our estimates of variance are no longer unbiased. The covariance has the “sandwich form” $$\text{Cov}(\widehat{\beta}_{OLS}) = (X'X)^{-1}(X'W^{-1}X)(X'X)^{-1}.$$ with $W=\text{diag}(1/\sigma^2_i)$.

• If our Std. Error is incorrect, so are our conclusions based on $t$-statistics!

• In this example, correcting for weights seemed to make the $t$-statistics larger. This will not always be the case!

## Efficiency¶

• The efficiency of an unbiased estimator of $\beta$ is 1 / variance.

• Estimators can be compared by their efficiency: the more efficient, the better.

• The other reason to correct for unequal variance (besides so that we get valid inference) is for efficiency.

## Illustrative example¶

• Suppose $$Z_i = \mu + \varepsilon_i, \qquad \varepsilon_i \sim N(0, i^2 \cdot \sigma^2), 1 \leq i \leq n.$$

• Two unbiased estimators of $\mu$: \begin{aligned} \widehat{\mu}_1 &= \frac{1}{n}\sum_{i=1}^n Z_i \\ \widehat{\mu}_2 &= \frac{1}{\sum_{i=1}^n i^{-2}}\sum_{i=1}^n i^{-2}Z_i \\ \widehat{\mu}_3 &= \frac{1}{\sum_{i=1}^n i^{-1}}\sum_{i=1}^n i^{-1}Z_i \\ \end{aligned}

## Illustrative example¶

• The estimator $\widehat{\mu}_2$ will always have lower variance, hence tighter confidence intervals.

• The estimator $\widehat{\mu}_3$ has incorrect weights, but they are "closer" to correct than the naive mean's weights which assume each observation has equal variance.

In [18]:
ntrial = 10000   # how many trials will we be doing?
nsample = 20   # how many points in each trial
sd = c(1:20)   # how does the variance change
mu = 2.0

get.sample <- function() {
return(rnorm(nsample)*sd + mu)
}

unweighted.estimate <- function(cur.sample) {
return(mean(cur.sample))
}

unweighted.estimate <- numeric(ntrial)
weighted.estimate <- numeric(ntrial)
suboptimal.estimate <- numeric(ntrial)


Let's simulate a number of experiments and compare the three estimates.

In [19]:
ntrial = 1000
for (i in 1:ntrial) {
cur.sample = get.sample()
unweighted.estimate[i] = mean(cur.sample)
weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)
suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)
}

data.frame(mean(unweighted.estimate),
sd(unweighted.estimate))
data.frame(mean(weighted.estimate),
sd(weighted.estimate))
data.frame(mean(suboptimal.estimate),
sd(suboptimal.estimate))

mean.unweighted.estimate.sd.unweighted.estimate.
0.20817391.031516
mean.weighted.estimate.sd.weighted.estimate.
0.20140670.6511842
mean.suboptimal.estimate.sd.suboptimal.estimate.
0.202308 0.7179637
In [20]:
densY = c(density(unweighted.estimate)$y, density(weighted.estimate)$y, density(suboptimal.estimate)$y) densX = c(density(unweighted.estimate)$x, density(weighted.estimate)$x, density(suboptimal.estimate)$x)
options(repr.plot.width=5, repr.plot.height=5)

In [21]:
plot(densX, densY, type='n', main='Comparison of densities of the estimators', cex=0.8)
lines(density(weighted.estimate), col='red', lwd=4)
lines(density(unweighted.estimate), col='blue', lwd=4)
lines(density(suboptimal.estimate), col='purple', lwd=4)
legend(6,0.3, c('optimal', 'suboptimal', 'mean'), col=c('red', 'purple', 'blue'), lwd=rep(4,3), cex=0.8)