# Transformations
- We have been working with *linear* regression models so far in the
course.
- Some models are nonlinear, but can be *transformed* to a linear
model.
- We will also see that transformations can sometimes *stabilize* the variance
making constant variance a more reasonable assumption.
- Finally, we will see how to correct for unequal variance using a technique weighted least squares (WLS).
```R
options(repr.plot.width=4, repr.plot.height=4)
```
## Bacterial colony decay
Here is a simple dataset showing the number of bacteria alive in a colony, $N_t$
as a function of time $t$. A simple linear regression model is clearly not a very
good fit.
```R
bacteria.table = read.table('http://stats191.stanford.edu/data/bacteria.table', header=T)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
bacteria.lm = lm(N_t ~ t, bacteria.table)
abline(bacteria.lm$coef, lwd=2, col='red')
```
```R
par(mfrow=c(2,2))
plot(bacteria.lm, pch=23, bg='orange')
```
## Exponential decay model
- Suppose the expected number of cells grows like
$$E(n_t) = n_0 e^{\beta_1t}, \qquad t=1, 2, 3, \dots$$
- If we take logs of both sides
$$\log E(n_t) = \log n_0 + \beta_1 t.$$
- A reasonable (?) model:
$$\log n_t = \log n_0 + \beta_1 t + \varepsilon_t, \qquad \varepsilon_t \overset{IID}{\sim} N(0,\sigma^2).$$
```R
bacteria.log.lm <- lm(log(N_t) ~ t, bacteria.table)
plot(bacteria.table$t, bacteria.table$N_t, pch=23, cex=2, bg='orange')
lines(bacteria.table$t, fitted(bacteria.lm), lwd=2, col='red')
lines(bacteria.table$t, exp(fitted(bacteria.log.lm)), lwd=2, col='green')
```
```R
par(mfrow=c(2,2))
plot(bacteria.log.lm, pch=23, bg='orange')
```
### Logarithmic transformation
- This model slightly different than original model:
$$E(\log n_t) \leq \log E(n_t)$$ but may be approximately true.
- If $\varepsilon_t \sim N(0,\sigma^2)$ then
$$n_t = n_0 \cdot \epsilon_t \cdot e^{\beta_1 t}.$$
- $\epsilon_t=e^{\varepsilon_t}$ is called a log-normal random
$(0,\sigma^2)$ random variable.
## Linearizing regression function
We see that an exponential growth or decay model can
be made (approximately) linear. Here are a few other models that can be linearized:
- $y=\alpha x^{\beta}$, use $\tilde{y}=\log(y), \tilde{x}=\log(x)$;
- $y=\alpha e^{\beta x}$, use $\tilde{y}=\log(y)$;
- $y=x/(\alpha x - \beta)$, use $\tilde{y}=1/y, \tilde{x}=1/x$.
- More in textbook.
### Caveats
- Just because expected value linearizes, doesn’t mean that the errors
behave correctly.
- In some cases, this can be corrected using weighted least squares
(more later).
- Constant variance, normality assumptions should still be checked.
## Stabilizing variance
- Sometimes, a transformation can turn non-constant variance errors to
"close to" constant variance. This is another situation in which we might
consider a transformation.
- Example: by the "delta rule", if
$$\text{Var}(Y) = \sigma^2 E(Y)$$ then
$$\text{Var}(\sqrt{Y}) \simeq \frac{\sigma^2}{4}.$$
- In practice, we might not know which transformation is best. [Box-Cox transformations](http://en.wikipedia.org/wiki/Power_transform) offer a tool to find a "best" transformation.
## Delta rule
The following approximation is ubiquitous in statistics.
- Taylor series expansion:
$$f(Y) = f(E(Y)) + \dot{f}(E(Y)) (Y - E(Y)) + \dots$$
- Taking expectations of both sides yields:
$$\text{Var}(f(Y)) \simeq \dot{f}(E(Y))^2 \cdot \text{Var}(Y)$$
## Delta rule
- So, for our previous example:
$$\text{Var}(\sqrt{Y}) \simeq \frac{\text{Var}(Y)}{4 \cdot E(Y)}$$
- Another example
$$\text{Var}(\log(Y)) \simeq \frac{\text{Var}(Y)}{E(Y)^2}.$$
### Caveats
- Just because a transformation makes variance constant doesn’t mean
regression function is still linear (or even that it was linear)!
- The models are approximations, and once a model is selected our
standard diagnostics should be used to assess adequacy of fit.
- It is possible to have non-constant variance but the variance
stabilizing transformation may destroy linearity of the regression
function.
- *Solution:* try weighted least squares (WLS).
## Correcting for unequal variance
- We will now see an example in which there seems to be strong evidence
for variance that changes based on `Region`.
- After observing this, we will create a new model that
attempts to *correct* for this and come up with better estimates.
- *Correcting* for unequal variance, as we describe it here, generally
requires a model for how the variance depends on observable quantities.
## Correcting for unequal variance

Variable

Description

$Y$

Per capita education expenditure by state

$X_1$

Per capita income in 1973 by state

$X_2$

Proportion of population under 18

$X_3$

Proportion in urban areas

`Region`

Which region of the country are the states located in

```R
education.table = read.table('http://stats191.stanford.edu/data/education1975.table', header=T)
education.table$Region = factor(education.table$Region)
education.lm = lm(Y ~ X1 + X2 + X3, data=education.table)
summary(education.lm)
```
```R
par(mfrow=c(2,2))
plot(education.lm)
```
```R
boxplot(rstandard(education.lm) ~ education.table$Region,
col=c('red', 'green', 'blue', 'yellow'))
```
```R
keep.subset = (education.table$STATE != 'AK')
education.noAK.lm = lm(Y ~ X1 + X2 + X3, subset=keep.subset,
data=education.table)
summary(education.noAK.lm)
```
```R
par(mfrow=c(2,2))
plot(education.noAK.lm)
```
```R
par(mfrow=c(1,1))
boxplot(rstandard(education.noAK.lm) ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))
```
## Reweighting observations
- If you have a reasonable guess of variance as a function of the
predictors, you can use this to *reweight* the data.
- Hypothetical example
$$Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2 X_i^2).$$
- Setting $\tilde{Y}_i = Y_i / X_i$, $\tilde{X}_i = 1 / X_i$, model
becomes
$$\tilde{Y}_i = \beta_0 \tilde{X}_i + \beta_1 + \epsilon_i, \epsilon_i \sim N(0,\sigma^2).$$
## Weighted Least Squares
- Fitting this model is equivalent to minimizing
$$\sum_{i=1}^n \frac{1}{X_i^2} \left(Y_i - \beta_0 - \beta_1 X_i\right)^2$$
- Weighted Least Squares
$$SSE(\beta, w) = \sum_{i=1}^n w_i \left(Y_i - \beta_0 - \beta_1 X_i\right)^2, \qquad w_i = \frac{1}{X_i^2}.$$
- In general, weights should be like:
$$w_i = \frac{1}{\text{Var}(\varepsilon_i)}.$$
- Our education expenditure example assumes
$$
w_i = W_{\tt Region[i]}
$$
## Common weighting schemes
- If you have a qualitative variable, then it is easy to estimate
weight within groups (our example today).
- "Often" $$\text{Var}(\varepsilon_i) = \text{Var}(Y_i) = V(E(Y_i))$$
- Many non-Gaussian (non-Normal) models behave like this: logistic, Poisson
regression.
## Two stage procedure
- Suppose we have a hypothesis about the weights, i.e. they are
constant within Region, or they are something like
$$w_i^{-1} = \text{Var}(\epsilon_i) = \alpha_0 + \alpha_1 X_{i1}^2.$$
- We pre-whiten:
1. Fit model using OLS (Ordinary Least Squares) to get initial
estimate $\widehat{\beta}_{OLS}$
2. Use predicted values from this model to estimate $w_i$.
3. Refit model using WLS (Weighted Least Squares).
4. If needed, iterate previous two steps.
```R
educ.weights = 0 * education.table$Y
for (region in levels(education.table$Region)) {
subset.region = (education.table$Region[keep.subset] == region)
educ.weights[subset.region] <- 1.0 / (sum(resid(education.noAK.lm)
[subset.region]^2) /
sum(subset.region))
}
```
```R
unique(educ.weights)
```
Here is our new model. Note that the scale of the estimates is *unchanged*. Numerically
the estimates are similar. What changes most is the `Std. Error` column.
```R
education.noAK.weight.lm <- lm(Y ~ X1 + X2 + X3,
weights=educ.weights,
subset=keep.subset,
data=education.table)
summary(education.noAK.weight.lm)
```
```R
summary(education.noAK.lm)
```
```R
par(mfrow=c(2,2))
plot(education.noAK.weight.lm)
```
Let's look at the boxplot again. It looks better, but perhaps not perfect.
```R
par(mfrow=c(1,1))
boxplot(resid(education.noAK.weight.lm, type='pearson') ~ education.table$Region[keep.subset],
col=c('red', 'green', 'blue', 'yellow'))
```
## Unequal variance: effects on inference
- So far, we have just mentioned that things *may* have unequal
variance, but not thought about how it affects inference.
- In general, if we ignore unequal variance, our estimates of variance
are no longer unbiased. The covariance has the “sandwich form”
$$\text{Cov}(\widehat{\beta}_{OLS}) = (X'X)^{-1}(X'W^{-1}X)(X'X)^{-1}.$$
with $W=\text{diag}(1/\sigma^2_i)$.
- ** If our `Std. Error` is incorrect, so are our conclusions based on $t$-statistics!**
- In this example, correcting for weights seemed to make the $t$-statistics larger. ** This will not always be the case!**
## Efficiency
- The efficiency of an unbiased estimator of $\beta$ is 1 / variance.
- Estimators can be compared by their efficiency: the more efficient,
the better.
- The other reason to correct for unequal variance (besides so that we
get valid inference) is for efficiency.
## Illustrative example
- Suppose
$$Z_i = \mu + \varepsilon_i, \qquad \varepsilon_i \sim N(0, i^2 \cdot \sigma^2), 1 \leq i \leq n.$$
- Two unbiased estimators of $\mu$: $$\begin{aligned}
\widehat{\mu}_1 &= \frac{1}{n}\sum_{i=1}^n Z_i \\
\widehat{\mu}_2 &= \frac{1}{\sum_{i=1}^n i^{-2}}\sum_{i=1}^n i^{-2}Z_i \\
\widehat{\mu}_3 &= \frac{1}{\sum_{i=1}^n i^{-1}}\sum_{i=1}^n i^{-1}Z_i \\
\end{aligned}$$
## Illustrative example
- The estimator $\widehat{\mu}_2$ will always have lower variance,
hence tighter confidence intervals.
- The estimator $\widehat{\mu}_3$ has incorrect weights, but they are "closer" to correct
than the naive mean's weights which assume each observation has equal variance.
```R
ntrial = 10000 # how many trials will we be doing?
nsample = 20 # how many points in each trial
sd = c(1:20) # how does the variance change
mu = 2.0
get.sample <- function() {
return(rnorm(nsample)*sd + mu)
}
unweighted.estimate <- function(cur.sample) {
return(mean(cur.sample))
}
unweighted.estimate <- numeric(ntrial)
weighted.estimate <- numeric(ntrial)
suboptimal.estimate <- numeric(ntrial)
```
Let's simulate a number of experiments and compare the three estimates.
```R
ntrial = 1000
for (i in 1:ntrial) {
cur.sample = get.sample()
unweighted.estimate[i] = mean(cur.sample)
weighted.estimate[i] = sum(cur.sample/sd^2) / sum(1/sd^2)
suboptimal.estimate[i] = sum(cur.sample/sd) / sum(1/sd)
}
data.frame(mean(unweighted.estimate),
sd(unweighted.estimate))
data.frame(mean(weighted.estimate),
sd(weighted.estimate))
data.frame(mean(suboptimal.estimate),
sd(suboptimal.estimate))
```
```R
densY = c(density(unweighted.estimate)$y, density(weighted.estimate)$y, density(suboptimal.estimate)$y)
densX = c(density(unweighted.estimate)$x, density(weighted.estimate)$x, density(suboptimal.estimate)$x)
options(repr.plot.width=5, repr.plot.height=5)
```
```R
plot(densX, densY, type='n', main='Comparison of densities of the estimators', cex=0.8)
lines(density(weighted.estimate), col='red', lwd=4)
lines(density(unweighted.estimate), col='blue', lwd=4)
lines(density(suboptimal.estimate), col='purple', lwd=4)
legend(6,0.3, c('optimal', 'suboptimal', 'mean'), col=c('red', 'purple', 'blue'), lwd=rep(4,3), cex=0.8)
```