Bootstrap#

Download#

Bootstrapping linear regression#

{height=400 fig-align=”center”}

Figure: our multiple linear regression model#

\[Y|X = X\beta + \epsilon\]

  • We’ve talked about checking assumptions.

  • What to do if the assumptions don’t hold?

  • We will use the bootstrap!

Random \(X\): pairs bootstrap#

  • Suppose we think of the pairs \((X_i, Y_i)\) coming from some distribution \(F\) – this is a distribution for both the features and the outcome.

  • In our usual model, \(\beta\) is clearly defined. What is \(\beta\) without this assumption?

Population least squares#

  • For our distribution \(F\), we can define

\[ E_F[\pmb{X}\pmb{X}^T], \qquad E_F[\pmb{X} \cdot \pmb{Y}] \]

where \((\pmb{X}, \pmb{Y}) \sim F\) leading to

\[ \beta(F) = \left(E_F[\pmb{X}\pmb{X}^T]\right)^{-1} E_F[\pmb{X} \cdot \pmb{Y}]. \]

Asymptotics#

  • In fact, our least squares estimator is \(\beta(\hat{F}_n)\) where \(\hat{F}_n\) is the empirical distribution of our sample of \(n\) observations from \(F\).

  • As we take a larger and larger sample,

\[ \beta(\hat{F}_n) \to \beta(F) \]

and

\[ n^{1/2}(\beta(\hat{F}_n) - \beta(F)) \to N(0, \Sigma(F)) \]

for some covariance matrix \(\Sigma=\Sigma(F)\) depending only on \(F\).


  • Recall the variance of OLS estimator (with \(X\) fixed): $\( (X^TX)^{-1} \text{Var}(X^T(Y-X\beta)) (X^TX)^{-1}. \)$

  • With \(X\) random and \(n\) large this is approximately $\( \frac{1}{n} \left(E_F[\pmb{X}\pmb{X}^T] \right)^{-1} \text{Var}_F(\pmb{X} \cdot (\pmb{Y} - \pmb{X} \beta(F))) \left(E_F[\pmb{X}\pmb{X}^T] \right)^{-1}. \)$

Population least squares#

  • In usual model,

\[\text{Var}(X^T(Y-X\beta)) = \sigma^2 X^TX \approx n \cdot E_F[\pmb{X} \pmb{X}^T].\]
  • This is wrong in general!

  • We will use OLS estimate – but correct its variance!

  • Can we get our hands on \(\text{Var}(X^T(Y-X\beta))\) or \(\text{Var}(\hat{\beta})\) without a model?

Nonparametric bootstrap in a nutshell#

Basic algorithm for pairs#

There are many variants of the bootstrap, most using roughly this structure

boot_sample = c()
for (b in 1:B) {
    idx_star = sample(1:n, n, replace=TRUE)
    X_star = X[idx_star,]
    Y_star = Y[idx_star]
    boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star)))
}
cov_beta_boot = cov(boot_sample)

Residual bootstrap#

  • If \(X\) is fixed, it doesn’t make sense to sample new \(X\) values for \(X^*\).

  • Residual bootstrap keeps \(X\) fixed, but adds randomly sampled residuals

boot_sample = c()
M = lm(Y ~ X - 1)
beta.hat = coef(M)
X = model.matrix(M)
Y.hat = X @ beta.hat
r.hat = Y - Y.hat
for (b in 1:B) {
    idx_star = sample(1:n, n, replace=TRUE)
    X_star = X 
    Y_star = Y.hat + r.hat[idx_star]
    boot_sample = rbind(boot_sample, coef(lm(Y_star ~ X_star - 1)))
}
cov_beta_boot = cov(boot_sample)

Bootstrap for inference#

  • Estimated covariance cov_beta_boot can be used to estimate \(\text{Var}(a^T\hat{\beta})\) for confidence intervals or general linear hypothesis tests.

  • Software does something slightly different – using percentiles of the bootstrap sample: bootstrap percentile intervals.

Bootstrapping regression#

Reference for more R examples#

library(car)
n = 50
X = rexp(n)
Y = 3 + 2.5 * X + X * (rexp(n) - 1) # our usual model is false here! 
Y.lm = lm(Y ~ X)
Loading required package: carData

pairs.Y.lm = Boot(Y.lm, coef, method='case', R=1000)
confint(Y.lm) 
confint(pairs.Y.lm, type='norm') # using bootstrap SE
confint(pairs.Y.lm, type='perc') # using percentiles
Loading required namespace: boot
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)2.5018693.353018
X2.2357862.978521
A confint.boot: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)2.6267673.237522
X2.0219543.179938
A confint.boot: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)2.5657973.178609
X2.1358283.279040

Using the boot package#

  • The Boot function in car is a wrapper around the more general boot function.

library(boot)
D = data.frame(X, Y)
bootstrap_stat = function(D, bootstrap_idx) {
    return(summary(lm(Y ~ X, data=D[bootstrap_idx,]))$coef[,1])
}
boot_results = boot(D, bootstrap_stat, R=500)
confint(boot_results, type='norm') # using bootstrap SE
confint(boot_results, type='perc') # using percentiles
Attaching package: ‘boot’
The following object is masked from ‘package:car’:

    logit
A confint.boot: 2 × 2 of type dbl
2.5 %97.5 %
2.6138703.233349
2.0212053.204503
A confint.boot: 2 × 2 of type dbl
2.5 %97.5 %
2.5562503.170863
2.1101863.275172

How is the coverage?#

  • First we’ll use the standard regression model but errors that aren’t Gaussian.

noise = function(n) { return(rexp(n) - 1) }

simulate_correct = function(n=200, b=0.5) {
    X = rexp(n)
    Y = 3 + b * X + noise(n)
    Y.lm = lm(Y ~ X)

    # parametric interval
    int_param = confint(Y.lm)[2,]
    
    # pairs bootstrap interval   
    pairs.Y.lm = Boot(Y.lm, coef, method='case', R=500)
    pairs_SE = sqrt(cov(pairs.Y.lm$t)[2,2]) # $t is the bootstrap sample
    int_pairs = c(coef(Y.lm)[2] - qnorm(0.975) * pairs_SE,
                  coef(Y.lm)[2] + qnorm(0.975) * pairs_SE)

    result = c((int_param[1] < b) * (int_param[2] > b),
              (int_pairs[1] < b) * (int_pairs[2] > b))
    names(result) = c('parametric', 'bootstrap')
    return(result)
}

Check one instance#

simulate_correct()
parametric
1
bootstrap
1

Check coverage#

nsim = 100
coverage = c()
for (i in 1:nsim) {
    coverage = rbind(coverage, simulate_correct())
}
print(apply(coverage, 2, mean))
parametric  bootstrap 
      0.94       1.00 

Misspecified model#

  • Now we make data for which we might have used WLS but we don’t have a model for the weights!

simulate_incorrect = function(n=200, b=0.5) {
    X = rexp(n)
    # the usual model is 
    # quite off here -- Var(X^TY) is not well
    # approximated by sigma^2 * X^TX...
    Y = 3 + b * X + X * noise(n)
    Y.lm = lm(Y ~ X)

    # parametric interval
    int_param = confint(Y.lm)[2,]
    
    # pairs bootstrap interval
    pairs.Y.lm = Boot(Y.lm, coef, method='case', R=500)
    pairs_SE = sqrt(cov(pairs.Y.lm$t)[2,2]) # $t is the bootstrap sample of coefficients
                                            # we want the 2nd coefficient
    int_pairs = c(coef(Y.lm)[2] - qnorm(0.975) * pairs_SE,
                  coef(Y.lm)[2] + qnorm(0.975) * pairs_SE)

    result = c((int_param[1] < b) * (int_param[2] > b),
              (int_pairs[1] < b) * (int_pairs[2] > b))
    
    names(result) = c('parametric', 'bootstrap')
    return(result)
}

Check one instance#

simulate_incorrect()
parametric
0
bootstrap
0

Check coverage#

nsim = 100
coverage = c()
for (i in 1:nsim) {
    coverage = rbind(coverage, simulate_incorrect())
}

print(apply(coverage, 2, mean))
parametric  bootstrap 
      0.58       0.93