Bagging#

  • Bagging = Bootstrap Aggregating

  • In the Bootstrap, we replicate our dataset by sampling with replacement:

    • Original dataset: x=c(x1, x2, ..., x100)

    • Bootstrap samples: boot1=sample(x, 100, replace=True), …, bootB=sample(x, 100, replace=True)

  • We used these samples to get the Standard Error of a parameter estimate: $\(\text{SE}(\hat\beta_1) \approx \frac{1}{B}\sum_{b=1}^B \hat\beta_1^{(b)}\)$


Overview#

  • In bagging we average the predictions of a model fit to many Bootstrap samples.

  • Example. Bagging the Lasso

    • Let \(\hat y^{L,b}\) be the prediction of the Lasso applied to the \(b\)th bootstrap sample.

    • Bagging prediction: \(\hat y^\text{bag} = \frac{1}{B} \sum_{b=1}^B \hat y^{L,b}.\)


When does Bagging make sense?#

  • When a regression method or a classifier has a tendency to overfit, bagging reduces the variance of the prediction.

  • When \(n\) is large, the empirical distribution is similar to the true distribution of the samples.

  • Bootstrap samples are similar to independent realizations of the data. They are actually conditionally independent, given the data.

  • Bagging smooths out an estimator which can reduce variance.


Bagging decision trees#

  • Disadvantage: Every time we fit a decision tree to a Bootstrap sample, we get a different tree \(T^b\).

  • \(\to\) Loss of interpretability

  • Variable importance:

    • For each predictor, add up the total amount by which the RSS (or Gini index) decreases every time we use the predictor in \(T^b\).

    • Average this total over each Boostrap estimate \(T^1,\dots,T^B\).


Variable importance#

Fig 8.9

Out-of-bag (OOB) error#

  • To estimate the test error of a bagging estimate, we could use cross-validation.

  • Each time we draw a Bootstrap sample, we only use ~63% of the observations.

  • Idea: use the rest of the observations as a test set.

  • OOB error:

    • For each sample \(x_i\), find the prediction \(\hat y_{i}^b\) for all bootstrap samples \(b\) which do not contain \(x_i\). There should be around \(0.37B\) of them. Average these predictions to obtain \(\hat y_{i}^\text{oob}\).

    • Compute the error \((y_i-\hat y_{i}^\text{oob})^2\).

    • Average the errors over all observations \(i=1,\dots,n\).


Fig 8.8
  • The test error decreases as we increase \(B\) (dashed line is the error for a plain decision tree).

  • Note: OOB: Bagging and OOB: RandomForest labels are incorrect – should be swapped.


Random Forests#

  • Bagging has a problem: \(\to\) The trees produced by different Bootstrap samples can be very similar

  • Random Forests:

    • We fit a decision tree to different Bootstrap samples.

    • When growing the tree, we select a random sample of \(m<p\) predictors to consider in each step.

    • This will lead to trees that are less correlated from each sample.

    • Finally, average the prediction of each tree.


Random Forests vs. Bagging#

Fig 8.8
  • Note: OOB: Bagging and OOB: RandomForest labels are incorrect – should be swapped.


Choosing \(m\) for random forests#

Fig 8.10
  • The optimal \(m\) is usually around \(\sqrt p\), but this can be used as a tuning parameter.