Bagging#

Bagging = Bootstrap Aggregating
In the Bootstrap, we replicate our dataset by sampling with replacement:
- Original dataset: x=c(x1, x2, ..., x100)
- Bootstrap samples: boot1=sample(x, 100, replace=True), …, bootB=sample(x, 100, replace=True)
We used these samples to get the Standard Error of a parameter estimate: $$\text{SE}(\hat\beta_1) \approx \frac{1}{B}\sum_{b=1}^B \hat\beta_1^{(b)}$$

Overview#

In bagging we average the predictions of a model fit to many Bootstrap samples.
Example. Bagging the Lasso
- Let $\hat y^{L,b}$ be the prediction of the Lasso applied to the $b$th bootstrap sample.
- Bagging prediction: $\hat y^\text{bag} = \frac{1}{B} \sum_{b=1}^B \hat y^{L,b}.$

When a regression method or a classifier has a tendency to overfit, bagging reduces the variance of the prediction.
When $n$ is large, the empirical distribution is similar to the true distribution of the samples.
Bootstrap samples are similar to independent realizations of the data. They are actually conditionally independent, given the data.
Bagging smooths out an estimator which can reduce variance.

Disadvantage: Every time we fit a decision tree to a Bootstrap sample, we get a different tree $T^b$.
$\to$ Loss of interpretability
Variable importance:
- For each predictor, add up the total amount by which the RSS (or Gini index) decreases every time we use the predictor in $T^b$.
- Average this total over each Boostrap estimate $T^1,\dots,T^B$.

To estimate the test error of a bagging estimate, we could use cross-validation.
Each time we draw a Bootstrap sample, we only use ~63% of the observations.
Idea: use the rest of the observations as a test set.
OOB error:
- For each sample $x_i$, find the prediction $\hat y_{i}^b$ for all bootstrap samples $b$ which do not contain $x_i$. There should be around $0.37B$ of them. Average these predictions to obtain $\hat y_{i}^\text{oob}$.
- Compute the error $(y_i-\hat y_{i}^\text{oob})^2$.
- Average the errors over all observations $i=1,\dots,n$.

The test error decreases as we increase $B$ (dashed line is the error for a plain decision tree).
Note: OOB: Bagging and OOB: RandomForest labels are incorrect – should be swapped.

Bagging has a problem: $\to$ The trees produced by different Bootstrap samples can be very similar
Random Forests:
- We fit a decision tree to different Bootstrap samples.
- When growing the tree, we select a random sample of $m<p$ predictors to consider in each step.
- This will lead to trees that are less correlated from each sample.
- Finally, average the prediction of each tree.

Note: OOB: Bagging and OOB: RandomForest labels are incorrect – should be swapped.

The optimal $m$ is usually around $\sqrt p$, but this can be used as a tuning parameter.