Regression trees

Overview

Find a partition of the space of predictors.
Predict a constant in each set of the partition.
The partition is defined by splitting the range of one predictor at a time.
Note: $\to$ not all partitions are possible.

Example: Predicting a baseball player’s salary

Parametric	Non-parametric

The prediction for a point in $R_i$ is the average of the training points in $R_i$.

How is a regression tree built?

Start with a single region $R_1$, and iterate:
1. Select a region $R_k$, a predictor $X_j$, and a splitting point $s$, such that splitting $R_k$ with the criterion $X_j<s$ produces the largest decrease in RSS: \[\sum_{m=1}^{|T|} \sum_{x_i\in R_m} (y_i-\bar y_{R_m})^2\]
2. Redefine the regions with this additional split.
Terminate when there are, say, 5 observations or fewer in each region.
This grows the tree from the root towards the leaves.

How do we control overfitting?

Idea 1: Find the optimal subtree by cross validation.
Hmm…. there are too many possibilities – harder than best subsets!
Idea 2: Stop growing the tree when the RSS doesn’t drop by more than a threshold with any new cut.
Hmm… in our greedy algorithm, it is possible to find good cuts after bad ones.

Solution: Prune a large tree from the leaves to the root.

Weakest link pruning:
- Starting with $T_0$, substitute a subtree with a leaf to obtain $T_1$, by minimizing: \[\frac{RSS(T_1)-RSS(T_0)}{|T_0|- |T_1|}.\]
- Iterate this pruning to obtain a sequence $T_0,T_1,T_2,\dots,T_m$ where $T_m$ is the null tree.
The sequence of trees can be used to solve the minimization for cost complexity pruning (over subtrees $T \subset T_0$)

\[ \text{minimize}_T \left(\sum_{m=1}^T \sum_{x_i \in R_m}(y_i - \hat{y}_{R_m})^2 + \alpha |T|\right) \]

Choose $\alpha$ by cross validation.

Example: predicting baseball salaries

Optimally tuned tree

Classification trees

They work much like regression trees. Both are examples of decision trees.
We predict the response by majority vote, i.e. pick the most common class in every region.
Instead of trying to minimize the RSS: \[\cancel{\sum_{m=1}^{|T|} \sum_{x_i\in R_m} (y_i-\bar y_{R_m})^2}\] we minimize a classification loss function.

Classification losses

The 0-1 loss or misclassification rate:

\[\sum_{m=1}^{|T|} \sum_{x_i\in R_m} \mathbf{1}(y_i \neq \hat y_{R_m})\]

Let $\hat p_{m,k}$ is the proportion of class $k$ within $R_m$, and $q_m$ is the proportion of samples in $R_m$. The Gini index is:

\[\sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}(1-\hat p_{mk}),\]

The cross-entropy:

\[- \sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}\log(\hat p_{mk}).\]

Comments

The Gini index and cross-entropy are better measures of the purity of a region, i.e. they are low when the region is mostly one category.
Motivation for the Gini index: If instead of predicting the most likely class, we predict a random sample from the distribution $(\hat p_{1,m},\hat p_{2,m},\dots,\hat p_{K,m})$, the Gini index is the expected misclassification rate.
It is typical to use the Gini index or cross-entropy for growing the tree, while using the misclassification rate when pruning the tree.

Example: heart dataset

Some details

How do we deal with categorical predictors?

If there are only 2 categories, then the split is obvious. We don’t have to choose the splitting point $s$, as for a numerical variable.
If there are more than 2 categories:
- Order the categories according to the average of the response: $\mathtt{ChestPain:a} > \mathtt{ChestPain:c} > \mathtt{ChestPain:b}$
- Treat as a numerical variable with this ordering, and choose a splitting point $s$.
One can show that this is the optimal way of partitioning.

How do we deal with missing data?

Suppose we can assign every sample to a leaf $R_i$ despite the missing data.
When choosing a new split with variable $X_j$ (growing the tree):
- Only consider the samples which have the variable $X_j$.
- In addition to choosing the best split, choose a second best split using a different variable, and a third best, …
To propagate a sample down the tree, if it is missing a variable to make a decision, try the second best decision, or the third best: surrogate splitting

Some advantages of trees

Very easy to interpret!
Closer to human decision-making.
Easy to visualize graphically (for shallow ones)
They easily handle qualitative predictors and missing data.

Downside: they don’t necessarily fit that well!

Bagging

Bagging = Bootstrap Aggregating
In the Bootstrap, we replicate our dataset by sampling with replacement:
- Original dataset: x=c(x1, x2, ..., x100)
- Bootstrap samples: boot1=sample(x, 100, replace=True), …, bootB=sample(x, 100, replace=True)
We used these samples to get the Standard Error of a parameter estimate: \[\text{SE}(\hat\beta_1) \approx \frac{1}{B}\sum_{b=1}^B \hat\beta_1^{(b)}\]

Overview

In bagging we average the predictions of a model fit to many Bootstrap samples.

Example: Bagging the Lasso

Let $\hat y^{L,b}$ be the prediction of the Lasso applied to the $b$th bootstrap sample.
Bagging prediction:

\[\hat y^\text{bag} = \frac{1}{B} \sum_{b=1}^B \hat y^{L,b}.\]

When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit, bagging reduces the variance of the prediction.
When $n$ is large, the empirical distribution is similar to the true distribution of the samples.
Bootstrap samples are similar to independent realizations of the data. They are actually conditionally independent, given the data.
Bagging smooths out an estimator which can reduce variance.

Bagging decision trees

Disadvantage: Every time we fit a decision tree to a Bootstrap sample, we get a different tree $T^b$.
$\to$ Loss of interpretability
Variable importance:
- For each predictor, add up the total amount by which the RSS (or Gini index) decreases every time we use the predictor in $T^b$.
- Average this total over each Boostrap estimate $T^1,\dots,T^B$.

Out-of-bag (OOB) error

To estimate the test error of a bagging estimate, we could use cross-validation.
Each time we draw a Bootstrap sample, we only use ~63% of the observations.
Idea: use the rest of the observations as a test set.

OOB error

For each sample $x_i$, find the prediction $\hat y_{i}^b$ for all bootstrap samples $b$ which do not contain $x_i$. There should be around $0.37B$ of them. Average these predictions to obtain $\hat y_{i}^\text{oob}$.
Compute the error $(y_i-\hat y_{i}^\text{oob})^2$.
Average the errors over all observations $i=1,\dots,n$.

The test error decreases as we increase $B$ (dashed line is the error for a plain decision tree).

Bagging has a problem

$\to$ The trees produced by different Bootstrap samples can be very similar

Random Forests

We fit a decision tree to different Bootstrap samples.
When growing the tree, we select a random sample of $m<p$ predictors to consider in each step.
This will lead to trees that are less correlated from each sample.
Finally, average the prediction of each tree.

Random Forests vs. Bagging

Note: OOB: Bagging and OOB: RandomForest labels are incorrect – should be swapped.

Choosing $m$ for random forests

Rule of thumb $m$ is usually around $\sqrt p$, but this can be used as a tuning parameter.

Boosting

Another ensemble method (i.e. uses a collection of learners)
Instead of randomizing each learner, each learner fits to the residual (not that unlike backfitting)

Boosting regression trees

Set $\hat f(x) = 0$, and $r_i=y_i$ for $i=1,\dots,n$.
For $b=1,\dots,B$, iterate:
1. Fit a regression tree $\hat f^b$ with $d$ splits to the response $r_1,\dots,r_n$.
2. Update the prediction to: \[\hat f(x) \leftarrow \hat f(x) + \lambda \cdot \hat f^b(x).\]
3. Update the residuals, \[ r_i \leftarrow r_i - \lambda \cdot \hat f^b(x_i).\]
Output the final model: \[\hat f(x) = \sum_{b=1}^B \lambda \cdot \hat f^b(x).\]

Boosting classification trees

Can be done with appropriately defined residual for classification based on offset in log-odds.

Some intuition

Boosting learns slowly
We first use the samples that are easiest to predict, then slowly down weigh these cases, moving on to harder samples.

The parameter $\lambda=0.01$ in each case.
We can tune the model by CV using $\lambda, d, B.$

BART: Bayesian Additive Regression Trees

An ensemble method that uses decision trees as its building blocks.
Recall that bagging and random forests make predictions from an average of regression trees, each of which is built using a random sample of data and/or predictors. Each tree is built separately from the others.
By contrast, boosting uses a weighted sum of trees, each of which is constructed by fitting a tree to the residual of the current fit. Thus, each new tree attempts to capture signal that is not yet accounted for by the current set of trees.

Some Details

Fig 8.1

BART is related to both random forests and boosting: each tree is constructed in a random manner as in bagging and random forests, and each tree tries to capture signal not yet accounted for by the current model, as in boosting.
BART can be applied to regression, classification etc.

BART iterations

In the first iteration of the BART algorithm, all trees are initialized to have a single root node, with $\hat{f}_k^{1}(x) = \frac{1}{nK} \sum_{i=1}^n y_i$, the mean of the response values divided by the total number of trees.
Thus,

\[\hat{f}^1(x) = \sum_{k=1}^K \hat{f}_k^{1}(x) = \frac{1}{n} \sum_{i=1}^n y_i\]

Subsequent iterations: BART updates each of the $K$ trees, one at a time. In the $b$th iteration, to update the $k$th tree, we subtract from each response value the predictions from all but the $k$th tree, in order to obtain a partial residual

\[r_i = r^{b,k}_i = y_i -\sum_{k' <k } \hat{f}_{k'}^{b} (x_i) -\sum_{k' > k } \hat{f}_{k'}^{b-1} (x_i), \; i=1,\ldots,n \]

New trees are chosen by perturbations

Rather than fitting a fresh tree to this partial residual, BART randomly chooses a perturbation to the tree from the previous iteration, $\hat{f}^{b-1}_k$, from a set of possible perturbations, favoring ones that improve the fit to the partial residual.

Examples of possible perturbations to a tree

Fig 8.13

We may change the structure of the tree by adding or pruning branches.
We may change the prediction in each terminal node of the tree.

What does BART Deliver?

The output of BART is a collection of prediction models,

\[\hat f^b(x)= \sum_{k=1}^K \hat f_k^b(x), \mbox{ for $b=1,2,\ldots,B.$}\]

To obtain a single prediction, we simply take the average after some $L$ burn-in iterations, $\hat f(x) = \frac{1}{B-L} \sum_{b=L+1}^B \hat f^b(x)$.
The perturbation-style moves guard against overfitting since they limit how hard we fit the data in each iteration.
We can also compute quantities other than the average: for instance, the percentiles of $f^{L+1}(x), \dots, f^B(x)$ provide a measure of uncertainty of the final prediction.

BART applied to the Heart data

Fig 8.13

$K=200$ trees; the number of iterations is increased to $10,000$. During the initial iterations (in gray), the test and training errors jump around a bit. After this initial burn-in period, the error rates settle down.

BART is a Bayesian Method

It turns out that the BART method can be viewed as a Bayesian approach to fitting an ensemble of trees
The BART algorithm can be viewed as a Markov chain Monte Carlo procedure for fitting the BART model.
We typically choose large values for $B$ and $K$, and a moderate value for $L$: for instance, $K=200$, $B=1000$, and $L=100$ are reasonable choices. BART has been shown to have impressive out-of-box performance — that is, it performs well with minimal tuning.

Tree-based Methods

web.stanford.edu/class/stats202

Regression trees

Overview

Example: Predicting a baseball player’s salary

How is a regression tree built?

How do we control overfitting?

Example: predicting baseball salaries

Optimally tuned tree

Classification trees

Classification losses

Comments

Example: heart dataset

Some details

How do we deal with categorical predictors?

How do we deal with missing data?

Some advantages of trees

Bagging

Overview

Example: Bagging the Lasso

When does Bagging make sense?

Bagging decision trees

Variable importance

Out-of-bag (OOB) error

OOB error

Bagging has a problem

Random Forests

Random Forests vs. Bagging

Choosing \(m\) for random forests

Boosting

Boosting regression trees

Boosting classification trees

Some intuition

BART: Bayesian Additive Regression Trees

Some Details

BART iterations

New trees are chosen by perturbations

Examples of possible perturbations to a tree

What does BART Deliver?

BART applied to the Heart data

BART is a Bayesian Method