Tree-based Methods

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Regression trees

Overview

  1. Find a partition of the space of predictors.
  2. Predict a constant in each set of the partition.
  3. The partition is defined by splitting the range of one predictor at a time.
  4. Note: \(\to\) not all partitions are possible.

Example: Predicting a baseball player’s salary

Parametric Non-parametric
Fig 8.1 Fig 8.2

How is a regression tree built?

How do we control overfitting?

Solution: Prune a large tree from the leaves to the root.

\[ \text{minimize}_T \left(\sum_{m=1}^T \sum_{x_i \in R_m}(y_i - \hat{y}_{R_m})^2 + \alpha |T|\right) \]

Example: predicting baseball salaries

Optimally tuned tree

Classification trees

Classification losses

\[\sum_{m=1}^{|T|} \sum_{x_i\in R_m} \mathbf{1}(y_i \neq \hat y_{R_m})\]

\[\sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}(1-\hat p_{mk}),\]

\[- \sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}\log(\hat p_{mk}).\]

Comments

Example: heart dataset

Some details

How do we deal with categorical predictors?

How do we deal with missing data?

Some advantages of trees

Downside: they don’t necessarily fit that well!

Bagging

Overview

Example: Bagging the Lasso

When does Bagging make sense?

Bagging decision trees

Variable importance

Out-of-bag (OOB) error

OOB error

Bagging has a problem

Random Forests

Random Forests vs. Bagging

Choosing \(m\) for random forests

Boosting

Boosting regression trees

  1. Set \(\hat f(x) = 0\), and \(r_i=y_i\) for \(i=1,\dots,n\).

  2. For \(b=1,\dots,B\), iterate:

    1. Fit a regression tree \(\hat f^b\) with \(d\) splits to the response \(r_1,\dots,r_n\).
    2. Update the prediction to: \[\hat f(x) \leftarrow \hat f(x) + \lambda \cdot \hat f^b(x).\]
    3. Update the residuals, \[ r_i \leftarrow r_i - \lambda \cdot \hat f^b(x_i).\]
  3. Output the final model: \[\hat f(x) = \sum_{b=1}^B \lambda \cdot \hat f^b(x).\]

Boosting classification trees

Some intuition

BART: Bayesian Additive Regression Trees

Some Details

Fig 8.1

BART iterations

\[\hat{f}^1(x) = \sum_{k=1}^K \hat{f}_k^{1}(x) = \frac{1}{n} \sum_{i=1}^n y_i\]

\[r_i = r^{b,k}_i = y_i -\sum_{k' <k } \hat{f}_{k'}^{b} (x_i) -\sum_{k' > k } \hat{f}_{k'}^{b-1} (x_i), \; i=1,\ldots,n \]

New trees are chosen by perturbations

Examples of possible perturbations to a tree

Fig 8.13

What does BART Deliver?

\[\hat f^b(x)= \sum_{k=1}^K \hat f_k^b(x), \mbox{ for $b=1,2,\ldots,B.$}\]

BART applied to the Heart data

Fig 8.13

\(K=200\) trees; the number of iterations is increased to \(10,000\). During the initial iterations (in gray), the test and training errors jump around a bit. After this initial burn-in period, the error rates settle down.

BART is a Bayesian Method