| Parametric | Non-parametric |
|
|
|
Start with a single region \(R_1\), and iterate:
Terminate when there are, say, 5 observations or fewer in each region.
This grows the tree from the root towards the leaves.
Solution: Prune a large tree from the leaves to the root.
\[ \text{minimize}_T \left(\sum_{m=1}^T \sum_{x_i \in R_m}(y_i - \hat{y}_{R_m})^2 + \alpha |T|\right) \]
\[\sum_{m=1}^{|T|} \sum_{x_i\in R_m} \mathbf{1}(y_i \neq \hat y_{R_m})\]
\[\sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}(1-\hat p_{mk}),\]
\[- \sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}\log(\hat p_{mk}).\]
Downside: they don’t necessarily fit that well!
Bagging = Bootstrap Aggregating
In the Bootstrap, we replicate our dataset by sampling with replacement:
x=c(x1, x2, ..., x100)boot1=sample(x, 100, replace=True),
…, bootB=sample(x, 100, replace=True)We used these samples to get the Standard Error of a parameter estimate: \[\text{SE}(\hat\beta_1) \approx \frac{1}{B}\sum_{b=1}^B \hat\beta_1^{(b)}\]
Let \(\hat y^{L,b}\) be the prediction of the Lasso applied to the \(b\)th bootstrap sample.
Bagging prediction:
\[\hat y^\text{bag} = \frac{1}{B} \sum_{b=1}^B \hat y^{L,b}.\]
When a regression method or a classifier has a tendency to overfit, bagging reduces the variance of the prediction.
When \(n\) is large, the empirical distribution is similar to the true distribution of the samples.
Bootstrap samples are similar to independent realizations of the data. They are actually conditionally independent, given the data.
Bagging smooths out an estimator which can reduce variance.
Disadvantage: Every time we fit a decision tree to a Bootstrap sample, we get a different tree \(T^b\).
\(\to\) Loss of interpretability
Variable importance:
For each sample \(x_i\), find the prediction \(\hat y_{i}^b\) for all bootstrap samples \(b\) which do not contain \(x_i\). There should be around \(0.37B\) of them. Average these predictions to obtain \(\hat y_{i}^\text{oob}\).
Compute the error \((y_i-\hat y_{i}^\text{oob})^2\).
Average the errors over all observations \(i=1,\dots,n\).
We fit a decision tree to different Bootstrap samples.
When growing the tree, we select a random sample of \(m<p\) predictors to consider in each step.
This will lead to trees that are less correlated from each sample.
Finally, average the prediction of each tree.
OOB: Bagging and OOB: RandomForest
labels are incorrect – should be swapped.Another ensemble method (i.e. uses a collection of learners)
Instead of randomizing each learner, each learner fits to the residual (not that unlike backfitting)
Set \(\hat f(x) = 0\), and \(r_i=y_i\) for \(i=1,\dots,n\).
For \(b=1,\dots,B\), iterate:
Output the final model: \[\hat f(x) = \sum_{b=1}^B \lambda \cdot \hat f^b(x).\]
Boosting learns slowly
We first use the samples that are easiest to predict, then slowly down weigh these cases, moving on to harder samples.
The parameter \(\lambda=0.01\) in each case.
We can tune the model by CV using \(\lambda, d, B.\)
An ensemble method that uses decision trees as its building blocks.
Recall that bagging and random forests make predictions from an average of regression trees, each of which is built using a random sample of data and/or predictors. Each tree is built separately from the others.
By contrast, boosting uses a weighted sum of trees, each of which is constructed by fitting a tree to the residual of the current fit. Thus, each new tree attempts to capture signal that is not yet accounted for by the current set of trees.
BART is related to both random forests and boosting: each tree is constructed in a random manner as in bagging and random forests, and each tree tries to capture signal not yet accounted for by the current model, as in boosting.
BART can be applied to regression, classification etc.
In the first iteration of the BART algorithm, all trees are initialized to have a single root node, with \(\hat{f}_k^{1}(x) = \frac{1}{nK} \sum_{i=1}^n y_i\), the mean of the response values divided by the total number of trees.
Thus,
\[\hat{f}^1(x) = \sum_{k=1}^K \hat{f}_k^{1}(x) = \frac{1}{n} \sum_{i=1}^n y_i\]
\[r_i = r^{b,k}_i = y_i -\sum_{k' <k } \hat{f}_{k'}^{b} (x_i) -\sum_{k' > k } \hat{f}_{k'}^{b-1} (x_i), \; i=1,\ldots,n \]
We may change the structure of the tree by adding or pruning branches.
We may change the prediction in each terminal node of the tree.
\[\hat f^b(x)= \sum_{k=1}^K \hat f_k^b(x), \mbox{ for $b=1,2,\ldots,B.$}\]
To obtain a single prediction, we simply take the average after some \(L\) burn-in iterations, \(\hat f(x) = \frac{1}{B-L} \sum_{b=L+1}^B \hat f^b(x)\).
The perturbation-style moves guard against overfitting since they limit how hard we fit the data in each iteration.
We can also compute quantities other than the average: for instance, the percentiles of \(f^{L+1}(x), \dots, f^B(x)\) provide a measure of uncertainty of the final prediction.
\(K=200\) trees; the number of iterations is increased to \(10,000\). During the initial iterations (in gray), the test and training errors jump around a bit. After this initial burn-in period, the error rates settle down.
It turns out that the BART method can be viewed as a Bayesian approach to fitting an ensemble of trees
The BART algorithm can be viewed as a Markov chain Monte Carlo procedure for fitting the BART model.
We typically choose large values for \(B\) and \(K\), and a moderate value for \(L\): for instance, \(K=200\), \(B=1000\), and \(L=100\) are reasonable choices. BART has been shown to have impressive out-of-box performance — that is, it performs well with minimal tuning.