Classification trees#

  • They work much like regression trees. Both are examples of decision trees.

  • We predict the response by majority vote, i.e. pick the most common class in every region.

  • Instead of trying to minimize the RSS:

\[\cancel{\sum_{m=1}^{|T|} \sum_{x_i\in R_m} (y_i-\bar y_{R_m})^2}\]

we minimize a classification loss function.


Classification losses#

  • The 0-1 loss or misclassification rate:

\[\sum_{m=1}^{|T|} \sum_{x_i\in R_m} \mathbf{1}(y_i \neq \hat y_{R_m})\]
  • Let \(\hat p_{m,k}\) is the proportion of class \(k\) within \(R_m\), and \(q_m\) is the proportion of samples in \(R_m\). The Gini index is:

\[\sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}(1-\hat p_{mk}),\]
  • The cross-entropy:

\[- \sum_{m=1}^{|T|} q_m \sum_{k=1}^K \hat p_{mk}\log(\hat p_{mk}).\]

Comments#

  • The Gini index and cross-entropy are better measures of the purity of a region, i.e. they are low when the region is mostly one category.

  • Motivation for the Gini index: If instead of predicting the most likely class, we predict a random sample from the distribution \((\hat p_{1,m},\hat p_{2,m},\dots,\hat p_{K,m})\), the Gini index is the expected misclassification rate.

  • It is typical to use the Gini index or cross-entropy for growing the tree, while using the misclassification rate when pruning the tree.


Example: heart dataset#

Fig 8.6