A decision tree is a hierarchical, non-parametric classification model that organizes data using a divide-and-conquer strategy. It graphically represents all possible solutions to a decision based on specific conditions, making the reasoning behind each decision easy to understand. In classification tasks, the goal is to learn a decision tree that accurately represents the training data, allowing labels for new examples to be predicted with confidence. One popular algorithm that we will explore is Classification and Regression Trees (CART), a powerful method used for both classification and regression tasks.
Decision tree structure
As illustrated in the diagram below (numerical example of decision tree), a decision tree begins with a root node, which has no incoming branches. From this root, outgoing branches lead to internal nodes, also known as decision nodes. These nodes evaluate the available features and split the data into homogenous subsets, represented by leaf or terminal nodes. The leaf nodes indicate the possible outcomes within the dataset.
- Root Node: The topmost node in the decision tree, representing the entire dataset before any splits.
- Internal Nodes: Nodes that are not leaf nodes, representing a decision point where the dataset is split based on a feature and a threshold.
- Branches: The paths followed from the root to a leaf node, representing the outcome of the feature tests at each internal node.
- Leaf Nodes or Leaves: Terminal nodes at the bottom of the tree that represent the final predicted class (in classification) or value (in regression).
Decision tree learning uses a divide-and-conquer approach, performing a greedy search to find optimal split points, and continues splitting in a top-down, recursive manner until most records are classified. Smaller trees tend to achieve pure leaf nodes, but as trees grow, data fragmentation can occur, leading to overfitting. To avoid this, decision trees prefer simplicity, aligning with Occam’s Razor, and pruning is used to reduce complexity. Accuracy can also be maintained by forming ensembles, like a random forest, which predicts more accurately when trees are uncorrelated.
All Types of Decision Trees
There are several types of decision trees, many of which are based on Hunt’s algorithm, developed in the 1960s to model human learning in psychology. Some of the most popular decision tree algorithms include:
- ID3 (Iterative Dichotomiser 3): Developed by Ross Quinlan, ID3 uses entropy and information gain to evaluate candidate splits. Quinlan’s research on ID3, dating back to 1986, remains influential.
- C4.5: A successor to ID3, also developed by Quinlan, C4.5 evaluates split points using information gain or gain ratios, making it a more refined version of ID3.
- CART (Classification and Regression Trees): Introduced by Leo Breiman, CART uses Gini impurity to identify the best attribute for splitting. Gini impurity measures the likelihood of misclassification, with lower values indicating better splits.
How Does a Decision Tree Work?
As illustrated in the decision tree above, when working with a dataset containing multiple variables, we start by selecting one variable to split the data. For each resulting subset, we continue splitting based on one of the remaining variables. This process is repeated until no further splits are possible. The decision tree construction can be summarized as follows:
- Start: The algorithm begins at the root node, which contains the entire dataset.
- Feature Selection: Using an Attribute Selection Measure (ASM), the best feature in the dataset is selected.
- Splitting: The root node is divided into subsets based on the possible values of the selected feature.
- Node Creation: A new decision tree node is created, representing the best feature.
- Recursive Process: New decision trees are recursively generated using the subsets created in the previous step. This continues until no further classification is possible, and the terminal node is reached as the final leaf node.
You might wonder why we began by splitting the data using the “Marital Status” feature. Is there a feature that would require fewer questions to classify data points into different classes? To determine the best feature to start with, we can use either Gini Impurity or Information Gain as follows.
\text{Information Gain}(A, S) = H(S) - \sum_{j=1}^{v} \left(\frac{|S_j|}{|S|} \cdot H(S_j)\right) = H(S) - H(A, S)
\text{Gini Gain}(A, S) = \text{Gini}(S) - \sum_{j=1}^{v} \left(\frac{|S_j|}{|S|} \cdot \text{Gini}(S_j)\right)
where
- H(S): Entropy of the whole dataset S
- |S|: Total number of instances in the dataset S
- |S_j |: Total number of instances with j value of an attribute A
- v: Set of distinct values of an attribute A
- H(S_j ): Entropy of subset of instances for attribute A
- Gini(S_j ): Gini of subset of instances for attribute A
- H(A, S): Entropy of an attribute A
The main difference between these two formulas lies in the use of Entropy for computing H(S)H(S) and the Gini Index for computing Gini(S)Gini(S). The choice of method does not significantly impact the results, as they are very similar. For a comprehensive comparison of these two measurements, see the article ‘Gini index vs Entropy in Decision Tree‘ here.
Decision tree examples with solutions
In the following table, we have three independent variables: Sex (gender), Age, and Marital Status, and one dependent variable: “Gift.” The problem involves deciding on gifts for our friends based on these variables. To start, we first compute the Entropy of the dependent variable “Gift” as follows:
Sex | Age | Marital status | Gift |
---|---|---|---|
Male | 13 | Single | Pen |
Male | 16 | Single | Pen |
Female | 30 | Married | Pen |
Male | 16 | Single | Pen |
Female | 15 | Single | Bag |
Female | 33 | Married | Pen |
Male | 30 | Single | Bag |
To compute the entropy for the “Gift” feature, follow these steps:
- Number of `Pen` gifts: 5
- Number of `Bag` gifts: 2
- Total number of gifts: 7
Calculate Proportions
- p_{Pen} = \frac{5}{7} \approx 0.714
- p_{Bag} = \frac{2}{7} \approx 0.286
Calculate Entropy
The entropy ( H(S) is calculated as:
H(S) = - \left( p_{Pen} \log_2(p_{Pen}) + p_{Bag} \log_2(p_{Bag}) \right)Substitute the proportions:
- p_{Pen} = \frac{5}{7} \approx 0.714
- p_{Bag} = \frac{2}{7} \approx 0.286
Calculate the logarithms:
- \log_2(0.714) \approx -0.485
- \log_2(0.286) \approx -1.807
Compute the entropy:
- H(S) = - \left( 0.714 \times (-0.485) + 0.286 \times (-1.807) \right)
- H(S) = - \left( -0.346 - 0.517 \right) = 0.863
Follow the following figures to complete the tree.