This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:iot-reloaded:decision_trees [2024/11/17 16:44] – agrisnik | en:iot-reloaded:decision_trees [2024/12/10 21:39] (current) – pczekalski | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Decision | + | ====== Decision |
| - | {{: | + | |
| - | Decision trees are the most commonly used base technique in classifications. To describe the idea of the decision trees a simple data set might be considered: | + | Decision trees are the most commonly used base technique in classifications. To describe the idea of the decision trees, a simple data set might be considered, as presented in figure {{ref> |
| - | {{: | + | <figure Classificationproblemexample> |
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| - | In this dataset, xn indicates the n-th observation; | + | In this dataset, xn indicates the n-th observation; |
| To build a decision tree for the given problem of calling the technical assistance, one might consider constructing a tree where each path from the root to tree leaves represents a separate example xn with a complete set of factors and their values corresponding to the given example. This solution would provide the necessary outcome – all examples will be classified correctly. However, there are two significant problems: | To build a decision tree for the given problem of calling the technical assistance, one might consider constructing a tree where each path from the root to tree leaves represents a separate example xn with a complete set of factors and their values corresponding to the given example. This solution would provide the necessary outcome – all examples will be classified correctly. However, there are two significant problems: | ||
| * The developed model is the same table encoded into a tree data structure, which might require the same amount of memory or even more since the model literally memorises all the examples. | * The developed model is the same table encoded into a tree data structure, which might require the same amount of memory or even more since the model literally memorises all the examples. | ||
| * The generalisation is lost, which is the essential feature of classification models – the ability to classify correctly unseen examples. In this case, this ability is lost. | * The generalisation is lost, which is the essential feature of classification models – the ability to classify correctly unseen examples. In this case, this ability is lost. | ||
| - | Referring to Occam’s razor principle ((Schaffer, Jonathan (2015). "What Not to Multiply Without Necessity" | + | Referring to Occam's razor principle ((Schaffer, Jonathan (2015). "What Not to Multiply Without Necessity" |
| This means that one needs to select the most relevant factor and then the next most relevant factor until the decision is made without a doubt. | This means that one needs to select the most relevant factor and then the next most relevant factor until the decision is made without a doubt. | ||
| - | {{: | + | <figure Factor_selection_example> |
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| - | In the figure above, on its left, the factor | + | In the figure |
| - | The figure on its right considers a different factor with similar potential outputs: | + | The figure |
| In this simple example, it is obvious that checking if children are in the car is more effective than checking the engine status. However, an effective estimate is needed to assess the potential effectiveness of a given factor. | In this simple example, it is obvious that checking if children are in the car is more effective than checking the engine status. However, an effective estimate is needed to assess the potential effectiveness of a given factor. | ||
| - | Ross Quinlan in 1986, proposed an algorithm ID3 ((Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106)), which employs an // | + | Ross Quinlan, in 1986, proposed an algorithm ID3 ((Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106)), which employs an // |
| {{: | {{: | ||
| - | , where | + | where: |
| - | E (D) - Entropy for a given data set D; | + | |
| - | C - total number of values c of the class variable in the given data set D; | + | **E(D)** |
| - | p(c) - proportion | + | **C** - Total number |
| + | **p%%(c)%%** - The proportion of examples with class value c to the total number of examples in D. | ||
| - | E(D) = 0, when only one class value is represented (the most desirable case), E(D) = 1, where class values are evenly distributed in the D (the least desirable case); | + | E(D) = 0, when only one class value is represented (the most desirable case), |
| To select a particular, estimating how much uncertainty is removed from the data set after applying a specific factor (test) is necessary. Quinlan proposed to use // | To select a particular, estimating how much uncertainty is removed from the data set after applying a specific factor (test) is necessary. Quinlan proposed to use // | ||
| - | {{: | + | {{: |
| - | , where | + | where: |
| - | IG(D,A) - Information gain of the dataset D, when factor A is applied to split it into subsets. | + | **IG(D,A)** - Information gain of the dataset D, when factor A is applied to split it into subsets. |
| - | E (D) - Entropy for a given data set D; | + | **E(D)** - Entropy for a given data set D. |
| - | T - Subsets of D, created by applying factor A; | + | **T** - Subsets of D, created by applying factor A. |
| - | p(t) - proportion of examples with class value t to the total number of examples in D; | + | **p(t)** - The proportion of examples with class value t to the total number of examples in D. |
| - | E(t) - | + | **E(t)** - Entropy |
| The attribute with the most significant information gain is selected to split the data set into subsets. Then, each subset is divided into subsets in the same way. The procedure is continued until each of the subsets has zero entropy or no factors remain to test. | The attribute with the most significant information gain is selected to split the data set into subsets. Then, each subset is divided into subsets in the same way. The procedure is continued until each of the subsets has zero entropy or no factors remain to test. | ||
| The approach, in its essence, is a greedy search algorithm with one hypothesis, which is refined by each iteration. It uses statistics from the entire data set, which makes it relatively immune to missing values, contradictions or errors. | The approach, in its essence, is a greedy search algorithm with one hypothesis, which is refined by each iteration. It uses statistics from the entire data set, which makes it relatively immune to missing values, contradictions or errors. | ||
| Since the algorithm seeks the best-fitting decision tree, it might run into a local minima trap, where the generalisation is lost. To avoid possible local minima solutions, it is necessary to simplify or generalise the decision tree. There are two common approaches: | Since the algorithm seeks the best-fitting decision tree, it might run into a local minima trap, where the generalisation is lost. To avoid possible local minima solutions, it is necessary to simplify or generalise the decision tree. There are two common approaches: | ||
| - | * Methods that monitor the hypothesis development and **stop it when overfitting** risks are significant; In this case, an accuracy change rate might be used, i.e. after every factor addition, the classification accuracy is measured. If the changes are small enough, it indicates that further model development does not bring significant improvements and can be stopped. For this reason, if the data set is large enough, a separate | + | * Methods that monitor the hypothesis development and **stop it when overfitting** risks are significant. In this case, an accuracy change rate might be used, i.e. after every factor addition, the classification accuracy is measured. If the changes are small enough, it indicates that further model development does not bring significant improvements and can be stopped. For this reason, if the data set is large enough, a separate |
| - | * Methods that **allow overfitting and then pruning** the tree to simplify or generalise the decision tree; In this case, the decision tree is transformed into a set of IF-THEN rules, where each rule represents a path from the decision tree root to the leaves. Iteratively, | + | * Methods that **allow overfitting and then pruning** the tree to simplify or generalise the decision tree. In this case, the decision tree is transformed into a set of IF-THEN rules, where each rule represents a path from the decision tree root to the leaves. Iteratively, |
| - | However, knowing the best factor to split the data set is not always helpful due to the costs related to the factor value estimation. For instance, in the medical domain, the most effective diagnostic methods might be the most expensive and, therefore, not always the most appropriate. | + | However, knowing the best factor to split the data set is not always helpful due to the costs related to the factor value estimation. For instance, in the medical domain, the most effective diagnostic methods might be the most expensive and, therefore, not always the most appropriate. Over time, different alternatives to information gain have been developed to respect expenses that are related to factor value estimation: |
| **Alternative 1:** | **Alternative 1:** | ||
| Line 66: | Line 71: | ||
| **Alternative 2:** | **Alternative 2:** | ||
| - | {{: | + | {{: |
| Currently, many other alternatives to the known ID3 family are used: ILA ((Mehmet R. Tolun, Saleh M. Abu-Soud, ILA: an inductive learning algorithm for rule extraction, Expert Systems with Applications, | Currently, many other alternatives to the known ID3 family are used: ILA ((Mehmet R. Tolun, Saleh M. Abu-Soud, ILA: an inductive learning algorithm for rule extraction, Expert Systems with Applications, | ||