Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:iot-reloaded:random_forests [2024/12/10 16:12] blankaen:iot-reloaded:random_forests [2024/12/10 21:39] (current) pczekalski
Line 1: Line 1:
-====== Random forests ======+====== Random Forests ======
  
-Random forests ((https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro|Random forests)) are among the best out-of-the-box methods highly valued by developers and data scientists. For a better understanding of the process, an imaginary weather forecast problem might be considered, represented by the following true decision tree:+Random forests ((https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro|Random forests)) are among the best out-of-the-box methods highly valued by developers and data scientists. For a better understanding of the process, an imaginary weather forecast problem might be considered, represented by the following true decision tree (figure {{ref>Weatherforecastexample}}):
  
-<figure Weather forecast example>+<figure Weatherforecastexample>
 {{ :en:iot-reloaded:classification_6.png?800 | Weather Forecast Example}} {{ :en:iot-reloaded:classification_6.png?800 | Weather Forecast Example}}
 <caption> Weather Forecast Example </caption> <caption> Weather Forecast Example </caption>
 </figure> </figure>
  
-Now, one might consider several forecast agents – friends of neighbours - where each provides their own forecast depending on the factor values. Some will be higher than the actual value, and some lower. However, since they all use some **experience-based knowledge**, the forecast collected will be distributed around the actual value.  +Now, one might consider several forecast agents – friends of neighbours - where each provides their forecast depending on the factor values. Some will be higher than the actual value, and some will be lower. However, since they all use some **experience-based knowledge**, the forecast collected will be distributed around the exact value.  
-The Random forest (RF) method uses hundreds of forecast agentsand decision treesand then applies majority voting. +The Random forest (RF) method uses hundreds of forecast agents and decision trees and then applies majority voting (figure {{ref>Weatherforecastvotingexample}})
  
-<figure Weather forecast voting example>+<figure Weatherforecastvotingexample>
 {{ :en:iot-reloaded:classification_7.png?800 | Weather Forecast Voting Example}} {{ :en:iot-reloaded:classification_7.png?800 | Weather Forecast Voting Example}}
 <caption> Weather Forecast Voting Example </caption> <caption> Weather Forecast Voting Example </caption>
Line 18: Line 18:
 Some advantages: Some advantages:
   * RF uses more knowledge than a single decision tree.   * RF uses more knowledge than a single decision tree.
-  * Furthermore, the more diverse initial information sources have been used, the more diverse the models will be and the more robust the final estimate will be.+  * Furthermore, the more diverse the initial information sources used, the more diverse the models will be and the more robust the final estimate.
   * This is true because a single data source might suffer from data anomalies reflected in model anomalies.   * This is true because a single data source might suffer from data anomalies reflected in model anomalies.
  
Line 31: Line 31:
   * If the number of cases in the training set is N, a sample of N cases at random is taken - but with replacement, from the original data. Some samples will be represented more than once.   * If the number of cases in the training set is N, a sample of N cases at random is taken - but with replacement, from the original data. Some samples will be represented more than once.
   * This sample will be the training set for growing the tree.    * This sample will be the training set for growing the tree. 
-  * If there are M input factors, a number m<<M (m is significantly smaller than M) is specified such that at each node, m factors are selected at random out of the M, and the best split on this m is used to split the node. +  * If there are M input factors, a number m<<M (m is significantly smaller than M) is specified such that at each node, m factors are selected randomly out of the M, and the best split on this m is used to split the node. 
   * The value of m is held constant while the forest grows.    * The value of m is held constant while the forest grows. 
   * Each tree is grown to the largest extent possible. **There is no pruning**.   * Each tree is grown to the largest extent possible. **There is no pruning**.
Line 38: Line 38:
 ===== Additional considerations ===== ===== Additional considerations =====
  
-**Correlation Between Trees in the Forest:** The correlation between any two trees in a Random Forest refers to the similarity in their predictions across the same dataset. When trees are highly correlated, they are likely to make similar mistakes on the same inputs. In other words, if many trees make similar errors, the model's aggregated predictions will not effectively reduce the bias and variance, and the overall error rate of the forest will increase. The Random Forest method addresses this by introducing randomness in two main ways:+**Correlation Between Trees in the Forest:** The correlation between any two trees in a Random Forest refers to the similarity in their predictions across the same dataset. When trees are highly correlated, they will likely make similar mistakes on the same inputs. In other words, if many trees make similar errors, the model's aggregated predictions will not effectively reduce the bias and variance, and the overall error rate of the forest will increase. The Random Forest method addresses this by introducing randomness in two main ways:
   * **Bootstrap Sampling:** Each tree is trained on a different bootstrapped sample (random sampling with replacement) of the training data, which helps to reduce the correlation between the trees.   * **Bootstrap Sampling:** Each tree is trained on a different bootstrapped sample (random sampling with replacement) of the training data, which helps to reduce the correlation between the trees.
   * **Feature Randomness:** A random subset of features is selected for each split within a tree. This subset size is denoted by m, the number of features considered at each split. By reducing m, fewer features are considered at each split, leading to more diversity among the trees and, consequently, lower correlation. Decreasing the correlation among trees increases the effectiveness of the ensemble because it reduces the variance of the overall model error, as the trees are less likely to make the same mistakes.   * **Feature Randomness:** A random subset of features is selected for each split within a tree. This subset size is denoted by m, the number of features considered at each split. By reducing m, fewer features are considered at each split, leading to more diversity among the trees and, consequently, lower correlation. Decreasing the correlation among trees increases the effectiveness of the ensemble because it reduces the variance of the overall model error, as the trees are less likely to make the same mistakes.
Line 45: Line 45:
 **Strength of Each Individual Tree:** The strength of an individual tree refers to its classification accuracy on new data, i.e., its ability to perform as a string classifier. In Random Forest terminology, a tree is strong if it has a low error rate. If each tree can classify well independently, the aggregate predictions of the forest will be more accurate. **Strength of Each Individual Tree:** The strength of an individual tree refers to its classification accuracy on new data, i.e., its ability to perform as a string classifier. In Random Forest terminology, a tree is strong if it has a low error rate. If each tree can classify well independently, the aggregate predictions of the forest will be more accurate.
  
-Each tree's strength depends on various factors, including its depth and the features it uses for splitting. However, there is a trade-off between correlation and strength. For example, reducing m (the number of features considered at each split) increases the diversity among the trees, lowering correlation, but it may also reduce the strength of each tree, as it may limit the tree'access to highly predictive features.+Each tree's strength depends on various factors, including its depth and the features it uses for splitting. However, there is a trade-off between correlation and strength. For example, reducing m (the number of features considered at each split) increases the diversity among the trees, lowering correlation. Still, it may also reduce the strength of each tree, as it may limit its access to highly predictive features.
  
-Despite this trade-off, Random Forests balance these dynamics by optimising m to minimise the ensemble error. Generally, a moderate reduction in m lowers correlation without significantly compromising the strength of each tree, thus leading to an overall decrease in the forests error rate.+Despite this trade-off, Random Forests balance these dynamics by optimising m to minimise the ensemble error. Generally, a moderate reduction in m lowers correlation without significantly compromising the strength of each tree, thus leading to an overall decrease in the forest's error rate.
  
  
-**Implications for the Forest Error Rate:** The forest error rate in a Random Forest model is influenced by both the correlation among the trees and the strength of each individual tree. Specifically: +**Implications for the Forest Error Rate:** The forest error rate in a Random Forest model is influenced by the correlation among the trees and the strength of each tree. Specifically: 
-  * Increasing correlation among trees typically increases the error rate, as it reduces the ensembles ability to correct individual trees' errors.+  * Increasing correlation among trees typically increases the error rate, as it reduces the ensemble's ability to correct individual trees' errors.
   * Increasing the strength of each tree (i.e., reducing its error rate) generally decreases the forest error rate, as each tree becomes a more reliable classifier.   * Increasing the strength of each tree (i.e., reducing its error rate) generally decreases the forest error rate, as each tree becomes a more reliable classifier.
 Consequently, an ideal Random Forest model balances between individually strong and sufficiently diverse trees, typically achieved by tuning the m parameter. Consequently, an ideal Random Forest model balances between individually strong and sufficiently diverse trees, typically achieved by tuning the m parameter.
en/iot-reloaded/random_forests.1733847178.txt.gz · Last modified: 2024/12/10 16:12 by blanka
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0