Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:iot-reloaded:classification_models [2024/12/02 21:35] – [K-folds] ktokarzen:iot-reloaded:classification_models [2024/12/10 21:38] (current) pczekalski
Line 1: Line 1:
-====== Decision tree-based classification Models ======+====== Decision Tree-based Classification Models ======
  
  
Line 9: Line 9:
 Classification is used in almost all domains of modern data analysis, including medicine, signal processing, pattern recognition, different types of diagnostics and other more specific applications. Classification is used in almost all domains of modern data analysis, including medicine, signal processing, pattern recognition, different types of diagnostics and other more specific applications.
  
-<WRAP excludefrompdf> 
-Within this chapter, two very widely used algorithm groups are discussed: 
- 
-  * [[en:iot-reloaded:Decision trees|]] - a fundamental set of methods and their variants are discussed. 
-  * [[en:iot-reloaded:Random forests|]] - one of the best out-of-the-box methods widely used by data analysts. 
-</WRAP> 
 ===== Interpretation of the model output ===== ===== Interpretation of the model output =====
  
  
-The classification process consists of two steps: first, an existing data sample is used to train the classification model, and then, in the second step, the model is used to classify unseen objects, thereby predicting to which class the object belongs. As with any other prediction, in classification, the output of the model is described by the rate of error, i.e., true prediction vs. wrong prediction. Usually, objects that belong to a given class are called – positive examples, while those that do not belong are called – negative examples.+The classification process consists of two steps: first, an existing data sample is used to train the classification model, and then, in the second step, the model is used to classify unseen objects, thereby predicting to which class the object belongs. As with any other prediction, in classification, the model output is described by the error rate, i.e., true prediction vs. wrong prediction. Usually, objects that belong to a given class are called – positive examples, while those that do not belong are called – negative examples.
  
 Depending on a particular output, several cases might be identified: Depending on a particular output, several cases might be identified:
   * True positive (TP) – the object belongs to the class and is classified as a class member.    * True positive (TP) – the object belongs to the class and is classified as a class member. 
  
-**Example:** A SPAM message is classified as SPAM, or a patient classified as being in a certain condition is in fact, experiencing this condition.+**Example:** A SPAM message is classified as SPAM, or a patient classified as being in a particular condition isin fact, experiencing this condition.
  
   * False positive (FP) – the object that does not belong to the class is classified as a class member.   * False positive (FP) – the object that does not belong to the class is classified as a class member.
Line 51: Line 45:
 The average statistics are used to describe the model.  The average statistics are used to describe the model. 
  
-The model's results on the test subsample depend on different factors—noise in the data, the proportion of classes represented in the data (how even classes are distributed), and others which are out of the developer's reach. However, by manipulating the split of the sample, it is possible to provide more data for training and thereby expect better training results—seeing more examples might lead to a better grasp of the class features. However, seeing too much might lead to a loss of generality and, consequently, dropped accuracy on test subsamples or previously unseen examples. Therefore, it is necessary to maintain a good balance between testing and training subsamples, usually 70% for training and 30% for testing or 60% for training and 40% for testing. In real applications, if the initial data sample is large enough, a third subsample is used – a validation set used only once to acquire final statistics and not provided to developers. It usually holds small but representative subsamples in 1-5% of the initial data sample. +The model's results on the test subsample depend on different factors—noise in the data, the proportion of classes represented in the data (how even classes are distributed), and others which are out of the developer's reach. However, by manipulating the sample split, it is possible to provide more data for training and thereby expect better training results—seeing more examples might lead to a better grasp of the class features. However, seeing too much might lead to a loss of generality and, consequently, dropped accuracy on test subsamples or previously unseen examples. Therefore, it is necessary to maintain a good balance between testing and training subsamples, usually 70% for training and 30% for testing or 60% for training and 40% for testing. In real applications, if the initial data sample is large enough, a third subsample is used – a validation set used only once to acquire final statistics and not provided to developers. It usually holds small but representative subsamples in 1-5% of the initial data sample. 
  
 Unfortunately, the data sample is not large enough in many practical cases. Therefore, several testing techniques are used to ensure the reliability of statistics and respect the scarcity of data. The method is called cross-validation, which uses the training and testing data subsets but allows data to be saved without using the validation set. Unfortunately, the data sample is not large enough in many practical cases. Therefore, several testing techniques are used to ensure the reliability of statistics and respect the scarcity of data. The method is called cross-validation, which uses the training and testing data subsets but allows data to be saved without using the validation set.
Line 57: Line 51:
 ===== Random sample ===== ===== Random sample =====
  
-<figure Random sample+<figure Randomsample
-{{ :en:iot-reloaded:classification_1.png?800 | |  Random sample}} +{{ :en:iot-reloaded:classification_1.png?800 | Random Sample}} 
-<caption> Random sample </caption>+<caption> Random Sample </caption>
 </figure> </figure>
  
-Most of the data is used for training in random sample cases, and only a few randomly selected samples are used to test the model. The procedure is repeated many times to ensure the model's average accuracy. Random selection has to be made without replacements. In the case of using replacements, the method is called bootstrapping, which is widely used and generally is more optimistic.+Most of the data is used for training in random sample cases (figure {{ref>Randomsample}}), and only a few randomly selected samples are used to test the model. The procedure is repeated many times to ensure the model's average accuracy. Random selection has to be made without replacements. In the case of using replacements, the method is called bootstrapping, which is widely used and generally is more optimistic.
  
 ===== K-folds ===== ===== K-folds =====
  
 <figure K-folds> <figure K-folds>
-{{ :en:iot-reloaded:classification_2.png?800 | |  K-folds}}+{{ :en:iot-reloaded:classification_2.png?800 | K-folds}}
 <caption> K-folds </caption> <caption> K-folds </caption>
 </figure> </figure>
Line 73: Line 67:
 This approach splits the training set into smaller sets called splits (in the figure {{ref>K-folds}} above, there are three splits). Then, for each split, the following steps are performed: This approach splits the training set into smaller sets called splits (in the figure {{ref>K-folds}} above, there are three splits). Then, for each split, the following steps are performed:
   * Model is trained using k-1 folds; in the picture above (figure {{ref>K-folds}}), every split (row) is divided into k folds, where in sequence, split by the split, an i-th fold is used for testing, while the k-1 folds for training.   * Model is trained using k-1 folds; in the picture above (figure {{ref>K-folds}}), every split (row) is divided into k folds, where in sequence, split by the split, an i-th fold is used for testing, while the k-1 folds for training.
-  * The models accuracy is assessed using the remaining fold for each split iteratively.+  * The model's accuracy is assessed iteratively using the remaining fold for each split.
 The overall performance for the k-fold cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications.  The overall performance for the k-fold cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications. 
  
 ===== One out ===== ===== One out =====
  
-<figure One out+<figure One_out
-{{ :en:iot-reloaded:classification_3.png?800 | |  One out}} +{{ :en:iot-reloaded:classification_3.png?800 | One Out}} 
-<caption> One out </caption>+<caption> One Out </caption>
 </figure> </figure>
  
-This approach splits the training set into smaller sets called splits in the same way as previous methods described here (in the figure abovethere are three splits). Then, for each split, the following steps are performed: +This approach splits the training set into smaller sets called splits in the same way as previous methods described here (in the figure {{ref>One_out}} abovethere are three splits). Then, for each split, the following steps are performed: 
-  * The model is trained using n-1 samples, and only one sample is used for testing the models performance.+  * The model is trained using n-1 samples, and only one sample is used for testing the model's performance.
   * The overall performance for the one-out cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications.    * The overall performance for the one-out cross-validation is the average performance of the individual performances computed for each split. It requires extra computing but respects data scarcity, which is why it is used in practical applications. 
 This method requires many iterations due to the limitations of the testing set.  This method requires many iterations due to the limitations of the testing set. 
  
 +
 +<WRAP excludefrompdf>
 +Within the following sub-chapters, two very widely used algorithm groups are discussed:
 +
 +  * [[en:iot-reloaded:Decision trees|]] - a fundamental set of methods and their variants are discussed.
 +  * [[en:iot-reloaded:Random forests|]] - one of the best out-of-the-box methods widely used by data analysts.
 +</WRAP>
en/iot-reloaded/classification_models.1733175306.txt.gz · Last modified: 2024/12/02 21:35 by ktokarz
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0