This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:iot-reloaded:clustering_models [2024/12/02 21:03] – [Summary about clustering] ktokarz | en:iot-reloaded:clustering_models [2024/12/10 21:34] (current) – pczekalski | ||
---|---|---|---|
Line 6: | Line 6: | ||
Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data. | Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data. | ||
- | This provides very insightful information about the data's internal organisation, | + | This provides very insightful information about the data's internal organisation, |
- | One might consider grouping customers by income estimate to explain the clustering better. It is very natural to assume some threshold values of 1KEUR per month, 10KEUR per month etc. However: | + | One might consider grouping customers by income estimate to explain the clustering better. It is natural to assume some threshold values of 1KEUR per month, 10KEUR per month, etc. However: |
* Do the groups reflect a natural distribution of customers by their behaviour? | * Do the groups reflect a natural distribution of customers by their behaviour? | ||
* For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month? | * For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month? | ||
- | It is obvious | + | It is evident |
In this context, a **cluster** refers to a collection of data points aggregated together because of certain similarities ((Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science [[https:// | In this context, a **cluster** refers to a collection of data points aggregated together because of certain similarities ((Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science [[https:// | ||
Line 17: | Line 17: | ||
* Cluster **centroid-based**, | * Cluster **centroid-based**, | ||
* Cluster **density-based**, | * Cluster **density-based**, | ||
- | In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should | + | In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should be numerical, assuming an Euclidian space. |
- | <WRAP excludefrompdf> | ||
- | To illustrate the mentioned algorithm groups, the following algorithms are discussed in detail: | ||
- | * [[en: | ||
- | * [[en: | ||
- | </ | ||
==== Data preprocessing before clustering ==== | ==== Data preprocessing before clustering ==== | ||
- | Before starting clustering, several | + | Before starting clustering, several |
- | * **Check if the used data is metric:** In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, | + | * **Check if the used data is metric:** In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, |
* **Select the proper scale:** For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' | * **Select the proper scale:** For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' | ||
* **Unity interval:** A minimal factor value is subtracted from the given point value and divided by the interval value, giving the result 0 to 1. | * **Unity interval:** A minimal factor value is subtracted from the given point value and divided by the interval value, giving the result 0 to 1. | ||
Line 36: | Line 31: | ||
==== Summary about clustering ==== | ==== Summary about clustering ==== | ||
- | * Besides the discussed, there are many other clustering methods; however, all of them, including the discussed ones, require prior knowledge | + | * There are many other clustering methods |
- | * All of the clustering methods require setting some parameters | + | * All clustering methods require setting some parameters |
* Proper data coding in clustering may provide a significant value even in complex application domains, including medicine, customer behaviour analysis, and finetuning of other data analysis algorithms. | * Proper data coding in clustering may provide a significant value even in complex application domains, including medicine, customer behaviour analysis, and finetuning of other data analysis algorithms. | ||
- | * In data analysis, clustering is used among the first methods to acquire the internal structure of the data before applying more informed methods. | + | * In data analysis, clustering is one of the first methods |
+ | <WRAP excludefrompdf> | ||
+ | To illustrate the mentioned algorithm groups, the following algorithms are discussed in detail: | ||
+ | * [[en: | ||
+ | * [[en: | ||
+ | </ |