This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:iot-reloaded:clustering_models [2024/09/25 12:18] – agrisnik | en:iot-reloaded:clustering_models [2024/12/10 21:34] (current) – pczekalski | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Clustering Models ====== | ====== Clustering Models ====== | ||
- | {{: | + | |
===== Introduction ===== | ===== Introduction ===== | ||
Line 6: | Line 6: | ||
Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data. | Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data. | ||
- | This provides very insightful information about the data's internal organisation, | + | This provides very insightful information about the data's internal organisation, |
- | One might consider grouping customers by income estimate to explain the clustering better. It is very natural to assume some threshold values of 1KEUR per month, 10KEUR per month etc. However: | + | One might consider grouping customers by income estimate to explain the clustering better. It is natural to assume some threshold values of 1KEUR per month, 10KEUR per month, etc. However: |
* Do the groups reflect a natural distribution of customers by their behaviour? | * Do the groups reflect a natural distribution of customers by their behaviour? | ||
* For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month? | * For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month? | ||
- | It is obvious | + | It is evident |
In this context, a **cluster** refers to a collection of data points aggregated together because of certain similarities ((Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science [[https:// | In this context, a **cluster** refers to a collection of data points aggregated together because of certain similarities ((Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science [[https:// | ||
Line 17: | Line 17: | ||
* Cluster **centroid-based**, | * Cluster **centroid-based**, | ||
* Cluster **density-based**, | * Cluster **density-based**, | ||
- | In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should | + | In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should be numerical, assuming an Euclidian space. |
- | + | ||
- | ===== K-Means ===== | + | |
- | The first method discussed here is one of the most commonly used – K-Means. K-means clustering is a method that splits the initial set of points (objects) into groups, using distance measure, which represents a distance from the given point of the group to the group' | + | |
- | The algorithm steps schematically are represented in the following figure: | + | |
- | + | ||
- | <figure K-means steps> | + | |
- | {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | In the figure: | + | |
- | * **STEP 1: | + | |
- | * **STEP 2: | + | |
- | * **STEP 3:** For each point, the closest cluster centre is selected, which is the point marker; | + | |
- | * **STEP 4: | + | |
- | * **STEP 5:** The initial cluster centre is being refined to minimise the average distance to the cluster centre from each cluster point. As a result, cluster centres might not be physical points any more; instead, they become imaginary. | + | |
- | * **STEP 6: | + | |
- | + | ||
- | Steps 4-6 are repeated until changes in cluster positions do not change or changes are insignificant. The distance is measured using Euclidian distance: | + | |
- | + | ||
- | <figure Euclidian distance> | + | |
- | {{ {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | , where: | + | |
- | * **Data points** - points {x< | + | |
- | * **K** – number of clusters set by the user. | + | |
- | * **r< | + | |
- | * **m< | + | |
- | * **D** – Squared Sum of all distances di to their assigned cluster centroids; | + | |
- | * **Goal** is to find such values of variables r< | + | |
- | + | ||
- | Example of initial data and assigned cluster marks with cluster centres after running the K-means algorithm: | + | |
- | + | ||
- | <figure K-means example 1> | + | |
- | {{ {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | Unfortunately, | + | |
- | Example of setting different numbers of cluster centres: | + | |
- | + | ||
- | <figure K-means example 1> | + | |
- | {{ {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | ==== Elbow method ==== | + | |
- | + | ||
- | In K-means clustering, a practical method – the Elbow method is used to select a particular number of clusters. The elbow method is based on finding the point at which adding more clusters does not significantly improve the model' | + | |
- | As explained, K-means clustering optimises the sum of squared errors (SSE) or squared distances between each point and its corresponding cluster centroid. Since the optimal number of clusters (NC) is not known initially, it is wise to increase the NCs iteratively. The SSE decreases as the number of clusters increases because the distances to the cluster centres also decrease. However, there is a point where the improvement in SSE diminishes significantly. This point is referred to as the " | + | |
- | + | ||
- | Steps of the method: | + | |
- | - **Plot SSE against the number of clusters: | + | |
- | * Computing the SSE for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20). | + | |
- | * Plotting the SSE values on the y-axis and the number of clusters NC on the x-axis. | + | |
- | - **Observe the plot:** | + | |
- | * As the number of clusters NC increases, the SSE will decrease because clusters become more specialised. | + | |
- | * Initially, adding clusters will result in a significant drop in SSE. | + | |
- | * After a certain point, the reduction in SSE will slow down, not showing a significant drop in the SSE. | + | |
- | - **The " | + | |
- | * The point on the curve where the rate of decrease sharply levels off forms the " | + | |
- | * This is where adding more clusters beyond this point doesn' | + | |
- | - **Select optimal NC:** | + | |
- | * The value of NC at the elbow point is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance. | + | |
- | + | ||
- | Since the method requires iteratively running the K-means algorithm, which might be resource-demanding, | + | |
- | + | ||
- | Limitations: | + | |
- | * The elbow point is not always obvious; in some cases, the curve may not show a distinct " | + | |
- | * The elbow method is heuristic and might not always lead to the perfect number of clusters, especially if the data structure is complex. | + | |
- | * Other methods, like the Silhouette score, can complement the elbow method to help determine the optimal NC. | + | |
- | + | ||
- | <figure Elbow example> | + | |
- | {{ {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | The figure above demonstrates more and less obvious " | + | |
- | + | ||
- | ==== Silhouette Score ==== | + | |
- | The Silhouette Score is a metric used to evaluate the quality of a clustering result. It measures how similar an object (point) is to its own cluster (cohesion) compared to other clusters (separation). The score ranges from −1 to +1, where higher values indicate better-defined clusters ((Peter J. Rousseeuw (1987). " | + | |
- | + | ||
- | The Silhouette score considers two main factors for each data point: | + | |
- | + | ||
- | * **Cohesion (a(i))** - The cohesion measure for the ith point is the average distance between the point and all other points in the same cluster. It measures the point' | + | |
- | * **Separation (b(i))** – The separation measure fo the ith point estimates the average distance between the point and points in the nearest neighbouring cluster - the cluster that is not its own but is closest to it. A large value for b(i) indicates that the point is far away from the closest other cluster, meaning it is well-separated; | + | |
- | + | ||
- | The silhouette score for a point i is then calculated as: | + | |
- | + | ||
- | <figure Silhouette score > | + | |
- | {{ {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | ,where: | + | |
- | * s(i) is the silhouette score for point i. | + | |
- | * a(i) is the average distance from point i to all other points in the same cluster. | + | |
- | * b(i) is the average distance from point I to all points in the nearest other cluster. | + | |
- | * s(i) ≈+1 indicated that the point i is well clustered; | + | |
- | * s(i) around 0 indicates that the point lies close to the boundary between clusters; | + | |
- | * s(i) ≈-1 indicated that the point i was most probably wrongly assigned to the cluster; | + | |
- | + | ||
- | Steps of the method: | + | |
- | - **Plot silhouette score (SC) against the number of clusters: | + | |
- | * Computing the SC for different values of NC, typically starting from NC=2 reasonable maximum value (e.g., 10 or 20). | + | |
- | * Plotting the SC values on the y-axis and the number of clusters NC on the x-axis. | + | |
- | - Observe the plot: | + | |
- | * As the number of clusters NC increases, the SC shows different score values, which may or may not gradually decrease, as in the case of the " | + | |
- | * The main goal is to observe the maximum SC value and the corresponding NC value; | + | |
- | - **Select optimal NC:** | + | |
- | * The value of NC at the maximum SC value is often considered the optimal number of clusters because it balances the trade-off between model complexity and performance. | + | |
- | + | ||
- | + | ||
- | Limitations: | + | |
- | * It may not perform well if the data does not have a clear structure or if the clusters are of very different densities or sizes. | + | |
- | * The Silhouette score might not always match intuitive or domain-specific clustering insights. | + | |
- | + | ||
- | An example is provided in the following figure: | + | |
- | + | ||
- | <figure Silhouette example> | + | |
- | {{ {{ : | + | |
- | < | + | |
- | </ | + | |
- | + | ||
- | The user should look for the highest score, which in this case is for the 3-cluster option. | + | |
- | ===== DBSCAN ===== | ||
- | DBSCAN (Density-Based Spatial Clustering of Applications with Noise) employs density measures to mark points in high-density regions and those in low-density regions – the noise. Because of this natural behaviour of the algorithm, it is particularly useful in signal processing and similar application domains. | + | ==== Data preprocessing before clustering ==== |
- | ((Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M. (eds.). A density-based algorithm for discovering clusters in large spatial databases with noise (PDF). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. CiteSeerX 10.1.1.121.9220. ISBN 1-57735-004-9.)). | + | |
- | One of the essential concepts is the point' | + | Before starting clustering, several necessary steps have to be performed: |
- | <figure Point' | + | * **Check if the used data is metric:** In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, |
- | {{ {{ :en:iot-reloaded:ClusterEq3.png?300 | Point' | + | * **Select the proper scale:** For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' |
- | < | + | * **Unity interval:** A minimal factor value is subtracted from the given point value and divided by the interval value, giving the result 0 to 1. |
- | </ | + | * **Z-scale: |
- | , where: | ||
- | * **p** – the point of interest; | ||
- | * **N(p)** – neighbourhood of the point p; | ||
- | * **q** – any other point; | ||
- | * **distance(p, | ||
- | * **eps** – epsilon – user-defined distance constant; | ||
- | * **D** – the initial set of points available for the algorithm; | ||
- | The algorithm treats different points differently depending on density | + | ==== Summary about clustering ==== |
+ | * There are many other clustering methods besides the discussed ones; however, all of them, including the discussed ones, require prior knowledge of the problem domain. | ||
+ | * All clustering methods require setting some parameters that drive the algorithms. In most cases, the value setting might not be intuitive | ||
+ | * Proper data coding in clustering may provide a significant value even in complex application domains, including medicine, customer behaviour analysis, and finetuning of other data analysis algorithms. | ||
+ | * In data analysis, clustering is one of the first methods used to acquire the internal structure of the data before applying more informed methods. | ||
- | afdafafa | + | <WRAP excludefrompdf> |
- | ssdgsdfgs | + | To illustrate the mentioned algorithm groups, the following algorithms are discussed in detail: |
- | dsdgfsdfgsg | + | * [[en: |
- | sg | + | * [[en: |
- | sdg | + | </ |
- | fsdfgs | + | |
- | dfg | + | |
- | sdfgsg | + |