Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:iot-reloaded:clustering_models [2024/12/09 11:42] – [Summary about clustering] pczekalskien:iot-reloaded:clustering_models [2024/12/10 21:34] (current) pczekalski
Line 6: Line 6:
 Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data.  Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data. 
  
-This provides very insightful information about the data's internal organisation, possible groups, their number and distribution, and other internal regularities that might be used to better understand the data content.  +This provides very insightful information about the data's internal organisation, possible groups, their number and distribution, and other internal regularities that might help us better understand the data content.  
-One might consider grouping customers by income estimate to explain the clustering better. It is very natural to assume some threshold values of 1KEUR per month, 10KEUR per month etc. However:+One might consider grouping customers by income estimate to explain the clustering better. It is natural to assume some threshold values of 1KEUR per month, 10KEUR per monthetc. However:
   * Do the groups reflect a natural distribution of customers by their behaviour?   * Do the groups reflect a natural distribution of customers by their behaviour?
   * For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month?    * For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month? 
  
-It is obvious that, most probably, customers' behaviour depends on factors like occupation, age, total household income, and others. While the need for considering other factors is obvious, grouping is not – how exactly different factors interact to decide which group a given customer belongs to. That is where clustering exposes its strength – revealing natural internal structures of the data (customers in the provided example).  +It is evident that, most probably, customers' behaviour depends on factors like occupation, age, total household income, and others. While the need for considering other factors is obvious, grouping is not – how exactly different factors interact to decide which group a given customer belongs to. That is where clustering exposes its strength – revealing natural internal structures of the data (customers in the provided example).  
  
 In this context, a **cluster** refers to a collection of data points aggregated together because of certain similarities ((Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science [[https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1]] – Cited 07.08.2024.)). In this context, a **cluster** refers to a collection of data points aggregated together because of certain similarities ((Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science [[https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1]] – Cited 07.08.2024.)).
Line 22: Line 22:
 ==== Data preprocessing before clustering ==== ==== Data preprocessing before clustering ====
  
-Before starting clustering, several important steps have to be performed:+Before starting clustering, several necessary steps have to be performed:
  
-  * **Check if the used data is metric:** In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, exam assessments, and the like. Bad examplesgendercolour.+  * **Check if the used data is metric:** In clustering, the primary measure is Euclidian distance (in most cases), which requires numeric data. While it is possible to encode some arbitrary data using numerical values, they must maintain the semantics of numbers, i.e. 1 < 2 < 3. Good examples of natural metric data are temperature, exam assessments, and the like—bad examples are gender and colour.
   * **Select the proper scale:** For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' monthly incomes in euros and their credit ratios are typically at different scales – the incomes in thousands, while ratios between 0 and 1. If scales are not adjusted, the income dimension will dominate distance estimation among points, deforming the overall clustering results. A universal scale is usually applied to all dimensions to avoid this trap. For instance:    * **Select the proper scale:** For the same reasons as the distance measure, the values of each dimension should be on the same scale. For instance, customers' monthly incomes in euros and their credit ratios are typically at different scales – the incomes in thousands, while ratios between 0 and 1. If scales are not adjusted, the income dimension will dominate distance estimation among points, deforming the overall clustering results. A universal scale is usually applied to all dimensions to avoid this trap. For instance: 
      * **Unity interval:** A minimal factor value is subtracted from the given point value and divided by the interval value, giving the result 0 to 1.      * **Unity interval:** A minimal factor value is subtracted from the given point value and divided by the interval value, giving the result 0 to 1.
en/iot-reloaded/clustering_models.1733744534.txt.gz · Last modified: 2024/12/09 11:42 by pczekalski
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0