Clustering Models

Introduction

Clustering is a methodology that belongs to the class of unsupervised machine learning. It allows for finding regularities in data when the group or class identifier or marker is absent. To do this, the data structure is used as a tool to find the regularities. Because of this powerful feature, clustering is often used as part of data analysis workflow prior to classification or other data analysis steps to find natural regularities or groups that may exist in data.

This provides very insightful information about the data's internal organisation, possible groups, their number and distribution, and other internal regularities that might be used to better understand the data content. One might consider grouping customers by income estimate to explain the clustering better. It is very natural to assume some threshold values of 1KEUR per month, 10KEUR per month etc. However:

Do the groups reflect a natural distribution of customers by their behaviour?
For instance, does a customer with 10KEUR behave differently from the one with 11KEUR per month?

It is obvious that, most probably, customers' behaviour depends on factors like occupation, age, total household income, and others. While the need for considering other factors is obvious, grouping is not – how exactly different factors interact to decide which group a given customer belongs to. That is where clustering exposes its strength – revealing natural internal structures of the data (customers in the provided example).

In this context, a cluster refers to a collection of data points aggregated together because of certain similarities ^[1]. Within this chapter, two different approaches to clustering are discussed:

Cluster centroid-based, where the main idea is to find an imaginary centroid point representing the “centre of mass” of the cluster or, in other words, the centroid represents a “typical” member of the cluster that, in most cases, is an imaginary point.
Cluster density-based, where the density of points around the given one determines the membership of a given point to the cluster. In other words, the main feature of the cluster is its density.

In both cases, a distance measure estimates the distance among points or objects and the density of points around the given. Therefore, all factors used should generally be numerical, assuming an Euclidian space.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) employs density measures to mark points in high-density regions and those in low-density regions – the noise. Because of this natural behaviour of the algorithm, it is particularly useful in signal processing and similar application domains. ^[2].

One of the essential concepts is the point's p neighbourhood, which is the set of points reachable within the user-defined distance eps (epsilon):

, where:

p – the point of interest;
N(p) – neighbourhood of the point p;
q – any other point;
distance(p,q) – Euclidian distance between points q and p;
eps – epsilon – user-defined distance constant;
D – the initial set of points available for the algorithm;

The algorithm treats different points differently depending on density and neighbouring points distribution around the point – its neighbourhood:

Core Points:
- A point is a core point if it has at least MinPts neighbours within a distance eps, where MinPts and eps are user-defined parameters, i.e. N(p) ≥ MinPts.
Directly Density-Reachable points:
- A point is directly density-reachable from a core point if it lies within the distance eps of the core point
Border Points:
Noise points:

^[1] Understanding K-means Clustering in Machine Learning | by Education Ecosystem (LEDU) | Towards Data Science https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1 – Cited 07.08.2024.

^[2] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M. (eds.). A density-based algorithm for discovering clusters in large spatial databases with noise (PDF). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. CiteSeerX 10.1.1.121.9220. ISBN 1-57735-004-9.

en/iot-reloaded/clustering_models.1727267433.txt.gz · Last modified: 2024/09/25 12:30 by agrisnik