What is Clustering in Data Mining?

Published April 1, 2015 |

arvindl

Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Regarding data mining, this methodology partitions the data implementing a specific join algorithm, most suitable for the desired information analysis.

This clustering analysis allows an object not to be part of a cluster, or strictly belong to it, calling this type of grouping hard partitioning. On the other hand, soft partitioning states that every object belongs to a cluster in a determined degree. More specific divisions can be possible to create like objects belonging to multiple clusters, to force an object to participate in only one cluster or even construct hierarchical trees on group relationships.

There are several different ways to implement this partitioning, based on distinct models. Distinct algorithms are applied to each model, differentiating its properties and results. These models are distinguished by their organization and type of relationship between them. The most important ones are:

– Centralized – each cluster is represented by a single vector mean, and an object value is compared to these mean values
– Distributed – the cluster is built using statistical distributions
– Connectivity – he connectivity on these models is based on a distance function between elements
– Group – algorithms have only group information
– Graph – cluster organization and relationship between members is defined by a graph linked structure
– Density – members of the cluster are grouped by regions where observations are dense and similar

Clustering Algorithms in Data Mining

Based on the recently described cluster models, there is a lot of clustering that can be applied to a data set in order to partitionate the information. In this article, we will briefly describe the most important ones. It is important to mention that every method has its advantages and cons. The choice of algorithm will always depend on the characteristics of the data set and what we want to do with it.

Centroid-based

In this type os grouping method, every cluster is referenced by a vector of values. Each object is part of the cluster whose value difference is minimal, comparing to other clusters. The number of clusters should be pre-defined, and this is the biggest problem of this kind of algorithms. This methodology is the closest to the classification subject and are vastly used for optimization problems.

Distributed-based

Related to pre-defined statistical modelks, the distributed methodology combines objects whose values belong to the same distribution. Because of its random nature of value generation, this process needs a well defined and complex model to interact in a better way with real data. However, these processes can achieve an optimal solution and calculate correlations and dependencies.

Connectivity-based

On this type of algorithm, every object is related to its neighbors, depending the degree of that relationship on the distance between them. Based on this assumption, clusters are created with nearby objects and can be described as a maximum distance limit. With this relationship between members, these clusters have hierarchical representations. The distance function varies on the focus of the analysis.

Density-based

These algorithms create clusters according to the high density of members of a data set, in a determined location. It aggregates some distance notion to a density standard level to group members in clusters. These kinds of processes may have less performance in detecting the limit areas of the group.

Cluster Analysis main Applications

Since this is a very valuable data analysis technique, it has several different applications in the sciences world. Every large data set of information can be processed by this kind of analysis, producing great results with many distinct types of data.

One of the most important application is related to image processing. detecting distinct kinds of pattern in image data. This can be very effective in biology research, distinguishing objects and identifying patterns. Another use is the classification of medical exams.

The personal data combined with shopping, location, interest, actions and an infinite number of indicators, can be analyzed with this methodology, providing very important information and trends. Examples of this are the market research, marketing strategies, web analytics, and a lot of others.

Other types of applications based on clustering algorithms are climatology, robotics, recommender systems, mathematical and statistical analysis, providing a broad spectrum of utilization.

This article originally appeared here. Republished with permission. Submit your copyright complaints here.