What is clustering? | A smart way of grouping items/examples/instances
– Items are described by numerical features
– Feature extraction and scaling needed before
clustering
Unsupervised learning
– No labeled data like in classification
• Application areas:
– Image processing
– Data mining / big data
– Information retrieval
• Documents/texts |
Types of clustering | •Hierarchical methods
• Partitioning methods
• Hard clustering – Each item belongs to exactly one cluster
• Soft clustering – Each item belongs to some/all clusters with a certain probability |
Hierarchical Clustering | • Produces a set of nested clusters organized as a hierarchical tree
• Can be visualized as a dendrogram – A tree-like diagram that records the sequences of merges or splits |
Strengths of Hierarchical Clustering | • No assumptions on the number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level |
Types of Hierarchical Clustering | – Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only
one cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a
point (or there are k clusters) |
Complexity of hierarchical clustering | • Distance matrix is used for deciding which clusters to merge/split
• At least quadratic in the number of data points |
Agglomerative clustering algorithm | •Most popular hierarchical clustering technique
• Basic algorithm
1. Compute the distance matrix between the input data points
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
• Key operation is the computation of the distance between two clusters
– Different definitions of the distance between clusters lead to different algorithms |