SEARCH
You are in browse mode. You must login to use MEMORY

   Log in to start

level: Level 1 of Chapter 6 - Clustering

Questions and Answers List

level questions: Level 1 of Chapter 6 - Clustering

QuestionAnswer
What is clustering?A smart way of grouping items/examples/instances – Items are described by numerical features – Feature extraction and scaling needed before clustering Unsupervised learning – No labeled data like in classification • Application areas: – Image processing – Data mining / big data – Information retrieval • Documents/texts
Types of clustering•Hierarchical methods • Partitioning methods • Hard clustering – Each item belongs to exactly one cluster • Soft clustering – Each item belongs to some/all clusters with a certain probability
Hierarchical Clustering• Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram – A tree-like diagram that records the sequences of merges or splits
Strengths of Hierarchical Clustering• No assumptions on the number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level
Types of Hierarchical Clustering– Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters)
Complexity of hierarchical clustering• Distance matrix is used for deciding which clusters to merge/split • At least quadratic in the number of data points
Agglomerative clustering algorithm•Most popular hierarchical clustering technique • Basic algorithm 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains • Key operation is the computation of the distance between two clusters – Different definitions of the distance between clusters lead to different algorithms