ch15s2_ClusteringAlgorithms
**Clustering** is a key *unsupervised learning* technique where the goal is to **discover natural groupings** in data — without using predefined labels.
Chapter 15: Advanced Machine Learning — Clustering Algorithms
🧩 Introduction to Clustering
Clustering is a key unsupervised learning technique where the goal is to discover natural groupings in data — without using predefined labels.
It helps uncover hidden structures and relationships that may not be obvious at first glance.
Common applications include:
- 🛒 Customer segmentation (grouping buyers by behavior)
- 🚨 Anomaly detection (finding outliers)
- 🧬 Gene expression analysis
- 🖼️ Image compression
🧠 1. How Clustering Works
Clustering algorithms measure similarity or distance between data points (often using Euclidean distance) and group similar ones together.
| Concept | Description |
|---|---|
| Unsupervised Learning | No labeled outputs — the model learns data structure by itself. |
| Cluster | A group of data points with similar patterns. |
| Centroid | The “center” of a cluster, representing its average point. |
| Distance Metric | A way to measure how far apart points are (Euclidean, Manhattan, etc.). |
⚙️ 2. K‑Means Clustering — Core Idea
K‑Means partitions the dataset into K clusters where each data point belongs to the nearest cluster centroid.
Algorithm Steps
- Choose K, the number of clusters.
- Randomly initialize K centroids.
- Assign each data point to the nearest centroid.
- Update centroids as the mean of all assigned points.
- Repeat until convergence (centroids stop moving).
🌸 3. Example — K‑Means on the Iris Dataset
Let’s implement K‑Means clustering using Scikit‑Learn and visualize results.
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
iris = load_iris()
X = iris.data
# Choose number of clusters
k = 3
# Train K-Means model
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
# Extract results
labels = kmeans.labels_
centers = kmeans.cluster_centers_
sil_score = silhouette_score(X, labels)
print(f"Silhouette Score: {sil_score:.3f}")
# Visualize clusters (using first two features for simplicity)
plt.figure(figsize=(7, 5))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=labels, palette="viridis", s=60)
plt.scatter(centers[:, 0], centers[:, 1], marker="X", c="red", s=200, label="Centroids")
plt.title("K‑Means Clustering (Iris Dataset)")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()
plt.show()
🔍 What You See
- Each color → a cluster found by the algorithm.
- Red “X” → cluster centroids.
- Silhouette Score (–1 to 1) measures how distinct and well‑formed clusters are — closer to 1 is better.
📊 4. Choosing the Right Number of Clusters (Elbow Method)
The Elbow Method helps estimate the optimal number of clusters by plotting inertia (sum of squared distances to centroids).
inertias = []
k_values = range(1, 10)
for k in k_values:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(k_values, inertias, marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.show()
The “elbow point” (where the curve starts to bend) often indicates a good k value.
📏 5. Evaluating Cluster Quality
| Metric | Description | Range | Ideal Value |
|---|---|---|---|
| Inertia | Total distance of points to their cluster center. | 0 → ∞ | Lower is better |
| Silhouette Score | Measures separation between clusters. | –1 → 1 | Closer to 1 |
| Calinski‑Harabasz Index | Ratio of between‑cluster to within‑cluster variance. | Higher is better | High |
| Davies‑Bouldin Index | Average similarity between clusters. | Lower is better | Low |
🧮 6. Comparing Popular Clustering Algorithms
| Algorithm | Type | Strengths | Weaknesses |
|---|---|---|---|
| K‑Means | Centroid‑based | Fast, scalable, easy to interpret | Requires pre‑defining k, sensitive to outliers |
| DBSCAN | Density‑based | Finds irregular clusters, detects outliers | Struggles with varying density |
| Agglomerative (Hierarchical) | Hierarchical | Visual via dendrograms | Slow on large datasets |
| Gaussian Mixture Models (GMM) | Probabilistic | Handles overlapping clusters | Computationally heavy |
🧭 7. Advanced Visualization — Pair Plot
For a deeper look at how K‑Means grouped samples:
import pandas as pd
df = pd.DataFrame(X, columns=iris.feature_names)
df['Cluster'] = labels
sns.pairplot(df, hue='Cluster', palette='viridis')
plt.suptitle("Pairwise Feature Relationships by Cluster", y=1.02)
plt.show()
Pair plots show how clusters differ across multiple features, revealing which dimensions matter most.
🧠 8. Practical Considerations
| Challenge | Explanation | Strategy |
|---|---|---|
| Scaling Needed | K‑Means uses Euclidean distance; unscaled features distort results. | Use StandardScaler or MinMaxScaler. |
| High‑Dimensional Data | Harder to visualize or interpret. | Apply PCA before clustering. |
| Categorical Data | K‑Means can’t handle non‑numerical features. | Use K‑Prototypes or encoding. |
| Initialization Sensitivity | Random starts may yield different results. | Use multiple initializations (n_init). |
🚀 9. Takeaways
- Clustering reveals hidden structure in data without labels.
- K‑Means is a simple yet powerful algorithm for continuous features.
- Use Elbow or Silhouette methods to choose k.
- For irregular or noisy data, explore DBSCAN or Hierarchical Clustering.
- Always scale your data before clustering.
🧭 Conclusion
Clustering algorithms like K‑Means open the door to unsupervised discovery — letting data speak for itself.
By mastering clustering techniques and evaluation metrics, you can uncover valuable insights in complex datasets.
“Clustering is the art of finding patterns when no one tells you what to look for.”