ch15s2_ClusteringAlgorithms

**Clustering** is a key *unsupervised learning* technique where the goal is to **discover natural groupings** in data — without using predefined labels.

Chapter 15: Advanced Machine Learning — Clustering Algorithms

🧩 Introduction to Clustering

Clustering is a key unsupervised learning technique where the goal is to discover natural groupings in data — without using predefined labels.
It helps uncover hidden structures and relationships that may not be obvious at first glance.

Common applications include:


🧠 1. How Clustering Works

Clustering algorithms measure similarity or distance between data points (often using Euclidean distance) and group similar ones together.

ConceptDescription
Unsupervised LearningNo labeled outputs — the model learns data structure by itself.
ClusterA group of data points with similar patterns.
CentroidThe “center” of a cluster, representing its average point.
Distance MetricA way to measure how far apart points are (Euclidean, Manhattan, etc.).

⚙️ 2. K‑Means Clustering — Core Idea

K‑Means partitions the dataset into K clusters where each data point belongs to the nearest cluster centroid.

Algorithm Steps

  1. Choose K, the number of clusters.
  2. Randomly initialize K centroids.
  3. Assign each data point to the nearest centroid.
  4. Update centroids as the mean of all assigned points.
  5. Repeat until convergence (centroids stop moving).

🌸 3. Example — K‑Means on the Iris Dataset

Let’s implement K‑Means clustering using Scikit‑Learn and visualize results.

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
iris = load_iris()
X = iris.data

# Choose number of clusters
k = 3

# Train K-Means model
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)

# Extract results
labels = kmeans.labels_
centers = kmeans.cluster_centers_
sil_score = silhouette_score(X, labels)

print(f"Silhouette Score: {sil_score:.3f}")

# Visualize clusters (using first two features for simplicity)
plt.figure(figsize=(7, 5))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=labels, palette="viridis", s=60)
plt.scatter(centers[:, 0], centers[:, 1], marker="X", c="red", s=200, label="Centroids")
plt.title("K‑Means Clustering (Iris Dataset)")
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()
plt.show()

🔍 What You See


📊 4. Choosing the Right Number of Clusters (Elbow Method)

The Elbow Method helps estimate the optimal number of clusters by plotting inertia (sum of squared distances to centroids).

inertias = []
k_values = range(1, 10)

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(k_values, inertias, marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.show()

The “elbow point” (where the curve starts to bend) often indicates a good k value.


📏 5. Evaluating Cluster Quality

MetricDescriptionRangeIdeal Value
InertiaTotal distance of points to their cluster center.0 → ∞Lower is better
Silhouette ScoreMeasures separation between clusters.–1 → 1Closer to 1
Calinski‑Harabasz IndexRatio of between‑cluster to within‑cluster variance.Higher is betterHigh
Davies‑Bouldin IndexAverage similarity between clusters.Lower is betterLow

AlgorithmTypeStrengthsWeaknesses
K‑MeansCentroid‑basedFast, scalable, easy to interpretRequires pre‑defining k, sensitive to outliers
DBSCANDensity‑basedFinds irregular clusters, detects outliersStruggles with varying density
Agglomerative (Hierarchical)HierarchicalVisual via dendrogramsSlow on large datasets
Gaussian Mixture Models (GMM)ProbabilisticHandles overlapping clustersComputationally heavy

🧭 7. Advanced Visualization — Pair Plot

For a deeper look at how K‑Means grouped samples:

import pandas as pd
df = pd.DataFrame(X, columns=iris.feature_names)
df['Cluster'] = labels

sns.pairplot(df, hue='Cluster', palette='viridis')
plt.suptitle("Pairwise Feature Relationships by Cluster", y=1.02)
plt.show()

Pair plots show how clusters differ across multiple features, revealing which dimensions matter most.


🧠 8. Practical Considerations

ChallengeExplanationStrategy
Scaling NeededK‑Means uses Euclidean distance; unscaled features distort results.Use StandardScaler or MinMaxScaler.
High‑Dimensional DataHarder to visualize or interpret.Apply PCA before clustering.
Categorical DataK‑Means can’t handle non‑numerical features.Use K‑Prototypes or encoding.
Initialization SensitivityRandom starts may yield different results.Use multiple initializations (n_init).

🚀 9. Takeaways


🧭 Conclusion

Clustering algorithms like K‑Means open the door to unsupervised discovery — letting data speak for itself.
By mastering clustering techniques and evaluation metrics, you can uncover valuable insights in complex datasets.

“Clustering is the art of finding patterns when no one tells you what to look for.”