ch15s1_ClassificationAlgorithms

**Classification** is a core machine learning task that involves predicting **categorical outcomes** — determining which *class or category* a given sample belongs to based on its features.

Chapter 15: Advanced Machine Learning — Classification Algorithms

🧠 Introduction to Classification

Classification is a core machine learning task that involves predicting categorical outcomes — determining which class or category a given sample belongs to based on its features.

Unlike regression (which predicts continuous values), classification predicts discrete labels, such as:

It’s one of the most widely used ML paradigms across industries — from fraud detection to sentiment analysis and image recognition.


⚙️ 1. How Classification Works

The model learns a decision boundary that separates data points from different classes.

Problem TypeExampleTarget Output
Binary ClassificationSpam filtering0 or 1
Multi‑Class ClassificationIris flower speciesSetosa / Versicolor / Virginica
Multi‑Label ClassificationMovie genres[Action, Comedy]

🧩 2. Common Classification Algorithms

AlgorithmTypeWhen to UseKey Strength
Logistic RegressionLinearSimple, interpretable modelsProbability outputs
Support Vector Machine (SVM)Non‑linearSmall, clean datasetsRobust with clear margins
Decision TreeNon‑linearExplainable modelsHuman‑interpretable decisions
Random ForestEnsembleComplex problemsHandles non‑linearity well
K‑Nearest Neighbors (KNN)Non‑parametricSmall datasetsNo training phase
Naïve BayesProbabilisticText classificationFast, low resource usage
Neural NetworksDeep LearningLarge datasetsHigh accuracy potential

🌸 3. Example — Iris Classification with SVM

Let’s build a Support Vector Machine (SVM) classifier using Scikit‑Learn’s Iris dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = SVC(kernel='rbf', gamma='auto', C=1)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualize Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix — Iris Classification")
plt.show()

🧠 Key Observations


🔍 4. Decision Boundary Visualization (Optional)

For datasets with only two features, you can visualize the learned boundaries.

from sklearn.datasets import make_classification
import numpy as np

# Generate sample 2D dataset
X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, random_state=3, n_clusters_per_class=1)

model = SVC(kernel='linear')
model.fit(X, y)

# Plot decision regions
plt.figure(figsize=(6,5))
x_min, x_max = X[:,0].min() - 1, X[:,0].max() + 1
y_min, y_max = X[:,1].min() - 1, X[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y, edgecolor='k', palette='deep')
plt.title("SVM Decision Boundary")
plt.show()

The shaded regions represent how the SVM separates different classes using its decision boundary.


🧮 5. Evaluating Classification Models

MetricDescriptionFunction
Accuracy% of correct predictions.accuracy_score()
Precision% of predicted positives that are correct.precision_score()
Recall (Sensitivity)% of actual positives correctly identified.recall_score()
F1‑ScoreHarmonic mean of precision & recall.f1_score()
ROC‑AUCQuality of binary classification curve.roc_auc_score()

A classification report (via classification_report()) summarizes these metrics automatically.


🎯 6. Hyperparameter Tuning with GridSearchCV

Optimizing hyperparameters can significantly improve model performance.

from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Cross‑Val Score:", grid.best_score_)

GridSearchCV automates trying combinations of parameters and cross‑validates results for robust tuning.


⚖️ 7. Understanding Bias–Variance Trade‑off

ConceptDescriptionRisk
High Bias (Underfitting)Model too simple → misses patterns.Low accuracy
High Variance (Overfitting)Model too complex → memorizes training data.Poor generalization

Regularization (C in SVMs, alpha in logistic regression) helps balance bias and variance.


🚀 8. Takeaways


🧭 Conclusion

Classification algorithms form the foundation of intelligent decision systems — from email filters to diagnostic AI.
By mastering Scikit‑Learn’s tools for classification, evaluation, and tuning, you’ll be well‑equipped to tackle real‑world predictive challenges.

“Accuracy is important, but understanding why the model made a decision matters even more.”