ch14s2_ScikitLearnLibrary

[Scikit‑Learn (sklearn)](https://scikit-learn.org) is the **core library for machine learning in Python**.

Chapter 14: Introduction to Machine Learning — Scikit‑Learn Library

⚙️ Scikit‑Learn: The Engine of Practical Machine Learning

Scikit‑Learn (sklearn) is the core library for machine learning in Python.
It provides efficient tools for data preprocessing, model training, evaluation, and deployment — all through a consistent and intuitive API.

Built on top of NumPy, SciPy, and Matplotlib, Scikit‑Learn integrates seamlessly with the broader data science ecosystem, making it the go‑to choice for beginners and professionals alike.

🧠 1. Why Scikit‑Learn?

Feature	Description
Consistent API	Uniform method naming (`fit()`, `predict()`, `transform()`) across models.
Wide Algorithm Support	Regression, classification, clustering, dimensionality reduction, and more.
Preprocessing Tools	Handle missing data, scaling, normalization, encoding, and feature selection.
Pipelines	Combine preprocessing and modeling steps into reproducible workflows.
Performance	Optimized using C‑backed NumPy/SciPy operations for speed.
Documentation	Exceptional tutorials and examples for all levels of expertise.

🧩 2. The Scikit‑Learn Workflow

Every ML project in Scikit‑Learn follows the same structure:

Data → Preprocessing → Model Training → Evaluation → Optimization → Deployment

Let’s see these phases in action using the famous Iris dataset.

🌸 3. Example — Classification with SVM (Iris Dataset)

We’ll train a Support Vector Machine (SVM) classifier to predict the species of an iris flower based on petal and sepal dimensions.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train an SVM classifier
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix — Iris Classification")
plt.show()

🔍 Explanation

fit() trains the model.
predict() makes predictions on unseen data.
accuracy_score() measures how many predictions are correct.
confusion_matrix() shows per‑class performance visually.

🧰 4. Data Preprocessing with Scikit‑Learn

Before training a model, data must be prepared properly.
Scikit‑Learn provides many preprocessing tools:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'salary': [50000, 60000, 80000, 90000, 120000],
    'city': ['NY', 'LA', 'NY', 'SF', 'LA'],
    'purchased': [0, 1, 0, 1, 1]
})

# Features and target
X = data[['age', 'salary', 'city']]
y = data['purchased']

# Define column types
numeric_features = ['age', 'salary']
categorical_features = ['city']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Full ML pipeline
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', LogisticRegression())
])

pipeline.fit(X, y)
print("Prediction for new data:", pipeline.predict([[40, 75000, 'SF']]))  # Example input

💡 Pipelines ensure consistent preprocessing for both training and prediction — avoiding data leakage.

📏 5. Model Evaluation

Scikit‑Learn includes many evaluation metrics beyond accuracy:

Metric	Description	Function
Accuracy	Fraction of correct predictions.	`accuracy_score()`
Precision / Recall / F1	Evaluate imbalanced datasets.	`classification_report()`
ROC‑AUC	Quality of binary classification.	`roc_auc_score()`
MSE / RMSE	Regression error measures.	`mean_squared_error()`
R²	How well regression fits data.	`r2_score()`

🎯 6. Model Selection and Tuning

Scikit‑Learn provides tools for cross‑validation and hyperparameter tuning:

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

⚙️ GridSearchCV automates finding the best hyperparameters through cross‑validation.

🧮 7. Common Algorithms in Scikit‑Learn

Category	Algorithm	Scikit‑Learn Class
Regression	Linear Regression	`LinearRegression`
Classification	Logistic Regression, SVM, Decision Tree	`LogisticRegression`, `SVC`, `DecisionTreeClassifier`
Clustering	K‑Means, DBSCAN	`KMeans`, `DBSCAN`
Dimensionality Reduction	PCA, LDA	`PCA`, `LinearDiscriminantAnalysis`
Ensemble Methods	Random Forest, Gradient Boosting	`RandomForestClassifier`, `GradientBoostingRegressor`
Model Selection	Cross‑validation, Grid Search	`cross_val_score`, `GridSearchCV`

💾 8. Saving and Loading Models

After training, you can save your model for later use:

import joblib

# Save model
joblib.dump(model, "svm_model.pkl")

# Load model
loaded_model = joblib.load("svm_model.pkl")
print("Loaded model prediction:", loaded_model.predict(X_test[:1]))

🧠 Use joblib or pickle for persistence and model deployment.

🧭 Conclusion

Scikit‑Learn is the cornerstone of applied machine learning in Python — from data preparation to evaluation and tuning.
Its consistent interface, extensive algorithms, and seamless integration make it ideal for both rapid prototyping and production‑grade ML systems.

“Learn the workflow, not just the algorithm — Scikit‑Learn makes experimentation effortless.”