ch14s2_ScikitLearnLibrary

[Scikit‑Learn (sklearn)](https://scikit-learn.org) is the **core library for machine learning in Python**.

Chapter 14: Introduction to Machine Learning — Scikit‑Learn Library

⚙️ Scikit‑Learn: The Engine of Practical Machine Learning

Scikit‑Learn (sklearn) is the core library for machine learning in Python.
It provides efficient tools for data preprocessing, model training, evaluation, and deployment — all through a consistent and intuitive API.

Built on top of NumPy, SciPy, and Matplotlib, Scikit‑Learn integrates seamlessly with the broader data science ecosystem, making it the go‑to choice for beginners and professionals alike.


🧠 1. Why Scikit‑Learn?

FeatureDescription
Consistent APIUniform method naming (fit(), predict(), transform()) across models.
Wide Algorithm SupportRegression, classification, clustering, dimensionality reduction, and more.
Preprocessing ToolsHandle missing data, scaling, normalization, encoding, and feature selection.
PipelinesCombine preprocessing and modeling steps into reproducible workflows.
PerformanceOptimized using C‑backed NumPy/SciPy operations for speed.
DocumentationExceptional tutorials and examples for all levels of expertise.

🧩 2. The Scikit‑Learn Workflow

Every ML project in Scikit‑Learn follows the same structure:

Data → Preprocessing → Model Training → Evaluation → Optimization → Deployment

Let’s see these phases in action using the famous Iris dataset.


🌸 3. Example — Classification with SVM (Iris Dataset)

We’ll train a Support Vector Machine (SVM) classifier to predict the species of an iris flower based on petal and sepal dimensions.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train an SVM classifier
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix — Iris Classification")
plt.show()

🔍 Explanation


🧰 4. Data Preprocessing with Scikit‑Learn

Before training a model, data must be prepared properly.
Scikit‑Learn provides many preprocessing tools:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'salary': [50000, 60000, 80000, 90000, 120000],
    'city': ['NY', 'LA', 'NY', 'SF', 'LA'],
    'purchased': [0, 1, 0, 1, 1]
})

# Features and target
X = data[['age', 'salary', 'city']]
y = data['purchased']

# Define column types
numeric_features = ['age', 'salary']
categorical_features = ['city']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Full ML pipeline
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', LogisticRegression())
])

pipeline.fit(X, y)
print("Prediction for new data:", pipeline.predict([[40, 75000, 'SF']]))  # Example input

💡 Pipelines ensure consistent preprocessing for both training and prediction — avoiding data leakage.


📏 5. Model Evaluation

Scikit‑Learn includes many evaluation metrics beyond accuracy:

MetricDescriptionFunction
AccuracyFraction of correct predictions.accuracy_score()
Precision / Recall / F1Evaluate imbalanced datasets.classification_report()
ROC‑AUCQuality of binary classification.roc_auc_score()
MSE / RMSERegression error measures.mean_squared_error()
How well regression fits data.r2_score()

🎯 6. Model Selection and Tuning

Scikit‑Learn provides tools for cross‑validation and hyperparameter tuning:

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

⚙️ GridSearchCV automates finding the best hyperparameters through cross‑validation.


🧮 7. Common Algorithms in Scikit‑Learn

CategoryAlgorithmScikit‑Learn Class
RegressionLinear RegressionLinearRegression
ClassificationLogistic Regression, SVM, Decision TreeLogisticRegression, SVC, DecisionTreeClassifier
ClusteringK‑Means, DBSCANKMeans, DBSCAN
Dimensionality ReductionPCA, LDAPCA, LinearDiscriminantAnalysis
Ensemble MethodsRandom Forest, Gradient BoostingRandomForestClassifier, GradientBoostingRegressor
Model SelectionCross‑validation, Grid Searchcross_val_score, GridSearchCV

💾 8. Saving and Loading Models

After training, you can save your model for later use:

import joblib

# Save model
joblib.dump(model, "svm_model.pkl")

# Load model
loaded_model = joblib.load("svm_model.pkl")
print("Loaded model prediction:", loaded_model.predict(X_test[:1]))

🧠 Use joblib or pickle for persistence and model deployment.


🧭 Conclusion

Scikit‑Learn is the cornerstone of applied machine learning in Python — from data preparation to evaluation and tuning.
Its consistent interface, extensive algorithms, and seamless integration make it ideal for both rapid prototyping and production‑grade ML systems.

“Learn the workflow, not just the algorithm — Scikit‑Learn makes experimentation effortless.”