Introduction to Machine Learning — Scikit‑Learn Library

Published: November 12, 2025 • Language: python • Chapter: 14 • Sub: 2 • Level: beginner

python

Chapter 14: Introduction to Machine Learning — Scikit‑Learn Library

⚙️ Scikit‑Learn: The Engine of Practical Machine Learning

Scikit‑Learn (sklearn) is the core library for machine learning in Python.
It provides efficient tools for data preprocessing, model training, evaluation, and deployment — all through a consistent and intuitive API.

Built on top of NumPy, SciPy, and Matplotlib, Scikit‑Learn integrates seamlessly with the broader data science ecosystem, making it the go‑to choice for beginners and professionals alike.


🧠 1. Why Scikit‑Learn?

Feature Description
Consistent API Uniform method naming (fit(), predict(), transform()) across models.
Wide Algorithm Support Regression, classification, clustering, dimensionality reduction, and more.
Preprocessing Tools Handle missing data, scaling, normalization, encoding, and feature selection.
Pipelines Combine preprocessing and modeling steps into reproducible workflows.
Performance Optimized using C‑backed NumPy/SciPy operations for speed.
Documentation Exceptional tutorials and examples for all levels of expertise.

🧩 2. The Scikit‑Learn Workflow

Every ML project in Scikit‑Learn follows the same structure:

Data → Preprocessing → Model Training → Evaluation → Optimization → Deployment

Let’s see these phases in action using the famous Iris dataset.


🌸 3. Example — Classification with SVM (Iris Dataset)

We'll train a Support Vector Machine (SVM) classifier to predict the species of an iris flower based on petal and sepal dimensions.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train an SVM classifier
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix — Iris Classification")
plt.show()

🔍 Explanation

  • fit() trains the model.
  • predict() makes predictions on unseen data.
  • accuracy_score() measures how many predictions are correct.
  • confusion_matrix() shows per‑class performance visually.

🧰 4. Data Preprocessing with Scikit‑Learn

Before training a model, data must be prepared properly.
Scikit‑Learn provides many preprocessing tools:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'salary': [50000, 60000, 80000, 90000, 120000],
    'city': ['NY', 'LA', 'NY', 'SF', 'LA'],
    'purchased': [0, 1, 0, 1, 1]
})

# Features and target
X = data[['age', 'salary', 'city']]
y = data['purchased']

# Define column types
numeric_features = ['age', 'salary']
categorical_features = ['city']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

# Full ML pipeline
pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', LogisticRegression())
])

pipeline.fit(X, y)
print("Prediction for new data:", pipeline.predict([[40, 75000, 'SF']]))  # Example input

💡 Pipelines ensure consistent preprocessing for both training and prediction — avoiding data leakage.


📏 5. Model Evaluation

Scikit‑Learn includes many evaluation metrics beyond accuracy:

Metric Description Function
Accuracy Fraction of correct predictions. accuracy_score()
Precision / Recall / F1 Evaluate imbalanced datasets. classification_report()
ROC‑AUC Quality of binary classification. roc_auc_score()
MSE / RMSE Regression error measures. mean_squared_error()
How well regression fits data. r2_score()

🎯 6. Model Selection and Tuning

Scikit‑Learn provides tools for cross‑validation and hyperparameter tuning:

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

⚙️ GridSearchCV automates finding the best hyperparameters through cross‑validation.


🧮 7. Common Algorithms in Scikit‑Learn

Category Algorithm Scikit‑Learn Class
Regression Linear Regression LinearRegression
Classification Logistic Regression, SVM, Decision Tree LogisticRegression, SVC, DecisionTreeClassifier
Clustering K‑Means, DBSCAN KMeans, DBSCAN
Dimensionality Reduction PCA, LDA PCA, LinearDiscriminantAnalysis
Ensemble Methods Random Forest, Gradient Boosting RandomForestClassifier, GradientBoostingRegressor
Model Selection Cross‑validation, Grid Search cross_val_score, GridSearchCV

💾 8. Saving and Loading Models

After training, you can save your model for later use:

import joblib

# Save model
joblib.dump(model, "svm_model.pkl")

# Load model
loaded_model = joblib.load("svm_model.pkl")
print("Loaded model prediction:", loaded_model.predict(X_test[:1]))

🧠 Use joblib or pickle for persistence and model deployment.


🧭 Conclusion

Scikit‑Learn is the cornerstone of applied machine learning in Python — from data preparation to evaluation and tuning.
Its consistent interface, extensive algorithms, and seamless integration make it ideal for both rapid prototyping and production‑grade ML systems.

“Learn the workflow, not just the algorithm — Scikit‑Learn makes experimentation effortless.”