ch14s2_ScikitLearnLibrary
[Scikit‑Learn (sklearn)](https://scikit-learn.org) is the **core library for machine learning in Python**.
Chapter 14: Introduction to Machine Learning — Scikit‑Learn Library
⚙️ Scikit‑Learn: The Engine of Practical Machine Learning
Scikit‑Learn (sklearn) is the core library for machine learning in Python.
It provides efficient tools for data preprocessing, model training, evaluation, and deployment — all through a consistent and intuitive API.
Built on top of NumPy, SciPy, and Matplotlib, Scikit‑Learn integrates seamlessly with the broader data science ecosystem, making it the go‑to choice for beginners and professionals alike.
🧠 1. Why Scikit‑Learn?
| Feature | Description |
|---|---|
| Consistent API | Uniform method naming (fit(), predict(), transform()) across models. |
| Wide Algorithm Support | Regression, classification, clustering, dimensionality reduction, and more. |
| Preprocessing Tools | Handle missing data, scaling, normalization, encoding, and feature selection. |
| Pipelines | Combine preprocessing and modeling steps into reproducible workflows. |
| Performance | Optimized using C‑backed NumPy/SciPy operations for speed. |
| Documentation | Exceptional tutorials and examples for all levels of expertise. |
🧩 2. The Scikit‑Learn Workflow
Every ML project in Scikit‑Learn follows the same structure:
Data → Preprocessing → Model Training → Evaluation → Optimization → Deployment
Let’s see these phases in action using the famous Iris dataset.
🌸 3. Example — Classification with SVM (Iris Dataset)
We’ll train a Support Vector Machine (SVM) classifier to predict the species of an iris flower based on petal and sepal dimensions.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train an SVM classifier
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues", xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix — Iris Classification")
plt.show()
🔍 Explanation
fit()trains the model.predict()makes predictions on unseen data.accuracy_score()measures how many predictions are correct.confusion_matrix()shows per‑class performance visually.
🧰 4. Data Preprocessing with Scikit‑Learn
Before training a model, data must be prepared properly.
Scikit‑Learn provides many preprocessing tools:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Example dataset
data = pd.DataFrame({
'age': [25, 32, 47, 51, 62],
'salary': [50000, 60000, 80000, 90000, 120000],
'city': ['NY', 'LA', 'NY', 'SF', 'LA'],
'purchased': [0, 1, 0, 1, 1]
})
# Features and target
X = data[['age', 'salary', 'city']]
y = data['purchased']
# Define column types
numeric_features = ['age', 'salary']
categorical_features = ['city']
# Preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
# Full ML pipeline
pipeline = Pipeline(steps=[
('preprocess', preprocessor),
('model', LogisticRegression())
])
pipeline.fit(X, y)
print("Prediction for new data:", pipeline.predict([[40, 75000, 'SF']])) # Example input
💡 Pipelines ensure consistent preprocessing for both training and prediction — avoiding data leakage.
📏 5. Model Evaluation
Scikit‑Learn includes many evaluation metrics beyond accuracy:
| Metric | Description | Function |
|---|---|---|
| Accuracy | Fraction of correct predictions. | accuracy_score() |
| Precision / Recall / F1 | Evaluate imbalanced datasets. | classification_report() |
| ROC‑AUC | Quality of binary classification. | roc_auc_score() |
| MSE / RMSE | Regression error measures. | mean_squared_error() |
| R² | How well regression fits data. | r2_score() |
🎯 6. Model Selection and Tuning
Scikit‑Learn provides tools for cross‑validation and hyperparameter tuning:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)
⚙️
GridSearchCVautomates finding the best hyperparameters through cross‑validation.
🧮 7. Common Algorithms in Scikit‑Learn
| Category | Algorithm | Scikit‑Learn Class |
|---|---|---|
| Regression | Linear Regression | LinearRegression |
| Classification | Logistic Regression, SVM, Decision Tree | LogisticRegression, SVC, DecisionTreeClassifier |
| Clustering | K‑Means, DBSCAN | KMeans, DBSCAN |
| Dimensionality Reduction | PCA, LDA | PCA, LinearDiscriminantAnalysis |
| Ensemble Methods | Random Forest, Gradient Boosting | RandomForestClassifier, GradientBoostingRegressor |
| Model Selection | Cross‑validation, Grid Search | cross_val_score, GridSearchCV |
💾 8. Saving and Loading Models
After training, you can save your model for later use:
import joblib
# Save model
joblib.dump(model, "svm_model.pkl")
# Load model
loaded_model = joblib.load("svm_model.pkl")
print("Loaded model prediction:", loaded_model.predict(X_test[:1]))
🧠 Use
jobliborpicklefor persistence and model deployment.
🧭 Conclusion
Scikit‑Learn is the cornerstone of applied machine learning in Python — from data preparation to evaluation and tuning.
Its consistent interface, extensive algorithms, and seamless integration make it ideal for both rapid prototyping and production‑grade ML systems.
“Learn the workflow, not just the algorithm — Scikit‑Learn makes experimentation effortless.”