ch16s3_BasicMachineLearningModel
Machine Learning (ML) enables computers to **learn from data** and make predictions or decisions without explicit programming.
Chapter 16: Real‑World Projects — Creating a Basic Machine Learning Model
🤖 Introduction to Machine Learning Model Building
Machine Learning (ML) enables computers to learn from data and make predictions or decisions without explicit programming.
In this project, we’ll build a simple yet complete classification model using the classic Iris dataset, one of the most popular datasets in data science.
This chapter walks through every step of an ML pipeline — from data loading to model evaluation and saving your trained model for reuse.
🧩 1. Workflow Overview
- Load the dataset — bring your data into memory.
- Preprocess and split — separate features and target, then divide into training and test sets.
- Train the model — fit a classifier to learn from the data.
- Evaluate — check accuracy and visualize performance.
- Save and reuse — persist the trained model for future predictions.
🌸 2. Building a Basic Classifier with the Iris Dataset
We’ll use the K‑Nearest Neighbors (KNN) algorithm — a simple, intuitive model that classifies new data based on the majority class of its nearest neighbors.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
data = pd.read_csv('iris_dataset.csv')
# Separate features and target
X = data.drop('species', axis=1)
y = data['species']
# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the KNN model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"✅ Model Accuracy: {accuracy:.2f}\n")
print("📋 Classification Report:")
print(classification_report(y_test, y_pred))
# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='d',
xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix — KNN Classifier')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
🧠 3. Understanding the Evaluation Metrics
| Metric | Description | Use |
|---|---|---|
| Accuracy | Percentage of correct predictions | General performance |
| Precision | How many predicted positives are true | Avoid false positives |
| Recall | How many true positives were captured | Avoid missing positives |
| F1‑Score | Harmonic mean of precision & recall | Balanced metric |
In classification tasks, don’t rely on accuracy alone — always check the confusion matrix and F1‑score to understand your model’s real behavior.
⚙️ 4. Improving Model Performance
KNN’s performance depends heavily on its hyperparameter n_neighbors (number of nearest points used for voting).
You can tune this parameter by testing multiple values and selecting the one with the highest validation accuracy.
from sklearn.model_selection import cross_val_score
import numpy as np
neighbors = range(1, 11)
scores = []
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
cv_score = cross_val_score(knn, X, y, cv=5).mean()
scores.append(cv_score)
best_k = neighbors[np.argmax(scores)]
print(f"🔍 Best number of neighbors: {best_k}")
🧰 5. Comparing Models (Optional)
Try comparing multiple algorithms to see which performs best on your dataset.
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
models = {
"KNN": KNeighborsClassifier(n_neighbors=best_k),
"Logistic Regression": LogisticRegression(max_iter=200),
"Decision Tree": DecisionTreeClassifier(random_state=42)
}
for name, clf in models.items():
clf.fit(X_train, y_train)
acc = accuracy_score(y_test, clf.predict(X_test))
print(f"{name}: {acc:.3f}")
In many cases, simple models like Logistic Regression or Decision Trees can outperform KNN, especially with larger or noisier datasets.
💾 6. Saving and Loading the Model
In real‑world workflows, you’ll want to save trained models to reuse them later without retraining.
import joblib
# Save the model
joblib.dump(model, 'iris_knn_model.joblib')
print("💾 Model saved as iris_knn_model.joblib")
# Load the model
loaded_model = joblib.load('iris_knn_model.joblib')
print("✅ Model reloaded successfully!")
📊 7. Visualizing Decision Boundaries (Bonus)
If you’re working with only two features, you can visualize how your model separates classes.
import numpy as np
# Use only two features for visualization
X_vis = X.iloc[:, :2].values
y_vis = y.values
knn_vis = KNeighborsClassifier(n_neighbors=3)
knn_vis.fit(X_vis, y_vis)
# Create a meshgrid for plotting
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
np.arange(y_min, y_max, 0.05))
Z = knn_vis.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
sns.scatterplot(x=X_vis[:, 0], y=X_vis[:, 1], hue=y_vis, palette='Set2', s=60)
plt.title("Decision Boundary Visualization (KNN)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
🧭 8. Real‑World Use Cases
| Industry | Example | Goal |
|---|---|---|
| Healthcare | Classifying tumors as benign/malignant | Early diagnosis |
| Finance | Detecting fraudulent transactions | Risk mitigation |
| Retail | Predicting customer churn | Retention strategy |
| Agriculture | Classifying plant species | Crop monitoring |
🚀 9. Takeaways
- You’ve built a complete ML pipeline: load → train → evaluate → visualize → save.
- KNN is a simple yet effective algorithm for small datasets.
- Always analyze metrics beyond accuracy.
- Saving and reusing models is key for real‑world deployment.
- Experimentation and comparison are part of every ML workflow.
🧠 Conclusion
You now have a solid understanding of how to create, evaluate, and persist a machine learning model using Scikit‑Learn.
This is the foundation for exploring more advanced techniques — such as model tuning, pipelines, ensemble learning, and deep learning.
“Machine learning is not about magic — it’s about iteration, insight, and improvement.”