Real‑World Projects — Creating a Basic Machine Learning Model

Published: November 12, 2025 • Language: python • Chapter: 16 • Sub: 3 • Level: beginner

python

Chapter 16: Real‑World Projects — Creating a Basic Machine Learning Model

🤖 Introduction to Machine Learning Model Building

Machine Learning (ML) enables computers to learn from data and make predictions or decisions without explicit programming.
In this project, we’ll build a simple yet complete classification model using the classic Iris dataset, one of the most popular datasets in data science.

This chapter walks through every step of an ML pipeline — from data loading to model evaluation and saving your trained model for reuse.


🧩 1. Workflow Overview

  1. Load the dataset — bring your data into memory.
  2. Preprocess and split — separate features and target, then divide into training and test sets.
  3. Train the model — fit a classifier to learn from the data.
  4. Evaluate — check accuracy and visualize performance.
  5. Save and reuse — persist the trained model for future predictions.

🌸 2. Building a Basic Classifier with the Iris Dataset

We’ll use the K‑Nearest Neighbors (KNN) algorithm — a simple, intuitive model that classifies new data based on the majority class of its nearest neighbors.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv('iris_dataset.csv')

# Separate features and target
X = data.drop('species', axis=1)
y = data['species']

# Split data into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the KNN model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"✅ Model Accuracy: {accuracy:.2f}\n")

print("📋 Classification Report:")
print(classification_report(y_test, y_pred))

# Visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='d',
            xticklabels=model.classes_, yticklabels=model.classes_)
plt.title('Confusion Matrix — KNN Classifier')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

🧠 3. Understanding the Evaluation Metrics

Metric Description Use
Accuracy Percentage of correct predictions General performance
Precision How many predicted positives are true Avoid false positives
Recall How many true positives were captured Avoid missing positives
F1‑Score Harmonic mean of precision & recall Balanced metric

In classification tasks, don’t rely on accuracy alone — always check the confusion matrix and F1‑score to understand your model’s real behavior.


⚙️ 4. Improving Model Performance

KNN’s performance depends heavily on its hyperparameter n_neighbors (number of nearest points used for voting).
You can tune this parameter by testing multiple values and selecting the one with the highest validation accuracy.

from sklearn.model_selection import cross_val_score
import numpy as np

neighbors = range(1, 11)
scores = []

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    cv_score = cross_val_score(knn, X, y, cv=5).mean()
    scores.append(cv_score)

best_k = neighbors[np.argmax(scores)]
print(f"🔍 Best number of neighbors: {best_k}")

🧰 5. Comparing Models (Optional)

Try comparing multiple algorithms to see which performs best on your dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

models = {
    "KNN": KNeighborsClassifier(n_neighbors=best_k),
    "Logistic Regression": LogisticRegression(max_iter=200),
    "Decision Tree": DecisionTreeClassifier(random_state=42)
}

for name, clf in models.items():
    clf.fit(X_train, y_train)
    acc = accuracy_score(y_test, clf.predict(X_test))
    print(f"{name}: {acc:.3f}")

In many cases, simple models like Logistic Regression or Decision Trees can outperform KNN, especially with larger or noisier datasets.


💾 6. Saving and Loading the Model

In real‑world workflows, you’ll want to save trained models to reuse them later without retraining.

import joblib

# Save the model
joblib.dump(model, 'iris_knn_model.joblib')
print("💾 Model saved as iris_knn_model.joblib")

# Load the model
loaded_model = joblib.load('iris_knn_model.joblib')
print("✅ Model reloaded successfully!")

📊 7. Visualizing Decision Boundaries (Bonus)

If you’re working with only two features, you can visualize how your model separates classes.

import numpy as np

# Use only two features for visualization
X_vis = X.iloc[:, :2].values
y_vis = y.values

knn_vis = KNeighborsClassifier(n_neighbors=3)
knn_vis.fit(X_vis, y_vis)

# Create a meshgrid for plotting
x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                     np.arange(y_min, y_max, 0.05))

Z = knn_vis.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
sns.scatterplot(x=X_vis[:, 0], y=X_vis[:, 1], hue=y_vis, palette='Set2', s=60)
plt.title("Decision Boundary Visualization (KNN)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

🧭 8. Real‑World Use Cases

Industry Example Goal
Healthcare Classifying tumors as benign/malignant Early diagnosis
Finance Detecting fraudulent transactions Risk mitigation
Retail Predicting customer churn Retention strategy
Agriculture Classifying plant species Crop monitoring

🚀 9. Takeaways

  • You’ve built a complete ML pipeline: load → train → evaluate → visualize → save.
  • KNN is a simple yet effective algorithm for small datasets.
  • Always analyze metrics beyond accuracy.
  • Saving and reusing models is key for real‑world deployment.
  • Experimentation and comparison are part of every ML workflow.

🧠 Conclusion

You now have a solid understanding of how to create, evaluate, and persist a machine learning model using Scikit‑Learn.
This is the foundation for exploring more advanced techniques — such as model tuning, pipelines, ensemble learning, and deep learning.

“Machine learning is not about magic — it’s about iteration, insight, and improvement.”