ch14s3_LinearRegressionExample

**Linear Regression** is one of the most fundamental algorithms in machine learning.

Chapter 14: Introduction to Machine Learning — Linear Regression Example

📈 Linear Regression: Predicting Continuous Values

Linear Regression is one of the most fundamental algorithms in machine learning.
It models the relationship between input variables (features) and a continuous output (target) using a linear equation.

It’s the foundation for many advanced algorithms and serves as the perfect starting point for understanding supervised learning.


🧩 1. Concept Overview

The Linear Regression Equation

[ y = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ + ε ]

Where:


🏠 2. Example — Predicting California Housing Prices

We’ll use Scikit‑Learn’s California Housing dataset, a modern replacement for the deprecated Boston dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="MedianHouseValue")

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse:.3f}")
print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"R² Score: {r2:.3f}")

📊 3. Visualizing Predictions

Predicted vs Actual Values

plt.figure(figsize=(6, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.6, color='royalblue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Predicted vs Actual House Prices")
plt.grid(True, linestyle='--', alpha=0.4)
plt.show()

The closer the points are to the red diagonal line, the better the model’s predictions.

Residual Plot (Error Distribution)

residuals = y_test - y_pred
sns.histplot(residuals, bins=30, kde=True, color="orange")
plt.title("Residual Distribution")
plt.xlabel("Prediction Error (Residual)")
plt.show()

Residuals centered around zero indicate a well‑fitted model.


🧠 4. Understanding Model Coefficients

Each coefficient shows how much the target changes for a unit increase in the corresponding feature.

coef_df = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
}).sort_values(by="Coefficient", ascending=False)

print(coef_df)

# Optional: visualize
plt.figure(figsize=(8, 5))
sns.barplot(data=coef_df, x="Coefficient", y="Feature", palette="viridis")
plt.title("Feature Importance (Linear Coefficients)")
plt.show()

Positive coefficients increase the predicted price, negative ones decrease it.


📏 5. Regression Metrics Summary

MetricDescriptionLower = BetterFunction
MSE (Mean Squared Error)Average squared difference between predicted and actual values.mean_squared_error()
RMSE (Root MSE)Square root of MSE (in same units as target).sqrt(mse)
MAE (Mean Absolute Error)Average absolute difference.mean_absolute_error()
R² (Coefficient of Determination)Proportion of variance explained by the model (1 = perfect).r2_score()

Example output:

Mean Squared Error (MSE): 0.54
Mean Absolute Error (MAE): 0.48
R² Score: 0.74

Interpretation:


🔍 6. Common Pitfalls and Improvements

IssueDescriptionFix
OutliersExtreme data points distort regression line.Use robust scalers or remove outliers.
Non‑linear relationshipsLinear model can’t capture curves.Use polynomial regression or tree‑based models.
Feature scalingUnscaled features affect coefficient magnitude.Apply StandardScaler.
MulticollinearityHighly correlated features distort coefficients.Use PCA or remove redundant features.

🚀 7. Takeaways


🧭 Conclusion

Linear Regression is not just a basic algorithm — it’s the foundation of predictive modeling.
By applying it to real data (like the California housing dataset), you learn how to train, evaluate, visualize, and interpret ML models end‑to‑end.

“All models are wrong, but some are useful.” — George Box