ch9s3_DataCleaningAndTransformation

Data cleaning and transformation are the **foundation of data analysis**.

Chapter 9: Data Analysis with Pandas

Sub-Chapter: Data Cleaning and Transformation — Preparing Data for Analysis

Data cleaning and transformation are the foundation of data analysis.
Before you can visualize or model data, you must ensure it’s accurate, consistent, and properly formatted. Pandas provides powerful tools to detect, clean, and transform messy datasets efficiently.

🧩 1. Why Data Cleaning Matters

Real-world datasets are rarely perfect — they often contain:

Missing values (NaN)
Duplicates
Inconsistent types (strings as numbers)
Formatting issues (extra spaces, wrong casing)
Outliers or invalid entries

Cleaning is part of the ETL process:
Extract → Transform → Load — preparing data for accurate analysis and decision-making.

🧹 2. Handling Missing Data

Missing values are represented by NaN (Not a Number). Pandas provides several ways to identify, drop, or fill them.

Detect Missing Data

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', None],
        'Age': [25, None, 28, 30]}
df = pd.DataFrame(data)

print(df.isna())   # True where missing
print(df.notna())  # False where missing

Drop Missing Data

df.dropna()              # Drop rows with any NaN
df.dropna(axis=1)        # Drop columns with NaN
df.dropna(subset=['Age'])  # Drop only if 'Age' is missing

Fill Missing Data

df.fillna(0)                         # Fill with constant value
df['Age'].fillna(df['Age'].mean())   # Fill with column mean
df.fillna(method='ffill')            # Forward fill
df.fillna(method='bfill')            # Backward fill

💡 Tip: Use inplace=True to modify the DataFrame directly, or assign to a new variable to preserve the original.

🔁 3. Removing Duplicates

Duplicates can distort your results — Pandas makes them easy to handle.

data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob'],
        'Age': [25, 30, 28, 30]}
df = pd.DataFrame(data)

df.duplicated()           # Boolean mask of duplicate rows
df.drop_duplicates()       # Remove duplicates
df.drop_duplicates(subset=['Name'], keep='first', inplace=True)  # Keep first occurrence

⚠️ Sometimes duplicates are valid (e.g., multiple purchases by same user). Always verify context.

🧱 4. Data Type Conversion

Data often imports as strings even when numeric or date-based. Use conversion utilities for accuracy.

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')
df['Salary'] = df['Salary'].astype(float)

Check and Fix Dtypes

print(df.dtypes)
df['Category'] = df['Category'].astype('category')

💡 Use errors='coerce' to safely convert invalid entries into NaN instead of raising an error.

🧰 5. Applying and Mapping Functions

Transform data easily using .apply(), .map(), or vectorized string methods.

Using `apply()`

df['Age'] = df['Age'].apply(lambda x: x + 1)

Using `map()` (Element-wise for Series)

df['Name'] = df['Name'].map(lambda x: x.upper() if isinstance(x, str) else x)

String Cleaning with `str` Accessor

df['Name'] = df['Name'].str.strip().str.title()

📊 6. Aggregating and Grouping Data

Aggregation helps summarize and analyze trends efficiently.

sales = pd.DataFrame({
    "Region": ["North", "South", "North", "West", "South"],
    "Sales": [2500, 3200, 2800, 1500, 3100],
    "Profit": [400, 600, 500, 200, 550]
})

# Group by region
summary = sales.groupby("Region").agg({
    "Sales": ["sum", "mean"],
    "Profit": ["mean"]
})
print(summary)

Output:

         Sales        Profit
           sum   mean   mean
Region
North     5300  2650.0  450.0
South     6300  3150.0  575.0
West      1500  1500.0  200.0

📈 7. Normalizing and Scaling Data

Normalization ensures data is on a consistent scale, which is crucial for machine learning and comparisons.

# Min-Max Normalization
df["Normalized"] = (df["Age"] - df["Age"].min()) / (df["Age"].max() - df["Age"].min())

# Z-Score Standardization
df["Standardized"] = (df["Age"] - df["Age"].mean()) / df["Age"].std()

🧪 8. Real-World Example — Cleaning a Sales Dataset

import numpy as np
data = {
    "Customer": ["Ali", "Sara", "Ali", "Lina", None],
    "Product": ["Laptop", "Phone", "Laptop", "Tablet", "Phone"],
    "Price": [1200, np.nan, 1200, 800, 900],
    "Date": ["2024-01-05", "2024/02/10", "Invalid", "2024-03-15", "2024-04-01"]
}

df = pd.DataFrame(data)

# 1. Clean missing values
df["Price"].fillna(df["Price"].mean(), inplace=True)
df.dropna(subset=["Customer"], inplace=True)

# 2. Convert types
df["Date"] = pd.to_datetime(df["Date"], errors="coerce")

# 3. Remove duplicates
df.drop_duplicates(subset=["Customer", "Product"], inplace=True)

# 4. Add derived column
df["DiscountedPrice"] = df["Price"] * 0.9

# 5. Summary by product
summary = df.groupby("Product")["Price"].agg(["count", "mean", "max"])
print(summary)

Output (summarized):

         count    mean     max
Product
Laptop        1  1200.0  1200.0
Phone         2   900.0   900.0
Tablet        1   800.0   800.0

🧭 9. Best Practices for Data Cleaning

✅ Always start with df.info() and df.describe() to assess data quality.
✅ Keep raw data immutable — create a copy for cleaning.
✅ Use consistent data types (astype, to_numeric, to_datetime).
✅ Handle missing and duplicate data systematically.
✅ Document every transformation step (reproducibility).
✅ Chain operations (.pipe(), .assign()) for cleaner workflows.

🧠 Summary

Operation	Purpose	Example
Handle Missing	Fill or drop NaN	`df.fillna(df.mean())`
Remove Duplicates	Ensure unique records	`df.drop_duplicates()`
Convert Types	Ensure correct dtypes	`pd.to_datetime(df['Date'])`
Apply Functions	Transform data	`df['Age'].apply(...)`
Aggregate	Summarize data	`df.groupby('Region').mean()`
Normalize	Scale values	`(x - min) / (max - min)`

Clean data is reliable data. Mastering Pandas cleaning functions transforms raw, messy inputs into ready-to-analyze datasets.