ch9s3_DataCleaningAndTransformation
Data cleaning and transformation are the **foundation of data analysis**.
Chapter 9: Data Analysis with Pandas
Sub-Chapter: Data Cleaning and Transformation β Preparing Data for Analysis
Data cleaning and transformation are the foundation of data analysis.
Before you can visualize or model data, you must ensure itβs accurate, consistent, and properly formatted. Pandas provides powerful tools to detect, clean, and transform messy datasets efficiently.
π§© 1. Why Data Cleaning Matters
Real-world datasets are rarely perfect β they often contain:
- Missing values (
NaN) - Duplicates
- Inconsistent types (strings as numbers)
- Formatting issues (extra spaces, wrong casing)
- Outliers or invalid entries
Cleaning is part of the ETL process:
Extract β Transform β Load β preparing data for accurate analysis and decision-making.
π§Ή 2. Handling Missing Data
Missing values are represented by NaN (Not a Number). Pandas provides several ways to identify, drop, or fill them.
Detect Missing Data
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 28, 30]}
df = pd.DataFrame(data)
print(df.isna()) # True where missing
print(df.notna()) # False where missing
Drop Missing Data
df.dropna() # Drop rows with any NaN
df.dropna(axis=1) # Drop columns with NaN
df.dropna(subset=['Age']) # Drop only if 'Age' is missing
Fill Missing Data
df.fillna(0) # Fill with constant value
df['Age'].fillna(df['Age'].mean()) # Fill with column mean
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
π‘ Tip: Use
inplace=Trueto modify the DataFrame directly, or assign to a new variable to preserve the original.
π 3. Removing Duplicates
Duplicates can distort your results β Pandas makes them easy to handle.
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Bob'],
'Age': [25, 30, 28, 30]}
df = pd.DataFrame(data)
df.duplicated() # Boolean mask of duplicate rows
df.drop_duplicates() # Remove duplicates
df.drop_duplicates(subset=['Name'], keep='first', inplace=True) # Keep first occurrence
β οΈ Sometimes duplicates are valid (e.g., multiple purchases by same user). Always verify context.
π§± 4. Data Type Conversion
Data often imports as strings even when numeric or date-based. Use conversion utilities for accuracy.
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')
df['Salary'] = df['Salary'].astype(float)
Check and Fix Dtypes
print(df.dtypes)
df['Category'] = df['Category'].astype('category')
π‘ Use
errors='coerce'to safely convert invalid entries intoNaNinstead of raising an error.
π§° 5. Applying and Mapping Functions
Transform data easily using .apply(), .map(), or vectorized string methods.
Using apply()
df['Age'] = df['Age'].apply(lambda x: x + 1)
Using map() (Element-wise for Series)
df['Name'] = df['Name'].map(lambda x: x.upper() if isinstance(x, str) else x)
String Cleaning with str Accessor
df['Name'] = df['Name'].str.strip().str.title()
π 6. Aggregating and Grouping Data
Aggregation helps summarize and analyze trends efficiently.
sales = pd.DataFrame({
"Region": ["North", "South", "North", "West", "South"],
"Sales": [2500, 3200, 2800, 1500, 3100],
"Profit": [400, 600, 500, 200, 550]
})
# Group by region
summary = sales.groupby("Region").agg({
"Sales": ["sum", "mean"],
"Profit": ["mean"]
})
print(summary)
Output:
Sales Profit
sum mean mean
Region
North 5300 2650.0 450.0
South 6300 3150.0 575.0
West 1500 1500.0 200.0
π 7. Normalizing and Scaling Data
Normalization ensures data is on a consistent scale, which is crucial for machine learning and comparisons.
# Min-Max Normalization
df["Normalized"] = (df["Age"] - df["Age"].min()) / (df["Age"].max() - df["Age"].min())
# Z-Score Standardization
df["Standardized"] = (df["Age"] - df["Age"].mean()) / df["Age"].std()
π§ͺ 8. Real-World Example β Cleaning a Sales Dataset
import numpy as np
data = {
"Customer": ["Ali", "Sara", "Ali", "Lina", None],
"Product": ["Laptop", "Phone", "Laptop", "Tablet", "Phone"],
"Price": [1200, np.nan, 1200, 800, 900],
"Date": ["2024-01-05", "2024/02/10", "Invalid", "2024-03-15", "2024-04-01"]
}
df = pd.DataFrame(data)
# 1. Clean missing values
df["Price"].fillna(df["Price"].mean(), inplace=True)
df.dropna(subset=["Customer"], inplace=True)
# 2. Convert types
df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
# 3. Remove duplicates
df.drop_duplicates(subset=["Customer", "Product"], inplace=True)
# 4. Add derived column
df["DiscountedPrice"] = df["Price"] * 0.9
# 5. Summary by product
summary = df.groupby("Product")["Price"].agg(["count", "mean", "max"])
print(summary)
Output (summarized):
count mean max
Product
Laptop 1 1200.0 1200.0
Phone 2 900.0 900.0
Tablet 1 800.0 800.0
π§ 9. Best Practices for Data Cleaning
β
Always start with df.info() and df.describe() to assess data quality.
β
Keep raw data immutable β create a copy for cleaning.
β
Use consistent data types (astype, to_numeric, to_datetime).
β
Handle missing and duplicate data systematically.
β
Document every transformation step (reproducibility).
β
Chain operations (.pipe(), .assign()) for cleaner workflows.
π§ Summary
| Operation | Purpose | Example |
|---|---|---|
| Handle Missing | Fill or drop NaN | df.fillna(df.mean()) |
| Remove Duplicates | Ensure unique records | df.drop_duplicates() |
| Convert Types | Ensure correct dtypes | pd.to_datetime(df['Date']) |
| Apply Functions | Transform data | df['Age'].apply(...) |
| Aggregate | Summarize data | df.groupby('Region').mean() |
| Normalize | Scale values | (x - min) / (max - min) |
Clean data is reliable data. Mastering Pandas cleaning functions transforms raw, messy inputs into ready-to-analyze datasets.