ch9s4_GroupingAndAggregation
Grouping and aggregation are **core techniques** in data analysis that allow you to summarize and gain insights from large datasets.
Chapter 9: Data Analysis with Pandas
Sub-Chapter: Grouping and Aggregation — Summarizing Data Effectively
Grouping and aggregation are core techniques in data analysis that allow you to summarize and gain insights from large datasets.
In Pandas, the groupby() operation is incredibly powerful — letting you split data into groups, apply functions, and combine the results, much like SQL’s GROUP BY or Excel pivot tables.
🧩 1. The Split–Apply–Combine Concept
Pandas’ grouping process follows this pattern:
- Split the data into groups based on one or more keys.
- Apply an aggregation or transformation function to each group.
- Combine the results into a new DataFrame or Series.
import pandas as pd
data = {
"Category": ["A", "B", "A", "B", "A"],
"Value": [10, 20, 15, 25, 30]
}
df = pd.DataFrame(data)
grouped = df.groupby("Category")
📊 2. Simple Aggregations
Once data is grouped, you can apply built-in aggregation functions such as sum(), mean(), count(), max(), and min().
sum_values = grouped["Value"].sum()
average_values = grouped["Value"].mean()
max_values = grouped["Value"].max()
print(sum_values)
Output:
Category
A 55
B 45
Name: Value, dtype: int64
🧮 3. Multiple Aggregations with agg()
The .agg() method allows multiple aggregations at once and supports custom names.
aggregated = grouped.agg({
"Value": ["sum", "mean", "max"]
})
print(aggregated)
Output:
Value
sum mean max
Category
A 55 18.3 30
B 45 22.5 25
Named Aggregations (Modern Syntax)
summary = df.groupby("Category").agg(
total_value=("Value", "sum"),
avg_value=("Value", "mean")
)
🧠 4. Grouping by Multiple Columns
You can group by more than one column to perform hierarchical grouping.
sales = pd.DataFrame({
"Region": ["North", "South", "North", "South", "West"],
"Product": ["A", "A", "B", "B", "A"],
"Sales": [2500, 2700, 3000, 3100, 1800],
"Profit": [400, 450, 500, 480, 300]
})
grouped = sales.groupby(["Region", "Product"]).agg(
total_sales=("Sales", "sum"),
avg_profit=("Profit", "mean")
)
print(grouped)
Output:
total_sales avg_profit
Region Product
North A 2500 400.0
B 3000 500.0
South A 2700 450.0
B 3100 480.0
West A 1800 300.0
🔁 5. Transforming Grouped Data
Use .transform() when you want to perform an operation but retain the original DataFrame shape.
sales["NormalizedProfit"] = sales.groupby("Region")["Profit"].transform(
lambda x: (x - x.mean()) / x.std()
)
🧰 6. Applying Custom Functions
You can use .apply() for complex operations that return DataFrames or Series.
def profit_margin(group):
group["Margin"] = group["Profit"] / group["Sales"]
return group
sales = sales.groupby("Region").apply(profit_margin)
⚙️ 7. Sorting and Flattening Grouped Results
Grouped results often return MultiIndex columns. Use reset_index() and rename columns for clarity.
flat = grouped.reset_index()
flat.columns = ["Region", "Product", "TotalSales", "AvgProfit"]
print(flat)
📈 8. Pivot-Table Analogy
Grouping can be seen as the programmatic version of an Excel pivot table.
pivot = sales.pivot_table(
values="Sales",
index="Region",
columns="Product",
aggfunc="sum",
fill_value=0
)
print(pivot)
Output:
Product A B
Region
North 2500 3000
South 2700 3100
West 1800 0
🧪 9. Real-World Example — Employee Performance Analysis
data = {
"Department": ["HR", "HR", "IT", "IT", "Finance", "Finance"],
"Employee": ["Ali", "Sara", "John", "Lina", "Reza", "Omid"],
"Salary": [5000, 5200, 6200, 6700, 5900, 6100],
"Experience": [3, 4, 5, 6, 2, 3]
}
df = pd.DataFrame(data)
# Group and summarize
summary = df.groupby("Department").agg(
avg_salary=("Salary", "mean"),
max_experience=("Experience", "max"),
count_employees=("Employee", "count")
).reset_index()
# Add derived metric
summary["SalaryRange"] = df.groupby("Department")["Salary"].agg(lambda x: x.max() - x.min()).values
print(summary)
Output:
Department avg_salary max_experience count_employees SalaryRange
0 Finance 6000.0 3 2 200
1 HR 5100.0 4 2 200
2 IT 6450.0 6 2 500
🧭 10. Best Practices
✅ Use named aggregations for readable outputs.
✅ Always reset index after grouping when you need a flat DataFrame.
✅ Prefer .agg() for summaries, .transform() for scaling, and .apply() for complex logic.
✅ Chain operations cleanly:
(df.groupby("Category")
.agg(total=("Value", "sum"), avg=("Value", "mean"))
.reset_index())
🧠 Summary
| Operation | Description | Example |
|---|---|---|
groupby() | Split data into groups | df.groupby("Category") |
.agg() | Apply one or more aggregations | agg({"Value": ["sum", "mean"]}) |
.transform() | Return transformed data (same shape) | transform(lambda x: x - x.mean()) |
.apply() | Apply custom functions | apply(custom_func) |
.reset_index() | Flatten grouped result | reset_index() |
.pivot_table() | Create pivot-style summary | pivot_table(values, index, columns) |
Grouping and aggregation are the heart of exploratory data analysis (EDA) — they turn raw data into insights, trends, and summaries you can act upon.