ch9s4_GroupingAndAggregation

Grouping and aggregation are **core techniques** in data analysis that allow you to summarize and gain insights from large datasets.

Chapter 9: Data Analysis with Pandas

Sub-Chapter: Grouping and Aggregation — Summarizing Data Effectively

Grouping and aggregation are core techniques in data analysis that allow you to summarize and gain insights from large datasets.
In Pandas, the groupby() operation is incredibly powerful — letting you split data into groups, apply functions, and combine the results, much like SQL’s GROUP BY or Excel pivot tables.


🧩 1. The Split–Apply–Combine Concept

Pandas’ grouping process follows this pattern:

  1. Split the data into groups based on one or more keys.
  2. Apply an aggregation or transformation function to each group.
  3. Combine the results into a new DataFrame or Series.
import pandas as pd

data = {
    "Category": ["A", "B", "A", "B", "A"],
    "Value": [10, 20, 15, 25, 30]
}
df = pd.DataFrame(data)

grouped = df.groupby("Category")

📊 2. Simple Aggregations

Once data is grouped, you can apply built-in aggregation functions such as sum(), mean(), count(), max(), and min().

sum_values = grouped["Value"].sum()
average_values = grouped["Value"].mean()
max_values = grouped["Value"].max()

print(sum_values)

Output:

Category
A    55
B    45
Name: Value, dtype: int64

🧮 3. Multiple Aggregations with agg()

The .agg() method allows multiple aggregations at once and supports custom names.

aggregated = grouped.agg({
    "Value": ["sum", "mean", "max"]
})
print(aggregated)

Output:

          Value
           sum  mean  max
Category
A           55  18.3   30
B           45  22.5   25

Named Aggregations (Modern Syntax)

summary = df.groupby("Category").agg(
    total_value=("Value", "sum"),
    avg_value=("Value", "mean")
)

🧠 4. Grouping by Multiple Columns

You can group by more than one column to perform hierarchical grouping.

sales = pd.DataFrame({
    "Region": ["North", "South", "North", "South", "West"],
    "Product": ["A", "A", "B", "B", "A"],
    "Sales": [2500, 2700, 3000, 3100, 1800],
    "Profit": [400, 450, 500, 480, 300]
})

grouped = sales.groupby(["Region", "Product"]).agg(
    total_sales=("Sales", "sum"),
    avg_profit=("Profit", "mean")
)
print(grouped)

Output:

               total_sales  avg_profit
Region Product
North  A              2500       400.0
       B              3000       500.0
South  A              2700       450.0
       B              3100       480.0
West   A              1800       300.0

🔁 5. Transforming Grouped Data

Use .transform() when you want to perform an operation but retain the original DataFrame shape.

sales["NormalizedProfit"] = sales.groupby("Region")["Profit"].transform(
    lambda x: (x - x.mean()) / x.std()
)

🧰 6. Applying Custom Functions

You can use .apply() for complex operations that return DataFrames or Series.

def profit_margin(group):
    group["Margin"] = group["Profit"] / group["Sales"]
    return group

sales = sales.groupby("Region").apply(profit_margin)

⚙️ 7. Sorting and Flattening Grouped Results

Grouped results often return MultiIndex columns. Use reset_index() and rename columns for clarity.

flat = grouped.reset_index()
flat.columns = ["Region", "Product", "TotalSales", "AvgProfit"]
print(flat)

📈 8. Pivot-Table Analogy

Grouping can be seen as the programmatic version of an Excel pivot table.

pivot = sales.pivot_table(
    values="Sales",
    index="Region",
    columns="Product",
    aggfunc="sum",
    fill_value=0
)
print(pivot)

Output:

Product     A     B
Region
North    2500  3000
South    2700  3100
West     1800     0

🧪 9. Real-World Example — Employee Performance Analysis

data = {
    "Department": ["HR", "HR", "IT", "IT", "Finance", "Finance"],
    "Employee": ["Ali", "Sara", "John", "Lina", "Reza", "Omid"],
    "Salary": [5000, 5200, 6200, 6700, 5900, 6100],
    "Experience": [3, 4, 5, 6, 2, 3]
}
df = pd.DataFrame(data)

# Group and summarize
summary = df.groupby("Department").agg(
    avg_salary=("Salary", "mean"),
    max_experience=("Experience", "max"),
    count_employees=("Employee", "count")
).reset_index()

# Add derived metric
summary["SalaryRange"] = df.groupby("Department")["Salary"].agg(lambda x: x.max() - x.min()).values

print(summary)

Output:

  Department  avg_salary  max_experience  count_employees  SalaryRange
0    Finance      6000.0               3                2           200
1         HR      5100.0               4                2           200
2         IT      6450.0               6                2           500

🧭 10. Best Practices

✅ Use named aggregations for readable outputs.
✅ Always reset index after grouping when you need a flat DataFrame.
✅ Prefer .agg() for summaries, .transform() for scaling, and .apply() for complex logic.
✅ Chain operations cleanly:

(df.groupby("Category")
   .agg(total=("Value", "sum"), avg=("Value", "mean"))
   .reset_index())

🧠 Summary

OperationDescriptionExample
groupby()Split data into groupsdf.groupby("Category")
.agg()Apply one or more aggregationsagg({"Value": ["sum", "mean"]})
.transform()Return transformed data (same shape)transform(lambda x: x - x.mean())
.apply()Apply custom functionsapply(custom_func)
.reset_index()Flatten grouped resultreset_index()
.pivot_table()Create pivot-style summarypivot_table(values, index, columns)

Grouping and aggregation are the heart of exploratory data analysis (EDA) — they turn raw data into insights, trends, and summaries you can act upon.