ch12s2_CombiningAndMergingDataFrames

In real-world data analysis, you often need to **combine multiple datasets** into a single, coherent structure.

Chapter 12: Advanced NumPy and Pandas — Combining and Merging DataFrames

🧩 Combining and Merging DataFrames with Pandas

In real-world data analysis, you often need to combine multiple datasets into a single, coherent structure.
Pandas provides flexible and efficient functions to concatenate, merge, and join DataFrames — enabling seamless data integration and analysis.

⚙️ 1. Overview of Data Combination Techniques

Method	Description	Common Use
`pd.concat()`	Stacks DataFrames along an axis	Combine datasets with same columns or indexes
`pd.merge()`	SQL-style join operation	Combine DataFrames using common keys
`.join()`	Simplified merge using index	Combine DataFrames based on row indexes

🧱 2. Concatenating DataFrames

Concatenation stacks DataFrames vertically (rows) or horizontally (columns).

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Vertical concatenation (default axis=0)
vertical = pd.concat([df1, df2])

# Horizontal concatenation
horizontal = pd.concat([df1, df2], axis=1)

print("Vertical:
", vertical)
print("\nHorizontal:
", horizontal)

✅ Useful Parameters

Parameter	Description
`axis`	0 = rows (stack), 1 = columns (side-by-side)
`ignore_index`	Reindex after concatenation
`keys`	Create hierarchical index labels
`join`	’inner’ or ‘outer’ join on columns

Example using keys and ignore_index:

result = pd.concat([df1, df2], keys=['Group1', 'Group2'], ignore_index=False)
print(result)

🔗 3. Merging DataFrames (SQL-style Joins)

Merging is similar to SQL joins and allows combining DataFrames using common columns or keys.

left = pd.DataFrame({'key': ['A', 'B', 'C'], 'value_left': [1, 2, 3]})
right = pd.DataFrame({'key': ['B', 'C', 'D'], 'value_right': [4, 5, 6]})

merged = pd.merge(left, right, on='key', how='inner')
print(merged)

Join Types

Join Type	Description	Keeps
inner	Intersection	Only matching keys
outer	Union	All keys, filling missing with NaN
left	Left DataFrame priority	All keys from left
right	Right DataFrame priority	All keys from right

Example comparing join types:

for method in ['inner', 'outer', 'left', 'right']:
    print(f"\n{method.upper()} JOIN:")
    print(pd.merge(left, right, on='key', how=method))

🧮 4. Merging on Multiple Keys

You can merge on multiple columns by passing a list to on:

left = pd.DataFrame({'key1': ['A', 'B', 'B'], 'key2': [1, 2, 3], 'value_left': [10, 20, 30]})
right = pd.DataFrame({'key1': ['A', 'B', 'B'], 'key2': [1, 2, 4], 'value_right': [100, 200, 300]})

merged = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print(merged)

🧭 5. Index-Based Merges

Merges can also align using DataFrame indexes:

left = pd.DataFrame({'value_left': [1, 2, 3]}, index=['A', 'B', 'C'])
right = pd.DataFrame({'value_right': [4, 5, 6]}, index=['B', 'C', 'D'])

merged = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(merged)

🔧 6. Using `.join()` for Simpler Index Joins

If you just need to join by index, .join() is simpler and faster:

df1 = pd.DataFrame({'value1': [10, 20, 30]}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({'value2': [40, 50, 60]}, index=['B', 'C', 'D'])

joined = df1.join(df2, how='outer')
print(joined)

🧰 7. Handling Overlapping Columns and Missing Data

When merging DataFrames with overlapping column names, you can specify suffixes to differentiate them:

left = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
right = pd.DataFrame({'key': ['B', 'C'], 'value': [3, 4]})

merged = pd.merge(left, right, on='key', how='outer', suffixes=('_left', '_right'))
print(merged)

Handling missing data post-merge:

merged.fillna(0, inplace=True)
print(merged)

Or use combine_first() to fill gaps with another DataFrame’s data:

df1 = pd.DataFrame({'A': [1, None, 3]})
df2 = pd.DataFrame({'A': [4, 5, 6]})
result = df1.combine_first(df2)
print(result)

⚙️ 8. Validating Merges (Safety Check)

To prevent unexpected data duplication, use the validate parameter.

pd.merge(left, right, on='key', validate='one_to_one')

Validation options include:

one_to_one
one_to_many
many_to_one
many_to_many

🧠 9. Performance and Best Practices

✅ Use ignore_index=True when concatenating large data to save memory.
✅ Use categorical dtypes for key columns to speed up merges.
✅ Validate merges to avoid accidental data duplication.
✅ For massive joins, consider Dask or Polars.
✅ When stacking similar data, prefer pd.concat() for simplicity.

🧾 10. Summary — Merging Tools at a Glance

Function	Primary Use	Key Parameter	Works On
`pd.concat()`	Stack DataFrames	`axis`, `ignore_index`, `keys`	Index alignment
`pd.merge()`	SQL-style joins	`on`, `how`, `suffixes`	Columns or indexes
`.join()`	Index-based joins	`how`, `lsuffix`, `rsuffix`	Index alignment
`.combine_first()`	Fill missing data	–	DataFrames

🧭 Conclusion

Combining and merging DataFrames is an essential step in data integration and preprocessing.
By mastering concat(), merge(), and join(), you can bring together data from multiple sources, align them precisely, and handle missing or overlapping data confidently.

Pandas makes complex relational operations feel effortless — a few lines of code can replicate full SQL joins and ETL pipelines.