ch12s2_CombiningAndMergingDataFrames
In real-world data analysis, you often need to **combine multiple datasets** into a single, coherent structure.
Chapter 12: Advanced NumPy and Pandas — Combining and Merging DataFrames
🧩 Combining and Merging DataFrames with Pandas
In real-world data analysis, you often need to combine multiple datasets into a single, coherent structure.
Pandas provides flexible and efficient functions to concatenate, merge, and join DataFrames — enabling seamless data integration and analysis.
⚙️ 1. Overview of Data Combination Techniques
| Method | Description | Common Use |
|---|---|---|
pd.concat() | Stacks DataFrames along an axis | Combine datasets with same columns or indexes |
pd.merge() | SQL-style join operation | Combine DataFrames using common keys |
.join() | Simplified merge using index | Combine DataFrames based on row indexes |
🧱 2. Concatenating DataFrames
Concatenation stacks DataFrames vertically (rows) or horizontally (columns).
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Vertical concatenation (default axis=0)
vertical = pd.concat([df1, df2])
# Horizontal concatenation
horizontal = pd.concat([df1, df2], axis=1)
print("Vertical:
", vertical)
print("\nHorizontal:
", horizontal)
✅ Useful Parameters
| Parameter | Description |
|---|---|
axis | 0 = rows (stack), 1 = columns (side-by-side) |
ignore_index | Reindex after concatenation |
keys | Create hierarchical index labels |
join | ’inner’ or ‘outer’ join on columns |
Example using keys and ignore_index:
result = pd.concat([df1, df2], keys=['Group1', 'Group2'], ignore_index=False)
print(result)
🔗 3. Merging DataFrames (SQL-style Joins)
Merging is similar to SQL joins and allows combining DataFrames using common columns or keys.
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'value_left': [1, 2, 3]})
right = pd.DataFrame({'key': ['B', 'C', 'D'], 'value_right': [4, 5, 6]})
merged = pd.merge(left, right, on='key', how='inner')
print(merged)
Join Types
| Join Type | Description | Keeps |
|---|---|---|
| inner | Intersection | Only matching keys |
| outer | Union | All keys, filling missing with NaN |
| left | Left DataFrame priority | All keys from left |
| right | Right DataFrame priority | All keys from right |
Example comparing join types:
for method in ['inner', 'outer', 'left', 'right']:
print(f"\n{method.upper()} JOIN:")
print(pd.merge(left, right, on='key', how=method))
🧮 4. Merging on Multiple Keys
You can merge on multiple columns by passing a list to on:
left = pd.DataFrame({'key1': ['A', 'B', 'B'], 'key2': [1, 2, 3], 'value_left': [10, 20, 30]})
right = pd.DataFrame({'key1': ['A', 'B', 'B'], 'key2': [1, 2, 4], 'value_right': [100, 200, 300]})
merged = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print(merged)
🧭 5. Index-Based Merges
Merges can also align using DataFrame indexes:
left = pd.DataFrame({'value_left': [1, 2, 3]}, index=['A', 'B', 'C'])
right = pd.DataFrame({'value_right': [4, 5, 6]}, index=['B', 'C', 'D'])
merged = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(merged)
🔧 6. Using .join() for Simpler Index Joins
If you just need to join by index, .join() is simpler and faster:
df1 = pd.DataFrame({'value1': [10, 20, 30]}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({'value2': [40, 50, 60]}, index=['B', 'C', 'D'])
joined = df1.join(df2, how='outer')
print(joined)
🧰 7. Handling Overlapping Columns and Missing Data
When merging DataFrames with overlapping column names, you can specify suffixes to differentiate them:
left = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
right = pd.DataFrame({'key': ['B', 'C'], 'value': [3, 4]})
merged = pd.merge(left, right, on='key', how='outer', suffixes=('_left', '_right'))
print(merged)
Handling missing data post-merge:
merged.fillna(0, inplace=True)
print(merged)
Or use combine_first() to fill gaps with another DataFrame’s data:
df1 = pd.DataFrame({'A': [1, None, 3]})
df2 = pd.DataFrame({'A': [4, 5, 6]})
result = df1.combine_first(df2)
print(result)
⚙️ 8. Validating Merges (Safety Check)
To prevent unexpected data duplication, use the validate parameter.
pd.merge(left, right, on='key', validate='one_to_one')
Validation options include:
one_to_oneone_to_manymany_to_onemany_to_many
🧠 9. Performance and Best Practices
✅ Use ignore_index=True when concatenating large data to save memory.
✅ Use categorical dtypes for key columns to speed up merges.
✅ Validate merges to avoid accidental data duplication.
✅ For massive joins, consider Dask or Polars.
✅ When stacking similar data, prefer pd.concat() for simplicity.
🧾 10. Summary — Merging Tools at a Glance
| Function | Primary Use | Key Parameter | Works On |
|---|---|---|---|
pd.concat() | Stack DataFrames | axis, ignore_index, keys | Index alignment |
pd.merge() | SQL-style joins | on, how, suffixes | Columns or indexes |
.join() | Index-based joins | how, lsuffix, rsuffix | Index alignment |
.combine_first() | Fill missing data | – | DataFrames |
🧭 Conclusion
Combining and merging DataFrames is an essential step in data integration and preprocessing.
By mastering concat(), merge(), and join(), you can bring together data from multiple sources, align them precisely, and handle missing or overlapping data confidently.
Pandas makes complex relational operations feel effortless — a few lines of code can replicate full SQL joins and ETL pipelines.