ch9s1_IntroductionToPandas
Pandas is the **cornerstone of data analysis in Python**.
Chapter 9: Data Analysis with Pandas
Sub-Chapter: Introduction to Pandas — Data Manipulation Made Easy
Pandas is the cornerstone of data analysis in Python.
Built on top of NumPy, it provides high-level data structures and tools designed for cleaning, transforming, analyzing, and visualizing structured data efficiently.
🧩 1. What Is Pandas and Why Use It?
Pandas was created to bring data-table functionality (like Excel or SQL) into Python — but faster, more flexible, and fully programmable.
It’s ideal for:
- Reading and writing CSV, Excel, SQL, or JSON files.
- Handling missing or inconsistent data.
- Performing grouping, aggregation, filtering, and visualization.
- Integrating seamlessly with NumPy, Matplotlib, and Scikit-learn.
import pandas as pd
import numpy as np
🧱 2. Core Data Structures in Pandas
1. Series — One-Dimensional Data
A Series is like a single column in Excel or a labeled NumPy array.
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data, name="Scores")
print(series)
Output:
0 10
1 20
2 30
3 40
Name: Scores, dtype: int64
Key Attributes:
.values→ underlying NumPy array.index→ row labels.dtype→ data type
2. DataFrame — Two-Dimensional Data
A DataFrame is like an Excel sheet or SQL table: labeled rows and columns.
data = {
"Name": ["Alice", "Bob", "Charlie", "Diana"],
"Age": [25, 30, 28, 22],
"Score": [88, 92, 79, 95]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Score
0 Alice 25 88
1 Bob 30 92
2 Charlie 28 79
3 Diana 22 95
🔍 3. Exploring DataFrames
Pandas provides intuitive methods to explore and summarize datasets.
df.head() # First 5 rows
df.tail(2) # Last 2 rows
df.shape # (rows, columns)
df.info() # Column types & memory info
df.describe() # Statistical summary (mean, std, etc.)
df.columns # Column labels
df.dtypes # Data types of each column
✏️ 4. Creating Data in Different Ways
From Lists or Dictionaries
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28]}
df = pd.DataFrame(data)
From NumPy Arrays
import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(array, columns=["A", "B", "C"])
From CSV or Excel
df = pd.read_csv("data.csv")
df = pd.read_excel("sales.xlsx", sheet_name="Sheet1")
🎯 5. Accessing and Selecting Data
Access Columns
ages = df["Age"]
names = df.Name
Access Rows — Label vs. Integer Index
df.loc[0] # Access row by label
df.iloc[2] # Access row by integer position
Subset Rows and Columns
df.loc[1:3, ["Name", "Age"]] # Rows 1–3, specific columns
df.iloc[:, 1:] # All rows, all columns except first
🔎 6. Filtering and Conditional Selection
# Filter by condition
young_people = df[df["Age"] < 28]
# Multiple conditions (use & and | with parentheses)
filtered = df[(df["Age"] < 30) & (df["Score"] > 85)]
🔄 7. Modifying DataFrames
Add or Update Columns
df["Gender"] = ["F", "M", "M", "F"]
df["Passed"] = df["Score"] >= 80
Rename Columns
df.rename(columns={"Score": "ExamScore"}, inplace=True)
Drop Columns or Rows
df.drop("Gender", axis=1, inplace=True) # Drop column
df.drop(0, axis=0, inplace=True) # Drop row by index
⚙️ 8. Sorting and Rearranging
# Sort by column
df.sort_values(by="Age", ascending=True, inplace=True)
# Sort by multiple columns
df.sort_values(by=["Passed", "Age"], ascending=[False, True], inplace=True)
# Reindex DataFrame
df = df.reset_index(drop=True)
🧮 9. Descriptive Statistics
print(df["Age"].mean()) # Average age
print(df["Score"].min()) # Minimum score
print(df["Score"].max()) # Maximum score
print(df["Age"].std()) # Standard deviation
🧠 10. Real-World Example — Mini Dataset
students = pd.DataFrame({
"Name": ["Ali", "Sara", "John", "Lina", "Omid"],
"Math": [78, 85, 90, 68, 92],
"Physics": [82, 79, 95, 70, 88],
"Chemistry": [80, 87, 89, 72, 91]
})
# Add average column
students["Average"] = students[["Math", "Physics", "Chemistry"]].mean(axis=1)
# Filter top performers
top_students = students[students["Average"] > 85]
# Sort by average descending
top_students = top_students.sort_values(by="Average", ascending=False)
print(top_students)
Output:
Name Math Physics Chemistry Average
2 John 90 95 89 91.3
4 Omid 92 88 91 90.3
1 Sara 85 79 87 83.7
🧾 11. Series vs. DataFrame — Quick Comparison
| Feature | Series | DataFrame |
|---|---|---|
| Structure | 1D (single column) | 2D (rows × columns) |
| Index | Row labels | Row & column labels |
| Creation | pd.Series([1,2,3]) | pd.DataFrame({...}) |
| Access | Single value: series[0] | Row/col: df.loc[0, "Age"] |
| Operations | Vectorized math | Row & column-wise operations |
🧭 12. Best Practices
✅ Use vectorized operations instead of loops.
✅ Use .copy() when modifying filtered subsets.
✅ Inspect data early with .info() and .describe().
✅ Name Series and columns clearly.
✅ Always verify column dtypes — e.g., strings may import as objects.
🧠 Summary
| Concept | Description | Example |
|---|---|---|
| Series | 1D labeled array | pd.Series([10,20,30]) |
| DataFrame | 2D labeled data table | pd.DataFrame({...}) |
| Access | Rows & columns | df.loc[1, "Age"] |
| Filter | Conditional selection | df[df["Age"]>25] |
| Modify | Add/update columns | df["New"] = ... |
| Explore | Inspect data | df.info(), df.describe() |
Pandas transforms Python into a data analysis powerhouse, bridging the gap between raw data and actionable insights.