ch9s1_IntroductionToPandas

Pandas is the **cornerstone of data analysis in Python**.

Chapter 9: Data Analysis with Pandas

Sub-Chapter: Introduction to Pandas — Data Manipulation Made Easy

Pandas is the cornerstone of data analysis in Python.
Built on top of NumPy, it provides high-level data structures and tools designed for cleaning, transforming, analyzing, and visualizing structured data efficiently.

🧩 1. What Is Pandas and Why Use It?

Pandas was created to bring data-table functionality (like Excel or SQL) into Python — but faster, more flexible, and fully programmable.

It’s ideal for:

Reading and writing CSV, Excel, SQL, or JSON files.
Handling missing or inconsistent data.
Performing grouping, aggregation, filtering, and visualization.
Integrating seamlessly with NumPy, Matplotlib, and Scikit-learn.

import pandas as pd
import numpy as np

🧱 2. Core Data Structures in Pandas

1. Series — One-Dimensional Data

A Series is like a single column in Excel or a labeled NumPy array.

import pandas as pd

data = [10, 20, 30, 40]
series = pd.Series(data, name="Scores")
print(series)

Output:

0    10
1    20
2    30
3    40
Name: Scores, dtype: int64

Key Attributes:

.values → underlying NumPy array
.index → row labels
.dtype → data type

2. DataFrame — Two-Dimensional Data

A DataFrame is like an Excel sheet or SQL table: labeled rows and columns.

data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana"],
    "Age": [25, 30, 28, 22],
    "Score": [88, 92, 79, 95]
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age  Score
0    Alice   25     88
1      Bob   30     92
2  Charlie   28     79
3    Diana   22     95

🔍 3. Exploring DataFrames

Pandas provides intuitive methods to explore and summarize datasets.

df.head()       # First 5 rows
df.tail(2)      # Last 2 rows
df.shape        # (rows, columns)
df.info()       # Column types & memory info
df.describe()   # Statistical summary (mean, std, etc.)
df.columns      # Column labels
df.dtypes       # Data types of each column

✏️ 4. Creating Data in Different Ways

From Lists or Dictionaries

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28]}
df = pd.DataFrame(data)

From NumPy Arrays

import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(array, columns=["A", "B", "C"])

From CSV or Excel

df = pd.read_csv("data.csv")
df = pd.read_excel("sales.xlsx", sheet_name="Sheet1")

🎯 5. Accessing and Selecting Data

Access Columns

ages = df["Age"]
names = df.Name

Access Rows — Label vs. Integer Index

df.loc[0]     # Access row by label
df.iloc[2]    # Access row by integer position

Subset Rows and Columns

df.loc[1:3, ["Name", "Age"]]   # Rows 1–3, specific columns
df.iloc[:, 1:]                 # All rows, all columns except first

🔎 6. Filtering and Conditional Selection

# Filter by condition
young_people = df[df["Age"] < 28]

# Multiple conditions (use & and | with parentheses)
filtered = df[(df["Age"] < 30) & (df["Score"] > 85)]

🔄 7. Modifying DataFrames

Add or Update Columns

df["Gender"] = ["F", "M", "M", "F"]
df["Passed"] = df["Score"] >= 80

Rename Columns

df.rename(columns={"Score": "ExamScore"}, inplace=True)

Drop Columns or Rows

df.drop("Gender", axis=1, inplace=True)  # Drop column
df.drop(0, axis=0, inplace=True)         # Drop row by index

⚙️ 8. Sorting and Rearranging

# Sort by column
df.sort_values(by="Age", ascending=True, inplace=True)

# Sort by multiple columns
df.sort_values(by=["Passed", "Age"], ascending=[False, True], inplace=True)

# Reindex DataFrame
df = df.reset_index(drop=True)

🧮 9. Descriptive Statistics

print(df["Age"].mean())       # Average age
print(df["Score"].min())      # Minimum score
print(df["Score"].max())      # Maximum score
print(df["Age"].std())        # Standard deviation

🧠 10. Real-World Example — Mini Dataset

students = pd.DataFrame({
    "Name": ["Ali", "Sara", "John", "Lina", "Omid"],
    "Math": [78, 85, 90, 68, 92],
    "Physics": [82, 79, 95, 70, 88],
    "Chemistry": [80, 87, 89, 72, 91]
})

# Add average column
students["Average"] = students[["Math", "Physics", "Chemistry"]].mean(axis=1)

# Filter top performers
top_students = students[students["Average"] > 85]

# Sort by average descending
top_students = top_students.sort_values(by="Average", ascending=False)
print(top_students)

Output:

   Name  Math  Physics  Chemistry  Average
2  John    90       95         89     91.3
4  Omid    92       88         91     90.3
1  Sara    85       79         87     83.7

🧾 11. Series vs. DataFrame — Quick Comparison

Feature	Series	DataFrame
Structure	1D (single column)	2D (rows × columns)
Index	Row labels	Row & column labels
Creation	`pd.Series([1,2,3])`	`pd.DataFrame({...})`
Access	Single value: `series[0]`	Row/col: `df.loc[0, "Age"]`
Operations	Vectorized math	Row & column-wise operations

🧭 12. Best Practices

✅ Use vectorized operations instead of loops.
✅ Use .copy() when modifying filtered subsets.
✅ Inspect data early with .info() and .describe().
✅ Name Series and columns clearly.
✅ Always verify column dtypes — e.g., strings may import as objects.

🧠 Summary

Concept	Description	Example
Series	1D labeled array	`pd.Series([10,20,30])`
DataFrame	2D labeled data table	`pd.DataFrame({...})`
Access	Rows & columns	`df.loc[1, "Age"]`
Filter	Conditional selection	`df[df["Age"]>25]`
Modify	Add/update columns	`df["New"] = ...`
Explore	Inspect data	`df.info()`, `df.describe()`

Pandas transforms Python into a data analysis powerhouse, bridging the gap between raw data and actionable insights.