ch10s4_WebScrapingWithBeautifulSoup
**Web scraping** is the process of automatically gathering data from websites. Python’s **Beautiful Soup** library, combined with the **requests** library, provides a simple yet powerful toolkit for extracting structured information from web pages.
Chapter 10: Advanced Topics — Web Scraping with Beautiful Soup
🕸️ Web Scraping with Beautiful Soup — Extracting Data from the Web
Web scraping is the process of automatically gathering data from websites. Python’s Beautiful Soup library, combined with the requests library, provides a simple yet powerful toolkit for extracting structured information from web pages.
🌐 1. How Web Scraping Works
When you visit a webpage, your browser downloads HTML code from a server. That HTML contains tags (like <div>, <p>, <a>) and attributes (like class or href) that define its structure.
Web scraping simulates this process programmatically:
- Fetch the page with an HTTP request.
- Parse the HTML.
- Extract relevant information (titles, links, tables, etc.).
⚙️ 2. Installing Required Libraries
You’ll need two essential packages:
pip install requests beautifulsoup4 lxml
| Library | Purpose |
|---|---|
requests | Downloads the webpage (HTTP requests). |
beautifulsoup4 | Parses HTML content into a navigable tree. |
lxml | Fast and robust HTML parser. |
🧱 3. Basic Example — Extracting a Webpage Title and Links
import requests
from bs4 import BeautifulSoup
# URL to scrape
url = "https://example.com"
# Send HTTP request (with headers to mimic a browser)
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
# Parse the HTML content
soup = BeautifulSoup(response.content, "lxml")
# Extract the title
title = soup.title.text.strip()
print("Page Title:", title)
# Extract all hyperlinks
links = [a.get("href") for a in soup.find_all("a", href=True)]
print("Found links:", links)
# Extract the first paragraph
first_paragraph = soup.find("p")
print("First paragraph:", first_paragraph.text.strip())
🔍 4. Navigating and Searching the DOM
Beautiful Soup allows flexible element selection using multiple methods.
find() and find_all()
# Find by tag
heading = soup.find("h1")
# Find by class
article_text = soup.find_all("div", class_="article-content")
CSS Selectors (select())
# Select using CSS selectors
nav_links = soup.select("nav a")
for link in nav_links:
print(link["href"])
Extracting Attributes and Text
image = soup.find("img")
src = image["src"]
alt = image.get("alt", "No alt text")
print("Image source:", src, "| Alt text:", alt)
📊 5. Extracting Structured Data (e.g., Tables)
url = "https://www.w3schools.com/html/html_tables.asp"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
cells = [cell.text.strip() for cell in row.find_all(["th", "td"])]
print(cells)
💡 Tables are common in finance, sports, and scientific data — this technique helps convert them into usable lists or DataFrames.
🔁 6. Handling Pagination and Multiple Pages
Many websites show results over multiple pages. You can loop over page URLs dynamically:
base_url = "https://quotes.toscrape.com/page/{}/"
for page in range(1, 4):
url = base_url.format(page)
res = requests.get(url)
soup = BeautifulSoup(res.text, "lxml")
quotes = [q.text for q in soup.select(".text")]
print(f"Quotes on page {page}:", quotes)
🧠 7. Handling Dynamic Content (JavaScript-Rendered Pages)
Some sites load content dynamically with JavaScript — meaning the data doesn’t exist in the initial HTML.
To handle this, you can:
- Use API endpoints (inspect the network tab in DevTools).
- Use Selenium or Playwright for browser automation if needed.
Example (detecting missing data):
if not soup.find("div", class_="product"): print("Likely rendered by JavaScript — try Selenium or API scraping.")
🧰 8. Parsing Local HTML Files
You can also parse HTML from a saved file — useful for offline scraping or debugging.
from bs4 import BeautifulSoup
with open("local_page.html", "r", encoding="utf-8") as file:
soup = BeautifulSoup(file, "html.parser")
print("Title:", soup.title.text)
⚡ 9. Optimizing Your Scraper
| Technique | Benefit |
|---|---|
Set custom User-Agent | Avoids blocking by servers |
Use time.sleep() between requests | Prevents rate-limiting |
| Cache requests locally | Saves bandwidth |
Use try/except for errors | Handles timeouts and missing tags gracefully |
Example with retry and delay:
import time, requests
for page in range(1, 6):
try:
url = f"https://example.com/page/{page}"
res = requests.get(url, timeout=10)
print(f"Fetched {url} ({len(res.text)} bytes)")
time.sleep(1)
except requests.RequestException as e:
print("Error fetching page:", e)
📦 10. Comparing Libraries
| Library | Focus | Strength |
|---|---|---|
| BeautifulSoup | General-purpose HTML parsing | Easy to learn, flexible |
| lxml | Fast XML/HTML parsing | High performance |
| Scrapy | Large-scale web crawling framework | Handles queues, pipelines, and scraping automation |
⚖️ 11. Ethical and Legal Scraping
Always scrape responsibly:
- ✅ Check the website’s
robots.txt(https://example.com/robots.txt). - ✅ Review terms of service — avoid violating site policies.
- ✅ Limit request rates to prevent server overload.
- ✅ Never scrape personal or private data.
- ✅ Credit the data source when publishing.
🧠 Remember: “Just because you can scrape it, doesn’t mean you should.”
🗞️ 12. Real-World Example — Scraping News Headlines
import requests
from bs4 import BeautifulSoup
url = "https://www.bbc.com/news"
res = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")
headlines = [h3.text.strip() for h3 in soup.select("h3") if h3.text.strip()]
print("Top Headlines:")
for title in headlines[:10]:
print("-", title)
🧭 13. Best Practices
✅ Use proper headers and delays to avoid blocks.
✅ Handle exceptions (requests.exceptions.RequestException).
✅ Cache results to reduce load.
✅ Respect website rules and privacy.
✅ Consider APIs or RSS feeds before scraping raw HTML.
✅ Store extracted data in structured formats (CSV, JSON, SQLite).
🧠 Summary
| Concept | Description | Example |
|---|---|---|
| Web Scraping | Automated extraction of web data | requests + BeautifulSoup |
| Parser | Converts HTML into a navigable structure | "html.parser", "lxml" |
| find() / select() | Methods to locate elements | soup.find("div"), soup.select(".class") |
| Ethical Scraping | Respect site policies and avoid overload | Check robots.txt |
| Advanced Tools | Selenium, Scrapy, Playwright | For dynamic pages |
With Beautiful Soup, the web becomes a massive, structured data source — handle it wisely and ethically.