ch10s4_WebScrapingWithBeautifulSoup

**Web scraping** is the process of automatically gathering data from websites. Python’s **Beautiful Soup** library, combined with the **requests** library, provides a simple yet powerful toolkit for extracting structured information from web pages.

Chapter 10: Advanced Topics — Web Scraping with Beautiful Soup

🕸️ Web Scraping with Beautiful Soup — Extracting Data from the Web

Web scraping is the process of automatically gathering data from websites. Python’s Beautiful Soup library, combined with the requests library, provides a simple yet powerful toolkit for extracting structured information from web pages.


🌐 1. How Web Scraping Works

When you visit a webpage, your browser downloads HTML code from a server. That HTML contains tags (like <div>, <p>, <a>) and attributes (like class or href) that define its structure.

Web scraping simulates this process programmatically:

  1. Fetch the page with an HTTP request.
  2. Parse the HTML.
  3. Extract relevant information (titles, links, tables, etc.).

⚙️ 2. Installing Required Libraries

You’ll need two essential packages:

pip install requests beautifulsoup4 lxml
LibraryPurpose
requestsDownloads the webpage (HTTP requests).
beautifulsoup4Parses HTML content into a navigable tree.
lxmlFast and robust HTML parser.

import requests
from bs4 import BeautifulSoup

# URL to scrape
url = "https://example.com"

# Send HTTP request (with headers to mimic a browser)
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

# Parse the HTML content
soup = BeautifulSoup(response.content, "lxml")

# Extract the title
title = soup.title.text.strip()
print("Page Title:", title)

# Extract all hyperlinks
links = [a.get("href") for a in soup.find_all("a", href=True)]
print("Found links:", links)

# Extract the first paragraph
first_paragraph = soup.find("p")
print("First paragraph:", first_paragraph.text.strip())

🔍 4. Navigating and Searching the DOM

Beautiful Soup allows flexible element selection using multiple methods.

find() and find_all()

# Find by tag
heading = soup.find("h1")

# Find by class
article_text = soup.find_all("div", class_="article-content")

CSS Selectors (select())

# Select using CSS selectors
nav_links = soup.select("nav a")
for link in nav_links:
    print(link["href"])

Extracting Attributes and Text

image = soup.find("img")
src = image["src"]
alt = image.get("alt", "No alt text")
print("Image source:", src, "| Alt text:", alt)

📊 5. Extracting Structured Data (e.g., Tables)

url = "https://www.w3schools.com/html/html_tables.asp"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

table = soup.find("table")
rows = table.find_all("tr")

for row in rows:
    cells = [cell.text.strip() for cell in row.find_all(["th", "td"])]
    print(cells)

💡 Tables are common in finance, sports, and scientific data — this technique helps convert them into usable lists or DataFrames.


🔁 6. Handling Pagination and Multiple Pages

Many websites show results over multiple pages. You can loop over page URLs dynamically:

base_url = "https://quotes.toscrape.com/page/{}/"

for page in range(1, 4):
    url = base_url.format(page)
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "lxml")

    quotes = [q.text for q in soup.select(".text")]
    print(f"Quotes on page {page}:", quotes)

🧠 7. Handling Dynamic Content (JavaScript-Rendered Pages)

Some sites load content dynamically with JavaScript — meaning the data doesn’t exist in the initial HTML.
To handle this, you can:

  1. Use API endpoints (inspect the network tab in DevTools).
  2. Use Selenium or Playwright for browser automation if needed.

Example (detecting missing data):

if not soup.find("div", class_="product"):
    print("Likely rendered by JavaScript — try Selenium or API scraping.")

🧰 8. Parsing Local HTML Files

You can also parse HTML from a saved file — useful for offline scraping or debugging.

from bs4 import BeautifulSoup

with open("local_page.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")

print("Title:", soup.title.text)

⚡ 9. Optimizing Your Scraper

TechniqueBenefit
Set custom User-AgentAvoids blocking by servers
Use time.sleep() between requestsPrevents rate-limiting
Cache requests locallySaves bandwidth
Use try/except for errorsHandles timeouts and missing tags gracefully

Example with retry and delay:

import time, requests

for page in range(1, 6):
    try:
        url = f"https://example.com/page/{page}"
        res = requests.get(url, timeout=10)
        print(f"Fetched {url} ({len(res.text)} bytes)")
        time.sleep(1)
    except requests.RequestException as e:
        print("Error fetching page:", e)

📦 10. Comparing Libraries

LibraryFocusStrength
BeautifulSoupGeneral-purpose HTML parsingEasy to learn, flexible
lxmlFast XML/HTML parsingHigh performance
ScrapyLarge-scale web crawling frameworkHandles queues, pipelines, and scraping automation

Always scrape responsibly:

🧠 Remember: “Just because you can scrape it, doesn’t mean you should.”


🗞️ 12. Real-World Example — Scraping News Headlines

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.com/news"
res = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(res.text, "lxml")

headlines = [h3.text.strip() for h3 in soup.select("h3") if h3.text.strip()]
print("Top Headlines:")
for title in headlines[:10]:
    print("-", title)

🧭 13. Best Practices

✅ Use proper headers and delays to avoid blocks.
✅ Handle exceptions (requests.exceptions.RequestException).
✅ Cache results to reduce load.
✅ Respect website rules and privacy.
✅ Consider APIs or RSS feeds before scraping raw HTML.
✅ Store extracted data in structured formats (CSV, JSON, SQLite).


🧠 Summary

ConceptDescriptionExample
Web ScrapingAutomated extraction of web datarequests + BeautifulSoup
ParserConverts HTML into a navigable structure"html.parser", "lxml"
find() / select()Methods to locate elementssoup.find("div"), soup.select(".class")
Ethical ScrapingRespect site policies and avoid overloadCheck robots.txt
Advanced ToolsSelenium, Scrapy, PlaywrightFor dynamic pages

With Beautiful Soup, the web becomes a massive, structured data source — handle it wisely and ethically.