Real‑World Projects — Building a Web Scraper

Published: November 12, 2025 • Language: python • Chapter: 16 • Sub: 1 • Level: beginner

python

Chapter 16: Real‑World Projects — Building a Web Scraper

🕸️ Introduction to Web Scraping

Web scraping is the process of automatically extracting information from websites.
It’s an essential skill for gathering data for analysis, automating repetitive research, and building real‑time information dashboards.

Python offers several libraries for scraping, including:

  • requests → for making HTTP requests
  • BeautifulSoup → for parsing and navigating HTML
  • pandas → for organizing and exporting the collected data
  • Selenium / Playwright → for handling dynamic JavaScript content

⚙️ 1. How Web Scraping Works

  1. Send an HTTP request to a web page (e.g., using requests).
  2. Retrieve the HTML response from the server.
  3. Parse and extract relevant content (like text, links, or images).
  4. Store the data in a structured format (CSV, JSON, or database).

🧠 2. Static vs Dynamic Pages

Type Description Tools
Static Pages Content is embedded directly in HTML. requests + BeautifulSoup
Dynamic Pages Content loads via JavaScript after page load. Selenium, Playwright, or APIs

📰 3. Example — Scraping News Headlines with BeautifulSoup

Let’s build a practical scraper that extracts headlines and article links from a sample news website.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

def fetch_page(url):
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; WebScraperBot/1.0)'}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def parse_headlines(html):
    soup = BeautifulSoup(html, 'html.parser')
    headlines = soup.find_all('h2', class_='headline')
    data = []
    for h in headlines:
        title = h.get_text(strip=True)
        link = h.find('a')['href'] if h.find('a') else None
        data.append({'title': title, 'link': link})
    return data

def save_to_csv(data, filename="headlines.csv"):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"✅ Saved {len(df)} records to {filename}")

# URL of the site to scrape
url = "https://www.example-news-site.com"

html = fetch_page(url)
if html:
    articles = parse_headlines(html)
    save_to_csv(articles)

    for article in articles[:5]:  # Display sample
        print(f"📰 {article['title']}")
        print(f"🔗 {article['link']}\n")

    # Be polite — wait a random delay before next request
    time.sleep(random.uniform(1, 3))

🧩 Key Improvements Over the Simple Example

  • Added user‑agent headers to mimic a real browser.
  • Included error handling for failed requests.
  • Added CSV export with pandas.
  • Used randomized delay to avoid overwhelming servers.

📈 4. Handling Pagination (Multiple Pages)

Many websites organize content across pages (e.g., /page/2, /page/3). You can automate fetching multiple pages:

base_url = "https://example-news-site.com/page/"
all_articles = []

for page in range(1, 6):  # scrape first 5 pages
    print(f"Scraping page {page}...")
    html = fetch_page(f"{base_url}{page}")
    if html:
        all_articles.extend(parse_headlines(html))
    time.sleep(random.uniform(1, 3))  # polite delay

save_to_csv(all_articles, "all_headlines.csv")

📊 5. Exporting to JSON

import json

with open("headlines.json", "w", encoding="utf-8") as f:
    json.dump(all_articles, f, ensure_ascii=False, indent=4)
print("✅ Saved data to headlines.json")

🔒 6. Ethics & Legality of Web Scraping

Web scraping should always follow ethical and legal standards.

Best Practice Description
Check robots.txt Each site’s /robots.txt defines which pages are allowed for automated access.
Respect Terms of Service Avoid scraping content that violates usage policies.
Rate Limiting Add random delays to avoid overloading servers.
Avoid Personal Data Never collect sensitive or private user info.
Credit Sources Cite data sources when republishing results.

Example — check permissions:

import requests
print(requests.get("https://www.example-news-site.com/robots.txt").text)

🧰 7. When to Use Selenium or Playwright

If the site uses JavaScript to load data dynamically, static scraping won’t work.
Use automation frameworks that simulate a real browser:

# Example with Selenium (optional)
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://example.com")
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

💡 8. Practical Use Cases

Category Example
E‑commerce Price tracking, product catalog monitoring
News & Media Headline aggregation, trend detection
Research Collecting academic abstracts or datasets
Social Media Hashtag monitoring, public post tracking
Real Estate Listing aggregation, price analysis

🚀 9. Takeaways

  • Use BeautifulSoup for simple static websites.
  • Always handle errors and request delays gracefully.
  • Respect robots.txt and site policies.
  • For dynamic content, use Selenium or Playwright.
  • Store your data in structured formats like CSV or JSON.

🧭 Conclusion

Building a web scraper is a practical and valuable project that bridges programming, automation, and data analysis.
With a few lines of Python, you can extract, clean, and structure data from across the web — responsibly and efficiently.

“With great scraping power comes great responsibility.”