ch16s1_WebScraper

**Web scraping** is the process of automatically extracting information from websites.

Chapter 16: Real‑World Projects — Building a Web Scraper

🕸️ Introduction to Web Scraping

Web scraping is the process of automatically extracting information from websites.
It’s an essential skill for gathering data for analysis, automating repetitive research, and building real‑time information dashboards.

Python offers several libraries for scraping, including:


⚙️ 1. How Web Scraping Works

  1. Send an HTTP request to a web page (e.g., using requests).
  2. Retrieve the HTML response from the server.
  3. Parse and extract relevant content (like text, links, or images).
  4. Store the data in a structured format (CSV, JSON, or database).

🧠 2. Static vs Dynamic Pages

TypeDescriptionTools
Static PagesContent is embedded directly in HTML.requests + BeautifulSoup
Dynamic PagesContent loads via JavaScript after page load.Selenium, Playwright, or APIs

📰 3. Example — Scraping News Headlines with BeautifulSoup

Let’s build a practical scraper that extracts headlines and article links from a sample news website.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

def fetch_page(url):
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; WebScraperBot/1.0)'}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def parse_headlines(html):
    soup = BeautifulSoup(html, 'html.parser')
    headlines = soup.find_all('h2', class_='headline')
    data = []
    for h in headlines:
        title = h.get_text(strip=True)
        link = h.find('a')['href'] if h.find('a') else None
        data.append({'title': title, 'link': link})
    return data

def save_to_csv(data, filename="headlines.csv"):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"✅ Saved {len(df)} records to {filename}")

# URL of the site to scrape
url = "https://www.example-news-site.com"

html = fetch_page(url)
if html:
    articles = parse_headlines(html)
    save_to_csv(articles)

    for article in articles[:5]:  # Display sample
        print(f"📰 {article['title']}")
        print(f"🔗 {article['link']}\n")

    # Be polite — wait a random delay before next request
    time.sleep(random.uniform(1, 3))

🧩 Key Improvements Over the Simple Example


📈 4. Handling Pagination (Multiple Pages)

Many websites organize content across pages (e.g., /page/2, /page/3). You can automate fetching multiple pages:

base_url = "https://example-news-site.com/page/"
all_articles = []

for page in range(1, 6):  # scrape first 5 pages
    print(f"Scraping page {page}...")
    html = fetch_page(f"{base_url}{page}")
    if html:
        all_articles.extend(parse_headlines(html))
    time.sleep(random.uniform(1, 3))  # polite delay

save_to_csv(all_articles, "all_headlines.csv")

📊 5. Exporting to JSON

import json

with open("headlines.json", "w", encoding="utf-8") as f:
    json.dump(all_articles, f, ensure_ascii=False, indent=4)
print("✅ Saved data to headlines.json")

🔒 6. Ethics & Legality of Web Scraping

Web scraping should always follow ethical and legal standards.

Best PracticeDescription
Check robots.txtEach site’s /robots.txt defines which pages are allowed for automated access.
Respect Terms of ServiceAvoid scraping content that violates usage policies.
Rate LimitingAdd random delays to avoid overloading servers.
Avoid Personal DataNever collect sensitive or private user info.
Credit SourcesCite data sources when republishing results.

Example — check permissions:

import requests
print(requests.get("https://www.example-news-site.com/robots.txt").text)

🧰 7. When to Use Selenium or Playwright

If the site uses JavaScript to load data dynamically, static scraping won’t work.
Use automation frameworks that simulate a real browser:

# Example with Selenium (optional)
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://example.com")
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

💡 8. Practical Use Cases

CategoryExample
E‑commercePrice tracking, product catalog monitoring
News & MediaHeadline aggregation, trend detection
ResearchCollecting academic abstracts or datasets
Social MediaHashtag monitoring, public post tracking
Real EstateListing aggregation, price analysis

🚀 9. Takeaways


🧭 Conclusion

Building a web scraper is a practical and valuable project that bridges programming, automation, and data analysis.
With a few lines of Python, you can extract, clean, and structure data from across the web — responsibly and efficiently.

“With great scraping power comes great responsibility.”