ch16s1_WebScraper

**Web scraping** is the process of automatically extracting information from websites.

Chapter 16: Real‑World Projects — Building a Web Scraper

🕸️ Introduction to Web Scraping

Web scraping is the process of automatically extracting information from websites.
It’s an essential skill for gathering data for analysis, automating repetitive research, and building real‑time information dashboards.

Python offers several libraries for scraping, including:

requests → for making HTTP requests
BeautifulSoup → for parsing and navigating HTML
pandas → for organizing and exporting the collected data
Selenium / Playwright → for handling dynamic JavaScript content

⚙️ 1. How Web Scraping Works

Send an HTTP request to a web page (e.g., using requests).
Retrieve the HTML response from the server.
Parse and extract relevant content (like text, links, or images).
Store the data in a structured format (CSV, JSON, or database).

🧠 2. Static vs Dynamic Pages

Type	Description	Tools
Static Pages	Content is embedded directly in HTML.	`requests` + `BeautifulSoup`
Dynamic Pages	Content loads via JavaScript after page load.	`Selenium`, `Playwright`, or APIs

📰 3. Example — Scraping News Headlines with BeautifulSoup

Let’s build a practical scraper that extracts headlines and article links from a sample news website.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

def fetch_page(url):
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; WebScraperBot/1.0)'}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.content
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def parse_headlines(html):
    soup = BeautifulSoup(html, 'html.parser')
    headlines = soup.find_all('h2', class_='headline')
    data = []
    for h in headlines:
        title = h.get_text(strip=True)
        link = h.find('a')['href'] if h.find('a') else None
        data.append({'title': title, 'link': link})
    return data

def save_to_csv(data, filename="headlines.csv"):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False)
    print(f"✅ Saved {len(df)} records to {filename}")

# URL of the site to scrape
url = "https://www.example-news-site.com"

html = fetch_page(url)
if html:
    articles = parse_headlines(html)
    save_to_csv(articles)

    for article in articles[:5]:  # Display sample
        print(f"📰 {article['title']}")
        print(f"🔗 {article['link']}\n")

    # Be polite — wait a random delay before next request
    time.sleep(random.uniform(1, 3))

🧩 Key Improvements Over the Simple Example

Added user‑agent headers to mimic a real browser.
Included error handling for failed requests.
Added CSV export with pandas.
Used randomized delay to avoid overwhelming servers.

📈 4. Handling Pagination (Multiple Pages)

Many websites organize content across pages (e.g., /page/2, /page/3). You can automate fetching multiple pages:

base_url = "https://example-news-site.com/page/"
all_articles = []

for page in range(1, 6):  # scrape first 5 pages
    print(f"Scraping page {page}...")
    html = fetch_page(f"{base_url}{page}")
    if html:
        all_articles.extend(parse_headlines(html))
    time.sleep(random.uniform(1, 3))  # polite delay

save_to_csv(all_articles, "all_headlines.csv")

📊 5. Exporting to JSON

import json

with open("headlines.json", "w", encoding="utf-8") as f:
    json.dump(all_articles, f, ensure_ascii=False, indent=4)
print("✅ Saved data to headlines.json")

🔒 6. Ethics & Legality of Web Scraping

Web scraping should always follow ethical and legal standards.

Best Practice	Description
Check `robots.txt`	Each site’s `/robots.txt` defines which pages are allowed for automated access.
Respect Terms of Service	Avoid scraping content that violates usage policies.
Rate Limiting	Add random delays to avoid overloading servers.
Avoid Personal Data	Never collect sensitive or private user info.
Credit Sources	Cite data sources when republishing results.

Example — check permissions:

import requests
print(requests.get("https://www.example-news-site.com/robots.txt").text)

🧰 7. When to Use Selenium or Playwright

If the site uses JavaScript to load data dynamically, static scraping won’t work.
Use automation frameworks that simulate a real browser:

# Example with Selenium (optional)
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://example.com")
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

💡 8. Practical Use Cases

Category	Example
E‑commerce	Price tracking, product catalog monitoring
News & Media	Headline aggregation, trend detection
Research	Collecting academic abstracts or datasets
Social Media	Hashtag monitoring, public post tracking
Real Estate	Listing aggregation, price analysis

🚀 9. Takeaways

Use BeautifulSoup for simple static websites.
Always handle errors and request delays gracefully.
Respect robots.txt and site policies.
For dynamic content, use Selenium or Playwright.
Store your data in structured formats like CSV or JSON.

🧭 Conclusion

Building a web scraper is a practical and valuable project that bridges programming, automation, and data analysis.
With a few lines of Python, you can extract, clean, and structure data from across the web — responsibly and efficiently.

“With great scraping power comes great responsibility.”