ch16s1_WebScraper
**Web scraping** is the process of automatically extracting information from websites.
Chapter 16: Real‑World Projects — Building a Web Scraper
🕸️ Introduction to Web Scraping
Web scraping is the process of automatically extracting information from websites.
It’s an essential skill for gathering data for analysis, automating repetitive research, and building real‑time information dashboards.
Python offers several libraries for scraping, including:
requests→ for making HTTP requestsBeautifulSoup→ for parsing and navigating HTMLpandas→ for organizing and exporting the collected dataSelenium/Playwright→ for handling dynamic JavaScript content
⚙️ 1. How Web Scraping Works
- Send an HTTP request to a web page (e.g., using
requests). - Retrieve the HTML response from the server.
- Parse and extract relevant content (like text, links, or images).
- Store the data in a structured format (CSV, JSON, or database).
🧠 2. Static vs Dynamic Pages
| Type | Description | Tools |
|---|---|---|
| Static Pages | Content is embedded directly in HTML. | requests + BeautifulSoup |
| Dynamic Pages | Content loads via JavaScript after page load. | Selenium, Playwright, or APIs |
📰 3. Example — Scraping News Headlines with BeautifulSoup
Let’s build a practical scraper that extracts headlines and article links from a sample news website.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
def fetch_page(url):
headers = {'User-Agent': 'Mozilla/5.0 (compatible; WebScraperBot/1.0)'}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.content
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
def parse_headlines(html):
soup = BeautifulSoup(html, 'html.parser')
headlines = soup.find_all('h2', class_='headline')
data = []
for h in headlines:
title = h.get_text(strip=True)
link = h.find('a')['href'] if h.find('a') else None
data.append({'title': title, 'link': link})
return data
def save_to_csv(data, filename="headlines.csv"):
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"✅ Saved {len(df)} records to {filename}")
# URL of the site to scrape
url = "https://www.example-news-site.com"
html = fetch_page(url)
if html:
articles = parse_headlines(html)
save_to_csv(articles)
for article in articles[:5]: # Display sample
print(f"📰 {article['title']}")
print(f"🔗 {article['link']}\n")
# Be polite — wait a random delay before next request
time.sleep(random.uniform(1, 3))
🧩 Key Improvements Over the Simple Example
- Added user‑agent headers to mimic a real browser.
- Included error handling for failed requests.
- Added CSV export with pandas.
- Used randomized delay to avoid overwhelming servers.
📈 4. Handling Pagination (Multiple Pages)
Many websites organize content across pages (e.g., /page/2, /page/3). You can automate fetching multiple pages:
base_url = "https://example-news-site.com/page/"
all_articles = []
for page in range(1, 6): # scrape first 5 pages
print(f"Scraping page {page}...")
html = fetch_page(f"{base_url}{page}")
if html:
all_articles.extend(parse_headlines(html))
time.sleep(random.uniform(1, 3)) # polite delay
save_to_csv(all_articles, "all_headlines.csv")
📊 5. Exporting to JSON
import json
with open("headlines.json", "w", encoding="utf-8") as f:
json.dump(all_articles, f, ensure_ascii=False, indent=4)
print("✅ Saved data to headlines.json")
🔒 6. Ethics & Legality of Web Scraping
Web scraping should always follow ethical and legal standards.
| Best Practice | Description |
|---|---|
Check robots.txt | Each site’s /robots.txt defines which pages are allowed for automated access. |
| Respect Terms of Service | Avoid scraping content that violates usage policies. |
| Rate Limiting | Add random delays to avoid overloading servers. |
| Avoid Personal Data | Never collect sensitive or private user info. |
| Credit Sources | Cite data sources when republishing results. |
Example — check permissions:
import requests
print(requests.get("https://www.example-news-site.com/robots.txt").text)
🧰 7. When to Use Selenium or Playwright
If the site uses JavaScript to load data dynamically, static scraping won’t work.
Use automation frameworks that simulate a real browser:
# Example with Selenium (optional)
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://example.com")
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()
💡 8. Practical Use Cases
| Category | Example |
|---|---|
| E‑commerce | Price tracking, product catalog monitoring |
| News & Media | Headline aggregation, trend detection |
| Research | Collecting academic abstracts or datasets |
| Social Media | Hashtag monitoring, public post tracking |
| Real Estate | Listing aggregation, price analysis |
🚀 9. Takeaways
- Use BeautifulSoup for simple static websites.
- Always handle errors and request delays gracefully.
- Respect robots.txt and site policies.
- For dynamic content, use Selenium or Playwright.
- Store your data in structured formats like CSV or JSON.
🧭 Conclusion
Building a web scraper is a practical and valuable project that bridges programming, automation, and data analysis.
With a few lines of Python, you can extract, clean, and structure data from across the web — responsibly and efficiently.
“With great scraping power comes great responsibility.”