Ojaswi Athghara | Python Web Scraping: Complete Tutorial for Data Collection

Python Web Scraping: Complete Tutorial for Data Collection

My First Web Scraping Project

"We need product prices from 50 competitor websites." My boss dropped this bombshell on a Monday morning. Manual copying? That would take days. Then I discovered web scraping.

What seemed impossible became a 100-line Python script running automatically every morning. Web scraping transformed me from a data consumer to a data creator.

In this comprehensive tutorial, you'll learn everything about web scraping—from fetching HTML to handling complex scenarios. By the end, you'll build real scrapers that collect data automatically.

What is Web Scraping?

Web scraping is programmatically extracting data from websites. Instead of manually copying information, you write code that:

Fetches web pages
Parses HTML structure
Extracts specific data
Saves it in structured format (CSV, JSON, database)

When to Use Web Scraping

Competitor price monitoring
News aggregation
Real estate listings
Job postings
Product reviews
Research data collection
Market research

Important: Always check robots.txt and terms of service. Respect rate limits. Scrape ethically!

Essential Tools

# Install required libraries
# pip install requests beautifulsoup4 lxml pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from urllib.parse import urljoin

Your First Scraper

Fetching a Web Page

# Simple GET request
url = 'https://example.com'

response = requests.get(url)
print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.content)}")

# Get HTML content
html = response.text
print(html[:500])  # First 500 characters

Parsing HTML

# Parse with BeautifulSoup
soup = BeautifulSoup(html, 'lxml')  # or 'html.parser'

# Pretty print HTML
print(soup.prettify()[:1000])

# Extract title
title = soup.title.string
print(f"Page Title: {title}")

# Find specific tags
h1 = soup.find('h1')
if h1:
    print(f"Main Heading: {h1.text}")

# Find all links
links = soup.find_all('a')
print(f"Found {len(links)} links")

Selecting Elements

Basic Selectors

# By tag name
paragraphs = soup.find_all('p')
print(f"Paragraphs: {len(paragraphs)}")

# By class
items = soup.find_all(class_='product-item')

# By ID
header = soup.find(id='header')

# By attribute
links = soup.find_all('a', href=True)

# Multiple classes
elements = soup.find_all(class_=['class1', 'class2'])

CSS Selectors

# CSS selector (more powerful)
products = soup.select('.product-card')

# Nested selection
prices = soup.select('.product-card .price')

# Attribute selectors
email_links = soup.select('a[href^="mailto:"]')

# Multiple selectors
elements = soup.select('div.content, div.main')

# nth-child
first_item = soup.select('.list-item:nth-child(1)')

Extracting Data

Text Content

# Get text
element = soup.find('h1')
text = element.text  # or element.get_text()
print(text)

# Clean whitespace
clean_text = element.get_text(strip=True)

# Navigate tree
parent = element.parent
siblings = element.find_next_siblings()

Attributes

# Get attribute values
link = soup.find('a')
href = link.get('href')  # or link['href']
title = link.get('title', 'No title')  # Default value

# All attributes
attrs = link.attrs
print(attrs)  # Dictionary of attributes

# Check if attribute exists
if link.has_attr('target'):
    print("Link has target attribute")

Real Example: Scraping Quotes

def scrape_quotes():
    """Scrape quotes from quotes.toscrape.com"""
    url = 'http://quotes.toscrape.com/'
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    quotes_data = []
    
    # Find all quote containers
    quotes = soup.find_all('div', class_='quote')
    
    for quote in quotes:
        # Extract text
        text = quote.find('span', class_='text').text
        
        # Extract author
        author = quote.find('small', class_='author').text
        
        # Extract tags
        tags = [tag.text for tag in quote.find_all('a', class_='tag')]
        
        quotes_data.append({
            'quote': text,
            'author': author,
            'tags': ', '.join(tags)
        })
    
    return quotes_data

# Run scraper
quotes = scrape_quotes()
print(f"Scraped {len(quotes)} quotes")

# Display results
for q in quotes[:3]:
    print(f"\n\"{q['quote']}\"")
    print(f"- {q['author']}")
    print(f"Tags: {q['tags']}")

# Save to CSV
df = pd.DataFrame(quotes)
df.to_csv('quotes.csv', index=False)
print("\nSaved to quotes.csv")

Handling Pagination

def scrape_multiple_pages(base_url, max_pages=5):
    """Scrape data from multiple pages"""
    all_data = []
    
    for page_num in range(1, max_pages + 1):
        url = f"{base_url}/page/{page_num}"
        print(f"Scraping page {page_num}...")
        
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'lxml')
            
            # Extract data (example)
            items = soup.find_all('div', class_='item')
            
            for item in items:
                # Extract details
                data = {
                    'title': item.find('h2').text if item.find('h2') else None,
                    'price': item.find(class_='price').text if item.find(class_='price') else None
                }
                all_data.append(data)
            
            # Be polite - wait between requests
            time.sleep(1)
            
        except requests.exceptions.RequestException as e:
            print(f"Error on page {page_num}: {e}")
            break
    
    return all_data

Following Links

def scrape_with_detail_pages(list_url):
    """Scrape list page and follow links to detail pages"""
    response = requests.get(list_url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    all_data = []
    
    # Find all product links
    product_links = soup.select('.product-card a')
    
    for link in product_links[:5]:  # Limit to 5 for demo
        detail_url = urljoin(list_url, link['href'])
        print(f"Scraping: {detail_url}")
        
        # Fetch detail page
        detail_response = requests.get(detail_url)
        detail_soup = BeautifulSoup(detail_response.text, 'lxml')
        
        # Extract detailed information
        data = {
            'title': detail_soup.find('h1').text if detail_soup.find('h1') else None,
            'description': detail_soup.find('div', class_='description').text if detail_soup.find('div', class_='description') else None,
            # Add more fields as needed
        }
        
        all_data.append(data)
        time.sleep(1)  # Be polite
    
    return all_data

Handling Common Challenges

User Agent

# Add headers to mimic browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(url, headers=headers)

Timeouts and Retries

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def get_session_with_retries():
    """Create session with automatic retries"""
    session = requests.Session()
    
    retry = Retry(
        total=3,
        backoff_factor=0.5,
        status_forcelist=[500, 502, 503, 504]
    )
    
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    return session

# Use it
session = get_session_with_retries()
response = session.get(url, timeout=10)

Rate Limiting

import time
from datetime import datetime

class RateLimiter:
    def __init__(self, max_requests, time_window):
        self.max_requests = max_requests
        self.time_window = time_window  # in seconds
        self.requests = []
    
    def wait_if_needed(self):
        now = datetime.now()
        
        # Remove old requests outside time window
        self.requests = [req_time for req_time in self.requests 
                        if (now - req_time).total_seconds() < self.time_window]
        
        if len(self.requests) >= self.max_requests:
            sleep_time = self.time_window - (now - self.requests[0]).total_seconds()
            if sleep_time > 0:
                print(f"Rate limit reached. Waiting {sleep_time:.2f} seconds...")
                time.sleep(sleep_time)
        
        self.requests.append(now)

# Usage: 10 requests per minute
limiter = RateLimiter(max_requests=10, time_window=60)

for url in urls:
    limiter.wait_if_needed()
    response = requests.get(url)

Saving Data

CSV

import csv

data = [
    {'name': 'Product 1', 'price': 100},
    {'name': 'Product 2', 'price': 200}
]

# Write to CSV
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(data)

JSON

import json

# Write to JSON
with open('products.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

# Append to JSON file
try:
    with open('products.json', 'r') as f:
        existing_data = json.load(f)
except FileNotFoundError:
    existing_data = []

existing_data.extend(data)

with open('products.json', 'w') as f:
    json.dump(existing_data, f, indent=2)

Complete Project: Job Scraper

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

class JobScraper:
    def __init__(self):
        self.base_url = "https://example-jobs.com"
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    def fetch_page(self, url):
        """Fetch and parse a page"""
        try:
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status()
            return BeautifulSoup(response.text, 'lxml')
        except Exception as e:
            print(f"Error fetching {url}: {e}")
            return None
    
    def extract_job_data(self, job_element):
        """Extract data from a job listing element"""
        try:
            title = job_element.find('h2', class_='job-title').text.strip()
            company = job_element.find('span', class_='company').text.strip()
            location = job_element.find('span', class_='location').text.strip()
            
            salary_elem = job_element.find('span', class_='salary')
            salary = salary_elem.text.strip() if salary_elem else 'Not specified'
            
            link = job_element.find('a')['href']
            full_link = urljoin(self.base_url, link)
            
            return {
                'title': title,
                'company': company,
                'location': location,
                'salary': salary,
                'url': full_link
            }
        except Exception as e:
            print(f"Error extracting job data: {e}")
            return None
    
    def scrape_jobs(self, search_term, max_pages=3):
        """Scrape jobs for a search term"""
        all_jobs = []
        
        for page in range(1, max_pages + 1):
            url = f"{self.base_url}/search?q={search_term}&page={page}"
            print(f"Scraping page {page}...")
            
            soup = self.fetch_page(url)
            if not soup:
                break
            
            jobs = soup.find_all('div', class_='job-listing')
            print(f"Found {len(jobs)} jobs on page {page}")
            
            for job in jobs:
                job_data = self.extract_job_data(job)
                if job_data:
                    all_jobs.append(job_data)
            
            time.sleep(2)  # Be respectful
        
        return all_jobs
    
    def save_results(self, jobs, filename='jobs.csv'):
        """Save results to CSV"""
        df = pd.DataFrame(jobs)
        df.to_csv(filename, index=False)
        print(f"Saved {len(jobs)} jobs to {filename}")

# Usage
scraper = JobScraper()
jobs = scraper.scrape_jobs('data scientist', max_pages=3)
scraper.save_results(jobs)

Best Practices

1. Respect robots.txt

from urllib.robotparser import RobotFileParser

def can_scrape(url):
    """Check if URL can be scraped according to robots.txt"""
    rp = RobotFileParser()
    rp.set_url(urljoin(url, '/robots.txt'))
    rp.read()
    
    return rp.can_fetch('*', url)

if can_scrape(target_url):
    # Proceed with scraping
    pass
else:
    print("Scraping not allowed by robots.txt")

2. Error Handling

def safe_scrape(url):
    """Scrape with comprehensive error handling"""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'lxml')
        return soup
        
    except requests.exceptions.Timeout:
        print("Request timed out")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error: {e}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    
    return None

3. Logging

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

logger.info("Starting scraper...")
logger.warning("Unusual response detected")
logger.error("Failed to parse element")

Debugging Your Scraper

When scraping breaks, here's how I troubleshoot:

Problem 1: Element Not Found

# Add debug prints
soup = BeautifulSoup(html, 'html.parser')

# Check if element exists
element = soup.find('div', class_='target')
if element:
    print(f"Found: {element}")
else:
    print("Element not found!")
    print(f"Page content: {soup.prettify()[:500]}")

Problem 2: Empty Results

# Common issue: wrong parser
soup_lxml = BeautifulSoup(html, 'lxml')  # Try this
soup_html = BeautifulSoup(html, 'html.parser')  # Or this

# Check both
print(f"lxml found: {len(soup_lxml.find_all('div'))}")
print(f"html.parser found: {len(soup_html.find_all('div'))}")

Problem 3: Changing Website Structure

# Use multiple selectors as fallback
def safe_extract(soup):
    """Try multiple ways to extract data"""
    # Method 1: Class name
    result = soup.find('span', class_='price')
    if result:
        return result.text
    
    # Method 2: Data attribute
    result = soup.find('span', {'data-test': 'price'})
    if result:
        return result.text
    
    # Method 3: CSS selector
    result = soup.select_one('[class*="price"]')
    if result:
        return result.text
    
    return None

Problem 4: Encoding Issues

# Handle different encodings
response = requests.get(url)
response.encoding = response.apparent_encoding  # Auto-detect
html = response.text

# Or specify manually
response.encoding = 'utf-8'

Legal and Ethical Considerations

Always:

Check terms of service before scraping
Respect robots.txt directives
Use rate limiting to avoid overload
Identify yourself with User-Agent
Don't overwhelm servers with requests
Cache responses when testing
Consider data privacy regulations (GDPR, CCPA)

Never:

Scrape personal data without consent
Ignore copyright laws
Bypass authentication or paywalls
Scrape at high frequency without permission
Republish scraped content without attribution
Scrape sites that explicitly prohibit it

Your Scraping Toolkit

You now know:

Basics - Fetching and parsing HTML
Selectors - Finding elements efficiently
Navigation - Following links and pagination
Challenges - Headers, retries, rate limiting
Storage - Saving to CSV, JSON
Projects - Building complete scrapers
Ethics - Scraping responsibly

Web scraping opens doors to unlimited data. Use it wisely!

My Scraping Workflow Evolution

Beginner Stage (Weeks 1-4): Started with simple requests + BeautifulSoup scripts, copying examples from tutorials, breaking constantly when websites changed structure.

Intermediate Stage (Months 2-6): Added proper error handling, logging, and rate limiting. Learned to respect robots.txt and handle different response types. Built reusable scraper templates.

Current Stage: Design scrapers defensively with multiple fallback selectors, comprehensive logging, automatic retry logic, and data validation. Can adapt quickly when sites change structure.

The progression is natural—start simple, learn from failures, build robustness gradually. Don't try to build the perfect scraper on day one.

One Final Piece of Advice

Before scraping any website, ask yourself: "Is there an easier way?" Check for:

Official APIs (always better than scraping)
Existing datasets (Kaggle, data.gov, etc.)
Data export options (many sites let you download your data)
Third-party data providers (sometimes worth the cost)

Scraping should be your solution when better alternatives don't exist. It's powerful but requires maintenance. Choose wisely!

Next Steps

Learn Scrapy framework
Handle JavaScript with Playwright
Explore APIs as alternatives
Practice with httpbin and quotes.toscrape.com

Remember: The best scraper is one you don't need—check if an API exists first!

Found this tutorial helpful? Share it with fellow data enthusiasts! Connect with me on Twitter or LinkedIn for more web scraping tips.

If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by ian dooley on Unsplash