Python Web Scraping: Complete Tutorial for Data Collection
Master web scraping with Python. Learn BeautifulSoup, requests, handling pagination, extracting data from HTML, avoiding blocks, and ethical scraping practices with practical examples.

My First Web Scraping Project
"We need product prices from 50 competitor websites." My boss dropped this bombshell on a Monday morning. Manual copying? That would take days. Then I discovered web scraping.
What seemed impossible became a 100-line Python script running automatically every morning. Web scraping transformed me from a data consumer to a data creator.
In this comprehensive tutorial, you'll learn everything about web scrapingâfrom fetching HTML to handling complex scenarios. By the end, you'll build real scrapers that collect data automatically.
What is Web Scraping?
Web scraping is programmatically extracting data from websites. Instead of manually copying information, you write code that:
- Fetches web pages
- Parses HTML structure
- Extracts specific data
- Saves it in structured format (CSV, JSON, database)
When to Use Web Scraping
- Competitor price monitoring
- News aggregation
- Real estate listings
- Job postings
- Product reviews
- Research data collection
- Market research
Important: Always check robots.txt and terms of service. Respect rate limits. Scrape ethically!
Essential Tools
# Install required libraries
# pip install requests beautifulsoup4 lxml pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from urllib.parse import urljoin
Your First Scraper
Fetching a Web Page
# Simple GET request
url = 'https://example.com'
response = requests.get(url)
print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.content)}")
# Get HTML content
html = response.text
print(html[:500]) # First 500 characters
Parsing HTML
# Parse with BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # or 'html.parser'
# Pretty print HTML
print(soup.prettify()[:1000])
# Extract title
title = soup.title.string
print(f"Page Title: {title}")
# Find specific tags
h1 = soup.find('h1')
if h1:
print(f"Main Heading: {h1.text}")
# Find all links
links = soup.find_all('a')
print(f"Found {len(links)} links")
Selecting Elements
Basic Selectors
# By tag name
paragraphs = soup.find_all('p')
print(f"Paragraphs: {len(paragraphs)}")
# By class
items = soup.find_all(class_='product-item')
# By ID
header = soup.find(id='header')
# By attribute
links = soup.find_all('a', href=True)
# Multiple classes
elements = soup.find_all(class_=['class1', 'class2'])
CSS Selectors
# CSS selector (more powerful)
products = soup.select('.product-card')
# Nested selection
prices = soup.select('.product-card .price')
# Attribute selectors
email_links = soup.select('a[href^="mailto:"]')
# Multiple selectors
elements = soup.select('div.content, div.main')
# nth-child
first_item = soup.select('.list-item:nth-child(1)')
Extracting Data
Text Content
# Get text
element = soup.find('h1')
text = element.text # or element.get_text()
print(text)
# Clean whitespace
clean_text = element.get_text(strip=True)
# Navigate tree
parent = element.parent
siblings = element.find_next_siblings()
Attributes
# Get attribute values
link = soup.find('a')
href = link.get('href') # or link['href']
title = link.get('title', 'No title') # Default value
# All attributes
attrs = link.attrs
print(attrs) # Dictionary of attributes
# Check if attribute exists
if link.has_attr('target'):
print("Link has target attribute")
Real Example: Scraping Quotes
def scrape_quotes():
"""Scrape quotes from quotes.toscrape.com"""
url = 'http://quotes.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
quotes_data = []
# Find all quote containers
quotes = soup.find_all('div', class_='quote')
for quote in quotes:
# Extract text
text = quote.find('span', class_='text').text
# Extract author
author = quote.find('small', class_='author').text
# Extract tags
tags = [tag.text for tag in quote.find_all('a', class_='tag')]
quotes_data.append({
'quote': text,
'author': author,
'tags': ', '.join(tags)
})
return quotes_data
# Run scraper
quotes = scrape_quotes()
print(f"Scraped {len(quotes)} quotes")
# Display results
for q in quotes[:3]:
print(f"\n\"{q['quote']}\"")
print(f"- {q['author']}")
print(f"Tags: {q['tags']}")
# Save to CSV
df = pd.DataFrame(quotes)
df.to_csv('quotes.csv', index=False)
print("\nSaved to quotes.csv")
Handling Pagination
def scrape_multiple_pages(base_url, max_pages=5):
"""Scrape data from multiple pages"""
all_data = []
for page_num in range(1, max_pages + 1):
url = f"{base_url}/page/{page_num}"
print(f"Scraping page {page_num}...")
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# Extract data (example)
items = soup.find_all('div', class_='item')
for item in items:
# Extract details
data = {
'title': item.find('h2').text if item.find('h2') else None,
'price': item.find(class_='price').text if item.find(class_='price') else None
}
all_data.append(data)
# Be polite - wait between requests
time.sleep(1)
except requests.exceptions.RequestException as e:
print(f"Error on page {page_num}: {e}")
break
return all_data
Following Links
def scrape_with_detail_pages(list_url):
"""Scrape list page and follow links to detail pages"""
response = requests.get(list_url)
soup = BeautifulSoup(response.text, 'lxml')
all_data = []
# Find all product links
product_links = soup.select('.product-card a')
for link in product_links[:5]: # Limit to 5 for demo
detail_url = urljoin(list_url, link['href'])
print(f"Scraping: {detail_url}")
# Fetch detail page
detail_response = requests.get(detail_url)
detail_soup = BeautifulSoup(detail_response.text, 'lxml')
# Extract detailed information
data = {
'title': detail_soup.find('h1').text if detail_soup.find('h1') else None,
'description': detail_soup.find('div', class_='description').text if detail_soup.find('div', class_='description') else None,
# Add more fields as needed
}
all_data.append(data)
time.sleep(1) # Be polite
return all_data
Handling Common Challenges
User Agent
# Add headers to mimic browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
Timeouts and Retries
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
def get_session_with_retries():
"""Create session with automatic retries"""
session = requests.Session()
retry = Retry(
total=3,
backoff_factor=0.5,
status_forcelist=[500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
# Use it
session = get_session_with_retries()
response = session.get(url, timeout=10)
Rate Limiting
import time
from datetime import datetime
class RateLimiter:
def __init__(self, max_requests, time_window):
self.max_requests = max_requests
self.time_window = time_window # in seconds
self.requests = []
def wait_if_needed(self):
now = datetime.now()
# Remove old requests outside time window
self.requests = [req_time for req_time in self.requests
if (now - req_time).total_seconds() < self.time_window]
if len(self.requests) >= self.max_requests:
sleep_time = self.time_window - (now - self.requests[0]).total_seconds()
if sleep_time > 0:
print(f"Rate limit reached. Waiting {sleep_time:.2f} seconds...")
time.sleep(sleep_time)
self.requests.append(now)
# Usage: 10 requests per minute
limiter = RateLimiter(max_requests=10, time_window=60)
for url in urls:
limiter.wait_if_needed()
response = requests.get(url)
Saving Data
CSV
import csv
data = [
{'name': 'Product 1', 'price': 100},
{'name': 'Product 2', 'price': 200}
]
# Write to CSV
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'price'])
writer.writeheader()
writer.writerows(data)
JSON
import json
# Write to JSON
with open('products.json', 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
# Append to JSON file
try:
with open('products.json', 'r') as f:
existing_data = json.load(f)
except FileNotFoundError:
existing_data = []
existing_data.extend(data)
with open('products.json', 'w') as f:
json.dump(existing_data, f, indent=2)
Complete Project: Job Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
class JobScraper:
def __init__(self):
self.base_url = "https://example-jobs.com"
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def fetch_page(self, url):
"""Fetch and parse a page"""
try:
response = requests.get(url, headers=self.headers, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, 'lxml')
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
def extract_job_data(self, job_element):
"""Extract data from a job listing element"""
try:
title = job_element.find('h2', class_='job-title').text.strip()
company = job_element.find('span', class_='company').text.strip()
location = job_element.find('span', class_='location').text.strip()
salary_elem = job_element.find('span', class_='salary')
salary = salary_elem.text.strip() if salary_elem else 'Not specified'
link = job_element.find('a')['href']
full_link = urljoin(self.base_url, link)
return {
'title': title,
'company': company,
'location': location,
'salary': salary,
'url': full_link
}
except Exception as e:
print(f"Error extracting job data: {e}")
return None
def scrape_jobs(self, search_term, max_pages=3):
"""Scrape jobs for a search term"""
all_jobs = []
for page in range(1, max_pages + 1):
url = f"{self.base_url}/search?q={search_term}&page={page}"
print(f"Scraping page {page}...")
soup = self.fetch_page(url)
if not soup:
break
jobs = soup.find_all('div', class_='job-listing')
print(f"Found {len(jobs)} jobs on page {page}")
for job in jobs:
job_data = self.extract_job_data(job)
if job_data:
all_jobs.append(job_data)
time.sleep(2) # Be respectful
return all_jobs
def save_results(self, jobs, filename='jobs.csv'):
"""Save results to CSV"""
df = pd.DataFrame(jobs)
df.to_csv(filename, index=False)
print(f"Saved {len(jobs)} jobs to {filename}")
# Usage
scraper = JobScraper()
jobs = scraper.scrape_jobs('data scientist', max_pages=3)
scraper.save_results(jobs)
Best Practices
1. Respect robots.txt
from urllib.robotparser import RobotFileParser
def can_scrape(url):
"""Check if URL can be scraped according to robots.txt"""
rp = RobotFileParser()
rp.set_url(urljoin(url, '/robots.txt'))
rp.read()
return rp.can_fetch('*', url)
if can_scrape(target_url):
# Proceed with scraping
pass
else:
print("Scraping not allowed by robots.txt")
2. Error Handling
def safe_scrape(url):
"""Scrape with comprehensive error handling"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
return soup
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP Error: {e}")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
return None
3. Logging
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('scraper.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
logger.info("Starting scraper...")
logger.warning("Unusual response detected")
logger.error("Failed to parse element")
Debugging Your Scraper
When scraping breaks, here's how I troubleshoot:
Problem 1: Element Not Found
# Add debug prints
soup = BeautifulSoup(html, 'html.parser')
# Check if element exists
element = soup.find('div', class_='target')
if element:
print(f"Found: {element}")
else:
print("Element not found!")
print(f"Page content: {soup.prettify()[:500]}")
Problem 2: Empty Results
# Common issue: wrong parser
soup_lxml = BeautifulSoup(html, 'lxml') # Try this
soup_html = BeautifulSoup(html, 'html.parser') # Or this
# Check both
print(f"lxml found: {len(soup_lxml.find_all('div'))}")
print(f"html.parser found: {len(soup_html.find_all('div'))}")
Problem 3: Changing Website Structure
# Use multiple selectors as fallback
def safe_extract(soup):
"""Try multiple ways to extract data"""
# Method 1: Class name
result = soup.find('span', class_='price')
if result:
return result.text
# Method 2: Data attribute
result = soup.find('span', {'data-test': 'price'})
if result:
return result.text
# Method 3: CSS selector
result = soup.select_one('[class*="price"]')
if result:
return result.text
return None
Problem 4: Encoding Issues
# Handle different encodings
response = requests.get(url)
response.encoding = response.apparent_encoding # Auto-detect
html = response.text
# Or specify manually
response.encoding = 'utf-8'
Legal and Ethical Considerations
Always:
- Check terms of service before scraping
- Respect robots.txt directives
- Use rate limiting to avoid overload
- Identify yourself with User-Agent
- Don't overwhelm servers with requests
- Cache responses when testing
- Consider data privacy regulations (GDPR, CCPA)
Never:
- Scrape personal data without consent
- Ignore copyright laws
- Bypass authentication or paywalls
- Scrape at high frequency without permission
- Republish scraped content without attribution
- Scrape sites that explicitly prohibit it
Your Scraping Toolkit
You now know:
- Basics - Fetching and parsing HTML
- Selectors - Finding elements efficiently
- Navigation - Following links and pagination
- Challenges - Headers, retries, rate limiting
- Storage - Saving to CSV, JSON
- Projects - Building complete scrapers
- Ethics - Scraping responsibly
Web scraping opens doors to unlimited data. Use it wisely!
My Scraping Workflow Evolution
Beginner Stage (Weeks 1-4): Started with simple requests + BeautifulSoup scripts, copying examples from tutorials, breaking constantly when websites changed structure.
Intermediate Stage (Months 2-6): Added proper error handling, logging, and rate limiting. Learned to respect robots.txt and handle different response types. Built reusable scraper templates.
Current Stage: Design scrapers defensively with multiple fallback selectors, comprehensive logging, automatic retry logic, and data validation. Can adapt quickly when sites change structure.
The progression is naturalâstart simple, learn from failures, build robustness gradually. Don't try to build the perfect scraper on day one.
One Final Piece of Advice
Before scraping any website, ask yourself: "Is there an easier way?" Check for:
- Official APIs (always better than scraping)
- Existing datasets (Kaggle, data.gov, etc.)
- Data export options (many sites let you download your data)
- Third-party data providers (sometimes worth the cost)
Scraping should be your solution when better alternatives don't exist. It's powerful but requires maintenance. Choose wisely!
Next Steps
- Learn Scrapy framework
- Handle JavaScript with Playwright
- Explore APIs as alternatives
- Practice with httpbin and quotes.toscrape.com
Remember: The best scraper is one you don't needâcheck if an API exists first!
Found this tutorial helpful? Share it with fellow data enthusiasts! Connect with me on Twitter or LinkedIn for more web scraping tips.
Support My Work
If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.
â Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by ian dooley on Unsplash