Python Data Scraping Libraries: BeautifulSoup and Beyond
Complete guide to Python scraping libraries. Master BeautifulSoup, Scrapy, Selenium, Playwright, and lxml for efficient web data extraction with practical examples and comparisons.

Finding the Right Scraping Tool
I wasted two weeks building a scraper with BeautifulSoup, only to discover it couldn't handle JavaScript-heavy pages. Then I tried Seleniumβit worked but was painfully slow. Finally, I found Scrapy, and everything changed.
Choosing the wrong scraping library is like using a hammer when you need a drill. Each tool has its purpose, strengths, and ideal use cases.
This guide compares Python's top scraping libraries, showing when to use each, with practical examples and real-world scenarios.
The Scraping Library Landscape
Quick Comparison
| Library | Best For | Speed | Learning Curve | JavaScript |
|---|---|---|---|---|
| BeautifulSoup | Simple scraping | Fast | Easy | No |
| Scrapy | Large-scale projects | Very Fast | Moderate | No |
| Selenium | JavaScript sites | Slow | Moderate | Yes |
| Playwright | Modern web apps | Medium | Moderate | Yes |
| lxml | Performance-critical | Very Fast | Hard | No |
BeautifulSoup: The Beginner's Friend
Why BeautifulSoup?
- Easiest to learn
- Perfect for small projects
- Great documentation
- Flexible parsing
Installation & Basic Usage
# pip install beautifulsoup4 lxml requests
from bs4 import BeautifulSoup
import requests
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Find elements
title = soup.title.string
links = soup.find_all('a')
prices = soup.select('.price')
print(f"Found {len(links)} links")
Advanced BeautifulSoup Techniques
# Complex selectors
products = soup.select('div.product[data-available="true"]')
# Navigate tree
for product in products:
name = product.find('h3').text
price = product.find('span', class_='price').text
# Get next sibling
description = product.find_next_sibling('p')
print(f"{name}: {price}")
# Extract attributes
images = soup.find_all('img')
image_urls = [img.get('src') for img in images if img.get('src')]
# Text extraction with cleaning
text = soup.get_text(separator=' ', strip=True)
Real Example: Product Scraper
def scrape_products(url):
"""Scrape product listings with BeautifulSoup"""
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
products = []
for item in soup.select('.product-card'):
try:
product = {
'name': item.select_one('.product-name').text.strip(),
'price': item.select_one('.price').text.strip(),
'rating': item.select_one('.rating')['data-rating'],
'url': item.select_one('a')['href']
}
products.append(product)
except (AttributeError, TypeError) as e:
print(f"Error parsing product: {e}")
continue
return products
Scrapy: The Professional Framework
Why Scrapy?
- Built for speed and scale
- Handles concurrency automatically
- Built-in data pipelines
- Middleware system
- Perfect for large projects
Creating a Scrapy Spider
# pip install scrapy
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
custom_settings = {
'CONCURRENT_REQUESTS': 16,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0...'
}
def parse(self, response):
"""Parse product listing page"""
for product in response.css('.product-card'):
yield {
'name': product.css('.product-name::text').get(),
'price': product.css('.price::text').get(),
'url': product.css('a::attr(href)').get()
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_detail(self, response):
"""Parse product detail page"""
yield {
'title': response.css('h1::text').get(),
'description': response.css('.description::text').get(),
'specifications': response.css('.specs li::text').getall()
}
Running Scrapy
# Create project
scrapy startproject myproject
# Generate spider
scrapy genspider products example.com
# Run spider
scrapy crawl products -o products.json
Scrapy Pipelines
# pipelines.py
class DataCleaningPipeline:
def process_item(self, item, spider):
"""Clean scraped data"""
# Remove whitespace
item['name'] = item.get('name', '').strip()
# Parse price
price_str = item.get('price', '').replace('$', '').replace(',', '')
try:
item['price'] = float(price_str)
except ValueError:
item['price'] = None
return item
class DuplicatesPipeline:
def __init__(self):
self.seen = set()
def process_item(self, item, spider):
"""Remove duplicates"""
identifier = (item['name'], item['price'])
if identifier in self.seen:
raise DropItem(f"Duplicate item: {identifier}")
self.seen.add(identifier)
return item
Selenium: The JavaScript Handler
Why Selenium?
- Handles JavaScript rendering
- Interacts with dynamic content
- Can fill forms, click buttons
- Simulates real browser
Installation & Setup
# pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Setup Chrome
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Run without GUI
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(options=options)
Selenium Scraping Patterns
class SeleniumScraper:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument('--headless')
self.driver = webdriver.Chrome(options=options)
self.wait = WebDriverWait(self.driver, 10)
def scrape_dynamic_content(self, url):
"""Scrape JavaScript-rendered content"""
self.driver.get(url)
# Wait for specific element
self.wait.until(
EC.presence_of_element_located((By.CLASS_NAME, 'product'))
)
# Extract data
products = self.driver.find_elements(By.CLASS_NAME, 'product')
data = [p.text for p in products]
return data
def handle_infinite_scroll(self, url):
"""Handle infinite scrolling"""
self.driver.get(url)
last_height = self.driver.execute_script(
"return document.body.scrollHeight"
)
while True:
# Scroll down
self.driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
# Wait for content to load
time.sleep(2)
# Check if reached bottom
new_height = self.driver.execute_script(
"return document.body.scrollHeight"
)
if new_height == last_height:
break
last_height = new_height
# Extract all loaded content
return self.driver.find_elements(By.CLASS_NAME, 'item')
def login_and_scrape(self, login_url, username, password, target_url):
"""Login then scrape protected content"""
self.driver.get(login_url)
# Fill login form
user_field = self.driver.find_element(By.ID, 'username')
pass_field = self.driver.find_element(By.ID, 'password')
user_field.send_keys(username)
pass_field.send_keys(password)
# Submit
submit_btn = self.driver.find_element(By.ID, 'login-button')
submit_btn.click()
# Wait for login
self.wait.until(EC.url_changes(login_url))
# Navigate to target
self.driver.get(target_url)
return self.driver.page_source
def close(self):
self.driver.quit()
Playwright: The Modern Alternative
Why Playwright?
- Faster than Selenium
- Better API design
- Auto-waiting for elements
- Built-in screenshots/videos
- Supports multiple browsers
Playwright Basics
# pip install playwright
# playwright install
from playwright.sync_api import sync_playwright
def scrape_with_playwright(url):
"""Scrape using Playwright"""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# Wait for element
page.wait_for_selector('.product')
# Extract data
products = page.query_selector_all('.product')
data = [p.text_content() for p in products]
# Take screenshot
page.screenshot(path='page.png')
browser.close()
return data
Playwright Advanced Features
from playwright.sync_api import sync_playwright
class PlaywrightScraper:
def __init__(self):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch()
def scrape_spa(self, url):
"""Scrape Single Page Application"""
page = self.browser.new_page()
page.goto(url)
# Wait for network to be idle
page.wait_for_load_state('networkidle')
# Extract data
content = page.content()
page.close()
return content
def handle_pagination(self, url):
"""Handle pagination with button clicks"""
page = self.browser.new_page()
page.goto(url)
all_items = []
while True:
# Extract current page items
items = page.query_selector_all('.item')
all_items.extend([i.text_content() for i in items])
# Try to click next button
next_btn = page.query_selector('button.next')
if not next_btn or not next_btn.is_enabled():
break
next_btn.click()
page.wait_for_load_state('networkidle')
page.close()
return all_items
def close(self):
self.browser.close()
self.playwright.stop()
lxml: The Speed Demon
Why lxml?
- Extremely fast parsing
- XPath support
- Low-level control
- Memory efficient
lxml Usage
# pip install lxml
from lxml import html
import requests
response = requests.get('https://example.com')
tree = html.fromstring(response.content)
# XPath selectors
titles = tree.xpath('//h2[@class="title"]/text()')
prices = tree.xpath('//span[@class="price"]/text()')
links = tree.xpath('//a/@href')
# CSS selectors (via cssselect)
from lxml.cssselect import CSSSelector
sel = CSSSelector('.product .name')
products = [e.text for e in sel(tree)]
Choosing the Right Library
Decision Tree
Do you need JavaScript support?
ββ No
β ββ Simple project? β BeautifulSoup
β ββ Large-scale? β Scrapy
β ββ Need speed? β lxml
ββ Yes
ββ Modern sites? β Playwright
ββ Legacy sites? β Selenium
Performance Comparison
import time
urls = ['https://example.com'] * 10
# BeautifulSoup
start = time.time()
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
bs_time = time.time() - start
# lxml
start = time.time()
for url in urls:
response = requests.get(url)
tree = html.fromstring(response.content)
lxml_time = time.time() - start
print(f"BeautifulSoup: {bs_time:.2f}s")
print(f"lxml: {lxml_time:.2f}s")
print(f"lxml is {bs_time/lxml_time:.2f}x faster")
Combining Libraries for Maximum Power
Real-world scraping often requires using multiple libraries together.
Pattern 1: Playwright + BeautifulSoup
# Use Playwright to render JavaScript, BeautifulSoup to parse
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def scrape_dynamic_site(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
page.wait_for_load_state('networkidle')
# Get rendered HTML
html = page.content()
browser.close()
# Parse with BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all('div', class_='product')
return data
Pattern 2: Requests + lxml for Speed
# Fast scraping for static sites
import requests
from lxml import html
import concurrent.futures
def scrape_url(url):
response = requests.get(url)
tree = html.fromstring(response.content)
return tree.xpath('//h1/text()')
urls = ['https://example.com/page{}'.format(i) for i in range(100)]
# Parallel scraping
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(scrape_url, urls))
Pattern 3: Scrapy + Selenium for Hybrid Approach
# Scrapy spider with Selenium middleware
from scrapy import Spider
from selenium import webdriver
class HybridSpider(Spider):
name = 'hybrid'
def __init__(self):
self.driver = webdriver.Chrome()
def parse(self, response):
# Use Selenium for dynamic content
self.driver.get(response.url)
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Parse with Scrapy
from scrapy.selector import Selector
sel = Selector(text=self.driver.page_source)
for item in sel.css('.product'):
yield {
'title': item.css('h2::text').get(),
'price': item.css('.price::text').get()
}
def closed(self, reason):
self.driver.quit()
Advanced Scraping Techniques
Technique 1: Handle Anti-Scraping Measures
from fake_useragent import UserAgent
import random
import time
# Rotate user agents
ua = UserAgent()
headers = {'User-Agent': ua.random}
# Add delays
def smart_delay():
time.sleep(random.uniform(2, 5)) # Random delay 2-5 seconds
# Use proxies
proxies = [
'http://proxy1:8080',
'http://proxy2:8080',
]
def get_with_proxy(url):
proxy = random.choice(proxies)
response = requests.get(url, proxies={'http': proxy, 'https': proxy})
return response
Technique 2: Handle Pagination Intelligently
def scrape_all_pages(base_url):
page = 1
all_data = []
while True:
url = f"{base_url}?page={page}"
soup = BeautifulSoup(requests.get(url).text, 'lxml')
items = soup.find_all('div', class_='item')
if not items: # No more items, we're done
break
all_data.extend(items)
page += 1
time.sleep(1) # Be polite
return all_data
Technique 3: Robust Error Handling
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def robust_scrape(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
data = soup.find('div', class_='content')
if data:
return data
else:
logger.warning(f"No data found on {url}")
return None
except requests.Timeout:
logger.error(f"Timeout on attempt {attempt + 1}/{max_retries}")
time.sleep(2 ** attempt) # Exponential backoff
except requests.RequestException as e:
logger.error(f"Request failed: {e}")
if attempt == max_retries - 1:
return None
time.sleep(2)
return None
Technique 4: Data Validation and Cleaning
def validate_and_clean(data):
"""Validate scraped data before saving"""
cleaned = []
for item in data:
# Check required fields
if not all([item.get('title'), item.get('price')]):
continue
# Clean price
try:
price_str = item['price'].replace('$', '').replace(',', '')
item['price'] = float(price_str)
except (ValueError, AttributeError):
continue
# Clean title
item['title'] = item['title'].strip()
# Validate ranges
if 0 < item['price'] < 100000:
cleaned.append(item)
return cleaned
Library Comparison Summary
| Feature | BeautifulSoup | Scrapy | Selenium | Playwright | lxml |
|---|---|---|---|---|---|
| Ease of Use | βββββ | βββ | βββ | ββββ | βββ |
| Speed | βββ | βββββ | ββ | ββββ | βββββ |
| JavaScript | β | β | β | β | β |
| Scalability | ββ | βββββ | ββ | ββββ | ββββ |
| Learning Curve | Low | High | Medium | Medium | Medium |
| Best For | Quick tasks | Production | JS-heavy | Modern sites | Performance |
Your Complete Toolkit
You now understand:
- BeautifulSoup - Simple, beginner-friendly
- Scrapy - Professional framework for scale
- Selenium - JavaScript handling (slower)
- Playwright - Modern, fast JavaScript support
- lxml - Maximum performance
Choose based on your project needs, not the latest hype!
My Recommendation for Different Scenarios
Starting out? Use BeautifulSoup with requests. Simple, effective, and teaches fundamentals.
Building a large scraper? Learn Scrapy. The initial learning curve pays off with powerful features and scalability.
Dealing with JavaScript? Try Playwright first. It's more modern and faster than Selenium for most use cases.
Need maximum speed? Use lxml with multiprocessing for blazing-fast scraping of static sites.
Complex project? Combine libraries. Use Playwright for rendering, BeautifulSoup for parsing, and Scrapy for orchestration.
Next Steps
- Practice with quotes.toscrape.com
- Learn Scrapy documentation
- Explore Playwright guides
- Master XPath and CSS selectors
Start simple with BeautifulSoup, then graduate to specialized tools as needed!
Found this guide helpful? Share it with your dev community! Connect with me on Twitter or LinkedIn for more scraping insights.
Support My Work
If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.
β Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by Clay Banks on Unsplash