Python Collections Module: Counter, deque, defaultdict for Data Science
Master Python's collections module with Counter for frequency counting, deque for fast queues, defaultdict for automatic defaults, and namedtuple for clean code

When I Discovered Collections Module
I was processing a dataset with word frequencies when my mentor looked at my code:
# My inefficient approach
word_count = {}
for word in words:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
He smiled and showed me:
from collections import Counter
word_count = Counter(words)
One line. That's all it took. The collections module has been my secret weapon ever since!
In this guide, I'll show you Python's powerful collections moduleâessential tools that make your data science code cleaner, faster, and more Pythonic.
Why Collections Module Matters
Python's built-in list, dict, and tuple are great, but the collections module provides specialized containers optimized for common patterns in data science:
Counter- Frequency tables in one linedefaultdict- No more KeyError checkingdeque- Lightning-fast queues (O(1) operations)namedtuple- Readable, memory-efficient recordsOrderedDict- Ordered dictionary with utilitiesChainMap- Layer multiple dictionaries
These aren't just convenienceâthey're performance optimizations used by professional data scientists!
Counter: Frequency Tables Made Easy
Counter is perfect for counting anythingâwords, labels, features, categories.
Basic Counting
from collections import Counter
# Count word frequencies
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
word_count = Counter(words)
print(word_count)
# Output: Counter({'apple': 3, 'banana': 2, 'cherry': 1})
# Access counts
print(word_count["apple"]) # 3
print(word_count["grape"]) # 0 (doesn't raise error!)
# Get most common items
print(word_count.most_common(2))
# Output: [('apple', 3), ('banana', 2)]
Key insight: Counter returns 0 for missing keys instead of raising KeyError!
ML Example: Class Distribution
from collections import Counter
# Analyze class distribution in dataset
labels = [0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]
class_dist = Counter(labels)
print("Class distribution:")
for label, count in class_dist.items():
percentage = (count / len(labels)) * 100
print(f"Class {label}: {count} samples ({percentage:.1f}%)")
# Check for class imbalance
if max(class_dist.values()) / min(class_dist.values()) > 2:
print("â ď¸ Warning: Class imbalance detected!")
# Output:
# Class distribution:
# Class 0: 5 samples (41.7%)
# Class 1: 7 samples (58.3%)
Counter Arithmetic
# Combine counters
c1 = Counter(['a', 'b', 'c', 'a'])
c2 = Counter(['a', 'c', 'd', 'c'])
# Addition
combined = c1 + c2
print(combined) # Counter({'a': 3, 'c': 3, 'b': 1, 'd': 1})
# Subtraction
diff = c1 - c2
print(diff) # Counter({'b': 1, 'a': 1})
# Intersection (minimum)
intersection = c1 & c2
print(intersection) # Counter({'a': 1, 'c': 1})
# Union (maximum)
union = c1 | c2
print(union) # Counter({'a': 2, 'c': 2, 'b': 1, 'd': 1})
NLP Example: Word Frequency Analysis
from collections import Counter
import re
def analyze_text(text):
"""Analyze word frequencies in text."""
# Clean and tokenize
words = re.findall(r'\w+', text.lower())
# Count frequencies
word_freq = Counter(words)
# Analysis
print(f"Total words: {len(words)}")
print(f"Unique words: {len(word_freq)}")
print(f"\nTop 10 most common words:")
for word, count in word_freq.most_common(10):
print(f" {word}: {count}")
return word_freq
text = """
Python is amazing for data science. Python has great libraries.
Machine learning with Python is powerful. Data science uses Python.
"""
analyze_text(text)
defaultdict: Never Check for Keys Again
defaultdict automatically creates missing keys with a default value. Perfect for grouping, counting, and building nested structures!
The Problem with Regular Dicts
# Regular dict - tedious key checking
word_groups = {}
words = ["apple", "ant", "banana", "bear", "cat", "cherry"]
for word in words:
first_letter = word[0]
if first_letter not in word_groups:
word_groups[first_letter] = [] # Initialize if missing
word_groups[first_letter].append(word)
print(word_groups)
The defaultdict Solution
from collections import defaultdict
# defaultdict - automatic initialization!
word_groups = defaultdict(list)
words = ["apple", "ant", "banana", "bear", "cat", "cherry"]
for word in words:
first_letter = word[0]
word_groups[first_letter].append(word) # No checking needed!
print(dict(word_groups))
# Output: {'a': ['apple', 'ant'], 'b': ['banana', 'bear'], 'c': ['cat', 'cherry']}
defaultdict with int (Counting)
from collections import defaultdict
# Count word frequencies
word_count = defaultdict(int)
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
for word in words:
word_count[word] += 1 # Starts at 0 automatically!
print(dict(word_count))
# Output: {'apple': 3, 'banana': 2, 'cherry': 1}
ML Example: Group Samples by Label
from collections import defaultdict
# Group samples by their predicted class
samples_by_class = defaultdict(list)
predictions = [
("sample1", 0),
("sample2", 1),
("sample3", 0),
("sample4", 1),
("sample5", 1),
]
for sample_id, predicted_class in predictions:
samples_by_class[predicted_class].append(sample_id)
for class_label, samples in samples_by_class.items():
print(f"Class {class_label}: {samples}")
# Output:
# Class 0: ['sample1', 'sample3']
# Class 1: ['sample2', 'sample4', 'sample5']
Nested defaultdict
from collections import defaultdict
# Create nested structure automatically
tree = lambda: defaultdict(tree)
users = tree()
# Build nested structure without initialization
users['user1']['age'] = 25
users['user1']['city'] = 'NYC'
users['user2']['age'] = 30
users['user2']['city'] = 'LA'
print(dict(users))
deque: High-Performance Queues
deque (double-ended queue) provides O(1) append/pop operations from both ends. Perfect for sliding windows, BFS, and producer-consumer patterns!
Why deque is Faster
from collections import deque
import time
# List (slow for left operations)
regular_list = []
start = time.time()
for i in range(100000):
regular_list.insert(0, i) # O(n) - SLOW!
list_time = time.time() - start
# deque (fast for all operations)
dq = deque()
start = time.time()
for i in range(100000):
dq.appendleft(i) # O(1) - FAST!
deque_time = time.time() - start
print(f"List time: {list_time:.3f}s")
print(f"deque time: {deque_time:.3f}s")
print(f"deque is {list_time / deque_time:.0f}x faster!")
Basic deque Operations
from collections import deque
# Create deque
queue = deque()
# Add to right (like append)
queue.append(1)
queue.append(2)
queue.append(3)
print(queue) # deque([1, 2, 3])
# Add to left
queue.appendleft(0)
print(queue) # deque([0, 1, 2, 3])
# Remove from right (like pop)
queue.pop()
print(queue) # deque([0, 1, 2])
# Remove from left
queue.popleft()
print(queue) # deque([1, 2])
# Rotate
queue.rotate(1) # Rotate right
print(queue) # deque([2, 1])
ML Example: Sliding Window
from collections import deque
def sliding_window_average(data, window_size):
"""Calculate moving average using deque."""
window = deque(maxlen=window_size)
averages = []
for value in data:
window.append(value)
if len(window) == window_size:
avg = sum(window) / window_size
averages.append(avg)
return averages
# Time series data
prices = [100, 102, 105, 103, 107, 110, 108, 112, 115]
# 3-day moving average
moving_avg = sliding_window_average(prices, window_size=3)
print(f"Prices: {prices}")
print(f"3-day moving average: {moving_avg}")
Bounded deque (Automatic Eviction)
from collections import deque
# Keep only last 5 items
recent_logs = deque(maxlen=5)
for i in range(10):
recent_logs.append(f"Log entry {i}")
print(f"Current logs: {list(recent_logs)}")
# Output shows only last 5 are kept:
# Current logs: ['Log entry 5', 'Log entry 6', 'Log entry 7', 'Log entry 8', 'Log entry 9']
namedtuple: Readable Records
namedtuple creates lightweight, immutable records with named fields. Perfect for returning multiple values or creating simple data structures!
Creating namedtuples
from collections import namedtuple
# Define a Point type
Point = namedtuple('Point', ['x', 'y'])
# Create instances
p1 = Point(3, 4)
p2 = Point(x=5, y=12)
# Access fields
print(p1.x, p1.y) # 3 4
print(p2[0], p2[1]) # 5 12 (also works like tuple)
# Unpack
x, y = p1
print(f"Coordinates: ({x}, {y})")
ML Example: Model Results
from collections import namedtuple
# Define result structure
ModelResult = namedtuple('ModelResult', ['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])
# Store results
results = [
ModelResult('RandomForest', 0.85, 0.83, 0.87, 0.85),
ModelResult('SVM', 0.92, 0.90, 0.93, 0.91),
ModelResult('NeuralNet', 0.88, 0.86, 0.89, 0.87),
]
# Easy to read and access
for result in results:
print(f"{result.model_name}:")
print(f" Accuracy: {result.accuracy:.1%}")
print(f" Precision: {result.precision:.1%}")
print(f" F1-Score: {result.f1_score:.1%}")
print()
# Find best model
best = max(results, key=lambda r: r.accuracy)
print(f"Best model: {best.model_name} ({best.accuracy:.1%})")
namedtuple Methods
from collections import namedtuple
Person = namedtuple('Person', ['name', 'age', 'city'])
p = Person('Alice', 25, 'NYC')
# Convert to dict
print(p._asdict())
# Output: {'name': 'Alice', 'age': 25, 'city': 'NYC'}
# Replace (creates new instance - immutable!)
p2 = p._replace(age=26)
print(p2) # Person(name='Alice', age=26, city='NYC')
# Create from iterable
data = ['Bob', 30, 'LA']
p3 = Person._make(data)
print(p3) # Person(name='Bob', age=30, city='LA')
OrderedDict: Ordered with Utilities
While modern dict preserves insertion order (Python 3.7+), OrderedDict adds useful methods like move_to_end().
LRU Cache Pattern
from collections import OrderedDict
class LRUCache:
"""Least Recently Used cache implementation."""
def __init__(self, capacity):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key):
if key not in self.cache:
return None
# Move to end (most recently used)
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
# Evict oldest if over capacity
if len(self.cache) > self.capacity:
oldest = next(iter(self.cache))
del self.cache[oldest]
# Test LRU cache
cache = LRUCache(capacity=3)
cache.put('a', 1)
cache.put('b', 2)
cache.put('c', 3)
print(dict(cache.cache)) # {'a': 1, 'b': 2, 'c': 3}
cache.put('d', 4) # Evicts 'a'
print(dict(cache.cache)) # {'b': 2, 'c': 3, 'd': 4}
ChainMap: Layered Dictionaries
ChainMap searches multiple dictionaries as one view. Perfect for configuration management!
from collections import ChainMap
# Default configuration
defaults = {
'batch_size': 32,
'learning_rate': 0.001,
'epochs': 100,
'optimizer': 'adam'
}
# User overrides
user_config = {
'batch_size': 64,
'epochs': 50
}
# Layer them (user_config takes precedence)
config = ChainMap(user_config, defaults)
print(f"Batch size: {config['batch_size']}") # 64 (from user)
print(f"Learning rate: {config['learning_rate']}") # 0.001 (from defaults)
print(f"Epochs: {config['epochs']}") # 50 (from user)
print(f"Optimizer: {config['optimizer']}") # adam (from defaults)
Real-World ML Example: Complete Data Pipeline
from collections import Counter, defaultdict, deque, namedtuple
# Define data structures
Sample = namedtuple('Sample', ['id', 'features', 'label'])
class DataPipeline:
"""Complete data processing pipeline using collections."""
def __init__(self):
self.samples_by_label = defaultdict(list)
self.feature_stats = Counter()
self.recent_samples = deque(maxlen=100)
def add_sample(self, sample):
"""Process and store a sample."""
# Group by label
self.samples_by_label[sample.label].append(sample)
# Track feature occurrences
for feature in sample.features:
self.feature_stats[feature] += 1
# Keep recent samples
self.recent_samples.append(sample)
def get_statistics(self):
"""Get dataset statistics."""
total = sum(len(samples) for samples in self.samples_by_label.values())
print(f"Total samples: {total}")
print(f"\nClass distribution:")
for label, samples in self.samples_by_label.items():
count = len(samples)
pct = (count / total) * 100
print(f" Class {label}: {count} ({pct:.1f}%)")
print(f"\nTop 5 features:")
for feature, count in self.feature_stats.most_common(5):
print(f" {feature}: {count} occurrences")
# Test the pipeline
pipeline = DataPipeline()
# Add samples
samples = [
Sample('s1', ['age', 'income', 'education'], 0),
Sample('s2', ['age', 'income'], 1),
Sample('s3', ['age', 'education'], 0),
Sample('s4', ['income', 'education'], 1),
Sample('s5', ['age', 'income', 'education'], 1),
]
for sample in samples:
pipeline.add_sample(sample)
pipeline.get_statistics()
Conclusion: Collections Module Mastery
You've learned Python's powerful collections module:
â
Counter - Effortless frequency counting and analysis
â
defaultdict - Automatic default values, no KeyError
â
deque - Lightning-fast double-ended queue
â
namedtuple - Readable, immutable records
â
OrderedDict - Ordered dict with move_to_end
â
ChainMap - Layer multiple dictionaries
These aren't just conveniencesâthey're performance optimizations that make your data science code faster and more Pythonic!
Quick Reference
from collections import Counter, defaultdict, deque, namedtuple, OrderedDict, ChainMap
# Counter - frequency tables
counts = Counter(['a', 'b', 'a', 'c', 'b', 'a'])
# defaultdict - automatic defaults
groups = defaultdict(list)
# deque - fast queue
queue = deque(maxlen=100)
# namedtuple - readable records
Point = namedtuple('Point', ['x', 'y'])
# OrderedDict - ordered with utilities
od = OrderedDict()
# ChainMap - layered dicts
config = ChainMap(user_config, defaults)
Master these containers, and you'll write cleaner, faster data science code! đ
If you found this guide helpful and are using Python's collections module in your projects, I'd love to hear about it! Connect with me on Twitter or LinkedIn.
Support My Work
If this guide helped you master Python's collections module, use Counter, deque, and defaultdict effectively, or write more efficient code, I'd really appreciate your support! Creating comprehensive, practical Python tutorials like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for Python developers.
â Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by Sunira Moses on Unsplash