Ojaswi Athghara | Python Collections Module: Counter, deque, defaultdict for Data Science

Python Collections Module: Counter, deque, defaultdict for Data Science

When I Discovered Collections Module

I was processing a dataset with word frequencies when my mentor looked at my code:

# My inefficient approach
word_count = {}
for word in words:
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1

He smiled and showed me:

from collections import Counter
word_count = Counter(words)

One line. That's all it took. The collections module has been my secret weapon ever since!

In this guide, I'll show you Python's powerful collections module—essential tools that make your data science code cleaner, faster, and more Pythonic.

Why Collections Module Matters

Python's built-in list, dict, and tuple are great, but the collections module provides specialized containers optimized for common patterns in data science:

Counter - Frequency tables in one line
defaultdict - No more KeyError checking
deque - Lightning-fast queues (O(1) operations)
namedtuple - Readable, memory-efficient records
OrderedDict - Ordered dictionary with utilities
ChainMap - Layer multiple dictionaries

These aren't just convenience—they're performance optimizations used by professional data scientists!

Counter: Frequency Tables Made Easy

Counter is perfect for counting anything—words, labels, features, categories.

Basic Counting

from collections import Counter

# Count word frequencies
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]
word_count = Counter(words)

print(word_count)
# Output: Counter({'apple': 3, 'banana': 2, 'cherry': 1})

# Access counts
print(word_count["apple"])    # 3
print(word_count["grape"])    # 0 (doesn't raise error!)

# Get most common items
print(word_count.most_common(2))
# Output: [('apple', 3), ('banana', 2)]

Key insight: Counter returns 0 for missing keys instead of raising KeyError!

ML Example: Class Distribution

from collections import Counter

# Analyze class distribution in dataset
labels = [0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1]
class_dist = Counter(labels)

print("Class distribution:")
for label, count in class_dist.items():
    percentage = (count / len(labels)) * 100
    print(f"Class {label}: {count} samples ({percentage:.1f}%)")

# Check for class imbalance
if max(class_dist.values()) / min(class_dist.values()) > 2:
    print("⚠️  Warning: Class imbalance detected!")

# Output:
# Class distribution:
# Class 0: 5 samples (41.7%)
# Class 1: 7 samples (58.3%)

Counter Arithmetic

# Combine counters
c1 = Counter(['a', 'b', 'c', 'a'])
c2 = Counter(['a', 'c', 'd', 'c'])

# Addition
combined = c1 + c2
print(combined)  # Counter({'a': 3, 'c': 3, 'b': 1, 'd': 1})

# Subtraction
diff = c1 - c2
print(diff)  # Counter({'b': 1, 'a': 1})

# Intersection (minimum)
intersection = c1 & c2
print(intersection)  # Counter({'a': 1, 'c': 1})

# Union (maximum)
union = c1 | c2
print(union)  # Counter({'a': 2, 'c': 2, 'b': 1, 'd': 1})

NLP Example: Word Frequency Analysis

from collections import Counter
import re

def analyze_text(text):
    """Analyze word frequencies in text."""
    # Clean and tokenize
    words = re.findall(r'\w+', text.lower())
    
    # Count frequencies
    word_freq = Counter(words)
    
    # Analysis
    print(f"Total words: {len(words)}")
    print(f"Unique words: {len(word_freq)}")
    print(f"\nTop 10 most common words:")
    for word, count in word_freq.most_common(10):
        print(f"  {word}: {count}")
    
    return word_freq

text = """
Python is amazing for data science. Python has great libraries.
Machine learning with Python is powerful. Data science uses Python.
"""

analyze_text(text)

defaultdict: Never Check for Keys Again

defaultdict automatically creates missing keys with a default value. Perfect for grouping, counting, and building nested structures!

The Problem with Regular Dicts

# Regular dict - tedious key checking
word_groups = {}
words = ["apple", "ant", "banana", "bear", "cat", "cherry"]

for word in words:
    first_letter = word[0]
    if first_letter not in word_groups:
        word_groups[first_letter] = []  # Initialize if missing
    word_groups[first_letter].append(word)

print(word_groups)

The defaultdict Solution

from collections import defaultdict

# defaultdict - automatic initialization!
word_groups = defaultdict(list)
words = ["apple", "ant", "banana", "bear", "cat", "cherry"]

for word in words:
    first_letter = word[0]
    word_groups[first_letter].append(word)  # No checking needed!

print(dict(word_groups))
# Output: {'a': ['apple', 'ant'], 'b': ['banana', 'bear'], 'c': ['cat', 'cherry']}

defaultdict with int (Counting)

from collections import defaultdict

# Count word frequencies
word_count = defaultdict(int)
words = ["apple", "banana", "apple", "cherry", "banana", "apple"]

for word in words:
    word_count[word] += 1  # Starts at 0 automatically!

print(dict(word_count))
# Output: {'apple': 3, 'banana': 2, 'cherry': 1}

ML Example: Group Samples by Label

from collections import defaultdict

# Group samples by their predicted class
samples_by_class = defaultdict(list)

predictions = [
    ("sample1", 0),
    ("sample2", 1),
    ("sample3", 0),
    ("sample4", 1),
    ("sample5", 1),
]

for sample_id, predicted_class in predictions:
    samples_by_class[predicted_class].append(sample_id)

for class_label, samples in samples_by_class.items():
    print(f"Class {class_label}: {samples}")

# Output:
# Class 0: ['sample1', 'sample3']
# Class 1: ['sample2', 'sample4', 'sample5']

Nested defaultdict

from collections import defaultdict

# Create nested structure automatically
tree = lambda: defaultdict(tree)
users = tree()

# Build nested structure without initialization
users['user1']['age'] = 25
users['user1']['city'] = 'NYC'
users['user2']['age'] = 30
users['user2']['city'] = 'LA'

print(dict(users))

deque: High-Performance Queues

deque (double-ended queue) provides O(1) append/pop operations from both ends. Perfect for sliding windows, BFS, and producer-consumer patterns!

Why deque is Faster

from collections import deque
import time

# List (slow for left operations)
regular_list = []
start = time.time()
for i in range(100000):
    regular_list.insert(0, i)  # O(n) - SLOW!
list_time = time.time() - start

# deque (fast for all operations)
dq = deque()
start = time.time()
for i in range(100000):
    dq.appendleft(i)  # O(1) - FAST!
deque_time = time.time() - start

print(f"List time: {list_time:.3f}s")
print(f"deque time: {deque_time:.3f}s")
print(f"deque is {list_time / deque_time:.0f}x faster!")

Basic deque Operations

from collections import deque

# Create deque
queue = deque()

# Add to right (like append)
queue.append(1)
queue.append(2)
queue.append(3)
print(queue)  # deque([1, 2, 3])

# Add to left
queue.appendleft(0)
print(queue)  # deque([0, 1, 2, 3])

# Remove from right (like pop)
queue.pop()
print(queue)  # deque([0, 1, 2])

# Remove from left
queue.popleft()
print(queue)  # deque([1, 2])

# Rotate
queue.rotate(1)  # Rotate right
print(queue)  # deque([2, 1])

ML Example: Sliding Window

from collections import deque

def sliding_window_average(data, window_size):
    """Calculate moving average using deque."""
    window = deque(maxlen=window_size)
    averages = []
    
    for value in data:
        window.append(value)
        if len(window) == window_size:
            avg = sum(window) / window_size
            averages.append(avg)
    
    return averages

# Time series data
prices = [100, 102, 105, 103, 107, 110, 108, 112, 115]

# 3-day moving average
moving_avg = sliding_window_average(prices, window_size=3)
print(f"Prices: {prices}")
print(f"3-day moving average: {moving_avg}")

Bounded deque (Automatic Eviction)

from collections import deque

# Keep only last 5 items
recent_logs = deque(maxlen=5)

for i in range(10):
    recent_logs.append(f"Log entry {i}")
    print(f"Current logs: {list(recent_logs)}")

# Output shows only last 5 are kept:
# Current logs: ['Log entry 5', 'Log entry 6', 'Log entry 7', 'Log entry 8', 'Log entry 9']

namedtuple: Readable Records

namedtuple creates lightweight, immutable records with named fields. Perfect for returning multiple values or creating simple data structures!

Creating namedtuples

from collections import namedtuple

# Define a Point type
Point = namedtuple('Point', ['x', 'y'])

# Create instances
p1 = Point(3, 4)
p2 = Point(x=5, y=12)

# Access fields
print(p1.x, p1.y)  # 3 4
print(p2[0], p2[1])  # 5 12 (also works like tuple)

# Unpack
x, y = p1
print(f"Coordinates: ({x}, {y})")

ML Example: Model Results

from collections import namedtuple

# Define result structure
ModelResult = namedtuple('ModelResult', ['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

# Store results
results = [
    ModelResult('RandomForest', 0.85, 0.83, 0.87, 0.85),
    ModelResult('SVM', 0.92, 0.90, 0.93, 0.91),
    ModelResult('NeuralNet', 0.88, 0.86, 0.89, 0.87),
]

# Easy to read and access
for result in results:
    print(f"{result.model_name}:")
    print(f"  Accuracy:  {result.accuracy:.1%}")
    print(f"  Precision: {result.precision:.1%}")
    print(f"  F1-Score:  {result.f1_score:.1%}")
    print()

# Find best model
best = max(results, key=lambda r: r.accuracy)
print(f"Best model: {best.model_name} ({best.accuracy:.1%})")

namedtuple Methods

from collections import namedtuple

Person = namedtuple('Person', ['name', 'age', 'city'])

p = Person('Alice', 25, 'NYC')

# Convert to dict
print(p._asdict())
# Output: {'name': 'Alice', 'age': 25, 'city': 'NYC'}

# Replace (creates new instance - immutable!)
p2 = p._replace(age=26)
print(p2)  # Person(name='Alice', age=26, city='NYC')

# Create from iterable
data = ['Bob', 30, 'LA']
p3 = Person._make(data)
print(p3)  # Person(name='Bob', age=30, city='LA')

OrderedDict: Ordered with Utilities

While modern dict preserves insertion order (Python 3.7+), OrderedDict adds useful methods like move_to_end().

LRU Cache Pattern

from collections import OrderedDict

class LRUCache:
    """Least Recently Used cache implementation."""
    
    def __init__(self, capacity):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key):
        if key not in self.cache:
            return None
        # Move to end (most recently used)
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        # Evict oldest if over capacity
        if len(self.cache) > self.capacity:
            oldest = next(iter(self.cache))
            del self.cache[oldest]

# Test LRU cache
cache = LRUCache(capacity=3)
cache.put('a', 1)
cache.put('b', 2)
cache.put('c', 3)
print(dict(cache.cache))  # {'a': 1, 'b': 2, 'c': 3}

cache.put('d', 4)  # Evicts 'a'
print(dict(cache.cache))  # {'b': 2, 'c': 3, 'd': 4}

ChainMap: Layered Dictionaries

ChainMap searches multiple dictionaries as one view. Perfect for configuration management!

from collections import ChainMap

# Default configuration
defaults = {
    'batch_size': 32,
    'learning_rate': 0.001,
    'epochs': 100,
    'optimizer': 'adam'
}

# User overrides
user_config = {
    'batch_size': 64,
    'epochs': 50
}

# Layer them (user_config takes precedence)
config = ChainMap(user_config, defaults)

print(f"Batch size: {config['batch_size']}")      # 64 (from user)
print(f"Learning rate: {config['learning_rate']}")  # 0.001 (from defaults)
print(f"Epochs: {config['epochs']}")              # 50 (from user)
print(f"Optimizer: {config['optimizer']}")        # adam (from defaults)

Real-World ML Example: Complete Data Pipeline

from collections import Counter, defaultdict, deque, namedtuple

# Define data structures
Sample = namedtuple('Sample', ['id', 'features', 'label'])

class DataPipeline:
    """Complete data processing pipeline using collections."""
    
    def __init__(self):
        self.samples_by_label = defaultdict(list)
        self.feature_stats = Counter()
        self.recent_samples = deque(maxlen=100)
    
    def add_sample(self, sample):
        """Process and store a sample."""
        # Group by label
        self.samples_by_label[sample.label].append(sample)
        
        # Track feature occurrences
        for feature in sample.features:
            self.feature_stats[feature] += 1
        
        # Keep recent samples
        self.recent_samples.append(sample)
    
    def get_statistics(self):
        """Get dataset statistics."""
        total = sum(len(samples) for samples in self.samples_by_label.values())
        
        print(f"Total samples: {total}")
        print(f"\nClass distribution:")
        for label, samples in self.samples_by_label.items():
            count = len(samples)
            pct = (count / total) * 100
            print(f"  Class {label}: {count} ({pct:.1f}%)")
        
        print(f"\nTop 5 features:")
        for feature, count in self.feature_stats.most_common(5):
            print(f"  {feature}: {count} occurrences")

# Test the pipeline
pipeline = DataPipeline()

# Add samples
samples = [
    Sample('s1', ['age', 'income', 'education'], 0),
    Sample('s2', ['age', 'income'], 1),
    Sample('s3', ['age', 'education'], 0),
    Sample('s4', ['income', 'education'], 1),
    Sample('s5', ['age', 'income', 'education'], 1),
]

for sample in samples:
    pipeline.add_sample(sample)

pipeline.get_statistics()

Conclusion: Collections Module Mastery

You've learned Python's powerful collections module:

✅ Counter - Effortless frequency counting and analysis
✅ defaultdict - Automatic default values, no KeyError
✅ deque - Lightning-fast double-ended queue
✅ namedtuple - Readable, immutable records
✅ OrderedDict - Ordered dict with move_to_end
✅ ChainMap - Layer multiple dictionaries

These aren't just conveniences—they're performance optimizations that make your data science code faster and more Pythonic!

Quick Reference

from collections import Counter, defaultdict, deque, namedtuple, OrderedDict, ChainMap

# Counter - frequency tables
counts = Counter(['a', 'b', 'a', 'c', 'b', 'a'])

# defaultdict - automatic defaults
groups = defaultdict(list)

# deque - fast queue
queue = deque(maxlen=100)

# namedtuple - readable records
Point = namedtuple('Point', ['x', 'y'])

# OrderedDict - ordered with utilities
od = OrderedDict()

# ChainMap - layered dicts
config = ChainMap(user_config, defaults)

Master these containers, and you'll write cleaner, faster data science code! 🚀

If you found this guide helpful and are using Python's collections module in your projects, I'd love to hear about it! Connect with me on Twitter or LinkedIn.

Support My Work

If this guide helped you master Python's collections module, use Counter, deque, and defaultdict effectively, or write more efficient code, I'd really appreciate your support! Creating comprehensive, practical Python tutorials like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for Python developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by Sunira Moses on Unsplash

Related Blogs