Python Iterators and Generators: Memory-Efficient Data Processing
Master Python iterators and generators for efficient data processing. Learn yield, itertools, generator expressions, and memory-efficient techniques for large datasets

When My Script Ran Out of Memory
I was processing a 10GB dataset when my script crashed with MemoryError. I'd loaded everything into a list:
# This crashes with large files!
data = [process_line(line) for line in open('huge_file.txt')]
My mentor showed me generators:
# This works with ANY size file!
data = (process_line(line) for line in open('huge_file.txt'))
That one change solved everything. Generators are magic for big data!
Understanding the Iterator Protocol
How Iteration Works
When you use a for loop, Python uses the iterator protocol behind the scenes:
# What Python does internally:
items = [1, 2, 3]
# 1. Get iterator object
iterator = iter(items) # Calls items.__iter__()
# 2. Loop until StopIteration
while True:
try:
item = next(iterator) # Calls iterator.__next__()
print(item)
except StopIteration:
break
Creating Custom Iterators
class Counter:
"""Iterator that counts from start to end."""
def __init__(self, start, end):
self.current = start
self.end = end
def __iter__(self):
"""Return the iterator object (self)."""
return self
def __next__(self):
"""Return the next value or raise StopIteration."""
if self.current > self.end:
raise StopIteration
value = self.current
self.current += 1
return value
# Use the iterator
counter = Counter(1, 5)
for num in counter:
print(num) # 1, 2, 3, 4, 5
# Manual iteration
counter2 = Counter(10, 12)
print(next(counter2)) # 10
print(next(counter2)) # 11
print(next(counter2)) # 12
# next(counter2) # Raises StopIteration
Iterable vs Iterator
Iterable: Object that can return an iterator (has __iter__())
Iterator: Object that produces values (has __iter__() and __next__())
# List is iterable, not an iterator
my_list = [1, 2, 3]
print(hasattr(my_list, '__iter__')) # True
print(hasattr(my_list, '__next__')) # False
# Get iterator from iterable
my_iter = iter(my_list)
print(hasattr(my_iter, '__iter__')) # True
print(hasattr(my_iter, '__next__')) # True
Generators: Simpler Iterators
Generators are functions that use yield instead of return. They automatically implement the iterator protocol!
Basic Generator
def counter(start, end):
"""Generator version - much simpler!"""
current = start
while current <= end:
yield current
current += 1
# Creates a generator object
gen = counter(1, 5)
print(type(gen)) # <class 'generator'>
# Use like any iterator
for num in gen:
print(num) # 1, 2, 3, 4, 5
How yield Works
def explain_yield():
print("Before first yield")
yield 1
print("Between yields")
yield 2
print("Before last yield")
yield 3
print("After last yield")
gen = explain_yield()
print("Generator created")
print(next(gen)) # Before first yield → 1
print(next(gen)) # Between yields → 2
print(next(gen)) # Before last yield → 3
# next(gen) # After last yield → StopIteration
Key insight: Execution pauses at yield and resumes on next next() call!
Generator Functions: The Power of yield
Memory Efficiency
import sys
# List - stores all values in memory
def numbers_list(n):
return [i for i in range(n)]
# Generator - produces values on demand
def numbers_gen(n):
for i in range(n):
yield i
# Compare memory usage
list_obj = numbers_list(1000000)
gen_obj = numbers_gen(1000000)
print(f"List size: {sys.getsizeof(list_obj)} bytes") # ~8MB
print(f"Generator size: {sys.getsizeof(gen_obj)} bytes") # ~120 bytes!
Classic Examples
Fibonacci sequence:
def fibonacci(n):
"""Generate first n Fibonacci numbers."""
a, b = 0, 1
count = 0
while count < n:
yield a
a, b = b, a + b
count += 1
# Memory efficient - generates on demand
for num in fibonacci(10):
print(num) # 0, 1, 1, 2, 3, 5, 8, 13, 21, 34
Infinite sequence:
def infinite_sequence():
"""Generate infinite sequence of numbers."""
num = 0
while True:
yield num
num += 1
# Works with infinite data!
gen = infinite_sequence()
print(next(gen)) # 0
print(next(gen)) # 1
print(next(gen)) # 2
# Can continue forever...
Generator Methods
Generators support advanced control methods:
def controllable_gen():
received = None
while True:
# yield returns value sent via send()
received = yield received * 2 if received else 0
print(f"Received: {received}")
gen = controllable_gen()
next(gen) # Prime the generator
print(gen.send(5)) # Received: 5 → 10
print(gen.send(10)) # Received: 10 → 20
print(gen.send(3)) # Received: 3 → 6
yield from: Generator Delegation
def generator1():
yield 1
yield 2
def generator2():
yield 3
yield 4
def combined():
yield from generator1()
yield from generator2()
yield 5
print(list(combined())) # [1, 2, 3, 4, 5]
# Flattening nested lists
def flatten(nested_list):
for item in nested_list:
if isinstance(item, list):
yield from flatten(item) # Recursive!
else:
yield item
nested = [1, [2, [3, 4], 5], 6, [7, 8]]
print(list(flatten(nested))) # [1, 2, 3, 4, 5, 6, 7, 8]
Generator Expressions: One-Line Generators
# List comprehension (loads everything)
squares_list = [x**2 for x in range(1000000)] # Uses lots of memory!
# Generator expression (lazy evaluation)
squares_gen = (x**2 for x in range(1000000)) # Uses almost no memory!
# Process one at a time
for square in squares_gen:
if square > 100:
break
Data Science Example: Processing Large Files
def process_large_csv(filename):
"""Process CSV file line by line (memory efficient)."""
with open(filename) as file:
# Skip header
next(file)
for line in file:
# Process one line at a time
values = line.strip().split(',')
yield {
'name': values[0],
'age': int(values[1]),
'score': float(values[2])
}
# Process millions of rows without loading all into memory!
for record in process_large_csv('huge_dataset.csv'):
if record['score'] > 90:
print(f"{record['name']}: {record['score']}")
itertools: Generator Superpowers
The itertools module provides powerful generator-based tools:
Infinite Iterators
import itertools
# count - infinite counter
for i in itertools.count(start=10, step=2):
if i > 20:
break
print(i) # 10, 12, 14, 16, 18, 20
# cycle - repeat sequence infinitely
colors = itertools.cycle(['red', 'green', 'blue'])
for i, color in enumerate(colors):
if i >= 6:
break
print(color) # red, green, blue, red, green, blue
# repeat - repeat single value
for x in itertools.repeat('hello', 3):
print(x) # hello, hello, hello
Combinatoric Iterators
# Combinations - order doesn't matter, no repeats
features = ['age', 'income', 'education']
feature_pairs = itertools.combinations(features, 2)
print(list(feature_pairs))
# [('age', 'income'), ('age', 'education'), ('income', 'education')]
# Permutations - order matters
letters = ['A', 'B', 'C']
perms = itertools.permutations(letters, 2)
print(list(perms))
# [('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]
# Product - Cartesian product
colors = ['red', 'blue']
sizes = ['S', 'M', 'L']
variants = itertools.product(colors, sizes)
print(list(variants))
# [('red', 'S'), ('red', 'M'), ('red', 'L'),
# ('blue', 'S'), ('blue', 'M'), ('blue', 'L')]
Data Processing Iterators
# chain - combine multiple iterables
combined = itertools.chain([1, 2], [3, 4], [5, 6])
print(list(combined)) # [1, 2, 3, 4, 5, 6]
# islice - slice an iterator
data = itertools.count() # Infinite!
first_10 = itertools.islice(data, 10)
print(list(first_10)) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# takewhile - take until condition is False
data = [1, 4, 6, 4, 1]
result = itertools.takewhile(lambda x: x < 5, data)
print(list(result)) # [1, 4]
# dropwhile - drop until condition is False
data = [1, 4, 6, 4, 1]
result = itertools.dropwhile(lambda x: x < 5, data)
print(list(result)) # [6, 4, 1]
# groupby - group consecutive equal elements
data = [1, 1, 2, 2, 2, 3, 1, 1]
for key, group in itertools.groupby(data):
print(f"{key}: {list(group)}")
# 1: [1, 1]
# 2: [2, 2, 2]
# 3: [3]
# 1: [1, 1]
Practical ML Example
from itertools import islice, cycle
import numpy as np
def batch_generator(data, batch_size, epochs):
"""Generate batches for multiple epochs."""
for epoch in range(epochs):
# Shuffle data each epoch (in practice)
for i in range(0, len(data), batch_size):
yield data[i:i + batch_size]
# Or infinite batches with cycling
def infinite_batch_gen(data, batch_size):
"""Infinite batch generator with cycling."""
data_cycle = cycle(data)
while True:
batch = list(islice(data_cycle, batch_size))
if batch:
yield np.array(batch)
Generator Pipelines
Chain generators together for data processing pipelines:
def read_file(filename):
"""Generator: read file line by line."""
with open(filename) as f:
for line in f:
yield line.strip()
def filter_comments(lines):
"""Generator: filter out comments."""
for line in lines:
if not line.startswith('#'):
yield line
def parse_numbers(lines):
"""Generator: convert to numbers."""
for line in lines:
try:
yield float(line)
except ValueError:
pass
def process_data(filename):
"""Pipeline: compose generators."""
lines = read_file(filename)
clean_lines = filter_comments(lines)
numbers = parse_numbers(clean_lines)
return numbers
# Memory-efficient pipeline!
for number in process_data('data.txt'):
print(number)
Common Pitfalls
Pitfall 1: Exhausted Generators
# Wrong - generator can only be iterated once!
gen = (x**2 for x in range(5))
print(list(gen)) # [0, 1, 4, 9, 16]
print(list(gen)) # [] - exhausted!
# Right - create new generator or use list
numbers = list(x**2 for x in range(5))
print(numbers) # [0, 1, 4, 9, 16]
print(numbers) # [0, 1, 4, 9, 16] - still works
Pitfall 2: Holding References
# Wrong - holds all items in memory anyway!
def bad_generator(data):
results = []
for item in data:
result = process(item)
results.append(result) # Defeating the purpose!
yield result
# Right - don't store processed items
def good_generator(data):
for item in data:
yield process(item) # Memory efficient
Pitfall 3: Side Effects in Generators
# Be careful with side effects
def generator_with_side_effects():
print("Generating 1")
yield 1
print("Generating 2")
yield 2
# Side effects don't run until iteration starts
gen = generator_with_side_effects() # No output yet
print("Created generator")
print(next(gen)) # Now "Generating 1" prints
Performance Comparison
import time
import sys
# Test with 10 million items
n = 10_000_000
# List - loads everything
start = time.time()
list_data = [i**2 for i in range(n)]
first_100 = list_data[:100]
list_time = time.time() - start
list_memory = sys.getsizeof(list_data)
# Generator - lazy evaluation
start = time.time()
gen_data = (i**2 for i in range(n))
first_100 = list(itertools.islice(gen_data, 100))
gen_time = time.time() - start
gen_memory = sys.getsizeof(gen_data)
print(f"List: {list_time:.2f}s, {list_memory / 1024 / 1024:.1f}MB")
print(f"Generator: {gen_time:.4f}s, {gen_memory} bytes")
# List: 1.2s, 76.3MB
# Generator: 0.0001s, 112 bytes
Best Practices
1. Use Generators for Large Data
# Bad - loads everything
def process_logs_bad(filename):
with open(filename) as f:
lines = f.readlines() # All in memory!
return [parse_log(line) for line in lines]
# Good - memory efficient
def process_logs_good(filename):
with open(filename) as f:
for line in f: # One at a time
yield parse_log(line)
2. Generator Expressions for Simple Cases
# Simple transformation
squares = (x**2 for x in range(1000))
# Filtering
even_squares = (x**2 for x in range(1000) if x % 2 == 0)
3. Use itertools for Complex Operations
from itertools import islice, chain, groupby
# Don't reinvent the wheel
# Use itertools' optimized C implementations
4. Profile Memory Usage
import tracemalloc
tracemalloc.start()
# Your generator code here
data = (i**2 for i in range(1000000))
result = sum(data)
current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024:.1f}KB, Peak: {peak / 1024:.1f}KB")
tracemalloc.stop()
5. Document Generator Exhaustion
def my_generator():
"""
Generate values from 1 to 10.
Note: Generator can only be iterated once.
Create new generator for multiple iterations.
Yields:
int: Numbers from 1 to 10
"""
for i in range(1, 11):
yield i
Key Takeaways
- Generators are lazy - values computed on demand, not upfront
- Memory efficient - perfect for large datasets and infinite sequences
- One-time use - generators exhaust after iteration (unlike lists)
- Use
yield- simpler than implementing__iter__and__next__ - itertools is powerful - provides optimized generator utilities
- Pipeline pattern - chain generators for complex data processing
- Generator expressions -
()instead of[]for simple cases
Generators transform how you handle data in Python. They enable processing datasets larger than memory, create infinite sequences, and build elegant data pipelines. Master generators, and you'll write more efficient, scalable Python code!
Connect with me on Twitter or LinkedIn.
Support My Work
If this guide helped you understand Python iterators, generators, and the yield keyword, write more memory-efficient code, or master lazy evaluation, I'd really appreciate your support! Creating comprehensive tutorials on advanced Python concepts like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for Python developers.
☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by Jason Yuen on Unsplash