Ojaswi Athghara | Python Iterators and Generators: Memory-Efficient Data Processing

Python Iterators and Generators: Memory-Efficient Data Processing

When My Script Ran Out of Memory

I was processing a 10GB dataset when my script crashed with MemoryError. I'd loaded everything into a list:

# This crashes with large files!
data = [process_line(line) for line in open('huge_file.txt')]

My mentor showed me generators:

# This works with ANY size file!
data = (process_line(line) for line in open('huge_file.txt'))

That one change solved everything. Generators are magic for big data!

Understanding the Iterator Protocol

How Iteration Works

When you use a for loop, Python uses the iterator protocol behind the scenes:

# What Python does internally:
items = [1, 2, 3]

# 1. Get iterator object
iterator = iter(items)  # Calls items.__iter__()

# 2. Loop until StopIteration
while True:
    try:
        item = next(iterator)  # Calls iterator.__next__()
        print(item)
    except StopIteration:
        break

Creating Custom Iterators

class Counter:
    """Iterator that counts from start to end."""
    def __init__(self, start, end):
        self.current = start
        self.end = end
    
    def __iter__(self):
        """Return the iterator object (self)."""
        return self
    
    def __next__(self):
        """Return the next value or raise StopIteration."""
        if self.current > self.end:
            raise StopIteration
        
        value = self.current
        self.current += 1
        return value

# Use the iterator
counter = Counter(1, 5)
for num in counter:
    print(num)  # 1, 2, 3, 4, 5

# Manual iteration
counter2 = Counter(10, 12)
print(next(counter2))  # 10
print(next(counter2))  # 11
print(next(counter2))  # 12
# next(counter2)  # Raises StopIteration

Iterable vs Iterator

Iterable: Object that can return an iterator (has __iter__())
Iterator: Object that produces values (has __iter__() and __next__())

# List is iterable, not an iterator
my_list = [1, 2, 3]
print(hasattr(my_list, '__iter__'))  # True
print(hasattr(my_list, '__next__'))  # False

# Get iterator from iterable
my_iter = iter(my_list)
print(hasattr(my_iter, '__iter__'))  # True
print(hasattr(my_iter, '__next__'))  # True

Generators: Simpler Iterators

Generators are functions that use yield instead of return. They automatically implement the iterator protocol!

Basic Generator

def counter(start, end):
    """Generator version - much simpler!"""
    current = start
    while current <= end:
        yield current
        current += 1

# Creates a generator object
gen = counter(1, 5)
print(type(gen))  # <class 'generator'>

# Use like any iterator
for num in gen:
    print(num)  # 1, 2, 3, 4, 5

How yield Works

def explain_yield():
    print("Before first yield")
    yield 1
    
    print("Between yields")
    yield 2
    
    print("Before last yield")
    yield 3
    
    print("After last yield")

gen = explain_yield()
print("Generator created")
print(next(gen))  # Before first yield → 1
print(next(gen))  # Between yields → 2
print(next(gen))  # Before last yield → 3
# next(gen)  # After last yield → StopIteration

Key insight: Execution pauses at yield and resumes on next next() call!

Generator Functions: The Power of yield

Memory Efficiency

import sys

# List - stores all values in memory
def numbers_list(n):
    return [i for i in range(n)]

# Generator - produces values on demand
def numbers_gen(n):
    for i in range(n):
        yield i

# Compare memory usage
list_obj = numbers_list(1000000)
gen_obj = numbers_gen(1000000)

print(f"List size: {sys.getsizeof(list_obj)} bytes")      # ~8MB
print(f"Generator size: {sys.getsizeof(gen_obj)} bytes")  # ~120 bytes!

Classic Examples

Fibonacci sequence:

def fibonacci(n):
    """Generate first n Fibonacci numbers."""
    a, b = 0, 1
    count = 0
    while count < n:
        yield a
        a, b = b, a + b
        count += 1

# Memory efficient - generates on demand
for num in fibonacci(10):
    print(num)  # 0, 1, 1, 2, 3, 5, 8, 13, 21, 34

Infinite sequence:

def infinite_sequence():
    """Generate infinite sequence of numbers."""
    num = 0
    while True:
        yield num
        num += 1

# Works with infinite data!
gen = infinite_sequence()
print(next(gen))  # 0
print(next(gen))  # 1
print(next(gen))  # 2
# Can continue forever...

Generator Methods

Generators support advanced control methods:

def controllable_gen():
    received = None
    while True:
        # yield returns value sent via send()
        received = yield received * 2 if received else 0
        print(f"Received: {received}")

gen = controllable_gen()
next(gen)  # Prime the generator

print(gen.send(5))    # Received: 5 → 10
print(gen.send(10))   # Received: 10 → 20
print(gen.send(3))    # Received: 3 → 6

yield from: Generator Delegation

def generator1():
    yield 1
    yield 2

def generator2():
    yield 3
    yield 4

def combined():
    yield from generator1()
    yield from generator2()
    yield 5

print(list(combined()))  # [1, 2, 3, 4, 5]

# Flattening nested lists
def flatten(nested_list):
    for item in nested_list:
        if isinstance(item, list):
            yield from flatten(item)  # Recursive!
        else:
            yield item

nested = [1, [2, [3, 4], 5], 6, [7, 8]]
print(list(flatten(nested)))  # [1, 2, 3, 4, 5, 6, 7, 8]

Generator Expressions: One-Line Generators

# List comprehension (loads everything)
squares_list = [x**2 for x in range(1000000)]  # Uses lots of memory!

# Generator expression (lazy evaluation)
squares_gen = (x**2 for x in range(1000000))   # Uses almost no memory!

# Process one at a time
for square in squares_gen:
    if square > 100:
        break

Data Science Example: Processing Large Files

def process_large_csv(filename):
    """Process CSV file line by line (memory efficient)."""
    with open(filename) as file:
        # Skip header
        next(file)
        
        for line in file:
            # Process one line at a time
            values = line.strip().split(',')
            yield {
                'name': values[0],
                'age': int(values[1]),
                'score': float(values[2])
            }

# Process millions of rows without loading all into memory!
for record in process_large_csv('huge_dataset.csv'):
    if record['score'] > 90:
        print(f"{record['name']}: {record['score']}")

itertools: Generator Superpowers

The itertools module provides powerful generator-based tools:

Infinite Iterators

import itertools

# count - infinite counter
for i in itertools.count(start=10, step=2):
    if i > 20:
        break
    print(i)  # 10, 12, 14, 16, 18, 20

# cycle - repeat sequence infinitely
colors = itertools.cycle(['red', 'green', 'blue'])
for i, color in enumerate(colors):
    if i >= 6:
        break
    print(color)  # red, green, blue, red, green, blue

# repeat - repeat single value
for x in itertools.repeat('hello', 3):
    print(x)  # hello, hello, hello

Combinatoric Iterators

# Combinations - order doesn't matter, no repeats
features = ['age', 'income', 'education']
feature_pairs = itertools.combinations(features, 2)
print(list(feature_pairs))
# [('age', 'income'), ('age', 'education'), ('income', 'education')]

# Permutations - order matters
letters = ['A', 'B', 'C']
perms = itertools.permutations(letters, 2)
print(list(perms))
# [('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]

# Product - Cartesian product
colors = ['red', 'blue']
sizes = ['S', 'M', 'L']
variants = itertools.product(colors, sizes)
print(list(variants))
# [('red', 'S'), ('red', 'M'), ('red', 'L'), 
#  ('blue', 'S'), ('blue', 'M'), ('blue', 'L')]

Data Processing Iterators

# chain - combine multiple iterables
combined = itertools.chain([1, 2], [3, 4], [5, 6])
print(list(combined))  # [1, 2, 3, 4, 5, 6]

# islice - slice an iterator
data = itertools.count()  # Infinite!
first_10 = itertools.islice(data, 10)
print(list(first_10))  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# takewhile - take until condition is False
data = [1, 4, 6, 4, 1]
result = itertools.takewhile(lambda x: x < 5, data)
print(list(result))  # [1, 4]

# dropwhile - drop until condition is False
data = [1, 4, 6, 4, 1]
result = itertools.dropwhile(lambda x: x < 5, data)
print(list(result))  # [6, 4, 1]

# groupby - group consecutive equal elements
data = [1, 1, 2, 2, 2, 3, 1, 1]
for key, group in itertools.groupby(data):
    print(f"{key}: {list(group)}")
# 1: [1, 1]
# 2: [2, 2, 2]
# 3: [3]
# 1: [1, 1]

Practical ML Example

from itertools import islice, cycle
import numpy as np

def batch_generator(data, batch_size, epochs):
    """Generate batches for multiple epochs."""
    for epoch in range(epochs):
        # Shuffle data each epoch (in practice)
        for i in range(0, len(data), batch_size):
            yield data[i:i + batch_size]

# Or infinite batches with cycling
def infinite_batch_gen(data, batch_size):
    """Infinite batch generator with cycling."""
    data_cycle = cycle(data)
    while True:
        batch = list(islice(data_cycle, batch_size))
        if batch:
            yield np.array(batch)

Generator Pipelines

Chain generators together for data processing pipelines:

def read_file(filename):
    """Generator: read file line by line."""
    with open(filename) as f:
        for line in f:
            yield line.strip()

def filter_comments(lines):
    """Generator: filter out comments."""
    for line in lines:
        if not line.startswith('#'):
            yield line

def parse_numbers(lines):
    """Generator: convert to numbers."""
    for line in lines:
        try:
            yield float(line)
        except ValueError:
            pass

def process_data(filename):
    """Pipeline: compose generators."""
    lines = read_file(filename)
    clean_lines = filter_comments(lines)
    numbers = parse_numbers(clean_lines)
    return numbers

# Memory-efficient pipeline!
for number in process_data('data.txt'):
    print(number)

Common Pitfalls

Pitfall 1: Exhausted Generators

# Wrong - generator can only be iterated once!
gen = (x**2 for x in range(5))
print(list(gen))  # [0, 1, 4, 9, 16]
print(list(gen))  # [] - exhausted!

# Right - create new generator or use list
numbers = list(x**2 for x in range(5))
print(numbers)  # [0, 1, 4, 9, 16]
print(numbers)  # [0, 1, 4, 9, 16] - still works

Pitfall 2: Holding References

# Wrong - holds all items in memory anyway!
def bad_generator(data):
    results = []
    for item in data:
        result = process(item)
        results.append(result)  # Defeating the purpose!
        yield result

# Right - don't store processed items
def good_generator(data):
    for item in data:
        yield process(item)  # Memory efficient

Pitfall 3: Side Effects in Generators

# Be careful with side effects
def generator_with_side_effects():
    print("Generating 1")
    yield 1
    print("Generating 2")
    yield 2

# Side effects don't run until iteration starts
gen = generator_with_side_effects()  # No output yet
print("Created generator")
print(next(gen))  # Now "Generating 1" prints

Performance Comparison

import time
import sys

# Test with 10 million items
n = 10_000_000

# List - loads everything
start = time.time()
list_data = [i**2 for i in range(n)]
first_100 = list_data[:100]
list_time = time.time() - start
list_memory = sys.getsizeof(list_data)

# Generator - lazy evaluation
start = time.time()
gen_data = (i**2 for i in range(n))
first_100 = list(itertools.islice(gen_data, 100))
gen_time = time.time() - start
gen_memory = sys.getsizeof(gen_data)

print(f"List: {list_time:.2f}s, {list_memory / 1024 / 1024:.1f}MB")
print(f"Generator: {gen_time:.4f}s, {gen_memory} bytes")
# List: 1.2s, 76.3MB
# Generator: 0.0001s, 112 bytes

Best Practices

1. Use Generators for Large Data

# Bad - loads everything
def process_logs_bad(filename):
    with open(filename) as f:
        lines = f.readlines()  # All in memory!
        return [parse_log(line) for line in lines]

# Good - memory efficient
def process_logs_good(filename):
    with open(filename) as f:
        for line in f:  # One at a time
            yield parse_log(line)

2. Generator Expressions for Simple Cases

# Simple transformation
squares = (x**2 for x in range(1000))

# Filtering
even_squares = (x**2 for x in range(1000) if x % 2 == 0)

3. Use itertools for Complex Operations

from itertools import islice, chain, groupby

# Don't reinvent the wheel
# Use itertools' optimized C implementations

4. Profile Memory Usage

import tracemalloc

tracemalloc.start()

# Your generator code here
data = (i**2 for i in range(1000000))
result = sum(data)

current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024:.1f}KB, Peak: {peak / 1024:.1f}KB")
tracemalloc.stop()

5. Document Generator Exhaustion

def my_generator():
    """
    Generate values from 1 to 10.
    
    Note: Generator can only be iterated once. 
    Create new generator for multiple iterations.
    
    Yields:
        int: Numbers from 1 to 10
    """
    for i in range(1, 11):
        yield i

Key Takeaways

Generators are lazy - values computed on demand, not upfront
Memory efficient - perfect for large datasets and infinite sequences
One-time use - generators exhaust after iteration (unlike lists)
Use yield - simpler than implementing __iter__ and __next__
itertools is powerful - provides optimized generator utilities
Pipeline pattern - chain generators for complex data processing
Generator expressions - () instead of [] for simple cases

Generators transform how you handle data in Python. They enable processing datasets larger than memory, create infinite sequences, and build elegant data pipelines. Master generators, and you'll write more efficient, scalable Python code!

Connect with me on Twitter or LinkedIn.

Support My Work

If this guide helped you understand Python iterators, generators, and the yield keyword, write more memory-efficient code, or master lazy evaluation, I'd really appreciate your support! Creating comprehensive tutorials on advanced Python concepts like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for Python developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by Jason Yuen on Unsplash

Related Blogs