Ojaswi Athghara | Python File Handling: Read, Write, and Process Files for Data Science

Python File Handling: Read, Write, and Process Files for Data Science

The Day I Lost All My Data

I was working on my first data science project when disaster struck. I'd spent hours preprocessing a dataset, but forgot to save the results. My laptop crashed, and everything was gone.

That's when I learned the hard way: file operations aren't optional in data science—they're essential. Every dataset you load, every model you train, every result you generate needs to be saved and loaded properly.

In this guide, I'll teach you everything about Python file handling so you never lose your work again!

Why File I/O Matters for Data Scientists

File operations are the foundation of data science workflows:

Data Loading: Reading datasets (CSV, JSON, text)
Model Persistence: Saving trained models
Results Storage: Storing predictions and metrics
Configuration: Loading hyperparameters
Logging: Tracking experiments

Master file operations = Master data science!

Reading Files: The Basics

Method 1: Basic File Reading

# Open, read, close (manual approach)
file = open('data.txt', 'r')
content = file.read()
file.close()

print(content)

Problem: If an error occurs, file.close() never runs, leaving the file open!

Method 2: Context Manager (RECOMMENDED)

# Better approach - auto-closes file
with open('data.txt', 'r') as file:
    content = file.read()
    print(content)
# File automatically closed here, even if error occurs!

Always use with statements for file operations!

Reading Line by Line

# Read all lines at once
with open('data.txt', 'r') as file:
    lines = file.readlines()  # Returns list of lines
    
for i, line in enumerate(lines, 1):
    print(f"Line {i}: {line.strip()}")

# Read line by line (memory efficient!)
with open('data.txt', 'r') as file:
    for line in file:
        print(line.strip())

Pro tip: For large files, iterate line-by-line to save memory!

Writing Files: Saving Your Work

Writing Text

# Write to file (overwrites existing content)
data = [
    "Machine learning is amazing!",
    "Python makes data science easy.",
    "Always save your work!"
]

with open('notes.txt', 'w') as file:
    for line in data:
        file.write(line + '\n')

print("Data saved successfully!")

Appending to Files

# Append to existing file
with open('notes.txt', 'a') as file:
    file.write("This line was added later.\n")
    file.write("Append mode preserves existing content.\n")

Data Science Example: Saving Results

# Save model evaluation results
results = {
    'accuracy': 0.87,
    'precision': 0.85,
    'recall': 0.89,
    'f1_score': 0.87
}

with open('model_results.txt', 'w') as file:
    file.write("Model Evaluation Results\n")
    file.write("=" * 30 + "\n")
    for metric, value in results.items():
        file.write(f"{metric}: {value:.2%}\n")
    
print("Results saved to model_results.txt")

File Modes: Choosing the Right Mode

Mode	Purpose	Creates New?	Overwrites?
`'r'`	Read	No	N/A
`'w'`	Write	Yes	Yes
`'a'`	Append	Yes	No
`'x'`	Exclusive create	Yes	Error if exists
`'r+'`	Read + Write	No	No

# Demonstration of modes
import os

# Write mode - overwrites!
with open('test.txt', 'w') as f:
    f.write("First content\n")

with open('test.txt', 'w') as f:
    f.write("Second content\n")  # First content GONE!

# Append mode - preserves
with open('test.txt', 'a') as f:
    f.write("Third content\n")  # Added to end

# Exclusive create - fails if exists
try:
    with open('test.txt', 'x') as f:
        f.write("New file\n")
except FileExistsError:
    print("File already exists!")

Working with CSV Files

CSV (Comma-Separated Values) is the most common data format in data science!

Method 1: Manual CSV Parsing

# Read CSV manually
with open('data.csv', 'r') as file:
    lines = file.readlines()
    
# Parse header
header = lines[0].strip().split(',')
print(f"Columns: {header}")

# Parse data
data = []
for line in lines[1:]:
    values = line.strip().split(',')
    data.append(values)

print(f"Loaded {len(data)} rows")

Method 2: Using csv Module (BETTER)

import csv

# Write CSV
students = [
    ['Name', 'Age', 'Grade'],
    ['Alice', '20', 'A'],
    ['Bob', '22', 'B'],
    ['Charlie', '21', 'A'],
]

with open('students.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(students)

# Read CSV
with open('students.csv', 'r') as file:
    reader = csv.reader(file)
    header = next(reader)  # Skip header
    
    for row in reader:
        name, age, grade = row
        print(f"{name} is {age} years old and got {grade}")

CSV with Dictionaries

import csv

# Write CSV with dict
students = [
    {'name': 'Alice', 'age': 20, 'grade': 'A'},
    {'name': 'Bob', 'age': 22, 'grade': 'B'},
    {'name': 'Charlie', 'age': 21, 'grade': 'A'},
]

with open('students_dict.csv', 'w', newline='') as file:
    fieldnames = ['name', 'age', 'grade']
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    
    writer.writeheader()
    writer.writerows(students)

# Read CSV as dict
with open('students_dict.csv', 'r') as file:
    reader = csv.DictReader(file)
    
    for row in reader:
        print(f"{row['name']}: Age {row['age']}, Grade {row['grade']}")

Path Handling: Cross-Platform Compatibility

Problem: Platform-Specific Paths

# Windows: C:\Users\Name\data.csv
# Mac/Linux: /Users/Name/data.csv

# Wrong approach (breaks on different OS)
file_path = "data\folder\file.txt"  # Only works on Windows!

Solution 1: os.path

import os

# Join paths correctly for any OS
folder = 'data'
filename = 'dataset.csv'
file_path = os.path.join(folder, filename)

print(f"Path: {file_path}")

# Check if file exists
if os.path.exists(file_path):
    print("File found!")
    print(f"Size: {os.path.getsize(file_path)} bytes")
else:
    print("File not found!")

# Get current directory
print(f"Working directory: {os.getcwd()}")

# Create directory if it doesn't exist
os.makedirs('data/processed', exist_ok=True)

Solution 2: pathlib (MODERN APPROACH)

from pathlib import Path

# Create Path objects
data_dir = Path('data')
file_path = data_dir / 'dataset.csv'  # Clean syntax!

print(f"Path: {file_path}")

# Check existence
if file_path.exists():
    print(f"File exists!")
    print(f"Size: {file_path.stat().st_size} bytes")

# Read file directly
if file_path.exists():
    content = file_path.read_text()
    print(content[:100])  # First 100 chars

# Write file directly
output = Path('output.txt')
output.write_text("Results from model training")

# Create directories
Path('data/raw').mkdir(parents=True, exist_ok=True)
Path('data/processed').mkdir(parents=True, exist_ok=True)
Path('models').mkdir(exist_ok=True)

Error Handling: Robust File Operations

def read_file_safely(filepath):
    """Safely read a file with proper error handling."""
    try:
        with open(filepath, 'r') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"❌ Error: File '{filepath}' not found!")
        return None
    except PermissionError:
        print(f"❌ Error: No permission to read '{filepath}'!")
        return None
    except Exception as e:
        print(f"❌ Unexpected error: {e}")
        return None

# Test the function
content = read_file_safely('data.txt')
if content:
    print("✅ File read successfully!")
else:
    print("Failed to read file")

Data Science Example: Complete Pipeline

import csv
from pathlib import Path

class DataPipeline:
    """Complete data processing pipeline with file operations."""
    
    def __init__(self, base_dir='data'):
        self.base_dir = Path(base_dir)
        self.raw_dir = self.base_dir / 'raw'
        self.processed_dir = self.base_dir / 'processed'
        
        # Create directories
        self.raw_dir.mkdir(parents=True, exist_ok=True)
        self.processed_dir.mkdir(parents=True, exist_ok=True)
    
    def load_csv(self, filename):
        """Load CSV file."""
        filepath = self.raw_dir / filename
        
        try:
            with open(filepath, 'r') as file:
                reader = csv.DictReader(file)
                data = list(reader)
            print(f"✅ Loaded {len(data)} rows from {filename}")
            return data
        except FileNotFoundError:
            print(f"❌ File not found: {filename}")
            return []
    
    def process_data(self, data):
        """Process the data."""
        processed = []
        
        for row in data:
            # Example: Convert age to int, calculate age group
            try:
                age = int(row['age'])
                age_group = 'young' if age < 30 else 'senior'
                
                processed.append({
                    'name': row['name'],
                    'age': age,
                    'age_group': age_group,
                    'score': float(row.get('score', 0))
                })
            except ValueError:
                print(f"⚠️  Skipping invalid row: {row}")
        
        return processed
    
    def save_processed(self, data, filename):
        """Save processed data."""
        filepath = self.processed_dir / filename
        
        if not data:
            print("❌ No data to save!")
            return
        
        with open(filepath, 'w', newline='') as file:
            fieldnames = data[0].keys()
            writer = csv.DictWriter(file, fieldnames=fieldnames)
            
            writer.writeheader()
            writer.writerows(data)
        
        print(f"✅ Saved {len(data)} rows to {filename}")
    
    def run_pipeline(self, input_file, output_file):
        """Execute complete pipeline."""
        print("="*50)
        print("Starting Data Pipeline")
        print("="*50)
        
        # Load
        raw_data = self.load_csv(input_file)
        if not raw_data:
            return
        
        # Process
        processed_data = self.process_data(raw_data)
        
        # Save
        self.save_processed(processed_data, output_file)
        
        print("="*50)
        print("Pipeline Complete!")
        print("="*50)

# Create sample data
sample_data = [
    ['name', 'age', 'score'],
    ['Alice', '25', '85.5'],
    ['Bob', '32', '92.0'],
    ['Charlie', '28', '78.5'],
]

with open('data/raw/students.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(sample_data)

# Run pipeline
pipeline = DataPipeline()
pipeline.run_pipeline('students.csv', 'students_processed.csv')

JSON Files: Structured Data

import json

# Python dict to JSON file
data = {
    'model': 'RandomForest',
    'params': {
        'n_estimators': 100,
        'max_depth': 10,
        'min_samples_split': 2
    },
    'metrics': {
        'accuracy': 0.87,
        'f1_score': 0.85
    }
}

# Save JSON
with open('model_config.json', 'w') as file:
    json.dump(data, file, indent=4)

# Load JSON
with open('model_config.json', 'r') as file:
    loaded_data = json.load(file)

print(f"Model: {loaded_data['model']}")
print(f"Accuracy: {loaded_data['metrics']['accuracy']:.1%}")

Best Practices for Data Science

1. Always Use Context Managers

# ❌ BAD
file = open('data.csv', 'r')
data = file.read()
file.close()

# ✅ GOOD
with open('data.csv', 'r') as file:
    data = file.read()

2. Handle Errors Properly

# ✅ GOOD
try:
    with open('data.csv', 'r') as file:
        data = file.read()
except FileNotFoundError:
    print("File not found!")
    data = None

3. Use Path Libraries

# ❌ BAD
filepath = 'data\\folder\\file.txt'

# ✅ GOOD
from pathlib import Path
filepath = Path('data') / 'folder' / 'file.txt'

4. Organize Your Files

project/
├── data/
│   ├── raw/           # Original datasets
│   ├── processed/     # Cleaned data
│   └── external/      # External sources
├── models/            # Trained models
├── results/           # Predictions
└── logs/              # Training logs

5. Save Intermediate Results

# Save after expensive operations
def train_model(X, y):
    # ... training code ...
    model = RandomForest()
    model.fit(X, y)
    
    # Save immediately!
    import pickle
    with open('model.pkl', 'wb') as f:
        pickle.dump(model, f)
    
    return model

Common Mistakes to Avoid

1. Not Closing Files

# ❌ BAD - file stays open
file = open('data.csv', 'r')
data = file.read()
# Forgot to close! Memory leak and file lock

# ✅ GOOD - automatically closed
with open('data.csv', 'r') as file:
    data = file.read()

2. Reading Entire Large Files into Memory

# ❌ BAD - loads 10GB file into RAM!
with open('huge_file.csv', 'r') as f:
    all_data = f.read()  # Memory error!

# ✅ GOOD - process line by line
with open('huge_file.csv', 'r') as f:
    for line in f:
        process(line)  # Only one line in memory at a time

3. Not Handling Encoding Issues

# ❌ BAD - crashes on special characters
with open('data.txt', 'r') as f:
    data = f.read()

# ✅ GOOD - specify encoding
with open('data.txt', 'r', encoding='utf-8') as f:
    data = f.read()

4. Using Wrong CSV Mode

# ❌ BAD - overwrites existing data
writer = csv.writer(open('log.csv', 'w'))

# ✅ GOOD - appends to existing file
writer = csv.writer(open('log.csv', 'a'))

5. Not Validating Paths Exist

# ❌ BAD - crashes if path doesn't exist
with open('nonexistent/data.csv', 'r') as f:
    data = f.read()

# ✅ GOOD - check first
from pathlib import Path
path = Path('nonexistent/data.csv')
if path.exists():
    with open(path, 'r') as f:
        data = f.read()
else:
    print(f"File not found: {path}")

6. Forgetting Binary Mode for Non-Text Files

# ❌ BAD - corrupts binary data
with open('model.pkl', 'r') as f:  # Wrong mode!
    model = pickle.load(f)

# ✅ GOOD - use binary mode
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

Conclusion: File Operations Mastery

You've learned essential file operations for data science:

✅ Reading/Writing - Open, process, and save files safely
✅ Context Managers - Automatic file closing with with
✅ CSV Handling - Process tabular data efficiently
✅ Path Management - Cross-platform file paths
✅ Error Handling - Robust file operations
✅ Best Practices - Professional data science workflows

Never lose your data again! Master file operations and your data science projects will be more reliable and professional.

Quick Reference

# Read file
with open('file.txt', 'r') as f:
    content = f.read()

# Write file
with open('file.txt', 'w') as f:
    f.write("Data\n")

# Append file
with open('file.txt', 'a') as f:
    f.write("More data\n")

# CSV
import csv
with open('data.csv', 'r') as f:
    reader = csv.DictReader(f)
    data = list(reader)

# Paths
from pathlib import Path
path = Path('data') / 'file.csv'

If you found this guide helpful and are building data science projects with proper file handling, I'd love to hear about it! Connect with me on Twitter or LinkedIn.

Support My Work

If this guide helped you master Python file operations, prevent data loss, or build better data science pipelines, I'd really appreciate your support! Creating comprehensive, practical Python tutorials like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for Python developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by Zulfugar Karimov on Unsplash

Related Blogs