Python File Handling: Read, Write, and Process Files for Data Science
Master Python file operations for data science. Learn to read/write text files, handle CSV data, use context managers, and manage file paths safely

The Day I Lost All My Data
I was working on my first data science project when disaster struck. I'd spent hours preprocessing a dataset, but forgot to save the results. My laptop crashed, and everything was gone.
That's when I learned the hard way: file operations aren't optional in data scienceβthey're essential. Every dataset you load, every model you train, every result you generate needs to be saved and loaded properly.
In this guide, I'll teach you everything about Python file handling so you never lose your work again!
Why File I/O Matters for Data Scientists
File operations are the foundation of data science workflows:
- Data Loading: Reading datasets (CSV, JSON, text)
- Model Persistence: Saving trained models
- Results Storage: Storing predictions and metrics
- Configuration: Loading hyperparameters
- Logging: Tracking experiments
Master file operations = Master data science!
Reading Files: The Basics
Method 1: Basic File Reading
# Open, read, close (manual approach)
file = open('data.txt', 'r')
content = file.read()
file.close()
print(content)
Problem: If an error occurs, file.close() never runs, leaving the file open!
Method 2: Context Manager (RECOMMENDED)
# Better approach - auto-closes file
with open('data.txt', 'r') as file:
content = file.read()
print(content)
# File automatically closed here, even if error occurs!
Always use with statements for file operations!
Reading Line by Line
# Read all lines at once
with open('data.txt', 'r') as file:
lines = file.readlines() # Returns list of lines
for i, line in enumerate(lines, 1):
print(f"Line {i}: {line.strip()}")
# Read line by line (memory efficient!)
with open('data.txt', 'r') as file:
for line in file:
print(line.strip())
Pro tip: For large files, iterate line-by-line to save memory!
Writing Files: Saving Your Work
Writing Text
# Write to file (overwrites existing content)
data = [
"Machine learning is amazing!",
"Python makes data science easy.",
"Always save your work!"
]
with open('notes.txt', 'w') as file:
for line in data:
file.write(line + '\n')
print("Data saved successfully!")
Appending to Files
# Append to existing file
with open('notes.txt', 'a') as file:
file.write("This line was added later.\n")
file.write("Append mode preserves existing content.\n")
Data Science Example: Saving Results
# Save model evaluation results
results = {
'accuracy': 0.87,
'precision': 0.85,
'recall': 0.89,
'f1_score': 0.87
}
with open('model_results.txt', 'w') as file:
file.write("Model Evaluation Results\n")
file.write("=" * 30 + "\n")
for metric, value in results.items():
file.write(f"{metric}: {value:.2%}\n")
print("Results saved to model_results.txt")
File Modes: Choosing the Right Mode
| Mode | Purpose | Creates New? | Overwrites? |
|---|---|---|---|
'r' | Read | No | N/A |
'w' | Write | Yes | Yes |
'a' | Append | Yes | No |
'x' | Exclusive create | Yes | Error if exists |
'r+' | Read + Write | No | No |
# Demonstration of modes
import os
# Write mode - overwrites!
with open('test.txt', 'w') as f:
f.write("First content\n")
with open('test.txt', 'w') as f:
f.write("Second content\n") # First content GONE!
# Append mode - preserves
with open('test.txt', 'a') as f:
f.write("Third content\n") # Added to end
# Exclusive create - fails if exists
try:
with open('test.txt', 'x') as f:
f.write("New file\n")
except FileExistsError:
print("File already exists!")
Working with CSV Files
CSV (Comma-Separated Values) is the most common data format in data science!
Method 1: Manual CSV Parsing
# Read CSV manually
with open('data.csv', 'r') as file:
lines = file.readlines()
# Parse header
header = lines[0].strip().split(',')
print(f"Columns: {header}")
# Parse data
data = []
for line in lines[1:]:
values = line.strip().split(',')
data.append(values)
print(f"Loaded {len(data)} rows")
Method 2: Using csv Module (BETTER)
import csv
# Write CSV
students = [
['Name', 'Age', 'Grade'],
['Alice', '20', 'A'],
['Bob', '22', 'B'],
['Charlie', '21', 'A'],
]
with open('students.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(students)
# Read CSV
with open('students.csv', 'r') as file:
reader = csv.reader(file)
header = next(reader) # Skip header
for row in reader:
name, age, grade = row
print(f"{name} is {age} years old and got {grade}")
CSV with Dictionaries
import csv
# Write CSV with dict
students = [
{'name': 'Alice', 'age': 20, 'grade': 'A'},
{'name': 'Bob', 'age': 22, 'grade': 'B'},
{'name': 'Charlie', 'age': 21, 'grade': 'A'},
]
with open('students_dict.csv', 'w', newline='') as file:
fieldnames = ['name', 'age', 'grade']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(students)
# Read CSV as dict
with open('students_dict.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(f"{row['name']}: Age {row['age']}, Grade {row['grade']}")
Path Handling: Cross-Platform Compatibility
Problem: Platform-Specific Paths
# Windows: C:\Users\Name\data.csv
# Mac/Linux: /Users/Name/data.csv
# Wrong approach (breaks on different OS)
file_path = "data\folder\file.txt" # Only works on Windows!
Solution 1: os.path
import os
# Join paths correctly for any OS
folder = 'data'
filename = 'dataset.csv'
file_path = os.path.join(folder, filename)
print(f"Path: {file_path}")
# Check if file exists
if os.path.exists(file_path):
print("File found!")
print(f"Size: {os.path.getsize(file_path)} bytes")
else:
print("File not found!")
# Get current directory
print(f"Working directory: {os.getcwd()}")
# Create directory if it doesn't exist
os.makedirs('data/processed', exist_ok=True)
Solution 2: pathlib (MODERN APPROACH)
from pathlib import Path
# Create Path objects
data_dir = Path('data')
file_path = data_dir / 'dataset.csv' # Clean syntax!
print(f"Path: {file_path}")
# Check existence
if file_path.exists():
print(f"File exists!")
print(f"Size: {file_path.stat().st_size} bytes")
# Read file directly
if file_path.exists():
content = file_path.read_text()
print(content[:100]) # First 100 chars
# Write file directly
output = Path('output.txt')
output.write_text("Results from model training")
# Create directories
Path('data/raw').mkdir(parents=True, exist_ok=True)
Path('data/processed').mkdir(parents=True, exist_ok=True)
Path('models').mkdir(exist_ok=True)
Error Handling: Robust File Operations
def read_file_safely(filepath):
"""Safely read a file with proper error handling."""
try:
with open(filepath, 'r') as file:
content = file.read()
return content
except FileNotFoundError:
print(f"β Error: File '{filepath}' not found!")
return None
except PermissionError:
print(f"β Error: No permission to read '{filepath}'!")
return None
except Exception as e:
print(f"β Unexpected error: {e}")
return None
# Test the function
content = read_file_safely('data.txt')
if content:
print("β
File read successfully!")
else:
print("Failed to read file")
Data Science Example: Complete Pipeline
import csv
from pathlib import Path
class DataPipeline:
"""Complete data processing pipeline with file operations."""
def __init__(self, base_dir='data'):
self.base_dir = Path(base_dir)
self.raw_dir = self.base_dir / 'raw'
self.processed_dir = self.base_dir / 'processed'
# Create directories
self.raw_dir.mkdir(parents=True, exist_ok=True)
self.processed_dir.mkdir(parents=True, exist_ok=True)
def load_csv(self, filename):
"""Load CSV file."""
filepath = self.raw_dir / filename
try:
with open(filepath, 'r') as file:
reader = csv.DictReader(file)
data = list(reader)
print(f"β
Loaded {len(data)} rows from {filename}")
return data
except FileNotFoundError:
print(f"β File not found: {filename}")
return []
def process_data(self, data):
"""Process the data."""
processed = []
for row in data:
# Example: Convert age to int, calculate age group
try:
age = int(row['age'])
age_group = 'young' if age < 30 else 'senior'
processed.append({
'name': row['name'],
'age': age,
'age_group': age_group,
'score': float(row.get('score', 0))
})
except ValueError:
print(f"β οΈ Skipping invalid row: {row}")
return processed
def save_processed(self, data, filename):
"""Save processed data."""
filepath = self.processed_dir / filename
if not data:
print("β No data to save!")
return
with open(filepath, 'w', newline='') as file:
fieldnames = data[0].keys()
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f"β
Saved {len(data)} rows to {filename}")
def run_pipeline(self, input_file, output_file):
"""Execute complete pipeline."""
print("="*50)
print("Starting Data Pipeline")
print("="*50)
# Load
raw_data = self.load_csv(input_file)
if not raw_data:
return
# Process
processed_data = self.process_data(raw_data)
# Save
self.save_processed(processed_data, output_file)
print("="*50)
print("Pipeline Complete!")
print("="*50)
# Create sample data
sample_data = [
['name', 'age', 'score'],
['Alice', '25', '85.5'],
['Bob', '32', '92.0'],
['Charlie', '28', '78.5'],
]
with open('data/raw/students.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(sample_data)
# Run pipeline
pipeline = DataPipeline()
pipeline.run_pipeline('students.csv', 'students_processed.csv')
JSON Files: Structured Data
import json
# Python dict to JSON file
data = {
'model': 'RandomForest',
'params': {
'n_estimators': 100,
'max_depth': 10,
'min_samples_split': 2
},
'metrics': {
'accuracy': 0.87,
'f1_score': 0.85
}
}
# Save JSON
with open('model_config.json', 'w') as file:
json.dump(data, file, indent=4)
# Load JSON
with open('model_config.json', 'r') as file:
loaded_data = json.load(file)
print(f"Model: {loaded_data['model']}")
print(f"Accuracy: {loaded_data['metrics']['accuracy']:.1%}")
Best Practices for Data Science
1. Always Use Context Managers
# β BAD
file = open('data.csv', 'r')
data = file.read()
file.close()
# β
GOOD
with open('data.csv', 'r') as file:
data = file.read()
2. Handle Errors Properly
# β
GOOD
try:
with open('data.csv', 'r') as file:
data = file.read()
except FileNotFoundError:
print("File not found!")
data = None
3. Use Path Libraries
# β BAD
filepath = 'data\\folder\\file.txt'
# β
GOOD
from pathlib import Path
filepath = Path('data') / 'folder' / 'file.txt'
4. Organize Your Files
project/
βββ data/
β βββ raw/ # Original datasets
β βββ processed/ # Cleaned data
β βββ external/ # External sources
βββ models/ # Trained models
βββ results/ # Predictions
βββ logs/ # Training logs
5. Save Intermediate Results
# Save after expensive operations
def train_model(X, y):
# ... training code ...
model = RandomForest()
model.fit(X, y)
# Save immediately!
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
return model
Common Mistakes to Avoid
1. Not Closing Files
# β BAD - file stays open
file = open('data.csv', 'r')
data = file.read()
# Forgot to close! Memory leak and file lock
# β
GOOD - automatically closed
with open('data.csv', 'r') as file:
data = file.read()
2. Reading Entire Large Files into Memory
# β BAD - loads 10GB file into RAM!
with open('huge_file.csv', 'r') as f:
all_data = f.read() # Memory error!
# β
GOOD - process line by line
with open('huge_file.csv', 'r') as f:
for line in f:
process(line) # Only one line in memory at a time
3. Not Handling Encoding Issues
# β BAD - crashes on special characters
with open('data.txt', 'r') as f:
data = f.read()
# β
GOOD - specify encoding
with open('data.txt', 'r', encoding='utf-8') as f:
data = f.read()
4. Using Wrong CSV Mode
# β BAD - overwrites existing data
writer = csv.writer(open('log.csv', 'w'))
# β
GOOD - appends to existing file
writer = csv.writer(open('log.csv', 'a'))
5. Not Validating Paths Exist
# β BAD - crashes if path doesn't exist
with open('nonexistent/data.csv', 'r') as f:
data = f.read()
# β
GOOD - check first
from pathlib import Path
path = Path('nonexistent/data.csv')
if path.exists():
with open(path, 'r') as f:
data = f.read()
else:
print(f"File not found: {path}")
6. Forgetting Binary Mode for Non-Text Files
# β BAD - corrupts binary data
with open('model.pkl', 'r') as f: # Wrong mode!
model = pickle.load(f)
# β
GOOD - use binary mode
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
Conclusion: File Operations Mastery
You've learned essential file operations for data science:
β
Reading/Writing - Open, process, and save files safely
β
Context Managers - Automatic file closing with with
β
CSV Handling - Process tabular data efficiently
β
Path Management - Cross-platform file paths
β
Error Handling - Robust file operations
β
Best Practices - Professional data science workflows
Never lose your data again! Master file operations and your data science projects will be more reliable and professional.
Quick Reference
# Read file
with open('file.txt', 'r') as f:
content = f.read()
# Write file
with open('file.txt', 'w') as f:
f.write("Data\n")
# Append file
with open('file.txt', 'a') as f:
f.write("More data\n")
# CSV
import csv
with open('data.csv', 'r') as f:
reader = csv.DictReader(f)
data = list(reader)
# Paths
from pathlib import Path
path = Path('data') / 'file.csv'
If you found this guide helpful and are building data science projects with proper file handling, I'd love to hear about it! Connect with me on Twitter or LinkedIn.
Support My Work
If this guide helped you master Python file operations, prevent data loss, or build better data science pipelines, I'd really appreciate your support! Creating comprehensive, practical Python tutorials like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for Python developers.
β Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by Zulfugar Karimov on Unsplash