Data Preprocessing with Python: Essential Cleaning Steps
Master data preprocessing in Python with essential cleaning techniques. Learn to handle missing values, normalize data, encode categories, scale features, and prepare datasets for machine learning.

When My Model Failed Because of Bad Preprocessing
I spent weeks building a sophisticated machine learning model. Great accuracy on training data. Then I deployed it, and predictions were nonsense. The problem? I forgot to scale my test data the same way as training data.
That painful lesson taught me: preprocessing isn't just a preliminary stepβit's the foundation of every successful ML project. Garbage in, garbage out.
This guide covers essential preprocessing steps that transform raw data into model-ready datasets. Master these, and your models will actually work in production.
The Preprocessing Pipeline
Every dataset needs these core steps:
- Handle missing values - Fill or remove
- Remove duplicates - Ensure data quality
- Handle outliers - Detect and treat
- Encode categories - Convert text to numbers
- Scale features - Normalize ranges
- Feature engineering - Create new features
Step 1: Handling Missing Values
Detecting Missing Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'age': [25, np.nan, 35, 40, 28],
'income': [50000, 60000, np.nan, 80000, 55000],
'city': ['NY', 'LA', None, 'SF', 'NY']
})
# Check missing values
print("Missing values per column:")
print(df.isnull().sum())
print("\nMissing percentage:")
print((df.isnull().sum() / len(df) * 100).round(2))
# Visualize missing pattern
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Pattern')
plt.show()
Strategy 1: Remove Missing Data
# Remove rows with any missing value
df_clean = df.dropna()
# Remove rows with missing in specific columns
df_clean = df.dropna(subset=['age', 'income'])
# Remove columns with too many missing values
threshold = 0.5 # 50%
df_clean = df.loc[:, df.isnull().mean() < threshold]
Strategy 2: Fill Missing Values
# Fill with constant
df['age'] = df['age'].fillna(0)
# Fill with mean (for numeric)
df['age'] = df['age'].fillna(df['age'].mean())
# Fill with median (better for outliers)
df['income'] = df['income'].fillna(df['income'].median())
# Fill with mode (for categorical)
df['city'] = df['city'].fillna(df['city'].mode()[0])
# Forward fill (use previous value)
df['age'] = df['age'].fillna(method='ffill')
# Backward fill
df['age'] = df['age'].fillna(method='bfill')
# Interpolate (for time series)
df['age'] = df['age'].interpolate()
Strategy 3: Advanced Imputation
from sklearn.impute import SimpleImputer, KNNImputer
# Simple imputer
imputer = SimpleImputer(strategy='mean') # mean, median, most_frequent, constant
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])
# KNN imputer (uses similar rows)
knn_imputer = KNNImputer(n_neighbors=3)
df[['age', 'income']] = knn_imputer.fit_transform(df[['age', 'income']])
Step 2: Handling Duplicates
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Show duplicate rows
duplicates = df[df.duplicated(keep=False)]
print(duplicates)
# Remove duplicates
df_clean = df.drop_duplicates()
# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['customer_id'], keep='first')
Step 3: Handling Outliers
Detection Methods
# Statistical method (Z-score)
from scipy import stats
z_scores = np.abs(stats.zscore(df['income'].dropna()))
outliers_z = z_scores > 3 # More than 3 standard deviations
print(f"Outliers (z-score): {outliers_z.sum()}")
# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = (df['income'] < lower_bound) | (df['income'] > upper_bound)
print(f"Outliers (IQR): {outliers_iqr.sum()}")
Treatment Methods
# Method 1: Remove outliers
df_no_outliers = df[(df['income'] >= lower_bound) & (df['income'] <= upper_bound)]
# Method 2: Cap outliers (winsorization)
df['income_capped'] = df['income'].clip(lower=lower_bound, upper=upper_bound)
# Method 3: Transform (log transformation)
df['income_log'] = np.log1p(df['income']) # log1p handles zeros
Step 4: Encoding Categorical Variables
Label Encoding (Ordinal)
from sklearn.preprocessing import LabelEncoder
# For ordinal categories (order matters)
df['size'] = ['S', 'M', 'L', 'XL', 'M']
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])
print("Original:", df['size'].tolist())
print("Encoded:", df['size_encoded'].tolist())
# Inverse transform
original = le.inverse_transform(df['size_encoded'])
One-Hot Encoding (Nominal)
# For nominal categories (no order)
df['color'] = ['red', 'blue', 'green', 'red', 'blue']
# Method 1: Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color')
# Method 2: sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False, drop='first') # drop first to avoid multicollinearity
encoded = ohe.fit_transform(df[['color']])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out())
print(encoded_df)
Ordinal Encoding (Custom Order)
from sklearn.preprocessing import OrdinalEncoder
# Define custom order
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
oe = OrdinalEncoder(categories=education_order)
df['education'] = ['Bachelor', 'Master', 'High School', 'PhD']
df['education_encoded'] = oe.fit_transform(df[['education']])
Step 5: Feature Scaling
Normalization (Min-Max Scaling)
from sklearn.preprocessing import MinMaxScaler
# Scale to [0, 1]
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
# Scale to custom range
scaler = MinMaxScaler(feature_range=(-1, 1))
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
Standardization (Z-score)
from sklearn.preprocessing import StandardScaler
# Scale to mean=0, std=1
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
print(f"Mean: {df['age'].mean():.10f}") # ~0
print(f"Std: {df['age'].std():.10f}") # ~1
Robust Scaling (For Outliers)
from sklearn.preprocessing import RobustScaler
# Uses median and IQR (less affected by outliers)
scaler = RobustScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
When to Use Which Scaler?
- MinMaxScaler: When you need bounded values (e.g., neural networks)
- StandardScaler: Most common, works well with many algorithms
- RobustScaler: When you have outliers you can't remove
- No scaling: For tree-based models (Random Forest, XGBoost)
Step 6: Feature Engineering
Creating New Features
# Date features
df['date'] = pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter
# Mathematical combinations
df['age_income_ratio'] = df['age'] / df['income']
df['age_squared'] = df['age'] ** 2
# Binning continuous variables
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 35, 50, 100],
labels=['Youth', 'Young Adult', 'Adult', 'Senior'])
# Aggregations
df['city_avg_income'] = df.groupby('city')['income'].transform('mean')
Text Features
# String operations
df['name'] = ['John Doe', 'Jane Smith', 'Bob Johnson']
df['name_length'] = df['name'].str.len()
df['word_count'] = df['name'].str.split().str.len()
df['first_name'] = df['name'].str.split().str[0]
# Extract patterns
import re
df['email'] = ['john@example.com', 'jane@test.org', 'bob@company.co.uk']
df['domain'] = df['email'].str.extract(r'@(.+)')
Complete Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
class DataPreprocessor:
def __init__(self, numeric_features, categorical_features):
self.numeric_features = numeric_features
self.categorical_features = categorical_features
self.pipeline = self._build_pipeline()
def _build_pipeline(self):
"""Build preprocessing pipeline"""
# Numeric transformer
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical transformer
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, self.numeric_features),
('cat', categorical_transformer, self.categorical_features)
])
return preprocessor
def fit_transform(self, X):
"""Fit and transform data"""
return self.pipeline.fit_transform(X)
def transform(self, X):
"""Transform new data"""
return self.pipeline.transform(X)
# Usage
preprocessor = DataPreprocessor(
numeric_features=['age', 'income'],
categorical_features=['city', 'education']
)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
Real-World Example: Customer Dataset
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('customers.csv')
print("Original shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
# Step 1: Handle missing values
print("\n1. Handling missing values...")
missing_threshold = 0.3
cols_to_drop = df.columns[df.isnull().mean() > missing_threshold]
df = df.drop(columns=cols_to_drop)
# Impute numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
# Impute categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
df[col] = df[col].fillna(df[col].mode()[0])
# Step 2: Remove duplicates
print("2. Removing duplicates...")
df = df.drop_duplicates()
# Step 3: Handle outliers
print("3. Handling outliers...")
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df[col] = df[col].clip(lower, upper)
# Step 4: Encode categoricals
print("4. Encoding categorical variables...")
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
# Step 5: Scale features
print("5. Scaling features...")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
print("\nPreprocessed shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())
# Save preprocessed data
df.to_csv('customers_preprocessed.csv', index=False)
print("\nPreprocessed data saved!")
Validation and Quality Checks
def validate_preprocessed_data(df):
"""Validate data quality after preprocessing"""
checks = {
'no_missing': df.isnull().sum().sum() == 0,
'no_duplicates': df.duplicated().sum() == 0,
'no_inf': np.isinf(df.select_dtypes(include=[np.number])).sum().sum() == 0,
'proper_dtypes': True # Custom check
}
print("Data Quality Checks:")
for check, passed in checks.items():
status = "β" if passed else "β"
print(f" {status} {check}: {'PASS' if passed else 'FAIL'}")
return all(checks.values())
# Run validation
is_valid = validate_preprocessed_data(df)
Common Preprocessing Mistakes
Mistake 1: Data Leakage
# WRONG: Fitting scaler on entire dataset
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df) # Includes test data!
# RIGHT: Fit only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use fitted scaler
Mistake 2: Not Saving Preprocessing Objects
import joblib
# Save scaler for production use
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(encoder, 'encoder.pkl')
# Load in production
scaler = joblib.load('scaler.pkl')
new_data_scaled = scaler.transform(new_data)
Mistake 3: Forgetting Feature Order
# Always maintain feature order
feature_columns = ['age', 'income', 'education']
# When preprocessing new data
X_new = X_new[feature_columns] # Ensure same order
Feature Engineering Basics
Creating new features from existing ones can dramatically improve model performance.
Datetime Features
# Extract time-based features
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter
# Time since event
df['days_since_start'] = (df['date'] - df['date'].min()).dt.days
Interaction Features
# Combine features meaningfully
df['price_per_sqft'] = df['price'] / df['square_feet']
df['income_to_age_ratio'] = df['income'] / df['age']
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['feature1', 'feature2']])
Binning Continuous Variables
# Create categorical bins from numeric data
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 35, 50, 65, 100],
labels=['Child', 'Young Adult', 'Adult', 'Senior', 'Elderly'])
# Or quantile-based bins
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
Text Feature Extraction
# Basic text features
df['text_length'] = df['comment'].str.len()
df['word_count'] = df['comment'].str.split().str.len()
df['avg_word_length'] = df['text_length'] / df['word_count']
df['has_special_chars'] = df['comment'].str.contains('[!?@#$]').astype(int)
Aggregation Features
# Group-based statistics
customer_stats = df.groupby('customer_id').agg({
'purchase_amount': ['mean', 'sum', 'count'],
'days_since_last_purchase': 'min'
}).reset_index()
# Merge back to original dataframe
df = df.merge(customer_stats, on='customer_id', how='left')
Your Preprocessing Checklist
- Handle missing values (strategy chosen)
- Remove duplicates
- Detect and handle outliers
- Encode categorical variables
- Scale numeric features
- Create relevant features
- Validate preprocessed data
- Save preprocessing objects
- Document preprocessing steps
Next Steps
- Learn sklearn preprocessing
- Explore feature engineering techniques
- Study domain-specific preprocessing
- Practice with Kaggle datasets
Remember: Good preprocessing is invisible, but bad preprocessing ruins everything!
The 80/20 Rule of Preprocessing
In my experience, 80% of preprocessing issues come from 20% of common problems:
The Critical 20%:
- Missing Values: Handle them systematically with documented strategies rather than ad-hoc deletion
- Data Type Mismatches: Ensure numeric columns are actually numeric, dates are datetime objects
- Outliers: Detect and handle extreme values that can skew your models significantly
- Feature Scale: Normalize or standardize features so no single feature dominates due to scale
Focus intensely on these four areas, and you'll catch most data quality issues before they become problems.
Building Your Preprocessing Intuition
The more you preprocess data, the more you develop intuition for what's wrong:
Red flags I watch for:
- Standard deviation much larger than mean (possible outliers or skewed distribution)
- Too many unique values in categorical column (might need grouping or is actually numeric)
- Suspiciously round numbers (might be placeholder values like -999 or 0)
- Date ranges that don't make business sense (future dates, dates before company existed)
- Correlation between features that should be independent (possible data leakage)
Trust your instincts when something feels off. Investigate anomalies rather than dismissing them.
Found this guide helpful? Share it with your ML community! Connect with me on Twitter or LinkedIn for more data science tips.
Support My Work
If this guide helped you master data preprocessing, build robust ML pipelines, or avoid common preprocessing mistakes, I'd greatly appreciate your support, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.
β Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by UX Indonesia on Unsplash