Ojaswi Athghara | Data Preprocessing with Python: Essential Cleaning Steps

Data Preprocessing with Python: Essential Cleaning Steps

When My Model Failed Because of Bad Preprocessing

I spent weeks building a sophisticated machine learning model. Great accuracy on training data. Then I deployed it, and predictions were nonsense. The problem? I forgot to scale my test data the same way as training data.

That painful lesson taught me: preprocessing isn't just a preliminary step—it's the foundation of every successful ML project. Garbage in, garbage out.

This guide covers essential preprocessing steps that transform raw data into model-ready datasets. Master these, and your models will actually work in production.

The Preprocessing Pipeline

Every dataset needs these core steps:

Handle missing values - Fill or remove
Remove duplicates - Ensure data quality
Handle outliers - Detect and treat
Encode categories - Convert text to numbers
Scale features - Normalize ranges
Feature engineering - Create new features

Step 1: Handling Missing Values

Detecting Missing Data

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 35, 40, 28],
    'income': [50000, 60000, np.nan, 80000, 55000],
    'city': ['NY', 'LA', None, 'SF', 'NY']
})

# Check missing values
print("Missing values per column:")
print(df.isnull().sum())

print("\nMissing percentage:")
print((df.isnull().sum() / len(df) * 100).round(2))

# Visualize missing pattern
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Pattern')
plt.show()

Strategy 1: Remove Missing Data

# Remove rows with any missing value
df_clean = df.dropna()

# Remove rows with missing in specific columns
df_clean = df.dropna(subset=['age', 'income'])

# Remove columns with too many missing values
threshold = 0.5  # 50%
df_clean = df.loc[:, df.isnull().mean() < threshold]

Strategy 2: Fill Missing Values

# Fill with constant
df['age'] = df['age'].fillna(0)

# Fill with mean (for numeric)
df['age'] = df['age'].fillna(df['age'].mean())

# Fill with median (better for outliers)
df['income'] = df['income'].fillna(df['income'].median())

# Fill with mode (for categorical)
df['city'] = df['city'].fillna(df['city'].mode()[0])

# Forward fill (use previous value)
df['age'] = df['age'].fillna(method='ffill')

# Backward fill
df['age'] = df['age'].fillna(method='bfill')

# Interpolate (for time series)
df['age'] = df['age'].interpolate()

Strategy 3: Advanced Imputation

from sklearn.impute import SimpleImputer, KNNImputer

# Simple imputer
imputer = SimpleImputer(strategy='mean')  # mean, median, most_frequent, constant
df[['age', 'income']] = imputer.fit_transform(df[['age', 'income']])

# KNN imputer (uses similar rows)
knn_imputer = KNNImputer(n_neighbors=3)
df[['age', 'income']] = knn_imputer.fit_transform(df[['age', 'income']])

Step 2: Handling Duplicates

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Show duplicate rows
duplicates = df[df.duplicated(keep=False)]
print(duplicates)

# Remove duplicates
df_clean = df.drop_duplicates()

# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['customer_id'], keep='first')

Step 3: Handling Outliers

Detection Methods

# Statistical method (Z-score)
from scipy import stats

z_scores = np.abs(stats.zscore(df['income'].dropna()))
outliers_z = z_scores > 3  # More than 3 standard deviations

print(f"Outliers (z-score): {outliers_z.sum()}")

# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = (df['income'] < lower_bound) | (df['income'] > upper_bound)
print(f"Outliers (IQR): {outliers_iqr.sum()}")

Treatment Methods

# Method 1: Remove outliers
df_no_outliers = df[(df['income'] >= lower_bound) & (df['income'] <= upper_bound)]

# Method 2: Cap outliers (winsorization)
df['income_capped'] = df['income'].clip(lower=lower_bound, upper=upper_bound)

# Method 3: Transform (log transformation)
df['income_log'] = np.log1p(df['income'])  # log1p handles zeros

Step 4: Encoding Categorical Variables

Label Encoding (Ordinal)

from sklearn.preprocessing import LabelEncoder

# For ordinal categories (order matters)
df['size'] = ['S', 'M', 'L', 'XL', 'M']

le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])

print("Original:", df['size'].tolist())
print("Encoded:", df['size_encoded'].tolist())

# Inverse transform
original = le.inverse_transform(df['size_encoded'])

One-Hot Encoding (Nominal)

# For nominal categories (no order)
df['color'] = ['red', 'blue', 'green', 'red', 'blue']

# Method 1: Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['color'], prefix='color')

# Method 2: sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False, drop='first')  # drop first to avoid multicollinearity
encoded = ohe.fit_transform(df[['color']])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out())

print(encoded_df)

Ordinal Encoding (Custom Order)

from sklearn.preprocessing import OrdinalEncoder

# Define custom order
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]

oe = OrdinalEncoder(categories=education_order)
df['education'] = ['Bachelor', 'Master', 'High School', 'PhD']
df['education_encoded'] = oe.fit_transform(df[['education']])

Step 5: Feature Scaling

Normalization (Min-Max Scaling)

from sklearn.preprocessing import MinMaxScaler

# Scale to [0, 1]
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

# Scale to custom range
scaler = MinMaxScaler(feature_range=(-1, 1))
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

Standardization (Z-score)

from sklearn.preprocessing import StandardScaler

# Scale to mean=0, std=1
scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

print(f"Mean: {df['age'].mean():.10f}")  # ~0
print(f"Std: {df['age'].std():.10f}")    # ~1

Robust Scaling (For Outliers)

from sklearn.preprocessing import RobustScaler

# Uses median and IQR (less affected by outliers)
scaler = RobustScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])

When to Use Which Scaler?

MinMaxScaler: When you need bounded values (e.g., neural networks)
StandardScaler: Most common, works well with many algorithms
RobustScaler: When you have outliers you can't remove
No scaling: For tree-based models (Random Forest, XGBoost)

Step 6: Feature Engineering

Creating New Features

# Date features
df['date'] = pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10'])

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter

# Mathematical combinations
df['age_income_ratio'] = df['age'] / df['income']
df['age_squared'] = df['age'] ** 2

# Binning continuous variables
df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 18, 35, 50, 100],
                          labels=['Youth', 'Young Adult', 'Adult', 'Senior'])

# Aggregations
df['city_avg_income'] = df.groupby('city')['income'].transform('mean')

Text Features

# String operations
df['name'] = ['John Doe', 'Jane Smith', 'Bob Johnson']

df['name_length'] = df['name'].str.len()
df['word_count'] = df['name'].str.split().str.len()
df['first_name'] = df['name'].str.split().str[0]

# Extract patterns
import re

df['email'] = ['john@example.com', 'jane@test.org', 'bob@company.co.uk']
df['domain'] = df['email'].str.extract(r'@(.+)')

Complete Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

class DataPreprocessor:
    def __init__(self, numeric_features, categorical_features):
        self.numeric_features = numeric_features
        self.categorical_features = categorical_features
        self.pipeline = self._build_pipeline()
    
    def _build_pipeline(self):
        """Build preprocessing pipeline"""
        
        # Numeric transformer
        numeric_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])
        
        # Categorical transformer
        categorical_transformer = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
        ])
        
        # Combine transformers
        preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, self.numeric_features),
                ('cat', categorical_transformer, self.categorical_features)
            ])
        
        return preprocessor
    
    def fit_transform(self, X):
        """Fit and transform data"""
        return self.pipeline.fit_transform(X)
    
    def transform(self, X):
        """Transform new data"""
        return self.pipeline.transform(X)

# Usage
preprocessor = DataPreprocessor(
    numeric_features=['age', 'income'],
    categorical_features=['city', 'education']
)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

Real-World Example: Customer Dataset

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('customers.csv')

print("Original shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# Step 1: Handle missing values
print("\n1. Handling missing values...")
missing_threshold = 0.3
cols_to_drop = df.columns[df.isnull().mean() > missing_threshold]
df = df.drop(columns=cols_to_drop)

# Impute numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Impute categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Step 2: Remove duplicates
print("2. Removing duplicates...")
df = df.drop_duplicates()

# Step 3: Handle outliers
print("3. Handling outliers...")
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df[col] = df[col].clip(lower, upper)

# Step 4: Encode categoricals
print("4. Encoding categorical variables...")
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Step 5: Scale features
print("5. Scaling features...")
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print("\nPreprocessed shape:", df.shape)
print("Missing values:", df.isnull().sum().sum())

# Save preprocessed data
df.to_csv('customers_preprocessed.csv', index=False)
print("\nPreprocessed data saved!")

Validation and Quality Checks

def validate_preprocessed_data(df):
    """Validate data quality after preprocessing"""
    
    checks = {
        'no_missing': df.isnull().sum().sum() == 0,
        'no_duplicates': df.duplicated().sum() == 0,
        'no_inf': np.isinf(df.select_dtypes(include=[np.number])).sum().sum() == 0,
        'proper_dtypes': True  # Custom check
    }
    
    print("Data Quality Checks:")
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"  {status} {check}: {'PASS' if passed else 'FAIL'}")
    
    return all(checks.values())

# Run validation
is_valid = validate_preprocessed_data(df)

Common Preprocessing Mistakes

Mistake 1: Data Leakage

# WRONG: Fitting scaler on entire dataset
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)  # Includes test data!

# RIGHT: Fit only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use fitted scaler

Mistake 2: Not Saving Preprocessing Objects

import joblib

# Save scaler for production use
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(encoder, 'encoder.pkl')

# Load in production
scaler = joblib.load('scaler.pkl')
new_data_scaled = scaler.transform(new_data)

Mistake 3: Forgetting Feature Order

# Always maintain feature order
feature_columns = ['age', 'income', 'education']

# When preprocessing new data
X_new = X_new[feature_columns]  # Ensure same order

Feature Engineering Basics

Creating new features from existing ones can dramatically improve model performance.

Datetime Features

# Extract time-based features
df['date'] = pd.to_datetime(df['date'])

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter

# Time since event
df['days_since_start'] = (df['date'] - df['date'].min()).dt.days

Interaction Features

# Combine features meaningfully
df['price_per_sqft'] = df['price'] / df['square_feet']
df['income_to_age_ratio'] = df['income'] / df['age']

# Polynomial features
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['feature1', 'feature2']])

Binning Continuous Variables

# Create categorical bins from numeric data
df['age_group'] = pd.cut(df['age'], 
                          bins=[0, 18, 35, 50, 65, 100],
                          labels=['Child', 'Young Adult', 'Adult', 'Senior', 'Elderly'])

# Or quantile-based bins
df['income_quartile'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

Text Feature Extraction

# Basic text features
df['text_length'] = df['comment'].str.len()
df['word_count'] = df['comment'].str.split().str.len()
df['avg_word_length'] = df['text_length'] / df['word_count']
df['has_special_chars'] = df['comment'].str.contains('[!?@#$]').astype(int)

Aggregation Features

# Group-based statistics
customer_stats = df.groupby('customer_id').agg({
    'purchase_amount': ['mean', 'sum', 'count'],
    'days_since_last_purchase': 'min'
}).reset_index()

# Merge back to original dataframe
df = df.merge(customer_stats, on='customer_id', how='left')

Your Preprocessing Checklist

Next Steps

Learn sklearn preprocessing
Explore feature engineering techniques
Study domain-specific preprocessing
Practice with Kaggle datasets

Remember: Good preprocessing is invisible, but bad preprocessing ruins everything!

The 80/20 Rule of Preprocessing

In my experience, 80% of preprocessing issues come from 20% of common problems:

The Critical 20%:

Missing Values: Handle them systematically with documented strategies rather than ad-hoc deletion
Data Type Mismatches: Ensure numeric columns are actually numeric, dates are datetime objects
Outliers: Detect and handle extreme values that can skew your models significantly
Feature Scale: Normalize or standardize features so no single feature dominates due to scale

Focus intensely on these four areas, and you'll catch most data quality issues before they become problems.

Building Your Preprocessing Intuition

The more you preprocess data, the more you develop intuition for what's wrong:

Red flags I watch for:

Standard deviation much larger than mean (possible outliers or skewed distribution)
Too many unique values in categorical column (might need grouping or is actually numeric)
Suspiciously round numbers (might be placeholder values like -999 or 0)
Date ranges that don't make business sense (future dates, dates before company existed)
Correlation between features that should be independent (possible data leakage)

Trust your instincts when something feels off. Investigate anomalies rather than dismissing them.

Found this guide helpful? Share it with your ML community! Connect with me on Twitter or LinkedIn for more data science tips.

Support My Work

If this guide helped you master data preprocessing, build robust ML pipelines, or avoid common preprocessing mistakes, I'd greatly appreciate your support, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by UX Indonesia on Unsplash

Related Blogs