Ojaswi Athghara | Pandas Data Cleaning Techniques: Handle Missing Data Like a Pro

Pandas Data Cleaning Techniques: Handle Missing Data Like a Pro

When Real Data Hit Me Like a Truck

My first data science project seemed simple: analyze customer data and build a prediction model. I loaded the CSV file, excited to dive into machine learning algorithms.

Then I saw it. NaN values everywhere. Dates as strings. "N/A" instead of numbers. Duplicate rows. Inconsistent categories. My clean textbook examples never prepared me for this mess.

I learned the hard way: data scientists spend 80% of their time cleaning data. Not glamorous, but absolutely critical. Garbage in, garbage out—no amount of fancy algorithms can fix dirty data.

This guide will teach you everything I wish I knew about data cleaning with pandas. From handling missing values to fixing messy categorical data, you'll learn to transform real-world chaos into analysis-ready datasets.

Why Data Cleaning Matters

import pandas as pd
import numpy as np

# Create messy data (like real-world datasets)
messy_data = {
    'customer_id': [1, 2, 3, 4, 5, 5, 7],  # Duplicate ID
    'name': ['Alice', 'Bob', 'CHARLIE', 'david', None, 'Eve', 'Frank'],
    'age': [25, np.nan, 35, '40', 28, 28, np.nan],  # Mixed types, missing
    'email': ['alice@example.com', 'bob@invalid', 'charlie@example.com', 
              'david@example.com', 'eve@example.com', 'eve@example.com', None],
    'purchase_amount': [100.0, 150.5, None, '200', -50, 75.0, 125.0],  # Negative value?
    'date': ['2024-01-15', '2024/01/20', '15-01-2024', '2024-01-25', 
             '2024-01-30', '2024-01-30', 'invalid']
}

df = pd.DataFrame(messy_data)
print("Messy real-world data:")
print(df)
print(f"\nData types:\n{df.dtypes}")

This is what real data looks like. Let's clean it step by step.

Detecting Missing Data

Identifying Missing Values

# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

print("\nPercentage missing:")
print((df.isnull().sum() / len(df)) * 100)

# Visual summary
print("\nMissing data heatmap:")
print(df.isnull())

# Rows with any missing values
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"\nRows with missing data: {len(rows_with_missing)}")

Different Types of Missing Values

# Create data with various missing indicators
df_missing = pd.DataFrame({
    'A': [1, 2, None, 4, np.nan],
    'B': [5, '', 'NA', 8, 'null'],
    'C': [9, 10, -999, 12, 0]
})

print("Various missing indicators:")
print(df_missing)

# Pandas only recognizes None and np.nan
print("\nDetected as missing by pandas:")
print(df_missing.isnull())

# Replace custom missing indicators
df_cleaned = df_missing.replace(['', 'NA', 'null', -999], np.nan)
print("\nAfter replacing custom indicators:")
print(df_cleaned)
print("\nNow detected:")
print(df_cleaned.isnull())

Handling Missing Data

Strategy 1: Remove Missing Data

df_example = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, 12]
})

# Drop rows with ANY missing values
df_dropped = df_example.dropna()
print("After dropping rows with any NaN:")
print(df_dropped)

# Drop rows with ALL values missing
df_example_all_nan = df_example.copy()
df_example_all_nan.loc[len(df_example_all_nan)] = [np.nan, np.nan, np.nan]
df_dropped = df_example_all_nan.dropna(how='all')
print("\nAfter dropping rows with all NaN:")
print(df_dropped)

# Drop columns with missing values
df_dropped = df_example.dropna(axis=1)
print("\nAfter dropping columns with NaN:")
print(df_dropped)

# Drop rows with missing in specific columns
df_dropped = df_example.dropna(subset=['A'])
print("\nAfter dropping rows where 'A' is NaN:")
print(df_dropped)

# Require minimum non-null values
df_dropped = df_example.dropna(thresh=2)  # At least 2 non-null values
print("\nAfter requiring at least 2 non-null values per row:")
print(df_dropped)

Strategy 2: Fill Missing Data

df_example = pd.DataFrame({
    'A': [1, 2, np.nan, 4, np.nan],
    'B': [5, np.nan, 7, 8, np.nan],
    'C': [9, 10, 11, 12, 13]
})

# Fill with constant
df_filled = df_example.fillna(0)
print("Filled with 0:")
print(df_filled)

# Fill with mean
df_filled = df_example.fillna(df_example.mean())
print("\nFilled with column mean:")
print(df_filled)

# Fill with median (better for outliers)
df_filled = df_example.fillna(df_example.median())
print("\nFilled with median:")
print(df_filled)

# Forward fill (use previous value)
df_filled = df_example.fillna(method='ffill')
print("\nForward filled:")
print(df_filled)

# Backward fill
df_filled = df_example.fillna(method='bfill')
print("\nBackward filled:")
print(df_filled)

# Fill with interpolation
df_filled = df_example.interpolate()
print("\nInterpolated:")
print(df_filled)

Strategy 3: Fill by Group

# Create data with categories
df_grouped = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'value': [10, np.nan, 20, np.nan, 30, 25]
})

print("Original data:")
print(df_grouped)

# Fill missing with group mean
df_grouped['value'] = df_grouped.groupby('category')['value'].transform(
    lambda x: x.fillna(x.mean())
)

print("\nFilled with group mean:")
print(df_grouped)

Handling Duplicates

Detecting Duplicates

df_dupes = pd.DataFrame({
    'customer_id': [1, 2, 3, 2, 4, 3],
    'name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Charlie'],
    'amount': [100, 150, 200, 150, 250, 200]
})

print("Data with duplicates:")
print(df_dupes)

# Check for duplicates
print("\nDuplicate rows:")
print(df_dupes.duplicated())

print(f"\nNumber of duplicates: {df_dupes.duplicated().sum()}")

# Show duplicate rows
print("\nDuplicate rows:")
print(df_dupes[df_dupes.duplicated(keep=False)])

Removing Duplicates

# Remove all duplicates (keep first occurrence)
df_no_dupes = df_dupes.drop_duplicates()
print("After removing duplicates:")
print(df_no_dupes)

# Remove duplicates based on specific columns
df_no_dupes = df_dupes.drop_duplicates(subset=['customer_id'])
print("\nRemove duplicates based on customer_id:")
print(df_no_dupes)

# Keep last occurrence instead
df_no_dupes = df_dupes.drop_duplicates(keep='last')
print("\nKeeping last occurrence:")
print(df_no_dupes)

# Remove ALL duplicates (don't keep any)
df_no_dupes = df_dupes.drop_duplicates(keep=False)
print("\nRemove all occurrences of duplicates:")
print(df_no_dupes)

Fixing Data Types

Converting Data Types

df_types = pd.DataFrame({
    'numbers_as_string': ['1', '2', '3', '4'],
    'mixed_types': [1, '2', 3.0, '4'],
    'dates_as_string': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04']
})

print("Original types:")
print(df_types.dtypes)

# Convert to numeric
df_types['numbers_as_string'] = pd.to_numeric(df_types['numbers_as_string'])

# Convert with error handling
df_types['mixed_types'] = pd.to_numeric(df_types['mixed_types'], errors='coerce')

# Convert to datetime
df_types['dates_as_string'] = pd.to_datetime(df_types['dates_as_string'])

print("\nConverted types:")
print(df_types.dtypes)
print("\nConverted data:")
print(df_types)

Handling Conversion Errors

messy_numbers = pd.Series(['1', '2', 'three', '4', '5.5', 'invalid'])

# Coerce errors to NaN
converted = pd.to_numeric(messy_numbers, errors='coerce')
print("Coerce errors to NaN:")
print(converted)

# Ignore errors (return original)
converted = pd.to_numeric(messy_numbers, errors='ignore')
print("\nIgnore errors:")
print(converted)

# Raise errors (will fail)
try:
    converted = pd.to_numeric(messy_numbers, errors='raise')
except ValueError as e:
    print(f"\nRaise error: {e}")

Cleaning String Data

Standardizing Text

df_text = pd.DataFrame({
    'name': ['  Alice  ', 'BOB', 'charlie', 'David '],
    'email': ['ALICE@EXAMPLE.COM', 'bob@Example.com', 'charlie@example.COM', 'david@example.com']
})

print("Original text:")
print(df_text)

# Strip whitespace
df_text['name'] = df_text['name'].str.strip()

# Lowercase
df_text['name'] = df_text['name'].str.lower()
df_text['email'] = df_text['email'].str.lower()

# Capitalize first letter
df_text['name'] = df_text['name'].str.capitalize()

print("\nCleaned text:")
print(df_text)

Text Replacement and Pattern Matching

df_text = pd.DataFrame({
    'phone': ['123-456-7890', '(123) 456-7890', '123.456.7890', '1234567890']
})

print("Original phone numbers:")
print(df_text)

# Remove non-numeric characters
df_text['phone_cleaned'] = df_text['phone'].str.replace(r'[^0-9]', '', regex=True)

print("\nCleaned phone numbers:")
print(df_text)

# Extract patterns
emails = pd.Series(['alice@example.com', 'invalid', 'bob@test.org', 'not-an-email'])
valid_emails = emails.str.extract(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})')
print("\nExtracted valid emails:")
print(valid_emails)

Handling Outliers

Detecting Outliers

np.random.seed(42)
data = np.random.normal(100, 15, 100)
# Add some outliers
data = np.append(data, [200, 250, 300, -50, -100])

df_outliers = pd.DataFrame({'value': data})

print("Statistical summary:")
print(df_outliers.describe())

# Z-score method (values > 3 standard deviations)
mean = df_outliers['value'].mean()
std = df_outliers['value'].std()
df_outliers['z_score'] = (df_outliers['value'] - mean) / std
outliers = df_outliers[abs(df_outliers['z_score']) > 3]

print(f"\nOutliers (z-score > 3): {len(outliers)}")
print(outliers)

# IQR method
Q1 = df_outliers['value'].quantile(0.25)
Q3 = df_outliers['value'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df_outliers[(df_outliers['value'] < lower_bound) | 
                           (df_outliers['value'] > upper_bound)]
print(f"\nOutliers (IQR method): {len(outliers_iqr)}")

Handling Outliers

# Remove outliers
df_no_outliers = df_outliers[(df_outliers['value'] >= lower_bound) & 
                             (df_outliers['value'] <= upper_bound)]
print(f"After removing outliers: {len(df_no_outliers)} rows")

# Cap outliers (winsorization)
df_capped = df_outliers.copy()
df_capped['value'] = df_capped['value'].clip(lower=lower_bound, upper=upper_bound)
print("\nAfter capping outliers:")
print(df_capped['value'].describe())

Date and Time Cleaning

Converting to Datetime

df_dates = pd.DataFrame({
    'date_string': ['2024-01-15', '2024/01/20', '15-01-2024', 'Jan 25, 2024', 'invalid']
})

print("Original date strings:")
print(df_dates)

# Convert with error handling
df_dates['date'] = pd.to_datetime(df_dates['date_string'], errors='coerce')

print("\nConverted to datetime:")
print(df_dates)
print(f"Data type: {df_dates['date'].dtype}")

# Specify format for faster parsing
dates_consistent = pd.Series(['2024-01-15', '2024-01-20', '2024-01-25'])
converted = pd.to_datetime(dates_consistent, format='%Y-%m-%d')
print("\nWith specified format:")
print(converted)

Extracting Date Components

df_dates = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])
})

# Extract components
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['day_of_week'] = df_dates['date'].dt.dayofweek
df_dates['day_name'] = df_dates['date'].dt.day_name()
df_dates['quarter'] = df_dates['date'].dt.quarter

print("Date components:")
print(df_dates)

Categorical Data Cleaning

Standardizing Categories

df_categories = pd.DataFrame({
    'status': ['active', 'Active', 'ACTIVE', 'inactive', 'Inactive', 'pending']
})

print("Original categories:")
print(df_categories['status'].value_counts())

# Standardize
df_categories['status'] = df_categories['status'].str.lower()

print("\nStandardized:")
print(df_categories['status'].value_counts())

# Map similar categories
category_map = {
    'active': 'Active',
    'inactive': 'Inactive',
    'pending': 'Pending'
}
df_categories['status'] = df_categories['status'].map(category_map)

print("\nMapped:")
print(df_categories['status'].value_counts())

Handling Unknown Categories

df_categories = pd.DataFrame({
    'category': ['A', 'B', 'C', 'Unknown', 'N/A', '', 'Other', 'A']
})

print("Original:")
print(df_categories['category'].value_counts())

# Replace various "unknown" indicators
unknown_indicators = ['Unknown', 'N/A', '', 'Other']
df_categories['category'] = df_categories['category'].replace(unknown_indicators, np.nan)

# Fill with mode (most common value)
mode_value = df_categories['category'].mode()[0]
df_categories['category'] = df_categories['category'].fillna(mode_value)

print("\nCleaned:")
print(df_categories['category'].value_counts())

Comprehensive Cleaning Pipeline

def clean_dataset(df):
    """Comprehensive data cleaning pipeline."""
    
    df_clean = df.copy()
    
    # 1. Remove completely empty rows/columns
    df_clean = df_clean.dropna(how='all')
    df_clean = df_clean.dropna(axis=1, how='all')
    
    # 2. Remove duplicates
    df_clean = df_clean.drop_duplicates()
    
    # 3. Standardize column names
    df_clean.columns = df_clean.columns.str.lower().str.replace(' ', '_')
    
    # 4. Handle missing values per column type
    for col in df_clean.columns:
        # Numeric columns: fill with median
        if df_clean[col].dtype in ['int64', 'float64']:
            df_clean[col] = df_clean[col].fillna(df_clean[col].median())
        
        # Categorical columns: fill with mode
        elif df_clean[col].dtype == 'object':
            if df_clean[col].mode().shape[0] > 0:
                df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])
    
    # 5. Strip whitespace from string columns
    string_cols = df_clean.select_dtypes(include=['object']).columns
    for col in string_cols:
        df_clean[col] = df_clean[col].str.strip()
    
    # 6. Reset index
    df_clean = df_clean.reset_index(drop=True)
    
    return df_clean

# Test the pipeline
messy_df = pd.DataFrame({
    'Name ': ['  Alice  ', 'Bob', 'Charlie', 'Bob', None],
    'Age': [25, np.nan, 35, 40, 28],
    'City': ['NY', 'LA', None, 'LA', 'SF']
})

print("Before cleaning:")
print(messy_df)
print(f"\nMissing values:\n{messy_df.isnull().sum()}")

cleaned_df = clean_dataset(messy_df)

print("\n" + "="*50)
print("After cleaning:")
print(cleaned_df)
print(f"\nMissing values:\n{cleaned_df.isnull().sum()}")
print(f"\nColumn names: {cleaned_df.columns.tolist()}")

Real-World Example: Customer Data Cleaning

# Simulate messy customer data
np.random.seed(42)

customer_data = pd.DataFrame({
    'customer_id': list(range(1, 101)) + [50, 75],  # Duplicates
    'name': ['Customer ' + str(i) if i % 10 != 0 else None for i in range(1, 103)],
    'email': ['customer' + str(i) + '@example.com' if i % 15 != 0 else 'invalid' 
              for i in range(1, 103)],
    'age': [np.random.randint(18, 80) if i % 20 != 0 else np.nan for i in range(1, 103)],
    'purchase_amount': [np.random.uniform(10, 500) if i % 12 != 0 else None 
                       for i in range(1, 103)],
    'signup_date': ['2024-' + str((i % 12) + 1).zfill(2) + '-15' if i % 18 != 0 else 'invalid' 
                    for i in range(1, 103)]
})

print("Original customer data:")
print(f"Shape: {customer_data.shape}")
print(f"Missing values:\n{customer_data.isnull().sum()}")
print(f"Duplicates: {customer_data.duplicated().sum()}")

# Clean the data
print("\n" + "="*50)
print("Cleaning process:")

# 1. Remove duplicates
print("1. Removing duplicates...")
customer_data = customer_data.drop_duplicates(subset=['customer_id'], keep='first')
print(f"   Rows remaining: {len(customer_data)}")

# 2. Fix email addresses
print("2. Cleaning email addresses...")
valid_email_mask = customer_data['email'].str.contains('@', na=False)
customer_data.loc[~valid_email_mask, 'email'] = np.nan
print(f"   Invalid emails replaced: {(~valid_email_mask).sum()}")

# 3. Handle missing ages
print("3. Handling missing ages...")
median_age = customer_data['age'].median()
customer_data['age'] = customer_data['age'].fillna(median_age)
print(f"   Filled with median: {median_age}")

# 4. Handle missing purchase amounts
print("4. Handling missing purchase amounts...")
mean_purchase = customer_data['purchase_amount'].mean()
customer_data['purchase_amount'] = customer_data['purchase_amount'].fillna(mean_purchase)
print(f"   Filled with mean: {mean_purchase:.2f}")

# 5. Clean dates
print("5. Cleaning dates...")
customer_data['signup_date'] = pd.to_datetime(customer_data['signup_date'], errors='coerce')
# Fill invalid dates with most common date
if customer_data['signup_date'].mode().shape[0] > 0:
    mode_date = customer_data['signup_date'].mode()[0]
    customer_data['signup_date'] = customer_data['signup_date'].fillna(mode_date)

# 6. Remove outliers in purchase amount
print("6. Handling outliers...")
Q1 = customer_data['purchase_amount'].quantile(0.25)
Q3 = customer_data['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
before_count = len(customer_data)
customer_data = customer_data[
    (customer_data['purchase_amount'] >= lower_bound) & 
    (customer_data['purchase_amount'] <= upper_bound)
]
print(f"   Removed {before_count - len(customer_data)} outliers")

print("\n" + "="*50)
print("Final cleaned data:")
print(f"Shape: {customer_data.shape}")
print(f"Missing values:\n{customer_data.isnull().sum()}")
print(f"\nSample data:")
print(customer_data.head())

Data Quality Report

def generate_quality_report(df):
    """Generate comprehensive data quality report."""
    
    report = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'missing_percentage': (df.isnull().sum() / len(df) * 100).to_dict(),
        'duplicates': df.duplicated().sum(),
        'data_types': df.dtypes.to_dict()
    }
    
    print("DATA QUALITY REPORT")
    print("=" * 50)
    print(f"Total Rows: {report['total_rows']}")
    print(f"Total Columns: {report['total_columns']}")
    print(f"Duplicate Rows: {report['duplicates']}")
    
    print("\nMissing Values:")
    for col, count in report['missing_values'].items():
        if count > 0:
            pct = report['missing_percentage'][col]
            print(f"  {col}: {count} ({pct:.1f}%)")
    
    print("\nData Types:")
    for col, dtype in report['data_types'].items():
        print(f"  {col}: {dtype}")
    
    return report

# Generate report
report = generate_quality_report(customer_data)

Best Practices

1. Always Keep a Copy

# ALWAYS work on a copy
df_clean = df.copy()

2. Document Your Cleaning Steps

# Keep track of what you did
cleaning_log = []

# Remove duplicates
before = len(df)
df = df.drop_duplicates()
after = len(df)
cleaning_log.append(f"Removed {before - after} duplicates")

# Fill missing values
cleaning_log.append("Filled missing ages with median")

print("Cleaning log:")
for step in cleaning_log:
    print(f"- {step}")

3. Validate After Cleaning

# Check if cleaning worked
assert df['age'].isnull().sum() == 0, "Still have missing ages!"
assert df.duplicated().sum() == 0, "Still have duplicates!"
assert (df['purchase_amount'] >= 0).all(), "Negative purchases exist!"

Your Data Cleaning Mastery

You've learned the essential data cleaning techniques:

Missing data - Detect, remove, or fill
Duplicates - Find and remove
Data types - Convert and fix
Text data - Standardize and clean
Outliers - Detect and handle
Dates - Parse and extract
Categories - Standardize and map

Remember: Clean data is the foundation of every successful data science project.

Next Steps

Practice with messy datasets from Kaggle
Learn advanced imputation techniques
Explore data validation frameworks
Study pandas missing data documentation

Clean data isn't exciting, but it's essential. Master these techniques, and you'll save countless hours of frustration.

Found this guide helpful? Share it with your data community! Connect with me on Twitter or LinkedIn to discuss data cleaning challenges.

Support My Work

If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by Simone Dinoia on Unsplash

Related Blogs