Pandas Data Cleaning Techniques: Handle Missing Data Like a Pro
Master data cleaning with pandas in Python. Learn to handle missing values, remove duplicates, fix data types, and transform messy datasets into analysis-ready data with practical examples.

When Real Data Hit Me Like a Truck
My first data science project seemed simple: analyze customer data and build a prediction model. I loaded the CSV file, excited to dive into machine learning algorithms.
Then I saw it. NaN values everywhere. Dates as strings. "N/A" instead of numbers. Duplicate rows. Inconsistent categories. My clean textbook examples never prepared me for this mess.
I learned the hard way: data scientists spend 80% of their time cleaning data. Not glamorous, but absolutely critical. Garbage in, garbage out—no amount of fancy algorithms can fix dirty data.
This guide will teach you everything I wish I knew about data cleaning with pandas. From handling missing values to fixing messy categorical data, you'll learn to transform real-world chaos into analysis-ready datasets.
Why Data Cleaning Matters
import pandas as pd
import numpy as np
# Create messy data (like real-world datasets)
messy_data = {
'customer_id': [1, 2, 3, 4, 5, 5, 7], # Duplicate ID
'name': ['Alice', 'Bob', 'CHARLIE', 'david', None, 'Eve', 'Frank'],
'age': [25, np.nan, 35, '40', 28, 28, np.nan], # Mixed types, missing
'email': ['alice@example.com', 'bob@invalid', 'charlie@example.com',
'david@example.com', 'eve@example.com', 'eve@example.com', None],
'purchase_amount': [100.0, 150.5, None, '200', -50, 75.0, 125.0], # Negative value?
'date': ['2024-01-15', '2024/01/20', '15-01-2024', '2024-01-25',
'2024-01-30', '2024-01-30', 'invalid']
}
df = pd.DataFrame(messy_data)
print("Messy real-world data:")
print(df)
print(f"\nData types:\n{df.dtypes}")
This is what real data looks like. Let's clean it step by step.
Detecting Missing Data
Identifying Missing Values
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print("\nPercentage missing:")
print((df.isnull().sum() / len(df)) * 100)
# Visual summary
print("\nMissing data heatmap:")
print(df.isnull())
# Rows with any missing values
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"\nRows with missing data: {len(rows_with_missing)}")
Different Types of Missing Values
# Create data with various missing indicators
df_missing = pd.DataFrame({
'A': [1, 2, None, 4, np.nan],
'B': [5, '', 'NA', 8, 'null'],
'C': [9, 10, -999, 12, 0]
})
print("Various missing indicators:")
print(df_missing)
# Pandas only recognizes None and np.nan
print("\nDetected as missing by pandas:")
print(df_missing.isnull())
# Replace custom missing indicators
df_cleaned = df_missing.replace(['', 'NA', 'null', -999], np.nan)
print("\nAfter replacing custom indicators:")
print(df_cleaned)
print("\nNow detected:")
print(df_cleaned.isnull())
Handling Missing Data
Strategy 1: Remove Missing Data
df_example = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]
})
# Drop rows with ANY missing values
df_dropped = df_example.dropna()
print("After dropping rows with any NaN:")
print(df_dropped)
# Drop rows with ALL values missing
df_example_all_nan = df_example.copy()
df_example_all_nan.loc[len(df_example_all_nan)] = [np.nan, np.nan, np.nan]
df_dropped = df_example_all_nan.dropna(how='all')
print("\nAfter dropping rows with all NaN:")
print(df_dropped)
# Drop columns with missing values
df_dropped = df_example.dropna(axis=1)
print("\nAfter dropping columns with NaN:")
print(df_dropped)
# Drop rows with missing in specific columns
df_dropped = df_example.dropna(subset=['A'])
print("\nAfter dropping rows where 'A' is NaN:")
print(df_dropped)
# Require minimum non-null values
df_dropped = df_example.dropna(thresh=2) # At least 2 non-null values
print("\nAfter requiring at least 2 non-null values per row:")
print(df_dropped)
Strategy 2: Fill Missing Data
df_example = pd.DataFrame({
'A': [1, 2, np.nan, 4, np.nan],
'B': [5, np.nan, 7, 8, np.nan],
'C': [9, 10, 11, 12, 13]
})
# Fill with constant
df_filled = df_example.fillna(0)
print("Filled with 0:")
print(df_filled)
# Fill with mean
df_filled = df_example.fillna(df_example.mean())
print("\nFilled with column mean:")
print(df_filled)
# Fill with median (better for outliers)
df_filled = df_example.fillna(df_example.median())
print("\nFilled with median:")
print(df_filled)
# Forward fill (use previous value)
df_filled = df_example.fillna(method='ffill')
print("\nForward filled:")
print(df_filled)
# Backward fill
df_filled = df_example.fillna(method='bfill')
print("\nBackward filled:")
print(df_filled)
# Fill with interpolation
df_filled = df_example.interpolate()
print("\nInterpolated:")
print(df_filled)
Strategy 3: Fill by Group
# Create data with categories
df_grouped = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'A', 'B'],
'value': [10, np.nan, 20, np.nan, 30, 25]
})
print("Original data:")
print(df_grouped)
# Fill missing with group mean
df_grouped['value'] = df_grouped.groupby('category')['value'].transform(
lambda x: x.fillna(x.mean())
)
print("\nFilled with group mean:")
print(df_grouped)
Handling Duplicates
Detecting Duplicates
df_dupes = pd.DataFrame({
'customer_id': [1, 2, 3, 2, 4, 3],
'name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Charlie'],
'amount': [100, 150, 200, 150, 250, 200]
})
print("Data with duplicates:")
print(df_dupes)
# Check for duplicates
print("\nDuplicate rows:")
print(df_dupes.duplicated())
print(f"\nNumber of duplicates: {df_dupes.duplicated().sum()}")
# Show duplicate rows
print("\nDuplicate rows:")
print(df_dupes[df_dupes.duplicated(keep=False)])
Removing Duplicates
# Remove all duplicates (keep first occurrence)
df_no_dupes = df_dupes.drop_duplicates()
print("After removing duplicates:")
print(df_no_dupes)
# Remove duplicates based on specific columns
df_no_dupes = df_dupes.drop_duplicates(subset=['customer_id'])
print("\nRemove duplicates based on customer_id:")
print(df_no_dupes)
# Keep last occurrence instead
df_no_dupes = df_dupes.drop_duplicates(keep='last')
print("\nKeeping last occurrence:")
print(df_no_dupes)
# Remove ALL duplicates (don't keep any)
df_no_dupes = df_dupes.drop_duplicates(keep=False)
print("\nRemove all occurrences of duplicates:")
print(df_no_dupes)
Fixing Data Types
Converting Data Types
df_types = pd.DataFrame({
'numbers_as_string': ['1', '2', '3', '4'],
'mixed_types': [1, '2', 3.0, '4'],
'dates_as_string': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04']
})
print("Original types:")
print(df_types.dtypes)
# Convert to numeric
df_types['numbers_as_string'] = pd.to_numeric(df_types['numbers_as_string'])
# Convert with error handling
df_types['mixed_types'] = pd.to_numeric(df_types['mixed_types'], errors='coerce')
# Convert to datetime
df_types['dates_as_string'] = pd.to_datetime(df_types['dates_as_string'])
print("\nConverted types:")
print(df_types.dtypes)
print("\nConverted data:")
print(df_types)
Handling Conversion Errors
messy_numbers = pd.Series(['1', '2', 'three', '4', '5.5', 'invalid'])
# Coerce errors to NaN
converted = pd.to_numeric(messy_numbers, errors='coerce')
print("Coerce errors to NaN:")
print(converted)
# Ignore errors (return original)
converted = pd.to_numeric(messy_numbers, errors='ignore')
print("\nIgnore errors:")
print(converted)
# Raise errors (will fail)
try:
converted = pd.to_numeric(messy_numbers, errors='raise')
except ValueError as e:
print(f"\nRaise error: {e}")
Cleaning String Data
Standardizing Text
df_text = pd.DataFrame({
'name': [' Alice ', 'BOB', 'charlie', 'David '],
'email': ['ALICE@EXAMPLE.COM', 'bob@Example.com', 'charlie@example.COM', 'david@example.com']
})
print("Original text:")
print(df_text)
# Strip whitespace
df_text['name'] = df_text['name'].str.strip()
# Lowercase
df_text['name'] = df_text['name'].str.lower()
df_text['email'] = df_text['email'].str.lower()
# Capitalize first letter
df_text['name'] = df_text['name'].str.capitalize()
print("\nCleaned text:")
print(df_text)
Text Replacement and Pattern Matching
df_text = pd.DataFrame({
'phone': ['123-456-7890', '(123) 456-7890', '123.456.7890', '1234567890']
})
print("Original phone numbers:")
print(df_text)
# Remove non-numeric characters
df_text['phone_cleaned'] = df_text['phone'].str.replace(r'[^0-9]', '', regex=True)
print("\nCleaned phone numbers:")
print(df_text)
# Extract patterns
emails = pd.Series(['alice@example.com', 'invalid', 'bob@test.org', 'not-an-email'])
valid_emails = emails.str.extract(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})')
print("\nExtracted valid emails:")
print(valid_emails)
Handling Outliers
Detecting Outliers
np.random.seed(42)
data = np.random.normal(100, 15, 100)
# Add some outliers
data = np.append(data, [200, 250, 300, -50, -100])
df_outliers = pd.DataFrame({'value': data})
print("Statistical summary:")
print(df_outliers.describe())
# Z-score method (values > 3 standard deviations)
mean = df_outliers['value'].mean()
std = df_outliers['value'].std()
df_outliers['z_score'] = (df_outliers['value'] - mean) / std
outliers = df_outliers[abs(df_outliers['z_score']) > 3]
print(f"\nOutliers (z-score > 3): {len(outliers)}")
print(outliers)
# IQR method
Q1 = df_outliers['value'].quantile(0.25)
Q3 = df_outliers['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = df_outliers[(df_outliers['value'] < lower_bound) |
(df_outliers['value'] > upper_bound)]
print(f"\nOutliers (IQR method): {len(outliers_iqr)}")
Handling Outliers
# Remove outliers
df_no_outliers = df_outliers[(df_outliers['value'] >= lower_bound) &
(df_outliers['value'] <= upper_bound)]
print(f"After removing outliers: {len(df_no_outliers)} rows")
# Cap outliers (winsorization)
df_capped = df_outliers.copy()
df_capped['value'] = df_capped['value'].clip(lower=lower_bound, upper=upper_bound)
print("\nAfter capping outliers:")
print(df_capped['value'].describe())
Date and Time Cleaning
Converting to Datetime
df_dates = pd.DataFrame({
'date_string': ['2024-01-15', '2024/01/20', '15-01-2024', 'Jan 25, 2024', 'invalid']
})
print("Original date strings:")
print(df_dates)
# Convert with error handling
df_dates['date'] = pd.to_datetime(df_dates['date_string'], errors='coerce')
print("\nConverted to datetime:")
print(df_dates)
print(f"Data type: {df_dates['date'].dtype}")
# Specify format for faster parsing
dates_consistent = pd.Series(['2024-01-15', '2024-01-20', '2024-01-25'])
converted = pd.to_datetime(dates_consistent, format='%Y-%m-%d')
print("\nWith specified format:")
print(converted)
Extracting Date Components
df_dates = pd.DataFrame({
'date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])
})
# Extract components
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month
df_dates['day'] = df_dates['date'].dt.day
df_dates['day_of_week'] = df_dates['date'].dt.dayofweek
df_dates['day_name'] = df_dates['date'].dt.day_name()
df_dates['quarter'] = df_dates['date'].dt.quarter
print("Date components:")
print(df_dates)
Categorical Data Cleaning
Standardizing Categories
df_categories = pd.DataFrame({
'status': ['active', 'Active', 'ACTIVE', 'inactive', 'Inactive', 'pending']
})
print("Original categories:")
print(df_categories['status'].value_counts())
# Standardize
df_categories['status'] = df_categories['status'].str.lower()
print("\nStandardized:")
print(df_categories['status'].value_counts())
# Map similar categories
category_map = {
'active': 'Active',
'inactive': 'Inactive',
'pending': 'Pending'
}
df_categories['status'] = df_categories['status'].map(category_map)
print("\nMapped:")
print(df_categories['status'].value_counts())
Handling Unknown Categories
df_categories = pd.DataFrame({
'category': ['A', 'B', 'C', 'Unknown', 'N/A', '', 'Other', 'A']
})
print("Original:")
print(df_categories['category'].value_counts())
# Replace various "unknown" indicators
unknown_indicators = ['Unknown', 'N/A', '', 'Other']
df_categories['category'] = df_categories['category'].replace(unknown_indicators, np.nan)
# Fill with mode (most common value)
mode_value = df_categories['category'].mode()[0]
df_categories['category'] = df_categories['category'].fillna(mode_value)
print("\nCleaned:")
print(df_categories['category'].value_counts())
Comprehensive Cleaning Pipeline
def clean_dataset(df):
"""Comprehensive data cleaning pipeline."""
df_clean = df.copy()
# 1. Remove completely empty rows/columns
df_clean = df_clean.dropna(how='all')
df_clean = df_clean.dropna(axis=1, how='all')
# 2. Remove duplicates
df_clean = df_clean.drop_duplicates()
# 3. Standardize column names
df_clean.columns = df_clean.columns.str.lower().str.replace(' ', '_')
# 4. Handle missing values per column type
for col in df_clean.columns:
# Numeric columns: fill with median
if df_clean[col].dtype in ['int64', 'float64']:
df_clean[col] = df_clean[col].fillna(df_clean[col].median())
# Categorical columns: fill with mode
elif df_clean[col].dtype == 'object':
if df_clean[col].mode().shape[0] > 0:
df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])
# 5. Strip whitespace from string columns
string_cols = df_clean.select_dtypes(include=['object']).columns
for col in string_cols:
df_clean[col] = df_clean[col].str.strip()
# 6. Reset index
df_clean = df_clean.reset_index(drop=True)
return df_clean
# Test the pipeline
messy_df = pd.DataFrame({
'Name ': [' Alice ', 'Bob', 'Charlie', 'Bob', None],
'Age': [25, np.nan, 35, 40, 28],
'City': ['NY', 'LA', None, 'LA', 'SF']
})
print("Before cleaning:")
print(messy_df)
print(f"\nMissing values:\n{messy_df.isnull().sum()}")
cleaned_df = clean_dataset(messy_df)
print("\n" + "="*50)
print("After cleaning:")
print(cleaned_df)
print(f"\nMissing values:\n{cleaned_df.isnull().sum()}")
print(f"\nColumn names: {cleaned_df.columns.tolist()}")
Real-World Example: Customer Data Cleaning
# Simulate messy customer data
np.random.seed(42)
customer_data = pd.DataFrame({
'customer_id': list(range(1, 101)) + [50, 75], # Duplicates
'name': ['Customer ' + str(i) if i % 10 != 0 else None for i in range(1, 103)],
'email': ['customer' + str(i) + '@example.com' if i % 15 != 0 else 'invalid'
for i in range(1, 103)],
'age': [np.random.randint(18, 80) if i % 20 != 0 else np.nan for i in range(1, 103)],
'purchase_amount': [np.random.uniform(10, 500) if i % 12 != 0 else None
for i in range(1, 103)],
'signup_date': ['2024-' + str((i % 12) + 1).zfill(2) + '-15' if i % 18 != 0 else 'invalid'
for i in range(1, 103)]
})
print("Original customer data:")
print(f"Shape: {customer_data.shape}")
print(f"Missing values:\n{customer_data.isnull().sum()}")
print(f"Duplicates: {customer_data.duplicated().sum()}")
# Clean the data
print("\n" + "="*50)
print("Cleaning process:")
# 1. Remove duplicates
print("1. Removing duplicates...")
customer_data = customer_data.drop_duplicates(subset=['customer_id'], keep='first')
print(f" Rows remaining: {len(customer_data)}")
# 2. Fix email addresses
print("2. Cleaning email addresses...")
valid_email_mask = customer_data['email'].str.contains('@', na=False)
customer_data.loc[~valid_email_mask, 'email'] = np.nan
print(f" Invalid emails replaced: {(~valid_email_mask).sum()}")
# 3. Handle missing ages
print("3. Handling missing ages...")
median_age = customer_data['age'].median()
customer_data['age'] = customer_data['age'].fillna(median_age)
print(f" Filled with median: {median_age}")
# 4. Handle missing purchase amounts
print("4. Handling missing purchase amounts...")
mean_purchase = customer_data['purchase_amount'].mean()
customer_data['purchase_amount'] = customer_data['purchase_amount'].fillna(mean_purchase)
print(f" Filled with mean: {mean_purchase:.2f}")
# 5. Clean dates
print("5. Cleaning dates...")
customer_data['signup_date'] = pd.to_datetime(customer_data['signup_date'], errors='coerce')
# Fill invalid dates with most common date
if customer_data['signup_date'].mode().shape[0] > 0:
mode_date = customer_data['signup_date'].mode()[0]
customer_data['signup_date'] = customer_data['signup_date'].fillna(mode_date)
# 6. Remove outliers in purchase amount
print("6. Handling outliers...")
Q1 = customer_data['purchase_amount'].quantile(0.25)
Q3 = customer_data['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
before_count = len(customer_data)
customer_data = customer_data[
(customer_data['purchase_amount'] >= lower_bound) &
(customer_data['purchase_amount'] <= upper_bound)
]
print(f" Removed {before_count - len(customer_data)} outliers")
print("\n" + "="*50)
print("Final cleaned data:")
print(f"Shape: {customer_data.shape}")
print(f"Missing values:\n{customer_data.isnull().sum()}")
print(f"\nSample data:")
print(customer_data.head())
Data Quality Report
def generate_quality_report(df):
"""Generate comprehensive data quality report."""
report = {
'total_rows': len(df),
'total_columns': len(df.columns),
'missing_values': df.isnull().sum().to_dict(),
'missing_percentage': (df.isnull().sum() / len(df) * 100).to_dict(),
'duplicates': df.duplicated().sum(),
'data_types': df.dtypes.to_dict()
}
print("DATA QUALITY REPORT")
print("=" * 50)
print(f"Total Rows: {report['total_rows']}")
print(f"Total Columns: {report['total_columns']}")
print(f"Duplicate Rows: {report['duplicates']}")
print("\nMissing Values:")
for col, count in report['missing_values'].items():
if count > 0:
pct = report['missing_percentage'][col]
print(f" {col}: {count} ({pct:.1f}%)")
print("\nData Types:")
for col, dtype in report['data_types'].items():
print(f" {col}: {dtype}")
return report
# Generate report
report = generate_quality_report(customer_data)
Best Practices
1. Always Keep a Copy
# ALWAYS work on a copy
df_clean = df.copy()
2. Document Your Cleaning Steps
# Keep track of what you did
cleaning_log = []
# Remove duplicates
before = len(df)
df = df.drop_duplicates()
after = len(df)
cleaning_log.append(f"Removed {before - after} duplicates")
# Fill missing values
cleaning_log.append("Filled missing ages with median")
print("Cleaning log:")
for step in cleaning_log:
print(f"- {step}")
3. Validate After Cleaning
# Check if cleaning worked
assert df['age'].isnull().sum() == 0, "Still have missing ages!"
assert df.duplicated().sum() == 0, "Still have duplicates!"
assert (df['purchase_amount'] >= 0).all(), "Negative purchases exist!"
Your Data Cleaning Mastery
You've learned the essential data cleaning techniques:
- Missing data - Detect, remove, or fill
- Duplicates - Find and remove
- Data types - Convert and fix
- Text data - Standardize and clean
- Outliers - Detect and handle
- Dates - Parse and extract
- Categories - Standardize and map
Remember: Clean data is the foundation of every successful data science project.
Next Steps
- Practice with messy datasets from Kaggle
- Learn advanced imputation techniques
- Explore data validation frameworks
- Study pandas missing data documentation
Clean data isn't exciting, but it's essential. Master these techniques, and you'll save countless hours of frustration.