Ojaswi Athghara | Data Cleaning Best Practices: Transform Raw Data to Insights

Data Cleaning Best Practices: Transform Raw Data to Insights

The Day My Analysis Was Completely Wrong

I presented my findings to the executive team. Beautiful charts. Confident conclusions. Then someone asked, "Why do we have customers aged 250?" I froze. One uncaught data error invalidated weeks of work.

That humbling experience taught me: data cleaning isn't just about fixing errors—it's about building systematic processes that catch problems before they cause disasters.

This guide shares best practices I've developed through years of data work, saving you from similar embarrassments and ensuring your analyses are actually trustworthy.

The Data Cleaning Philosophy

Three Core Principles

Document Everything - Future you will thank present you
Automate Repeated Tasks - Humans make mistakes
Validate Continuously - Trust, but verify

Best Practice 1: Start with Exploration

Never clean blindly. Understand your data first.

import pandas as pd
import numpy as np

def explore_dataset(df, name="Dataset"):
    """Comprehensive data exploration"""
    
    print(f"{'='*60}")
    print(f" {name} Overview")
    print(f"{'='*60}\n")
    
    # Basic info
    print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB\n")
    
    # Data types
    print("Data Types:")
    print(df.dtypes.value_counts())
    print()
    
    # Missing values
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print("Missing Values:")
        missing_pct = (missing / len(df) * 100).round(2)
        missing_df = pd.DataFrame({
            'Count': missing[missing > 0],
            'Percentage': missing_pct[missing > 0]
        }).sort_values('Count', ascending=False)
        print(missing_df)
        print()
    
    # Duplicates
    dupes = df.duplicated().sum()
    print(f"Duplicate rows: {dupes:,} ({dupes/len(df)*100:.2f}%)\n")
    
    # Numeric summary
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    if len(numeric_cols) > 0:
        print("Numeric Columns Summary:")
        print(df[numeric_cols].describe())
        print()
    
    # Categorical summary
    cat_cols = df.select_dtypes(include=['object', 'category']).columns
    if len(cat_cols) > 0:
        print("Categorical Columns:")
        for col in cat_cols[:5]:  # First 5
            unique = df[col].nunique()
            print(f"  {col}: {unique:,} unique values")
            if unique <= 10:
                print(f"    Values: {df[col].value_counts().to_dict()}")
        print()
    
    return {
        'shape': df.shape,
        'missing': missing.sum(),
        'duplicates': dupes,
        'numeric_cols': list(numeric_cols),
        'categorical_cols': list(cat_cols)
    }

# Usage
info = explore_dataset(df, "Customer Data")

Best Practice 2: Create a Data Cleaning Log

Track every change you make.

class DataCleaningLog:
    """Track all data cleaning operations"""
    
    def __init__(self):
        self.operations = []
        self.original_shape = None
    
    def log_operation(self, operation, before, after, details=""):
        """Log a cleaning operation"""
        entry = {
            'timestamp': pd.Timestamp.now(),
            'operation': operation,
            'rows_before': before[0],
            'cols_before': before[1],
            'rows_after': after[0],
            'cols_after': after[1],
            'rows_changed': before[0] - after[0],
            'cols_changed': before[1] - after[1],
            'details': details
        }
        self.operations.append(entry)
    
    def get_summary(self):
        """Get cleaning summary"""
        if not self.operations:
            return "No operations logged"
        
        log_df = pd.DataFrame(self.operations)
        
        summary = f"""
Data Cleaning Summary
{'='*60}
Total Operations: {len(self.operations)}
Rows Removed: {log_df['rows_changed'].sum():,}
Columns Removed: {log_df['cols_changed'].sum()}

Operations Performed:
"""
        for i, op in enumerate(self.operations, 1):
            summary += f"{i}. {op['operation']}: {op['details']}\n"
        
        return summary
    
    def save_log(self, filename='cleaning_log.csv'):
        """Save log to file"""
        if self.operations:
            pd.DataFrame(self.operations).to_csv(filename, index=False)

# Usage
log = DataCleaningLog()

# Track operations
before = df.shape
df = df.drop_duplicates()
after = df.shape
log.log_operation('Remove Duplicates', before, after, 
                  f"Removed {before[0] - after[0]} duplicate rows")

print(log.get_summary())
log.save_log()

Best Practice 3: Validate Data Types

Ensure columns have correct types.

def fix_data_types(df, type_mapping):
    """Fix data types with error handling"""
    
    for col, dtype in type_mapping.items():
        if col not in df.columns:
            print(f"Warning: Column '{col}' not found")
            continue
        
        try:
            if dtype == 'datetime':
                df[col] = pd.to_datetime(df[col], errors='coerce')
            elif dtype == 'numeric':
                df[col] = pd.to_numeric(df[col], errors='coerce')
            elif dtype == 'category':
                df[col] = df[col].astype('category')
            elif dtype == 'int':
                # Convert to float first (handles NaN), then to int
                df[col] = pd.to_numeric(df[col], errors='coerce')
                df[col] = df[col].fillna(0).astype('int64')
            else:
                df[col] = df[col].astype(dtype)
            
            print(f"✓ Converted {col} to {dtype}")
            
        except Exception as e:
            print(f"✗ Failed to convert {col} to {dtype}: {e}")
    
    return df

# Usage
type_mapping = {
    'customer_id': 'int',
    'signup_date': 'datetime',
    'age': 'numeric',
    'category': 'category'
}

df = fix_data_types(df, type_mapping)

Best Practice 4: Standardize Text Data

Consistent formatting prevents downstream issues.

def standardize_text_columns(df, text_cols):
    """Standardize text data"""
    
    for col in text_cols:
        if col not in df.columns:
            continue
        
        # Strip whitespace
        df[col] = df[col].str.strip()
        
        # Lowercase
        df[col] = df[col].str.lower()
        
        # Remove extra spaces
        df[col] = df[col].str.replace(r'\s+', ' ', regex=True)
        
        # Remove special characters (optional)
        # df[col] = df[col].str.replace(r'[^a-z0-9\s]', '', regex=True)
        
        print(f"Standardized: {col}")
    
    return df

# Usage
text_cols = ['name', 'email', 'city']
df = standardize_text_columns(df, text_cols)

Best Practice 5: Use Validation Rules

Define business rules and validate against them.

class DataValidator:
    """Validate data against business rules"""
    
    def __init__(self):
        self.errors = []
    
    def validate_range(self, df, col, min_val, max_val):
        """Validate numeric range"""
        invalid = df[(df[col] < min_val) | (df[col] > max_val)]
        
        if len(invalid) > 0:
            self.errors.append({
                'rule': f'{col} must be between {min_val} and {max_val}',
                'violations': len(invalid),
                'examples': invalid[col].head(5).tolist()
            })
            return False
        return True
    
    def validate_not_null(self, df, cols):
        """Validate required columns"""
        for col in cols:
            null_count = df[col].isnull().sum()
            if null_count > 0:
                self.errors.append({
                    'rule': f'{col} cannot be null',
                    'violations': null_count,
                    'percentage': f'{null_count/len(df)*100:.2f}%'
                })
        return len([e for e in self.errors if 'null' in e['rule']]) == 0
    
    def validate_unique(self, df, col):
        """Validate uniqueness"""
        dupes = df[df.duplicated(subset=[col], keep=False)]
        
        if len(dupes) > 0:
            self.errors.append({
                'rule': f'{col} must be unique',
                'violations': len(dupes),
                'examples': dupes[col].head(5).tolist()
            })
            return False
        return True
    
    def validate_format(self, df, col, pattern, description):
        """Validate format using regex"""
        import re
        invalid = df[~df[col].str.match(pattern, na=False)]
        
        if len(invalid) > 0:
            self.errors.append({
                'rule': f'{col} must match format: {description}',
                'violations': len(invalid),
                'examples': invalid[col].head(5).tolist()
            })
            return False
        return True
    
    def get_report(self):
        """Generate validation report"""
        if not self.errors:
            return "✓ All validation rules passed!"
        
        report = f"✗ Found {len(self.errors)} validation issues:\n\n"
        for i, error in enumerate(self.errors, 1):
            report += f"{i}. {error['rule']}\n"
            report += f"   Violations: {error['violations']}\n"
            if 'examples' in error:
                report += f"   Examples: {error['examples']}\n"
            report += "\n"
        
        return report

# Usage
validator = DataValidator()

# Run validations
validator.validate_range(df, 'age', 0, 120)
validator.validate_not_null(df, ['customer_id', 'email'])
validator.validate_unique(df, 'customer_id')
validator.validate_format(df, 'email', 
                          r'^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$',
                          'valid email')

print(validator.get_report())

Best Practice 6: Handle Outliers Systematically

Don't arbitrarily remove outliers—understand them first.

def analyze_outliers(df, col):
    """Comprehensive outlier analysis"""
    
    data = df[col].dropna()
    
    # Statistical measures
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    
    print(f"Outlier Analysis for '{col}'")
    print(f"{'='*60}")
    print(f"Total values: {len(data):,}")
    print(f"Outliers: {len(outliers):,} ({len(outliers)/len(data)*100:.2f}%)")
    print(f"\nBounds:")
    print(f"  Lower: {lower_bound:.2f}")
    print(f"  Upper: {upper_bound:.2f}")
    print(f"\nOutlier range:")
    if len(outliers) > 0:
        print(f"  Min: {outliers[col].min():.2f}")
        print(f"  Max: {outliers[col].max():.2f}")
    
    # Visualization
    import matplotlib.pyplot as plt
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Box plot
    axes[0].boxplot(data)
    axes[0].set_title(f'{col} - Box Plot')
    axes[0].set_ylabel('Value')
    
    # Histogram
    axes[1].hist(data, bins=50, edgecolor='black')
    axes[1].axvline(lower_bound, color='r', linestyle='--', label='Lower bound')
    axes[1].axvline(upper_bound, color='r', linestyle='--', label='Upper bound')
    axes[1].set_title(f'{col} - Distribution')
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()
    
    return outliers

# Usage
outliers = analyze_outliers(df, 'income')

# Decide on treatment
# Option 1: Remove
# df = df[~df.index.isin(outliers.index)]

# Option 2: Cap
# df['income'] = df['income'].clip(lower_bound, upper_bound)

# Option 3: Log transform
# df['income_log'] = np.log1p(df['income'])

Best Practice 7: Create Reusable Cleaning Functions

Build a personal library of cleaning utilities.

class DataCleaner:
    """Reusable data cleaning utilities"""
    
    @staticmethod
    def remove_constant_columns(df, threshold=0.95):
        """Remove columns with same value > threshold%"""
        constant_cols = []
        
        for col in df.columns:
            if df[col].dtype == 'object':
                most_common_pct = df[col].value_counts(normalize=True).iloc[0]
            else:
                most_common_pct = (df[col] == df[col].mode()[0]).mean()
            
            if most_common_pct > threshold:
                constant_cols.append(col)
        
        if constant_cols:
            print(f"Removing {len(constant_cols)} constant columns:")
            print(f"  {constant_cols}")
            df = df.drop(columns=constant_cols)
        
        return df
    
    @staticmethod
    def fix_mixed_types(df, col):
        """Fix columns with mixed data types"""
        # Convert everything to string first
        df[col] = df[col].astype(str)
        
        # Try to convert to numeric
        numeric_version = pd.to_numeric(df[col], errors='coerce')
        
        # If most values are numeric, use numeric
        if numeric_version.notna().sum() / len(df) > 0.8:
            df[col] = numeric_version
            print(f"Converted {col} to numeric")
        else:
            print(f"Kept {col} as string")
        
        return df
    
    @staticmethod
    def standardize_dates(df, date_cols):
        """Standardize date columns"""
        for col in date_cols:
            df[col] = pd.to_datetime(df[col], errors='coerce')
            print(f"Converted {col} to datetime")
        return df
    
    @staticmethod
    def remove_special_characters(df, text_cols):
        """Remove special characters from text"""
        for col in text_cols:
            df[col] = df[col].str.replace(r'[^\w\s]', '', regex=True)
        return df

# Usage
cleaner = DataCleaner()

df = cleaner.remove_constant_columns(df)
df = cleaner.fix_mixed_types(df, 'price')
df = cleaner.standardize_dates(df, ['signup_date', 'last_purchase'])

Best Practice 8: Automate with Pipelines

Make cleaning reproducible.

from sklearn.base import BaseEstimator, TransformerMixin

class CustomCleaner(BaseEstimator, TransformerMixin):
    """Custom transformer for sklearn pipeline"""
    
    def __init__(self, remove_outliers=True, standardize_text=True):
        self.remove_outliers = remove_outliers
        self.standardize_text = standardize_text
        self.outlier_bounds = {}
    
    def fit(self, X, y=None):
        """Learn parameters from training data"""
        # Calculate outlier bounds for numeric columns
        if self.remove_outliers:
            numeric_cols = X.select_dtypes(include=[np.number]).columns
            
            for col in numeric_cols:
                Q1 = X[col].quantile(0.25)
                Q3 = X[col].quantile(0.75)
                IQR = Q3 - Q1
                self.outlier_bounds[col] = {
                    'lower': Q1 - 1.5 * IQR,
                    'upper': Q3 + 1.5 * IQR
                }
        
        return self
    
    def transform(self, X):
        """Apply cleaning to data"""
        X = X.copy()
        
        # Remove outliers
        if self.remove_outliers:
            for col, bounds in self.outlier_bounds.items():
                X[col] = X[col].clip(bounds['lower'], bounds['upper'])
        
        # Standardize text
        if self.standardize_text:
            text_cols = X.select_dtypes(include=['object']).columns
            for col in text_cols:
                X[col] = X[col].str.strip().str.lower()
        
        return X

# Create pipeline
from sklearn.pipeline import Pipeline

cleaning_pipeline = Pipeline([
    ('cleaner', CustomCleaner()),
    # Add more transformers as needed
])

# Use pipeline
X_train_clean = cleaning_pipeline.fit_transform(X_train)
X_test_clean = cleaning_pipeline.transform(X_test)

Best Practice 9: Version Your Data

Track data changes like code.

import hashlib
from datetime import datetime

class DataVersionControl:
    """Simple data versioning"""
    
    def __init__(self, base_path='./data_versions'):
        self.base_path = Path(base_path)
        self.base_path.mkdir(exist_ok=True)
        self.manifest_file = self.base_path / 'manifest.json'
        self.manifest = self._load_manifest()
    
    def _load_manifest(self):
        """Load version manifest"""
        if self.manifest_file.exists():
            with open(self.manifest_file, 'r') as f:
                return json.load(f)
        return {'versions': []}
    
    def _save_manifest(self):
        """Save version manifest"""
        with open(self.manifest_file, 'w') as f:
            json.dump(self.manifest, f, indent=2)
    
    def _calculate_hash(self, df):
        """Calculate dataframe hash"""
        return hashlib.md5(pd.util.hash_pandas_object(df).values).hexdigest()
    
    def save_version(self, df, description=""):
        """Save new version of data"""
        version_id = datetime.now().strftime('%Y%m%d_%H%M%S')
        file_hash = self._calculate_hash(df)
        
        # Save data
        filename = f"data_{version_id}.parquet"
        filepath = self.base_path / filename
        df.to_parquet(filepath)
        
        # Update manifest
        version_info = {
            'version_id': version_id,
            'filename': filename,
            'hash': file_hash,
            'shape': df.shape,
            'description': description,
            'timestamp': datetime.now().isoformat()
        }
        
        self.manifest['versions'].append(version_info)
        self._save_manifest()
        
        print(f"Saved version: {version_id}")
        print(f"  Shape: {df.shape}")
        print(f"  Description: {description}")
        
        return version_id
    
    def load_version(self, version_id):
        """Load specific version"""
        version = next((v for v in self.manifest['versions'] 
                       if v['version_id'] == version_id), None)
        
        if not version:
            raise ValueError(f"Version {version_id} not found")
        
        filepath = self.base_path / version['filename']
        return pd.read_parquet(filepath)
    
    def list_versions(self):
        """List all versions"""
        print("Data Versions:")
        print("="*60)
        for v in self.manifest['versions']:
            print(f"Version: {v['version_id']}")
            print(f"  Shape: {v['shape']}")
            print(f"  Description: {v['description']}")
            print(f"  Timestamp: {v['timestamp']}")
            print()

# Usage
dvc = DataVersionControl()

# Save versions
dvc.save_version(df, "Raw data after initial load")
df_cleaned = clean_data(df)
dvc.save_version(df_cleaned, "After cleaning and deduplication")

# List versions
dvc.list_versions()

# Load specific version
df_v1 = dvc.load_version('20241028_140523')

Best Practice 10: Document Your Process

Create a data dictionary and cleaning documentation.

def generate_data_dictionary(df, output_file='data_dictionary.md'):
    """Generate markdown data dictionary"""
    
    dictionary = "# Data Dictionary\n\n"
    dictionary += f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n"
    dictionary += f"**Shape:** {df.shape[0]:,} rows × {df.shape[1]} columns\n\n"
    dictionary += "## Columns\n\n"
    
    for col in df.columns:
        dictionary += f"### `{col}`\n\n"
        dictionary += f"- **Type:** {df[col].dtype}\n"
        dictionary += f"- **Missing:** {df[col].isnull().sum()} ({df[col].isnull().mean()*100:.2f}%)\n"
        dictionary += f"- **Unique:** {df[col].nunique():,}\n"
        
        if df[col].dtype in ['int64', 'float64']:
            dictionary += f"- **Min:** {df[col].min()}\n"
            dictionary += f"- **Max:** {df[col].max()}\n"
            dictionary += f"- **Mean:** {df[col].mean():.2f}\n"
        elif df[col].dtype == 'object':
            top_values = df[col].value_counts().head(5)
            dictionary += "- **Top values:**\n"
            for val, count in top_values.items():
                dictionary += f"  - `{val}`: {count:,} ({count/len(df)*100:.1f}%)\n"
        
        dictionary += "\n"
    
    with open(output_file, 'w') as f:
        f.write(dictionary)
    
    print(f"Data dictionary saved to {output_file}")

# Usage
generate_data_dictionary(df)

Your Data Cleaning Checklist

Before Cleaning

Explore data thoroughly
Document original state
Identify business rules
Back up original data

During Cleaning

After Cleaning

Common Pitfalls to Avoid

Cleaning without understanding - Always explore first
Not documenting changes - Future you will struggle
Arbitrary outlier removal - Understand business context
Inconsistent handling - Apply same rules to train/test
Not validating results - Always check your work

Your Data Quality Arsenal

You now have:

Exploration - Understand before cleaning
Logging - Track every change
Validation - Business rules enforcement
Automation - Reproducible pipelines
Versioning - Track data evolution
Documentation - Clear process records

Remember: Clean data is the foundation of trustworthy insights!

Next Steps

Build your own cleaning utilities library
Learn TensorFlow Data Validation
Explore Great Expectations
Study domain-specific cleaning needs

Data cleaning isn't glamorous, but it's what separates good analysts from great ones!

Found these practices helpful? Share with your team! Connect with me on Twitter or LinkedIn for more data quality insights.

Support My Work

If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by Jennifer Burk on Unsplash

Related Blogs