Ojaswi Athghara | Pandas DataFrame Manipulation: Complete Data Handling Guide for Analysis

Pandas DataFrame Manipulation: Complete Data Handling Guide for Analysis

The Moment Pandas Clicked for Me

I spent hours trying to manipulate data with pure Python lists and dictionaries. Filtering, grouping, merging—everything required nested loops and conditional logic that made my brain hurt.

Then a colleague showed me pandas. "Try dfdf'age' > 25," she said. Wait, what? That one line did what took me 20 lines of code? That's when I realized: pandas isn't just a library—it's a completely different way of thinking about data.

In this guide, I'll show you everything I've learned about DataFrame manipulation. From basic selection to complex transformations, you'll learn to handle data like a pro.

##What are DataFrames?

Think of a DataFrame as Excel, but programmable. It's a 2D table with rows and columns, where:

Each column has a name
Each row has an index
Columns can have different data types
You can filter, sort, group, and transform data with simple commands

import pandas as pd
import numpy as np

# Create a simple DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 40, 28],
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'salary': [70000, 80000, 90000, 85000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

      name  age      city  salary
0    Alice   25  New York   70000
1      Bob   30    London   80000
2  Charlie   35     Paris   90000
3    David   40     Tokyo   85000
4      Eve   28    Sydney   75000

Creating DataFrames

Multiple ways to create DataFrames for different scenarios.

From Dictionary

# Dictionary of lists
data = {
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'price': [1200, 25, 75, 300],
    'stock': [15, 150, 80, 25]
}

df = pd.DataFrame(data)
print(df)

# Specify index
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
print(df)

From Lists

# List of lists
data = [
    ['Alice', 25, 70000],
    ['Bob', 30, 80000],
    ['Charlie', 35, 90000]
]

df = pd.DataFrame(data, columns=['name', 'age', 'salary'])
print(df)

# List of dictionaries
data = [
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 30},
    {'name': 'Charlie', 'age': 35}
]

df = pd.DataFrame(data)
print(df)

From NumPy Array

# Random data
data = np.random.randint(0, 100, size=(5, 3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)

From CSV File

# Read from CSV (most common)
df = pd.read_csv('data.csv')

# With specific options
df = pd.read_csv('data.csv', 
                 sep=',',           # Delimiter
                 header=0,          # Row for column names
                 index_col=0,       # Column to use as index
                 na_values=['NA', 'null'])  # Missing value indicators

Inspecting DataFrames

Before manipulating data, understand what you're working with.

# Create sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'age': [25, 30, 35, 40, 28, 33],
    'city': ['NY', 'London', 'Paris', 'Tokyo', 'Sydney', 'Berlin'],
    'salary': [70000, 80000, 90000, 85000, 75000, 82000]
})

# First few rows
print(df.head())  # Default: 5 rows
print(df.head(3))  # First 3 rows

# Last few rows
print(df.tail())

# Shape (rows, columns)
print(f"Shape: {df.shape}")

# Column names
print(f"Columns: {df.columns.tolist()}")

# Data types
print(df.dtypes)

# Statistical summary
print(df.describe())

# General info
print(df.info())

Selecting Data

The most fundamental DataFrame operation.

Selecting Columns

# Single column (returns Series)
ages = df['age']
print(type(ages))  # pandas.Series

# Single column (returns DataFrame)
ages_df = df[['age']]
print(type(ages_df))  # pandas.DataFrame

# Multiple columns
subset = df[['name', 'salary']]
print(subset)

# All columns except some
cols_to_drop = ['city']
subset = df[[col for col in df.columns if col not in cols_to_drop]]

Selecting Rows

# By index position (iloc)
first_row = df.iloc[0]  # First row
print(first_row)

first_three = df.iloc[0:3]  # First 3 rows
print(first_three)

# By label (loc)
df_indexed = df.set_index('name')
alice = df_indexed.loc['Alice']
print(alice)

# Multiple rows by label
subset = df_indexed.loc[['Alice', 'Charlie']]
print(subset)

Selecting Rows and Columns

# iloc: position-based
value = df.iloc[0, 1]  # Row 0, Column 1
print(f"Value: {value}")

subset = df.iloc[0:3, 0:2]  # First 3 rows, first 2 columns
print(subset)

# loc: label-based
df_indexed = df.set_index('name')
value = df_indexed.loc['Alice', 'age']
print(f"Alice's age: {value}")

subset = df_indexed.loc['Alice':'Charlie', 'age':'city']
print(subset)

Filtering Data

Boolean indexing for powerful data filtering.

Basic Filtering

# Single condition
young_people = df[df['age'] < 30]
print("People under 30:")
print(young_people)

# Multiple conditions (AND)
filtered = df[(df['age'] > 25) & (df['salary'] > 75000)]
print("\nAge > 25 AND Salary > 75000:")
print(filtered)

# Multiple conditions (OR)
filtered = df[(df['city'] == 'NY') | (df['city'] == 'London')]
print("\nFrom NY OR London:")
print(filtered)

Advanced Filtering

# String contains
df_with_names = df.copy()
filtered = df_with_names[df_with_names['name'].str.contains('a', case=False)]
print("Names containing 'a':")
print(filtered)

# String startswith
filtered = df[df['name'].str.startswith('A')]
print("\nNames starting with 'A':")
print(filtered)

# Values in list
cities_of_interest = ['NY', 'London', 'Tokyo']
filtered = df[df['city'].isin(cities_of_interest)]
print("\nFrom specific cities:")
print(filtered)

# NOT in list
filtered = df[~df['city'].isin(cities_of_interest)]
print("\nNOT from specific cities:")
print(filtered)

# Between values
filtered = df[df['age'].between(28, 35, inclusive='both')]
print("\nAge between 28 and 35:")
print(filtered)

Adding and Modifying Columns

Adding New Columns

# From calculation
df['annual_salary'] = df['salary'] * 12

# From conditions
df['senior'] = df['age'] >= 35

# Using apply with function
def categorize_salary(salary):
    if salary < 75000:
        return 'Low'
    elif salary < 85000:
        return 'Medium'
    else:
        return 'High'

df['salary_category'] = df['salary'].apply(categorize_salary)

# Using lambda
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

# From multiple columns
df['name_city'] = df['name'] + ' - ' + df['city']

print(df)

Modifying Existing Columns

# Update all values
df['salary'] = df['salary'] * 1.1  # 10% raise

# Update based on condition
df.loc[df['age'] > 35, 'salary'] = df['salary'] * 1.2

# Update specific values
df.loc[df['name'] == 'Alice', 'city'] = 'Boston'

# Replace values
df['city'] = df['city'].replace({'NY': 'New York'})

Dropping Data

Dropping Columns

# Drop single column
df_dropped = df.drop('salary', axis=1)  # axis=1 for columns

# Drop multiple columns
df_dropped = df.drop(['salary', 'city'], axis=1)

# Drop in-place (modifies original)
df.drop('annual_salary', axis=1, inplace=True)

Dropping Rows

# Drop by index
df_dropped = df.drop(0)  # Drop row at index 0

# Drop by condition
df_filtered = df[df['age'] >= 30]  # Keep only age >= 30

# Drop duplicates
df_unique = df.drop_duplicates()

# Drop duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['city'], keep='first')

Sorting Data

# Sort by single column
df_sorted = df.sort_values('age')
print("Sorted by age (ascending):")
print(df_sorted)

# Sort descending
df_sorted = df.sort_values('salary', ascending=False)
print("\nSorted by salary (descending):")
print(df_sorted)

# Sort by multiple columns
df_sorted = df.sort_values(['city', 'age'])
print("\nSorted by city, then age:")
print(df_sorted)

# Sort with custom order
df_sorted = df.sort_values('age', ascending=[False])

# Sort by index
df_sorted = df.sort_index()

GroupBy Operations

One of pandas' most powerful features.

Basic GroupBy

# Create sample data
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'salary': [70000, 80000, 90000, 85000, 75000, 72000],
    'experience': [3, 5, 7, 6, 4, 3]
})

# Group by and aggregate
dept_salaries = df.groupby('department')['salary'].mean()
print("Average salary by department:")
print(dept_salaries)

# Multiple aggregations
dept_stats = df.groupby('department')['salary'].agg(['mean', 'sum', 'count'])
print("\nDepartment statistics:")
print(dept_stats)

# Group by multiple columns
grouped = df.groupby(['department', 'experience'])['salary'].sum()
print("\nGrouped by department and experience:")
print(grouped)

Advanced GroupBy

# Multiple aggregations on multiple columns
stats = df.groupby('department').agg({
    'salary': ['mean', 'max', 'min'],
    'experience': ['mean', 'sum']
})
print("Multiple aggregations:")
print(stats)

# Custom aggregation functions
def salary_range(x):
    return x.max() - x.min()

custom_agg = df.groupby('department')['salary'].agg([
    ('avg', 'mean'),
    ('total', 'sum'),
    ('range', salary_range)
])
print("\nCustom aggregations:")
print(custom_agg)

# Transform (broadcast results back)
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
print("\nWith department average:")
print(df)

Merging and Joining DataFrames

Combine data from multiple sources.

Merge (SQL-style joins)

# Create sample dataframes
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'dept_id': [10, 20, 10, 30]
})

departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Sales', 'IT', 'HR']
})

# Inner join (default)
merged = pd.merge(employees, departments, on='dept_id')
print("Inner join:")
print(merged)

# Left join
merged = pd.merge(employees, departments, on='dept_id', how='left')
print("\nLeft join:")
print(merged)

# Merge on different column names
employees2 = employees.rename(columns={'dept_id': 'department'})
merged = pd.merge(employees2, departments, 
                 left_on='department', right_on='dept_id')
print("\nMerge on different columns:")
print(merged)

Concatenate

# Concatenate vertically (stack rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

result = pd.concat([df1, df2], ignore_index=True)
print("Vertical concatenation:")
print(result)

# Concatenate horizontally (side by side)
result = pd.concat([df1, df2], axis=1)
print("\nHorizontal concatenation:")
print(result)

Pivot Tables

Reshape data for analysis.

# Sample sales data
sales = pd.DataFrame({
    'date': ['2024-01', '2024-01', '2024-02', '2024-02', '2024-03', '2024-03'],
    'product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'region': ['East', 'East', 'West', 'West', 'East', 'East'],
    'sales': [100, 150, 120, 180, 110, 160]
})

# Create pivot table
pivot = sales.pivot_table(
    values='sales',
    index='product',
    columns='date',
    aggfunc='sum',
    fill_value=0
)
print("Pivot table:")
print(pivot)

# Multiple aggregations
pivot = sales.pivot_table(
    values='sales',
    index='product',
    columns='region',
    aggfunc=['sum', 'mean']
)
print("\nMultiple aggregations:")
print(pivot)

Handling Missing Data

# Create data with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

print("Data with missing values:")
print(df)

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Drop rows with any missing values
df_dropped = df.dropna()
print("\nAfter dropping NaN rows:")
print(df_dropped)

# Drop columns with any missing values
df_dropped = df.dropna(axis=1)
print("\nAfter dropping NaN columns:")
print(df_dropped)

# Fill missing values
df_filled = df.fillna(0)
print("\nFilled with 0:")
print(df_filled)

# Fill with mean
df_filled = df.fillna(df.mean())
print("\nFilled with column mean:")
print(df_filled)

# Forward fill
df_filled = df.fillna(method='ffill')
print("\nForward filled:")
print(df_filled)

Apply and Map Functions

Apply custom logic to DataFrames.

Apply

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40]
})

# Apply to column
df['A_squared'] = df['A'].apply(lambda x: x ** 2)

# Apply to multiple columns
df['sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

# Apply with custom function
def categorize(value):
    if value < 15:
        return 'Low'
    elif value < 35:
        return 'Medium'
    else:
        return 'High'

df['B_category'] = df['B'].apply(categorize)
print(df)

Map

# Dictionary mapping
df = pd.DataFrame({
    'size': ['S', 'M', 'L', 'XL']
})

size_map = {'S': 'Small', 'M': 'Medium', 'L': 'Large', 'XL': 'Extra Large'}
df['size_name'] = df['size'].map(size_map)
print(df)

Practical Example: Sales Analysis

# Sample sales dataset
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100)
sales_data = pd.DataFrame({
    'date': dates,
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'quantity': np.random.randint(1, 20, 100),
    'price': np.random.randint(100, 1000, 100)
})

# Calculate revenue
sales_data['revenue'] = sales_data['quantity'] * sales_data['price']

print("Sales data preview:")
print(sales_data.head())

# Total revenue by product
product_revenue = sales_data.groupby('product')['revenue'].sum().sort_values(ascending=False)
print("\nRevenue by product:")
print(product_revenue)

# Average price by region
region_prices = sales_data.groupby('region')['price'].mean()
print("\nAverage price by region:")
print(region_prices)

# Top selling products by region
top_products = sales_data.groupby(['region', 'product'])['quantity'].sum()
top_products = top_products.groupby('region').nlargest(1)
print("\nTop product per region:")
print(top_products)

# Monthly revenue trend
sales_data['month'] = sales_data['date'].dt.to_period('M')
monthly_revenue = sales_data.groupby('month')['revenue'].sum()
print("\nMonthly revenue:")
print(monthly_revenue)

Performance Tips

Use Vectorized Operations

# BAD: Using loops
result = []
for value in df['column']:
    result.append(value * 2)
df['new_column'] = result

# GOOD: Vectorized
df['new_column'] = df['column'] * 2

Use Query for Complex Filtering

# Instead of:
filtered = df[(df['age'] > 25) & (df['salary'] > 70000)]

# Use query:
filtered = df.query('age > 25 and salary > 70000')

Use Categorical Data Type

# For columns with repeated values
df['city'] = df['city'].astype('category')
# Saves memory and speeds up operations

Common Mistakes to Avoid

Mistake 1: Chained Assignment

# WRONG: May not work as expected
df[df['age'] > 30]['salary'] = df['salary'] * 1.1

# RIGHT: Use loc
df.loc[df['age'] > 30, 'salary'] = df['salary'] * 1.1

Mistake 2: Forgetting inplace

# Doesn't modify original
df.drop('column', axis=1)

# Need to reassign or use inplace
df = df.drop('column', axis=1)
# OR
df.drop('column', axis=1, inplace=True)

Mistake 3: Index Issues After Filtering

df_filtered = df[df['age'] > 30]
# Index is not continuous [0, 2, 4, ...]

# Reset index
df_filtered = df_filtered.reset_index(drop=True)
# Now index is [0, 1, 2, ...]

Your DataFrame Mastery Journey

You've now learned the essential DataFrame operations:

Creation - From various sources
Selection - Rows, columns, conditions
Filtering - Boolean indexing
Transformation - Apply, map, groupby
Merging - Combine multiple DataFrames
Aggregation - Statistical analysis

These skills form the foundation of every data analysis project.

Next Steps

Practice with real datasets from Kaggle
Learn time series analysis with pandas
Master data cleaning techniques
Explore visualization with matplotlib/seaborn
Study pandas documentation

Remember: The best way to learn pandas is by doing. Load a dataset and start exploring!

Found this guide valuable? Share it with your data science community! Connect with me on Twitter or LinkedIn to discuss pandas techniques.

Support My Work

If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for aspiring data scientists and engineers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by Luke Chesser on Unsplash

Related Blogs