Pandas DataFrame Manipulation: Complete Data Handling Guide for Analysis
Master pandas DataFrame operations with this comprehensive guide. Learn data selection, filtering, merging, grouping, and transformation techniques for efficient data analysis in Python.

The Moment Pandas Clicked for Me
I spent hours trying to manipulate data with pure Python lists and dictionaries. Filtering, grouping, merging—everything required nested loops and conditional logic that made my brain hurt.
Then a colleague showed me pandas. "Try dfdf'age' > 25," she said. Wait, what? That one line did what took me 20 lines of code? That's when I realized: pandas isn't just a library—it's a completely different way of thinking about data.
In this guide, I'll show you everything I've learned about DataFrame manipulation. From basic selection to complex transformations, you'll learn to handle data like a pro.
##What are DataFrames?
Think of a DataFrame as Excel, but programmable. It's a 2D table with rows and columns, where:
- Each column has a name
- Each row has an index
- Columns can have different data types
- You can filter, sort, group, and transform data with simple commands
import pandas as pd
import numpy as np
# Create a simple DataFrame
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 40, 28],
'city': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
'salary': [70000, 80000, 90000, 85000, 75000]
}
df = pd.DataFrame(data)
print(df)
Output:
name age city salary
0 Alice 25 New York 70000
1 Bob 30 London 80000
2 Charlie 35 Paris 90000
3 David 40 Tokyo 85000
4 Eve 28 Sydney 75000
Creating DataFrames
Multiple ways to create DataFrames for different scenarios.
From Dictionary
# Dictionary of lists
data = {
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [1200, 25, 75, 300],
'stock': [15, 150, 80, 25]
}
df = pd.DataFrame(data)
print(df)
# Specify index
df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])
print(df)
From Lists
# List of lists
data = [
['Alice', 25, 70000],
['Bob', 30, 80000],
['Charlie', 35, 90000]
]
df = pd.DataFrame(data, columns=['name', 'age', 'salary'])
print(df)
# List of dictionaries
data = [
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30},
{'name': 'Charlie', 'age': 35}
]
df = pd.DataFrame(data)
print(df)
From NumPy Array
# Random data
data = np.random.randint(0, 100, size=(5, 3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)
From CSV File
# Read from CSV (most common)
df = pd.read_csv('data.csv')
# With specific options
df = pd.read_csv('data.csv',
sep=',', # Delimiter
header=0, # Row for column names
index_col=0, # Column to use as index
na_values=['NA', 'null']) # Missing value indicators
Inspecting DataFrames
Before manipulating data, understand what you're working with.
# Create sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'age': [25, 30, 35, 40, 28, 33],
'city': ['NY', 'London', 'Paris', 'Tokyo', 'Sydney', 'Berlin'],
'salary': [70000, 80000, 90000, 85000, 75000, 82000]
})
# First few rows
print(df.head()) # Default: 5 rows
print(df.head(3)) # First 3 rows
# Last few rows
print(df.tail())
# Shape (rows, columns)
print(f"Shape: {df.shape}")
# Column names
print(f"Columns: {df.columns.tolist()}")
# Data types
print(df.dtypes)
# Statistical summary
print(df.describe())
# General info
print(df.info())
Selecting Data
The most fundamental DataFrame operation.
Selecting Columns
# Single column (returns Series)
ages = df['age']
print(type(ages)) # pandas.Series
# Single column (returns DataFrame)
ages_df = df[['age']]
print(type(ages_df)) # pandas.DataFrame
# Multiple columns
subset = df[['name', 'salary']]
print(subset)
# All columns except some
cols_to_drop = ['city']
subset = df[[col for col in df.columns if col not in cols_to_drop]]
Selecting Rows
# By index position (iloc)
first_row = df.iloc[0] # First row
print(first_row)
first_three = df.iloc[0:3] # First 3 rows
print(first_three)
# By label (loc)
df_indexed = df.set_index('name')
alice = df_indexed.loc['Alice']
print(alice)
# Multiple rows by label
subset = df_indexed.loc[['Alice', 'Charlie']]
print(subset)
Selecting Rows and Columns
# iloc: position-based
value = df.iloc[0, 1] # Row 0, Column 1
print(f"Value: {value}")
subset = df.iloc[0:3, 0:2] # First 3 rows, first 2 columns
print(subset)
# loc: label-based
df_indexed = df.set_index('name')
value = df_indexed.loc['Alice', 'age']
print(f"Alice's age: {value}")
subset = df_indexed.loc['Alice':'Charlie', 'age':'city']
print(subset)
Filtering Data
Boolean indexing for powerful data filtering.
Basic Filtering
# Single condition
young_people = df[df['age'] < 30]
print("People under 30:")
print(young_people)
# Multiple conditions (AND)
filtered = df[(df['age'] > 25) & (df['salary'] > 75000)]
print("\nAge > 25 AND Salary > 75000:")
print(filtered)
# Multiple conditions (OR)
filtered = df[(df['city'] == 'NY') | (df['city'] == 'London')]
print("\nFrom NY OR London:")
print(filtered)
Advanced Filtering
# String contains
df_with_names = df.copy()
filtered = df_with_names[df_with_names['name'].str.contains('a', case=False)]
print("Names containing 'a':")
print(filtered)
# String startswith
filtered = df[df['name'].str.startswith('A')]
print("\nNames starting with 'A':")
print(filtered)
# Values in list
cities_of_interest = ['NY', 'London', 'Tokyo']
filtered = df[df['city'].isin(cities_of_interest)]
print("\nFrom specific cities:")
print(filtered)
# NOT in list
filtered = df[~df['city'].isin(cities_of_interest)]
print("\nNOT from specific cities:")
print(filtered)
# Between values
filtered = df[df['age'].between(28, 35, inclusive='both')]
print("\nAge between 28 and 35:")
print(filtered)
Adding and Modifying Columns
Adding New Columns
# From calculation
df['annual_salary'] = df['salary'] * 12
# From conditions
df['senior'] = df['age'] >= 35
# Using apply with function
def categorize_salary(salary):
if salary < 75000:
return 'Low'
elif salary < 85000:
return 'Medium'
else:
return 'High'
df['salary_category'] = df['salary'].apply(categorize_salary)
# Using lambda
df['age_group'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
# From multiple columns
df['name_city'] = df['name'] + ' - ' + df['city']
print(df)
Modifying Existing Columns
# Update all values
df['salary'] = df['salary'] * 1.1 # 10% raise
# Update based on condition
df.loc[df['age'] > 35, 'salary'] = df['salary'] * 1.2
# Update specific values
df.loc[df['name'] == 'Alice', 'city'] = 'Boston'
# Replace values
df['city'] = df['city'].replace({'NY': 'New York'})
Dropping Data
Dropping Columns
# Drop single column
df_dropped = df.drop('salary', axis=1) # axis=1 for columns
# Drop multiple columns
df_dropped = df.drop(['salary', 'city'], axis=1)
# Drop in-place (modifies original)
df.drop('annual_salary', axis=1, inplace=True)
Dropping Rows
# Drop by index
df_dropped = df.drop(0) # Drop row at index 0
# Drop by condition
df_filtered = df[df['age'] >= 30] # Keep only age >= 30
# Drop duplicates
df_unique = df.drop_duplicates()
# Drop duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['city'], keep='first')
Sorting Data
# Sort by single column
df_sorted = df.sort_values('age')
print("Sorted by age (ascending):")
print(df_sorted)
# Sort descending
df_sorted = df.sort_values('salary', ascending=False)
print("\nSorted by salary (descending):")
print(df_sorted)
# Sort by multiple columns
df_sorted = df.sort_values(['city', 'age'])
print("\nSorted by city, then age:")
print(df_sorted)
# Sort with custom order
df_sorted = df.sort_values('age', ascending=[False])
# Sort by index
df_sorted = df.sort_index()
GroupBy Operations
One of pandas' most powerful features.
Basic GroupBy
# Create sample data
df = pd.DataFrame({
'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR'],
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'salary': [70000, 80000, 90000, 85000, 75000, 72000],
'experience': [3, 5, 7, 6, 4, 3]
})
# Group by and aggregate
dept_salaries = df.groupby('department')['salary'].mean()
print("Average salary by department:")
print(dept_salaries)
# Multiple aggregations
dept_stats = df.groupby('department')['salary'].agg(['mean', 'sum', 'count'])
print("\nDepartment statistics:")
print(dept_stats)
# Group by multiple columns
grouped = df.groupby(['department', 'experience'])['salary'].sum()
print("\nGrouped by department and experience:")
print(grouped)
Advanced GroupBy
# Multiple aggregations on multiple columns
stats = df.groupby('department').agg({
'salary': ['mean', 'max', 'min'],
'experience': ['mean', 'sum']
})
print("Multiple aggregations:")
print(stats)
# Custom aggregation functions
def salary_range(x):
return x.max() - x.min()
custom_agg = df.groupby('department')['salary'].agg([
('avg', 'mean'),
('total', 'sum'),
('range', salary_range)
])
print("\nCustom aggregations:")
print(custom_agg)
# Transform (broadcast results back)
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')
print("\nWith department average:")
print(df)
Merging and Joining DataFrames
Combine data from multiple sources.
Merge (SQL-style joins)
# Create sample dataframes
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['Sales', 'IT', 'HR']
})
# Inner join (default)
merged = pd.merge(employees, departments, on='dept_id')
print("Inner join:")
print(merged)
# Left join
merged = pd.merge(employees, departments, on='dept_id', how='left')
print("\nLeft join:")
print(merged)
# Merge on different column names
employees2 = employees.rename(columns={'dept_id': 'department'})
merged = pd.merge(employees2, departments,
left_on='department', right_on='dept_id')
print("\nMerge on different columns:")
print(merged)
Concatenate
# Concatenate vertically (stack rows)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print("Vertical concatenation:")
print(result)
# Concatenate horizontally (side by side)
result = pd.concat([df1, df2], axis=1)
print("\nHorizontal concatenation:")
print(result)
Pivot Tables
Reshape data for analysis.
# Sample sales data
sales = pd.DataFrame({
'date': ['2024-01', '2024-01', '2024-02', '2024-02', '2024-03', '2024-03'],
'product': ['A', 'B', 'A', 'B', 'A', 'B'],
'region': ['East', 'East', 'West', 'West', 'East', 'East'],
'sales': [100, 150, 120, 180, 110, 160]
})
# Create pivot table
pivot = sales.pivot_table(
values='sales',
index='product',
columns='date',
aggfunc='sum',
fill_value=0
)
print("Pivot table:")
print(pivot)
# Multiple aggregations
pivot = sales.pivot_table(
values='sales',
index='product',
columns='region',
aggfunc=['sum', 'mean']
)
print("\nMultiple aggregations:")
print(pivot)
Handling Missing Data
# Create data with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
print("Data with missing values:")
print(df)
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
# Drop rows with any missing values
df_dropped = df.dropna()
print("\nAfter dropping NaN rows:")
print(df_dropped)
# Drop columns with any missing values
df_dropped = df.dropna(axis=1)
print("\nAfter dropping NaN columns:")
print(df_dropped)
# Fill missing values
df_filled = df.fillna(0)
print("\nFilled with 0:")
print(df_filled)
# Fill with mean
df_filled = df.fillna(df.mean())
print("\nFilled with column mean:")
print(df_filled)
# Forward fill
df_filled = df.fillna(method='ffill')
print("\nForward filled:")
print(df_filled)
Apply and Map Functions
Apply custom logic to DataFrames.
Apply
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
# Apply to column
df['A_squared'] = df['A'].apply(lambda x: x ** 2)
# Apply to multiple columns
df['sum'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
# Apply with custom function
def categorize(value):
if value < 15:
return 'Low'
elif value < 35:
return 'Medium'
else:
return 'High'
df['B_category'] = df['B'].apply(categorize)
print(df)
Map
# Dictionary mapping
df = pd.DataFrame({
'size': ['S', 'M', 'L', 'XL']
})
size_map = {'S': 'Small', 'M': 'Medium', 'L': 'Large', 'XL': 'Extra Large'}
df['size_name'] = df['size'].map(size_map)
print(df)
Practical Example: Sales Analysis
# Sample sales dataset
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=100)
sales_data = pd.DataFrame({
'date': dates,
'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'quantity': np.random.randint(1, 20, 100),
'price': np.random.randint(100, 1000, 100)
})
# Calculate revenue
sales_data['revenue'] = sales_data['quantity'] * sales_data['price']
print("Sales data preview:")
print(sales_data.head())
# Total revenue by product
product_revenue = sales_data.groupby('product')['revenue'].sum().sort_values(ascending=False)
print("\nRevenue by product:")
print(product_revenue)
# Average price by region
region_prices = sales_data.groupby('region')['price'].mean()
print("\nAverage price by region:")
print(region_prices)
# Top selling products by region
top_products = sales_data.groupby(['region', 'product'])['quantity'].sum()
top_products = top_products.groupby('region').nlargest(1)
print("\nTop product per region:")
print(top_products)
# Monthly revenue trend
sales_data['month'] = sales_data['date'].dt.to_period('M')
monthly_revenue = sales_data.groupby('month')['revenue'].sum()
print("\nMonthly revenue:")
print(monthly_revenue)
Performance Tips
Use Vectorized Operations
# BAD: Using loops
result = []
for value in df['column']:
result.append(value * 2)
df['new_column'] = result
# GOOD: Vectorized
df['new_column'] = df['column'] * 2
Use Query for Complex Filtering
# Instead of:
filtered = df[(df['age'] > 25) & (df['salary'] > 70000)]
# Use query:
filtered = df.query('age > 25 and salary > 70000')
Use Categorical Data Type
# For columns with repeated values
df['city'] = df['city'].astype('category')
# Saves memory and speeds up operations
Common Mistakes to Avoid
Mistake 1: Chained Assignment
# WRONG: May not work as expected
df[df['age'] > 30]['salary'] = df['salary'] * 1.1
# RIGHT: Use loc
df.loc[df['age'] > 30, 'salary'] = df['salary'] * 1.1
Mistake 2: Forgetting inplace
# Doesn't modify original
df.drop('column', axis=1)
# Need to reassign or use inplace
df = df.drop('column', axis=1)
# OR
df.drop('column', axis=1, inplace=True)
Mistake 3: Index Issues After Filtering
df_filtered = df[df['age'] > 30]
# Index is not continuous [0, 2, 4, ...]
# Reset index
df_filtered = df_filtered.reset_index(drop=True)
# Now index is [0, 1, 2, ...]
Your DataFrame Mastery Journey
You've now learned the essential DataFrame operations:
- Creation - From various sources
- Selection - Rows, columns, conditions
- Filtering - Boolean indexing
- Transformation - Apply, map, groupby
- Merging - Combine multiple DataFrames
- Aggregation - Statistical analysis
These skills form the foundation of every data analysis project.
Next Steps
- Practice with real datasets from Kaggle
- Learn time series analysis with pandas
- Master data cleaning techniques
- Explore visualization with matplotlib/seaborn
- Study pandas documentation
Remember: The best way to learn pandas is by doing. Load a dataset and start exploring!
Found this guide valuable? Share it with your data science community! Connect with me on Twitter or LinkedIn to discuss pandas techniques.
Support My Work
If this guide helped you with this topic, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for aspiring data scientists and engineers.
☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!
Cover image by Luke Chesser on Unsplash