Ojaswi Athghara | Data Engineering Fundamentals: Essential Tools and Skills to Get Started

Data Engineering Fundamentals: Essential Tools and Skills to Get Started

The Tool Confusion

When I decided to learn data engineering, I Googled "data engineering tools" and nearly gave up.

The list was overwhelming: Hadoop, Spark, Kafka, Airflow, Flink, Storm, Hive, Presto, Redshift, BigQuery, Snowflake, dbt, Fivetran... and that's just scratching the surface!

"Do I need to learn ALL of these?" I wondered.

The answer is no. After talking to data engineers and spending months learning, I realized something important: you need to understand the fundamentals first, then learn tools based on what problems you're solving.

In this post, I'm breaking down the essential tools and skills you actually need to get started in data engineering—without the overwhelm.

The Core Skills Foundation

Think of data engineering skills as a pyramid. You need a solid foundation before climbing higher.

Level 1: The Non-Negotiables

These are the skills you absolutely need. No shortcuts here.

1. Python (Your Primary Language)

Python is the lingua franca of data engineering. Here's what I focused on learning:

Basic Python:

# Variables and data types
user_id = 12345
username = "john_doe"
is_active = True

# Lists for storing collections
user_ids = [101, 102, 103, 104]

# Dictionaries for structured data
user = {
    "id": 101,
    "name": "John Doe",
    "email": "john@example.com",
    "age": 28
}

# Functions for reusable code
def calculate_age_group(age):
    if age < 18:
        return "minor"
    elif age < 65:
        return "adult"
    else:
        return "senior"

# Loops for processing data
for user_id in user_ids:
    print(f"Processing user {user_id}")

Working with Data (Pandas):

import pandas as pd

# Reading data
df = pd.read_csv('users.csv')

# Viewing data
print(df.head())  # First 5 rows
print(df.info())  # Data types and null counts

# Filtering
active_users = df[df['is_active'] == True]

# Grouping and aggregating
user_counts = df.groupby('age_group')['user_id'].count()

# Creating new columns
df['full_name'] = df['first_name'] + ' ' + df['last_name']

# Handling missing data
df['email'].fillna('unknown@example.com', inplace=True)

Why Python?

Easy to learn and read
Massive ecosystem of data libraries
Works well with all data engineering tools
Industry standard for data pipelines

My learning path:

Python basics (2-3 weeks)
Pandas for data manipulation (2-3 weeks)
Working with APIs using requests (1 week)
Error handling and logging (1 week)

2. SQL (The Most Important Skill)

I can't emphasize this enough: SQL is critical. You'll use it every single day.

Basic Queries I Practice:

-- Selecting specific columns
SELECT 
    user_id,
    username,
    email,
    created_at
FROM users
WHERE is_active = true;

-- Aggregations
SELECT 
    country,
    COUNT(*) as user_count,
    AVG(age) as avg_age
FROM users
GROUP BY country
HAVING COUNT(*) > 100
ORDER BY user_count DESC;

-- Joins (crucial for combining data)
SELECT 
    u.username,
    u.email,
    o.order_date,
    o.total_amount
FROM users u
INNER JOIN orders o 
    ON u.user_id = o.user_id
WHERE o.order_date >= '2025-10-01';

-- Subqueries
SELECT 
    user_id,
    username
FROM users
WHERE user_id IN (
    SELECT user_id 
    FROM orders 
    WHERE total_amount > 1000
);

Advanced SQL I'm Learning:

-- Window functions (super useful!)
SELECT 
    user_id,
    order_date,
    total_amount,
    SUM(total_amount) OVER (
        PARTITION BY user_id 
        ORDER BY order_date
    ) as running_total,
    ROW_NUMBER() OVER (
        PARTITION BY user_id 
        ORDER BY order_date DESC
    ) as order_rank
FROM orders;

-- Common Table Expressions (CTEs)
WITH active_users AS (
    SELECT user_id, username
    FROM users
    WHERE last_login_date >= CURRENT_DATE - INTERVAL '30 days'
),
recent_orders AS (
    SELECT user_id, COUNT(*) as order_count
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT 
    au.username,
    COALESCE(ro.order_count, 0) as orders
FROM active_users au
LEFT JOIN recent_orders ro ON au.user_id = ro.user_id;

How I'm learning SQL:

HackerRank SQL problems (doing 5-10 per day)
Mode Analytics SQL Tutorial
LeetCode Database Problems
Practicing with real datasets

3. Understanding Databases

You need to know how databases work, not just how to query them.

Relational Databases (PostgreSQL, MySQL):

# Connecting to PostgreSQL with Python
import psycopg2

# Establish connection
conn = psycopg2.connect(
    host="localhost",
    database="mydb",
    user="myuser",
    password="mypassword"
)

# Create cursor
cur = conn.cursor()

# Execute query
cur.execute("""
    SELECT user_id, username, email 
    FROM users 
    WHERE created_at >= %s
""", ('2025-01-01',))

# Fetch results
users = cur.fetchall()

for user in users:
    print(user)

# Close connections
cur.close()
conn.close()

Key concepts I learned:

Primary keys: Unique identifiers for records
Foreign keys: Links between tables
Indexes: Make queries faster
Normalization: Organizing data efficiently
ACID properties: Ensuring data reliability

NoSQL Databases (MongoDB):

from pymongo import MongoClient

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['mydb']
collection = db['users']

# Insert document
user = {
    "username": "john_doe",
    "email": "john@example.com",
    "age": 28,
    "interests": ["tech", "music", "travel"]
}
collection.insert_one(user)

# Query documents
active_users = collection.find({"is_active": True})

for user in active_users:
    print(user)

When to use what:

Relational (SQL): Structured data with relationships
NoSQL (MongoDB): Flexible schemas, nested data
Redis: Caching, fast access
Elasticsearch: Full-text search

Level 2: Essential Data Engineering Concepts

Once you're comfortable with Python and SQL, these concepts are next.

ETL vs ELT

ETL (Extract, Transform, Load):

Extract data from sources
Transform it (clean, aggregate)
Load into destination

# Simple ETL example
import pandas as pd
from datetime import datetime

def extract_from_api():
    """Extract data from API"""
    # Simulated API response
    data = {
        'user_id': [1, 2, 3, 4, 5],
        'name': ['John', 'Jane', 'Bob', None, 'Alice'],
        'email': ['john@ex.com', 'jane@ex.com', 'bob@ex', 'mike@ex.com', 'alice@ex.com'],
        'signup_date': ['2025-01-15', '2025-02-20', '2025-03-10', '2025-04-05', '2025-05-12']
    }
    return pd.DataFrame(data)

def transform_data(df):
    """Clean and transform data"""
    # Remove rows with null names
    df = df.dropna(subset=['name'])
    
    # Validate email format
    df = df[df['email'].str.contains('@', regex=False)]
    df = df[df['email'].str.contains('.', regex=False)]
    
    # Convert signup_date to datetime
    df['signup_date'] = pd.to_datetime(df['signup_date'])
    
    # Add computed field
    df['days_since_signup'] = (datetime.now() - df['signup_date']).dt.days
    
    return df

def load_to_database(df):
    """Load data to database"""
    # In real scenario, this would write to actual database
    df.to_csv('processed_users.csv', index=False)
    print(f"Loaded {len(df)} records")

# Run ETL pipeline
raw_data = extract_from_api()
clean_data = transform_data(raw_data)
load_to_database(clean_data)

ELT (Extract, Load, Transform):

Extract data
Load raw data first
Transform inside the data warehouse

This is becoming more popular with modern cloud data warehouses!

Data Modeling

How you structure your data matters. I learned two main approaches:

Star Schema (for data warehouses):

Fact Table (center): sales
- sale_id
- date_key (FK)
- product_key (FK)
- customer_key (FK)
- store_key (FK)
- quantity
- amount

Dimension Tables (points):
- dim_date (date_key, date, month, year, quarter)
- dim_product (product_key, name, category, price)
- dim_customer (customer_key, name, email, city)
- dim_store (store_key, store_name, location, region)

Why star schema?

Simple queries
Fast aggregations
Easy to understand

Data Quality

One thing I learned early: bad data is worse than no data.

Data Quality Checks:

def validate_data_quality(df):
    """Run data quality checks"""
    issues = []
    
    # Completeness checks
    null_counts = df.isnull().sum()
    for column, null_count in null_counts.items():
        if null_count > 0:
            percentage = (null_count / len(df)) * 100
            if percentage > 5:  # More than 5% nulls
                issues.append(f"{column}: {percentage:.2f}% null values")
    
    # Uniqueness checks
    if df['user_id'].duplicated().any():
        dup_count = df['user_id'].duplicated().sum()
        issues.append(f"Found {dup_count} duplicate user_ids")
    
    # Validity checks
    if (df['age'] < 0).any() or (df['age'] > 120).any():
        issues.append("Invalid age values found")
    
    # Consistency checks
    if (df['order_amount'] < 0).any():
        issues.append("Negative order amounts found")
    
    # Timeliness checks
    old_data = df[df['created_at'] < '2020-01-01']
    if len(old_data) > 0:
        issues.append(f"Found {len(old_data)} records older than 2020")
    
    # Report
    if issues:
        print("❌ Data quality issues:")
        for issue in issues:
            print(f"  - {issue}")
        return False
    else:
        print("✅ All data quality checks passed!")
        return True

Essential Tools and Technologies

Now for the fun part—the actual tools! I'm organizing these by category.

Data Processing Tools

1. Pandas (For Small to Medium Data)

I use pandas almost every day for:

Reading/writing data files (CSV, JSON, Excel)
Data cleaning and transformation
Quick analysis and exploration

When pandas works: Data that fits in memory (up to a few GB)

When to move on: When data gets too big for your RAM

2. Apache Spark (For Big Data)

Spark handles data that doesn't fit on one machine.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg

# Create Spark session
spark = SparkSession.builder \
    .appName("UserAnalysis") \
    .getOrCreate()

# Read data
df = spark.read.csv("users.csv", header=True, inferSchema=True)

# Transform data (runs distributed!)
result = df.filter(col("is_active") == True) \
    .groupBy("country") \
    .agg(
        count("user_id").alias("user_count"),
        avg("age").alias("avg_age")
    ) \
    .orderBy(col("user_count").desc())

# Show results
result.show()

# Write results
result.write.csv("output/user_stats.csv")

When to use Spark:

Data larger than single machine memory
Need distributed processing
Working with data lakes

Workflow Orchestration

Apache Airflow

Airflow schedules and monitors your data pipelines.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

# Define default arguments
default_args = {
    'owner': 'me',
    'depends_on_past': False,
    'start_date': datetime(2025, 10, 28),
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Define the DAG
dag = DAG(
    'user_data_pipeline',
    default_args=default_args,
    description='Daily user data processing',
    schedule_interval='0 2 * * *',  # Run daily at 2 AM
)

# Define tasks
def extract_data():
    print("Extracting data...")
    # Your extraction logic

def transform_data():
    print("Transforming data...")
    # Your transformation logic

def load_data():
    print("Loading data...")
    # Your loading logic

# Create operators
extract_task = PythonOperator(
    task_id='extract',
    python_callable=extract_data,
    dag=dag,
)

transform_task = PythonOperator(
    task_id='transform',
    python_callable=transform_data,
    dag=dag,
)

load_task = PythonOperator(
    task_id='load',
    python_callable=load_data,
    dag=dag,
)

# Set dependencies
extract_task >> transform_task >> load_task

What Airflow does:

Schedules pipeline runs
Monitors task failures
Handles retries
Provides a web UI
Manages dependencies

Cloud Platforms

Most data engineering happens in the cloud now. I started with AWS basics:

AWS Services I'm Learning

S3 (Simple Storage Service):

import boto3

# Connect to S3
s3 = boto3.client('s3')

# Upload file
s3.upload_file('local_file.csv', 'my-bucket', 'data/file.csv')

# Download file
s3.download_file('my-bucket', 'data/file.csv', 'downloaded_file.csv')

# List files
response = s3.list_objects_v2(Bucket='my-bucket', Prefix='data/')
for obj in response.get('Contents', []):
    print(obj['Key'])

Other AWS services:

RDS: Managed relational databases
Redshift: Data warehouse
Lambda: Run code without servers
Glue: ETL service

Alternatives:

Google Cloud Platform: BigQuery, Cloud Storage, Dataflow
Azure: Synapse, Data Factory, Blob Storage

Data Warehouses

Where you store processed data for analysis.

Options I researched:

Snowflake: Popular, easy to use, separates storage and compute
Amazon Redshift: AWS native, fast queries
Google BigQuery: Serverless, pay per query
Azure Synapse: Microsoft ecosystem

-- Example: Loading data into Snowflake
COPY INTO users
FROM @my_stage/users.csv
FILE_FORMAT = (TYPE = 'CSV' SKIP_HEADER = 1)
ON_ERROR = 'CONTINUE';

-- Create aggregation table
CREATE TABLE user_stats AS
SELECT 
    country,
    COUNT(*) as user_count,
    AVG(age) as avg_age,
    MAX(signup_date) as last_signup
FROM users
GROUP BY country;

My Tool Learning Strategy

I initially tried to learn everything. Bad idea! Here's what worked better:

Phase 1: Core Skills (Month 1-3)

✅ Python basics + pandas
✅ SQL (practice daily)
✅ PostgreSQL locally
✅ Git for version control

Phase 2: First Pipeline (Month 4-5)

✅ Build ETL with Python
✅ Schedule with cron
✅ Basic data quality checks
✅ Log results

Phase 3: Cloud Basics (Month 6-7)

🔄 AWS account and S3
🔄 RDS for databases
🔄 Lambda for simple tasks
🔄 CloudWatch for monitoring

Phase 4: Orchestration (Month 8-9)

📝 Apache Airflow
📝 More complex pipelines
📝 Error handling and alerts
📝 Documentation

Phase 5: Scale Up (Month 10-12)

📝 Introduction to Spark
📝 Data warehouses
📝 Real-time processing basics
📝 Portfolio projects

The Tools You DON'T Need (Yet)

As a beginner, you can safely ignore:

❌ Kafka, Flink, Storm (unless doing real-time)
❌ Kubernetes (unless DevOps focused)
❌ Scala (Python is enough)
❌ Hadoop (mostly legacy)
❌ Every new tool that gets hyped

Focus on fundamentals first!

Setting Up Your Learning Environment

Here's my current setup:

Local Development

# My tech stack on MacOS/Linux
- Python 3.9+
- PostgreSQL (local database)
- VS Code (with Python extensions)
- DBeaver (database GUI)
- Docker (for running services locally)
- Git (version control)

Cloud Setup (Free Tier)

# Services I use without paying (initially)
- AWS Free Tier (12 months)
- Google Cloud Free Tier ($300 credit)
- GitHub (for code hosting)
- Heroku Free Tier (for small apps)

Practice Projects That Build Skills

Theory is important, but projects teach more. Here's what I built:

Project 1: Personal Data Pipeline

Extract your own data (bank statements, screen time, etc.)
Clean and transform it
Load to SQLite
Create simple visualizations

Project 2: API Data Collector

Find a free API (weather, stocks, COVID data)
Fetch data daily
Store in database
Track changes over time

Project 3: E-commerce Data Warehouse

Use sample e-commerce data
Design star schema
Create ETL pipeline
Build analytics queries

Common Tool Mistakes I Made

Mistake 1: Choosing Tools Before Understanding Problems

I learned Spark before understanding when to use it. Wasted time!

Better: Learn a tool when you encounter a problem it solves.

Mistake 2: Following Every Hype

New tools emerge constantly. I'd jump to learn each one.

Better: Master fundamentals. Tools change, concepts don't.

Mistake 3: Tutorial Hell

I watched tutorials without building anything.

Better: Watch one tutorial, then build something without it.

Mistake 4: Not Learning SQL Deeply

I thought basic SELECT queries were enough. Very wrong!

Better: Invest serious time in SQL. It's the most used skill.

Resources That Actually Helped

For Python

Official Python Tutorial
Corey Schafer's Python YouTube Series
"Python for Data Analysis" by Wes McKinney

For SQL

HackerRank SQL (my daily practice)
Mode Analytics Tutorial
"SQL Queries for Mere Mortals" book

For Data Engineering

Start Data Engineering (excellent blog)
Data Engineering Podcast
"Fundamentals of Data Engineering" by Joe Reis

For Hands-On Practice

Kaggle (datasets and notebooks)
Data Engineering Zoomcamp (free course)
Build your own projects!

Conclusion: Focus on Fundamentals

After months of learning, here's what I wish someone told me on day one:

The 80/20 of Data Engineering:

📌 Python + SQL = 80% of your work
📌 Understanding ETL concepts > knowing every tool
📌 Building projects > watching tutorials
📌 Solving real problems > following hype

Your action plan:

Week 1-4: Python basics and pandas
Week 5-8: SQL practice daily
Week 9-12: Build your first pipeline
Month 4+: Add tools as you need them

You don't need to know everything to start. You just need strong fundamentals and the willingness to learn as you go.

The best time to start was yesterday. The second best time is now!

Building your data engineering toolkit? Share your progress! Connect with me on Twitter or LinkedIn. Let's learn together!

Support My Work

If this guide helped you understand essential data engineering tools and skills, or gave you a clear learning roadmap, I'd really appreciate your support, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for aspiring data scientists and engineers.

☕ Buy me a coffee - Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Cover image by ThisisEngineering on Unsplash

Related Blogs