Ojaswi Athghara | Database Design in System Design: SQL vs NoSQL, Sharding, Replication

Database Design in System Design: SQL vs NoSQL, Sharding, Replication

"NoSQL Is Always Faster Than SQL" – The Expensive Myth

I've heard this in dozens of system design interviews: "Let's use MongoDB because it's faster than PostgreSQL."

Wrong.

Instagram serves billions of photos—PostgreSQL. Uber tracks millions of rides—PostgreSQL. Discord handles billions of messages—their custom database built on... Cassandra? No, they actually use Scylla, but store critical data in PostgreSQL.

The truth? Database choice isn't about "faster" or "slower." It's about matching your data model, access patterns, and consistency requirements to the right storage system.

In this guide, I'll break down SQL vs NoSQL, when to use each, sharding, replication, and how companies like Facebook, Twitter, and Uber design their databases. Let's dive in.

Types of Databases

1. SQL (Relational) Databases

Examples: PostgreSQL, MySQL, Oracle, SQL Server

Data Model: Tables with rows and columns

Example:

-- Users table
users
+----+----------+------------------+
| id | username | email            |
+----+----------+------------------+
| 1  | john_doe | john@example.com |
| 2  | jane_doe | jane@example.com |
+----+----------+------------------+

-- Posts table
posts
+----+---------+------------------+------------+
| id | user_id | content          | created_at |
+----+---------+------------------+------------+
| 1  | 1       | Hello world!     | 2025-01-01 |
| 2  | 1       | Learning SQL     | 2025-01-02 |
| 3  | 2       | NoSQL vs SQL     | 2025-01-03 |
+----+---------+------------------+------------+

-- Relationships
Foreign key: posts.user_id → users.id

Characteristics: ✅ ACID transactions (Atomicity, Consistency, Isolation, Durability) ✅ Structured schema (enforced data integrity) ✅ Complex queries with JOINs ✅ Mature ecosystem (40+ years old)

When to use:

Financial systems (banking, payments)
E-commerce (orders, inventory)
Applications with complex relationships
Need for strong consistency

Real Example: Instagram

Instagram uses PostgreSQL for:

User profiles
Relationships (followers/following)
Posts metadata
Comments

Why PostgreSQL?

Complex queries (fetch all posts from followed users)
Transactional integrity (ensure consistency)
Proven scalability (with sharding)

2. NoSQL Databases

"Not only SQL" – Multiple types:

a) Document Stores (MongoDB, Couchbase)

Data Model: JSON-like documents

Example:

{
  "_id": "user_123",
  "username": "john_doe",
  "email": "john@example.com",
  "profile": {
    "bio": "Software developer",
    "location": "San Francisco"
  },
  "posts": [
    {
      "content": "Hello world!",
      "created_at": "2025-01-01",
      "likes": 42
    }
  ]
}

Characteristics: ✅ Flexible schema (no predefined structure) ✅ Nested data (no JOINs needed) ✅ Horizontal scaling (sharding built-in)

When to use:

Rapidly changing schema
Nested/hierarchical data
Content management systems
Catalogs with varying attributes

Real Example: Facebook Messages

Facebook uses HBase (similar to document store) for messaging:

Billions of messages
Schema flexibility (attachments, reactions, etc.)
Write-heavy workload

b) Key-Value Stores (Redis, DynamoDB, Riak)

Data Model: Simple key → value

Example:

# Redis
"session:abc123" → {"user_id": 1, "logged_in": true}
"user:1:cart" → ["product_1", "product_2"]
"rate_limit:user_1" → 45

Characteristics: ✅ Extremely fast (in-memory) ✅ Simple operations (GET, SET, DELETE) ✅ TTL support (auto-expiration)

When to use:

Caching
Session storage
Real-time analytics (counters, leaderboards)
Rate limiting

Real Example: Twitter

Twitter uses Redis for:

Timeline caching
User session storage
Rate limiting (API calls per user)

c) Column-Family Stores (Cassandra, HBase, ScyllaDB)

Data Model: Rows with dynamic columns

Example:

Row key: user_123
  | Column family: profile
    | name: "John Doe"
    | email: "john@example.com"
  | Column family: posts
    | post_1: "Hello world!"
    | post_2: "Learning Cassandra"

Characteristics: ✅ Write-optimized (append-only) ✅ Massive scalability (petabytes) ✅ Tunable consistency

When to use:

Time-series data
High write throughput
Analytics / logging

Real Example: Netflix

Netflix uses Cassandra for:

Viewing history (billions of events)
User preferences
A/B testing data

Why Cassandra?

Handles 1+ million writes per second
Globally distributed (multi-region)
Always available (no single point of failure)

d) Graph Databases (Neo4j, Amazon Neptune)

Data Model: Nodes and edges (relationships)

Example:

(User: john_doe) -[:FOLLOWS]-> (User: jane_doe)
(User: john_doe) -[:LIKES]-> (Post: 123)
(Post: 123) -[:TAGGED]-> (Tag: #systemdesign)

When to use:

Social networks (friends, followers)
Recommendation engines
Fraud detection

Real Example: LinkedIn

LinkedIn uses graph database concepts for:

Connection recommendations ("People you may know")
2nd/3rd degree connections
Shortest path between users

SQL vs NoSQL: Decision Matrix

Factor	SQL	NoSQL
Schema	Fixed, predefined	Flexible, dynamic
Transactions	Full ACID ✅	Limited (eventual consistency)
Scalability	Vertical (harder to scale)	Horizontal (built-in) ✅
Complex Queries	JOINs, aggregations ✅	Limited support
Consistency	Strong ✅	Eventual (tunable)
Maturity	40+ years ✅	15 years
Use Case	Banking, e-commerce	Social media, analytics

Truth: Most large systems use both.

Database Scaling Techniques

1. Indexing

Problem: Query slow on large table

-- 10 million users
SELECT * FROM users WHERE email = 'john@example.com';
-- Scans entire table: 5 seconds ❌

Solution: Create index

CREATE INDEX idx_email ON users(email);

-- Now uses index: 5ms ✅

How indexes work:

Think of a book:

Without index: Read every page to find "Chapter 5" (slow)
With index: Look at table of contents, jump to page 87 (fast)

Trade-off: ✅ Faster reads ❌ Slower writes (must update index)

Best practice: Index columns used in WHERE, ORDER BY, JOIN

Real Example: Facebook

Facebook heavily indexes:

User IDs
Email addresses
Phone numbers
Timestamps (for timeline queries)

2. Replication

Strategy: Copy data to multiple servers

Master-Slave Replication

        [Master DB]
     (Handles writes)
            ↓
    ┌───────┴───────┐
    ↓               ↓
[Slave 1]       [Slave 2]
(Handles reads) (Handles reads)

How it works:

All writes → Master
Master replicates changes to Slaves
Reads → Distributed across Slaves

Benefits: ✅ Scale reads (add more slaves) ✅ High availability (if master fails, promote slave) ✅ Backup (slaves have full copy)

Drawbacks: ❌ Replication lag (slaves slightly behind master) ❌ Single write point (master is bottleneck)

Real Example: Reddit

Reddit uses master-slave replication:

Master: Handles posts, comments, votes (writes)
Slaves (10+): Serve home feed, subreddit pages (reads)

Master-Master Replication

[Master 1] ←→ [Master 2]
  (Writes)      (Writes)

Both databases accept writes, replicate to each other.

Benefits: ✅ No single write bottleneck ✅ High availability

Drawbacks: ❌ Write conflicts (complex to resolve) ❌ Consistency challenges

When to use: Multi-region applications needing low write latency everywhere

3. Sharding (Horizontal Partitioning)

Problem: Database too large for single server

Database: 1 TB
Single server: 256 GB RAM
    ↓
Cannot fit in memory
    ↓
Slow queries

Solution: Split data across multiple databases

Sharding Strategies

a) Range-Based Sharding

Shard 1: Users with ID 1 - 1,000,000
Shard 2: Users with ID 1,000,001 - 2,000,000
Shard 3: Users with ID 2,000,001 - 3,000,000

Pros: ✅ Simple to implement ✅ Easy to add new shard (extend range)

Cons: ❌ Uneven distribution (hotspots) ❌ Shard 1 might have 80% traffic

b) Hash-Based Sharding

def get_shard(user_id, num_shards):
    return hash(user_id) % num_shards

# Examples
user_123 → Shard 1
user_456 → Shard 3
user_789 → Shard 2

Pros: ✅ Even distribution ✅ No hotspots

Cons: ❌ Adding shards requires re-sharding all data ❌ Range queries difficult (data scattered)

c) Geographic Sharding

US users → US database
EU users → EU database
Asia users → Asia database

Pros: ✅ Low latency (data close to users) ✅ Regulatory compliance (GDPR)

Cons: ❌ Uneven load (if user distribution varies) ❌ Cross-region queries expensive

Real Example: Uber

Uber shards by city:

New York rides → New York database
San Francisco rides → SF database
London rides → London database

Why?

Rides are geographically bounded
Most queries within single city
Rare cross-city queries

Sharding Challenges

Problem 1: Joins across shards

-- User in Shard 1, Posts in Shard 2
SELECT users.name, posts.content
FROM users JOIN posts ON users.id = posts.user_id;

-- Requires fetching data from multiple shards ❌

Solution: Denormalize (duplicate data to avoid cross-shard queries)

Problem 2: Auto-increment IDs

-- Shard 1 generates: 1, 2, 3, ...
-- Shard 2 generates: 1, 2, 3, ... (collision!)

Solutions:

Use UUID (globally unique, but large)
Twitter Snowflake (timestamp + shard ID + sequence)
Instagram ID generation (timestamp + shard ID)

Data Storage Types

1. OLTP (Online Transaction Processing)

Purpose: Handle day-to-day transactions

Characteristics:

Fast writes/reads
Small queries (fetch single user, update order)
High concurrency (thousands of queries/sec)

Examples: User profiles, orders, inventory

Databases: MySQL, PostgreSQL, MongoDB

Real Example: Amazon Orders

When you place an order:

BEGIN TRANSACTION;
  INSERT INTO orders (user_id, total) VALUES (123, 49.99);
  UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 789;
  INSERT INTO payments (order_id, status) VALUES (456, 'pending');
COMMIT;

Fast, transactional, consistent.

2. OLAP (Online Analytical Processing)

Purpose: Analyze historical data

Characteristics:

Complex queries (aggregations, JOINs)
Large datasets (billions of rows)
Read-heavy

Examples: Business intelligence, reports, dashboards

Databases: Snowflake, Google BigQuery, Amazon Redshift

Real Example: Netflix Analytics

Netflix analyzes billions of viewing events:

-- "What shows are most popular in India?"
SELECT show_name, COUNT(*) as views
FROM viewing_history
WHERE country = 'India' AND date >= '2025-01-01'
GROUP BY show_name
ORDER BY views DESC
LIMIT 10;

Query might take 30 seconds—but that's okay, it's not real-time.

3. Data Warehouse

Purpose: Central repository for analytics

Architecture:

          [Data Warehouse]
                ↑
    ┌───────────┼───────────┐
    ↓           ↓           ↓
[OLTP DB 1] [OLTP DB 2] [Logs]
(Products)  (Users)     (Events)

Extract data from multiple sources, transform, load into warehouse (ETL).

Real Example: Airbnb

Airbnb consolidates:

Booking data (PostgreSQL)
Search logs (Elasticsearch)
User behavior (event tracking)

Into data warehouse (Presto) for analytics.

4. Data Lake

Purpose: Store raw, unstructured data

Difference from Data Warehouse:

Data Lake: Store everything (unprocessed)
Data Warehouse: Store structured, processed data

Real Example: Facebook

Facebook stores:

Photos (billions)
Videos
Logs
Everything

In data lake (Hadoop), process later as needed.

CAP Theorem (Briefly)

CAP Theorem: You can only choose 2 out of 3:

Consistency: All nodes see same data
Availability: Every request gets response
Partition Tolerance: System works despite network failures

Choices:

CP (Consistency + Partition Tolerance): MongoDB, HBase
AP (Availability + Partition Tolerance): Cassandra, DynamoDB
CA (Consistency + Availability): Not realistic (networks fail)

Most systems: Choose AP or CP based on needs

System Design Interview Tips

Common Question: "Design Twitter"

Database Strategy:

1. User Profile: PostgreSQL
   - Structured data
   - Relationships (followers)
   - ACID transactions

2. Tweets: Cassandra
   - Massive write volume (6,000 tweets/sec)
   - Time-series data
   - No complex queries

3. Timeline Cache: Redis
   - Pre-computed timelines
   - Fast access
   - TTL = 5 minutes

4. Analytics: Data Warehouse (BigQuery)
   - Aggregate metrics
   - Historical trends

Sharding:

Users: Hash-based sharding by user_id
Tweets: Range-based sharding by timestamp (recent tweets on hot shards)

Replication:

Master-slave replication:
  - 1 master (writes)
  - 5 slaves (reads)
  - Geographic distribution

What to Mention

✅ SQL for structured data, NoSQL for unstructured/high-scale ✅ Replication for read scaling ✅ Sharding for write scaling ✅ Indexes for query optimization ✅ Cache layer (Redis) for hot data

Avoid These Mistakes

❌ "Let's use NoSQL because it's faster" (no justification) ❌ Not discussing replication/sharding ❌ Ignoring consistency requirements ❌ Not mentioning indexes

Practical Tips

1. Start with SQL

Unless you have a specific reason (massive scale, flexible schema), start with PostgreSQL or MySQL.

Why?

Mature, well-understood
Rich querying capabilities
Most applications fit relational model

Uber, Instagram, and Reddit all started with SQL.

2. Denormalize for Performance

Normalized (relational):

-- Fetch user's tweet with retweet count: 2 queries
SELECT * FROM tweets WHERE id = 123;
SELECT COUNT(*) FROM retweets WHERE tweet_id = 123;

Denormalized (fast):

tweets
+----+---------+----------------+
| id | content | retweet_count  |
+----+---------+----------------+
| 123| Hello   | 42             |
+----+---------+----------------+

-- Fetch in 1 query
SELECT * FROM tweets WHERE id = 123;

Trade-off: Update retweet_count on every retweet (extra write), but reads are faster.

3. Cache Aggressively

Pattern:

1. Check Redis cache
2. If miss: Query database
3. Store result in cache (TTL = 5 min)
4. Return data

Result: 90%+ requests served from cache (< 1ms), 10% from database (100ms)

4. Monitor Query Performance

Use:

PostgreSQL: EXPLAIN ANALYZE
MySQL: EXPLAIN
MongoDB: .explain()

Find slow queries, add indexes.

Conclusion

Database design is the foundation of scalable systems:

SQL: Structured data, transactions, complex queries
NoSQL: Flexibility, horizontal scaling, high throughput
Replication: Scale reads, high availability
Sharding: Scale writes, massive data
Indexing: Fast queries

Key takeaway: There's no "best" database. Match your data model and access patterns to the right storage system.

Instagram uses PostgreSQL. Facebook uses MySQL, Cassandra, HBase, and more. Twitter uses MySQL, Cassandra, and Redis.

Use the right tool for the job. Master the fundamentals, and you'll ace database design questions in system design interviews—and build systems that scale to billions of users.

Database Design in System Design: SQL vs NoSQL, Sharding, Replication

"NoSQL Is Always Faster Than SQL" – The Expensive Myth

Types of Databases

1. SQL (Relational) Databases

2. NoSQL Databases

a) Document Stores (MongoDB, Couchbase)

b) Key-Value Stores (Redis, DynamoDB, Riak)

c) Column-Family Stores (Cassandra, HBase, ScyllaDB)

d) Graph Databases (Neo4j, Amazon Neptune)

SQL vs NoSQL: Decision Matrix

Database Scaling Techniques

1. Indexing

2. Replication

Master-Slave Replication

Master-Master Replication

3. Sharding (Horizontal Partitioning)

Sharding Strategies

Sharding Challenges

Data Storage Types

1. OLTP (Online Transaction Processing)

2. OLAP (Online Analytical Processing)

3. Data Warehouse

4. Data Lake

CAP Theorem (Briefly)

System Design Interview Tips

Common Question: "Design Twitter"

What to Mention

Avoid These Mistakes

Practical Tips

1. Start with SQL

2. Denormalize for Performance

3. Cache Aggressively

4. Monitor Query Performance

Conclusion

Support My Work

Related Blogs