Ojaswi Athghara | High Availability in System Design: Achieving 99.99% Uptime

High Availability in System Design: Achieving 99.99% Uptime

October 4, 2021: Facebook, Instagram, WhatsApp—All Down for 6 Hours

3.5 billion users couldn't access their apps.

Businesses lost millions in revenue.

Facebook lost $60 million in ad revenue—every hour.

The cause? A configuration change took down their DNS servers, making all Facebook services unreachable. No redundancy, no failover.

This wasn't a hack. It wasn't a DDoS attack. It was a single point of failure that brought down the world's largest social network.

High availability isn't optional for critical systems—it's mandatory. In this guide, I'll show you exactly how companies like Netflix, AWS, and Google design systems for 99.99% uptime, even when servers fail, data centers burn down, or entire regions go offline.

What Is High Availability?

High Availability (HA) = System remains operational even when components fail

Goal: Minimize downtime

Understanding Uptime Percentages

Availability	Downtime per Year	Downtime per Month	Downtime per Week
90% (one nine)	36.5 days	72 hours	16.8 hours
99% (two nines)	3.65 days	7.2 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.2 minutes	10.1 minutes
99.99% (four nines)	52.56 minutes	4.32 minutes	1.01 minutes
99.999% (five nines)	5.26 minutes	25.9 seconds	6.05 seconds

Real-World SLAs:

AWS EC2: 99.99% (4 nines)
Google Cloud: 99.95%
Netflix: 99.99%+
Banking systems: 99.999% (five nines)

Context:

99.9%: Acceptable for many apps (8 hours downtime/year)
99.99%: Expected for enterprise apps (< 1 hour/year)
99.999%: Critical systems only (< 6 minutes/year, extremely expensive)

Principles of High Availability

1. Eliminate Single Points of Failure (SPOF)

Single Point of Failure = Component whose failure brings down entire system

Example: E-Commerce Site

Bad (SPOF):

Users → [Single Server] → [Single Database]
           ↓                    ↓
        Fails? → Everything down ❌

Good (No SPOF):

            [Load Balancer]
                  ↓
       ┌──────────┼──────────┐
       ↓          ↓          ↓
   [Server 1] [Server 2] [Server 3]
       ↓          ↓          ↓
        [Master DB] ← [Slave DB]

If:

Server 2 fails → Load balancer routes to Server 1 & 3 ✅
Master DB fails → Slave promoted to master ✅

2. Redundancy

Redundancy = Duplicate critical components

Types:

a) Active-Active (Both serve traffic)

        [Load Balancer]
              ↓
       ┌──────┴──────┐
       ↓             ↓
  [Server A]    [Server B]
  (Active)      (Active)

Both handle 50% of traffic

Pros: ✅ Efficient resource usage ✅ Instant failover ✅ Better performance

Cons: ❌ Complex synchronization ❌ More expensive

Real Example: Netflix

Netflix runs active-active globally:

US users → US servers (active)
EU users → EU servers (active)
All regions serve traffic simultaneously

b) Active-Passive (Standby waits)

        [Load Balancer]
              ↓
       ┌──────┴──────┐
       ↓             ↓
  [Server A]    [Server B]
  (Active)      (Passive - standby)

Server B takes over only if A fails

Pros: ✅ Simpler ✅ Lower cost (standby can be smaller)

Cons: ❌ Wasted resources (standby idle) ❌ Failover delay (30-60 seconds)

When to use: Cost-sensitive applications, acceptable short downtime

3. Monitoring and Health Checks

Principle: Detect failures fast, recover faster

Health Check Types:

Shallow Health Check

@app.route('/health')
def health():
    return {'status': 'healthy'}, 200

Checks: Server is responding

Response time: < 10ms

Deep Health Check

@app.route('/health')
def health():
    # Check database connection
    try:
        db.execute("SELECT 1")
    except:
        return {'status': 'unhealthy', 'reason': 'db_down'}, 503
    
    # Check cache connection
    try:
        cache.ping()
    except:
        return {'status': 'unhealthy', 'reason': 'cache_down'}, 503
    
    # Check disk space
    if disk_usage() > 90%:
        return {'status': 'unhealthy', 'reason': 'disk_full'}, 503
    
    return {'status': 'healthy'}, 200

Checks:

Database reachable
Cache reachable
Disk space available

Response time: 50-100ms

Best Practice: Use deep health checks (catch issues before they cause failures)

4. Failover

Failover = Automatically switch to backup when primary fails

Example: Database Failover

Without failover:

Master DB fails
    ↓
Application can't write
    ↓
Manual intervention needed (1-2 hours) ❌
    ↓
Business impact: lost revenue

With automatic failover:

Master DB fails (detected in 10 seconds)
    ↓
Monitoring system detects failure
    ↓
Slave promoted to master (30 seconds)
    ↓
Application continues writing ✅
    ↓
Total downtime: 40 seconds

Tools:

PostgreSQL: Patroni, repmgr
MySQL: MHA (Master High Availability)
MongoDB: Replica sets with automatic election

Strategies for High Availability

1. Geographic Redundancy (Multi-Region)

Deploy across multiple geographic regions

Example: AWS Multi-Region Architecture

        [Route 53 - DNS Load Balancer]
                     ↓
         ┌───────────┼───────────┐
         ↓                       ↓
  [US-East Region]        [EU-West Region]
       ↓                          ↓
  [Load Balancer]           [Load Balancer]
       ↓                          ↓
  [App Servers]             [App Servers]
       ↓                          ↓
   [Database]                [Database]

Benefits: ✅ Disaster recovery (entire region fails → traffic reroutes) ✅ Low latency (users served from nearest region) ✅ Compliance (data stays in region for GDPR, etc.)

Real Example: Netflix

Netflix operates in 3 AWS regions:

US-East
US-West
EU

If US-East fails (AWS outage), traffic automatically reroutes to US-West.

Result: Users in US experience slight latency increase, but service continues.

2. Availability Zones

Availability Zone (AZ) = Isolated data center within a region

Example: AWS US-East Region

US-East Region
  ├── AZ 1 (Data center in Virginia)
  ├── AZ 2 (Data center in Virginia, 10 miles away)
  └── AZ 3 (Data center in Virginia, 15 miles away)

Each AZ has independent:

Power supply
Cooling
Networking

High Availability Pattern:

        [Load Balancer]
              ↓
    ┌─────────┼─────────┐
    ↓         ↓         ↓
  [AZ 1]    [AZ 2]    [AZ 3]
   App       App       App
  Server    Server    Server

If AZ 1 loses power, load balancer routes to AZ 2 and AZ 3.

Real Example: Stripe

Stripe (payment processing) runs:

Servers in all 3 AZs per region
Database replicas in all 3 AZs

Result: AZ failure = < 1 second downtime (automatic failover)

3. Database High Availability

Master-Slave Replication

      [Master]
   (Handles writes)
         ↓
  [Async replication]
         ↓
      [Slave 1]  [Slave 2]
   (Handle reads)

Failover process:

1. Master fails
2. Monitoring detects (10 seconds)
3. Slave 1 promoted to master (30 seconds)
4. Application reconnects to new master
Total downtime: 40 seconds

Limitations:

Replication lag: Slaves slightly behind master (100-500ms)
Potential data loss: Last writes might not be replicated before failure

Multi-Master Replication

[Master 1] ←→ [Master 2]
  (Region 1)   (Region 2)

Both accept writes, replicate to each other.

Benefits: ✅ No downtime on master failure ✅ Low write latency globally

Challenges: ❌ Write conflicts (same record updated on both)

Conflict Resolution:

Last write wins (timestamp-based)
Custom merge logic

Real Example: CockroachDB, Google Spanner

4. Chaos Engineering

Principle: Intentionally break things in production to test resilience

Netflix Chaos Monkey:

Randomly kills production servers
Forces engineers to build fault-tolerant systems
Netflix is so confident, they open-sourced it: Chaos Monkey

Example:

10:00 AM: Chaos Monkey kills server in AZ 1
    ↓
Load balancer detects failure (5 seconds)
    ↓
Traffic reroutes to AZ 2 and AZ 3
    ↓
Auto-scaling launches new server in AZ 1 (2 minutes)
    ↓
System self-heals

Result: Netflix has extremely high availability because they constantly test failure scenarios.

Real-World High Availability Examples

1. Amazon.com

Black Friday 2020:

Billions of requests
99.99% uptime (< 1 minute downtime)

How?

1. Multi-region deployment (US, EU, Asia)
2. Each region: 3 AZs with redundant servers
3. Auto-scaling (10,000+ servers during peak)
4. Circuit breakers (if checkout fails, show cached page)
5. Graceful degradation (non-essential features disabled during overload)

2. Gmail

SLA: 99.9% uptime

High Availability Strategy:

1. Multi-region (15+ global data centers)
2. Data replicated 3x (across data centers)
3. Database: Multi-master (writes go to nearest data center)
4. Failover: Automatic (< 10 seconds)
5. Offline mode (Progressive Web App caches emails)

Result: Gmail has been down for < 1 hour total in the past 5 years.

3. Uber

Challenge: Drivers and riders need real-time matching

Requirements:

Low latency (match in < 5 seconds)
High availability (drivers depend on app for income)

Architecture:

1. Geographic sharding (each city = separate database)
2. Multi-AZ deployment per city
3. Ring Pop (peer-to-peer service discovery)
4. If one server fails, others in city handle requests
5. If entire city data center fails, route to nearest city

Result: 99.99% uptime globally

Graceful Degradation

Principle: When failures happen, degrade functionality instead of crashing

Example: Twitter

Full functionality:

Timeline → Fetch tweets from database
        → Rank by algorithm
        → Fetch user profiles
        → Fetch media (images/videos)
        → Return enriched timeline

Database under heavy load (degraded mode):

Timeline → Fetch from cache (5 minutes old) ✅
        → Skip ranking (show chronologically)
        → Show placeholder avatars
        → Lazy-load media
        → Return basic timeline (fast, but less features)

Result: Users still get timeline, even if it's not perfect.

Example: Netflix

Degradation steps:

1. Full quality: 4K streaming
2. Load increases: Reduce to 1080p
3. Heavy load: Reduce to 720p
4. Critical load: Show cached thumbnails, delay video start
5. Extreme: Show "Retry" page

Better than: Complete service outage.

System Design Interview Tips

Common Question: "How would you ensure 99.99% uptime for your system?"

Answer Template:

1. Eliminate SPOF

- Load balancers (2+, with health checks)
- Application servers (10+, auto-scaling)
- Database (master-slave with automatic failover)
- Cache (Redis cluster, not single instance)

2. Multi-AZ Deployment

- Deploy across 3 Availability Zones
- If one AZ fails, traffic routes to others
- Downtime: < 30 seconds

3. Multi-Region (if global app)

- US, EU, Asia regions
- DNS-based routing to nearest region
- If region fails, failover to next nearest

4. Monitoring

- Health checks every 10 seconds
- Alerting (PagerDuty, Slack)
- Automatic remediation (restart failed services)

5. Graceful Degradation

- If database slow: serve cached data
- If payment gateway down: queue payments for retry
- Never show complete error page

6. Testing

- Chaos engineering (randomly kill servers)
- Load testing (simulate peak traffic)
- Disaster recovery drills (simulate region failure)

Component	Basic	High Availability	Cost Increase
Servers	1 server	6 servers (2 per AZ × 3 AZ)	6x
Database	1 instance	1 master + 2 slaves	3x
Load Balancer	None	2 (redundant)	2x hardware
Regions	1	2-3	2-3x infrastructure

Total: Achieving 99.99% can cost 5-10x more than basic setup.

When to invest in HA:

High cost of downtime (e-commerce, banking)
Business-critical (SaaS with SLA commitments)
User expectation (Gmail, Netflix—users expect "always on")

When NOT to over-invest:

Internal tools (99% might be fine)
Early startups (better to spend on product)
Low traffic (manual restart acceptable)

Practical Checklist

For 99.9% (Three Nines):

✅ Load balancer + 2+ servers
✅ Database replication (master-slave)
✅ Health checks
✅ Monitoring and alerting

For 99.99% (Four Nines):

✅ Multi-AZ deployment (3 zones)
✅ Auto-scaling
✅ Automatic database failover
✅ Redundant load balancers
✅ Cache layer (Redis cluster)
✅ Graceful degradation

For 99.999% (Five Nines):

✅ Multi-region (2-3 regions)
✅ Chaos engineering
✅ Zero-downtime deployments
✅ Active-active globally
✅ Advanced monitoring (Datadog, New Relic)
✅ Dedicated ops team (24/7)

Conclusion

High availability is about resilience:

Eliminate single points of failure
Deploy redundantly (multiple servers, AZs, regions)
Monitor and failover automatically
Degrade gracefully (never show complete failure)

Real-world uptime:

99.9%: Good (8 hours downtime/year)
99.99%: Great (< 1 hour/year)
99.999%: Exceptional (< 6 minutes/year)

Facebook's 6-hour outage in 2021 was a $400 million lesson: high availability isn't optional for critical systems.

Design for failure, test relentlessly, and your system will stay up when it matters most.