High Availability in System Design: Achieving 99.99% Uptime

Learn high availability patterns for system design interviews. Master redundancy, failover, disaster recovery with examples from AWS, Netflix, Google achieving 99.99% uptime

πŸ“… Published: March 8, 2025 ✏️ Updated: April 12, 2025 By Ojaswi Athghara
#availability #uptime #redundancy #failover #system-design

High Availability in System Design: Achieving 99.99% Uptime

October 4, 2021: Facebook, Instagram, WhatsAppβ€”All Down for 6 Hours

3.5 billion users couldn't access their apps.

Businesses lost millions in revenue.

Facebook lost $60 million in ad revenueβ€”every hour.

The cause? A configuration change took down their DNS servers, making all Facebook services unreachable. No redundancy, no failover.

This wasn't a hack. It wasn't a DDoS attack. It was a single point of failure that brought down the world's largest social network.

High availability isn't optional for critical systemsβ€”it's mandatory. In this guide, I'll show you exactly how companies like Netflix, AWS, and Google design systems for 99.99% uptime, even when servers fail, data centers burn down, or entire regions go offline.


What Is High Availability?

High Availability (HA) = System remains operational even when components fail

Goal: Minimize downtime


Understanding Uptime Percentages

AvailabilityDowntime per YearDowntime per MonthDowntime per Week
90% (one nine)36.5 days72 hours16.8 hours
99% (two nines)3.65 days7.2 hours1.68 hours
99.9% (three nines)8.76 hours43.2 minutes10.1 minutes
99.99% (four nines)52.56 minutes4.32 minutes1.01 minutes
99.999% (five nines)5.26 minutes25.9 seconds6.05 seconds

Real-World SLAs:

  • AWS EC2: 99.99% (4 nines)
  • Google Cloud: 99.95%
  • Netflix: 99.99%+
  • Banking systems: 99.999% (five nines)

Context:

  • 99.9%: Acceptable for many apps (8 hours downtime/year)
  • 99.99%: Expected for enterprise apps (< 1 hour/year)
  • 99.999%: Critical systems only (< 6 minutes/year, extremely expensive)

Principles of High Availability

1. Eliminate Single Points of Failure (SPOF)

Single Point of Failure = Component whose failure brings down entire system

Example: E-Commerce Site

Bad (SPOF):

Users β†’ [Single Server] β†’ [Single Database]
           ↓                    ↓
        Fails? β†’ Everything down ❌

Good (No SPOF):

            [Load Balancer]
                  ↓
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       ↓          ↓          ↓
   [Server 1] [Server 2] [Server 3]
       ↓          ↓          ↓
        [Master DB] ← [Slave DB]

If:

  • Server 2 fails β†’ Load balancer routes to Server 1 & 3 βœ…
  • Master DB fails β†’ Slave promoted to master βœ…

2. Redundancy

Redundancy = Duplicate critical components

Types:

a) Active-Active (Both serve traffic)

        [Load Balancer]
              ↓
       β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
       ↓             ↓
  [Server A]    [Server B]
  (Active)      (Active)

Both handle 50% of traffic

Pros: βœ… Efficient resource usage βœ… Instant failover βœ… Better performance

Cons: ❌ Complex synchronization ❌ More expensive

Real Example: Netflix

Netflix runs active-active globally:

  • US users β†’ US servers (active)
  • EU users β†’ EU servers (active)
  • All regions serve traffic simultaneously

b) Active-Passive (Standby waits)

        [Load Balancer]
              ↓
       β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
       ↓             ↓
  [Server A]    [Server B]
  (Active)      (Passive - standby)

Server B takes over only if A fails

Pros: βœ… Simpler βœ… Lower cost (standby can be smaller)

Cons: ❌ Wasted resources (standby idle) ❌ Failover delay (30-60 seconds)

When to use: Cost-sensitive applications, acceptable short downtime


3. Monitoring and Health Checks

Principle: Detect failures fast, recover faster

Health Check Types:

Shallow Health Check

@app.route('/health')
def health():
    return {'status': 'healthy'}, 200

Checks: Server is responding

Response time: < 10ms


Deep Health Check

@app.route('/health')
def health():
    # Check database connection
    try:
        db.execute("SELECT 1")
    except:
        return {'status': 'unhealthy', 'reason': 'db_down'}, 503
    
    # Check cache connection
    try:
        cache.ping()
    except:
        return {'status': 'unhealthy', 'reason': 'cache_down'}, 503
    
    # Check disk space
    if disk_usage() > 90%:
        return {'status': 'unhealthy', 'reason': 'disk_full'}, 503
    
    return {'status': 'healthy'}, 200

Checks:

  • Database reachable
  • Cache reachable
  • Disk space available

Response time: 50-100ms

Best Practice: Use deep health checks (catch issues before they cause failures)


4. Failover

Failover = Automatically switch to backup when primary fails

Example: Database Failover

Without failover:

Master DB fails
    ↓
Application can't write
    ↓
Manual intervention needed (1-2 hours) ❌
    ↓
Business impact: lost revenue

With automatic failover:

Master DB fails (detected in 10 seconds)
    ↓
Monitoring system detects failure
    ↓
Slave promoted to master (30 seconds)
    ↓
Application continues writing βœ…
    ↓
Total downtime: 40 seconds

Tools:

  • PostgreSQL: Patroni, repmgr
  • MySQL: MHA (Master High Availability)
  • MongoDB: Replica sets with automatic election

Strategies for High Availability

1. Geographic Redundancy (Multi-Region)

Deploy across multiple geographic regions

Example: AWS Multi-Region Architecture

        [Route 53 - DNS Load Balancer]
                     ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         ↓                       ↓
  [US-East Region]        [EU-West Region]
       ↓                          ↓
  [Load Balancer]           [Load Balancer]
       ↓                          ↓
  [App Servers]             [App Servers]
       ↓                          ↓
   [Database]                [Database]

Benefits: βœ… Disaster recovery (entire region fails β†’ traffic reroutes) βœ… Low latency (users served from nearest region) βœ… Compliance (data stays in region for GDPR, etc.)

Real Example: Netflix

Netflix operates in 3 AWS regions:

  • US-East
  • US-West
  • EU

If US-East fails (AWS outage), traffic automatically reroutes to US-West.

Result: Users in US experience slight latency increase, but service continues.


2. Availability Zones

Availability Zone (AZ) = Isolated data center within a region

Example: AWS US-East Region

US-East Region
  β”œβ”€β”€ AZ 1 (Data center in Virginia)
  β”œβ”€β”€ AZ 2 (Data center in Virginia, 10 miles away)
  └── AZ 3 (Data center in Virginia, 15 miles away)

Each AZ has independent:

  • Power supply
  • Cooling
  • Networking

High Availability Pattern:

        [Load Balancer]
              ↓
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    ↓         ↓         ↓
  [AZ 1]    [AZ 2]    [AZ 3]
   App       App       App
  Server    Server    Server

If AZ 1 loses power, load balancer routes to AZ 2 and AZ 3.

Real Example: Stripe

Stripe (payment processing) runs:

  • Servers in all 3 AZs per region
  • Database replicas in all 3 AZs

Result: AZ failure = < 1 second downtime (automatic failover)


3. Database High Availability

Master-Slave Replication

      [Master]
   (Handles writes)
         ↓
  [Async replication]
         ↓
      [Slave 1]  [Slave 2]
   (Handle reads)

Failover process:

1. Master fails
2. Monitoring detects (10 seconds)
3. Slave 1 promoted to master (30 seconds)
4. Application reconnects to new master
Total downtime: 40 seconds

Limitations:

  • Replication lag: Slaves slightly behind master (100-500ms)
  • Potential data loss: Last writes might not be replicated before failure

Multi-Master Replication

[Master 1] ←→ [Master 2]
  (Region 1)   (Region 2)

Both accept writes, replicate to each other.

Benefits: βœ… No downtime on master failure βœ… Low write latency globally

Challenges: ❌ Write conflicts (same record updated on both)

Conflict Resolution:

  • Last write wins (timestamp-based)
  • Custom merge logic

Real Example: CockroachDB, Google Spanner


4. Chaos Engineering

Principle: Intentionally break things in production to test resilience

Netflix Chaos Monkey:

  • Randomly kills production servers
  • Forces engineers to build fault-tolerant systems
  • Netflix is so confident, they open-sourced it: Chaos Monkey

Example:

10:00 AM: Chaos Monkey kills server in AZ 1
    ↓
Load balancer detects failure (5 seconds)
    ↓
Traffic reroutes to AZ 2 and AZ 3
    ↓
Auto-scaling launches new server in AZ 1 (2 minutes)
    ↓
System self-heals

Result: Netflix has extremely high availability because they constantly test failure scenarios.


Real-World High Availability Examples

1. Amazon.com

Black Friday 2020:

  • Billions of requests
  • 99.99% uptime (< 1 minute downtime)

How?

1. Multi-region deployment (US, EU, Asia)
2. Each region: 3 AZs with redundant servers
3. Auto-scaling (10,000+ servers during peak)
4. Circuit breakers (if checkout fails, show cached page)
5. Graceful degradation (non-essential features disabled during overload)

2. Gmail

SLA: 99.9% uptime

High Availability Strategy:

1. Multi-region (15+ global data centers)
2. Data replicated 3x (across data centers)
3. Database: Multi-master (writes go to nearest data center)
4. Failover: Automatic (< 10 seconds)
5. Offline mode (Progressive Web App caches emails)

Result: Gmail has been down for < 1 hour total in the past 5 years.


3. Uber

Challenge: Drivers and riders need real-time matching

Requirements:

  • Low latency (match in < 5 seconds)
  • High availability (drivers depend on app for income)

Architecture:

1. Geographic sharding (each city = separate database)
2. Multi-AZ deployment per city
3. Ring Pop (peer-to-peer service discovery)
4. If one server fails, others in city handle requests
5. If entire city data center fails, route to nearest city

Result: 99.99% uptime globally


Graceful Degradation

Principle: When failures happen, degrade functionality instead of crashing

Example: Twitter

Full functionality:

Timeline β†’ Fetch tweets from database
        β†’ Rank by algorithm
        β†’ Fetch user profiles
        β†’ Fetch media (images/videos)
        β†’ Return enriched timeline

Database under heavy load (degraded mode):

Timeline β†’ Fetch from cache (5 minutes old) βœ…
        β†’ Skip ranking (show chronologically)
        β†’ Show placeholder avatars
        β†’ Lazy-load media
        β†’ Return basic timeline (fast, but less features)

Result: Users still get timeline, even if it's not perfect.


Example: Netflix

Degradation steps:

1. Full quality: 4K streaming
2. Load increases: Reduce to 1080p
3. Heavy load: Reduce to 720p
4. Critical load: Show cached thumbnails, delay video start
5. Extreme: Show "Retry" page

Better than: Complete service outage.


System Design Interview Tips

Common Question: "How would you ensure 99.99% uptime for your system?"

Answer Template:

1. Eliminate SPOF

- Load balancers (2+, with health checks)
- Application servers (10+, auto-scaling)
- Database (master-slave with automatic failover)
- Cache (Redis cluster, not single instance)

2. Multi-AZ Deployment

- Deploy across 3 Availability Zones
- If one AZ fails, traffic routes to others
- Downtime: < 30 seconds

3. Multi-Region (if global app)

- US, EU, Asia regions
- DNS-based routing to nearest region
- If region fails, failover to next nearest

4. Monitoring

- Health checks every 10 seconds
- Alerting (PagerDuty, Slack)
- Automatic remediation (restart failed services)

5. Graceful Degradation

- If database slow: serve cached data
- If payment gateway down: queue payments for retry
- Never show complete error page

6. Testing

- Chaos engineering (randomly kill servers)
- Load testing (simulate peak traffic)
- Disaster recovery drills (simulate region failure)

What to Mention

βœ… Multiple availability zones βœ… Load balancers with health checks βœ… Database replication and failover βœ… Monitoring and alerting βœ… Graceful degradation βœ… Auto-scaling

Avoid These Mistakes

❌ Not mentioning specific uptime percentage (say "99.99%") ❌ Only focusing on servers (forget about database HA) ❌ No monitoring strategy ❌ Not discussing failover time (aim for < 1 minute)


Trade-Offs

High Availability = Higher Cost

ComponentBasicHigh AvailabilityCost Increase
Servers1 server6 servers (2 per AZ Γ— 3 AZ)6x
Database1 instance1 master + 2 slaves3x
Load BalancerNone2 (redundant)2x hardware
Regions12-32-3x infrastructure

Total: Achieving 99.99% can cost 5-10x more than basic setup.

When to invest in HA:

  • High cost of downtime (e-commerce, banking)
  • Business-critical (SaaS with SLA commitments)
  • User expectation (Gmail, Netflixβ€”users expect "always on")

When NOT to over-invest:

  • Internal tools (99% might be fine)
  • Early startups (better to spend on product)
  • Low traffic (manual restart acceptable)

Practical Checklist

For 99.9% (Three Nines):

βœ… Load balancer + 2+ servers
βœ… Database replication (master-slave)
βœ… Health checks
βœ… Monitoring and alerting

For 99.99% (Four Nines):

βœ… Multi-AZ deployment (3 zones)
βœ… Auto-scaling
βœ… Automatic database failover
βœ… Redundant load balancers
βœ… Cache layer (Redis cluster)
βœ… Graceful degradation

For 99.999% (Five Nines):

βœ… Multi-region (2-3 regions)
βœ… Chaos engineering
βœ… Zero-downtime deployments
βœ… Active-active globally
βœ… Advanced monitoring (Datadog, New Relic)
βœ… Dedicated ops team (24/7)

Conclusion

High availability is about resilience:

  • Eliminate single points of failure
  • Deploy redundantly (multiple servers, AZs, regions)
  • Monitor and failover automatically
  • Degrade gracefully (never show complete failure)

Real-world uptime:

  • 99.9%: Good (8 hours downtime/year)
  • 99.99%: Great (< 1 hour/year)
  • 99.999%: Exceptional (< 6 minutes/year)

Facebook's 6-hour outage in 2021 was a $400 million lesson: high availability isn't optional for critical systems.

Design for failure, test relentlessly, and your system will stay up when it matters most.


Cover image by Erik Mclean on Unsplash

Support My Work

If this guide helped you learn something new, solve a problem, or ace your interviews, I'd really appreciate your support! Creating comprehensive, free content like this takes significant time and effort. Your support helps me continue sharing knowledge and creating more helpful resources for developers and students.

Buy me a Coffee

Every contribution, big or small, means the world to me and keeps me motivated to create more content!

Related Blogs

Ojaswi Athghara

SDE, 4+ Years

Β© ojaswiat.com 2025-2027