High Availability in System Design: Achieving 99.99% Uptime
Learn high availability patterns for system design interviews. Master redundancy, failover, disaster recovery with examples from AWS, Netflix, Google achieving 99.99% uptime

October 4, 2021: Facebook, Instagram, WhatsAppβAll Down for 6 Hours
3.5 billion users couldn't access their apps.
Businesses lost millions in revenue.
Facebook lost $60 million in ad revenueβevery hour.
The cause? A configuration change took down their DNS servers, making all Facebook services unreachable. No redundancy, no failover.
This wasn't a hack. It wasn't a DDoS attack. It was a single point of failure that brought down the world's largest social network.
High availability isn't optional for critical systemsβit's mandatory. In this guide, I'll show you exactly how companies like Netflix, AWS, and Google design systems for 99.99% uptime, even when servers fail, data centers burn down, or entire regions go offline.
What Is High Availability?
High Availability (HA) = System remains operational even when components fail
Goal: Minimize downtime
Understanding Uptime Percentages
| Availability | Downtime per Year | Downtime per Month | Downtime per Week |
|---|---|---|---|
| 90% (one nine) | 36.5 days | 72 hours | 16.8 hours |
| 99% (two nines) | 3.65 days | 7.2 hours | 1.68 hours |
| 99.9% (three nines) | 8.76 hours | 43.2 minutes | 10.1 minutes |
| 99.99% (four nines) | 52.56 minutes | 4.32 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 25.9 seconds | 6.05 seconds |
Real-World SLAs:
- AWS EC2: 99.99% (4 nines)
- Google Cloud: 99.95%
- Netflix: 99.99%+
- Banking systems: 99.999% (five nines)
Context:
- 99.9%: Acceptable for many apps (8 hours downtime/year)
- 99.99%: Expected for enterprise apps (< 1 hour/year)
- 99.999%: Critical systems only (< 6 minutes/year, extremely expensive)
Principles of High Availability
1. Eliminate Single Points of Failure (SPOF)
Single Point of Failure = Component whose failure brings down entire system
Example: E-Commerce Site
Bad (SPOF):
Users β [Single Server] β [Single Database]
β β
Fails? β Everything down β
Good (No SPOF):
[Load Balancer]
β
ββββββββββββΌβββββββββββ
β β β
[Server 1] [Server 2] [Server 3]
β β β
[Master DB] β [Slave DB]
If:
- Server 2 fails β Load balancer routes to Server 1 & 3 β
- Master DB fails β Slave promoted to master β
2. Redundancy
Redundancy = Duplicate critical components
Types:
a) Active-Active (Both serve traffic)
[Load Balancer]
β
ββββββββ΄βββββββ
β β
[Server A] [Server B]
(Active) (Active)
Both handle 50% of traffic
Pros: β Efficient resource usage β Instant failover β Better performance
Cons: β Complex synchronization β More expensive
Real Example: Netflix
Netflix runs active-active globally:
- US users β US servers (active)
- EU users β EU servers (active)
- All regions serve traffic simultaneously
b) Active-Passive (Standby waits)
[Load Balancer]
β
ββββββββ΄βββββββ
β β
[Server A] [Server B]
(Active) (Passive - standby)
Server B takes over only if A fails
Pros: β Simpler β Lower cost (standby can be smaller)
Cons: β Wasted resources (standby idle) β Failover delay (30-60 seconds)
When to use: Cost-sensitive applications, acceptable short downtime
3. Monitoring and Health Checks
Principle: Detect failures fast, recover faster
Health Check Types:
Shallow Health Check
@app.route('/health')
def health():
return {'status': 'healthy'}, 200
Checks: Server is responding
Response time: < 10ms
Deep Health Check
@app.route('/health')
def health():
# Check database connection
try:
db.execute("SELECT 1")
except:
return {'status': 'unhealthy', 'reason': 'db_down'}, 503
# Check cache connection
try:
cache.ping()
except:
return {'status': 'unhealthy', 'reason': 'cache_down'}, 503
# Check disk space
if disk_usage() > 90%:
return {'status': 'unhealthy', 'reason': 'disk_full'}, 503
return {'status': 'healthy'}, 200
Checks:
- Database reachable
- Cache reachable
- Disk space available
Response time: 50-100ms
Best Practice: Use deep health checks (catch issues before they cause failures)
4. Failover
Failover = Automatically switch to backup when primary fails
Example: Database Failover
Without failover:
Master DB fails
β
Application can't write
β
Manual intervention needed (1-2 hours) β
β
Business impact: lost revenue
With automatic failover:
Master DB fails (detected in 10 seconds)
β
Monitoring system detects failure
β
Slave promoted to master (30 seconds)
β
Application continues writing β
β
Total downtime: 40 seconds
Tools:
- PostgreSQL: Patroni, repmgr
- MySQL: MHA (Master High Availability)
- MongoDB: Replica sets with automatic election
Strategies for High Availability
1. Geographic Redundancy (Multi-Region)
Deploy across multiple geographic regions
Example: AWS Multi-Region Architecture
[Route 53 - DNS Load Balancer]
β
βββββββββββββΌββββββββββββ
β β
[US-East Region] [EU-West Region]
β β
[Load Balancer] [Load Balancer]
β β
[App Servers] [App Servers]
β β
[Database] [Database]
Benefits: β Disaster recovery (entire region fails β traffic reroutes) β Low latency (users served from nearest region) β Compliance (data stays in region for GDPR, etc.)
Real Example: Netflix
Netflix operates in 3 AWS regions:
- US-East
- US-West
- EU
If US-East fails (AWS outage), traffic automatically reroutes to US-West.
Result: Users in US experience slight latency increase, but service continues.
2. Availability Zones
Availability Zone (AZ) = Isolated data center within a region
Example: AWS US-East Region
US-East Region
βββ AZ 1 (Data center in Virginia)
βββ AZ 2 (Data center in Virginia, 10 miles away)
βββ AZ 3 (Data center in Virginia, 15 miles away)
Each AZ has independent:
- Power supply
- Cooling
- Networking
High Availability Pattern:
[Load Balancer]
β
βββββββββββΌββββββββββ
β β β
[AZ 1] [AZ 2] [AZ 3]
App App App
Server Server Server
If AZ 1 loses power, load balancer routes to AZ 2 and AZ 3.
Real Example: Stripe
Stripe (payment processing) runs:
- Servers in all 3 AZs per region
- Database replicas in all 3 AZs
Result: AZ failure = < 1 second downtime (automatic failover)
3. Database High Availability
Master-Slave Replication
[Master]
(Handles writes)
β
[Async replication]
β
[Slave 1] [Slave 2]
(Handle reads)
Failover process:
1. Master fails
2. Monitoring detects (10 seconds)
3. Slave 1 promoted to master (30 seconds)
4. Application reconnects to new master
Total downtime: 40 seconds
Limitations:
- Replication lag: Slaves slightly behind master (100-500ms)
- Potential data loss: Last writes might not be replicated before failure
Multi-Master Replication
[Master 1] ββ [Master 2]
(Region 1) (Region 2)
Both accept writes, replicate to each other.
Benefits: β No downtime on master failure β Low write latency globally
Challenges: β Write conflicts (same record updated on both)
Conflict Resolution:
- Last write wins (timestamp-based)
- Custom merge logic
Real Example: CockroachDB, Google Spanner
4. Chaos Engineering
Principle: Intentionally break things in production to test resilience
Netflix Chaos Monkey:
- Randomly kills production servers
- Forces engineers to build fault-tolerant systems
- Netflix is so confident, they open-sourced it: Chaos Monkey
Example:
10:00 AM: Chaos Monkey kills server in AZ 1
β
Load balancer detects failure (5 seconds)
β
Traffic reroutes to AZ 2 and AZ 3
β
Auto-scaling launches new server in AZ 1 (2 minutes)
β
System self-heals
Result: Netflix has extremely high availability because they constantly test failure scenarios.
Real-World High Availability Examples
1. Amazon.com
Black Friday 2020:
- Billions of requests
- 99.99% uptime (< 1 minute downtime)
How?
1. Multi-region deployment (US, EU, Asia)
2. Each region: 3 AZs with redundant servers
3. Auto-scaling (10,000+ servers during peak)
4. Circuit breakers (if checkout fails, show cached page)
5. Graceful degradation (non-essential features disabled during overload)
2. Gmail
SLA: 99.9% uptime
High Availability Strategy:
1. Multi-region (15+ global data centers)
2. Data replicated 3x (across data centers)
3. Database: Multi-master (writes go to nearest data center)
4. Failover: Automatic (< 10 seconds)
5. Offline mode (Progressive Web App caches emails)
Result: Gmail has been down for < 1 hour total in the past 5 years.
3. Uber
Challenge: Drivers and riders need real-time matching
Requirements:
- Low latency (match in < 5 seconds)
- High availability (drivers depend on app for income)
Architecture:
1. Geographic sharding (each city = separate database)
2. Multi-AZ deployment per city
3. Ring Pop (peer-to-peer service discovery)
4. If one server fails, others in city handle requests
5. If entire city data center fails, route to nearest city
Result: 99.99% uptime globally
Graceful Degradation
Principle: When failures happen, degrade functionality instead of crashing
Example: Twitter
Full functionality:
Timeline β Fetch tweets from database
β Rank by algorithm
β Fetch user profiles
β Fetch media (images/videos)
β Return enriched timeline
Database under heavy load (degraded mode):
Timeline β Fetch from cache (5 minutes old) β
β Skip ranking (show chronologically)
β Show placeholder avatars
β Lazy-load media
β Return basic timeline (fast, but less features)
Result: Users still get timeline, even if it's not perfect.
Example: Netflix
Degradation steps:
1. Full quality: 4K streaming
2. Load increases: Reduce to 1080p
3. Heavy load: Reduce to 720p
4. Critical load: Show cached thumbnails, delay video start
5. Extreme: Show "Retry" page
Better than: Complete service outage.
System Design Interview Tips
Common Question: "How would you ensure 99.99% uptime for your system?"
Answer Template:
1. Eliminate SPOF
- Load balancers (2+, with health checks)
- Application servers (10+, auto-scaling)
- Database (master-slave with automatic failover)
- Cache (Redis cluster, not single instance)
2. Multi-AZ Deployment
- Deploy across 3 Availability Zones
- If one AZ fails, traffic routes to others
- Downtime: < 30 seconds
3. Multi-Region (if global app)
- US, EU, Asia regions
- DNS-based routing to nearest region
- If region fails, failover to next nearest
4. Monitoring
- Health checks every 10 seconds
- Alerting (PagerDuty, Slack)
- Automatic remediation (restart failed services)
5. Graceful Degradation
- If database slow: serve cached data
- If payment gateway down: queue payments for retry
- Never show complete error page
6. Testing
- Chaos engineering (randomly kill servers)
- Load testing (simulate peak traffic)
- Disaster recovery drills (simulate region failure)
What to Mention
β Multiple availability zones β Load balancers with health checks β Database replication and failover β Monitoring and alerting β Graceful degradation β Auto-scaling
Avoid These Mistakes
β Not mentioning specific uptime percentage (say "99.99%") β Only focusing on servers (forget about database HA) β No monitoring strategy β Not discussing failover time (aim for < 1 minute)
Trade-Offs
High Availability = Higher Cost
| Component | Basic | High Availability | Cost Increase |
|---|---|---|---|
| Servers | 1 server | 6 servers (2 per AZ Γ 3 AZ) | 6x |
| Database | 1 instance | 1 master + 2 slaves | 3x |
| Load Balancer | None | 2 (redundant) | 2x hardware |
| Regions | 1 | 2-3 | 2-3x infrastructure |
Total: Achieving 99.99% can cost 5-10x more than basic setup.
When to invest in HA:
- High cost of downtime (e-commerce, banking)
- Business-critical (SaaS with SLA commitments)
- User expectation (Gmail, Netflixβusers expect "always on")
When NOT to over-invest:
- Internal tools (99% might be fine)
- Early startups (better to spend on product)
- Low traffic (manual restart acceptable)
Practical Checklist
For 99.9% (Three Nines):
β
Load balancer + 2+ servers
β
Database replication (master-slave)
β
Health checks
β
Monitoring and alerting
For 99.99% (Four Nines):
β
Multi-AZ deployment (3 zones)
β
Auto-scaling
β
Automatic database failover
β
Redundant load balancers
β
Cache layer (Redis cluster)
β
Graceful degradation
For 99.999% (Five Nines):
β
Multi-region (2-3 regions)
β
Chaos engineering
β
Zero-downtime deployments
β
Active-active globally
β
Advanced monitoring (Datadog, New Relic)
β
Dedicated ops team (24/7)
Conclusion
High availability is about resilience:
- Eliminate single points of failure
- Deploy redundantly (multiple servers, AZs, regions)
- Monitor and failover automatically
- Degrade gracefully (never show complete failure)
Real-world uptime:
- 99.9%: Good (8 hours downtime/year)
- 99.99%: Great (< 1 hour/year)
- 99.999%: Exceptional (< 6 minutes/year)
Facebook's 6-hour outage in 2021 was a $400 million lesson: high availability isn't optional for critical systems.
Design for failure, test relentlessly, and your system will stay up when it matters most.