Your infrastructure scales perfectly, but what happens when your database goes down? Learn strategies for graceful degradation, from application-level error handling to Route 53 health checks and CloudFront functions. Turn ugly 500 errors into honest, professional error pages.
Building Highly Available AWS Infrastructure: Graceful Failure - Part 4
You've done everything right. You followed Part 1 and built a highly available setup with ALB and Auto Scaling. You containerized with Part 2 using ECS. Maybe you even went serverless with Part 3 and Fargate.
Your application layer is bulletproof. It scales beautifully. Health checks are perfect. Your ALB is distributing traffic like a champ.
Then, at 2 AM on a Friday, your database goes down.
Suddenly, every single request returns a 500 error. Your perfectly scaled infrastructure becomes a perfectly scaled error generator. Your users see this:
500 Internal Server Error
The server encountered an internal error and was unable to complete your request.
Welcome to Part 4, where we talk about the harsh truth: No matter how well your applications scale, there are links in the chain that can still break.
🎯 The Cascading Failure Problem: When Health Checks Lie
Your architecture looks like this:
User → CloudFront → ALB → Fargate Tasks → RDS Database
✅ ✅ ✅ ↓
(💀 DOWN)
Here's what happens when a critical dependency fails:
- Your Fargate tasks are passing health checks ✅ (they respond to
/health) - Your ALB is routing traffic normally ✅ (targets are "healthy")
- But every real request cascades into failure ❌ (database is down)
The root issue: Maybe your health checks only verify the infrastructure layer—and they don't test the full dependency chain. Your app appears healthy because it responds to HTTP requests, even though it can't actually serve user traffic.
This isn't limited to databases. When dependencies fail—external APIs, payment processors, authentication services, or even entire AWS regions—your "healthy" infrastructure becomes a perfectly scaled error generator.
Real example: During the October 2025 US-East-1 incident, the region didn't go down completely. DNS resolution failures for DynamoDB cascaded through AWS's control plane, and one critical downstream effect was the inability to launch new EC2 instances. Existing instances continued running fine, but autoscaling was broken—if you experienced a traffic spike, you were stuck at current capacity. This created an inconsistent user experience: some requests succeeded (hitting healthy instances with capacity) while others failed (hitting overloaded instances). Multi-region architectures with geo-routing fared better, but single-region deployments had no escape valve.
What this article covers:
- Patterns to detect these failures early and show users honest, professional messages instead of cryptic "502 Bad Gateway", "503 Service Unavailable", "504 Gateway Timeout" errors.
- While these won't fix AWS outages or broken dependencies, they prevent the cascading failure from reaching users as ugly error pages.
🎭 The User Experience Crisis
What your users actually see when things go wrong depends on where the failure occurs:
Scenario 1: All Servers Unhealthy (ALB Has No Targets)
502 Bad Gateway
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
When this happens:
- All your EC2/Fargate instances failed health checks
- Auto Scaling hasn't launched replacements yet
- ALB has no healthy targets to route to
- Classic "cold start" problem during rapid scale-up
Scenario 2: Database Connection Failures
500 Internal Server Error
Or sometimes just a blank white page in the browser.
When this happens:
- Your servers are healthy, but can't connect to RDS
- Security group changes blocked the connection
- RDS is failing over to standby or experiencing issues
- Connection pool exhausted from traffic spike
- Network partition between app and database subnets
Scenario 3: Timeout to External Dependencies
504 Gateway Timeout
When this happens:
- Your app calls an external API that's down
- Payment processor is slow/unresponsive
- Authentication service (Auth0, Cognito) has latency
- Third-party data provider is degraded
- CloudFront origin request times out (30 second default)
Scenario 4: AWS Regional Issues
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
When this happens:
- AWS Control Plane issues (like the recent US-East-1 incident)
- ALB itself is experiencing problems
- Route 53 DNS resolution delays
- Cross-AZ network degradation
- Lambda cold starts timing out
Edge Cases That Happen More Than You'd Think
Out of Memory Crashes:
502 Bad Gateway
Your container runs out of memory mid-request, crashes, ALB marks it unhealthy.
Slow Database Queries During Traffic Spike:
504 Gateway Timeout
Query works fine normally, but locks up under high load. Requests queue up and timeout.
SSL/TLS Certificate Expiration:
ERR_CERT_DATE_INVALID
Automated renewal failed, now all requests fail at the ALB level.
Cascading Failures from One Slow Endpoint:
500/502/504 (varies)
One API endpoint is slow, consumes all worker threads/connections, now everything fails.
Problems with all of these:
- Look broken and unprofessional
- No information about what's happening
- No estimated time to resolution
- No alternatives or status page link
- Users don't know if it's their problem or yours
- Makes users think your entire service is broken
What you want users to see instead:
<!-- An honest status page -->
<!DOCTYPE html>
<html>
<head><title>We'll Be Right Back</title></head>
<body style="font-family: Arial; text-align: center; padding: 50px;">
<h1>🔧 We'll be right back!</h1>
<p>We're experiencing some technical difficulties and are working to resolve them.</p>
<p>Please try again in a few minutes. We apologize for the inconvenience.</p>
<p><a href="https://status.example.com">Check our status page</a></p>
</body>
</html>
💡 Solution 1: Application-Level Graceful Degradation
The first line of defense is your application itself.
Strategy: Fail Gracefully for Different Scenarios
Handle all the real-world failure cases:
# Python/Flask example
from flask import Flask, render_template, jsonify
import psycopg2
import requests
from functools import wraps
from requests.exceptions import Timeout, RequestException
app = Flask(__name__)
# Track service health
service_health = {
'database': True,
'payment_api': True,
'auth_service': True
}
def graceful_degradation(f):
@wraps(f)
def decorated_function(*args, **kwargs):
try:
# Try to execute the route
return f(*args, **kwargs)
except psycopg2.OperationalError as e:
# Database connection failed
service_health['database'] = False
app.logger.error(f"Database error: {e}")
return render_template('error.html',
reason='database'), 503
except psycopg2.pool.PoolError as e:
# Connection pool exhausted
app.logger.error(f"Connection pool exhausted: {e}")
return render_template('error.html',
reason='high_load'), 503
except Timeout as e:
# External API timeout
app.logger.error(f"External service timeout: {e}")
return render_template('error.html',
reason='external_service'), 504
except RequestException as e:
# External API failed
service_health['payment_api'] = False
app.logger.error(f"External service error: {e}")
return render_template('error.html',
reason='external_service'), 503
except MemoryError:
# Out of memory - this is critical
app.logger.critical("Out of memory!")
return render_template('error.html',
reason='resource_exhaustion'), 503
except Exception as e:
# Catch-all for unexpected errors
app.logger.error(f"Unexpected error: {e}")
return render_template('error.html',
reason='unknown'), 503
return decorated_function
@app.route('/api/users')
@graceful_degradation
def get_users():
# This will fail gracefully if database is down
conn = get_db_connection()
users = conn.execute('SELECT * FROM users').fetchall()
return jsonify(users)
@app.route('/api/process-payment', methods=['POST'])
@graceful_degradation
def process_payment():
# Call external payment API with timeout
response = requests.post(
'https://payment-api.example.com/charge',
json=request.json,
timeout=5 # 5 second timeout to avoid gateway timeouts
)
return jsonify(response.json())
# Health check that ACTUALLY checks all dependencies
@app.route('/health')
def health_check():
health_status = {'status': 'healthy', 'checks': {}}
is_healthy = True
# Check database connectivity
try:
conn = get_db_connection()
conn.execute('SELECT 1')
health_status['checks']['database'] = 'connected'
service_health['database'] = True
except Exception as e:
health_status['checks']['database'] = f'error: {str(e)}'
service_health['database'] = False
is_healthy = False
# Check external API connectivity (with short timeout)
try:
response = requests.get(
'https://payment-api.example.com/health',
timeout=2
)
if response.status_code == 200:
health_status['checks']['payment_api'] = 'available'
service_health['payment_api'] = True
else:
health_status['checks']['payment_api'] = 'degraded'
service_health['payment_api'] = False
except Exception as e:
health_status['checks']['payment_api'] = 'unavailable'
service_health['payment_api'] = False
# Don't mark as unhealthy for external API issues
# Only fail health check for critical dependencies
# Check memory usage
import psutil
memory_percent = psutil.virtual_memory().percent
if memory_percent > 90:
health_status['checks']['memory'] = f'critical: {memory_percent}%'
is_healthy = False
else:
health_status['checks']['memory'] = f'ok: {memory_percent}%'
if is_healthy:
return jsonify(health_status), 200
else:
health_status['status'] = 'unhealthy'
# Return 503 so ALB marks this instance as unhealthy
return jsonify(health_status), 503
# Fallback route - show error page for all other routes
@app.errorhandler(503)
def service_unavailable(e):
return render_template('error.html'), 503
@app.errorhandler(504)
def gateway_timeout(e):
return render_template('error.html', reason='timeout'), 504
.NET 9 / ASP.NET Core Implementation
Here's the same concept in modern .NET, handling all failure scenarios:
// Program.cs - ASP.NET Core 9
using Microsoft.EntityFrameworkCore;
using System.Data.Common;
var builder = WebApplication.CreateBuilder(args);
// Add services
builder.Services.AddDbContext<AppDbContext>(options =>
options.UseNpgsql(builder.Configuration.GetConnectionString("DefaultConnection")));
builder.Services.AddHttpClient();
builder.Services.AddControllers();
builder.Services.AddMemoryCache();
var app = builder.Build();
// Global exception handler middleware - handles all real-world failures
app.Use(async (context, next) =>
{
try
{
await next(context);
}
catch (DbException ex)
{
// Database connection or query failure
context.RequestServices.GetRequiredService<ILogger<Program>>()
.LogError(ex, "Database error on {Path}", context.Request.Path);
context.Response.StatusCode = 503;
context.Response.ContentType = "text/html";
await context.Response.WriteAsync(GetErrorPageHtml("database"));
}
catch (TimeoutException ex)
{
// External service timeout
context.RequestServices.GetRequiredService<ILogger<Program>>()
.LogError(ex, "Timeout on {Path}", context.Request.Path);
context.Response.StatusCode = 504;
context.Response.ContentType = "text/html";
await context.Response.WriteAsync(GetErrorPageHtml("timeout"));
}
catch (HttpRequestException ex)
{
// External API failure
context.RequestServices.GetRequiredService<ILogger<Program>>()
.LogError(ex, "External service error on {Path}", context.Request.Path);
context.Response.StatusCode = 503;
context.Response.ContentType = "text/html";
await context.Response.WriteAsync(GetErrorPageHtml("external_service"));
}
catch (OutOfMemoryException ex)
{
// Critical memory issue
context.RequestServices.GetRequiredService<ILogger<Program>>()
.LogCritical(ex, "Out of memory on {Path}", context.Request.Path);
context.Response.StatusCode = 503;
context.Response.ContentType = "text/html";
await context.Response.WriteAsync(GetErrorPageHtml("resource_exhaustion"));
}
catch (Exception ex)
{
// Catch-all for unexpected errors
context.RequestServices.GetRequiredService<ILogger<Program>>()
.LogError(ex, "Unexpected error on {Path}", context.Request.Path);
context.Response.StatusCode = 503;
context.Response.ContentType = "text/html";
await context.Response.WriteAsync(GetErrorPageHtml("unknown"));
}
});
app.MapControllers();
// Health check endpoint that checks ALL dependencies
app.MapGet("/health", async (
AppDbContext db,
IHttpClientFactory httpClientFactory,
ILogger<Program> logger) =>
{
var healthStatus = new Dictionary<string, object>
{
["status"] = "healthy",
["checks"] = new Dictionary<string, string>(),
["timestamp"] = DateTime.UtcNow
};
var isHealthy = true;
var checks = (Dictionary<string, string>)healthStatus["checks"];
// Check database connectivity
try
{
await db.Database.ExecuteSqlRawAsync("SELECT 1");
checks["database"] = "connected";
}
catch (Exception ex)
{
logger.LogError(ex, "Database health check failed");
checks["database"] = $"error: {ex.Message}";
isHealthy = false;
}
// Check external API connectivity (with timeout)
try
{
var httpClient = httpClientFactory.CreateClient();
httpClient.Timeout = TimeSpan.FromSeconds(2);
var response = await httpClient.GetAsync("https://payment-api.example.com/health");
if (response.IsSuccessStatusCode)
{
checks["payment_api"] = "available";
}
else
{
checks["payment_api"] = "degraded";
// Don't fail health check for non-critical dependencies
}
}
catch (Exception ex)
{
logger.LogWarning(ex, "External API health check failed");
checks["payment_api"] = "unavailable";
// Don't mark as unhealthy for external API issues
}
// Check memory usage
var memoryUsed = GC.GetTotalMemory(false);
var memoryMB = memoryUsed / 1024 / 1024;
if (memoryMB > 1024) // More than 1GB
{
checks["memory"] = $"high: {memoryMB}MB";
// Warning but not failing
}
else
{
checks["memory"] = $"ok: {memoryMB}MB";
}
if (isHealthy)
{
return Results.Ok(healthStatus);
}
else
{
healthStatus["status"] = "unhealthy";
// Return 503 so ALB marks this instance as unhealthy
return Results.Json(healthStatus, statusCode: 503);
}
});
app.Run();
static string GetErrorPageHtml(string reason = "unknown")
{
var message = reason switch
{
"database" => "We're experiencing database connectivity issues.",
"timeout" => "An external service is responding slowly.",
"external_service" => "A dependent service is temporarily unavailable.",
"resource_exhaustion" => "We're experiencing high system load.",
_ => "We're experiencing technical difficulties."
};
return $@"
<!DOCTYPE html>
<html lang=""en"">
<head>
<meta charset=""UTF-8"">
<meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">
<title>We'll Be Right Back</title>
<style>
body {{
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
display: flex;
align-items: center;
justify-content: center;
height: 100vh;
margin: 0;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}}
.container {{
text-align: center;
padding: 2rem;
max-width: 600px;
}}
h1 {{ font-size: 3rem; margin: 0; }}
p {{ font-size: 1.2rem; opacity: 0.9; }}
</style>
</head>
<body>
<div class=""container"">
<h1>🔧 We'll Be Right Back!</h1>
<p>{message}</p>
<p>Please try again in a few minutes.</p>
<p style=""font-size: 0.9rem; margin-top: 2rem; opacity: 0.7;"">
If this persists, contact support@acme.com
</p>
</div>
<script>
// Auto-refresh every 30 seconds
setTimeout(() => location.reload(), 30000);
</script>
</body>
</html>";
}
Key .NET Features Used:
✅ Global exception middleware - Catches all unhandled exceptions by type
✅ Minimal API health endpoint - Clean, simple health checks with multiple dependencies
✅ Multiple exception handlers - Handles DbException, TimeoutException, HttpRequestException, OutOfMemoryException
✅ Pattern matching - Switch expressions for customized error messages
✅ IHttpClientFactory - Proper HTTP client management for external service checks
✅ Structured logging - Different log levels (Error, Warning, Critical) for different scenarios
✅ Memory monitoring - Uses GC to track memory usage in health checks
Alternative: Using Middleware Class
For larger applications, create a dedicated middleware:
// GracefulDegradationMiddleware.cs
public class GracefulDegradationMiddleware
{
private readonly RequestDelegate _next;
private readonly ILogger<GracefulDegradationMiddleware> _logger;
public GracefulDegradationMiddleware(
RequestDelegate next,
ILogger<GracefulDegradationMiddleware> logger)
{
_next = next;
_logger = logger;
}
public async Task InvokeAsync(HttpContext context)
{
try
{
await _next(context);
}
catch (DbException ex)
{
_logger.LogError(ex, "Database error on {Path}", context.Request.Path);
await HandleFailureAsync(context, "Database temporarily unavailable");
}
catch (HttpRequestException ex)
{
_logger.LogError(ex, "External service error on {Path}", context.Request.Path);
await HandleFailureAsync(context, "External service temporarily unavailable");
}
}
private static async Task HandleFailureAsync(HttpContext context, string message)
{
context.Response.StatusCode = 503;
context.Response.ContentType = "application/json";
await context.Response.WriteAsJsonAsync(new
{
status = "service_unavailable",
message = message,
timestamp = DateTime.UtcNow
});
}
}
// Register in Program.cs
app.UseMiddleware<GracefulDegradationMiddleware>();
The Error Page Template
<!-- templates/error.html -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>We'll Be Right Back</title>
<style>
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
display: flex;
align-items: center;
justify-content: center;
min-height: 100vh;
margin: 0;
padding: 20px;
}
.container {
text-align: center;
max-width: 600px;
background: rgba(255, 255, 255, 0.1);
backdrop-filter: blur(10px);
padding: 60px 40px;
border-radius: 20px;
box-shadow: 0 8px 32px rgba(0, 0, 0, 0.3);
}
h1 { font-size: 3em; margin: 0 0 20px; }
p { font-size: 1.2em; margin: 15px 0; opacity: 0.9; }
.status-link {
display: inline-block;
margin-top: 30px;
padding: 12px 30px;
background: white;
color: #667eea;
text-decoration: none;
border-radius: 25px;
font-weight: 600;
transition: transform 0.2s;
}
.status-link:hover { transform: translateY(-2px); }
.icon { font-size: 4em; margin-bottom: 20px; }
</style>
</head>
<body>
<div class="container">
<div class="icon">🔧</div>
<h1>We'll Be Right Back!</h1>
<p>We're currently experiencing technical difficulties.</p>
<p>Our team has been notified and is working to resolve the issue.</p>
<p style="font-size: 0.9em; opacity: 0.7;">Estimated resolution time: 15-30 minutes</p>
<a href="https://status.example.com" class="status-link">Check Status Page</a>
</div>
<script>
// Auto-refresh every 30 seconds
setTimeout(() => location.reload(), 30000);
</script>
</body>
</html>
Pros and Cons
✅ Pros:
- Handles all real failure scenarios: Database outages, external API timeouts, memory exhaustion, etc.
- Context-aware messages: Different errors show appropriate messages to users
- Immediate response: No DNS propagation delays, users see error pages instantly
- Health checks work properly: ALB can detect unhealthy instances and route around them
- Can include degraded mode: Return cached data or limited functionality instead of total failure
- Logging included: Every failure is logged with context for debugging
❌ Cons:
- Only works if app is running: If your container crashes completely (OOM kill), this won't help
- Requires code in every app: Each microservice needs its own error handling
- Still uses compute resources: Even rendering error pages consumes CPU/memory
- Won't fix AWS regional issues: If ALB itself is down, your graceful error handling can't run
- Complex health checks: Need to balance checking dependencies vs. marking instance unhealthy
- Edge case: Security group changes: If DB security group blocks your app, you still serve errors (but gracefully)
💡 Solution 2: Route 53 Health Checks with Failover
Take control at the DNS level before requests even reach your infrastructure.
Strategy: DNS Failover to Static Error/Status Site
Normal Operation:
User → DNS (app.example.com) → ALB → Your App
Database Down:
User → DNS (app.example.com) → S3 Static Site (Error/Status Page)
Setting It Up
1. Create an error/status page in S3:
# Create S3 bucket for error/status page
aws s3 mb s3://example-error-page
# Upload your error page
aws s3 cp error.html s3://example-error-page/index.html \
--content-type "text/html" \
--cache-control "no-cache, no-store, must-revalidate"
# Configure bucket for static website hosting
aws s3 website s3://example-error-page \
--index-document index.html
# Make it public (or use CloudFront for better security)
aws s3api put-bucket-policy \
--bucket example-error-page \
--policy '{
"Version": "2012-10-17",
"Statement": [{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::example-error-page/*"
}]
}'
2. Create Route 53 health check:
# Health check that monitors your actual app health
aws route53 create-health-check \
--health-check-config '{
"Type": "HTTPS",
"ResourcePath": "/health",
"FullyQualifiedDomainName": "app.example.com",
"Port": 443,
"RequestInterval": 30,
"FailureThreshold": 3,
"MeasureLatency": true,
"EnableSNI": true
}' \
--caller-reference "app-health-check-$(date +%s)"
3. Configure Route 53 failover records:
# Primary record (your main ALB)
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "my-alb-123456.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": false
},
"HealthCheckId": "abc123-health-check-id"
}
}]
}'
# Secondary record (S3 error/status page)
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890ABC \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "app.example.com",
"Type": "A",
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z3AQBSTGFYJSTF",
"DNSName": "s3-website-us-east-1.amazonaws.com",
"EvaluateTargetHealth": false
}
}
}]
}'
Pros and Cons
✅ Pros:
- Completely offloads traffic: When failing over, zero load on your infrastructure
- Works even if app crashes: DNS-level failover doesn't require your app to be running
- Simple error page: Just static HTML in S3
- Cost-effective during outages: S3 hosting is pennies compared to running instances
- Automatic failover: Route 53 detects failure and switches automatically
❌ Cons:
- DNS propagation delay: Can take 30-60 seconds (or longer with caching) for failover to take effect
- TTL complications: Clients cache DNS for the TTL duration (typically 60-300 seconds)
- All or nothing: Either all traffic goes to error page or none
- Limited customization: Static page can't show dynamic information
- Health check costs: Route 53 health checks cost $0.50/month each
- Not granular: Can't fail over specific routes, only entire domains
💡 Solution 3: CloudFront with Edge Functions
Intercept and handle errors at the edge, closest to your users.
Strategy: CloudFront Functions or Lambda@Edge
CloudFront sits in front of your entire infrastructure and can inspect/modify responses:
User → CloudFront (Edge Location) → ALB → Your App
↓
(Detects 5xx error)
↓
(Returns professional error page)
Option A: CloudFront Functions (Lightweight)
CloudFront Functions run in microseconds and are perfect for simple transformations:
// CloudFront Function (viewer-response event)
function handler(event) {
var response = event.response;
var statusCode = response.statusCode;
// If origin returned 5xx error, return error page
if (statusCode >= 500 && statusCode < 600) {
return {
statusCode: 503,
statusDescription: 'Service Unavailable',
headers: {
'content-type': { value: 'text/html; charset=utf-8' },
'cache-control': { value: 'no-cache, no-store, must-revalidate' }
},
body: `<!DOCTYPE html>
<html>
<head>
<title>We'll Be Right Back</title>
<style>
body {
font-family: Arial, sans-serif;
display: flex;
align-items: center;
justify-content: center;
min-height: 100vh;
margin: 0;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
text-align: center;
}
.container {
background: rgba(255, 255, 255, 0.1);
padding: 40px;
border-radius: 20px;
backdrop-filter: blur(10px);
}
h1 { font-size: 2.5em; margin: 0 0 20px; }
p { font-size: 1.1em; margin: 10px 0; }
</style>
</head>
<body>
<div class="container">
<h1>🔧 We'll Be Right Back!</h1>
<p>We're experiencing technical difficulties.</p>
<p>Our team is working to resolve the issue.</p>
<p style="font-size: 0.9em; opacity: 0.8;">Please try again in a few minutes.</p>
</div>
<script>setTimeout(() => location.reload(), 30000);</script>
</body>
</html>`
};
}
// Return original response if no error
return response;
}
Deploying the function:
# Create function
aws cloudfront create-function \
--name error-handler \
--function-config Comment="Handle 5xx errors gracefully",Runtime="cloudfront-js-1.0" \
--function-code file://error-handler.js
# Publish function
aws cloudfront publish-function \
--name error-handler \
--if-match ETVABCDEF12345
# Associate with CloudFront distribution
aws cloudfront update-distribution \
--id E1234ABCD \
--distribution-config '{
"DefaultCacheBehavior": {
"FunctionAssociations": {
"Quantity": 1,
"Items": [{
"FunctionARN": "arn:aws:cloudfront::123456:function/error-handler",
"EventType": "viewer-response"
}]
}
}
}'
Option B: Lambda@Edge (Full Power)
For more complex logic, use Lambda@Edge:
# Lambda@Edge function (origin-response event)
import json
import boto3
def lambda_handler(event, context):
response = event['Records'][0]['cf']['response']
status = int(response['status'])
# If 5xx error, check if it's a database issue
if 500 <= status < 600:
# Could check CloudWatch metrics, or RDS status here
# For simplicity, return error page for all 5xx
error_page = """<!DOCTYPE html>
<html>
<head>
<title>We'll Be Right Back</title>
<style>
body {
font-family: Arial, sans-serif;
text-align: center;
padding: 50px;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
}
.container {
max-width: 600px;
margin: 0 auto;
background: rgba(255,255,255,0.1);
padding: 40px;
border-radius: 20px;
}
h1 { font-size: 2.5em; }
</style>
</head>
<body>
<div class="container">
<h1>🔧 We're Having Technical Issues</h1>
<p>We're experiencing technical difficulties and our team is working to resolve them.</p>
<p>We'll be back shortly!</p>
</div>
</body>
</html>"""
return {
'status': '503',
'statusDescription': 'Service Unavailable',
'headers': {
'content-type': [{'key': 'Content-Type', 'value': 'text/html'}],
'cache-control': [{'key': 'Cache-Control', 'value': 'no-cache'}]
},
'body': error_page
}
return response
CloudFront Custom Error Pages (Simplest Option)
CloudFront also supports custom error pages without any code:
aws cloudfront update-distribution \
--id E1234ABCD \
--distribution-config '{
"CustomErrorResponses": {
"Quantity": 3,
"Items": [
{
"ErrorCode": 500,
"ResponsePagePath": "/error.html",
"ResponseCode": "503",
"ErrorCachingMinTTL": 10
},
{
"ErrorCode": 502,
"ResponsePagePath": "/error.html",
"ResponseCode": "503",
"ErrorCachingMinTTL": 10
},
{
"ErrorCode": 503,
"ResponsePagePath": "/error.html",
"ResponseCode": "503",
"ErrorCachingMinTTL": 10
}
]
}
}'
Then host error.html in your S3 origin bucket.
Pros and Cons
✅ Pros:
- Edge-level response: Handled at CloudFront POPs, closest to users
- Fast failover: No DNS propagation delays
- Reduced origin load: Errors intercepted before hitting origin repeatedly
- Granular control: Can handle different error codes differently
- Custom logic: Lambda@Edge can check metrics, databases, etc.
- Consistent UX: Same error page for all users globally
- Low error cache TTL: Can recover quickly once origin is healthy
❌ Cons:
- Requires CloudFront: Additional infrastructure and cost
- CloudFront Functions limitations: 10KB size limit, limited runtime
- Lambda@Edge complexity: More expensive ($0.60 per 1M requests), longer latency
- Deployment time: Function updates take 15-30 minutes to propagate
- Cold starts: Lambda@Edge can have cold start latency
- Debugging challenges: Edge functions are harder to test and debug
🏆 Comparison Matrix
| Feature | App-Level | Route 53 Failover | CloudFront Functions | Lambda@Edge | Custom Error Pages |
|---|---|---|---|---|---|
| Response Time | Instant | 30-60s (DNS TTL) | Instant | Instant | Instant |
| Infrastructure Load | High | None (failover) | Low | Low | Low |
| Customization | Full | Limited (static) | Medium | High | Low (static) |
| Code Required | Yes | No | Yes (simple) | Yes (complex) | No |
| Cost | App compute | $0.50/month | $0.10 per 1M | $0.60 per 1M | Included |
| Maintenance Effort | Per app | DNS + S3 | Function updates | Function updates | Config only |
| Granularity | Per route | Per domain | Per distribution | Per distribution | Per error code |
| Works if app crashes | No | Yes | Yes | Yes | Yes |
| Edge/Global | No | Yes (DNS) | Yes | Yes | Yes |
💎 The Hybrid Approach (Best Practice)
Don't choose just one—layer your defenses:
Layer 1: Application-Level (First Line)
# Catch expected failures, show degraded functionality
@app.route('/api/users')
def get_users():
try:
return fetch_users_from_db()
except DatabaseError:
# Return cached data with a warning
return {
'users': get_cached_users(),
'warning': 'Using cached data - live data temporarily unavailable'
}, 200
Layer 2: CloudFront Custom Error Pages (Second Line)
CustomErrorResponses:
- ErrorCode: 503
ResponsePagePath: /error.html
ResponseCode: 503
ErrorCachingMinTTL: 10 # Short TTL for quick recovery
Layer 3: Route 53 Failover (Nuclear Option)
# Only kicks in if health checks fail completely
PRIMARY: app.example.com → ALB
SECONDARY: app.example.com → S3 (Failover to static error page)
The Flow
1. Database goes down
2. App catches error, returns cached data or 503
3. If app returns 503, CloudFront shows professional error page
4. If entire app/ALB fails health checks, Route 53 fails over to S3
🎯 Real-World Implementation
Let's put it all together for a production setup:
#!/bin/bash
# Setup script for graceful failure handling
# 1. Create S3 bucket for error/status page
aws s3 mb s3://myapp-error-page
aws s3 cp error.html s3://myapp-error-page/index.html
aws s3 website s3://myapp-error-page --index-document index.html
# 2. Create Route 53 health check
HEALTH_CHECK_ID=$(aws route53 create-health-check \
--health-check-config Type=HTTPS,ResourcePath=/health,FullyQualifiedDomainName=app.example.com,Port=443 \
--caller-reference "health-$(date +%s)" \
--query 'HealthCheck.Id' --output text)
# 3. Create CloudFront function for error handling
aws cloudfront create-function \
--name error-handler \
--function-config Runtime="cloudfront-js-1.0" \
--function-code fileb://error-handler.js
# 4. Update CloudFront to use custom error pages
aws cloudfront update-distribution \
--id $DISTRIBUTION_ID \
--distribution-config file://distribution-config.json
# 5. Configure Route 53 failover
aws route53 change-resource-record-sets \
--hosted-zone-id $ZONE_ID \
--change-batch file://failover-config.json
echo "✅ Graceful failure handling configured!"
echo "Test by:"
echo "1. Taking down database"
echo "2. Watching CloudWatch metrics"
echo "3. Verifying users see professional error page"
📊 Monitoring and Alerting
Set up alerts to know when things go wrong:
# CloudWatch Alarms
DatabaseConnectionFailures:
Metric: DatabaseConnectionErrors
Threshold: > 10 in 5 minutes
Action: SNS notification to ops team
ALB5xxErrors:
Metric: HTTPCode_Target_5XX_Count
Threshold: > 50 in 2 minutes
Action: Page on-call engineer
Route53HealthCheckFailed:
Metric: HealthCheckStatus
Threshold: < 1
Action: Trigger failover + alert
CloudFrontErrorRate:
Metric: 5xxErrorRate
Threshold: > 5%
Action: Escalate to engineering lead
🎬 Testing Your Graceful Failure
Always test before you need it—simulate every real-world scenario:
Test 1: Database Connection Failure
# Block database access from your app security group
aws ec2 modify-security-group-rules \
--group-id sg-app \
--security-group-rules "SecurityGroupRuleId=sgr-xxx,SecurityGroupRule={IpProtocol=tcp,FromPort=5432,ToPort=5432,CidrIpv4=0.0.0.0/0,Description='Block DB'}"
# Expected: Users see professional error page, not "502 Bad Gateway"
curl -I https://app.example.com
# Should return: HTTP/1.1 503 Service Unavailable
# Check ALB health
aws elbv2 describe-target-health --target-group-arn $TG_ARN
# Expected: Targets should be marked unhealthy
Test 2: Connection Pool Exhaustion
# Simulate high load that exhausts connection pool
import concurrent.futures
import requests
def make_request():
requests.get("https://app.example.com/api/users")
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
futures = [executor.submit(make_request) for _ in range(10000)]
# Expected: Graceful degradation, not cascading failures
# Monitor: CloudWatch metrics for connection pool usage
Test 3: External API Timeout
# Use iptables to simulate slow external service
# On your app server, block the external API temporarily
sudo iptables -A OUTPUT -d payment-api.example.com -j DROP
# Make requests that depend on that API
curl https://app.example.com/api/checkout
# Expected: Should timeout gracefully with 504 after 5 seconds
# Not: Hang for 30+ seconds causing gateway timeout
Test 4: Memory Pressure / OOM
# Deploy a version with memory leak or set low memory limits
# Update task definition with only 256MB memory
aws ecs update-service \
--cluster production \
--service api-service \
--task-definition api:memory-test
# Generate load
hey -n 10000 -c 100 https://app.example.com/api/large-response
# Expected:
# - Health checks fail before OOM kill
# - ALB routes to other healthy instances
# - Users see professional error page, not blank page
Test 5: Route 53 Failover
# Mark primary as unhealthy manually
aws route53 update-health-check \
--health-check-id $HEALTH_CHECK_ID \
--disabled
# Wait 60 seconds for DNS propagation
sleep 60
# Check DNS resolution
dig app.example.com
# Expected: Should point to S3 error/status site
# Test with: curl https://app.example.com
# Should see: Static error page from S3
Test 6: CloudFront Error Interception
# Force origin to return 500 errors
# Either through app admin endpoint or by stopping services
aws ecs update-service \
--cluster production \
--service api-service \
--desired-count 0
# Make request
curl https://app.example.com
# Expected: CloudFront custom error page
# Not: Raw ALB "503 Service Temporarily Unavailable"
Test 7: SSL Certificate Issues
# Check certificate expiration
aws acm describe-certificate \
--certificate-arn $CERT_ARN \
| jq '.Certificate.NotAfter'
# Set up monitoring alert for < 30 days
aws cloudwatch put-metric-alarm \
--alarm-name ssl-cert-expiring \
--metric-name DaysToExpiry \
--threshold 30
# Expected: Alert fires before certificate expires
# Not: Finding out at 2 AM when all requests fail
Test 8: Cascading Failure from One Slow Endpoint
# Deploy a version with one intentionally slow endpoint
# In your app, add artificial delay to one route
@app.route('/api/slow')
def slow_endpoint():
time.sleep(60) # Simulate slow query
return jsonify({'data': 'slow'})
# Generate traffic to that endpoint
hey -n 1000 -c 50 https://app.example.com/api/slow &
# Try to use other endpoints
curl https://app.example.com/api/users
# Expected: Other endpoints still work (circuit breaker pattern)
# Or: Graceful degradation with timeout
# Not: Entire app becomes unresponsive
🎓 Key Takeaways
- Perfect infrastructure isn't enough: Even AWS regions fail—US-East-1 recently had widespread 502-504 errors
- Many failure modes exist: Database connection failures, external API timeouts, memory exhaustion, security group issues, certificate expiration, cascading failures
- Layer your defenses: Use application-level error handling + CloudFront functions + Route 53 failover together
- Fail gracefully, not silently: Show users honest, professional error pages, not blank white pages or "502 Bad Gateway"
- Health checks must be comprehensive: Check database, external APIs, memory usage—not just "is the process running"
- Edge cases happen more than you think: OOM kills, connection pool exhaustion, slow queries under load, SSL cert renewal failures
- Test the real scenarios: Block database access, disable external APIs, simulate memory pressure, force timeouts
- Monitor and alert appropriately: Different failure types need different responses and escalation paths
- Set user expectations: Error pages should honestly explain what's happening and when to expect recovery
- Graceful degradation > total failure: Sometimes returning cached data is better than returning nothing
🚀 What's Next?
You now have a complete picture of building highly available infrastructure on AWS:
- Part 1: ALB, Auto Scaling, and EC2 fundamentals
- Part 2: ECS with containers and two-dimensional scaling
- Part 3: Fargate serverless simplicity
- Part 4: Graceful failure handling and error recovery
Your infrastructure can now:
- Scale automatically based on demand ✅
- Handle instance failures ✅
- Distribute traffic intelligently ✅
- Fail gracefully when dependencies break ✅
- Provide great UX even during outages ✅
The final lesson: High availability isn't about preventing all failures—it's about handling them gracefully when they inevitably happen.
"Hope for the best, plan for the worst, and prepare to be surprised." Build systems that fail gracefully, monitor continuously, and always have a plan B (and C, and D).
Questions about graceful failure handling? Find me on social media or leave a comment below!
Comments