Published November 21, 2025 · 25 min read · by Eric Wilson

AWS High Availability CloudFront Route 53 Resilience

Your infrastructure scales perfectly, but what happens when your database goes down? Learn strategies for graceful degradation, from application-level error handling to Route 53 health checks and CloudFront functions. Turn ugly 500 errors into honest, professional error pages.

Building Highly Available AWS Infrastructure: Graceful Failure - Part 4

You've done everything right. You followed Part 1 and built a highly available setup with ALB and Auto Scaling. You containerized with Part 2 using ECS. Maybe you even went serverless with Part 3 and Fargate.

Your application layer is bulletproof. It scales beautifully. Health checks are perfect. Your ALB is distributing traffic like a champ.

Then, at 2 AM on a Friday, your database goes down.

Suddenly, every single request returns a 500 error. Your perfectly scaled infrastructure becomes a perfectly scaled error generator. Your users see this:

500 Internal Server Error
The server encountered an internal error and was unable to complete your request.

Welcome to Part 4, where we talk about the harsh truth: No matter how well your applications scale, there are links in the chain that can still break.

🎯 The Cascading Failure Problem: When Health Checks Lie

Your architecture looks like this:

User → CloudFront → ALB → Fargate Tasks → RDS Database
         ✅            ✅         ✅              ↓
                                              (💀 DOWN)

Here's what happens when a critical dependency fails:

Your Fargate tasks are passing health checks ✅ (they respond to /health)
Your ALB is routing traffic normally ✅ (targets are "healthy")
But every real request cascades into failure ❌ (database is down)

The root issue: Maybe your health checks only verify the infrastructure layer—and they don't test the full dependency chain. Your app appears healthy because it responds to HTTP requests, even though it can't actually serve user traffic.

This isn't limited to databases. When dependencies fail—external APIs, payment processors, authentication services, or even entire AWS regions—your "healthy" infrastructure becomes a perfectly scaled error generator.

Real example: During the October 2025 US-East-1 incident, the region didn't go down completely. DNS resolution failures for DynamoDB cascaded through AWS's control plane, and one critical downstream effect was the inability to launch new EC2 instances. Existing instances continued running fine, but autoscaling was broken—if you experienced a traffic spike, you were stuck at current capacity. This created an inconsistent user experience: some requests succeeded (hitting healthy instances with capacity) while others failed (hitting overloaded instances). Multi-region architectures with geo-routing fared better, but single-region deployments had no escape valve.

What this article covers:

Patterns to detect these failures early and show users honest, professional messages instead of cryptic "502 Bad Gateway", "503 Service Unavailable", "504 Gateway Timeout" errors.
While these won't fix AWS outages or broken dependencies, they prevent the cascading failure from reaching users as ugly error pages.

🎭 The User Experience Crisis

What your users actually see when things go wrong depends on where the failure occurs:

Scenario 1: All Servers Unhealthy (ALB Has No Targets)

502 Bad Gateway

The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.

When this happens:

All your EC2/Fargate instances failed health checks
Auto Scaling hasn't launched replacements yet
ALB has no healthy targets to route to
Classic "cold start" problem during rapid scale-up

Scenario 2: Database Connection Failures

500 Internal Server Error

Or sometimes just a blank white page in the browser.

When this happens:

Your servers are healthy, but can't connect to RDS
Security group changes blocked the connection
RDS is failing over to standby or experiencing issues
Connection pool exhausted from traffic spike
Network partition between app and database subnets

Scenario 3: Timeout to External Dependencies

504 Gateway Timeout

When this happens:

Your app calls an external API that's down
Payment processor is slow/unresponsive
Authentication service (Auth0, Cognito) has latency
Third-party data provider is degraded
CloudFront origin request times out (30 second default)

Scenario 4: AWS Regional Issues

502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout

When this happens:

AWS Control Plane issues (like the recent US-East-1 incident)
ALB itself is experiencing problems
Route 53 DNS resolution delays
Cross-AZ network degradation
Lambda cold starts timing out

Edge Cases That Happen More Than You'd Think

Out of Memory Crashes:

502 Bad Gateway

Your container runs out of memory mid-request, crashes, ALB marks it unhealthy.

Slow Database Queries During Traffic Spike:

504 Gateway Timeout

Query works fine normally, but locks up under high load. Requests queue up and timeout.

SSL/TLS Certificate Expiration:

ERR_CERT_DATE_INVALID

Automated renewal failed, now all requests fail at the ALB level.

Cascading Failures from One Slow Endpoint:

500/502/504 (varies)

One API endpoint is slow, consumes all worker threads/connections, now everything fails.

Problems with all of these:

Look broken and unprofessional
No information about what's happening
No estimated time to resolution
No alternatives or status page link
Users don't know if it's their problem or yours
Makes users think your entire service is broken

What you want users to see instead:

<!-- An honest status page -->
<!DOCTYPE html>
<html>
<head><title>We'll Be Right Back</title></head>
<body style="font-family: Arial; text-align: center; padding: 50px;">
<h1>🔧 We'll be right back!</h1>
<p>We're experiencing some technical difficulties and are working to resolve them.</p>
<p>Please try again in a few minutes. We apologize for the inconvenience.</p>
<p><a href="https://status.example.com">Check our status page</a></p>
</body>
</html>

💡 Solution 1: Application-Level Graceful Degradation

The first line of defense is your application itself.

Strategy: Fail Gracefully for Different Scenarios

Handle all the real-world failure cases:

# Python/Flask example
from flask import Flask, render_template, jsonify
import psycopg2
import requests
from functools import wraps
from requests.exceptions import Timeout, RequestException

app = Flask(__name__)

# Track service health
service_health = {
    'database': True,
    'payment_api': True,
    'auth_service': True
}

def graceful_degradation(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        try:
            # Try to execute the route
            return f(*args, **kwargs)
        except psycopg2.OperationalError as e:
            # Database connection failed
            service_health['database'] = False
            app.logger.error(f"Database error: {e}")
            return render_template('error.html', 
                                   reason='database'), 503
        except psycopg2.pool.PoolError as e:
            # Connection pool exhausted
            app.logger.error(f"Connection pool exhausted: {e}")
            return render_template('error.html', 
                                   reason='high_load'), 503
        except Timeout as e:
            # External API timeout
            app.logger.error(f"External service timeout: {e}")
            return render_template('error.html', 
                                   reason='external_service'), 504
        except RequestException as e:
            # External API failed
            service_health['payment_api'] = False
            app.logger.error(f"External service error: {e}")
            return render_template('error.html', 
                                   reason='external_service'), 503
        except MemoryError:
            # Out of memory - this is critical
            app.logger.critical("Out of memory!")
            return render_template('error.html', 
                                   reason='resource_exhaustion'), 503
        except Exception as e:
            # Catch-all for unexpected errors
            app.logger.error(f"Unexpected error: {e}")
            return render_template('error.html', 
                                   reason='unknown'), 503
    return decorated_function

@app.route('/api/users')
@graceful_degradation
def get_users():
    # This will fail gracefully if database is down
    conn = get_db_connection()
    users = conn.execute('SELECT * FROM users').fetchall()
    return jsonify(users)

@app.route('/api/process-payment', methods=['POST'])
@graceful_degradation
def process_payment():
    # Call external payment API with timeout
    response = requests.post(
        'https://payment-api.example.com/charge',
        json=request.json,
        timeout=5  # 5 second timeout to avoid gateway timeouts
    )
    return jsonify(response.json())

# Health check that ACTUALLY checks all dependencies
@app.route('/health')
def health_check():
    health_status = {'status': 'healthy', 'checks': {}}
    is_healthy = True
    
    # Check database connectivity
    try:
        conn = get_db_connection()
        conn.execute('SELECT 1')
        health_status['checks']['database'] = 'connected'
        service_health['database'] = True
    except Exception as e:
        health_status['checks']['database'] = f'error: {str(e)}'
        service_health['database'] = False
        is_healthy = False
    
    # Check external API connectivity (with short timeout)
    try:
        response = requests.get(
            'https://payment-api.example.com/health',
            timeout=2
        )
        if response.status_code == 200:
            health_status['checks']['payment_api'] = 'available'
            service_health['payment_api'] = True
        else:
            health_status['checks']['payment_api'] = 'degraded'
            service_health['payment_api'] = False
    except Exception as e:
        health_status['checks']['payment_api'] = 'unavailable'
        service_health['payment_api'] = False
        # Don't mark as unhealthy for external API issues
        # Only fail health check for critical dependencies
    
    # Check memory usage
    import psutil
    memory_percent = psutil.virtual_memory().percent
    if memory_percent > 90:
        health_status['checks']['memory'] = f'critical: {memory_percent}%'
        is_healthy = False
    else:
        health_status['checks']['memory'] = f'ok: {memory_percent}%'
    
    if is_healthy:
        return jsonify(health_status), 200
    else:
        health_status['status'] = 'unhealthy'
        # Return 503 so ALB marks this instance as unhealthy
        return jsonify(health_status), 503

# Fallback route - show error page for all other routes
@app.errorhandler(503)
def service_unavailable(e):
    return render_template('error.html'), 503

@app.errorhandler(504)
def gateway_timeout(e):
    return render_template('error.html', reason='timeout'), 504

.NET 9 / ASP.NET Core Implementation

Here's the same concept in modern .NET, handling all failure scenarios:

// Program.cs - ASP.NET Core 9
using Microsoft.EntityFrameworkCore;
using System.Data.Common;

var builder = WebApplication.CreateBuilder(args);

// Add services
builder.Services.AddDbContext<AppDbContext>(options =>
    options.UseNpgsql(builder.Configuration.GetConnectionString("DefaultConnection")));
builder.Services.AddHttpClient();
builder.Services.AddControllers();
builder.Services.AddMemoryCache();

var app = builder.Build();

// Global exception handler middleware - handles all real-world failures
app.Use(async (context, next) =>
{
    try
    {
        await next(context);
    }
    catch (DbException ex)
    {
        // Database connection or query failure
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Database error on {Path}", context.Request.Path);
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetErrorPageHtml("database"));
    }
    catch (TimeoutException ex)
    {
        // External service timeout
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Timeout on {Path}", context.Request.Path);
        
        context.Response.StatusCode = 504;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetErrorPageHtml("timeout"));
    }
    catch (HttpRequestException ex)
    {
        // External API failure
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "External service error on {Path}", context.Request.Path);
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetErrorPageHtml("external_service"));
    }
    catch (OutOfMemoryException ex)
    {
        // Critical memory issue
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogCritical(ex, "Out of memory on {Path}", context.Request.Path);
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetErrorPageHtml("resource_exhaustion"));
    }
    catch (Exception ex)
    {
        // Catch-all for unexpected errors
        context.RequestServices.GetRequiredService<ILogger<Program>>()
            .LogError(ex, "Unexpected error on {Path}", context.Request.Path);
        
        context.Response.StatusCode = 503;
        context.Response.ContentType = "text/html";
        await context.Response.WriteAsync(GetErrorPageHtml("unknown"));
    }
});

app.MapControllers();

// Health check endpoint that checks ALL dependencies
app.MapGet("/health", async (
    AppDbContext db, 
    IHttpClientFactory httpClientFactory,
    ILogger<Program> logger) =>
{
    var healthStatus = new Dictionary<string, object>
    {
        ["status"] = "healthy",
        ["checks"] = new Dictionary<string, string>(),
        ["timestamp"] = DateTime.UtcNow
    };
    var isHealthy = true;
    var checks = (Dictionary<string, string>)healthStatus["checks"];

    // Check database connectivity
    try
    {
        await db.Database.ExecuteSqlRawAsync("SELECT 1");
        checks["database"] = "connected";
    }
    catch (Exception ex)
    {
        logger.LogError(ex, "Database health check failed");
        checks["database"] = $"error: {ex.Message}";
        isHealthy = false;
    }

    // Check external API connectivity (with timeout)
    try
    {
        var httpClient = httpClientFactory.CreateClient();
        httpClient.Timeout = TimeSpan.FromSeconds(2);
        
        var response = await httpClient.GetAsync("https://payment-api.example.com/health");
        if (response.IsSuccessStatusCode)
        {
            checks["payment_api"] = "available";
        }
        else
        {
            checks["payment_api"] = "degraded";
            // Don't fail health check for non-critical dependencies
        }
    }
    catch (Exception ex)
    {
        logger.LogWarning(ex, "External API health check failed");
        checks["payment_api"] = "unavailable";
        // Don't mark as unhealthy for external API issues
    }

    // Check memory usage
    var memoryUsed = GC.GetTotalMemory(false);
    var memoryMB = memoryUsed / 1024 / 1024;
    if (memoryMB > 1024) // More than 1GB
    {
        checks["memory"] = $"high: {memoryMB}MB";
        // Warning but not failing
    }
    else
    {
        checks["memory"] = $"ok: {memoryMB}MB";
    }

    if (isHealthy)
    {
        return Results.Ok(healthStatus);
    }
    else
    {
        healthStatus["status"] = "unhealthy";
        // Return 503 so ALB marks this instance as unhealthy
        return Results.Json(healthStatus, statusCode: 503);
    }
});

app.Run();

static string GetErrorPageHtml(string reason = "unknown")
{
    var message = reason switch
    {
        "database" => "We're experiencing database connectivity issues.",
        "timeout" => "An external service is responding slowly.",
        "external_service" => "A dependent service is temporarily unavailable.",
        "resource_exhaustion" => "We're experiencing high system load.",
        _ => "We're experiencing technical difficulties."
    };

    return $@"
<!DOCTYPE html>
<html lang=""en"">
<head>
    <meta charset=""UTF-8"">
    <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">
    <title>We'll Be Right Back</title>
    <style>
        body {{
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            display: flex;
            align-items: center;
            justify-content: center;
            height: 100vh;
            margin: 0;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }}
        .container {{
            text-align: center;
            padding: 2rem;
            max-width: 600px;
        }}
        h1 {{ font-size: 3rem; margin: 0; }}
        p {{ font-size: 1.2rem; opacity: 0.9; }}
    </style>
</head>
<body>
    <div class=""container"">
        <h1>🔧 We'll Be Right Back!</h1>
        <p>{message}</p>
        <p>Please try again in a few minutes.</p>
        <p style=""font-size: 0.9rem; margin-top: 2rem; opacity: 0.7;"">
            If this persists, contact support@acme.com
        </p>
    </div>
    <script>
        // Auto-refresh every 30 seconds
        setTimeout(() => location.reload(), 30000);
    </script>
</body>
</html>";
}

Key .NET Features Used:

✅ Global exception middleware - Catches all unhandled exceptions by type
✅ Minimal API health endpoint - Clean, simple health checks with multiple dependencies
✅ Multiple exception handlers - Handles DbException, TimeoutException, HttpRequestException, OutOfMemoryException
✅ Pattern matching - Switch expressions for customized error messages
✅ IHttpClientFactory - Proper HTTP client management for external service checks
✅ Structured logging - Different log levels (Error, Warning, Critical) for different scenarios
✅ Memory monitoring - Uses GC to track memory usage in health checks

Alternative: Using Middleware Class

For larger applications, create a dedicated middleware:

// GracefulDegradationMiddleware.cs
public class GracefulDegradationMiddleware
{
    private readonly RequestDelegate _next;
    private readonly ILogger<GracefulDegradationMiddleware> _logger;
    
    public GracefulDegradationMiddleware(
        RequestDelegate next,
        ILogger<GracefulDegradationMiddleware> logger)
    {
        _next = next;
        _logger = logger;
    }
    
    public async Task InvokeAsync(HttpContext context)
    {
        try
        {
            await _next(context);
        }
        catch (DbException ex)
        {
            _logger.LogError(ex, "Database error on {Path}", context.Request.Path);
            await HandleFailureAsync(context, "Database temporarily unavailable");
        }
        catch (HttpRequestException ex)
        {
            _logger.LogError(ex, "External service error on {Path}", context.Request.Path);
            await HandleFailureAsync(context, "External service temporarily unavailable");
        }
    }
    
    private static async Task HandleFailureAsync(HttpContext context, string message)
    {
        context.Response.StatusCode = 503;
        context.Response.ContentType = "application/json";
        
        await context.Response.WriteAsJsonAsync(new
        {
            status = "service_unavailable",
            message = message,
            timestamp = DateTime.UtcNow
        });
    }
}

// Register in Program.cs
app.UseMiddleware<GracefulDegradationMiddleware>();

The Error Page Template

<!-- templates/error.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>We'll Be Right Back</title>
    <style>
        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Arial, sans-serif;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            display: flex;
            align-items: center;
            justify-content: center;
            min-height: 100vh;
            margin: 0;
            padding: 20px;
        }
        .container {
            text-align: center;
            max-width: 600px;
            background: rgba(255, 255, 255, 0.1);
            backdrop-filter: blur(10px);
            padding: 60px 40px;
            border-radius: 20px;
            box-shadow: 0 8px 32px rgba(0, 0, 0, 0.3);
        }
        h1 { font-size: 3em; margin: 0 0 20px; }
        p { font-size: 1.2em; margin: 15px 0; opacity: 0.9; }
        .status-link {
            display: inline-block;
            margin-top: 30px;
            padding: 12px 30px;
            background: white;
            color: #667eea;
            text-decoration: none;
            border-radius: 25px;
            font-weight: 600;
            transition: transform 0.2s;
        }
        .status-link:hover { transform: translateY(-2px); }
        .icon { font-size: 4em; margin-bottom: 20px; }
    </style>
</head>
<body>
    <div class="container">
        <div class="icon">🔧</div>
        <h1>We'll Be Right Back!</h1>
        <p>We're currently experiencing technical difficulties.</p>
        <p>Our team has been notified and is working to resolve the issue.</p>
        <p style="font-size: 0.9em; opacity: 0.7;">Estimated resolution time: 15-30 minutes</p>
        <a href="https://status.example.com" class="status-link">Check Status Page</a>
    </div>
    <script>
        // Auto-refresh every 30 seconds
        setTimeout(() => location.reload(), 30000);
    </script>
</body>
</html>

Pros and Cons

✅ Pros:

Handles all real failure scenarios: Database outages, external API timeouts, memory exhaustion, etc.
Context-aware messages: Different errors show appropriate messages to users
Immediate response: No DNS propagation delays, users see error pages instantly
Health checks work properly: ALB can detect unhealthy instances and route around them
Can include degraded mode: Return cached data or limited functionality instead of total failure
Logging included: Every failure is logged with context for debugging

❌ Cons:

Only works if app is running: If your container crashes completely (OOM kill), this won't help
Requires code in every app: Each microservice needs its own error handling
Still uses compute resources: Even rendering error pages consumes CPU/memory
Won't fix AWS regional issues: If ALB itself is down, your graceful error handling can't run
Complex health checks: Need to balance checking dependencies vs. marking instance unhealthy
Edge case: Security group changes: If DB security group blocks your app, you still serve errors (but gracefully)

💡 Solution 2: Route 53 Health Checks with Failover

Take control at the DNS level before requests even reach your infrastructure.

Strategy: DNS Failover to Static Error/Status Site

Normal Operation:
User → DNS (app.example.com) → ALB → Your App

Database Down:
User → DNS (app.example.com) → S3 Static Site (Error/Status Page)

Setting It Up

1. Create an error/status page in S3:

# Create S3 bucket for error/status page
aws s3 mb s3://example-error-page

# Upload your error page
aws s3 cp error.html s3://example-error-page/index.html \
  --content-type "text/html" \
  --cache-control "no-cache, no-store, must-revalidate"

# Configure bucket for static website hosting
aws s3 website s3://example-error-page \
  --index-document index.html

# Make it public (or use CloudFront for better security)
aws s3api put-bucket-policy \
  --bucket example-error-page \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::example-error-page/*"
    }]
  }'

2. Create Route 53 health check:

# Health check that monitors your actual app health
aws route53 create-health-check \
  --health-check-config '{
    "Type": "HTTPS",
    "ResourcePath": "/health",
    "FullyQualifiedDomainName": "app.example.com",
    "Port": 443,
    "RequestInterval": 30,
    "FailureThreshold": 3,
    "MeasureLatency": true,
    "EnableSNI": true
  }' \
  --caller-reference "app-health-check-$(date +%s)"

3. Configure Route 53 failover records:

# Primary record (your main ALB)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Primary",
        "Failover": "PRIMARY",
        "AliasTarget": {
          "HostedZoneId": "Z35SXDOTRQ7X7K",
          "DNSName": "my-alb-123456.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        },
        "HealthCheckId": "abc123-health-check-id"
      }
    }]
  }'

# Secondary record (S3 error/status page)
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890ABC \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "app.example.com",
        "Type": "A",
        "SetIdentifier": "Secondary",
        "Failover": "SECONDARY",
        "AliasTarget": {
          "HostedZoneId": "Z3AQBSTGFYJSTF",
          "DNSName": "s3-website-us-east-1.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

Pros and Cons

✅ Pros:

Completely offloads traffic: When failing over, zero load on your infrastructure
Works even if app crashes: DNS-level failover doesn't require your app to be running
Simple error page: Just static HTML in S3
Cost-effective during outages: S3 hosting is pennies compared to running instances
Automatic failover: Route 53 detects failure and switches automatically

❌ Cons:

DNS propagation delay: Can take 30-60 seconds (or longer with caching) for failover to take effect
TTL complications: Clients cache DNS for the TTL duration (typically 60-300 seconds)
All or nothing: Either all traffic goes to error page or none
Limited customization: Static page can't show dynamic information
Health check costs: Route 53 health checks cost $0.50/month each
Not granular: Can't fail over specific routes, only entire domains

💡 Solution 3: CloudFront with Edge Functions

Intercept and handle errors at the edge, closest to your users.

Strategy: CloudFront Functions or Lambda@Edge

CloudFront sits in front of your entire infrastructure and can inspect/modify responses:

User → CloudFront (Edge Location) → ALB → Your App
           ↓
    (Detects 5xx error)
           ↓
    (Returns professional error page)

Option A: CloudFront Functions (Lightweight)

CloudFront Functions run in microseconds and are perfect for simple transformations:

// CloudFront Function (viewer-response event)
function handler(event) {
    var response = event.response;
    var statusCode = response.statusCode;
    
    // If origin returned 5xx error, return error page
    if (statusCode >= 500 && statusCode < 600) {
        return {
            statusCode: 503,
            statusDescription: 'Service Unavailable',
            headers: {
                'content-type': { value: 'text/html; charset=utf-8' },
                'cache-control': { value: 'no-cache, no-store, must-revalidate' }
            },
            body: `<!DOCTYPE html>
<html>
<head>
    <title>We'll Be Right Back</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            display: flex;
            align-items: center;
            justify-content: center;
            min-height: 100vh;
            margin: 0;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
            text-align: center;
        }
        .container {
            background: rgba(255, 255, 255, 0.1);
            padding: 40px;
            border-radius: 20px;
            backdrop-filter: blur(10px);
        }
        h1 { font-size: 2.5em; margin: 0 0 20px; }
        p { font-size: 1.1em; margin: 10px 0; }
    </style>
</head>
<body>
    <div class="container">
        <h1>🔧 We'll Be Right Back!</h1>
        <p>We're experiencing technical difficulties.</p>
        <p>Our team is working to resolve the issue.</p>
        <p style="font-size: 0.9em; opacity: 0.8;">Please try again in a few minutes.</p>
    </div>
    <script>setTimeout(() => location.reload(), 30000);</script>
</body>
</html>`
        };
    }
    
    // Return original response if no error
    return response;
}

Deploying the function:

# Create function
aws cloudfront create-function \
  --name error-handler \
  --function-config Comment="Handle 5xx errors gracefully",Runtime="cloudfront-js-1.0" \
  --function-code file://error-handler.js

# Publish function
aws cloudfront publish-function \
  --name error-handler \
  --if-match ETVABCDEF12345

# Associate with CloudFront distribution
aws cloudfront update-distribution \
  --id E1234ABCD \
  --distribution-config '{
    "DefaultCacheBehavior": {
      "FunctionAssociations": {
        "Quantity": 1,
        "Items": [{
          "FunctionARN": "arn:aws:cloudfront::123456:function/error-handler",
          "EventType": "viewer-response"
        }]
      }
    }
  }'

Option B: Lambda@Edge (Full Power)

For more complex logic, use Lambda@Edge:

# Lambda@Edge function (origin-response event)
import json
import boto3

def lambda_handler(event, context):
    response = event['Records'][0]['cf']['response']
    status = int(response['status'])
    
    # If 5xx error, check if it's a database issue
    if 500 <= status < 600:
        # Could check CloudWatch metrics, or RDS status here
        # For simplicity, return error page for all 5xx
        
        error_page = """<!DOCTYPE html>
<html>
<head>
    <title>We'll Be Right Back</title>
    <style>
        body { 
            font-family: Arial, sans-serif; 
            text-align: center; 
            padding: 50px;
            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
            color: white;
        }
        .container { 
            max-width: 600px; 
            margin: 0 auto; 
            background: rgba(255,255,255,0.1);
            padding: 40px;
            border-radius: 20px;
        }
        h1 { font-size: 2.5em; }
    </style>
</head>
<body>
    <div class="container">
        <h1>🔧 We're Having Technical Issues</h1>
        <p>We're experiencing technical difficulties and our team is working to resolve them.</p>
        <p>We'll be back shortly!</p>
    </div>
</body>
</html>"""
        
        return {
            'status': '503',
            'statusDescription': 'Service Unavailable',
            'headers': {
                'content-type': [{'key': 'Content-Type', 'value': 'text/html'}],
                'cache-control': [{'key': 'Cache-Control', 'value': 'no-cache'}]
            },
            'body': error_page
        }
    
    return response

CloudFront Custom Error Pages (Simplest Option)

CloudFront also supports custom error pages without any code:

aws cloudfront update-distribution \
  --id E1234ABCD \
  --distribution-config '{
    "CustomErrorResponses": {
      "Quantity": 3,
      "Items": [
        {
          "ErrorCode": 500,
          "ResponsePagePath": "/error.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        },
        {
          "ErrorCode": 502,
          "ResponsePagePath": "/error.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        },
        {
          "ErrorCode": 503,
          "ResponsePagePath": "/error.html",
          "ResponseCode": "503",
          "ErrorCachingMinTTL": 10
        }
      ]
    }
  }'

Then host error.html in your S3 origin bucket.

Pros and Cons

✅ Pros:

Edge-level response: Handled at CloudFront POPs, closest to users
Fast failover: No DNS propagation delays
Reduced origin load: Errors intercepted before hitting origin repeatedly
Granular control: Can handle different error codes differently
Custom logic: Lambda@Edge can check metrics, databases, etc.
Consistent UX: Same error page for all users globally
Low error cache TTL: Can recover quickly once origin is healthy

❌ Cons:

Requires CloudFront: Additional infrastructure and cost
CloudFront Functions limitations: 10KB size limit, limited runtime
Lambda@Edge complexity: More expensive ($0.60 per 1M requests), longer latency
Deployment time: Function updates take 15-30 minutes to propagate
Cold starts: Lambda@Edge can have cold start latency
Debugging challenges: Edge functions are harder to test and debug

🏆 Comparison Matrix

Feature	App-Level	Route 53 Failover	CloudFront Functions	Lambda@Edge	Custom Error Pages
Response Time	Instant	30-60s (DNS TTL)	Instant	Instant	Instant
Infrastructure Load	High	None (failover)	Low	Low	Low
Customization	Full	Limited (static)	Medium	High	Low (static)
Code Required	Yes	No	Yes (simple)	Yes (complex)	No
Cost	App compute	$0.50/month	$0.10 per 1M	$0.60 per 1M	Included
Maintenance Effort	Per app	DNS + S3	Function updates	Function updates	Config only
Granularity	Per route	Per domain	Per distribution	Per distribution	Per error code
Works if app crashes	No	Yes	Yes	Yes	Yes
Edge/Global	No	Yes (DNS)	Yes	Yes	Yes

💎 The Hybrid Approach (Best Practice)

Don't choose just one—layer your defenses:

Layer 1: Application-Level (First Line)

# Catch expected failures, show degraded functionality
@app.route('/api/users')
def get_users():
    try:
        return fetch_users_from_db()
    except DatabaseError:
        # Return cached data with a warning
        return {
            'users': get_cached_users(),
            'warning': 'Using cached data - live data temporarily unavailable'
        }, 200

Layer 2: CloudFront Custom Error Pages (Second Line)

CustomErrorResponses:
  - ErrorCode: 503
    ResponsePagePath: /error.html
    ResponseCode: 503
    ErrorCachingMinTTL: 10  # Short TTL for quick recovery

Layer 3: Route 53 Failover (Nuclear Option)

# Only kicks in if health checks fail completely
PRIMARY: app.example.com → ALB
SECONDARY: app.example.com → S3 (Failover to static error page)

The Flow

1. Database goes down
2. App catches error, returns cached data or 503
3. If app returns 503, CloudFront shows professional error page
4. If entire app/ALB fails health checks, Route 53 fails over to S3

🎯 Real-World Implementation

Let's put it all together for a production setup:

#!/bin/bash
# Setup script for graceful failure handling

# 1. Create S3 bucket for error/status page
aws s3 mb s3://myapp-error-page
aws s3 cp error.html s3://myapp-error-page/index.html
aws s3 website s3://myapp-error-page --index-document index.html

# 2. Create Route 53 health check
HEALTH_CHECK_ID=$(aws route53 create-health-check \
  --health-check-config Type=HTTPS,ResourcePath=/health,FullyQualifiedDomainName=app.example.com,Port=443 \
  --caller-reference "health-$(date +%s)" \
  --query 'HealthCheck.Id' --output text)

# 3. Create CloudFront function for error handling
aws cloudfront create-function \
  --name error-handler \
  --function-config Runtime="cloudfront-js-1.0" \
  --function-code fileb://error-handler.js

# 4. Update CloudFront to use custom error pages
aws cloudfront update-distribution \
  --id $DISTRIBUTION_ID \
  --distribution-config file://distribution-config.json

# 5. Configure Route 53 failover
aws route53 change-resource-record-sets \
  --hosted-zone-id $ZONE_ID \
  --change-batch file://failover-config.json

echo "✅ Graceful failure handling configured!"
echo "Test by:"
echo "1. Taking down database"
echo "2. Watching CloudWatch metrics"
echo "3. Verifying users see professional error page"

📊 Monitoring and Alerting

Set up alerts to know when things go wrong:

# CloudWatch Alarms
DatabaseConnectionFailures:
  Metric: DatabaseConnectionErrors
  Threshold: > 10 in 5 minutes
  Action: SNS notification to ops team

ALB5xxErrors:
  Metric: HTTPCode_Target_5XX_Count
  Threshold: > 50 in 2 minutes
  Action: Page on-call engineer

Route53HealthCheckFailed:
  Metric: HealthCheckStatus
  Threshold: < 1
  Action: Trigger failover + alert

CloudFrontErrorRate:
  Metric: 5xxErrorRate
  Threshold: > 5%
  Action: Escalate to engineering lead

🎬 Testing Your Graceful Failure

Always test before you need it—simulate every real-world scenario:

Test 1: Database Connection Failure

# Block database access from your app security group
aws ec2 modify-security-group-rules \
  --group-id sg-app \
  --security-group-rules "SecurityGroupRuleId=sgr-xxx,SecurityGroupRule={IpProtocol=tcp,FromPort=5432,ToPort=5432,CidrIpv4=0.0.0.0/0,Description='Block DB'}"

# Expected: Users see professional error page, not "502 Bad Gateway"
curl -I https://app.example.com
# Should return: HTTP/1.1 503 Service Unavailable

# Check ALB health
aws elbv2 describe-target-health --target-group-arn $TG_ARN
# Expected: Targets should be marked unhealthy

Test 2: Connection Pool Exhaustion

# Simulate high load that exhausts connection pool
import concurrent.futures
import requests

def make_request():
    requests.get("https://app.example.com/api/users")

with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
    futures = [executor.submit(make_request) for _ in range(10000)]

# Expected: Graceful degradation, not cascading failures
# Monitor: CloudWatch metrics for connection pool usage

Test 3: External API Timeout

# Use iptables to simulate slow external service
# On your app server, block the external API temporarily
sudo iptables -A OUTPUT -d payment-api.example.com -j DROP

# Make requests that depend on that API
curl https://app.example.com/api/checkout

# Expected: Should timeout gracefully with 504 after 5 seconds
# Not: Hang for 30+ seconds causing gateway timeout

Test 4: Memory Pressure / OOM

# Deploy a version with memory leak or set low memory limits
# Update task definition with only 256MB memory
aws ecs update-service \
  --cluster production \
  --service api-service \
  --task-definition api:memory-test

# Generate load
hey -n 10000 -c 100 https://app.example.com/api/large-response

# Expected: 
# - Health checks fail before OOM kill
# - ALB routes to other healthy instances
# - Users see professional error page, not blank page

Test 5: Route 53 Failover

# Mark primary as unhealthy manually
aws route53 update-health-check \
  --health-check-id $HEALTH_CHECK_ID \
  --disabled

# Wait 60 seconds for DNS propagation
sleep 60

# Check DNS resolution
dig app.example.com

# Expected: Should point to S3 error/status site
# Test with: curl https://app.example.com
# Should see: Static error page from S3

Test 6: CloudFront Error Interception

# Force origin to return 500 errors
# Either through app admin endpoint or by stopping services
aws ecs update-service \
  --cluster production \
  --service api-service \
  --desired-count 0

# Make request
curl https://app.example.com

# Expected: CloudFront custom error page
# Not: Raw ALB "503 Service Temporarily Unavailable"

Test 7: SSL Certificate Issues

# Check certificate expiration
aws acm describe-certificate \
  --certificate-arn $CERT_ARN \
  | jq '.Certificate.NotAfter'

# Set up monitoring alert for < 30 days
aws cloudwatch put-metric-alarm \
  --alarm-name ssl-cert-expiring \
  --metric-name DaysToExpiry \
  --threshold 30

# Expected: Alert fires before certificate expires
# Not: Finding out at 2 AM when all requests fail

Test 8: Cascading Failure from One Slow Endpoint

# Deploy a version with one intentionally slow endpoint
# In your app, add artificial delay to one route

@app.route('/api/slow')
def slow_endpoint():
    time.sleep(60)  # Simulate slow query
    return jsonify({'data': 'slow'})

# Generate traffic to that endpoint
hey -n 1000 -c 50 https://app.example.com/api/slow &

# Try to use other endpoints
curl https://app.example.com/api/users

# Expected: Other endpoints still work (circuit breaker pattern)
# Or: Graceful degradation with timeout
# Not: Entire app becomes unresponsive

🎓 Key Takeaways

Perfect infrastructure isn't enough: Even AWS regions fail—US-East-1 recently had widespread 502-504 errors
Many failure modes exist: Database connection failures, external API timeouts, memory exhaustion, security group issues, certificate expiration, cascading failures
Layer your defenses: Use application-level error handling + CloudFront functions + Route 53 failover together
Fail gracefully, not silently: Show users honest, professional error pages, not blank white pages or "502 Bad Gateway"
Health checks must be comprehensive: Check database, external APIs, memory usage—not just "is the process running"
Edge cases happen more than you think: OOM kills, connection pool exhaustion, slow queries under load, SSL cert renewal failures
Test the real scenarios: Block database access, disable external APIs, simulate memory pressure, force timeouts
Monitor and alert appropriately: Different failure types need different responses and escalation paths
Set user expectations: Error pages should honestly explain what's happening and when to expect recovery
Graceful degradation > total failure: Sometimes returning cached data is better than returning nothing

🚀 What's Next?

You now have a complete picture of building highly available infrastructure on AWS:

Part 1: ALB, Auto Scaling, and EC2 fundamentals
Part 2: ECS with containers and two-dimensional scaling
Part 3: Fargate serverless simplicity
Part 4: Graceful failure handling and error recovery

Your infrastructure can now:

Scale automatically based on demand ✅
Handle instance failures ✅
Distribute traffic intelligently ✅
Fail gracefully when dependencies break ✅
Provide great UX even during outages ✅

The final lesson: High availability isn't about preventing all failures—it's about handling them gracefully when they inevitably happen.

"Hope for the best, plan for the worst, and prepare to be surprised." Build systems that fail gracefully, monitor continuously, and always have a plan B (and C, and D).

Questions about graceful failure handling? Find me on social media or leave a comment below!

Comments

Comments are not available. Feel free to share your feedback on LinkedIn or connect with Geek Cafe.

Back to All Blogs