The Mystery of the Stubbornly Silent ECS Agent (Or: How I Learned to Stop Worrying and Love exit 0)

I wrote a whole blog post about fixing ECS agent registration issues. Turns out I was completely wrong. Here's what actually happened, the rabbit holes I fell into, and the one-line fix that ended my suffering.

The Mystery of the Stubbornly Silent ECS Agent

Or: How I Learned to Stop Worrying and Love exit 0

Remember when I wrote that triumphant blog post about finally fixing the ECS agent registration issue? Yeah, about that...

I was completely wrong.

The issue came back. Multiple times. Like a horror movie villain that just won't stay dead. And every time, I'd apply my "fix" from the previous blog post, cross my fingers, sacrifice a rubber duck to the AWS gods, and... nothing. The ECS agent would still refuse to register.

So here's the real story. The one where I actually figured it out. Probably. Maybe. I'm like 95% sure this time.

The Symptom (Redux)

Picture this: It's late on a Saturday night (because of course it is). I deploy my infrastructure. CloudFormation succeeds. EC2 instances boot up beautifully. They have:

  • ✅ Public IP addresses
  • ✅ Proper security groups
  • ✅ Internet connectivity
  • ✅ SSM access (I can remote in!)
  • ✅ Docker running
  • ✅ All the right IAM roles

But the ECS console? Zero registered instances. The cluster sits there, lonely and empty, mocking me with its "Desired: 2, Running: 0" status.

The Troubleshooting Journey (A.K.A. The Descent Into Madness)

Theory #1: "It's Obviously the Network"

My first thought: "The instances can't reach the ECS endpoints!"

# Check if instance has public IP
aws ec2 describe-instances \
  --filters "Name=tag:aws:autoscaling:groupName,Values=*sandbox-beta-ecs-asg*" \
  --query 'Reservations[*].Instances[*].[InstanceId,PublicIpAddress,PrivateIpAddress]' \
  --output table

Result: Public IPs everywhere. Network is fine.

Theory #2: "Security Groups Are Blocking Something"

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids sg-xxxxx \
  --query 'SecurityGroups[*].{Egress:IpPermissionsEgress}'

Result: Wide open 0.0.0.0/0 egress. Not the problem.

Theory #3: "The ECS Agent Isn't Running"

SSM into an instance:

sudo systemctl status ecs

Output:

○ ecs.service - Amazon Elastic Container Service - container agent
     Loaded: loaded (/usr/lib/systemd/system/ecs.service; enabled; preset: disabled)
     Active: inactive (dead)

AH-HA! The agent isn't running!

# Try to start it manually
sudo systemctl start ecs

Result: Command just... hangs. Forever. Like I'm waiting for Half-Life 3. Or for AWS to simplify their IAM policies.

But wait, I can start it the old-fashioned way:

sudo /usr/libexec/amazon-ecs-init start

Result: Works perfectly! Agent starts, registers with the cluster, containers deploy. Everything is sunshine and rainbows.

Theory #4: "It's the Docker Image!"

I remembered Amazon Linux 2023 uses a different image path. Added to user data:

sudo docker pull public.ecr.aws/ecs/amazon-ecs-agent:latest
sudo docker tag public.ecr.aws/ecs/amazon-ecs-agent:latest amazon/amazon-ecs-agent:latest

Result: Helped... but didn't fix the root cause.

Theory #5: "FUSE Kernel Module Is Breaking Things"

I was trying to enable s3fs mounting in containers, which requires the FUSE kernel module:

modprobe fuse
echo 'fuse' >> /etc/modules-load.d/fuse.conf

Removed all FUSE loading. Tested. Agent still wouldn't start automatically.

Then I noticed something weird: when FUSE was commented out, the agent would eventually start (after about 60 seconds). When FUSE was enabled, it wouldn't.

Conclusion: "FUSE must be the culprit!"

Narrator: It was not the culprit.

Theory #6: "Secrets Manager Timeout (Plot Twist)"

Around this time, my containers started timing out when accessing AWS Secrets Manager:

System.TimeoutException: Timeout retrieving secret from AWS Secrets Manager 
  for ARN arn:aws:secretsmanager:us-east-1:328553401036:secret:/beta/sandbox/rds/credentials

"Great," I thought, "now we have TWO problems!"

I went down another rabbit hole checking:

  • VPC endpoints
  • DNS resolution
  • Task IAM roles
  • Network ACLs

Turns out: This was a symptom, not the cause. The containers weren't running because the ECS agent wasn't starting. The ECS agent wasn't starting because...

The Breakthrough

I finally did what I should have done from the start: captured complete boot logs from a working instance and a failing instance.

I saved them side-by-side and compared them line-by-line like I was solving a murder mystery on CSI: AWS Edition.

The Failing Instance Log (Excerpt)

<13>Nov 17 08:45:00 user-data: User data complete: Mon Nov 17 08:45:00 UTC 2025
<13>Nov 17 08:45:00 user-data: + echo ECS_CLUSTER=sandbox-beta-cluster
<13>Nov 17 08:45:00 user-data: + systemctl restart ecs
# [crickets chirping]
# [tumbleweeds rolling]
# [heat death of the universe]

User data just... stops. Right at systemctl restart ecs. Forever.

The Working Instance Log (Excerpt)

<13>Nov 17 08:36:29 user-data: FUSE module loaded successfully
<13>Nov 17 08:36:29 user-data: + 'Done!'
<13>Nov 17 08:36:29 user-data: /var/lib/cloud/instance/scripts/part-001: line 30: Done!: command not found
[   11.204843] cloud-init[1654]: Failed to run module scripts-user

Wait. There's a syntax error (Done! without echo), causing the script to exit early. Then at 57 seconds:

[   57.319874] docker0: port 1(veth8ee4d53) entered blocking state

The ECS agent started! Containers launched!

The "Oh No" Moment

The working instance had a syntax error that caused user-data to exit early. The failing instance ran successfully to completion... and then hung forever.

Let me say that again for the people in the back:

The broken script worked. The working script broke.

The Root Cause: A Systemd Timing Deadlock

Here's what was actually happening:

  1. ECS-optimized AMIs have a built-in script that detects when you create /etc/ecs/ecs.config
  2. When detected, it automatically appends these commands after your user data:
    echo 'ECS_CLUSTER=your-cluster' >> /etc/ecs/ecs.config
    systemctl restart ecs
    
  3. This worked fine on Amazon Linux 2
  4. But Amazon Linux 2023 changed systemd boot timing

The Deadlock:

  • User data runs as part of cloud-init
  • Cloud-init runs during systemd boot sequence
  • Systemd waits for cloud-init to finish
  • systemctl commands in cloud-init wait for systemd to be "ready"
  • Circular dependency = infinite hang

It's like this:

User Data: "I'll finish once systemctl completes"
Systemd: "I'll be ready once user data completes"
Both: *stares at each other forever*

Why FUSE "Caused" the Problem

FUSE didn't cause anything. But it was the last command before the deadlock, so removing it made the script shorter, which sometimes (randomly!) changed the timing enough that systemd was ready by the time systemctl restart ecs ran.

It was pure coincidence. A red herring. A false prophet.

The Fix: One Line

Add this to the end of your user data:

echo "User data complete: $(date)"

# Exit to prevent AMI's automatic systemctl restart ecs from running during cloud-init
# ECS agent will start naturally after boot completes
exit 0

That's it. One line. exit 0.

By exiting early, you prevent the ECS-optimized AMI's automatic systemctl restart ecs from running during cloud-init. Instead, the ECS service starts naturally after boot completes (~60 seconds later), when systemd is fully ready.

The Complete Working User Data

#!/bin/bash
set -euxo pipefail
exec > >(tee /var/log/user-data.log | logger -t user-data -s 2>/dev/console) 2>&1

# Create ECS config
mkdir -p /etc/ecs
cat > /etc/ecs/ecs.config <<'EOF'
ECS_CLUSTER=your-cluster-name
ECS_ENABLE_CONTAINER_METADATA=true
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
EOF
chmod 644 /etc/ecs/ecs.config

# Load FUSE for s3fs (yes, this is fine now!)
echo "Loading FUSE kernel module..."
modprobe fuse
echo 'fuse' >> /etc/modules-load.d/fuse.conf
lsmod | grep fuse
echo "FUSE module loaded successfully"

echo "User data complete: $(date)"

# Exit to prevent systemd deadlock
exit 0

Troubleshooting Commands for Future You

If you hit this issue, here's how to diagnose it:

1. Check if user data hung:

# View user data log
sudo cat /var/log/user-data.log

# Look for cloud-init status
cloud-init status

If user data hangs at systemctl restart ecs, you've found it.

2. Check ECS agent status:

# Check service status
sudo systemctl status ecs

# Try manual start (bypasses systemd timing issues)
sudo /usr/libexec/amazon-ecs-init start

# Check agent logs
sudo docker logs ecs-agent

3. Check instance registration:

# See if instance appears in ECS
aws ecs list-container-instances --cluster your-cluster-name

4. Verify network connectivity:

# From inside the instance
curl -v https://ecs.us-east-1.amazonaws.com
nslookup ecs.us-east-1.amazonaws.com

Lessons Learned

  1. Amazon Linux 2023 is not just "AL2 with a new version number". Systemd timing changed significantly.

  2. Always compare working vs. failing states side-by-side. I wasted days chasing red herrings because I focused on "what changed" instead of "what's different."

  3. Syntax errors that cause early exit can accidentally work around bugs. The universe has a dark sense of humor.

  4. When AWS says a service is "enabled," that doesn't mean "started during boot." The ECS agent service is enabled to start automatically... after systemd is fully ready.

  5. exit 0 is a valid design pattern. Sometimes the best fix is to stop trying so hard.

Is This a Known Issue?

Yes! But AWS doesn't document it clearly. You can find scattered references:

  • GitHub issues on ECS agent repositories
  • AWS forums with cryptic suggestions
  • Stack Overflow posts from other suffering souls

The issue specifically affects:

  • Amazon Linux 2023 (not AL2)
  • ECS-optimized AMIs
  • User data that creates /etc/ecs/ecs.config
  • Anyone who values their sanity

The Aftermath

After deploying the fix:

  • ✅ ECS agents start automatically
  • ✅ Instances register with the cluster
  • ✅ Containers deploy successfully
  • ✅ FUSE works for s3fs mounting
  • ✅ Secrets Manager access restored
  • ✅ My blood pressure returned to normal

Conclusion

I spent days chasing ghosts:

  • Network issues that didn't exist
  • Security groups that were fine
  • Docker images that worked perfectly
  • FUSE modules that were innocent bystanders
  • Secrets Manager timeouts that were just symptoms

The real problem? A systemd timing deadlock caused by calling systemctl during cloud-init on Amazon Linux 2023.

The solution? One line: exit 0.

So here I am, writing a second blog post to correct the first blog post, which was based on incomplete understanding of what was actually broken. If you're reading this from the past after finding my original post, please disregard everything I said. If you're reading this from the future, I apologize in advance for whatever I got wrong this time.

Now if you'll excuse me, I'm going to go update my original blog post with a giant "I WAS WRONG" banner and a link to this one.

Because if there's one thing I've learned from this experience, it's that debugging cloud infrastructure is an exercise in humility, stubbornness, and the occasional accidental syntax error that somehow saves the day.

TL;DR

  • Problem: ECS agent won't start on Amazon Linux 2023 instances
  • Symptom: User data hangs at systemctl restart ecs
  • Cause: Systemd timing deadlock during cloud-init
  • Solution: Add exit 0 at the end of your user data
  • Time wasted: Several days across multiple occurrences
  • Coffee consumed: Too much
  • Lessons learned: Sometimes the simplest fix is the right one

Stay caffeinated, friends. And may your ECS agents always register on the first try.

Comments

Comments are not available. Feel free to share your feedback on LinkedIn or connect with Geek Cafe.

Geek Cafe LogoGeek Cafe

Your trusted partner for cloud architecture, development, and technical solutions. Let's build something amazing together.

Quick Links

© 2025 Geek Cafe LLC. All rights reserved.

Research Triangle Park, North Carolina

Version: 8.9.23