Bad Health Checks With Dynamic Container Ports

Last night, we solved a long-standing bug in one of our Terraform modules. It’s been haunting us for a long time - damn near over a year. We managed to produce this bug in a CloudFormation stack as well, but we just couldn’t figure out where it was coming from.

For reference, we were trying to stand up an Elastic Container Service cluster with containers using dynamic port routing. When you do this, you build it with an Application Load Balancer (ALB) plus a target group. The way things are supposed to work is that when a container is spun up on the cluster, it chooses an ephemeral port and that port is registered to the target group’s health checks.

That was essentially working, but something was adding an additional and erroneous health check to the exposed port (https/443) which would cause the auto-scaling group to think things were amiss… and continously terminate/rebuild instances. Not a fun situation. Our workaround was to manually remove the health check. But each time the ASG terminated and added an instance, the bad health check would come back. We finally figured this mess out.

What was happening? In our auto-scaling group definition (in CloudFormation and Terraform, doesn’t matter) - if you define a health check of “ELB” in the ASG resource, it will produce this issue. You don’t want to define the health check in the ASG resource at all, actually. You want to leave that up to the target group. By default, the ASG will add a health check for EC2 and that’s all you need on the ASG level.

In Terraform, that looks like this:

resource "aws_autoscaling_group" "demo-cluster" {
  name                      = "demo-cluster"
  vpc_zone_identifier       = [aws_subnet.demo-public-1.id, aws_subnet.demo-public-2.id, aws_subnet.demo-public-3.id]
  min_size                  = "2"
  max_size                  = "10"
  desired_capacity          = "2"
#   launch_configuration      = aws_launch_configuration.demo-cluster-lc.name
  
  # *** THESE TWO LINES PRODUCE THE BAD HEALTH CHECK
#   health_check_type = "ELB"
#   target_group_arns = [aws_alb_target_group.nginx.arn]
  # *** END SUSPECT LINES

  launch_template {
    id      = aws_launch_template.demo-cluster-lt.id
    version = "$Latest"
  }

Remove those two lines from your Terraform ASG resource and everything will adapt to just health checks for your containers’ dynamic ports.

In CloudFormation, it’s similar - just remove the “HealthCheck” line from your resource.

I’m really happy to put this one to bed. It was a real haunt.

Vermyndax / Bad Health Checks With Dynamic Container Ports