Last night, we solved a long-standing bug in one of our Terraform modules. It’s been haunting us for a long time - damn near over a year. We managed to produce this bug in a CloudFormation stack as well, but we just couldn’t figure out where it was coming from.
For reference, we were trying to stand up an Elastic Container Service cluster with containers using dynamic port routing. When you do this, you build it with an Application Load Balancer (ALB) plus a target group. The way things are supposed to work is that when a container is spun up on the cluster, it chooses an ephemeral port and that port is registered to the target group’s health checks.
That was essentially working, but something was adding an additional and erroneous health check to the exposed port (https/443) which would cause the auto-scaling group to think things were amiss… and continously terminate/rebuild instances. Not a fun situation. Our workaround was to manually remove the health check. But each time the ASG terminated and added an instance, the bad health check would come back. We finally figured this mess out.