Update: Details of “Instabilities on DHCP fr-par and api.scaleway.com” incident & response

Deploy
Pierre Scacchi
3 min read

On January 17 at 15:23 UTC, Scaleway encountered an incident in the FR-PAR region that impacted customers using the Scaleway API and some products in VPC that use DHCP including Load Balancers and Kapsule nodes. It was resolved by 16:20 UTC the same day. Here’s an overview of what happened.

Timeline of the incident

  • At 15:23 UTC, an unusual type of request began to hit the fr-par internal DHCP service. Approximately 100,000 API calls over a few seconds were made to the API gateway.
  • At 15:24 UTC, all the nodes handling the api.scaleway.com endpoint hit OOM (out of memory) conditions, causing them to restart. This made the API Gateway unavailable for 17 seconds, and triggered our first alerts.
  • At 15:28 UTC, another internal alert was triggered, indicating that one DHCP server was down in the fr-par-2 Availability Zone (AZ).
    Around 15:45 UTC, we issued a public communication. It was at this time that some customers began to report impacts on their side.
  • At 15:48 UTC, the API Gateway team determined that the unusual load originated from the VPC infrastructure.
  • At 16:07 UTC, the VPC team identified the cause of the unusual load, and began to take corrective action.
  • At 16:21 UTC, the source of the unusual load was isolated from the rest of the network, and all services were back to normal.

Root cause and resolution of the issue

On VPC side

The analysis showed that an Elastic Metal machine of one customer was directing an unusually large volume of requests to the PAR-2 DHCP servers. This was ultimately due to an unintentional misconfiguration on the customer’s side.

Each DHCP request led to an IPAM request. These requests transited through the API Gateway, which struggled to keep up with the volume.

The misbehaving Elastic Metal machine was quickly identified and isolated, which stopped the flood of DHCP requests. The impacted services were then restarted and the situation resolved progressively.

To prevent future internal DHCP request floods, we are taking the following remediation actions:

  • Adding more restrictive rate limits on the DHCP service.
  • Implementing a better strategy for IPAM requests in order to limit this type of traffic behavior in the future.

On API gateway side

Scaleway’s API gateway relies on the open source Envoy proxy, which handles an average of 2000 requests per second. This could be handled with one node, but we use two nodes for reliability reasons.

The VPC incident added 4000 requests per second to the API Gateway. The total of 6000 requests per second is something our API Gateway would normally handle easily; however, a combination of three things led Envoy to use an abnormal amount of RAM and never release it, resulting in an out of memory (OOM) crash:

  • The HTTP2 initial flow-control window size was not properly configured.
  • The slow response time on the IPAM service increased the number of pending requests in Envoy. This, combined with the high window size, caused RAM usage to spike.
  • Further exacerbating the issue, by default Envoy didn’t release RAM used by flow-control mechanisms (in case another burst arrives).

We had identified the first issue with CVE-2023-35945 and had a fix ready, but unfortunately it was not deployed on our production environment yet.

  • At 15:23:41 UTC the first node crashed. Traffic was automatically switched to the other node, and the first node restarted.
  • At 15:24:03 UTC the second node crashed for the same reasons, but the first node hadn’t yet fully restarted. As noted previously, this resulted in a 17 second outage.
  • At 15:24:20 UTC the first node completed its restart and resumed handling the traffic. A few seconds later, the second node came back as well.
  • At 15:29:07 UTC the first node crashed again, causing all traffic to shift to the second node.

The following remedial actions were taken:

  • We deployed the flow-control change to limit the amount of RAM a single HTTP2 connection can consume.
  • We configured the Envoy heap shrink mechanism to run more often. This way, Envoy will release RAM faster after connections are themselves released.

Conclusion

We hope this communication was useful and helped you understand what happened. We are continuously working on improving our services and sincerely apologize for any inconvenience you may have experienced.

Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console for any questions moving forwards. Thank you!

Share on
Other articles about: