Update: Kapsule & Kosmos incident in FR-PAR region & response

22/02/243 min read

On February 13th at 01:04 Central European Time (CET), Scaleway encountered an incident in the Paris region (FR-PAR) that impacted customers using the Kubernetes managed services, Kapsule and Kosmos, either with mutualized or dedicated environments.

The incident was resolved by 16:40 the same day. Here’s an overview of what happened.

Timeline of the Incident

February 13th, 2024

01:04 - Kubernetes API underwent RAM and CPU usage spikes resulting in an Out of Memory situation (OOM). Nodes in the FR-PAR region were not able to authenticate in their control plane resulting in some nodes becoming “Not Ready”. This phenomenon was amplified by some nodes being automatically replaced by the auto-healing feature.
01:46 - Resources allocated to the Kubernetes API were increased. The situation seemed to improve, but OOM was reached again.
02:20 - Resources allocated to the Kubernetes API were increased again. An improvement was observed.
02:30 - We applied some API traffic restrictions to reduce congestion by smoothing the incoming traffic profile.
03:20 - Applied API traffic restrictions were removed progressively again in 50-100 batches to avoid congestion. Nodes in the FR-PAR region were then able to authenticate in their control plane, and so they became “Ready”.
03:40 - A script was launched to replace the failing nodes that were created during the Kubernetes API congestion.
05:30 - Service delivery was stabilized and monitored closely.
15:40 - We confirmed that the incident root-cause was an insufficient resource allocation on the Kubernetes API server. The Kubernetes API’s unavailability resulted in continuous node authentication failures and retries. Therefore, nodes got into “Not Ready” state.

The authentication mechanism’s fail & retry behavior resulted in a congestion situation like the one observed during the Kubernetes incident on January 24th, 2024.

The short term risk mitigation actions implemented since January 24th were not sufficient to prevent incident recurrence. We decided to rollback to the former authentication mechanism.

15:45 - Rollback operation was executed progressively over a subset of the Kubernetes clusters to avoid the simultaneous authentication of nodes that were re-configured.
16:00 - The Kubernetes API congestion situation occurred again. Nodes in the FR-PAR region were not able to authenticate in their control plane resulting in some nodes becoming “Not Ready”.

As done before, API traffic restrictions were applied, and progressively removed again to avoid traffic congestion.

16:40 - All nodes were “Ready”. Service was stabilized and monitored closely.

To prevent incident recurrence, we decided:

to monitor closely that the clusters where the authentication mechanism rollback was applied were stable.
to disable any features that could lead to simultaneous node authentication: cluster upgrade and configuration update features.

February 16th, 2024

14:00: The subset of Kubernetes clusters where the rollback was applied were stable. We decided to rollback progressively to the former authentication mechanisms over all Kubernetes clusters.

February 19th, 2024

19:41 - The rollback was completed over all the Kubernetes clusters. Service was stable and monitored closely.

February 20th, 2024

10:30 - Cluster upgrade and update features were re-enabled.
17:56 - Incident was closed.

Impact

Increase in the Kubernetes API response times resulting in some unavailability (reaching timeouts).
Kubernetes nodes are unable to authenticate and thus temporarily enter the “Not Ready” state, leading to potential service disruption.

Root Cause

On February 12th, a new Kubernetes release was made available to all Kubernetes users. During the night from February 12th to 13th, many of the clusters triggered their auto-upgrade simultaneously, resulting in Kapsule API slowness and congestion. Due to that congestion, the new node authentication mechanism deployed on January 24th continuously failed and retried authenticating, worsening the congestion situation. Therefore, Kubernetes nodes got into “Not Ready” status.

Corrective actions

The implementation of the corrective actions identified after the Kubernetes January 24th 2024 incident was not sufficient to avoid the incident recurrence. Therefore, we decided to rollback to the former authentication mechanism to stabilize service delivery and work on the medium/long term solutions with the highest priority.

Actions taken

Kubernetes API resource allocation increase
Rollback to former authentication mechanism

Ongoing actions

Implementation of a local authentication cache (nodes)
Rework our resource allocation strategy to achieve better behavior on load spikes
New request queuing mechanisms for Kubernetes
Separation of the different components to reduce side effects on high loads or failure
Close monitoring of some metrics to identify anomalous patterns earlier

Conclusion

As the number of Kubernetes clusters and nodes deployed by our customers increases dramatically, this incident underscores the importance of enhancing existing traffic control mechanisms to prevent congestion situations.

We are committed to learning from this incident and we are already implementing some of the improvements listed above. Moreover, we are working on refunding customers that deployed nodes during this period, but that could not use them nominally.

We apologize for any inconvenience caused and are grateful for your understanding and support as we continue to enhance our systems to serve you better.

Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console for any questions moving forwards. Thank you!