Update: FR-PAR Kapsule API incident & response

01/02/242 min read

On January 24 at 23:48 Central European Time (CET), Scaleway encountered an incident in the FR-PAR region that impacted customers using the Kubernetes managed services Kapsule, Kosmos, either with mutualized or dedicated environments, and some products such as Cockpit.

It was resolved by 01:03 the same night. Here’s an overview of what happened.

Timeline of the Incident

23:48: Scaleway Kubernetes API experienced an Out of Memory (OOM) situation. The incident began, affecting the Kubernetes FR-PAR region API, leading to nodes being unable to authenticate with their control-plane, resulting in some nodes becoming NotReady
23:48: Increased load detected on the API Gateway, with requests spiking significantly
00:00:10: Incident escalated internally and handled by Scaleway engineers on call
00:00 - 00:30: Diagnostic ongoing: OOM situation identified as the root cause of the outage
00:35 - 00:40: Measures to increase the API Kubernetes's RAM and launch additional replicas were taken
00:43: Remediation worked, API Gateway requests decreasing fast
00:54: Kubernetes nodes were coming back up, heavy queue of actions on instances being monitored
01:03: All FR-PAR clusters began to stabilize with the implementation of further measures like purging the registry cache and monitoring for any further issues
01:13: Metrics on Cockpit began to reappear, indicating full recovery.

Post-incident: Continuous monitoring and adjustments were made to ensure stability, including adjustments to service configurations.

Impact

Increase in the API gateway response times resulting in some unavailability (reaching timeouts)
Kubernetes nodes unable to authenticate and thus temporarily entered the 'NotReady' state, leading to service disruption
When relevant, activation of the cluster auto-healing process, generating automatic node replacement.

Root Cause and Resolution of the Issue

On Kubernetes Side

The incident originated from an Out of Memory (OOM) scenario within the Scaleway Kubernetes API, triggered by simultaneous deployments in FR-PAR. The situation was further complicated because this API was currently handling authentication, preventing Kubelet to update their leases, leading nodes to becoming NotReady and retrying indefinitely.

Resolution:
The following corrective actions were taken for remediation:

Technical Adjustments: Immediate increase in the memory limit and garbage collection thresholds for the Kubernetes managed services.
Scaling: Deployment of additional replicas for key services to handle increased load.

Long-term Solutions:
We are planning to implement the following safeguards:

Implementation of a local authentication cache
Further developments on node authentication.

On API Gateway Side

The API Gateway experienced a significant increase in load, primarily due to the overwhelming number of requests originating from the malfunctioning components on the Kubernetes side.

Despite the increased load, the API Gateway managed to operate without a complete outage. Adjustments already implemented due to a past incident allowed us to better manage sudden spikes in requests.

Conclusion

This incident underscores the importance of robust monitoring and rapid response mechanisms in managing unexpected system behaviors. We are committed to learning from this incident and have already implemented several improvements to prevent such occurrences in the future.

These measures include enhancing our monitoring capabilities, adjusting our rate limiting strategies, and improving our incident response protocols. We apologize for any inconvenience caused and are grateful for your understanding and support as we continue to enhance our systems to serve you better.

_Scaleway provides [updates in real time of all of its services’ status, here](https://status.scaleway.com/). Feel free to contact us via the [console](https://console.scaleway.com/) for any questions moving forwards. Thank you!_