On January 24 at 23:48 Central European Time (CET), Scaleway encountered an incident in the FR-PAR region that impacted customers using the Kubernetes managed services Kapsule, Kosmos, either with mutualized or dedicated environments, and some products such as Cockpit.
It was resolved by 01:03 the same night. Here’s an overview of what happened.
Timeline of the Incident
- 23:48: Scaleway Kubernetes API experienced an Out of Memory (OOM) situation. The incident began, affecting the Kubernetes FR-PAR region API, leading to nodes being unable to authenticate with their control-plane, resulting in some nodes becoming NotReady
- 23:48: Increased load detected on the API Gateway, with requests spiking significantly
- 00:00:10: Incident escalated internally and handled by Scaleway engineers on call
- 00:00 - 00:30: Diagnostic ongoing: OOM situation identified as the root cause of the outage
- 00:35 - 00:40: Measures to increase the API Kubernetes's RAM and launch additional replicas were taken
- 00:43: Remediation worked, API Gateway requests decreasing fast
- 00:54: Kubernetes nodes were coming back up, heavy queue of actions on instances being monitored
- 01:03: All FR-PAR clusters began to stabilize with the implementation of further measures like purging the registry cache and monitoring for any further issues
- 01:13: Metrics on Cockpit began to reappear, indicating full recovery.
Post-incident: Continuous monitoring and adjustments were made to ensure stability, including adjustments to service configurations.
- Increase in the API gateway response times resulting in some unavailability (reaching timeouts)
- Kubernetes nodes unable to authenticate and thus temporarily entered the 'NotReady' state, leading to service disruption
- When relevant, activation of the cluster auto-healing process, generating automatic node replacement.
Root Cause and Resolution of the Issue
On Kubernetes Side
The incident originated from an Out of Memory (OOM) scenario within the Scaleway Kubernetes API, triggered by simultaneous deployments in FR-PAR. The situation was further complicated because this API was currently handling authentication, preventing Kubelet to update their leases, leading nodes to becoming NotReady and retrying indefinitely.
The following corrective actions were taken for remediation:
- Technical Adjustments: Immediate increase in the memory limit and garbage collection thresholds for the Kubernetes managed services.
- Scaling: Deployment of additional replicas for key services to handle increased load.
We are planning to implement the following safeguards:
- Implementation of a local authentication cache
- Further developments on node authentication.
On API Gateway Side
The API Gateway experienced a significant increase in load, primarily due to the overwhelming number of requests originating from the malfunctioning components on the Kubernetes side.
Despite the increased load, the API Gateway managed to operate without a complete outage. Adjustments already implemented due to a past incident allowed us to better manage sudden spikes in requests.
This incident underscores the importance of robust monitoring and rapid response mechanisms in managing unexpected system behaviors. We are committed to learning from this incident and have already implemented several improvements to prevent such occurrences in the future.
These measures include enhancing our monitoring capabilities, adjusting our rate limiting strategies, and improving our incident response protocols. We apologize for any inconvenience caused and are grateful for your understanding and support as we continue to enhance our systems to serve you better.
Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console for any questions moving forwards. Thank you!