Object Storage’s recent performance status: What happened and how we fixed it

In 2021, Scaleway began expanding fast in terms of customer acquisition, and the back-end of our Object Storage product started to encounter performance issues - limited numbers of objects within buckets, instability due to heavy write load in buckets, and other erratic behavior such as “unsynced buckets”.

Following that, we spent a year designing and building Hive**–**an innovative, scalable, and globally distributed database to allow us to grow beyond our current limitations, to pave the way for the next decades of services, and a new standard Multi-AZ storage class.

In February 2022, we pushed Hive into production. This was not only a huge statement for us, demonstrating that a European cloud provider can build its own independent, fully developed in-house, software stack, but it also made us the first European stakeholder to support data replication across Multi-AZ regions.

A few weeks after the start of the migration, the performance of our first Multi-AZ region - Paris - started deteriorating significantly, impacting both external and internal users, with repercussions on other products relying on our Object Storage, such as Container Registry.

The technical issue

In late May, we began experiencing significant issues with the increased usage of our Multi-AZ Standard Storage class. The graph below shows the business impact to users that were getting high HTTP error rates. The servers had to face an unanticipated load increase as our clients used the new storage class much more than expected. Based on what we monitored, performance was good 95% of the time, acceptable 99% of the time, and poor 0.1% of the time.

It goes without saying that what stood out was the negative impact for a small part of user requests which experienced latency increase, leading to unbearable 99th percentile latency, spurious 503 errors, timeouts, etc. Due to the scale of the issue, the product was widely perceived as malfunctioning. Even though the error rate never spiked above 10%, and was most of the time way under this figure, the situation was indeed unacceptable.

Solving the incident

In order to resolve the incident, it was necessary to add new hardware to the cluster as it was suffering from the workload increase and unable to deliver the highest level of service to our customers. However, due to the global electronic components shortage that followed the pandemic, we experienced supply chain delays, preventing us from adding additional compute power to the cluster as quickly as required.

While awaiting new hardware, we worked hard to minimize the impact on customers as much as possible, either by optimizing the usage of components, or by limiting the stack running on the servers experiencing the highest loads.

Mid-June, in an effort to improve the situation, we even tried adding smaller servers to the pool, while waiting for the delivery of larger ones. It did not generate significant improvements for our customers.

Finally, on July 5, thanks to the hard work of the Supply Chain, Datacenter, and Network teams, we were able to add a significant amount of compute power to our clusters. We witnessed a notably positive impact on error rates and tail latency, meaning that the clusters were finally back to a fully functional, optimal state. As a result, the business impact on customers, as well as the number of support tickets started decreasing dramatically, as you can observe directly in the below graphs.

What's next?

We learned a lot from this**–**we have now secured large quantities of hardware to allow us to grow further, and reviewed our platform hardware specifications to rely more on standard parts that will be easier to source and replace with alternatives should another shortage come our way.

We’d like to thank you again for your patience, your continuous feedback on our Slack community, and your support. We are committed to building the cloud of choice, and delivering products with the highest level of service possible.

Finally, we are proud of doing what no other European cloud provider has done before**–**developing a sovereign, Amazon S3-compatible Multi-AZ Object Storage solution. You can find more information about our platform in our Hive whitepaper.

Gaspard Plantrou and the Storage team

Object Storage - What Is It? (1/3)

In this series of articles, we will start with a wide description of the Object Storage technology currently in production at Scaleway.

Build

Rémy Léone

26/06/195 min read

Object Storage - How It Works? (2/3)

In this article, we will present the internal architecture of Scaleway Object Storage.

Build

Rémy Léone

26/06/194 min read

Object Storage Introduction Discover Storage

Object Storage - How Is It Built? (3/3)

In this article, we will go through the infrastructure design on which our object storage service runs. The first challenge was to find the right balance between the network, CPUs and IOPS.

Build

Rémy Léone

08/04/196 min read

Object Storage Introduction Storage Introduction