Building a distributed Key-Value store to power S3

Build
Louis Solofrizzo
7 min read

As a cloud provider, we handle a considerable amount of data on a daily basis, especially with our Object Storage products. So we needed a way to distribute this data globally, with various consistency, replication, and database sharding for linear read and write latency.

We designed the database we needed in-house. A platform that can scale up to millions of different databases with billions of entries, all the while maintaining client separation, good latency, and great performance. It also had to be:

  • Flexible because in some regions, we can have multi-datacenter replication
  • Consistent because there are strict operational requirements on our production
  • To support continuous growth, any platform needs to be highly scalable
  • Reliable because even the slightest outage has significant repercussions on client trust (and financial consequences)

Meet the reliability and scaling needs of S3

Hive is a database that stores key-value pairs and shards data across many RAFT clusters, with replication to ensure global availability at the highest level of abstraction. We designed it to scale up to thousands of machines across multiple data centers worldwide and billions of database entries.

For any cloud provider, dealing with failures in infrastructure comprising millions of components is a standard mode of operation; there are always a significant number of failures at any given time. So, we designed Hive to treat failures as the typical case without having an impact on availability or performance.

Ensure consistency and data repartitioning

Clients can use Hive to store data safely, with specific optimizations for specific access patterns. A client can choose a consistency per read or write request.

For instance, for a DNS database engine, consistency might be preferable to have low write latencies. But for an S3 engine, strong consistency is paramount.

Rather than creating a generic database engine with a query language, we decided to create specific storage engines optimized for their dedicated use cases.

Hive’s main client is Scaleway’s Object Storage. It stores a reference to objects in the database and uses it for specific S3 operations (versioning, using delimiters, or prefix listing).

It also uses consistency features to ensure bucket unicity worldwide and strong consistency multi-datacenter replication to ensure safety.

Our main problem with the previous S3 database architecture was the database sharding - the databases were growing larger and this impacted latency and sometimes even replication. We solved this by splitting S3 objects lexicographically among many shards, automatically splitting and merging shards when needed.

We used a modified version of the RAFT quorum protocol to ensure consistency and replication for a shard. In order to avoid having one big RAFT cluster, we split all the shards into RAFT groups, which are RAFT state machines dedicated to specific data sets.

This also enabled us to avoid catastrophic failures in a quorum fail or some other internal error.

Design overview

Hive is composed of many different nodes on different machines, data centers, and regions.

A node stores multiple clusters (thousands of clusters per node) and responds to queries for reading and writing on those clusters. A node can also take clients’ requests and redirect operations to the specific nodes holding the information.

So we split the cluster into two logical parts: Storage & API.

Designing the API

Any node can respond to requests regarding whether or not the node has the data. The node does a cluster resolve on each request, caching most of the results for the next one, and then knows which node it needs to talk to in order to fulfill said request.

This is the opposite of the traditional client redirection from RAFT architectures, as the node does the redirection for the client. This approach enables multiple optimizations, but the main ones are that we can cache the path and the nodes of the requests to avoid having to resolve it again, and we can put a simple load balancer in front of Hive without worrying about redirections.

We made multiple APIs for Hive: frontal ones, usually an HTTP API, that clients talk to, and an internal one, using protobuf for internal node communications.

This design allowed us to bind the client-facing server to a private IP address and use IPv6 internet addresses for node-to-node communication. To ensure the safety of communications between nodes, every packet is end-to-end encrypted.

Designing the storage

Each Hive cluster is composed of at least one storage backend. They can have different database engines and maintain an in-memory state machine. A storage backend implements a subset of a RAFT cluster with a log application and its database engine. It does not know what other backends it is paired with, even though a cluster shares the same RAFT log for all of its backends. Each backend can use different storage engines for their storage: backend ’A’ can use SQLite, and backend ’B’ can use LMDB without any issues.

A common RAFT log is shared by the cluster, storing all the cluster operations and some "meta" operations, like adding a node into a cluster or changing the default consistency of a cluster.

Each node maintains a global cache to quickly resolve a cluster when needed. This cache is stored in RAM, with an AVL tree, and stores all the addresses of the nodes composing a cluster and which node was the last known leader. When a leader changes, any node can redirect the caller to it, and the caller then updates its cache. Upon cluster deletion or re-creation, the node will return a "Does not exist" code, resulting in a cache flush for this particular entry on the caller’s side.

Hive: bird’s eye view

S3 is a key-value store we exposed it to operate on

S3 is an Amazon API, which is, in essence, a key-value store. The key is an object name, the value is binary content. Some metadata can be set by the user on an object: tags, access control lists, or simple flags. Alongside this, the actual backend has to store the data’s physical position if one does not store the object content alongside its metadata.

Hive was created for this purpose: it exposes a key-value store HTTP API to an S3 API gateway and handles most S3 operations under the hood. It also enables the use case of huge buckets, with billions of objects, with no performance penalty.

Hive is a database, but it is not designed to store the object content, only the metadata.

Likewise, Hive is not an S3 API. Its HTTP API is related to S3 API calls but cannot be exposed directly to the client. A third party must handle some features like signature, ACLs and data storage.

Object storage API

Hive exposes an HTTP API for S3 in the form of a one-route, body-action JSON API. The value may depend on the type of call, but the global rule is that every client-facing S3 call has an equivalent Hive call.

The idea is for the gateway to do as little work as possible, leaving Hive the specific behavior of S3 in some cases: Versioning, Multipart Upload, and other operations. Even though HTTP comes with a bit of overhead, it was chosen to make the client implementation easier. Every language has a library to implement an HTTP client with JSON payloads.

Key unicity across multiple regions

A key S3 feature is that a bucket name is unique across all S3 regions. We use Hive to ensure strongly consistent writing and unicity.

The safety calls and two-step-commits are automatically used on the ’CreateBucket’ API call, leaving almost no work to do for the gateway.

In order to ensure unicity, a Redis-like backend is used, with a set nx=1, which means a key is set only if it does not already exist. It exposes a Redis-like API over protobuf but cannot be accessed directly by the caller. The cluster is automatically created on the first bucket, with a worldwide failure domain configured.

Hive has been in production for months now

Since Hive has been in production on the Scaleway S3 product, 5 billion entries have been written, with an average write P90 around 1ms and an average read of P90 around 150us.

Our initial tests showed that the aggregate availability (return_code != 5xx / total_requests * 100) is 99.9998%. It is mainly due to bugs and crashes which are to be expected with a first deployment, but still higher than the proposal on our current platform.
We are pretty happy with those numbers, as the node crashes almost had no impact on client-facing calls, despite the occasional spikes in tail latency.

Share on
Other articles about:

Recommended articles