Creating Cockpit: From ecosystem tool to observability product

Build
Maxime Besson
5 min read

At Scaleway, we provide our customers with two types of services: on the one hand, we have billed products. On the other hand, we provide certain features for free — those are ecosystem tools.

One of the tools most awaited by our customers was Scaleway’s monitoring system. That journey began several years ago and led to the public release of our fully-managed observability solution, Cockpit, on May 9, 2023.

But here’s the thing: Cockpit is actually both an ecosystem tool and a product. Initially, we only provided a monitoring solution for our product teams, an internal tool. But as Cockpit grew, it became a fully-fledged product for our customers.

Now, you can monitor both your Scaleway infrastructure data and your application data. And you can import data from other cloud providers you might be using.

Product-specific monitoring wasn’t enough

Let’s take a trip down memory lane. Before we created a shared infrastructure monitoring tool for our customers, each Scaleway product had to handle monitoring for itself. Teams had to:

  • Select relevant metrics and logs
  • Retrieve them
  • Store them
  • Process them
  • Create the necessary APIs for data retrieval

This approach had several challenges:

  • Information storage was scattered across multiple locations
  • Data retrieval APIs weren’t standardized
  • There were discrepancies in metric and log availability between products

It soon became clear that a uniform and comprehensive solution for all our products (and consequently our customers) was needed.

Balancing the needs of multiple stakeholders

Our team started by trying to understand both the infrastructure monitoring needs of our product teams, especially the platform team, and our customers. One thing became clear fairly quickly: while the general metrics for each product were the same for 90% of users, some data, how it’s processed, and the collection frequency were very specific to each product.

So we couldn’t just create a one-size-fits-all solution for all products in our ecosystem. Instead, each product team had to remain in charge of the metrics and logs available for their product within our monitoring solution.

So we established guidelines and provided support to help the teams with the following:

  • Key observability concepts
  • The importance of selecting the appropriate frequency for each metric
  • Scalability and storage considerations related to metric cardinality
  • Best practices for creating dashboards
  • Best practices for setting up alerts
  • The necessary technical documentation for sending metrics and logs to our internal APIs

Scalability through components

Monitoring infrastructure is already complex enough when it’s just for a single company. We knew that scaling the system and managing the volume of data when making monitoring available to all our customers would be a huge challenge.

We partly addressed this by constructing a unified system — completely separate systems for each product wouldn’t scale well — while also breaking it down into separate components that can scale separately.

Each component of our infrastructure is developed individually to ensure stability, performance, and the ability to scale. As a result, each component can handle a significant workload. And that makes the entire system scale.

Think of it like a bunch of LEGO bricks! If each brick is stable on its own, the whole structure built from the bricks is stable as well, and it can grow.

Taking it from tool to product with Cockpit

The initial tests of our architecture design were highly satisfying and gave us great confidence in our ability to handle the load. This sparked an idea: what if we opened our APIs to our customers, allowing them to push their metrics and logs for applications as well?

This would have significant benefits for them:

  • The ability to consolidate Scaleway infrastructure data with their application data
  • Managing their observability solution
  • Providing a unified alternative and open-source solution to proprietary options, addressing a previously identified need

So, we began conducting more extensive user research. Several crucial points quickly emerged:

  • The need to retrieve infrastructure data (our tool was already providing that)
  • A need for transparency and understanding of observability costs, along with greater predictability of what companies will have to pay
  • The requirement for an easy-to-use solution

Opening our APIs

Clearly, a more comprehensive observability solution was needed. So we decided to design and develop our infrastructure to receive metrics and logs from our customers. We opened our APIs and enabled customers to push data and remote/read it.

Our observability solution now includes the following:

  • Grafana-as-a-Service: A managed Grafana solution for our customers, entirely developed by our team, with quick response times and rapid display, even though it’s built on a serverless architecture that resets to zero if the client doesn’t view their dashboards. By default, it is populated with pre-built dashboards for all clients, allowing them to monitor their Scaleway infrastructure within five minutes of activating the monitoring solution.
  • A remote/read API using the open-source and normalized PromQL and LogQL protocols
  • A managed alert manager, also pre-populated with default alerts for all Scaleway products (which can be activated or deactivated by the client)
  • An information hub accessible to all product teams and a front end to display information in the Scaleway console (e.g., product metrics, consumption, etc.)
  • Coming soon: A new version of our graphs in the console allowing product-specific monitoring of service health

During the private and public betas, we monitored traffic, usage, scale, and clients’ usage. We learned that we could handle significant workloads and scale accordingly to meet the expectations of Scaleway customers and products. We also noticed that we needed to set appropriate limits and best practices to ensure system security and sustainability.

Over time, we developed the right techniques and tools to allow our customers to adapt their usage of our product based on real-world needs. Finally, Cockpit — as both a tool and a product — was born and made available to all Scaleway users in March 2023.

Based on our information about how the product was being used, we also devised a pricing strategy that allows customers to predict and always oversee their costs. You’ll read more about that in detail in our next blog post, so stay tuned!

Share on
Other articles about: