“Who can do what,” or why Identity Access Management matters

Build
Olivier Cano
7 min read

On February 22nd, we activated IAM for all Scaleway organizations. Identity and Access Management (IAM) is a security framework to control the authentication and authorization of individuals and manage their access to resources.

IAM is no small task; at Scaleway, every API call goes through IAM before being executed. For each second you spend reading this post, about a thousand permission requests are executed on IAM.

But here’s the thing, on February 22nd, IAM had already been live for eight months. You just didn’t know it. How did we manage to migrate so seamlessly? That’s what we will look at in this series of two blog posts.

In this first part, we’ll dive into the history of IAM at Scaleway and the different database implementations we used over the years. The second post will go further into the technical details of the migration itself.

IAM at Scaleway prior to 2022

IAM can be summarized with the question “who can access what?”. The answer to this question is determined in what we call a policy. The “who” often defines the identity: an individual, a team, or even non-human users. The “what” defines different levels of granularity we want to give access to: an organization, a project, or a particular resource.

The list of “who” or “what” in a policy depends on the business you run. Let’s have a look at these two example policies:

Two sample sentences to demonstrate access: “Alice can access the organization Foo” and “People with green eyes born during a full moon can access the secret room during odd days”

Even though both are valid policies, the first one is much simpler to understand and implement. In a nutshell, the policy dilemma is that you have to weigh simplicity vs. granularity.

2015: A permissions-based system

Scaleway started in 2015 as an internal startup within Online SAS. At the time, IAM was not called IAM yet. Instead, we would refer to it as the “permissions system,” which was very simple: a user has many permissions.

Disclaimer: database models shown in this blog post are not exhaustive.

Database model showing the permissions-based system

A permission represents an action to perform on a resource in a product. For example, the permission instance:server:read allows a user to read information about servers in their organization, but it does not allow them to create servers. The permission is then checked by the product API managing this resource.

This approach worked well for many years, but it had its limits:

  • Every time we created a user, we had to insert entries for each of the P permissions in the users__permissions table.
  • Every time we created a permission, we had to insert entries for each of the U users in the users__permissions table.

As Scaleway grew, more permissions were added for new products, and a lot of new users joined Scaleway. At some point, it took a full day to seed a new permission in the database. We had clearly hit the limits of the permissions model.

2020: The switch to a role-based system

We improved the performance of the permissions model by introducing permission sets — sets of one or multiple permissions that we can attribute to one or many users. We took this refactoring opportunity to introduce a Role-Based Access Control (RBAC) model: when a user assumes a role in an organization, they get access to various permission sets.

Database model showing how the role-based system worked

At the time, organizations equaled users, so each organization could only have one user. The idea with RBAC was to enable a multi-user feature by letting a user assume one of the four predefined roles in any organization:

  • owner: can do everything in the organization
  • administrator: can do everything except delete the organization
  • billing administrator: can only manage the billing and payment
  • editor: can access cloud products

In this system, users could be part of many organizations, but they remained global across Scaleway. It was a simple approach but meant that an organization owner couldn’t take actions on their users other than removing them from an organization. Example: the owner of an organization cannot enforce multi-factor authentication on their users.

2020: Pivoting from user-based to project-based API keys

In 2020, we also introduced a new feature: projects. This feature allows you to organize resources by isolating them in projects. Implementing this into our RBAC model was trivial: instead of assigning a role in an organization, we now assign it within a project.

With this feature, we also introduced project-scoped API keys. Before, API keys were bound to a global user, which means the same API key had the permissions associated with the user across all the organizations the user belongs to. With project-scoped API keys, the API key is bound to a certain role on a particular project. This was a big step toward adding more granularity to our policies.

Before:

Example: “API key has access to all resources organizations Bob belongs to”

After:
Example: “API key has access to all resources of a project”

The limits of the RBAC model

The project feature was a quick win, but we knew the RBAC model would be limited when it came to defining scope when accessing a product. We identified four possible scopes to configure access to a product:

  • Accessing a product on a project
  • Accessing a product on many projects
  • Accessing a product on all current projects
  • Accessing a product on all current and future projects

Every time we tried to address these four use cases with RBAC, we ended up twisting the model in a way that didn’t satisfy us. But having more granularities in IAM is the top feature requested by our users, and we have great plans for IAM, such as adding granularity at the resource level and more. We didn’t want to make a compromise here, so we needed a new approach with better handling of project and organization scopes.

Designing our current IAM system

In 2022, we started working on our IAM system as it exists today. The main requirements were:

  • Simpler user management for audit and security features
  • Support of both project and organization scopes
  • Letting the organization owners define the policy they want
  • Anticipating the need for more granularity in the future

The first thing we did was to analyze many existing IAM systems, especially from the open-source world. None of the existing systems addressed our needs 100%. Sometimes the UX was too complicated. In other cases, the “project” concept did not exist. Most of the time, we would disagree with implementation details. The truth is, only rarely will an open-source software magically fit your business.

The policy to rule them all

So we started designing our own solution by defining the list of “who” — often referred to as the principal — and the list of “what.”

There might be more in the future, but for now, we identified three use cases of principals at Scaleway: IAM user, IAM application, and IAM group.

Schema illustrating how principals get access through policies

Then we defined the “what” by a series of rules that give access to permission sets on a scope:

  • Project scope: a list of specific projects
  • All-projects scope: all current and future projects
  • Organization scope: for organization-level permission sets

Rules hold the granularity definitions of IAM and are the most difficult part of IAM to explain and use. So very early on, we focused on the user experience in the console and Terraform to make it easier for users to understand.

Terraform-driven design

Most Scaleway users use infrastructure-as-code tools to deploy their infrastructure, the most popular being Terraform. So we decided very early on to base our design of the IAM system on what it would be like to define a policy in this particular tool. After a couple of iterations, we arrived at the following setup:

resource scaleway_iam_policy "object_storage_and_billing" {

description = "gives read only access to object storage in project A and full access to billing"

// Who

user_id = scaleway_iam_user.alice.id

// What

rule {

project_ids = [scaleway_account_project.my_project_a.id]

permission_set_names = ["ObjectStorageReadOnly", "RedisReadOnly"]

}

rule {

organization_id = data.scaleway_account_organization.my_orga.id

permission_set_names = ["BillingFullAccess"]

}

}

We noticed many small details, such as:

  • Using different attributes for principal — user_id, application_id, group_id — is more comprehensible than using a single attribute principal_id.
  • Binding a policy to a single principal encourages the use of a group when giving a policy to many users or applications.
  • It’s much clearer to define all rules on the same Terraform resource than using one Terraform resource per rule.

These kinds of details helped us design a better API and database model.

Schema showing how a policy is a list of rules

In the database, a policy is a list of IAM rules attached to a principal. Each rule allows access to a scope (project_id, organization_id) on one or multiple permission sets — all the same as in Terraform.

While designing this data model, we also tested its evolutivity against future features we plan to add to the IAM policy, such as resource-level scope, deny rules, conditions, resource principal, etc.

We have clear plans on how to implement each of them, and if you want to help us prioritize which feature is more important to you, please upvote or create a new feature request for it.

From design to migration — how we made the switch

So now you know how we designed the IAM policy with a compromise between user needs and user experience, using the Terraform representation as a source of inspiration. If you want to know more about these topics, we discussed them on a Dev’Obs episode (French podcast).

You might have noticed that the new IAM concepts are quite different from the previous role-based access control. Principal, policy, group,… many of these data were nonexistent in the previous model. And the permissions API is one of our busiest internal APIs, with thousands of requests per second.

So how did we manage to completely flip the model with no downtimes? Stay tuned! All of this will be revealed in the next blog post. :-)