Arthur Busser is Site Reliability Engineer at financial data company Pigment. This K8s expert par excellence describes Kubernetes as “the perfect level of abstraction on top of a cloud provider’s VMs”. He first discovered K8s in banking, before joining cluster specialists Padok. We caught up with him to discuss Kubernetes security, via secret management… and discovered he’d made an app especially for Scaleway, further to his own secret-related horror story!
Why is Kubernetes security such a big issue?
Firstly, with Serverless, your focus is mostly on application-level security. You trust your cloud service provider (CSP) to secure their infrastructure. With Kubernetes, you have to worry about the application-level security — which is different for everybody — but also about your infrastructure-level security, and that’s the same for everybody. That’s why we talk about it a lot; Kubernetes security basically comes down to collaboration across companies.
Secondly, our systems have become a lot more capable, hence complicated, hence vulnerable.
A computer has to be connected to the world to bring value. And the more connected it is, the more opportunities there are for an attacker to get in.
Kubernetes' use has grown really quickly globally, with countless new capabilities unlocked. They include high viability, replication, autoscaling, load balancing, orchestration, containerization and more.
Each new capability brings its own security risks. A lot of people built complicated platforms they don’t fully understand, that are not fully secured. Securing your platform is constant work. Open-source security solutions are getting built, and everyone’s waiting for them.
Why is Kubernetes secret management so crucial?
A K8s service’s code is the same across all environments we deploy a service to. What changes is the service’s configuration. Sometimes, part of that configuration is sensitive, like database credentials, or API keys.
The sensitive bits of configurations, or “secrets”, are usually related to authentication. If hackers can get their hands on those credentials, they can impersonate your service elsewhere. And usually, that blows up in your face…
Maybe they can spam other people, making them think they’re you. You also hear about someone’s AWS credentials getting hacked. The hackers use that person’s account to mine bitcoin… and the user only finds out when they get their cloud bill!
You need to balance security and usability. This trade-off has always existed. Today, to do security well, you need to make the easiest way the secure way.
The easiest way would be to put the secret — e.g. an API key — in the source code. But a hacker could access that.
So you need something better! That’s why, the same way you’d use a password manager, we suggest using a secret manager.
How can you best deploy a Kubernetes secret management system?
Today, a lot of people use secret managers. Most CSPs provide them. They are basically key value stores, but they use cryptography and key rotation to keep those secrets secure. They also integrate with the CSP’s IAM, and they usually keep audit logs, so you can know who read what secret at what time.
Alternatively, you can run your own, with self-hosted tools like Hashicorp’s Vault. Or you can even just store them in a (secure) bucket. Either way, your secrets are stored in an external system, which means your service needs a way to access them.
The easiest way to do that is to use your secret manager’s API. When your app starts, it gets those credentials, and uses them to access the database. Simple! But there’s no abstraction.
So if you want to switch to another secret manager, you’ll have to edit every single service you have before you can migrate over.
A lot of people have the service read the secret from an environment variable; the service just expects it to be there.
The most common way to do that is with an operator called “external secrets” (ESO), which fetches secrets from your secret manager and copies them into your K8s cluster.
This way, if you move CSPs, all you need to update is the ESO configuration, so that instead of fetching from Scaleway’s secret manager, for example, it fetches from Vault.
I am not a fan of this approach. Firstly, because it compromises security by copying secrets from a highly secure secret manager to a likely much less secure Kubernetes cluster. Secondly, because it only works 99.9% of the time. You have a reliability problem because of the 0.1%.
0.1% doesn’t seem like much of a risk. Why should we worry about it?
Your service reads from the ESO’s equivalent of a cache and not directly from your secret manager. If that cache is not up to date, your service will read the wrong version of your secret. I have my own horror story where ESO didn’t update the cache for the mere 45 seconds that the secret manager was unavailable.
I had a service using an API to contact Paypal. I got an email from Paypal about the API key expiring in 3 days. So I went to Paypal, generated a new key, and uploaded it to my secret manager. Then I had to update my ESO configuration to read the new version of the secret, and restart my service.
I opened a pull request to change the configuration, have it reviewed, and merge it. My deployment pipeline updated the ESO configuration in my production cluster and triggered a rolling restart of my service.
At that moment, ESO couldn’t connect to the secret manager. So the secret wasn’t up to date in my Kubernetes cluster. New instances of my service started anyway, but with the old API key.
The issue lasted for 45 seconds, after which ESO updated the cache. More instances of my service started, but with the new API key this time. The rolling restart completed and my pipeline exited successfully.
So then the problem was fixed, right?
No! I had no idea that the old API key was still in use by some instances. 3 days later, the key expired. Suddenly, half of our payments were failing and my team got alerted. We quickly fixed the issue by replacing the faulty instances, but the incident lasted 25 minutes in total.
That was a big chunk of our error budget, so we wanted to identify and fix the root cause. The problem was the stale cache that ESO was supposed to keep in sync.
This cache is a form of state, shared by all versions and instances of our service. Our service wasn’t stateless anymore. This exposed us to a nasty kind of bug: race conditions.
A race condition can happen when asynchronous operations share state. Here, the asynchronous operations were ESO updating the cache on one side, and our service starting on the other. If these happen in the wrong order, then the end result can be very different.
What sort of alternatives exist?
There are three types of solutions:
1. Keep your ESO, but update your system to make sure your service only ever restarts after the ESO is updated. Reloader is an open source software that watches the secret in your K8s API, and when that changes, it restarts your service. Once the ESO update is complete, Reloader restarts your service again. So even if the race condition happens, it’s automatically mitigated
2. Get rid of the ESO, and replace it with:
- Berglas for GCP Secret Manager
- BankVaults for Hashicorp Vault
- Murmur for GCP, AWS, Azure, and Scaleway.
Murmur is a small executable I’ve created, that you put in your container. Before your service starts, it’ll fetch the secret from the secret manager, put it in your environment variable, and start the service for you. So you can’t have that race condition anymore, because there’s no intermediate copy of your secret. Murmur won’t start your service if the secret manager isn’t available. That’s way better than my service starting in an unexpected state. Murmur works whatever language your service is written in, because every single language understands environment variables.
3. “Hot reloading”, a process whereby a service’s credentials are changed without restarting the service. Each instance of your service has a file which is a copy of the secrets stored in your secret manager. Your service watches the file, so when the file changes, the service reads it again, and uses the new credentials. This is useful because you can’t change an environment variable of a running service. But you can change a file read by a running service. This is a very advanced use case, for optimal data security requirements, and overkill for a lot of people. But we use it at Pigment because we sometimes run very long database requests which we don’t want to interrupt with a restart.
Which ones would you recommend, and why?
Option one is the easiest if you already use ESO. But I recommend option two. It is the simplest and most reliable, and strikes a sweet spot in regards to security and ease of use. And option three, as I said, is only for very high-security situations.
People think a lot about security, but after a while they also start thinking about user experience. Once they’re there, they need to start thinking about reliability. I found out I had a problem when it blew up in my face. Twice!
So I realized a lot of people have this problem in their infrastructure, but they don’t know about it. So I want people to know there is a solution, but above all to start thinking about how reliable their platform really is.