As Platform Engineers specialized in monitoring and observability solutions, my coworkers and I had the chance to experiment a lot with monitoring practices and tools. Now we would like to share our views on implementing monitoring in a way that allows your product to become more resilient and responsive.
Monitoring has a significant impact on your product and its quality. In our experience, naive attempts to implement monitoring systems without the accompanying organizational and cultural shifts that allow teams to compensate for the high costs of monitoring and guarantee that the system in place is aligned with their goals always fail.
A primer on monitoring
This post is aimed at engineers, engineering managers, and product managers. We will define monitoring and how it differs from observability and go over some general principles that should be kept in mind during the whole lifecycle of implementing and maintaining your monitoring system. We’ll also look at some common types of monitoring and, finally, how to continuously improve your product by taking advantage of the insights provided by monitoring.
Monitoring is all about collecting telemetries (metrics, logs, traces,…) from systems and storing them in a time series database (TSDB) to be watched in the background by a system capable of issuing alerts when certain behaviors are observed and other systems taking input from those alerts/time-series (e.g., Kubernetes auto-scaling on custom metrics).
Common telemetries: metrics, logs, and traces
Alerts can be routed to other systems to trigger automated procedures, be displayed on a dashboard for further investigation, trigger a page to summon the on-call engineer (which is expensive), and much more.
Metrics are the most used among all telemetries as they are typically fresh data and easily aggregable, so they scale easily with your production.
Logs describe unique events, so they’re not directly used for monitoring because they’re too verbose and harder to manipulate for a machine than plain numbers like metrics. But they can be counted and turned into metrics (e.g., by incrementing a counter for every log describing an HTTP 500). Their main use case is still in the name — to log with a certain verbosity what’s happening on a system for traceability and posthoc investigation.
Traces are mostly used to track a user request that may be distributed among different services. They are basically a collection of the logs triggered by one user request traversing several of your service components and are not really used in monitoring as such but more in observability (more on this later).
Why do you need monitoring?
If you’ve ever looked into the literature on monitoring, you’ve probably come across Part III of the Google SRE Book, which presents a Hierarchy of Needs for service reliability. Google puts monitoring at the “base” of the pyramid, saying it’s impossible to run a reliable service if you are unaware of its state.
Meanwhile, if we look at Service Quality literature, one of the most widely accepted models, which has been in use since the early 90s, is the SERVQUAL model, which evaluates the quality of services in five dimensions. Among them, reliability (“The ability to perform the promised service dependably and accurately”) and responsiveness (“The willingness to help customers and to provide prompt service”) are two criteria that monitoring systems can improve.
Without monitoring, we are unable to detect whether our system is drifting away from its nominal state. Therefore, our users will perceive our services as unreliable as they can’t trust them to perform consistently. But by being unaware of those drifts, we are not able to quickly put the system back on the rails. This negatively impacts our Mean Time To Recover (MTTR), resulting in a drop in responsiveness and in our customers’ trust in us.
With repeat occurrences, the perceived quality of our service will drop to a point where the user is no longer willing to pay for it (the decision to use a service is mainly motivated by the quality-cost ratio), which unavoidably, leads to churn.
The difference between monitoring and observability
There have been some shifts in recent years, and we’ve seen the emergence of solutions labeled “observability”. Some have argued that observability is just a buzzy tech word and yet another synonym for monitoring. But we disagree.
The need for human intervention is becoming less and less with the development of autonomous systems, but the need for a qualified workforce to maintain those automations is rising. This effectively means that we don’t have to scale our operational team proportionally to our production scale. And that’s where we created the need for observability next to monitoring.
Monitoring has allowed us to create automations and/or autonomous systems by collecting telemetries and launching automated procedures based on the evaluation of a set of rules taking input from those telemetries. On the other hand, observability has been designed for humans. It allows us to observe our (increasingly more) complex systems effectively.
The more we automate our systems, the more they look like a black box for us. Observability was born because we made our systems more and more autonomous to handle more and more complex problems, but we realized it made the cognitive load needed to reason about them humanly unbearable.
At the end of the day, all observability solutions advertise themselves as easy to use with a good developer experience for this exact reason: they allow humans to observe things they couldn’t with the naked eye.
That’s why observability and monitoring are complementary, not interchangeable.
The principles of monitoring
Now that we’ve established that there is a need for monitoring let’s look at the important principles you should keep in mind when working on implementing a solution.
Monitoring is not just technical
Don’t ignore the product and business implications of monitoring. You should have regular meetings (e.g., every quarter) with those parties to speak about your objectives and your Quality of Service (QoS).
This will allow you to assert the relevance of your current monitoring solution, which should be aligned with your service objectives. And if the alert volume is already high, and you struggle to meet the objectives, product management should either allocate time to improve the reliability of your service or downgrade the targeted objectives.
Alerts should have a severity spectrum
Alerts typically have a spectrum of severity, e.g., you may get HTTP 500 errors for 0.1% of the requests, or all user requests are failing. Being able to classify the severity of the symptoms will help you adopt an adequate response. We don’t need to page engineers for every single abnormal behavior, especially when there are no user-facing impacts. But if the abnormal behavior happens every time, then someone should take a look at it.
If you want to keep your system simple, avoid using too many severity classes. The minimalistic yet effective classification we recommend is:
- INFO: Alerts giving information that require some attention. Those shouldn’t trigger notifications but appear on a dashboard.
- PAGE: Alerts that should page engineers and require immediate responses
Adding more levels adds quite a lot of complexity to the system. For example, adding a WARN level over the INFO level to prioritize some alerts in the dashboard will require you to set up inhibition rules to silence lower-level alerts corresponding to the same base symptoms, or you will have duplicate entries on your dashboard.
Remember that not all alerts should lead to paging. Some may just be used as informational tickets in dashboards or as automation triggers. If they trigger a page, then they need to be instantly actionable and describe a real urgency.
Your on-call engineer will experience an adrenaline rush when they get a page, so make their work easier by adding dashboards and runbooks to every alert that help them easily find the root cause.
Alarm fatigue is a real concept that has been observed and studied in other industries as well. So use paging wisely.
Keep it simple to keep the noise down
If you look online, you will find a lot of fancy setups with complex anomaly detection implementation and a long chain of alerts with complex inhibition rules. Use them sparingly.
Most of those setups are hard to maintain because they create a hard coupling between your current situation — your current infrastructure, your current volume, the current context, etc. — and your monitoring. But in an environment with high change velocity, this is a source of significant bit rot.
There may be cases where you will need a more complex setup, but if they become a source of noise, they should definitely be replaced as soon as possible with long-term measures or simpler rules, such as linear predictions, if applicable.
Monitoring needs to be refined
Monitoring is not plug-and-play. Every product team should always have at least one “monitoring person”; two, if the system is complex. They perform typical engineering jobs but are also responsible for maintaining the monitoring of their product.
Allocating engineers to monitoring isn’t the only organizational measure needed. You also need a ritual to review the alert volume between different time windows, assert the relevance of the alerts, and decide how to refine the monitoring system.
There are three main types of monitoring we’ll look at:
- Symptom-based monitoring
- Cause-based monitoring
- Resource-based monitoring
This is not an either/or situation but, rather, which type you use when will depend on what you want to do or need to know about your system.
With symptom-based monitoring, the focus is on what the user is experiencing. We want to monitor what our customers are experiencing and warn our team when the system drifts from its nominal state.
The method is based on service-level objectives (SLO), so we need to take a quick detour into the basics. An SLO essentially defines how reliable a service should be over time. Meanwhile, service-level indicators (SLI) describe user-facing concerns, are measured over time, and indicate whether an SLO has been met or violated.
But one SLO violation alone doesn’t immediately mean a service is unreliable. A certain allowance is made for failure with an error budget, which describes the maximum time allowed for a given type of error. And how quickly you’re burning through your error budget is calculated with a burn rate.
Monitoring symptoms through SLOs
Once you have defined SLOs, the first thing that may come to your mind is to alert you when you’ve consumed your error budget. However, you want to be able to act before an actual violation happens.
A good method is to use burn rates by defining different alert conditions that are each made of:
- A budget consumption, e.g., 2% of the budget is consumed,
- A look-back window, e.g., over 1 hour,
- A severity, e.g., using the spectrum we previously presented when we talked about when to page.
Using this method, you will come up with multiple alert conditions, each with their own severity.
The benefits of symptom-based monitoring
Symptom-based monitoring is by far the most effective type of monitoring because if you are currently failing to meet your SLOs, you can be certain your product is out of its nominal state, and the customers can feel it. There is almost no risk of false positives with this method.
What we like about this methodology:
- It’s data-driven: SLIs and SLOs are an operationalization of the customer’s voice, so you can use them to remind everyone what your customers’ primary interests are by using significant metrics.
- It has clear KPIs to present to the business every quarter.
- Risk management: It models an error budget for innovation. If you’re clearly over-performing your SLOs, you have room to innovate and try new features. Conversely, if you struggle to meet your SLOs, engineers’ time should be spent on making the product more resilient.
Using SLIs and SLOs to monitor your service’s symptoms is an emerging practice, and there are some interesting tools available to help you get started:
- If you are good at learning by code or by example, take a look at the OpenSLO initiative that aims at creating a declarative model for SLIs, SLOs, and their monitoring;
- There also are tools that help you turn declarative specifications into implementations; for example, if you are a Prometheus user, you may be interested in Sloth.
You should keep one thing in mind, though: SLI/SLO-based monitoring will only be effective if your SLOs are significant enough for your users and capture real pain points. Also, if your SLOs are way too ambitious, you will end up with noisy monitoring.
So before starting symptom-based monitoring, you have to ensure that SLIs/SLOs are well integrated into your organization’s processes.
Monitoring SLOs with burn rates gives us a measurement of the pain our users are experiencing with our service in real-time, but it doesn’t explain why it’s happening. You will need to investigate further, using observability tools to find the root cause (or failure).
There is a causal relation between failures and symptoms: failures lead to symptoms, so we should remediate those failures. The general consensus in the industry is that we should focus on symptoms instead of causes (or failures), as cause-based monitoring can quickly become noisy.
This is a dangerous statement if interpreted too drastically: some may say cause-based monitoring is completely useless because the symptoms will be detected by our symptom-based monitoring anyway. While that makes sense, it also means our customers will be impacted if we always wait for symptom-based monitoring to alert us that something is wrong. Sometimes, failures can be detected preemptively with cause-based monitoring, so we can avoid our users being impacted entirely.
The workflow of cause-based monitoring
Typically, we start to experience some incidents that are all linked to the same root cause. In the beginning, we manually remediate these incidents. With time, this root cause may evolve from occasional to common enough.
If we are able to detect this failure preemptively, we may decide to monitor it and page our on-call engineer every time it happens and have them deal with it before our service degrades. If the failure and the symptoms appear at the same time, creating an alert on the failure has low to no value at all, as the symptom-based monitoring should already be ringing anyway.
The problem starts when the failure occurs so often that monitoring becomes really noisy and causes a ton of toil for engineers; this is the moment when you’re supposed to “turn off” the monitoring. In those cases, you may choose to drop the alert and accept the resulting degradation of service or, if the degradation is not acceptable, invest some of your time into remediations and mitigations.
Remediation: addressing failures
In most cases, failures can be addressed and fixed automatically (i.e., without human intervention). Once you develop a remediation script, you simply redirect the alerts for those failures from paging to an automation engine. That way, on-call engineers don’t have to waste hours manually remediating the same problem over and over again.
Of course, there will be times when short-term fixes are impossible and will require your team to work towards long-term remediations. But honestly, this is the normal life of any service. Day-2 Ops is always the biggest part of our job, so we should work to make it as enjoyable as possible.
We know this part will be the hardest to accept for most engineers, but failures are normal and to be expected. We love to control everything and usually aim for a completely symptom-free system where every possible failure is mapped to fully-tailored remediation. But there are so many things that could possibly go wrong that a failure-free system isn’t even possible.
Mitigation: reducing the effects of failures
This is why we don’t rely on remediation alone. Remediation is made to react to the known unknown, but there are a lot of failure classes yet to be discovered (yay!) or failure classes that are just too costly or irregular to remediate without creating bit rot. So, instead, we couple remediation with failure mitigation mechanisms such as redundancy and failure domain isolation which do not prevent failures from happening but mitigate their impact.
In the lifetime of your service, accept failures and learn from them. Failures are an opportunity to improve the antifragility of your team and your product (which is the ultimate goal but harder to achieve than mere reliability). When your system is, from time to time, put under stress, you train your team to
- identify the weaknesses of your system,
- develop their incident response process,
- keep their cool in the face of stressful situations and adrenaline rush, and
- implement new mitigation strategies and remediations which improve your product quality overall.
Imagine a team facing incidents every one or two months and another experiencing their first incident after two years. Which one do you think will respond better?
How to avoid the toil spiral
When facing noisy cause-based monitoring, there are only two reasonable things to do: drop the alert and accept the service degradation or invest time on automated remediations and/or mitigations so humans no longer have to handle the failure.
Choose an in-between, and you will enter a vicious spiral where your engineers get more and more paging, hence more and more toil, leading to alarm fatigue and less time to work on features and long-term projects.
If you keep releasing new features instead of working on automated remediations and mitigations, those features will create new classes of failure and even more paging. In the end, this will completely nullify your team’s velocity to deliver values on your roadmap, as their focus is 100% on toil work. And that will make it harder to find people to join your team as well, as working only on toil isn’t very attractive to engineers.
While cause-based monitoring requires complex decisions, you should avoid noise generated by resource level at all costs. There are only a few cases when you want to produce resource-based pages, for example:
- When overshooting your capacity planning and quotas: In those cases, you may need human intervention (e.g., to get approval from management to allocate more resources).
- When your automation to scale up your resources (limited by quotas) fails.
- When you perform migration and allocation manually (which doesn’t scale, but sometimes it’s expensive to automate it too early)
- When the page is actionable and describes a real issue, otherwise, you should avoid those “there may be problems” pages.
The following situations are examples of acceptable resource-based monitoring:
- Disk saturation: once full, it won’t really restart without intervention if you are using local disks. If you are using remote storage, most of the time, it can be automated.
- Memory levels when memory is overcommitted and crash-looping processes lead to data loss, consistency errors, or heavy contention on your nodes.
What you should do, though, is prepare dashboards to ease root cause analysis when symptom-based monitoring is ringing. You can, for example, take advantage of the USE method to abstract away your resources’ low-level details by using three high-level metrics: Utilization, Saturation, and Errors, which will allow you to iterate over your resources one by one during root cause analysis to find which component is causing a drop in your SLIs.
To conclude this part, we want to highlight that most of the time, the real remediation of a resource-based failure will be smarter capacity planning, but this is yet another subject that deserves its own blog post.
The alert review: why monitoring needs to be part of your culture
There is one thing that most engineering teams will overlook: the organizational and cultural measures needed to make their monitoring effective. Your monitoring system can get quite complex, and it’s important to review it regularly so you can continuously improve it and your product. To do so, we’ve implemented a ritual we call the “alert review”.
The benefits of this ritual are:
- We can track the evolution of the reliability of our product over different time windows: weekly, monthly, quarterly, and yearly.
- We can see if our objectives are way too ambitious for the current state of our product and decide:
- to prioritize stabilization over innovation (i.e., slowing down new features) for the time being or
- to downgrade our objectives.
- We can identify new remediations and mitigations that may be needed to improve our product.
- We can identify which components of our product are causing the most problems and decide to focus on those.
- We can also identify when a manually remediated failure is getting more and more frequent and decide it’s time to automate things to avoid the toil spiral.
- We can save costs by removing unnecessary telemetries and/or reducing the resolution of some time series.
The concept is simple: one team member hosts the meeting, and another one takes notes. Then the alert review can start:
- Macro review: The host starts with overall statistics: how many alerts are fired by region (eventually zones) and severity (page or info), mean time to acknowledge, mean time to remediate,… Start small, and you will incrementally discover the KPIs you need for this ritual. Once familiar with the process, you could even use time windows to observe relative and absolute variations to spot any degradation of your monitoring (e.g., since the last alert review, alerts have become 10% noisier).
- Micro review: The host starts to go over each alert group that has been fired since the last iteration. Once again, using different time windows to spot trends can be helpful. For example, maybe a cause-based alert that is, for now, mitigated manually will start to get more frequent, so the team will decide to start working on automation.
- (Optional) SLI/SLO correlation: The host shares SLI/SLO stats. After going over alert trends, it should be easy to empirically discover some correlation (even if it doesn’t imply causation) between an alert being more frequent and a drop in some SLIs.
- Backlog feeding: The final step is to review the notes made during the meeting and issue tickets for the backlog.
Monitoring is not plug-and-play! It’s an engineering practice that requires discipline and continuous improvement in order to bring antifragility to your product while staying cost-effective.
Monitoring costs can be reduced to the cost of their collection, long-term storage, and indexation for further querying.
Monitoring and observability can get really expensive and typically scale with your production, so when you start to see which telemetries are significant and important for you to collect, you should down-sample or remove the others to reduce your costs.
We went over the general principles of effective monitoring and their different types.
Symptom-based monitoring exploits SLI/SLO concepts, allowing us to produce alerts that reflect actual customer pain. You should page when you start burning your SLO budget.
Cause-based (or failure-based) monitoring allows us to detect failures leading to high customer impact. Paging should only be used when a failure is too new or rare to have automated remediations and/or mitigations in place.
Resource-based monitoring is sometimes unavoidable but should only be used when resource exhaustion cannot be automatically remediated (e.g., with auto-scaling mechanisms) or when you are getting out of your capacity planning. In both cases, page if, and only if, the page is actionable for your on-call engineer. Good capacity planning is the real key there.
We really can’t stress enough how important it is to refine your monitoring with rituals like alert reviews. This is especially true when you use cause-based and/or resource-based monitoring, as those tend to be short-termed and operational. They give you insight into the current failure rate, the adequacy of your capacity planning, and your level of toil. If you don’t want to be confined to operational views, reviewing your symptom-based monitoring and the related SLIs/SLOs should allow you to take your customers’ point of view.
As we saw, monitoring helps you improve the responsiveness and reliability of your service. It’s not an easy discipline and can get really expensive when misused, but an experienced engineering team should be able to design a monitoring system whose benefits clearly outweigh the costs.
For all these reasons, I personally think monitoring shouldn’t be considered plug-and-play: it is a fundamental engineering discipline.