How to Build Observable Systems - An Introduction to Observability

Introduction Link to heading

Recently I have been hearing about observability more and more. The promise that an observable system can help us to debug and identify performance and reliability issues in a microservice architecture sounded quite good to me. Hence I decided to read up and learn more on this topic. In this blog post I will try to summarise what I have learned about observability so far.

What is Observability? Link to heading

Observability is a term or a concept that has its root in Physics, mainly in Control Theory. According to Wikipedia -

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

… A system is said to be observable if, for any possible evolution of state and control vectors, the current state can be estimated using only the information from outputs (physically, this generally corresponds to information obtained by sensors). In other words, one can determine the behavior of the entire system from the system’s outputs. On the other hand, if the system is not observable, there are state trajectories that are not distinguishable by only measuring the outputs.

Extending the same concept to a software system, we can say -

A software system is observable if we can ask new questions from the outside to understand what is going on on the inside, all without deploying new code.

So in effect, observability is a measure of how well we can make sense of what is going on with our application by asking arbitrary questions (the unknown-unknowns) about the system, without having to know the questions in advance. The more observable our systems are, the more arbitrary questions we are able to ask. Used effectively, it can greatly improve our application quality and reliability by making it relatively easy to debug and identify potential performance and reliability issues in production.

Why should I care about Observability? Link to heading

Because software is becoming more and more complex than what it used to be, making it more difficult to predict most of the production issues in advance.

Consider a regular monolithic application. In such applications the entire codebase is in one place, making it possible to browse through different use cases end-to-end and anticipate most (if not all) of the production issues in advance. Once the problematic areas have been identified we augment the application code to collect and report various metrics, visualise these metrics in dashboards, and create alerts. Combining application logs with with these collected metrics was often enough to debug most of the performance and reliability issues in production. If we could also throw in distributed tracing to the mix, then our chances of finding and quickly fixing these issues would increase even further.

Contrast this with the current trend in the software world. Nowadays we see a clear preference among companies to decompose large monolithic applications into smaller-sized microservices in order to achieve greater business agility. As a result, systems are becoming more and more distributed. Small and independent teams are working on different distributed systems in parallel whose code bases are separate. What used to be functions invocations before have now been converted to network calls between remote applications. With the adoption of DevOps practices releases are becoming more frequent, reducing the time needed to release features to production once they are ready. All of these are resulting in more moving parts in a system which are changing frequently at their own pace, making it difficult to predict how an application will behave in production. Often times, questions like “will the release of this new features in application X interfere with the existing features in application a? What about application b? Or c?….” can be answered only after we release application X in production. As a direct consequence of adopting a microservice-oriented architecture, debugging gets more difficult than a monolithic system.

This is where the concept of observability is particularly useful. Being able to ask arbitrary questions about our entire system without having to know them in advance can greatly reduce the burden of debugging and identifying performance and reliability issues in a microservice architecture.

The 3 pillars of Observability Link to heading

Metrics, Logs, and Distributed Traces are often called the 3 pillars of observability because these are the tools we have been traditionally using to make sense of our system. Most of us already familiar with these concepts and related tools, so we are not going to dive deep into them in this article. Instead, we will try to understand why these 3 pillars are not enough to create an observable system.

Metrics Link to heading

Metrics are numerical measurements taken over intervals of time. We use metrics to to measure response times of requests, to count the number requests that failed to get a valid response etc. For a long time metrics have been the standard way to monitor the overall health of an application - number of live instances, current memory consumption, cpu usage, response times, query execution time etc. They are also being used to trigger alerts in case of emergencies like instances going down, low memory, high cpu usage etc. All in all, very useful tool.

However, traditional metrics-based tools are not enough to create an observable system. One of the primary reasons for this is that any tools that are metric-based can deal with only low-cardinality dimensions. Things like user ids, purchase ids, shopping cart ids - any data that have high-cardinality are not collected with these tools as otherwise the cost would blow up. Also, in order to keep the associated costs low, metrics are aggregated on the client-side and then stored in their aggregated form, losing granularity even further. Without high-cardinality data it is difficult to investigate and debug issues in a microservice architecture. As a result any questions that are answered by a metric-based tool have to be pre-defined so that we can collect targeted metrics to answer them. This is an antithesis to the premise of observability as observability requires being able to ask arbitrary questions about a system without knowing them in advance. Without high-cardinality data, this is not possible.

Another downside is that the metrics that are collected are not tied to their source request which triggered them. In a microsevice-oriented architecture a single user request can hit many different services, query different databases or caches, send messages to queues or kafka topics, or can interact with any combination of these. We can collect metrics from each of these sources, but once collected we can never link them back together. This makes it difficult to answer questions like why do this particular group of users see a high response times of 10 seconds while our metrics dashboard is showing a p99 of 1 seconds?

Logs Link to heading

Logs are a useful tool which help us debug issues by providing us context-dependent messages and stack traces. However, they cannot be used effectively to create an observable system.

One of the primary downsides of using logging to create an observable system is the associated cost. Systems that use logging to improve observability becomes too expensive to maintain. This is how Ben Sigelman, co-founder of Lightstep, explains the problem in one of his articles written on the Lightstep blog (I highly recommend to give the entire article a thorough read) -

If we want to use logs to account for individual transactions (like we used to in the days of a monolithic web server’s request logs), we would need to pay for the following:
 Application transaction rate
     * all microservices
     * cost of network and storage
     * weeks of data retention
 = way, way too much $$$$
Logging systems can’t afford to store data about every transaction anymore because the cost of those transactional logs is proportional to the number of microservices touched by an average transaction.

Another downside is that In order to answer any arbitrary questions about our system we would have to log quite aggressively. Since traditional logging libraries cannot dynamically sample logs, logging excessively could adversely affect the performance of the application as a whole.

Distributed Tracing Link to heading

This is how OpenTracing, a Cloud Native Computing Foundation project, defines Distributed Tracing -

Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.

Distributed Tracing has its use in building an observable system. After all, they are the threads with which we can connect an end-to-end request in a microservice architecture. However, they come with a few challenges on their own.

The first challenge is choosing a right sampling strategy. Traditionally distributed tracing tools have been making the decision of whether to sample a request or not at the very beginning, when a request enters the infrastructure for the first time from outside. This results in a sampling strategy that is either too aggressive and collect too much data which are expensive to store and analyse, or too relaxed and does not collect enough data to help us with observability.

The second challenge is the UI with which we analyse the trace data. Tracing tools usually come with a UI component which display all the traces in what is called a trace view. In a system with hundreds of services where a typical request touches 20 or 30 of them, the trace view becomes too complex for a human to analyse without any automated support. In addition, spans, which are treated as units of work by the tracing systems and responsible for capturing trace data from services, are too low level to be used for debugging purposes. Cindy Sridharan wrote an excellent article on this topic where she explains the problem in a much better way -

Admittedly, some tracing systems provide condensed traceviews when the number of spans in a trace are so exceedingly large that they cannot be displayed in a single visualization. Yet, the amount of information being encapsulated even in such pared down views still squarely puts the onus on the engineers to sift through all the data the traceview exposes and narrow down the set of culprit services. This is an endeavor machines are truly faster, more repeatable and less error-prone than humans at accomplishing.

… The fundamental problem with the traceview is that a span is too low-level a primitive for both latency and “root cause” analysis. It’s akin to looking at individual CPU instructions to debug an exception when a much higher level entity like a backtrace would benefit day-to-day engineers the most.

Furthermore, I’d argue that what is ideally required isn’t the entire picture of what happened during the lifecycle of a request that modern day traces depict. What is instead required is some form of higher level abstraction of what went wrong (analogous to the backtrace) along with some context. Instead of seeing an entire trace, what I really want to be seeing is a portion of the trace where something interesting or unusual is happening. Currently, this process is entirely manual: given a trace, an engineer is required to find relevant spans to spot anything interesting. Humans eyeballing spans in individual traces in the hopes of finding suspicious behavior simply isn’t scalable, especially when they have to deal with the cognitive overhead of making sense of all the metadata encoded in all the various spans like the span ID, the RPC method name, the duration of the span, logs, tags and so forth.

I highly recommend you to give the article a thorough read.

What does an ideal Observability tool look like? Link to heading

Charity Majors, co-founder of HoneyComb, wrote an excellent article on the HoneyComb blog where she mentions the criteria that a tool must fulfil in order to deliver observability -

Arbitrarily-wide structured raw events
Context persisted through the execution path
Without indexes or schemas
High-cardinality, high-dimensionality
Ordered dimensions for traceability
Client-side dynamic sampling
An exploratory visual interface that lets you slice and dice and combine dimensions
In close to real-time

She then goes on to explain the reasoning behind her choices, all of which I fully agree with. I highly recommend giving the article a thorough read.

Trying out an existing Observability tool - HoneyComb Link to heading

In the same article that I have mentioned just now, Charity mentions how HoneyComb was built to deliver on these promises. Hence I decided to give it a try by checking out their live play scenarios.

In the Play with Tracing and BubbleUp scenario I followed the step by step guide to identify some outlier requests which were taking longer than the rest. By the end of the demo I was able to nail the problem down to the individual user who was experiencing the slower response times. I could definitely see how this technique could help me to debug performance issues in production which are affecting a portion of the users but are not visible in my pre-defined metrics dashboard.

Next I tried out the Play with Events scenario which contains data about an actual production incident that HoneyComb faced back in 2018. Using the step by step guide as before I was able to identify the failed database that was the root of the issue.

I noticed the following aspects of the tool -

High-cardinality data: In the first scenario I was able to link the response time with an individual user, and then link the slower response time with the individual query that was being executed. Without high-cardinality this would have been impossible. In the absence of high cardinality data I could at best try to guess the issue and add sporadic log statements here and there, but I would still have to rely on luck to give me a break. Debugging should not be tied to luck.
Metrics are also tied to requests/traces: All response times were tied with each individual trace, thus making it easy to identify requests which were slow.
Wide events: The trace events contained a lot of data, including even the database query that was executed by the affected user! Without this query it would have been difficult to nail it down to the database performance problem.
Dynamic dashboards: all the dashboards that are being generated are fully dynamic, and it’s possible to create dashboards per dimension!

At this point I was curious to know the strategy HoneyComb uses to decide which requests to sample. I searched in the doc and found the section about Dynamic Sampling, where it’s mentioned how it’s possible to make the sampling decision based on whether an HTTP request encounters an error -

For example: when recording HTTP events, we may care about seeing every server error but need less resolution when looking at successful requests. We can then set the sample rate for successful requests to 100 (storing one in a hundred successful events). We include the sample rate along with each event—100 for successful events and 1 for error events.

Hence with HoneyComb it is possible to delay the sampling decision once the request has been fully executed. This strategy is very handy and can be used to sample aggressively for failed/problematic requests and thus making it easy to debug them, while at the same time performing a relaxed sampling for the successful requests and thus helping us to keep the data volume low.

One other thing that I noticed - in order to identify which requests are slow, we first need to define what a slow request looks like. For some applications it may be perfectly acceptable if a request completes within 4 seconds, while for some other type of applications it might be too slow. Since SLI/SLO/SLAs are gaining more and more popularity in our industry, it would make sense to use SLOs to define these criteria, and then create sampling strategies based on this definition. If a request fails our SLO, we can always decide to sample and store the request so that we can later debug why it failed. If it is successful, we can adopt a more relaxed sampling rate.

All in all, HoneyComb has left quite a good impression on me. Indeed it’s an excellent tool!

Are Metric-based monitoring tools going to be obsolete? Link to heading

I don’t think so. Metrics-based monitoring tools are still the best choice when we want to answer any pre-defined questions about our system (also called known-unknowns) -

How many application instances are live at the moment?
What is the amount of memory being consumed by the applications?
What is the CPU usage etc.

Observability tools, on the other hand, are best at answering the unknown-unknowns, things like -

Why does this user sees a response time of 4 seconds?
Why are the database queries hitting the database instances in region X take more than 5 seconds to complete etc.

Any ideal observable system would combine them both.

Conclusion Link to heading

I am still at the very early stage of my observability journey, and still learning the concepts and the tools used in this field. However, I am already convinced that in a distributed system architecture observability practices are invaluable and can help us improve the quality and the reliability of our applications by a great deal. I intend to apply these practices in my day to day work and as I learn more I will definitely try to share my learnings in my blog (given time permits)!

Acknowledgements Link to heading

These are the resources which helped me learned about what observability truly is and how to build an observable system -