Recently I have been hearing about observability more and more. The promise that an observable system can help us to debug and identify performance and reliability issues in a microservice architecture sounded quite good to me. Hence I decided to read up and learn more on this topic. In this blog post I will try to summarise what I have learned about observability so far.
What is Observability?
Observability is a term or a concept that has its root in Physics, mainly in Control Theory. According to Wikipedia -
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
... A system is said to be observable if, for any possible evolution of state and control vectors, the current state can be estimated using only the information from outputs (physically, this generally corresponds to information obtained by sensors). In other words, one can determine the behavior of the entire system from the system's outputs. On the other hand, if the system is not observable, there are state trajectories that are not distinguishable by only measuring the outputs.
Extending the same concept to a software system, we can say -
A software system is observable if we can ask new questions from the outside to understand what is going on on the inside, all without deploying new code.
Why should I care about Observability?
The 3 pillars of Observability
If we want to use logs to account for individual transactions (like we used to in the days of a monolithic web server's request logs), we would need to pay for the following:Application transaction rate * all microservices * cost of network and storage * weeks of data retention = way, way too much $$$$Logging systems can't afford to store data about every transaction anymore because the cost of those transactional logs is proportional to the number of microservices touched by an average transaction.
Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
Admittedly, some tracing systems provide condensed traceviews when the number of spans in a trace are so exceedingly large that they cannot be displayed in a single visualization. Yet, the amount of information being encapsulated even in such pared down views still squarely puts the onus on the engineers to sift through all the data the traceview exposes and narrow down the set of culprit services. This is an endeavor machines are truly faster, more repeatable and less error-prone than humans at accomplishing...The fundamental problem with the traceview is that a span is too low-level a primitive for both latency and “root cause” analysis. It’s akin to looking at individual CPU instructions to debug an exception when a much higher level entity like a backtrace would benefit day-to-day engineers the most.Furthermore, I’d argue that what is ideally required isn’t the entire picture of what happened during the lifecycle of a request that modern day traces depict. What is instead required is some form of higher level abstraction of what went wrong (analogous to the backtrace) along with some context. Instead of seeing an entire trace, what I really want to be seeing is a portion of the trace where something interesting or unusual is happening. Currently, this process is entirely manual: given a trace, an engineer is required to find relevant spans to spot anything interesting. Humans eyeballing spans in individual traces in the hopes of finding suspicious behavior simply isn’t scalable, especially when they have to deal with the cognitive overhead of making sense of all the metadata encoded in all the various spans like the span ID, the RPC method name, the duration of the span, logs, tags and so forth.
What does an ideal Observability tool look like?
- Arbitrarily-wide structured raw events
- Context persisted through the execution path
- Without indexes or schemas
- High-cardinality, high-dimensionality
- Ordered dimensions for traceability
- Client-side dynamic sampling
- An exploratory visual interface that lets you slice and dice and combine dimensions
- In close to real-time
Trying out an existing Observability tool - HoneyComb
- High-cardinality data: In the first scenario I was able to link the response time with an individual user, and then link the slower response time with the individual query that was being executed. Without high-cardinality this would have been impossible. In the absence of high cardinality data I could at best try to guess the issue and add sporadic log statements here and there, but I would still have to rely on luck to give me a break. Debugging should not be tied to luck.
- Metrics are also tied to requests/traces: All response times were tied with each individual trace, thus making it easy to identify requests which were slow.
- Wide events: The trace events contained a lot of data, including even the database query that was executed by the affected user! Without this query it would have been difficult to nail it down to the database performance problem.
- Dynamic dashboards: all the dashboards that are being generated are fully dynamic, and it's possible to create dashboards per dimension!
For example: when recording HTTP events, we may care about seeing every server error but need less resolution when looking at successful requests. We can then set the sample rate for successful requests to 100 (storing one in a hundred successful events). We include the sample rate along with each event—100 for successful events and 1 for error events.
Are Metric-based monitoring tools going to be obsolete?
- How many application instances are live at the moment?
- What is the amount of memory being consumed by the applications?
- What is the CPU usage etc.
- Why does this user sees a response time of 4 seconds?
- Why are the database queries hitting the database instances in region X take more than 5 seconds to complete etc.