5 Basic Strategies for Observability of Distributed Systems


In software architecture, observability is defined as the ability to determine how the internal state of a system changes in response to external outputs. Most complex, highly distributed application systems emit measurable signals that measure the internal back-end state of a system, as well as the impact of external inputs on that state. With the right combination of monitoring, logging, documentation, and visualization tools, software teams can assemble a tight distributed systems observability strategy.

However, teams should apply these tools in a way that provides adequate transparency, but does not waste resources, decrease application performance, or impede development operations. This requires following some basic observability guidelines and practices. Let’s review some of the important metrics to monitor, ways to maintain effective event logs, practical approaches to observability tools, effective visualization strategies, and finally some of the pitfalls to avoid.

Focus on the right metrics

A well-designed observability approach predicts when a potential error or failure will occur and then identifies the root cause of those problems, rather than reacting to problem situations as they arise. occur. In addition to other monitoring and testing tools, a variety of data collection and analysis mechanisms play an important role in the quest for transparency.

To begin with, a distributed systems observability plan should focus on a set of metrics called the four golden signals: latency, traffic, errors, and saturation. Point-in-time metrics help track the internal state of the system, such as those gathered from an external data store that constantly retrieves state data over time. This high-level status data may not be particularly granular, but it does provide a picture of when and how often a certain error occurs. Combining this information with other data, such as event logs, makes it easier to identify the underlying cause of a problem.

Stay on top of event logs

Event logs are a rich source of distributed system observability data for architecture and development teams. Dedicated event logging tools, such as Prometheus and Splunk, capture and log occurrences. These types of occurrences include such things as the successful completion of an application process, a major system failure, periods of unexpected downtime, or traffic surges causing overload.

Event logs combine timestamps and sequential records to provide a detailed breakdown of what happened — quickly identify when an incident occurred and the sequence of events leading up to it. This is especially important for debugging and error handling, as it provides key forensic information for developers to identify faulty components or problematic component interactions.

Provide toggle switches for tools

Extensive event logging processes can dramatically increase a system’s data throughput and processing requirements, and add troublesome levels of cardinality. For this reason, logging tools can quickly exhaust application performance and resource availability. They can also become unsustainable when system scaling requirements increase over time, which is often the case in complex cloud-based distributed systems.

To strike a balance, development teams should install tool-based mechanisms that start, stop, or adjust logging operations without the need to completely restart an application or update large sections of code. For example, resource-intensive debugging tools should only activate when single-system error rates exceed a predetermined limit, rather than allowing them to continually consume application resources.

Diligently follow up on requests

Request tracking is a process that tracks the individual calls made to and from a respective system, as well as the respective execution time of those calls from start to finish. Request tracking information does not have the ability to contextualize, for example, what went wrong when a particular request failed. However, it does provide valuable insight into exactly where the issue has occurred in an application’s workflow and where teams should focus their attention.

Like event logs, request tracing creates high levels of data throughput and cardinality that make them expensive to store. Again, it’s important that teams only use resource-intensive request trackers with unusual activity or errors. In some cases, teams can use request tracing to pull individual samples of transaction histories on a regular, sequential schedule, creating a cost-effective and resource-efficient way to continuously monitor a distributed system.

Create accessible data visualizations

Once a team manages to aggregate the observability data, the next step is to condense the information into a readable and shareable format. Often this is done by creating visual representations of this data using tools like Kibana or Grafana. From there, team members can share this information with each other or distribute it to other teams who are also working on the application.

Such data visualization can tax a system with millions of downstream queries, but don’t worry too much about median response times. Instead, most teams will be better served by focusing more on the number of requests available 95-99% of the time, and matching that number to the SLA requirements. It’s entirely possible that this number will meet the SLA requirements, even if it’s buried under piles of less impressive median response time data.

Some common pitfalls

While observability can bring transparency to a system, a poorly managed approach can lead to two particularly detrimental effects, particularly related to alerts and data volumes.

The first of these effects is that observability tools for distributed systems often generate large amounts of statistical noise. Teams can feel overwhelmed with constant alerts that may or may not need their attention, and those alerts become useless if developers increasingly ignore them. As a result, critical events go undetected until a full disaster occurs.

The second effect is that logging and tracing efforts can be very time consuming if the logs lack some level of granularity or do not provide the situational context of an event. IT teams may be able to identify the beginning of a failure, but it can still be difficult and time-consuming to sort through the large amount of contextual data needed to find the root cause of the problem. Again, the solution is to give developers the ability to adjust the amount of data that individual logging tools return or disable them if necessary.


Comments are closed.