Observability

Observability is the practice of instrumenting systems and code to gain insight into their operation through metrics, tracing, and logging. These three pillars serve different purposes, but when used together provide comprehensive visibility into system behavior, enabling you to understand, debug, and improve your systems. This is covered well in the OpenTelemetry documentation. As well as the SRE book Site Reliability Engineering.

Metrics

Metrics provide an overall picture of system performance and behavior over time. They let you track, monitor and alert on the overall health and performance of your system. Which in turn allows you to build dashboards to visualise this data, as well as debug performance issues or abnormal system behaviour.

They aggregate data points to show trends and patterns, such as:

HTTP requests per second per endpoint
Number of products created by sellers from a controller
Messages processed per second by a worker
Database updates created by a worker

Metrics excel at showing the overall health and performance of your system, but they carry limited context. They are stored in time-series databases and can be labeled with key-value pairs. Each unique combination of labels creates a new dimension, and each dimension becomes a separate time series stored on disk.

Important: Metrics can become expensive with high-cardinality labels (key-value pairs that change frequently). To keep your metrics system performant and queries efficient, constrain labels to known types and values, and minimize the number of labels to only what's necessary.

A good primer on metrics is this blog post from engineers at Spotify.

Tracing

Traces measure the duration and flow of operations through your system. A trace represents a complete request or operation, while spans represent individual operations within that trace. Spans are organized hierarchically, creating a tree structure that shows the sequence and timing of events.

For example, an HTTP request would create a trace, with spans added for:

Database calls
Function calls that perform heavy computation
Cache operations
External API calls

This hierarchical structure enables you to build flamegraphs and understand both the sequence of events and the time each operation took. Spans can include attributes (key-value pairs) that provide additional context, such as:

SQL queries executed
User IDs or seller IDs
Request parameters
Error messages

By propagating trace context through your system, you can correlate logs with specific traces and spans for deeper investigation.

The SRE book also covers traces in chapter 6.

Logging

Structured logs provide detailed information for debugging and analysis. They are relatively inexpensive to store and process compared to metrics and traces. To maximize their value, include trace IDs and span IDs in your log entries. This correlation allows you to:

Filter logs by specific traces or spans
Understand the full context of an operation
Debug issues by following a request through your entire system

When combined with metrics and traces, structured logging completes the observability picture, giving you the granular detail needed to diagnose and resolve issues efficiently.

The SRE book also covers logging in chapter 6.

Metrics​

Tracing​

Logging​

Metrics

Tracing

Logging