Observability
Observability is the practice of instrumenting systems and code to gain insight into their operation through metrics, tracing, and logging. These three pillars serve different purposes, but when used together provide comprehensive visibility into system behavior, enabling you to understand, debug, and improve your systems. This is covered well in the OpenTelemetry documentation. As well as the SRE book Site Reliability Engineering.
Metrics
Metrics provide an overall picture of system performance and behavior over time. They let you track, monitor and alert on the overall health and performance of your system. Which in turn allows you to build dashboards to visualise this data, as well as debug performance issues or abnormal system behaviour.
They aggregate data points to show trends and patterns, such as:
- HTTP requests per second per endpoint
- Number of products created by sellers from a controller
- Messages processed per second by a worker
- Database updates created by a worker
Metrics excel at showing the overall health and performance of your system, but they carry limited context. They are stored in time-series databases and can be labeled with key-value pairs. Each unique combination of labels creates a new dimension, and each dimension becomes a separate time series stored on disk.
Important: Metrics can become expensive with high-cardinality labels (key-value pairs that change frequently). To keep your metrics system performant and queries efficient, constrain labels to known types and values, and minimize the number of labels to only what's necessary.
A good primer on metrics is this blog post from engineers at Spotify.
Tracing
Traces measure the duration and flow of operations through your system. A trace represents a complete request or operation, while spans represent individual operations within that trace. Spans are organized hierarchically, creating a tree structure that shows the sequence and timing of events.
For example, an HTTP request would create a trace, with spans added for:
- Database calls
- Function calls that perform heavy computation
- Cache operations
- External API calls
This hierarchical structure enables you to build flamegraphs and understand both the sequence of events and the time each operation took. Spans can include attributes (key-value pairs) that provide additional context, such as:
- SQL queries executed
- User IDs or seller IDs
- Request parameters
- Error messages
By propagating trace context through your system, you can correlate logs with specific traces and spans for deeper investigation.
The SRE book also covers traces in chapter 6.
Logging
Structured logs provide detailed information for debugging and analysis. They are relatively inexpensive to store and process compared to metrics and traces. To maximize their value, include trace IDs and span IDs in your log entries. This correlation allows you to:
- Filter logs by specific traces or spans
- Understand the full context of an operation
- Debug issues by following a request through your entire system
When combined with metrics and traces, structured logging completes the observability picture, giving you the granular detail needed to diagnose and resolve issues efficiently.
The SRE book also covers logging in chapter 6.