Technical Glossary

Observability

Definition: Practice of understanding system behavior through its external outputs: logs, metrics, and traces, enabling investigation of unknown problems beyond traditional monitoring.

— Source: NERVICO, Product Development Consultancy

What is Observability

Observability is the ability to understand a system’s internal state by examining its external outputs. It rests on three fundamental pillars: logs (event records), metrics (numerical measurements over time), and traces (tracking a request’s journey across multiple services). Unlike traditional monitoring, which answers predefined questions, observability enables investigating problems that were not anticipated.

The concept originates from control theory and has become an essential practice for operating modern distributed systems.

How it works

Logs capture discrete events with detailed context: what happened, when, and under what circumstances. Metrics aggregate numerical data into time series: p99 latency, error rates, CPU usage. Traces connect the operations of an individual request as it traverses multiple services, revealing where latency occurs or where failures happen.

Tools like Datadog, Grafana with Loki and Tempo, or the OpenTelemetry stack collect, correlate, and visualize these three data types. The key is correlation: being able to go from an anomalous metric to relevant traces and from there to the specific error logs.

Why it matters

In distributed systems with dozens of services, identifying the root cause of a problem without observability is practically impossible. A user reports slowness, but the problem could be in any of the 15 services involved in the request. Without distributed traces, diagnosis can take hours. With proper observability, it takes minutes.

Practical example

A SaaS platform detects that the search endpoint latency has increased from 200ms to 2 seconds. The team checks metrics and confirms the increase started at 14:00. They review traces from slow requests and discover that 90% of the time is consumed by a call to the cache service. That service’s logs reveal that an automatic Redis update changed the eviction policy, flushing the cache. Complete diagnosis in 10 minutes thanks to correlation across the three pillars.

Need help with product development?

We help you accelerate your development with cutting-edge technology and best practices.