Definition: Observability technique that tracks the complete journey of a request across multiple services in distributed architectures to diagnose performance issues.
— Source: NERVICO, Product Development Consultancy
What Is Distributed Tracing
Distributed tracing is an observability technique that enables following the complete journey of a request as it traverses multiple services in a distributed architecture. Each service processing the request generates a span with timing, context, and result information, and the collection of spans forms a complete trace that visualizes the end-to-end flow and reveals where latencies or errors occur.
How It Works
When a request enters the system, it is assigned a unique trace identifier (trace ID) that propagates to all services participating in its processing via HTTP headers or message metadata. Each service creates a span recording start time, duration, result, and hierarchical relationship with other spans. Tools like AWS X-Ray, Jaeger, or Zipkin collect these spans, reconstruct the complete trace, and visualize it as a waterfall diagram showing the temporal sequence and dependencies between services.
Key Use Cases
- Diagnosing high latency on specific endpoints by identifying which intermediate service is the bottleneck
- Analyzing errors in microservice call chains to locate the service originating the failure
- Performance optimization by identifying redundant or unnecessary calls between services
- Validating the actual impact of architecture changes by comparing traces before and after the change
Advantages and Considerations
Distributed tracing provides visibility that is impossible to obtain from individual service logs or metrics alone, as it shows causal relationships between services. It is indispensable in microservices architectures with more than five services. On the other hand, instrumentation requires effort in each service, and data volume can be significant. Applying sampling to reduce costs by storing only a percentage of traces is common practice.