What Is Distributed Tracing?
The rise of microservices has enabled users to create distributed applications that consist of modular services rather than a single functional unit. This modularity makes testing and deployment easier while preventing a single point of failure with the application.
While applications begin to scale and distribute their resources amongst multiple cloud-native services, tracing a single transaction becomes tedious and nearly impossible. Hence, developers need to apply distributed tracing techniques.
Distributed tracing allows a single transaction to be tracked across the front end to the backend services while providing visibility into the systems’ behavior.
How Distributed Tracing Works
The distributed tracing process operates on a fundamental concept of being able to trace every transaction through multiple distributed components of the application. To achieve this visibility, distributed tracing technology uses unique identifiers, namely the Trace ID, to tag each transaction. The system then puts together each trace from the various components of the application by using this unique identifier, thus building a timeline of the transaction.
Each trace consists of one or more spans that represent a single operation within a single trace. It is essential to understand that a span can be referred to as a parent span for another span, indicating that the parent span triggers the child span.
Implementing Distributed Tracing
Setting up a distributed tracing depends on the selected solution. However, every solution will consist of these common steps. These three steps ensure developers have a solid base to start their distributed tracing journey:
- Setting up a distributed tracing system.
- Instrumenting code for tracing.
- Collecting and storing trace data.
1. Setting Up a Distributed System
Selecting the right distributed tracing solution is crucial. Key aspects, such as compatibility, scale, and other important factors must be addressed.
Many distributed tracing tools support various programming languages, including Node.js, Python, Go, .NET, Java, etc. These tools allow developers to use a single solution for distributed tracing across multiple services.
2. Instrumenting Code for Tracing
Depending on the solution, the method of integration may change. The most common approach many solutions provide is using an SDK that collects the data during runtime.
For example, developers using Helios with Node.js require installing the latest Helios OpenTelemetry SDK by running the following command:
npm install --save helios-opentelemetry-sdk
Afterward, the solution requires defining the following environment variables. Finally, it enables the SDK to collect the necessary data from the service:
export NODE_OPTIONS="--require helios-opentelemetry-sdk" export HS_TOKEN="{{HELIOS_API_TOKEN}}" export HS_SERVICE_NAME="<Lambda01>" export HS_ENVIRONMENT="<ServiceEnvironment01>"
3. Collecting and Storing Trace Data
In most distributed tracing systems, trace data collection occurs automatically during the runtime. Then, this data makes its way to the distributed tracing solution, where the analysis and visualization occur.
The collection and storage of the trace data depend on the solution in use. For example, if the solution is SaaS-based, the solution provider will take care of all trace data collecting and storage aspects. However, if the tracing solution is self-hosted, the responsibility of taking care of these aspects falls on the administrators of the solution.
Analyzing Trace Data
Analyzing trace data can be tedious. However, visualizing the trace data makes it easier for developers to understand the actual transaction flow and identify anomalies or bottlenecks.
The following demonstrates the flow of the transaction through the various services and components of the application. An advanced distributed tracing system may highlight errors and bottlenecks that each transaction runs through.
Since the trace data contains the time it takes for each service to process the transaction, developers can analyze the latencies and identify abnormalities that may impact the application’s performance.
Identifying an issue using the distributed tracing solution can provide insight into the problem that has taken place. However, to gain further details regarding the issue, developers may need to use additional tools that provide added insight with observability or the capability to correlate traces with the logs to identify the cause.
Distributed tracing solutions, such as Helios, offer insight into the error’s details, which eases the developer’s burden.
Best Practices for Distributed Tracing
A comprehensive distributed tracing solution empowers developers to respond to crucial issues swiftly. The following best practices set the fundamentals for a successful distributed tracing solution.
1. Ensuring Trace Data Accuracy and Completeness
Collecting trace data from services enable developers to identify the performance and latency of all the services each transaction flows through. However, when the trace data does not contain information from a specific service, it reduces the accuracy of the entire trace and its overall completeness.
To ensure developers obtain the most out of distributed tracing, it is vital that the system collects accurate trace information from all services to reflect the original data.
2. Balancing Trace Overhead and Detail
Collecting all trace information from all the services will provide the most comprehensive trace. However, collecting most trace information comes at the cost of the overhead to the overall application or the individual service.
The tradeoff between the amount of data collected and the acceptable overhead is crucial. Planning for this tradeoff ensures distributed tracing does not harm the overall solution, thus outweighing the benefits the solution brings.
Another take on balancing these aspects is filtering and sampling the trace information to collect what is required. However, this would require additional planning and a thorough understanding of the requirement to collect valuable trace information.
3. Protecting Sensitive Data in Trace Data
Collecting trace information from transactions includes collecting payloads of the actual transaction. This information is usually considered sensitive since it may contain personally identifiable information of customers, such as driver’s license numbers or banking information.
Regulations worldwide clearly define what information to store during business operations and how to handle this information. Therefore, it is of unparalleled importance that the information collected must undergo data obfuscation.
Helios enables its users to easily obfuscate sensitive data from the payloads collected, thereby enabling compliance with regulations. In addition to obfuscation, Helios provides other techniques to enhance and filter out the data sent to the Helios platform.
Distributed Tracing Tools
Today, numerous distributed tracing tools are available for developers to easily leverage their capabilities in resolving issues quicker.
1. Lightstep
Lightstep is a cloud-agnostic distributed tracing tool that provides full-context distributed tracing across multi-cloud environments or microservices. It enables developers to integrate the solution with complex systems with little extra effort.
It also provides a free plan with the features required for developers to get started on their distributed tracing journey. In addition, the free plan offers many helpful features, including data ingestion, analysis, and monitoring.
Source: LightStep UI
2. Zipkin
Zipkin is an open-source solution that provides distributed tracing with easy-to-use steps to get started. It enhances its distributed tracing efforts by enabling the integration with Elasticsearch for efficient log searching.
Source: Zipkin UI
It was developed at Twitter to gather crucial timing data needed to troubleshoot latency issues in service architectures, and it is straightforward to set up with a simple Docker command:
docker run -d -p 9411:9411 openzipkin/zipkin
3. Jaeger Tracing
Jaeger Tracing is yet another open-source solution that provides end-to-end distributed tracing and the ability to perform root cause analysis to identify performance issues or bottlenecks across each trace.
It also supports Elasticsearch for data persistence and exposes Prometheus metrics by default to help developers derive meaningful insights. In addition, it allows filtering traces based on duration, service, and tags using the pre-built Jaeger UI.
Source: Jaeger Tracing
4. SigNoz
SigNoz is an open-source tool that enables developers to perform distributed tracing across microservices-based systems while capturing logs, traces, and metrics and later visualizing them within its unified UI. It also provides insightful performance metrics such as the p50, p95, and p99 latency.
Some key benefits of using SigNoz include the consolidated UI that showcases logs, metrics, and traces while supporting OpenTelemetry.
Source: SigNoz UI
5. New Relic
New Relic is a distributed tracing solution that can observe 100% of an application’s traces. It provides compatibility with a vast technology stack and support for industry-standard frameworks such as OpenTelemetry. It also supports alerts to diagnose errors before they become major issues.
New Relic has the advantage of being a fully managed cloud-native with support for on-demand scalability. In addition, developers can use a single agent to automatically instrument the entire application code.
Source: New Relic UI
6. Datadog
Datadog is a well-recognized solution that offers cloud monitoring as a service. It provides distributed tracing capabilities with Datadog APM, including additional features to correlate distributed tracing, browser sessions, logs, profiles, network, processes, and infrastructure metrics.
In addition, Datadog APM allows developers to easily integrate the solution with the application. Developers can also use the solution’s capabilities to seamlessly instrument application code to monitor cloud infrastructure.
Source: DataDog UI
7. Splunk
Splunk offers a distributed tracing tool capable of ingesting all application data while enabling an AI-driven service to identify error-prone microservices. It also adds the advantage of correlating between application and infrastructure metrics to better understand the fault at hand.
You can start with a free tier that brings in essential features. However, it is crucial to understand that this solution will store data in the cloud; this may cause compliance issues in some industries.
Source: Splunk UI
8. Honeycomb
Honeycomb brings in distributed tracing capabilities in addition to its native observability functionalities. One of its standout features is that it uses anomaly detection to pinpoint which spans are tied to bad user experiences.
It supports OpenTelemetry to enable developers to instrument code without being stuck to a single vendor while offering a pay-as-you-go pricing model to only pay for what you use.
Source: HoneyComb UI
9. Helios
Helios brings advanced distributed tracing techniques that enhance the developer’s ability to get actionable insight into the end-to-end application flow by adapting OpenTelemetry’s context propagation framework.
The solution provides visibility into your system across microservices, serverless functions, databases, and third-party APIs, thus enabling you to quickly identify, reproduce, and resolve issues.
Source: Helios Sandbox
Furthermore, Helios provides a free trace visualization tool based on OpenTelemetry that allows developers to visualize and analyze a trace file by simply uploading it.
Conclusion
Distributed tracing has seen many iterations and feature enhancements that allow developers to easily identify issues within the application. It reduces the time taken to detect and respond to performance issues and helps understand the relationships between individual microservices.
The future of distributed tracing would incorporate multi-cloud tracing, enabling developers to troubleshoot issues across various cloud platforms. Also, these platforms consolidate the trace, thus cutting off the requirement for developers to trace these transactions across each cloud platform manually, which is time-consuming and nearly impossible to achieve.
I hope you have found this helpful. Thank you for reading!