OpenTelemetry in practice

More recently, two standards - OpenTracing and OpenCensus - have finally merged into one. A new standard for distributed tracing and monitoring has appeared - OpenTelemetry. But despite the fact that the development of libraries is in full swing, there is not too much real experience of using it.



Ilya Kaznacheev, who has been developing for eight years and works as a backend developer at MTS, is ready to share how to use OpenTelemetry in Golang projects. At the Golang Live 2020 conference, he talked about how to set up the use of a new standard for tracing and monitoring and make it friends with the infrastructure already existing in the project.





OpenTelemetry is a relatively recent standard, late last year. At the same time, it has received wide distribution and support from many vendors of software for tracing and monitoring.



Observability, or observability, is a term from control theory that determines how much one can judge the internal state of a system by its external manifestations. In the system architecture, this means a set of approaches to monitoring the state of the system at runtime. These approaches include logging, tracing, and monitoring.







There are many vendor solutions for tracing and monitoring. Until recently, there were two open standards: OpenTracing from CNCF, which appeared in 2016, and Open Census, from Google, which appeared in 2018.



These are two pretty good standards that competed with each other for a while, until in 2019 they decided to merge into one new standard called OpenTelemetry.







This standard includes distributed tracing and monitoring. It is compatible with the first two. Moreover, OpenTracing and Open Census have been discontinuing support within two years, which is inevitably bringing us closer to the transition to OpenTelemetry.



Use cases



The standard assumes ample opportunities for combining everything with everything and is, in fact, an active layer between the sources of metrics and traces and their consumers.

Let's take a look at the main scenarios.



For distributed tracing, you can directly set up a connection to Jaeger or whatever service you are using.







If tracing is broadcast directly, you can use config and just replace the library.



In case your application is already using OpenTracing, you can use the OpenTracing Bridge, a wrapper that will convert requests to the OpenTracing API to the OpenTelemetry API at the top level.







To collect metrics, you can also configure Prometheus to directly access the metrics port for your application.







This is useful if you have a simple infrastructure and you collect metrics directly. But the standard also provides more flexibility.



The main scenario for using the standard is collecting metrics and traces through a collector, which is also launched by a separate application or container into your infrastructure. In addition, you can take a ready-made container and install it at home.



To do this, it is enough to configure the exporter in the OTLP format in the application. It is a grpc scheme for data transmission in OpenTracing format. From the collector side, you can configure the format and parameters for exporting metrics and traces to end users, or to other formats. For example, in OpenCensus.







The collector allows you to connect a large number of types of data sources and many data sinks at the output.







Thus, the OpenTelemetry standard provides compatibility with many open source and vendor standards.



The standard manifold is expandable. Therefore, most vendors already have exporters ready for their own solutions, if any. You can use OpenTelemetry even if you collect metrics and traces from some proprietary vendor. This solves the problem with vendor lock-in. Even if something has not yet appeared directly for OpenTelemetry, it can be forwarded through OpenCensus.



The collector itself is very easy to configure through the banal YAML config:







Receivers are specified here. Your application may have some other source (Kafka, etc.):







Exporters - data recipients.

Processors - methods for processing data inside the collector:







And pipelines, which directly define how each data stream that flows inside a collector will be handled:







Let's look at one illustrative example.







Let's say you have a microservice to which you have already screwed OpenTelemetry and configured it. And one more service with similar fragmentation.



So far, everything is easy. But there are:



  • legacy services that run through OpenCensus;
  • a database that sends data in its own format (for example, directly to Prometheus, as PostgreSQL does);
  • some other service that works in a container and provides metrics in its own format. You do not want to rebuild this container and screw up the sidecars so that they reformat the metrics. You just want to pick up and send them.
  • hardware from which you also collect metrics and want to somehow use them.


All these metrics can be combined in one collector.







It already supports many sources of metrics and traces that are used in existing applications. And in case you are using something exotic, you can implement your own plugin. But this is unlikely to be needed in practice. Because applications that export metrics or traces, in one way or another, use either some common standards or open standards like OpenCensus.



Now we want to use this information. You can specify Jaeger as an exporter of traces, and send metrics to Prometheus, or something compatible. Let's say everyone's favorite VictoriaMetrics.



But what if we suddenly decided to move to AWS and use the local X-Ray tracer? No problem. This can be forwarded through OpenCensus, which has an exporter for X ‑ Ray.



Thus, from these pieces you can assemble all your infrastructure for metrics and traces.



The theory is over. Let's talk about how to use tracing in practice.



Instrumentation of the Golang application: tracing



First, you need to create a root span, from which the call tree will grow.



ctx := context.Background()
tr := global.Tracer("github.com/me/otel-demo")
ctx, span := tr.Start(ctx, "root")
span.AddEvent(ctx, "I am a root span!")
doSomeAction(ctx, "12345")
span.End()
      
      





This is the name of your service or library. In this way, you can define spans in the trace that lie within the framework of your application, and those that went to the imported libraries.



Next, a root span is created with the name:



ctx, span := tr.Start(ctx, "root")
      
      





Choose a name that will clearly describe the trace level. For example, it can be either the name of a method (or class and method) or an architecture layer. For example, infrastructure layer, logic layer, database layer, etc.



The span data is also put into context:



ctx, span := tr.Start(ctx, "root")
span.AddEvent(ctx, "I am a root span!")
doSomeAction(ctx, "12345")

      
      





Therefore, you need to pass the methods that you want to trace into the context.



Span represents a process at a specific level in the call tree. You can put attributes, logs and error statuses in it, if it occurs. Span must be closed at the end. When closed, its duration is calculated.



ctx, span := tr.Start(ctx, "root")
span.AddEvent(ctx, "I am a root span!")
doSomeAction(ctx, "12345")
span.End()
      
      





This is how our span looks in Jaeger: You







can expand it and see the logs and attributes.



Then you can get the same span from the context if you don't want to set a new one. For example, you want to write one architectural layer in one span, and your layer is scattered across several methods and several call levels. You get it, write to it, and then it closes.



func doSomeAction(ctx context.Context, requestID string) {
      span := trace.SpanFromContext(ctx)
      span.AddEvent(ctx, "I am the same span!")
      ...
}
      
      





Note that you do not need to close it here, because it will close in the same method where it was created. We're just taking it out of context.



Writing a message to the root span:







Sometimes you need to create a new child span so that it exists separately.



func doSomeAction(ctx context.Context, requestID string) {
   ctx, span := global.Tracer("github.com/me/otel-demo").
      Start(ctx, "child")
   defer span.End()
   span.AddEvent(ctx, "I am a child span!")
   ...
}
      
      





Here we get a global tracer named library. This call can be wrapped in some method, or you can use a global variable, because it will be the same throughout your service.



Next, a child span is created from the context, and a name is assigned to it, similar to how we did it at the beginning:



   Start(ctx, "child")
      
      





Remember to close the span at the end of the method in which it was created.



  ctx, span := global.Tracer("github.com/me/otel-demo"). 
      Start(ctx, "child") 
   defer span.End()
      
      





We write messages into it that fall into the child span.







Here you can see that the messages are displayed hierarchically and the child span is under the parent. It is expected to be shorter because it was a synchronous call.



It shows the attributes that can be written in the span:



func doSomeAction(ctx context.Context, requestID string) {
      ...
      span.SetAttributes(label.String("request.id", requestID))
      span.AddEvent(ctx, "request validation ok")
   span.AddEvent(ctx, "entities loaded", label.Int64("count", 123))
      span.SetStatus(codes.Error, "insertion error")
}
      
      





For example, our request got here. id:







You can add events:



   span.AddEvent(ctx, "request validation ok")
      
      





Also, you can add a label here. This works in much the same way as a structured log in the form of logrus:



span.AddEvent(ctx, "entities loaded", label.Int64("count", 123))
      
      





Here we see our message in the span log. You can expand it and see labels. In our case, a label count was added here:







Then it will be convenient to use it when filtering in a search.



If an error occurs, you can add a status to the span. In this case, it will be marked as invalid.



  span.SetStatus(codes.Error, "insertion error")
      
      





The standard used to use error codes from OpenCensus and they were from grpc. Now only OK, ERROR and UNSET are left. OK is the default, ERROR is added in case of an error.



Here you can see that the error trace is marked with a red icon. There is an error code and a message about it:







We must not forget that tracing is not a replacement for logs. The main point is to track the flow of information through a distributed system, and for this you need to put traces in network requests and be able to read them from there.



Trace microservices



OpenTelemetry already has many set party implementations of interceptors and middleware for various frameworks and libraries. They can be found in the repository: github.com/open-telemetry/opentelemetry-go-contrib



List of frameworks for which there are interceptors and middleware:



  • beego
  • go-restful
  • gin
  • gocql
  • mux
  • echo
  • http
  • grpc
  • sarama
  • memcache
  • mongo
  • macaron


Let's see how to use this using a standard http client and server as an example.



middleware client



In the client, we simply add an interceptor as a transport, after which our requests are enriched with trace.id and the information necessary to continue the trace.



client := http.Client{
      Transport: otelhttp.NewTransport(http.DefaultTransport),
}
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
resp, err := client.Do(req)
      
      





middleware server



A small middleware with the name of the library is added on the server:



http.Handle("/", otelhttp.NewHandler(
      http.HandlerFunc(get), "root"))
err := http.ListenAndServe(addr, nil)
      
      





Then, as usual: get the span from the context, work with it, write something into it, create child spans, close them, etc.



This is how a simple request looks like, passing through three services:







The screenshot shows the hierarchy of calls, division into services, their duration, sequence. You can click on each of them and see more detailed information.



And this is how the error looks like:







It is easy to track where it happened, when and how much time has passed.

In span, you can see detailed information about the context in which the error occurred:







Moreover, fields that refer to the entire span (various request id, key fields in the table in the request, some other meta data that you want to put) can be nested in the span when it is created. Roughly speaking, you don’t need to copy and paste all these fields to every place where you handle an error. You can write data about it to span.



middleware func



Here's a little bonus: how to make middleware so you can use it as a global middleware for things like Gorilla and Gin:



middleware := func(h http.Handler) http.Handler {
      return otelhttp.NewHandler(h, "root")
}
      
      





Golang Application Instrumentation: Monitoring



It's time to talk about monitoring.



Connection to the monitoring system is configured in the same way as for tracing.



Measurements are divided into two types:



1. Synchronous, when the user explicitly passes values ​​at the time of the call:



  • Counter
  • UpDownCounter
  • ValueRecorder


int64, float64



2. Asynchronous, which SDK reads at the moment of data collection from the application:



  • SumObserver
  • UpDownSumObserver
  • ValueObserver


int64, float64



The metrics themselves are:



  • Additive and monotone (Counter, SumObserver) that sum positive numbers and do not decrease.
  • Additive but not monotone (UpDownCounter, UpDownSumObserver), which can sum positive and negative numbers.
  • Non-additive (ValueRecorder, ValueObserver) that simply record a sequence of values. For example, some kind of distribution.


At the beginning of the program, a global meter is created, to which the name of the library or service is indicated.



meter := global.Meter("github.com/ilyakaznacheev/otel-demo")
floatCounter := metric.Must(meter).NewFloat64Counter(
         "float_counter",
         metric.WithDescription("Cumulative float counter"),
   ).Bind(label.String("label_a", "some label"))
defer floatCounter.Unbind()
      
      





Next, a metric is created:



floatCounter := metric.Must(meter).NewFloat64Counter(
         "float_counter",
         metric.WithDescription("Cumulative float counter"),
   ).Bind(label.String("label_a", "some label"))
      
      





She is given a name:



   "float_counter",
      
      





Description:



…
         metric.WithDescription("Cumulative float counter"),
…
      
      





A set of labels by which you can then filter requests. For example, when building dashboards in Grafana:



…
    ).Bind(label.String("label_a", "some label"))
…

      
      





At the end of the program, you also need to call Unbind for each metric, which will free resources and close it correctly:



…
defer floatCounter.Unbind()
…

      
      





Recording changes is simple:



var (
counter metric.BoundFloat64Counter
udCounter metric.BoundFloat64UpDownCounter
valueRecorder metric.BoundFloat64ValueRecorder
)
...
counter.Add(ctx, 1.5)
udCounter.Add(ctx, -2.5)
valueRecorder.Record(ctx, 3.5)

      
      





These are positive numbers for Counter, any numbers for the UpDownCounter that it will sum, and also any numbers for the ValueRecorder. For all kinds of instruments, Go supports int64 and float64.



This is what we get at the output:



# HELP float_counter Cumulative float counter
# TYPE float_counter counter
float_counter{label_a="some label"} 20
      
      





This is our metric with a comment and a given label. Then you can take it either directly through Prometheus, or export it through the OpenTelemetry collector, and then use it wherever we need it.



Golang Application Instrumentation: Libraries



The last thing I want to say is the ability that the standard provides for instrumenting libraries.



Previously, when using OpenCensus and OpenTracing, you could not instrument your individual libraries, especially open source ones. Because in this case you ended up with a vendor lock-in. Anyone who has worked closely with tracing probably paid attention to the fact that large client libraries, or large APIs for cloud services, from time to time crash with hard-to-explain errors.



Tracing would be very useful here. Especially in productivity, when you have some kind of unclear situation, and I would really like to know why it happened. But all you have is an error message from your imported library.



OpenTelemetry solves this problem.







Since the SDK and API are separated in the standard, the metrics tracing API can be used regardless of the SDK and specific data export settings. Moreover, you can first instrument your methods, and only then configure the export of this data to the outside.

This way, you can instrument the imported library without worrying about how and where the data will be exported. This will work for both internal and open source libraries.



No need to worry about vendor lock-in, no need to worry about how this information will be used or whether it will be used at all. Libraries and applications are instrumented in advance, and the data export configuration is specified when the application is initialized.



Thus, you can see that the configuration settings are set in the SDK application. Next, you need to deal with the exporters of tracing and metrics. It can be one exporter via OTLP if you are exporting to the OpenTelemetry collector. Then all the necessary traces and metrics fall into the context, and it is propagated down the call tree by another method.



The application inherits the rest of the spans from the root span, simply using the OpenTelemetry API and the data that is in the context. In this case, the imported libraries receive the context methods as input, try to read information about the root span from this method. If it is not there, they create their own, and then they instruct the logic. This way you can instrument your library first.



Moreover, you can instrument everything, but not configure the data exporters, and just deploy it.



This may work for you in production, and until the infrastructure is settled, you will not have tracing and monitoring configured. Then you configure them, deploy a collector there, some applications for collecting this data, and everything will work for you. You don't need to change anything directly in the methods themselves.



Thus, if you have an open source library, you can instrument it using OpenTelemetry. Then the people who use it will configure OpenTelemetry and use this data.



In conclusion, I would like to say that the OpenTelemetry standard is promising. Perhaps, finally, this is the same universal standard that we all wanted to see.



Our company actively uses the OpenCensus standard for tracing and monitoring the company's microservice landscape. It is planned to implement OpenTelemetry after its release.



All Articles