Home
/
Articles
/
Connecting the dots using OpenTelemetry (Part - 1)

Connecting the dots using OpenTelemetry (Part - 1)

Engineering
Author
Mohit Shukla
Mohit Shukla

Expert
Mohit Shukla
Bureau Team

November 11, 2024

Table Of Contents

OpenTelemetry is a set of tools, APIs, and SDKs that allows us to measure and collect telemetry data such as metrics, logs and traces from the software. By utilizing these tools, we can gain insight into the performance and behavior of our system, which can aid in debugging, troubleshooting, and identifying performance bottlenecks.

At Bureau, we have been using OpenTelemetry quite extensively. We rely on multiple data sources, internal and external to generate insights. These data sources are often inconsistent, opening up a lot of blind spots in the infrastructure. OpenTelemetry has helped us in understanding how well these sources are performing.

Why OpenTelemetry?

As of 2023, OpenTelemetry continues to be the second most active CNCF project after Kubernetes. It gives a couple of advantages over the others.

  • Vendor agnostic: This means that you do not have to change the instrumentation in your applications if you decide to move away from one backend to another. Apart from it, you should also be able to design your observability strategy by mixing and matching independently to the backend.
  • Speed: Being one of the most active projects of CNCF and having a good number of contributors helps it move way faster than any propriety tool ever can.
  • Data Governance: It gives capabilities to govern the data by offering features such as controlling the data retention time, data masking/redacting, data filtering, and data sampling.
  • Community: OpenTelemetry is an industry standard for gathering telemetry data from the system. It is backed by strong and supportive community members.

Architecture

Reference: https://opentelemetry.io/docs/

How do we use OpenTelemetry at Bureau?

We use OpenTelemetry to increase observability by collecting data from all layers of an application stack — including distributed traces, metrics, and logs — providing a comprehensive view that helps us make informed decisions quickly while responding faster when issues arise within production environments and customer-facing services.

We have an OTEL Collector running as an HTTP service that all other services use to export their telemetry data. Later, This data is exported to their respective backends.

What is an OpenTelemetry Collector?

The OpenTelemetry Collector is a vendor-agnostic way of receiving, processing and exporting telemetry data from different systems. It removes the need to spawn multiple agents to get any signal data. It supports majorly all open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) to receive and eventually send to one or more open-source or commercial back-ends like honeycomb.io, newrelic.com, lightstep.com, etc.

OTEL Collector supports multiple components to receive, process, and export data.

  • Receivers
  • Processors
  • Exporters
  • Extensions

We use the filter, tail_sampling, and batch processors to filter health checks routes, samples, and batch traces, respectively, before exporting them to the backend. This helps us to have governance over the data, control the cost of the backend and focus on the interesting traces from the system. There are various processors available as part of the collector or collector repository.

Open telemetry Otel collector
Reference: opentelemetry.io

Below is the typical otel collector configuration.

receivers:
 otlp:
  protocols:
     http:
     grpc:
 fluentforward:
   endpoint: 0.0.0.0:24224
 prometheus/infra:
   config:
     scrape_configs:
       - job_name: '${SRV_ENV_NAME}-kafka'
         scrape_interval: 60s
         scrape_timeout: 60s
         static_configs:
           - targets: ['kafka-1.${SRV_ENV_NAME}.internal:11002', 'kafka-2.${SRV_ENV_NAME}.internal:11002', 'kafka-3.${SRV_ENV_NAME}.internal:11002']
           - labels:
               resource: 'amazon-msk'
               env: '${SRV_ENV_NAME}'
               type: 'node-exporter'
           - targets: ['kafka-1.${SRV_ENV_NAME}.internal:11001', 'kafka-2.${SRV_ENV_NAME}.internal:11001', 'kafka-3.${SRV_ENV_NAME}.internal:11001']
           - labels:
               resource: 'amazon-msk'
               env: '${SRV_ENV_NAME}'
               type: 'jmx-exporter'
       - job_name: '${SRV_ENV_NAME}-otel-collector-service'
         scrape_interval: 30s
         static_configs:
           - targets: ['0.0.0.0:8888']
         metric_relabel_configs:
           - source_labels: [ __name__ ]
             regex: '.*grpc_io.*'
             action: drop

processors:
 filter:
   spans:
     exclude:
       match_type: regexp
       attributes:
         - key: http.target
           value: (^/health|^/metrics)
       
 batch:

 tail_sampling:
   decision_wait: 60s
   num_traces: 10000
   expected_new_traces_per_sec: 1000
   policies:
     [          
       {
         name: errors-policy,
         type: status_code,
         status_code: {status_codes: [ERROR]}
       },
       {
         name: randomized-policy,
         type: probabilistic,
         probabilistic: {sampling_percentage: 25}
       },
     ]
   
exporters:
 logging:
   verbosity: detailed
   sampling_initial: 5
   sampling_thereafter: 200
 otlp:
   endpoint: $NEW_RELIC_ENDPOINT
   headers:
     api-key: $NEW_RELIC_API_KEY
 prometheusremotewrite/1:
   endpoint: $ENDPOINT
   auth:
     authenticator: basicauth/prw
   external_labels:
     server: otel-collector

extensions:
 basicauth/prw:
   client_auth:
     username: $TOKEN_USERNAME
     password: $TOKEN_PASSWORD
 health_check:
   endpoint: :13133
 zpages:
   endpoint: :55679

service:
 telemetry:
   logs:
     level: "$LOG_LEVEL"
   metrics:
     address: :8888
 extensions: [health_check,zpages,basicauth/prw]
 pipelines:
   traces:
     receivers: [otlp]
     processors: [filter, tail_sampling, batch]
     exporters: [otlp]
   metrics:
     receivers: [prometheus/infra]
     processors: [batch]
     exporters: [prometheusremotewrite]
   logs:
     receivers: [fluentforward, otlp]
     processors: [batch]
     exporters: [otlp]

Below is the example of a distributed trace where one or more spans have attribute otel.status_code as “ERROR” where an HTTP POST request failed. We can get more information, like the HTTP endpoint, HTTP status code, etc., from the attributes for the analysis and alerts.

Open telemetry
Failed Request

A log message is linked with the trace using a trace.id attribute that will help you understand what happened before and after the failed span. You can also use it to carry the complete payload.

Log in context

Limitations of using OpenTelemetry

There are a couple of limitations as well when using OpenTelemery collector.

  • It is still a work in progress. Many of its components are still in the alpha or beta stage.
  • Scalability is still a concern for the collector. When some components support running multiple instances of the collector, many don’t. For example, tail based sampling processor does not support running multiple instances of the collector without load balancing exporter.

Special Thanks to Abhinav, Nandeesh, Shekh from the Bureau Team.

At Bureau, top tech minds in fraud prevention are sharing valuable insights and uncovering the trade’s best-kept secrets, all with one goal: empowering fraud fighters everywhere to work smarter and more effectively.

Schedule a consultation with the Bureau team here.

You might also like

Learn More

See How Bureau Can Help Fight Fraud
Talk To Us