OpenTelemetry is a set of tools, APIs, and SDKs that allows us to measure and collect telemetry data such as metrics, logs and traces from the software. By utilizing these tools, we can gain insight into the performance and behavior of our system, which can aid in debugging, troubleshooting, and identifying performance bottlenecks.

At Bureau, we have been using OpenTelemetry quite extensively. We rely on multiple data sources, internal and external to generate insights. These data sources are often inconsistent, opening up a lot of blind spots in the infrastructure. OpenTelemetry has helped us in understanding how well these sources are performing.

‍

Why OpenTelemetry?

As of 2023, OpenTelemetry continues to be the second most active CNCF project after Kubernetes. It gives a couple of advantages over the others.

Vendor agnostic: This means that you do not have to change the instrumentation in your applications if you decide to move away from one backend to another. Apart from it, you should also be able to design your observability strategy by mixing and matching independently to the backend.
Speed: Being one of the most active projects of CNCF and having a good number of contributors helps it move way faster than any propriety tool ever can.
Data Governance: It gives capabilities to govern the data by offering features such as controlling the data retention time, data masking/redacting, data filtering, and data sampling.
Community: OpenTelemetry is an industry standard for gathering telemetry data from the system. It is backed by strong and supportive community members.

‍

Architecture

Reference: https://opentelemetry.io/docs/

‍

How do we use OpenTelemetry at Bureau?

We use OpenTelemetry to increase observability by collecting data from all layers of an application stack — including distributed traces, metrics, and logs — providing a comprehensive view that helps us make informed decisions quickly while responding faster when issues arise within production environments and customer-facing services.

We have an OTEL Collector running as an HTTP service that all other services use to export their telemetry data. Later, This data is exported to their respective backends.

What is an OpenTelemetry Collector?

The OpenTelemetry Collector is a vendor-agnostic way of receiving, processing and exporting telemetry data from different systems. It removes the need to spawn multiple agents to get any signal data. It supports majorly all open-source observability data formats (e.g. Jaeger, Prometheus, Fluent Bit, etc.) to receive and eventually send to one or more open-source or commercial back-ends like honeycomb.io, newrelic.com, lightstep.com, etc.

OTEL Collector supports multiple components to receive, process, and export data.

Receivers
Processors
Exporters
Extensions

We use the filter, tail_sampling, and batch processors to filter health checks routes, samples, and batch traces, respectively, before exporting them to the backend. This helps us to have governance over the data, control the cost of the backend and focus on the interesting traces from the system. There are various processors available as part of the collector or collector repository.

‍

Open telemetry Otel collector — Reference: opentelemetry.io

Below is the typical otel collector configuration.

receivers:
otlp:
protocols:
http:
grpc:
fluentforward:
endpoint: 0.0.0.0:24224
prometheus/infra:
config:
scrape_configs:
- job_name: '${SRV_ENV_NAME}-kafka'
scrape_interval: 60s
scrape_timeout: 60s
static_configs:
- targets: ['kafka-1.${SRV_ENV_NAME}.internal:11002', 'kafka-2.${SRV_ENV_NAME}.internal:11002', 'kafka-3.${SRV_ENV_NAME}.internal:11002']
- labels:
resource: 'amazon-msk'
env: '${SRV_ENV_NAME}'
type: 'node-exporter'
- targets: ['kafka-1.${SRV_ENV_NAME}.internal:11001', 'kafka-2.${SRV_ENV_NAME}.internal:11001', 'kafka-3.${SRV_ENV_NAME}.internal:11001']
- labels:
resource: 'amazon-msk'
env: '${SRV_ENV_NAME}'
type: 'jmx-exporter'
- job_name: '${SRV_ENV_NAME}-otel-collector-service'
scrape_interval: 30s
static_configs:
- targets: ['0.0.0.0:8888']
metric_relabel_configs:
- source_labels: [ __name__ ]
regex: '.*grpc_io.*'
action: drop

processors:
filter:
spans:
exclude:
match_type: regexp
attributes:
- key: http.target
value: (^/health|^/metrics)

batch:

tail_sampling:
decision_wait: 60s
num_traces: 10000
expected_new_traces_per_sec: 1000
policies:
[
{
name: errors-policy,
type: status_code,
status_code: {status_codes: [ERROR]}
},
{
name: randomized-policy,
type: probabilistic,
probabilistic: {sampling_percentage: 25}
},
]

exporters:
logging:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
otlp:
endpoint: $NEW_RELIC_ENDPOINT
headers:
api-key: $NEW_RELIC_API_KEY
prometheusremotewrite/1:
endpoint: $ENDPOINT
auth:
authenticator: basicauth/prw
external_labels:
server: otel-collector

extensions:
basicauth/prw:
client_auth:
username: $TOKEN_USERNAME
password: $TOKEN_PASSWORD
health_check:
endpoint: :13133
zpages:
endpoint: :55679

service:
telemetry:
logs:
level: "$LOG_LEVEL"
metrics:
address: :8888
extensions: [health_check,zpages,basicauth/prw]
pipelines:
traces:
receivers: [otlp]
processors: [filter, tail_sampling, batch]
exporters: [otlp]
metrics:
receivers: [prometheus/infra]
processors: [batch]
exporters: [prometheusremotewrite]
logs:
receivers: [fluentforward, otlp]
processors: [batch]
exporters: [otlp]

Below is the example of a distributed trace where one or more spans have attribute otel.status_code as “ERROR” where an HTTP POST request failed. We can get more information, like the HTTP endpoint, HTTP status code, etc., from the attributes for the analysis and alerts.

‍

‍

A log message is linked with the trace using a trace.id attribute that will help you understand what happened before and after the failed span. You can also use it to carry the complete payload.

‍

Limitations of using OpenTelemetry

There are a couple of limitations as well when using OpenTelemery collector.

It is still a work in progress. Many of its components are still in the alpha or beta stage.
Scalability is still a concern for the collector. When some components support running multiple instances of the collector, many don’t. For example, tail based sampling processor does not support running multiple instances of the collector without load balancing exporter.

‍

Special Thanks to Abhinav, Nandeesh, Shekh from the Bureau Team.

At Bureau, top tech minds in fraud prevention are sharing valuable insights and uncovering the trade’s best-kept secrets, all with one goal: empowering fraud fighters everywhere to work smarter and more effectively.

Schedule a consultation with the Bureau team here.

‍

Connecting the dots using OpenTelemetry (Part - 1)

Author

Author

Key takeaways

Why OpenTelemetry?

Architecture

How do we use OpenTelemetry at Bureau?

What is an OpenTelemetry Collector?

Limitations of using OpenTelemetry

You might also like

Switching Environments Is a Costly Affair: Here’s How We Avoid It at Bureau

Bot Detection Decoded: What Separates Humans from Algorithms

Inside Bureau's AI-powered Orchestration Platform: Achieving a 360° View of Identity

Learn More