Zero to Telemetry

Jan 7, 2023Observability

Establishing the tools and systems needed for telemetry collection to enable Observability can be a daunting but essential task. Throughout the last decade I have found myself in this exact position multiple times. I want to share the high level principals I use when approaching this problem space.

When I break down the problem of “collect telemetry from running systems” I tend to break it down into 3 key areas. The first area is the actual telemetry/observability platform that you will use to store and analyze the various signals you are collecting. Then the final 2 pieces are instrumenting your apps and collecting and forwarding the data to your platform of choice. Depending on preference the final phase can be 1 or 2 pieces.

The Platform

Selecting the platform you will be utilizing is a hugely consequential task as it will dictate your observability strategy going forward. It is also very hard to remove and replace if something goes wrong.

Due to this it makes a lot of sense to take your time figuring out which platform has the capabilities you need and handles the signals you care about the best. Each platform makes different trade offs in how they handle your telemetry.

For example log querying can vary greatly between platforms with some requiring you to pre define the indexes you want to query your logs with, and others allowing you to search on any string.

Some platforms have cost control mechanisms within the platform, while others will require you to modify the data you send.

Key Considerations

How are you charged for you data? Is it ingest only? Do they charge per query?
What protocols are supported for telemetry collection? OpenTelemetry? Prometheus? Focused on proprietary SDKs and Protocols?
If you choose self-hosted what expertise does your team have to make it successful?
Does the platform have controls to help you control costs? How are you going to control your costs?
What application platform does your application run on and does your chosen platform support it (Lambda, EKS, ECS, GKE etc)?
How invested in the OpenTelemetry ecosystem is the platform?
Do queries rely on indexes for logs? Is trace data queryable?

How you weigh each of these considerations and what others you add will vary widely depending on your situation. Maybe you only care about a single key telemetry signal. On the other hand maybe you need strong support for all 3 signals. Whatever your situation is understanding that each platform regardless of what they promise are going to be a set of trade offs.

Code Instrumentation

Code ergnomics can be a very controversial topic and as such I usually strive to be as library agnostic as I can.

Vendor Specific SDKs

I tend to avoid platform specific SDKs where possible. These tend to feel like unneccesary vendor lock in that will require fleetwide code changes to change platforms, hopefully an extremely rare occurence. While this isn’t a complete ban on vendor/platform specific SDKs I generally default to more general open source SDKs if possible.

Vendor Agnostic SDKs

Key Considerations

Can the SDK support your application platform?
How mature is the support for the languages your applications/engineers use? How mature are the OpenTelemetry SDKs for your languages? What happens if the SDK misbehaves and crashes your application?
Will your application talk directly to your platform or will it require an agent to ship the data?
Does your application always have network access to your platform?

Optional Agent/Pipeline

I nearly always include an agent/pipeline component for a couple of reasons.

First not all things you monitor will let you just inject an SDK (databases, queues, etc). Second sometimes it’s nice to process your telemetry signal data before shipping it to the [platform] (data reduction, metadata cleansing, addition, etc). Finally it provides a bit more flexibility to your overall telemetry collection systems.

However, particularly with platforms that provide public endpoints you may find you wish to skip this step and that’s just fine.

Open Source Agents

FluentD/FluentBit (Logs)
OpenTelemetry Collector/Contrib Collector (Logs, Metrics, Traces)
Telegraf (Metrics)

Conclusion

Thinking about telemetry collection as these 3 components has been very helpful for me in breaking down launching telemetry collection for a company. It allows me to take much smaller pieces and also visualize how the various pieces will interact and plan for future interactions and build in the flexibility that is important for my company.

“A journey of a thousand miles begins with a single step”

- Lao Tzu

OpenTelemetry Distributed Tracing Logs Metrics Prometheus Grafana Telegraf