A gentle introduction to observability

Friday, 5:43 PM. And an incident happens.

Within seconds, you are in a war room with other developers trying to understand the issue reported by a production customer. The system is a true black box. No logs, no traces and several microservices interacting with each other to handle the client request.

Where do you start?

The importance of observability

In a world where systems process increasingly large and complex volumes of information and, as a consequence, need more robust architectures that use containers, orchestrators, queues and dozens of microservices communicating continuously, identifying and solving problems in production becomes a complex task.

Without observability, answering questions like "where is the problem", "how many users are being impacted" or "which services are degraded" becomes a slow and exhausting process that consumes hours of the team and compromises troubleshooting. This directly affects MTTR (Mean Time To Recovery) and the user experience.

Observability also allows you to stay two steps ahead by identifying failures before customers, visualizing bottlenecks to optimize internal flows and increasing confidence in system stability.

What observability is

Observability is the ability to understand the state of a system based on the information it exposes. A system is considered observable when, by analyzing signals such as logs, metrics and traces, we can understand its internal behavior without modifying the application during an incident.

It makes the application more transparent by providing evidence, context and explanations about how it works, which makes it much easier to investigate and understand what is happening in production.

The three pillars of observability

Logs

A log is a text record from the system that indicates the occurrence of an event at a specific moment.
Logs include a timestamp and information about the event, which can be in plain text or structured, containing additional information and metadata that make querying these events easier.

2024-12-06 14:23:45 [INFO] Application started successfully
2024-12-06 14:23:46 [INFO] Database connection established to postgres://prod-db:5432

Metrics

A metric is a numerical value used to measure quantitative information about the system and environment over time. They can be related to hardware usage such as CPU and memory, to application traffic and latency by measuring how many requests it receives and how long it takes to respond, and they can also be used to quantify business-related information.

# How many GET requests to the endpoint /api/users returned 200 as a response?
http_requests_total{method="GET", endpoint="/api/users", status="200"} = 1250

Traces

A trace is used to track the execution of requests across the system. Especially useful in distributed systems, traces help us understand how requests are processed, identify possible performance issues inside the application and make debugging easier.

Trace ID: 7f8a9b2c-4d3e-5f6g-7h8i-9j0k1l2m3n4o
Total Duration: 1.245s

├─ [Span 1] API Gateway: POST /checkout
│  Duration: 1.245s
│  └─ [Span 2] Authentication Service
│     Duration: 45ms
│     └─ [Span 3] Redis Query (session cache)
│        Duration: 3ms
│  
├─ [Span 4] Order Service
│  Duration: 890ms
│  ├─ [Span 5] PostgreSQL Query (check stock)
│  │  Duration: 120ms
│  │
│  ├─ [Span 6] Payment Service
│  │  Duration: 650ms
│  │  └─ [Span 7] External Payment API Gateway
│  │     Duration: 640ms
│  │
│  └─ [Span 8] PostgreSQL Update (create order)
│    Duration: 85ms

The observability ecosystem

Given the growing need to implement observability in systems, the number of related tools has also increased. However, the usage distribution among these tools is not uniform and some products dominate the market.

OpenTelemetry, or OTel, is an open standard that uses a set of tools, APIs and SDKs to standardize the collection, processing and export of telemetry data (logs, metrics, traces and profiling). OTel is maintained by the Cloud Native Computing Foundation (CNCF) and is currently the dominant standard in the observability market.

Prometheus is a monitoring and metric collection tool also maintained by CNCF. Prometheus collects metrics by querying the application from time to time in a process called pull and stores them in a TSDB (Time Series Database). It also provides a query language to work with this data, PromQL.

Prometheus remains the most common choice for metrics. The combination OpenTelemetry and Prometheus was adopted by 71% of the companies who responded to the latest Observability Survey.

Grafana is a data visualization and analysis platform. It allows the creation of customizable dashboards for metrics, logs and traces, connecting to many data sources such as Prometheus, Loki, Elastic, InfluxDB and hundreds of other systems.

It is the most popular tool today for visualizing and monitoring operational information, widely used due to its flexibility, plugin ecosystem and because it is open source.

In general, although some companies such as Grafana Labs offer complete observability ecosystems, the most common scenario is a combination of products from different vendors, all connected through the OTel standard.

This information and more details about the current state of the observability ecosystem can be found in the Observability Survey 2025 from Grafana Labs.

Conclusion

Observability has become a necessity for any modern system. As applications grow in complexity, the ability to understand the internal behavior of services to respond quickly to incidents and identify bottlenecks becomes essential to maintain trust, performance and user experience.

With logs, metrics and traces working together and with standards like OpenTelemetry unifying the ecosystem, observability stops being a set of isolated tools and becomes a structured practice that brings more transparency to systems.

Now that we understand the core concepts, the next step is to see all of this in practice. In the next part we will instrument a real application using Prometheus for metric collection and Grafana for visualization, building a complete observability pipeline from scratch.