In the first part of this series, we introduced observability and discussed its significance, particularly in microservice architectures. Now, we shift our focus to the tools that will bring observability to life: Grafana and Prometheus. These are powerful platforms that allow us to visualize and monitor our systems effectively. In this post, we will introduce key observability concepts that one must understand before diving deeper into Grafana and Prometheus.

Grafana

Grafana is an open-source platform for monitoring and observability, primarily used to visualize metrics, logs, and traces through customizable dashboards. It integrates with a wide variety of data sources, including Prometheus, and offers multiple types of visualizations such as line charts, heatmaps, and bar graphs. Grafana also provides alerting features, enabling users to set up and manage alerts that are triggered by predefined conditions. In this series, we will not be involved with alerts, but instad focus deeper on metrics and traces visualization.

Prometheus

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability in modern, cloud-native environments. It focuses on metrics as the primary data type and uses a time-series database to store this data. Prometheus collects data by “scraping” HTTP endpoints exposed by services, retrieving metrics at regular intervals, and storing them in its time-series database. In this series, we can consider Prometheus a database to store our observability data, which is then displayed in Grafana platform.

Key observability concepts in Grafana/Prometheus

These concepts together form the foundation of effective monitoring and visualization:

  • Metrics: Metrics are numerical representations of system data that we can measure over time. Examples of common metrics include CPU utilization, memory usage, request counts, and error rates. Prometheus is responsible for collecting and storing these metrics.
  • Time-Series Data: Prometheus stores metrics as time-series data, meaning each metric has a timestamp and a value. This allows for efficient querying of historical data over specified periods. For example, we can track how CPU usage has changed over the past hour, day, or week.
  • Labels: Labels are key-value pairs attached to metrics in Prometheus, providing additional context. Labels allow us to filter, group, and differentiate between instances of metrics. For example, if we are monitoring CPU usage across different servers, we can use labels such as server_name="server1" to identify the source of the metric.
  • Trace: A trace represents the complete journey of a request across a system. Each segment of the journey (from service to service) is called a span. Spans carry information such as the service name, timestamp, duration, and contextual metadata (like HTTP status codes). By aggregating spans, we create a full trace of how a request traveled through the system, providing insight into bottlenecks or failures.

How Grafana and Prometheus interact

  • Data collection: Any service could emit metrics and traces to be digested by Prometheus collector. These data include, but are not limited to CPU usage, memory consumption, request/error rate, throughput, and latency. Typically services these days are instrumented with existing tracing solution such as OpenTelemetry.
  • Data querying: Once metrics and traces are digested by Prometheus, it’s possible to use PromQL, a specialized query language of Prometheus, to retrieve these data and display on graphical use interfaces provided by such platforms as Grafana.
  • Data visualization: One popular tool in Grafana is Dashboard, which allows us to create multiple panels that display metrics digested by Prometheus. Grafana also enables the view of distributed tracing of our entire system, showing spans across different services.

Conclusion

In this post, we introduced the core observability concepts using Grafana, Prometheus, and distributed tracing, including metrics, labels, time-series data, and traces. We also covered how Grafana and Prometheus interact to collect, store, and visualize data. While we are using Grafana and Prometheus as our observability solution, it must be noted that these concepts are not exclusive to these toolings. In the next part of the series, we’ll explore how we implement observability with an example.