This repository is an overview of Prometheus and Grafana, it is very important to understand these two observability tools to understand how much they can help us.
Observability is understanding or inferring the internal state and behavior of a system from its external outputs.
In the context of software engineering, observability refers to the ability to monitor and understand the behavior of a software application in real-time, especially in production environments.
1.1. What Data to Collect
Logs: Record of events in a system.
Metrics: Also know as Telemetry Data.
Measurement of system's performance. i.e., CPU usage.
Traces: Traces are records of a request as it moves through the system.
1.2. What is Time Series data?
A piece of data that has the following attributes:
Timestamp: When data was stored?
Metric or Name: What data is this?
Value: The value of the data.
Example:
Timestamp
Metric Name
Value
CPU Usage
13:56:20
63
1.3. Whats is a Time Series Database?
A time series database can:
Store the time series data.
Has special functions to work with time series data.
Has special query language to query time series data.
Its performamce is tuned for working with time series data.
Examples of Time Series databases:
Primetheus - Arguably the most common TSD (Time Series Database).
InfluxDB - High performance read and write.
Graphite - Original TSD of Grafana.
Amazon Timestream (AWS) - TSD on AWS.
1.4. How is Time Series data collected?
Push: Data is written to the database.
Scrape: Data is read by the database.
Example of Push TSD: Graphite and InfluxDB.
Example of a Scrape-based TSD: Prometheus.
1.5. How is data pushed to a TSD?
Use of a Network Daemon or Agent.
Example:
StatsD: Network daemon for collecting and aggregating data from different sources and pushing it to Graphite.
Telegraf: Data collector agent for InfluxDB that is used for collecting and aggregating dfata from different sources and pushing it to InfluxDB.
Fluentd: Fluentd is an open-source data collector that is designed for collecting and forwarding logs, events, and metrics from various sources.
OpenTelemetry: Set of tools and APIs to collect and forward data.
2. Telemetry
2.1. What is Telemetry?
Telemetry is the automatic recording and transmission of data from remote or inaccessible sources to an IT system in a different location form monitoring and analysis.
In software refers to the collection of business and diagnosis data from the software in production, and store and visualise it for the purpose of diagnosing production issues or improving the business.
2.2. Examples of Telemetric Data
Average of time taken to connect to a database over time.
The number of received orders per minute.
The average value of refunds per day.
How many erros and exceptions?
What is the response time?
How many times api is called?
How many servers?
How many users from Brazil?
2.3. What's is the challenge?
Organizations rely more and more on telemetric data.
Companies want to bring different data together.
Telemetric data reside in different datasources.
3. Grafana
Open-source software to:
Visualizes Time Series (telemetry) Data.
Time Series data has timestamps attached to it: One Order at 01/01/2022.
Brings data from different datasources together.
Defines alerts.
Is extensible through plugins.
Multi-organizational.
3.1. Alerts in Grafana
Alerts are defined Graph Panel.
Eachg Graph Panel can have one to many alerts.
Alerts rise when a Rule is violated.
A Rule indicates if a value on the graph is above or below a threshold.
There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics.
This is useful for cases where it is not feasible to instrument a given system with Prometheus metrics directly (for example, HAProxy or Linux system stats).
4.2. Retrieving Metrics
4.2.1. Data Model
Prometheus stores data as time series.
Every time series is identified by metric name and labels.