-
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud.
-
Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.
-
Features:
Prometheus's main features are:
- a multi-dimensional data model with time series data identified by metric name and key/value pairs
- PromQL, a flexible query language to leverage this dimensionality
- time series collection happens via a pull model over HTTP
- pushing time series is supported via an intermediary gateway
- targets are discovered via service discovery or static configuration
- multiple modes of graphing and dashboarding support
- Black box monitoring
- White box monitoring
- Black box monitoring:
- End users, means us, we can test the apps, like whether its logging in or not, if our posts are publishing or not in social media apps, this kind of tests we can test/monitor as end users.
- White box monitoring:
- Application internal Monitoring. This will be monitored only by internal facebook team or any other company team.
- Example:
- Server CPU, RAM, Memory
- Number of requests
- Latency: Time taken to respond our requests
- Logs monitoring.
- There are 4 golden signals are for good application monitoring. which we will follow these in ELK stack.
- Latency
- Traffic
- Memory
- Errors
- CCTV:
- CC cameras
- centralized monitors
- cc cameras are used for capturing videos, and centralized system is pulling those videos.
- Live/Instant video
- Historical
- Pull & Push
- pull model will have agents
- push model will not have agents
- Prometheus is pull model monitoring.
- we need to install the node_exporter agent in nodes. so that way Prometheus will fetch the data from nodes every 5 or 10 seconds (as we configure.)
- Time Series Database:
- whatever the data we collect with proper date and times that is called time series database.
-
29-JUL 500 30-JUL 600 31-JUL 499
quarterly, yearly, weekly
-
Prometheus will fetch the data from nodes and store them in the time series database.
-
If a user accesses Prometheus with HTTP, then HTTP will access the time series database and show the content to the user.
-
- whatever the data we collect with proper date and times that is called time series database.
-
Create EC2 instance as Prometheus
-
Login to EC2
cd /opt
-
Download the prometheus for Linux from official site:
https://prometheus.io/download
wget https://github.com/prometheus/prometheus/releases/download/v2.54.0-rc.0/prometheus-2.54.0-rc.0.linux-amd64.tar.gz
-
extract the prometheus tar file
tar -xvzf prometheus-2.54.0-rc.0.linux-amd64.tar.gz
-
Rename the prometheus file with any short name, just for convenience.
mv prometheus-2.54.0-rc.0.linux-amd64 prometheus
-
Create systemd unit file (service file) for prometheus
# vim /etc/systemd/system/prometheus.service [Unit] Description=Prometheus Server Documentation=https://prometheus.io/docs/introduction/overview/ After=network-online.target [Service] Restart=on-failure ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml [Install] WantedBy=multi-user.target
-
start and enable prometheus service. then check status.
-
netstat -lntp
- Prometheus will open 9090 port.[ root@ip-172-31-32-21 /opt/prometheus ]# netstat -lntp Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1245/sshd: /usr/sbi **tcp6 0 0 :::9090 :::* LISTEN 1605/prometheus** tcp6 0 0 :::22 :::* LISTEN 1245/sshd: /usr/sbi 52.87.208.118 | 172.31.32.21 | t2.micro | null [ root@ip-172-31-32-21 /opt/prometheus ]#
-
Allow the port 9090 in prometheus EC2 instance’s security group.
-
Take public IP of EC2 and open prometheus in browser. http://:9090
-
prometheus will monitor itself as well as other nodes too. meaning the same localhost it will monitor.
- up - instant data
- up[1m] - historical data >> last 1 minute data
- Refer below article for PromoQL cheatSheet.
-
scrape_interval: to mention the details like, which node to monitor, how frequently (5s, 15, 1m) it should monitor, which job and other options
#prometheus.yaml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s).
-
if you click on globe symbol option in prometheus UI, you can see all monitoring options.
-
in GUI > status > configuration - we can see all its configuration data.
-
Create 2 EC2 nodes
-
login to it
-
download and install node_exporter from prometheus page Download | Prometheus
cd /opt
https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
-
extract and rename to a short name node_exporter, just for convenience.
-
cd node_exporter
you will seenode_exporter
here, thisnode_exporter
software/tool runs in nodes and collect all the data. thennode_exporter
will provide the data to prometheus when it asks. or it will also collect when prometheus asks to collect. -
Rather running the
node_exporter
directly, we will run it from the service. -
create node_exporter service file.
# vim /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter After=network-online.target [Service] Restart=on-failure ExecStart=/opt/node_exporter/node_exporter [Install] WantedBy=multi-user.target
-
Start and enable node_exporter service
systemctl daemon-reload systemctl start node_exporter systemctl enable node_exporter systemctl status node_exporter
-
netstat -lntp
- node_exporter runs on 9100 port. -
Add a new job for nodes in prometheus config file. so that it will monitor the nodes.
#prometheus.yaml scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: "prometheus" # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ["localhost:9090"] labels: name: prometheus - job_name: "nodes" static_configs: - targets: ["172.31.43.191:9100"] labels: name: node-1
-
Restart the prometheus service.
systemctl start prometheus
-
check the logs
less /var/log/messages
- it will load config file and get ready for receiving metrices.
-
now check in prometheus UI, you will see node there and its status as well.
- you can query it by
up
up[1m]
etc.
- you can query it by
-
for test purpose, shutdown the node and check the status again in prometheus.
-
Go to official grafana documentation > Docs > opensource > grafana > Install on linux option > select RHEL from left
wget -q -O gpg.key https://rpm.grafana.com/gpg.key # Ig gpgkey is not downloading, then open the URL in browser and then create a file with gpg.key name in server level. sudo rpm --import gpg.key
-
Create /etc/yum.repos.d/grafana.repo with the following content:
[grafana] name=grafana baseurl=https://rpm.grafana.com repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://rpm.grafana.com/gpg.key sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt
-
install Grafana:
sudo dnf install grafana
-
Complete the following steps to start the Grafana server using systemd and verify that it is running.
sudo systemctl daemon-reload sudo systemctl start grafana-server sudo systemctl status grafana-server sudo systemctl enable grafana-server.service
-
netstat -lntp
- Grafana runs on port 3000.
Note: For Grafana, no need to create the systemd service file.
Note: we have installed grafana in prometheus server itself. So, we have to use the prometheus server public Ip itself to connect to grafana.
-
Take the public IP of prometheus and browse it with 3000 port for grafana.
http://<public-IP-of-prometheus-server>:3000 username: admin pwd: admin give new pwd: anything
-
In Grafana UI, do the following to add prometheus as connection.
- Go to connections > Add new connection > search for prometheus and select > add new data source > enter prometheus server URL > give localhost URL
http://localhost:9090
> save and test. - Grafana is taking the data as input from prometheus localhost
- Go to connections > Add new connection > search for prometheus and select > add new data source > enter prometheus server URL > give localhost URL
-
Follow below steps in grafana to create a dashboard.
-
Dashboards > create dashboard > add visualization > prometheus > click code > and give query
up
> orup[1m]
> Now explore the options. > save itThis will create a dashboard in grafana.
-
- In Realtime environment also, we can install prometheus and grafana in the same server. So, there will be no latency and it helps to speed up the data fetch.
- If there are any opensource grafana dashboards, we can import them using the dashboard ID(1235 or 78612 etc. any 4 or 5 digit ID).
-
Service discovery component helps to find the newly automatically created VM’s (part of auto-scaling & replicasets) and add them to node monitoring in prometheus.
-
prometheus should find the nodes automatically using service-discovery and add them to scrape list.
- scrape means - list of nodes details to be monitored.
-
The underlying nodes should have node_exporter installed and started.
-
To find the EC2 instances automatically, prometheus will describe the EC2 instances for details and get the IP and other details.
-
So, prometheus server should have the describe permission in AWS.
- In AWS, Go to IAM > create role > give role name
- Go to policy > create policy > EC2 > describe instances > provide name for policy.
- Go to role > attach newly created EC2 describe policy here.
- Go to prometheus EC2 in AWS > actions > security > modify IAM role > choose the new role we created
-
If the monitoring tag is set to true for an ec2, then prometheus will enable the monitoring for that ec2.
- Go to EC2 > Tag > Monitoring: true # set the monitoring tag to true
**example:** - job_name: "ec2_sd" ec2_sd_configs: - region: eu-east-1 port: 9100 filters: - name: tag:Monitoring values: - true relabel_configs: - source_labels: [__meta_ec2_instance_id] target_label: instance_id - source_labels: [__meta_ec2_private_ip] target_label: private_ip - source_labels: [__meta_ec2_tag_name] target_label: name
-
Alerting with Prometheus is separated into two parts. Alerting rules in Prometheus servers send alerts to an Alertmanager. The Alertmanager then manages those alerts, including silencing, inhibition, aggregation and sending out notifications via methods such as email, on-call notification systems, and chat platforms.
-
The main steps to setting up alerting and notifications are:
- Setup and configure the Alertmanager
- Configure Prometheus to talk to the Alertmanager
- Create alerting rules in Prometheus
- we have to configure the rules for alert mechanism under prometheus configuration file
-
Below the example Alerting rule to create alert for instance down events
# vim instancedown.yaml groups: - name: InstanceDown rules: - alert: InstanceDownalert expr: up < 1 # up < 1 - meaning 0 - so if system is down. it will create active alert. for: 1m # frquency to monitor the nodes labels: severity: critical # severity levels are critical, warning, error, info annotations: summary: Instance is down
-
Now configure the prometheus configuration file to read these alerting file
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "alert_rule/*.yaml" #it will fetch all the alrting rules under `alert_rule` directory. # - "second_rules.yml"
-
restart the prometheus service and check in the prometheus UI > Alerts > you will see alert is raised. (make sure you check all check boxes).
-
Installing and configuring Alertmanager:
-
Go to prometheus official documentation
https://prometheus.io/download/
-
cd /opt
-
download the alert manager in Prometheus EC2
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
-
Extract the alertmanager tar file.
tar -xvzf alertmanager-0.27.0.linux-amd64.tar.gz
-
Rename the file
mv alertmanager-0.27.0.linux-amd64 alertmanager
-
Create alertmanager service file
#vim /etc/systemd/system/alertmanager.service [Unit] Description=Alertmanager Wants=network-online.target After=network-online.target [Service] Type=simple WorkingDirectory=/opt/alertmanager/ ExecStart=/opt/alertmanager/alertmanager --config.file=/opt/alertmanager/alertmanager.yml [Install] WantedBy=multi-user.target
-
Configure the
alertmanager.yaml
file with your email and smtp endpoint details- in AWS > SES > create Identity with your email address > then verify it from Gmail
- Goto SMTP settings > create SMTP credentials > create user > note down username and password
route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'ses' receivers: - name: 'ses' email_configs: - smarthost: email-smtp.us-east-1.amazonaws.com:587 #this SMTP endpoint is same for all AWS user auth_username: your-username #change the SMTP username auth_password: your-password #change the SMTP password from: your-from-address #add your from email to: your-to-address headers: subject: Prometheus Mail Alert inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
-
restart the service
systemctl daemon-reload systemctl start alertmanager systemctl enable alertmanager systemctl status alertmanager
-
netstat -lntp
- alertmanager runs on port 9093/9094 -
Now go to
prometheus.yaml
and add configure it with alertmanager.# Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - localhost:9093
-
Restart the prometheus service.
systemctl restart prometheus systemctl status prometheus
-
Go to prometheus > Alerts > and check if the active alert is
fired
-
Once the alert status is
fired
in prometheus, Go to browser with prometheus server public IP address http://:90939093 port for alertmanager UI
- because alertmanager should be installed in prometheus server itself
-
Check in alertmanager UI to see if you received that alert.
- we must have received email to our gmail from alertmanager.
-
-
ELK is a powerful set of tools used for searching, analyzing, and visualizing large volumes of data in real-time. The acronym ELK stands for Elasticsearch, Logstash, and Kibana.
- Elasticsearch:
- Purpose: It is a search and analytics engine.
- Function: Stores data and allows you to search and analyze it quickly and in real-time.
- Elasticsearch is a like a repository which holds/stores data, logs etc.
- Key Feature: It's highly scalable and can handle large amounts of data.
- Logstash:
- Logstash will filter/stash the logs and convert into the format we want and then send the logs to elasticsearch.
- it will filter the logs and also convert the semi-structured/unstructured logs into structured logs as per our needs.
- It is better to have elasticsearch and Kibana in the same server.
- Purpose: It is a data processing pipeline.
- Function: Collects, processes, and forwards data.
- Key Feature: Can take data from multiple sources, transform it, and send it to Elasticsearch (or other destinations).
- Kibana:
- Kibana will query the logs and provide the visualization.
- Purpose: It is a visualization tool.
- Function: Allows you to create and share dynamic dashboards based on the data stored in Elasticsearch.
- Key Feature: Provides powerful charts and graphs to visualize data insights.
- Elasticsearch:
-
Creation of resources for ELK:
- create and EC2 instance name it as ELK
- follow this steps for installing all the below resources:
- create Ec2 for ELK then follow these steps as in this URL https://github.com/daws-78s/concepts/blob/main/elk.MD
- in ELK Ec2, install below resources:
- Elasticsearch
- Logstash
- kibana
- follow this steps for installing all the below resources:
- Once above steps are done, Create EC2 instance and name it as frontend.
- run the frontend shell script from
expense-shell
git repo. - access the web UI with public IP and give multiple clicks to generate the logs.
- Install filebeat in frontend server
- run the frontend shell script from
- create and EC2 instance name it as ELK
-
Generally, the web application logs, or any other logs are unstructured/semi-structured. So, to make them structured logs use filebeat agent to send the logs data to elasticsearch and kibana will filter them and give us the structured data.
-
Filebeat:
- install filebeat in the frontend server, meaning in which server the application logs we want to monitor they should have filebeat agent installed. (just like crowdstrike, tanium, node_exporter etc..)
- Now go to elasticsearch/kibana UI > Discover > stack management > index management > Index pattern > give any name > @timestamp > save
- developers will check the logs in kibana
- so in this case, filebeat sends to logs to logstash first to filter and format and then to elasticsearch.
- check in the logs messages of frontend, that if filebeat is connected to elasticsearch and sent the logs.
- for filebeat, application logs are input, and it sends those logs to elasticsearch as output.
- filebeat access the logs and push to elasticsearch
- mostly all the logs are semi-structured. only DB is structured.
- to see how logs are structured, check in its config file.
- each line stored in logs (/var/log/messages) will be pushed to elasticsearch UI, there we can see the logs count.
-
in elasticsearch UI, index is nothing but getting the data ready for visualization.
-
Filter:
-
input - filter(logstash) - output
-
in between input and output we have 'filter'.
-
so to filter the data we need to use grok patters.
-
go to nginx config file and change the log format with grok pattern
-
sample logstash:
input { << filter will take the input from here beats { port => 5044 } } filter { << filter and format the data as per needs grok { match => { "message" => "%{IP:client_ip} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{URIPATH:request_path} %{NOTSPACE:http_version} %{NUMBER:status:int} %{NUMBER:response_size:int} \"%{URI:referrer}\" %{NUMBER:response_time:float}" } } } output { << send the filtered data to elasticsearch. elasticsearch { hosts => ["http://localhost:9200"] index => "%{[@metadata][beat]}-%{[@metadata][version]}" } }
-
-
once logstash config is done, generate few logs from frontend expense app by giving just clicks.
-
you can see them in elasticsearch UI in logs.
-
so now you can create visualization for that logs
- In elasticsearch UI > analytics > visualization > visualization library > create visualization > lens.
- now give values in vertical axis and horizontal axis as per requirement. then save it as a new dashboard.
-
Interview purpose:
-
what you do for elasticsearch:
- We install filebeat, elasticsearch, kilogstash in servers then we send the logs/data to logstash.
- ELK team also works closely with us so we have an understanding of stashing using grok patterns in logstash, these filtered logstashed data will go to elasticsearch and there we use visualizations.
- we also have access to dashboard. we monitor on latency, errors, traffic in dashboard