Git Product home page Git Product logo

ansible-prometheus's Introduction

DEPRECATED

This role has been deprecated in favor of a the prometheus-community/ansible collection.

Ansible Role: prometheus

CircleCI License Ansible Role GitHub tag

Description

Deploy Prometheus monitoring system using ansible.

Upgradability notice

When upgrading from <= 2.4.0 version of this role to >= 2.4.1 please turn off your prometheus instance. More in 2.4.1 release notes

Requirements

  • Ansible >= 2.7 (It might work on previous versions, but we cannot guarantee it)
  • jmespath on deployer machine. If you are using Ansible from a Python virtualenv, install jmespath to the same virtualenv via pip.
  • gnu-tar on Mac deployer host (brew install gnu-tar)

Role Variables

All variables which can be overridden are stored in defaults/main.yml file as well as in table below.

Name Default Value Description
prometheus_version 2.27.0 Prometheus package version. Also accepts latest as parameter. Only prometheus 2.x is supported
prometheus_skip_install false Prometheus installation tasks gets skipped when set to true.
prometheus_binary_local_dir "" Allows to use local packages instead of ones distributed on github. As parameter it takes a directory where prometheus AND promtool binaries are stored on host on which ansible is ran. This overrides prometheus_version parameter
prometheus_config_dir /etc/prometheus Path to directory with prometheus configuration
prometheus_db_dir /var/lib/prometheus Path to directory with prometheus database
prometheus_read_only_dirs [] Additional paths that Prometheus is allowed to read (useful for SSL certs outside of the config directory)
prometheus_web_listen_address "0.0.0.0:9090" Address on which prometheus will be listening
prometheus_web_config {} A Prometheus web config yaml for configuring TLS and auth.
prometheus_web_external_url "" External address on which prometheus is available. Useful when behind reverse proxy. Ex. http://example.org/prometheus
prometheus_storage_retention "30d" Data retention period
prometheus_storage_retention_size "0" Data retention period by size
prometheus_config_flags_extra {} Additional configuration flags passed to prometheus binary at startup
prometheus_alertmanager_config [] Configuration responsible for pointing where alertmanagers are. This should be specified as list in yaml format. It is compatible with official <alertmanager_config>
prometheus_alert_relabel_configs [] Alert relabeling rules. This should be specified as list in yaml format. It is compatible with the official <alert_relabel_configs>
prometheus_global { scrape_interval: 60s, scrape_timeout: 15s, evaluation_interval: 15s } Prometheus global config. Compatible with official configuration
prometheus_remote_write [] Remote write. Compatible with official configuration
prometheus_remote_read [] Remote read. Compatible with official configuration
prometheus_external_labels environment: "{{ ansible_fqdn | default(ansible_host) | default(inventory_hostname) }}" Provide map of additional labels which will be added to any time series or alerts when communicating with external systems
prometheus_targets {} Targets which will be scraped. Better example is provided in our demo site
prometheus_scrape_configs defaults/main.yml#L58 Prometheus scrape jobs provided in same format as in official docs
prometheus_config_file "prometheus.yml.j2" Variable used to provide custom prometheus configuration file in form of ansible template
prometheus_alert_rules defaults/main.yml#L81 Full list of alerting rules which will be copied to {{ prometheus_config_dir }}/rules/ansible_managed.rules. Alerting rules can be also provided by other files located in {{ prometheus_config_dir }}/rules/ which have *.rules extension
prometheus_alert_rules_files defaults/main.yml#L78 List of folders where ansible will look for files containing alerting rules which will be copied to {{ prometheus_config_dir }}/rules/. Files must have *.rules extension
prometheus_static_targets_files defaults/main.yml#L78 List of folders where ansible will look for files containing custom static target configuration files which will be copied to {{ prometheus_config_dir }}/file_sd/.

Relation between prometheus_scrape_configs and prometheus_targets

Short version

prometheus_targets is just a map used to create multiple files located in "{{ prometheus_config_dir }}/file_sd" directory. Where file names are composed from top-level keys in that map with .yml suffix. Those files store file_sd scrape targets data and they need to be read in prometheus_scrape_configs.

Long version

A part of prometheus.yml configuration file which describes what is scraped by prometheus is stored in prometheus_scrape_configs. For this variable same configuration options as described in prometheus docs are used.

Meanwhile prometheus_targets is our way of adopting prometheus scrape type file_sd. It defines a map of files with their content. A top-level keys are base names of files which need to have their own scrape job in prometheus_scrape_configs and values are a content of those files.

All this mean that you CAN use custom prometheus_scrape_configs with prometheus_targets set to {}. However when you set anything in prometheus_targets it needs to be mapped to prometheus_scrape_configs. If it isn't you'll get an error in preflight checks.

Example

Lets look at our default configuration, which shows all features. By default we have this prometheus_targets:

prometheus_targets:
  node:  # This is a base file name. File is located in "{{ prometheus_config_dir }}/file_sd/<<BASENAME>>.yml"
    - targets:              #
        - localhost:9100    # All this is a targets section in file_sd format
      labels:               #
        env: test           #

Such config will result in creating one file named node.yml in {{ prometheus_config_dir }}/file_sd directory.

Next this file needs to be loaded into scrape config. Here is modified version of our default prometheus_scrape_configs:

prometheus_scrape_configs:
  - job_name: "prometheus"    # Custom scrape job, here using `static_config`
    metrics_path: "/metrics"
    static_configs:
      - targets:
          - "localhost:9090"
  - job_name: "example-node-file-servicediscovery"
    file_sd_configs:
      - files:
          - "{{ prometheus_config_dir }}/file_sd/node.yml" # This line loads file created from `prometheus_targets`

Example

Playbook

---
- hosts: all
  roles:
  - cloudalchemy.prometheus
  vars:
    prometheus_targets:
      node:
      - targets:
        - localhost:9100
        - demo.cloudalchemy.org:9100
        labels:
          env: demosite

Demo site

Prometheus organization provide a demo site for full monitoring solution based on prometheus and grafana. Repository with code and links to running instances is available on github.

Defining alerting rules files

Alerting rules are defined in prometheus_alert_rules variable. Format is almost identical to one defined in Prometheus 2.0 documentation. Due to similarities in templating engines, every templates should be wrapped in {% raw %} and {% endraw %} statements. Example is provided in defaults/main.yml file.

Local Testing

The preferred way of locally testing the role is to use Docker and molecule (v2.x). You will have to install Docker on your system. See "Get started" for a Docker package suitable to for your system. We are using tox to simplify process of testing on multiple ansible versions. To install tox execute:

pip3 install tox

To run tests on all ansible versions (WARNING: this can take some time)

tox

To run a custom molecule command on custom environment with only default test scenario:

tox -e py35-ansible28 -- molecule test -s default

For more information about molecule go to their docs.

If you would like to run tests on remote docker host just specify DOCKER_HOST variable before running tox tests.

CircleCI

Combining molecule and CircleCI allows us to test how new PRs will behave when used with multiple ansible versions and multiple operating systems. This also allows use to create test scenarios for different role configurations. As a result we have a quite large test matrix which will take more time than local testing, so please be patient.

Contributing

See contributor guideline.

Troubleshooting

See troubleshooting.

License

This project is licensed under MIT License. See LICENSE for more details.

ansible-prometheus's People

Contributors

aeber avatar anisse avatar asatblurbs avatar bartoszcisek avatar bngsudheer avatar carpenterbees avatar cloudalchemybot avatar devil0000 avatar dreig avatar ecksun avatar jkrol2 avatar juliusv avatar krzyzakp avatar mehonoshin avatar mjbnz avatar morsicus avatar noraab avatar paulfantom avatar porkepix avatar rdemachkovych avatar roidelapluie avatar sarphram avatar sdarwin avatar soloradish avatar sparanoid avatar superq avatar turgon37 avatar wbh1 avatar wzyboy avatar zxyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansible-prometheus's Issues

Autostart fails

After restarting the server, prometheus fails to start up again.

galen@liffey:~$ prometheus
level=info ts=2018-11-23T10:06:36.460176795Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.2, branch=HEAD, revision=c305ffaa092e94e9d2dbbddf8226c4813b1190a0)"
level=info ts=2018-11-23T10:06:36.460270875Z caller=main.go:239 build_context="(go=go1.10.3, user=root@dcde2b74c858, date=20180921-07:22:29)"
level=info ts=2018-11-23T10:06:36.460309357Z caller=main.go:240 host_details="(Linux 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 liffey (none))"
level=info ts=2018-11-23T10:06:36.460344059Z caller=main.go:241 fd_limits="(soft=1024, hard=1048576)"
level=info ts=2018-11-23T10:06:36.4603725Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-11-23T10:06:36.461755235Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-11-23T10:06:36.46185093Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-11-23T10:06:36.473924184Z caller=main.go:564 msg="TSDB started"
level=info ts=2018-11-23T10:06:36.474044284Z caller=main.go:624 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2018-11-23T10:06:36.474171003Z caller=main.go:423 msg="Stopping scrape discovery manager..."
level=info ts=2018-11-23T10:06:36.474209444Z caller=main.go:437 msg="Stopping notify discovery manager..."
level=info ts=2018-11-23T10:06:36.474237334Z caller=main.go:459 msg="Stopping scrape manager..."
level=info ts=2018-11-23T10:06:36.4742675Z caller=main.go:433 msg="Notify discovery manager stopped"
level=info ts=2018-11-23T10:06:36.474281513Z caller=main.go:453 msg="Scrape manager stopped"
level=info ts=2018-11-23T10:06:36.474298368Z caller=main.go:419 msg="Scrape discovery manager stopped"
level=info ts=2018-11-23T10:06:36.474326699Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-11-23T10:06:36.474496861Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-11-23T10:06:36.479371575Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2018-11-23T10:06:36.479521783Z caller=main.go:608 msg="Notifier manager stopped"
level=error ts=2018-11-23T10:06:36.479592537Z caller=main.go:617 err="error loading config from \"prometheus.yml\": couldn't load configuration (--config.file=\"prometheus.yml\"): open prometheus.yml: no such file or directory"
galen@liffey:/$ systemctl status prometheus
prometheus.service - Prometheus
   Loaded: loaded (/etc/systemd/system/prometheus.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

The playbook I used:

- hosts: monitors
  become: True
  vars:
    prometheus_version: 2.4.2
    prometheus_targets:
      node:
        - targets:
          - 192.168.128.16:9100
          - 192.168.128.32:9100 
          - 192.168.128.64:9100 
  roles:
    - cloudalchemy.prometheus

NO_NEW_PRIVILEGES on Ubuntu 16.04

My prometheus server process fails to start when configured by this wonderful Ansible role.

Feb 20 21:10:34 localhost systemd[82276]: prometheus.service: Failed at step NO_NEW_PRIVILEGES spawning /usr/local/bin/prometheus: Invalid argument

When I change this line from true to false it starts fine:
https://github.com/cloudalchemy/ansible-prometheus/blob/master/templates/prometheus.service.j2#L31

I appreciate those changes made in #110 but is there a way to toggle them?

# systemd --version
systemd 229
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN

can't create prometheus user again

If there is another prometheus installation in /opt/prometheus and the service prometheus is started, the playbook can't create the user prometheus again:

TASK [cloudalchemy.prometheus : create prometheus system user] ********************************************************************************************************************************************* fatal: [nexus]: FAILED! => {"changed": false, "msg": "usermod: user prometheus is currently used by process 25091\n", "name": "prometheus", "rc": 8}

In fact, the user is already created and used by running process.

The solution is to stop the prometheus service and try again. I believe that the playbook can detect the such situation and automatically resolve it.

Executable HTML files in consoles and console_libraries in prometheus_config_dir

The html files in /etc/prometheus/consoles and /etc/prometheus/console_libraries get copied with 755 (executable) permissions. They should have mod 0644 instead.

vagrant@ubuntu:~$ sudo ls -la  /etc/prometheus/consoles
total 44
drwxr-xr-x 2 root root       4096 Dec 26 12:45 .
drwxr-x--- 7 root prometheus 4096 Dec 26 12:45 ..
-rwxr-xr-x 1 root root        623 Dec 26 12:45 index.html.example
-rwxr-xr-x 1 root root       2675 Dec 26 12:45 node-cpu.html
-rwxr-xr-x 1 root root       3513 Dec 26 12:45 node-disk.html
-rwxr-xr-x 1 root root       1444 Dec 26 12:45 node.html
-rwxr-xr-x 1 root root       5794 Dec 26 12:45 node-overview.html
-rwxr-xr-x 1 root root       1332 Dec 26 12:45 prometheus.html
-rwxr-xr-x 1 root root       4128 Dec 26 12:45 prometheus-overview.html

I also think that the owner/group of these files should be prometheus instead of root. Would that be appropriate?

Outdated documentation

Currently README.md is outdated. Issues which need fixing:

  • cleanup example usage
  • alerting rules definitions
  • commented variables in defaults/main.yml so everyone knows what which variables does

Does not clean alerts catalogue

Good day!

I tried to use the latest version of this role and it is awesome! But I faced with a very pity issue:
I removed the alert rule file from ./prometheus/rules catalogue and re-ran the playbook. The old alert rule file was left on the target server. I believe that the desired behaviour is to clean /etc/prometheus/rules catalogue on the target server from the files that are absent in ansible inventory.

I removed the obsoleted file on the target host by hand and re-ran the playbook,. Unfortunately it didn't detect the changes and so the alert list in Prometheus was old, so it forced me to restart the service manually too. :-(

Support every prometheus configuration option

Role should somehow support everything prometheus supports. This means current template for prometheus.yml should be expanded to include other scrape configurations.

Maybe there should be a more generic approach, something like in L49
This is open for discussion.

Error creating systemd template

Hello,

I can't install properly prometheus (using the default configuration). There is an issue with systemd configuration file.

Informations

Distribution:ย Ubuntu 16.04

Error message

TASK [ansible-prometheus : create systemd service unit] ********************************************************************************************************************************************************
Wednesday 23 May 2018  16:39:26 +0200 (0:00:00.868)       0:01:10.611 *********
fatal: [prometheus-devtest-000]: FAILED! => {"changed": false, "msg": "AnsibleError: Unexpected templating type error occurred on (# {{ ansible_managed }}\n[Unit]\nDescription=Prometheus\nAfter=network.target
\n\n[Service]\nType=simple\nEnvironment=\"GOMAXPROCS={{ ansible_processor_vcpus|default(ansible_processor_count) }}\"\nUser=prometheus\nGroup=prometheus\nExecReload=/bin/kill -HUP $MAINPID\nExecStart=/usr/loc
al/bin/prometheus \\\n  --config.file={{ prometheus_config_dir }}/prometheus.yml \\\n  --storage.tsdb.path={{ prometheus_db_dir }} \\\n  --storage.tsdb.retention={{ prometheus_storage_retention }} \\\n  --web
.listen-address={{ prometheus_web_listen_address }} \\\n  --web.external-url={{ prometheus_web_external_url }}{% for flag, flag_value in prometheus_config_flags_extra.items() %}\\\n  --{{ flag }}={{ flag_valu
e }} {% endfor %}\n\nPrivateTmp=true\nPrivateDevices=true\nProtectHome=true\nNoNewPrivileges=true\n{% if prometheus_systemd_version.stdout >= 231 %}\nReadWritePaths={{ prometheus_db_dir }}\n{% else %}\nReadWr
iteDirectories={{ prometheus_db_dir }}\n{% endif %}\n{% if prometheus_systemd_version.stdout >= 232 %}\nProtectSystem=strict\nProtectControlGroups=true\nProtectKernelModules=true\nProtectKernelTunables=true\n
{% else %}\nProtectSystem=full\n{% endif %}\n\n{% if http_proxy is defined %}\nEnvironment=\"HTTP_PROXY={{ http_proxy }}\"{% if https_proxy is defined %} \"HTTPS_PROXY={{ https_proxy }}{% endif %}\"\n{% endif
 %}\n\nSyslogIdentifier=prometheus\nRestart=always\n\n[Install]\nWantedBy=multi-user.target\n): '>=' not supported between instances of 'AnsibleUnsafeText' and 'int'"}

Targeting problems

The issue seems to be related to https://github.com/cloudalchemy/ansible-prometheus/blob/master/templates/prometheus.service.j2#L24 and https://github.com/cloudalchemy/ansible-prometheus/blob/master/templates/prometheus.service.j2#L29

the registered object is a dict containing stdout and stderr (we use shell command to generate it). So a condition between Dict and int is not possible.

Possible fix

{% if prometheus_systemd_version >= 231 %}

could be fixed by:

{% if prometheus_systemd_version.stdout >= '231' %}

Checking this fix, if it works i will do a PR. If you have other ideas, feel free. :)

Thanks !

Support recording rules without alerts

It doesn't seem there is a straight forward way to use recording rules with this playbook. You can specify them as part of the alerting rules but there is a conditional on task that manages alert rules which stops it if no prometheus_alertmanager_config is specified.

- name: alerting rules file
template:
src: "alert.rules.j2"
dest: "{{ prometheus_config_dir }}/rules/basic.rules"
owner: prometheus
group: prometheus
mode: 0644
validate: "{{ prometheus_rules_validator }}"
when:
- prometheus_alertmanager_config != []
- prometheus_alert_rules != []
notify:
- reload prometheus

I was able to work around this by adding a stub alerting rule

prometheus_alertmanager_config:
  - timeout: 10s

https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/

Convert scrape_configs to simple attribute tree

Rather than having a lot of custom templating, the scrape_configs section should be a simple tree of attributes that are sent to the config with to_nice_yaml

In defaults/main.yml:

prometheus_scrape_configs:
  - job_name: prometheus
   ...

Then in templates/prometheus.yml.j2:

scrape_configs:
{{ prometheus_scrape_configs | to_nice_yaml }}

Prometheus failed to start on Ubuntu 18.04: LimitNOFILE: Operation not permitted

On a fresh Ubuntu 18.04 install: "systemctl start prometheus" fails with

mars 08 12:53:01 ip-172-31-28-36 systemd[8512]: prometheus.service: Failed to adjust resource limit LimitNOFILE: Operation not permitted
mars 08 12:53:01 ip-172-31-28-36 systemd[8512]: prometheus.service: Failed at step LIMITS spawning /usr/local/bin/prometheus: Operation not permitted

Any help appreciated :) (beside commenting out LimitNOFILE=infinity in the SystemD unit file)
Thanks !

Parallel CI build

To speed up testing we could use parallel pipelines in travis-ci. For example we could test every platform on separate machines (ubuntu, debian, centos...)
This could be also used to test many ansible versions.

File copy globbing

Use a with_fileglob glob to copy files. This would allow for easily deploying user supplied rules and targets files.

  • prometheus/targets/*
  • prometheus/rules/*

Using relative paths is nice because it will include files in the full Ansible search path outside of the role directory.

prometheus_alert_rules are not copied as expected

Hello, I am having problems with setting up default alerts in configuration.

Version: 2.3.3


Expected behaviour

When the prometheus_alert_rules file is populated with some rules, and I then provision the instance, a file should be created in {{ prometheus_config_dir }}/rules/basic.rules containing those rules.

Actual behaviour

The {{ prometheus_config_dir }}/rules/ directory exists but is empty, and Prometheus sees no alerts configured. This occurs even if I simply use the default value for prometheus_alert_rules


Other properties I am using (for example, prometheus_scrape_configs and prometheus_alertmanager_config) seem to work just fine.

Use systemd for service control

Since the role already assumes systemd service units, it would be better to declare them with the systemd rather than service. The service module does not understand or automatically run daemon-reload.

Implement Uninstalls based on specific variables

The way the current playbook is set up, does not allow for shutting down/uninstalling the Prometheus server.

We should allow this, so that we can close out an old server if we're moving to a new one.

Allow specifying `latest` version

It would be nice to have a mechanism which allows on specifying latest as a parameter for prometheus_version. This way we could allow for upgrades without locking specific version.

Role fails on RedHat if SELinux is disabled

Role fails with:

TASK [cloudalchemy.prometheus : Allow prometheus to bind to port in SELinux on RedHat OS family] fatal: []: FAILED! => {"changed": false, "msg": "SELinux is disabled on this host."}

Rate limiter in GitHub API

Sometimes we hit rate limit while asking GitHub API for latest prometheus version when executing travis CI tests.
This can be prevented by adding authentication to uri module while executing request to GitHub API.

Release 1.0.0

I would like to release 1.0 and have a feature-freezed, production version of this role.

@SuperQ @rdemachkovych do you think we are missing something or we need to refactor some part?

If there are things to do before 1.0.0 release, please mention them in this thread.

When used on GCE with gce_sd_config discovery scrape config, the ProtectHome=Yes option prevents Prometheus to discover Google Cloud Instances

When this role is used provision an instances on GCE and this instance is desired to auto-discover other instances in GCE using the https://prometheus.io/docs/prometheus/latest/configuration/configuration/#gce_sd_config the implicit authentication mechanism (https://cloud.google.com/docs/authentication/production#auth-cloud-implicit-go) fails due not finding the well-known auth file in user home directory.
The auth file cannot be accessed due to systemd service file is using Sandboxing (https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Sandboxing), the ProtectHome=Yes option si the cause.

Steps to reproduce:

  1. Provision a GCE instance using this role, have the instance with service account.
  2. Use gce_sd_config discovery in use, for example use variable:
prometheus_scrape_configs:
  - job_name: "prometheus"
    metrics_path: "{{ prometheus_metrics_path }}"
    static_configs:
      - targets:
        - "{{ ansible_fqdn | default(ansible_host) | default('localhost') }}:9090"

 - job_name: "gce_instances"
    gce_sd_configs:
      project: "my-test-project"
      zone: us-central1-c
      port: 9100
  1. Start Prometheus

Expected behavior:

Current behavior:

  • Function findDefaultCredentials(https://github.com/golang/oauth2/blob/master/google/default.go#L43), fails at step 2. It tries to open the well-known credentials file (/home/prometheus/.config/gcloud/application_default_credentials.json in this case). This fails due to Permission Denied error returner from OS.
  • Permission Denied is caused by Sandboxing from systemd, this error is returned even if the homedir exists with correct permissions and file in place.
  • This relates to golang/oauth2#337

To circumvent the golang/oauth2#337 issue, let me propose one of two solutions:

  1. set Sandboxing to read-only in systemd
  2. set /tmp and the prometheus user home directory

failed parsing YAML File

Hello,
When I execute this role I get the following error:

TASK [cloudalchemy.prometheus : configure prometheus] *************************************************************************************************************************************************************************************************************************************************************************
fatal: [192.168.33.10]: FAILED! => {"changed": false, "checksum": "cfbf9d2e8fb35eb6ed385d41664584b67942be98", "exit_status": 1, "msg": "failed to validate", "stderr": "  FAILED: parsing YAML file /home/vagrant/.ansible/tmp/ansible-tmp-1530090642.5528903-66166601425282/source: yaml: unmarshal errors:\n  line 39: cannot unmarshal !!map into string\n", "stderr_lines": ["  FAILED: parsing YAML file /home/vagrant/.ansible/tmp/ansible-tmp-1530090642.5528903-66166601425282/source: yaml: unmarshal errors:", "  line 39: cannot unmarshal !!map into string"], "stdout": "Checking /home/vagrant/.ansible/tmp/ansible-tmp-1530090642.5528903-66166601425282/source\n\n", "stdout_lines": ["Checking /home/vagrant/.ansible/tmp/ansible-tmp-1530090642.5528903-66166601425282/source", ""]}

I use this vars:

---
prometheus_targets:
  node:
    - targets:
      - 192.168.33.10
      - 192.168.33.11
  alertmanager:
    - targets:
      - "{{ inventory_hostname }}:9093"

prometheus_scrape_configs:
- job_name: "prometheus"
  metrics_path: "/metrics"
  static_configs:
  - targets:
    - "{{ inventory_hostname }}:9090"
- job_name: "node"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/node.yml"
- job_name: "alertmanager"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/alertmanager.yml"
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
    static_configs:
    - targets:
      - "https://xxx.xxx"
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: "{{ inventory_hostname }}:9115"

alertmanager_receivers:
- name: 'rocketchat'
  webhook_configs:
  - send_resolved: false
    url: 'xxx.xxx.xxx.xxx'
alertmanager_route:
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'rocketchat'

grafana_security:
  admin_user: admin
  admin_password: admin

grafana_datasources:
  - name: "Prometheus"
    type: "prometheus"
    access: "proxy"
    url: "http://{{ inventory_hostname }}:9090"
    isDefault: true
grafana_dashboards:
  - dashboard_id: '1860'
    revision_id: '8'
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: '3662'
    revision_id: '2'
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: '4271'
    revision_id: '3'
    datasource: '{{ grafana_datasources.0.name }}'

My playbook looks like this:

- name: Deploy node_exporter
  remote_user: vagrant
  hosts: nodeexporter
  become: yes
  roles:
    - cloudalchemy.node-exporter
  tags:
    - node_exporter

- name: Deploy blackbox_exporter
  remote_user: vagrant
  hosts: blackboxexporter
  become: yes
  roles:
    - cloudalchemy.blackbox-exporter
  tags:
    - blackbox_exporter

- name: Setup core monitoring software
  remote_user: vagrant
  hosts: prometheus
  become: yes
  roles:
    - cloudalchemy.prometheus
    - cloudalchemy.alertmanager
  tags:
    - prometheus


- name: Deploy grafana
  remote_user: vagrant
  hosts: grafana
  become: yes
  roles:
    - cloudalchemy.grafana
  tags:
    - grafana

Do you have any idea? :( I have tried debugging this myself with keeping the generated yaml file and looking over it but the generated yaml file looks ok as well.

Wrong directory permissions

After changing prometheus_root_dir to /usr/local/bin ansible uses wrong directory permissions on this directory.

Install prometheus using docker

Hi,

I would like to use this role to install prometheus with docker.

From what I can see, it would mean to change some stuff here and there (have a "pre-install" generic task, put some variables here and there to give a choice, etc ..)

Would this be something interesting for your role to have this feature ?

Get checksum for * architecture: "HTTP Error 400: Bad Request"

I have error message when trying to launch this playbook:

TASK [cloudalchemy.prometheus : Get checksum for amd64 architecture] ***********************************************************************************
fatal: [ci]: FAILED! => {"msg": "An unhandled exception occurred while running the lookup plugin 'url'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Received HTTP error for https://github.com/prometheus/prometheus/releases/download/v2.3.2/sha256sums.txt : HTTP Error 400: Bad Request"}

When I download this file manually, it's accessible and contains correct data.

I think we need the ability to skip checksum verification.

Allow Multiple blackbox exporters

Hi,
is it possible to allow multiple blackbox exporters? In your demo-site example you just deploy one blackbox exporter on the prometheus node. I need a decentral approach and I need to deploy multiple blackbox exporters on several nodes. My current approach is a dirty hack. Let me show you my hosts file:

[prometheus]
192.168.33.10

[grafana]
192.168.33.10

[nodeexporter]
192.168.33.10
192.168.33.11
192.168.33.12

[blackboxexporter]
192.168.33.11
192.168.33.12

what I expect from this hosts file is that I have one prometheus + grafana on node 192.168.33.10 and on each node a node exporter and on the nodes 33.11 and 33.12 a blackbox exporter. My goal is to let both run under the same job.

My /etc/group_vars/all/vars file looks like this at the moment:

---
prometheus_targets:
  node:
    - targets:
      - 192.168.33.10:9100
      - 192.168.33.11:9100
      - 192.168.33.12:9100
  alertmanager:
    - targets:
      - "{{ inventory_hostname }}:9093"

prometheus_scrape_configs:
- job_name: "prometheus"
  metrics_path: "/metrics"
  static_configs:
  - targets:
    - "{{ inventory_hostname }}:9090"
- job_name: "node"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/node.yml"
- job_name: "alertmanager"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/alertmanager.yml"
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
      - "https://tu-clausthal.de"
      - "https://rz.tu-clausthal.de"
      - "https://service.rz.tu-clausthal.de"
      labels:
        blackbox_instance: "192.168.33.11"
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: "192.168.33.11:9115"
- job_name: 'blackbox2'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
      - "https://tu-clausthal.de"
      - "https://rz.tu-clausthal.de"
      - "https://service.rz.tu-clausthal.de"
      labels:
        blackbox_instance: "192.168.33.12"
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: "192.168.33.12:9115"


alertmanager_receivers:
- name: 'rocketchat'
  webhook_configs:
  - send_resolved: false
    url: 'https://chat.rz.tu-clausthal.de/hooks/<censored>'
alertmanager_route:
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'rocketchat'

grafana_security:
  admin_user: admin
  admin_password: foobar

grafana_datasources:
  - name: "Prometheus"
    type: "prometheus"
    access: "proxy"
    url: "http://{{ inventory_hostname }}:9090"
    isDefault: true
grafana_dashboards:
  - dashboard_id: 1860
    revision_id: 13
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: 3662
    revision_id: 2
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: 5345
    revision_id: 3
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: 9719
    revision_id: 4
    datasource: '{{ grafana_datasources.0.name }}'

You may see my problem. I need to specify the blackbox exporter hardcoded into that group all vars file. Isn't there a way to split configuration into host_vars or at least groups other than all?
And why do I need to specify the node exporters hard as well? I also need two configure 2 different blackbox exporter jobs... I would like to have them both in one job.

Support remote read option

Since we now support remote write, I think we should also support remote read.
Here are the docs

Also this is probably last thing before 1.0 release, since I cannot think about anything else.

Allow specifying source url for Prometheus archive

Hi there,

currently working for a client with a highly restricted environment, as in controllers and nodes cannot access the outside world. They want to rely as much as possible on existing and maintained Ansible roles without having to tweak the tasks and use variables as much as possible.

Long story short, they host tar.gz releases internally and don't want to do nasty stuff like lying DNS servers redirecting traffic aimed at Github or anywhere else to internal servers.

Question is, would it be possible to have the variables related to external stuff moved to the defaults/main.yml file and use variables in the tasks, allowing the user to specify their own location ?

I see you're making use of the delegate_to: localhost directive so only the controller needs access to Github, and I'm still trying to convince them that it's going to be a nightmare to ask every Ansible role developer to allow url overwriting.

Have you ever seen such cases with that much limited connectivity and how do you usually do ?

Thanks : )

Better handling of alerting rules

Looking at alert rules in prometheus 1.x and 2.x it seems that we could define alerting rules in yaml (like in prometheus 2.x) and use jinja templates to create rules compatible with prometheus 2.x.

Take this config for example:

  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
      summary: Instance {{ $labels.instance }} down

it should easily map to v1:

ALERT InstanceDown
  IF up == 0
  FOR 5m
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} down",
    description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
  }

Also this way we can get rid of static alerting rules, and create them with jinja templates.

@rdemachkovych what do you think?

Allow multiple targets templates files.

It should be possible to configure multiple targets files with yaml-only syntax.

prometheus_targets:
  node:
  - targets:
    - localhost:9100
    labels:
      env: test
      job: node
  other-job:
  - targets:
    - foo:8080

Looping over prometheus_targets will create target files named after the job key.

Can't download release due to Github redirect the request

From an unknown time, Github starts to redirect download request to AWS s3. And since using ansible unarchive module with remote_src won't follow the redirect URL, the task will actually download an HTML and report some tar error like this:

fatal: [test -> localhost]: FAILED! => {"attempts": 1, "changed": false, "msg": "Failed to find handler for \"/private/var/folders/5m/1f26j_6d3jnbp4y30g8rhlgc0000gn/T/ansible_CxcU89/prometheus-2.2.1.linux-amd64.tar.gz\". Make sure the required command to extract the file is installed. Command \"/usr/bin/tar\" detected as tar type bsd. GNU tar required. Command \"/usr/bin/unzip\" could not handle archive."}

My two cents on this, and according to the discussion from ansible/ansible#19985 (comment), might be split the download and unarchive into two tasks. It will be even greater if we can extract download src as an overwritable variable.

Default rules not working correctly

With the defaults now as they are CPU,RAM and DiskSpace alerts needs to be reworked.

eg.
(100 * (1 - avg(irate(node_cpu{job="node",mode="idle"}[5m])) BY (instance))) > 96
won't work, but
100 - (avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100) > 96
will

I think the PromQL just changed as I remember those defaults to be working correctly (unless I am missing something and the first one is ok).

[Question] can't specify blackbox as target

Hello,
Do you have any idea why your ansible role is not fine with this config?

---
prometheus_targets:
  node:
    - targets:
      - 192.168.33.10:9100
      - 192.168.33.11:9100
  alertmanager:
    - targets:
      - "{{ inventory_hostname }}:9093"
  blackbox:
    - targets:
      - "{{ inventory_hostname }}:9115"

prometheus_scrape_configs:
- job_name: "prometheus"
  metrics_path: "/metrics"
  static_configs:
  - targets:
    - "{{ inventory_hostname }}:9090"
- job_name: "node"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/node.yml"
- job_name: "alertmanager"
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd/alertmanager.yml"
- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
      - "https://nullday.de"
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: "{{ inventory_hostname }}:9115"

alertmanager_receivers:
- name: 'rocketchat'
  webhook_configs:
  - send_resolved: false
    url: 'https://chat.rz.tu-clausthal.de/hooks/sCyvZuLWYXG7erK6J/QsE2zGrpwfXrAjMgWD7KQKte3eY4mLAWFW6qKS92hFQ57RoJ'
alertmanager_route:
  group_interval: 1m
  repeat_interval: 1m
  receiver: 'rocketchat'

grafana_security:
  admin_user: admin
  admin_password: admin

grafana_datasources:
  - name: "Prometheus"
    type: "prometheus"
    access: "proxy"
    url: "http://{{ inventory_hostname }}:9090"
    isDefault: true
grafana_dashboards:
  - dashboard_id: '1860'
    revision_id: '8'
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: '3662'
    revision_id: '2'
    datasource: '{{ grafana_datasources.0.name }}'
  - dashboard_id: '4271'
    revision_id: '3'
    datasource: '{{ grafana_datasources.0.name }}'

The error message I get is this one:

TASK [cloudalchemy.prometheus : Fail when file_sd targets are not defined in scrape_configs] **********************************************************************************************************************************************************************************************************************************
skipping: [192.168.33.10] => (item={'key': 'node', 'value': [{'targets': ['192.168.33.10:9100', '192.168.33.11:9100']}]}) 
skipping: [192.168.33.10] => (item={'key': 'alertmanager', 'value': [{'targets': ['192.168.33.10:9093']}]}) 
failed: [192.168.33.10] (item={'key': 'blackbox', 'value': [{'targets': ['192.168.33.10:9115']}]}) => {"changed": false, "item": {"key": "blackbox", "value": [{"targets": ["192.168.33.10:9115"]}]}, "msg": "Oh, snap! `blackbox` couldn't be found in you scrape configs. Please ensure you provided all targets from prometheus_targets in prometheus_scrape_configs"}

I am providing the blackbox target in the scrape_config, or not? So what is the issue here?

Another Problem I have is this:

Get http://192.168.33.10:9115/probe?module=http_2xx&target=https%3A%2F%2Fnullday.de: dial tcp 192.168.33.10:9115: connect: connection refused

Somehow the blackbox exporter doesn't get configured and started on my prometheus node. Did I miss something?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.