Git Product home page Git Product logo

prometheus-k8s-operator's Introduction

Prometheus Charmed Operator for Kubernetes

CharmHub Badge Release Discourse Status

Description

The Prometheus Charmed Operator for Juju provides a monitoring solution using Prometheus, which is an open-source monitoring system and alerting toolkit. Besides it handles instantiation, scaling, configuration, and Day 2 operations specific to Prometheus.

This Charm is also part of the Canonical Observability Stack and deploys only the monitoring component of Prometheus in a Kubernetes cluster. An alerting service for Prometheus is offered through a separate Charm.

Getting Started

Basic Deployment

Create a Juju model for your operator, say "observability"

juju add-model observability

The Prometheus Charmed Operator may now be deployed using the Juju command line as in

juju deploy prometheus-k8s --channel=stable

Checking deployment status

Progress of the Prometheus charm deployment and its current status may be viewed anytime using the Juju command line.

juju status --color --relations

Once the Prometheus charm deployments completes, you may expect to see a status result such as

$ juju status --relations

Model  Controller           Cloud/Region        Version  SLA          Timestamp
cos    charm-dev-batteries  microk8s/localhost  3.0.2    unsupported  14:27:58-03:00

App             Version  Status  Scale  Charm           Channel  Rev  Address         Exposed  Message
prometheus-k8s  2.33.5   active      1  prometheus-k8s  stable    79  10.152.183.227  no

Unit               Workload  Agent  Address     Ports  Message
prometheus-k8s/0*  active    idle   10.1.36.90

Relation provider                Requirer                         Interface         Type  Message
prometheus-k8s:prometheus-peers  prometheus-k8s:prometheus-peers  prometheus_peers  peer

Accessing the Prometheus User Interface

Prometheus provides a user interface that let us explore the data that has collected such as metrics and associated alerts. This UI is accessed on port 9090 at the Prometheus charm's address. This assumes the Prometheus host address is accessible from the host your browser is running on. For example the Juju status message shown above is from a microk8s cluster, hence navigating to http://10.152.183.227:9090 will show us the Prometheus UI.

Adding scrape targets

When Prometheus is deployed it will only scrape metrics from itself. This is the default behavior of Prometheus. Additional metrics endpoints may be added to the Prometheus charm through relations with other charms. Currently the following charms are supported

  • Any charm that uses the prometheus_scrape interface to provide a metrics endpoint and optionally alert rules for Prometheus.
  • The Prometheus Scrape Target charm may be used to scrape metrics endpoint that are not part of any Juju Model.
  • The Prometheus Scrape Config charm may be used to scrape metrics endpoints across different Juju models. The charm also support overriding some of the scrape job configurations provided by metrics endpoints.

Relations

At present the Prometheus Charmed Operator for Kubernetes supports eight relations.

Requires

Metrics Endpoint:

  metrics-endpoint:
    interface: prometheus_scrape

Charms may forward information about their metrics endpoints and associated alert rules to the Prometheus charm over the metrics-endpoint relation using the prometheus_scrape interface. In order for these metrics to be aggregated by this Prometheus charm all that is required is to relate the two charms as in:

juju relate kube-state-metrics-k8s prometheus-k8s

Charms that seek to provide metrics endpoints and alert rules for Prometheus must do so using the provided prometheus_scrape charm library. This library by implementing the metrics-endpoint relation, not only ensures that scrape jobs and alert rules are forward to Prometheus but also that these are updated any time the metrics provider charm is upgraded. For example new alert rules may be added or old ones removed by updating and releasing a new version of the metrics provider charm. While it is safe to update alert rules as desired, care must be taken when updating scrape job specifications as this has the potential to break the continuity of the scraped metrics time series. In particular changing the following keys in the scrape job can break time series continuity

  • job_name
  • relabel_configs
  • metric_relabel_configs
  • Any label set by static_configs

Evaluation of alert rules forwarded through the prometheus_scrape interface are automatically limited to the charmed application that provided these rules. This ensures that alert rule evaluation is scoped down to the charm providing the rules.

Alerting

  alertmanager:
    interface: alertmanager_dispatch

The Alertmanager Charm aggregates, deduplicates, groups and routes alerts to selected "receivers". Alertmanager receives its alerts from Prometheus and this interaction is set up and configured using the alertmanager relation through the alertmanager_dispatch interface. Over this relation the Alertmanager charm keeps Prometheus informed of all Alertmanager instances (units) to which alerts must be forwarded. If your charm sets any alert rules then almost always it would need a relation with an Alertmanager charm which had been configured to forward alerts to specific receivers. In the absence of such a relation alerts even when raised will only be visible in the Prometheus user interface. A prudent approach to setting up an Observability stack is to do so in a manner such that it draws your attention to alarms as and when they are raised, without you having to periodically check a dashboard.

Ingress

  ingress:
    interface: ingress_per_unit
    limit: 1

Interactions with the Prometheus charm can not be assumed to originate within the same Juju model, let alone the same Kubernetes cluster, or even the same Juju cloud. Hence the Prometheus charm also supports an Ingress relation. There are multiple use cases that require an ingress, in particular

  • Using the Prometheus remote write endpoint across network boundaries.
  • Querying the Prometheus HTTP API endpoint across network boundaries.
  • Self monitoring of Prometheus that must happen across network boundaries to ensure robustness of self monitoring.
  • Supporting the Loki push API.
  • Exposing the Prometheus remote write endpoint to Grafana agent.

Prometheus typical needs a "per unit" Ingress. This per unit ingress is necessary since Prometheus exposes a remote write endpoint on a per unit basis. A per unit ingress relation is available in the traefik-k8s charm and this Prometheus charm does support that relation over ingress_per_unit interface.

Catalogue

  catalogue:
    interface: catalogue

Through this relation, Prometheus provides its URL to Catalogue K8s Charmed Operator which is a landing page that helps users to locate the user interfaces of charms it relates to.

Provides

Grafana Source

  grafana-source:
    interface: grafana_datasource

The Grafana Charm provides a data visualization solution for metrics aggregated by Prometheus and supports the creation of bespoke dashboards for such visualization. Grafana requires a data source for its dashboards and this Prometheus charm provides the data source through the grafana-source relation using the grafana_datasource interface. To visualize your charms metrics using Grafana the following steps are required

  • Add a relation between your charm (say cassandra-k8s) and Prometheus so that Prometheus can aggregate the metrics.
  • Add a relation between the Grafana and Prometheus charm so that metrics are forwarded to Grafana.
  • Add a relation between your charm and Grafana so that your charm can forward dashboards for its metrics to Grafana.

For example

juju relate cassandra-k8s prometheus-k8s
juju relate prometheus-k8s grafana-k8s
juju relate cassandra-k8s grafana-k8s

Remote Write

  receive-remote-write:
    interface: prometheus_remote_write

Metrics may also be pushed to this Prometheus charm through the receive-remote-write relation using the prometheus_remote_write interface, which can be used with the Grafana Agent Charm to have metrics scraped by the Grafana Agent sent over to Prometheus.

Self metrics endpoint

self-metrics-endpoint:
    interface: prometheus_scrape

This Prometheus charm may forward information about its metrics endpoint and associated alert rules to another Prometheus charm over the self-metrics-endpoint relation using the prometheus_scrape interface. In order for these metrics to be aggregated by the remote Prometheus charm all that is required is to relate the two charms as in:

juju relate \
    prometheus-k8s:self-metrics-endpoint \
    remote-prometheus-charm:metrics-endpoint

Grafana dashboard

  grafana-dashboard:
    interface: grafana_dashboard

Over the grafana-dashboard relation using the grafana-dashboard interface, this Prometheus charm also provides meaningful dashboards about its metrics to be shown in a Grafana Charm .

In order to add these dashboards to Grafana all that is required is to relate the two charms in the following way:

juju relate \
    prometheus-k8s:grafana-dashboard \
    grafana-k8s:grafana-dashboard

OCI Images

This charm by default uses the latest version of the canonical/prometheus image.

prometheus-k8s-operator's People

Contributors

abuelodelanada avatar balbirthomas avatar beliaev-maksim avatar dstathis avatar facundobatista avatar gboutry avatar guillaumebeuzeboc avatar ibraaoad avatar jnsgruk avatar justinmclark avatar lengau avatar lucabello avatar michaeldmitry avatar mmkay avatar mthaddon avatar neoaggelos avatar nobuto-m avatar observability-noctua-bot avatar pietropasotti avatar rbarry82 avatar sabaini avatar sed-i avatar simondeziel avatar simskij avatar tonyandrewmeyer avatar xavpaice avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prometheus-k8s-operator's Issues

Can't configure custom Prometheus configuration

The Prometheus Charm doesn’t seem to have an easy way to configure any custom scrapping. The default configuration (prometheus.yaml) is hardcoded by the _prometheus_config(self) method and only scrapes itself.

With that, it’s not possible to configure the Prometheus Charm to scrape anything else, like other deployed Charms or Juju itself.

Is that assumption correct or am I missing something?

Support for targeting non charm targets

I think that there should be some way of targeting systems which are not managed by Juju. This is important for a couple reasons.

  1. Most production environments will inevitably require some systems that either can't have a charm or don't yet have a charm. Furthermore, it may be a large amount of work for some existing systems to be migrated to Juju. Without having the ability to target these systems, admins will need to deploy a separate Prometheus instance outside of Juju if they want to monitor these systems. At which point they likely won't deploy this charm at all. I would like to encourage adoption of our charms and Juju in general as much as possible.
  2. When writing Prometheus support for a new charm, it is useful to have a quick and easy way to point monitoring at my system while I am still developing the data exporting functionality. Otherwise I have to write all of the code before I ever know if the data exporting works in the first place.

Feature request: actions to manage a pool of "static" endpoints.

This approach seems extremely promising, and I wonder if we could have some actions to help with migration pains. In particular, if there could be a manually-curated list of scraped endpoints either managed from juju config or via actions.

The idea would be that there could be legacy systems unreachable via cross-model relations from a new install of this prometheus service, but which need to be considered in alert or graphing queries. So either a simple list of endpoint URLs to scrape in the config, or a CRUD interface through actions to manage a similar list, would help the "We can't scrape our old x until we migrate it" problem.

Race condition in generating config

It looks like 986a86e created a race condition in this charm's ability to serve the monitoring relation, due to the self._stored.provider_ready flag only being set in the update-status hook, which will come after the relation hooks unless an artificial delay is introduced during deployment. As an example, see this failure and these details from a similarly stuck model:

$ juju show-unit kube-state-metrics/0
kube-state-metrics/0:
  opened-ports: []
  charm: local:focal/kube-state-metrics-0
  leader: true
  relation-info:
  - endpoint: monitoring
    related-endpoint: monitoring
    application-data:
      targets: '["10.1.148.201:8080", "10.1.148.201:8081"]'
    related-units:
      prometheus-k8s/0:
        in-scope: true
        data: {}
  provider-id: kube-state-metrics-0
  address: 10.1.148.201
$ juju debug-log --replay -n10000 -i prometheus-k8s/0 | grep 'Prometheus config\|Prometheus provider is available\|running operation run .* hook' | head -n14
unit-prometheus-k8s-0: 17:31:30 DEBUG juju.worker.uniter.operation running operation run install hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:32 DEBUG juju.worker.uniter.operation running operation run leader-elected hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:32 DEBUG juju.worker.uniter.operation running operation run pebble-ready hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:33 DEBUG unit.prometheus-k8s/0.juju-log Prometheus config : {'global': {'scrape_interval': '1m', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}]}
unit-prometheus-k8s-0: 17:31:35 DEBUG juju.worker.uniter.operation running operation run storage-attached (database/0) hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:35 DEBUG juju.worker.uniter.operation running operation run config-changed hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:36 DEBUG unit.prometheus-k8s/0.juju-log Prometheus config : {'global': {'scrape_interval': '1m', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}]}
unit-prometheus-k8s-0: 17:31:37 DEBUG juju.worker.uniter.operation running operation run start hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:57 DEBUG juju.worker.uniter.operation running operation run relation-created (0; app: kube-state-metrics) hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:58 DEBUG juju.worker.uniter.operation running operation run relation-joined (0; unit: kube-state-metrics/0) hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:31:59 DEBUG juju.worker.uniter.operation running operation run relation-changed (0; unit: kube-state-metrics/0) hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:32:00 DEBUG juju.worker.uniter.operation running operation run relation-changed (0; app: kube-state-metrics) hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:32:25 DEBUG juju.worker.uniter.operation running operation run update-status hook for prometheus-k8s/0
unit-prometheus-k8s-0: 17:32:25 DEBUG unit.prometheus-k8s/0.juju-log Prometheus provider is available

The right address for Consumer units

Introduction

At present Prometheus consumer charm units provide their bind_address to the Prometheus provider. Is this the right address to use ? Is it guaranteed that this address is visible to both the Prometheus Provider application container in addition to the Prometheus Provider charm container ?

Support juju actions for adding a new scrape target

Use case

A system administrator would like to add a new scrape target using a Juju action.

Open Questions

  1. Why is this required/useful in additions to providing scrape targets using config and relation data ?
  2. What parameters should the action have (for example IP address obviously) ?

Integration tests fails

Running integration test I am getting the following:

➜  prometheus-operator git:(main) tox -e integration
integration installed: attrs==21.2.0,backcall==0.2.0,bcrypt==3.2.0,cachetools==4.2.4,certifi==2021.10.8,cffi==1.15.0,charset-normalizer==2.0.7,cryptography==35.0.0,decorator==5.1.0,google-auth==2.3.0,idna==3.3,iniconfig==1.1.1,ipdb==0.13.9,ipython==7.28.0,jedi==0.18.0,Jinja2==3.0.2,juju==2.9.4,jujubundlelib==0.5.6,kubernetes==18.20.0,macaroonbakery==1.3.1,MarkupSafe==2.0.1,matplotlib-inline==0.1.3,mypy-extensions==0.4.3,oauthlib==3.1.1,packaging==21.0,paramiko==2.8.0,parso==0.8.2,pexpect==4.8.0,pickleshare==0.7.5,pluggy==1.0.0,prompt-toolkit==3.0.21,protobuf==3.19.0,ptyprocess==0.7.0,py==1.10.0,pyasn1==0.4.8,pyasn1-modules==0.2.8,pycparser==2.20,Pygments==2.10.0,pymacaroons==0.13.0,PyNaCl==1.4.0,pyparsing==2.4.7,pyRFC3339==1.1,pytest==6.2.5,pytest-asyncio==0.16.0,pytest-operator==0.8.4,python-dateutil==2.8.2,pytz==2021.3,PyYAML==6.0,requests==2.26.0,requests-oauthlib==1.3.0,rsa==4.7.2,six==1.16.0,theblues==0.5.2,toml==0.10.2,toposort==1.7,traitlets==5.1.0,typing-extensions==3.10.0.2,typing-inspect==0.7.1,urllib3==1.26.7,wcwidth==0.2.5,websocket-client==1.2.1,websockets==7.0
integration run-test-pre: PYTHONHASHSEED='1200984586'
integration run-test: commands[0] | pytest -v --tb native --log-cli-level=INFO -s /home/jose/trabajos/canonical/repos/prometheus-operator/tests/integration
====================================================================================== test session starts ======================================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/bin/python
cachedir: .tox/integration/.pytest_cache
rootdir: /home/jose/trabajos/canonical/repos/prometheus-operator
plugins: asyncio-0.16.0, operator-0.8.4
collected 1 item

tests/integration/test_charm.py::test_build_and_deploy_with_ubuntu_image /snap/bin/juju
/snap/bin/charmcraft

---------------------------------------------------------------------------------------- live log setup -----------------------------------------------------------------------------------------
INFO     pytest_operator.plugin:plugin.py:160 Using tmp_path: /home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/tmp/pytest/test-charm-qbtp0
INFO     pytest_operator.plugin:plugin.py:222 Adding model mk8s:test-charm-qbtp
WARNING  juju.client.connection:connection.py:729 unknown facade CAASModelConfigManager
WARNING  juju.client.connection:connection.py:753 unexpected facade CAASModelConfigManager found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade RaftLease
WARNING  juju.client.connection:connection.py:753 unexpected facade RaftLease found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade Secrets
WARNING  juju.client.connection:connection.py:753 unexpected facade Secrets found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade SecretsManager
WARNING  juju.client.connection:connection.py:753 unexpected facade SecretsManager found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade SecretsRotationWatcher
WARNING  juju.client.connection:connection.py:753 unexpected facade SecretsRotationWatcher found, unable to decipher version to use
----------------------------------------------------------------------------------------- live log call -----------------------------------------------------------------------------------------
INFO     pytest_operator.plugin:plugin.py:338 Building charm prometheus-k8s
INFO     juju.model:model.py:1873 Deploying local:focal/prometheus-k8s-0
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [allocating] waiting: installing agent
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
FAILED
--------------------------------------------------------------------------------------- live log teardown ---------------------------------------------------------------------------------------
INFO     pytest_operator.plugin:plugin.py:260 Model status:

Unit           Machine  Status       Message
prometheus/0*  no-machine  blocked      Failed to load Prometheus config

Machine  Series      Status
INFO     pytest_operator.plugin:plugin.py:280 Juju error logs:

controller-0: 19:02:03 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Operation cannot be fulfilled on pods "prometheus-0": the object has been modified; please apply your changes to the latest version and try again
unit-prometheus-0: 19:02:10 ERROR unit.prometheus/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1103, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/model.py", line 1133, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1250, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1263, in _services_action
    resp = self._request('POST', '/v1/services', body=body)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1074, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1114, in _request_raw
    raise APIError(body, code, status, message)
ops.pebble.APIError: action "restart" is unsupported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 311, in <module>
    main(PrometheusCharm)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/main.py", line 408, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/framework.py", line 275, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/framework.py", line 735, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/framework.py", line 782, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 120, in _configure
    container.restart(self._name)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/model.py", line 1138, in restart
    for svc in self.get_services(service_names):
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/model.py", line 1179, in get_services
    services = self._pebble.get_services(service_names)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1398, in get_services
    query = {'names': ','.join(names)}
TypeError: sequence item 0: expected str instance, tuple found
unit-prometheus-0: 19:02:10 ERROR juju.worker.uniter.operation hook "prometheus-pebble-ready" (via hook dispatching script: dispatch) failed: exit status 1
unit-prometheus-0: 19:02:10 ERROR juju.worker.uniter pebble poll failed for container "prometheus": hook failed

INFO     pytest_operator.plugin:plugin.py:294 Destroying model test-charm-qbtp
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-739' coro=<WebSocketCommonProtocol.recv() done, defined at /home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py:369> exception=ConnectionClosed('WebSocket connection is closed: code = 1000 (OK), no reason')>
Traceback (most recent call last):
  File "/home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py", line 434, in recv
    yield from self.ensure_open()
  File "/home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py", line 644, in ensure_open
    raise ConnectionClosed(
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1000 (OK), no reason


=========================================================================================== FAILURES ============================================================================================
____________________________________________________________________________ test_build_and_deploy_with_ubuntu_image ____________________________________________________________________________
Traceback (most recent call last):
  File "/home/jose/trabajos/canonical/repos/prometheus-operator/tests/integration/test_charm.py", line 21, in test_build_and_deploy_with_ubuntu_image
    await ops_test.model.wait_for_idle(apps=["prometheus"], status="active")
  File "/home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/juju/model.py", line 2625, in wait_for_idle
    raise jasyncio.TimeoutError("Timed out waiting for model:\n" + busy)
asyncio.exceptions.TimeoutError: Timed out waiting for model:
prometheus/0 [idle] blocked: Failed to load Prometheus config
-------------------------------------------------------------------------------------- Captured log setup ---------------------------------------------------------------------------------------
INFO     pytest_operator.plugin:plugin.py:160 Using tmp_path: /home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/tmp/pytest/test-charm-qbtp0
INFO     pytest_operator.plugin:plugin.py:222 Adding model mk8s:test-charm-qbtp
WARNING  juju.client.connection:connection.py:729 unknown facade CAASModelConfigManager
WARNING  juju.client.connection:connection.py:753 unexpected facade CAASModelConfigManager found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade RaftLease
WARNING  juju.client.connection:connection.py:753 unexpected facade RaftLease found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade Secrets
WARNING  juju.client.connection:connection.py:753 unexpected facade Secrets found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade SecretsManager
WARNING  juju.client.connection:connection.py:753 unexpected facade SecretsManager found, unable to decipher version to use
WARNING  juju.client.connection:connection.py:729 unknown facade SecretsRotationWatcher
WARNING  juju.client.connection:connection.py:753 unexpected facade SecretsRotationWatcher found, unable to decipher version to use
--------------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------------
INFO     pytest_operator.plugin:plugin.py:338 Building charm prometheus-k8s
INFO     juju.model:model.py:1873 Deploying local:focal/prometheus-k8s-0
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [allocating] waiting: installing agent
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
INFO     juju.model:model.py:2627 Waiting for model:
  prometheus/0 [idle] blocked: Failed to load Prometheus config
------------------------------------------------------------------------------------- Captured log teardown -------------------------------------------------------------------------------------
INFO     pytest_operator.plugin:plugin.py:260 Model status:

Unit           Machine  Status       Message
prometheus/0*  no-machine  blocked      Failed to load Prometheus config

Machine  Series      Status
INFO     pytest_operator.plugin:plugin.py:280 Juju error logs:

controller-0: 19:02:03 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Operation cannot be fulfilled on pods "prometheus-0": the object has been modified; please apply your changes to the latest version and try again
unit-prometheus-0: 19:02:10 ERROR unit.prometheus/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1103, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/model.py", line 1133, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1250, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1263, in _services_action
    resp = self._request('POST', '/v1/services', body=body)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1074, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1114, in _request_raw
    raise APIError(body, code, status, message)
ops.pebble.APIError: action "restart" is unsupported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 311, in <module>
    main(PrometheusCharm)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/main.py", line 408, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/framework.py", line 275, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/framework.py", line 735, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/framework.py", line 782, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 120, in _configure
    container.restart(self._name)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/model.py", line 1138, in restart
    for svc in self.get_services(service_names):
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/model.py", line 1179, in get_services
    services = self._pebble.get_services(service_names)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/ops/pebble.py", line 1398, in get_services
    query = {'names': ','.join(names)}
TypeError: sequence item 0: expected str instance, tuple found
unit-prometheus-0: 19:02:10 ERROR juju.worker.uniter.operation hook "prometheus-pebble-ready" (via hook dispatching script: dispatch) failed: exit status 1
unit-prometheus-0: 19:02:10 ERROR juju.worker.uniter pebble poll failed for container "prometheus": hook failed

INFO     pytest_operator.plugin:plugin.py:294 Destroying model test-charm-qbtp
ERROR    asyncio:base_events.py:1707 Task exception was never retrieved
future: <Task finished name='Task-739' coro=<WebSocketCommonProtocol.recv() done, defined at /home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py:369> exception=ConnectionClosed('WebSocket connection is closed: code = 1000 (OK), no reason')>
Traceback (most recent call last):
  File "/home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py", line 434, in recv
    yield from self.ensure_open()
  File "/home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py", line 644, in ensure_open
    raise ConnectionClosed(
websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1000 (OK), no reason
==================================================================================== short test summary info ====================================================================================
FAILED tests/integration/test_charm.py::test_build_and_deploy_with_ubuntu_image - asyncio.exceptions.TimeoutError: Timed out waiting for model:
================================================================================= 1 failed in 654.98s (0:10:54) =================================================================================
ERROR: InvocationError for command /home/jose/trabajos/canonical/repos/prometheus-operator/.tox/integration/bin/pytest -v --tb native --log-cli-level=INFO -s tests/integration (exited with code 1)
____________________________________________________________________________________________ summary ____________________________________________________________________________________________
ERROR:   integration: commands failed

reload_configuration timeouts after service restart

https://github.com/canonical/prometheus-operator/blob/ccc2858ac1b2c8d1a4dc1a0293969b94687598f0/src/charm.py#L123

When I juju config scrape-config, then prometheus tries to reload configuration but the workload is probably not running yet.

  File "./src/charm.py", line 123, in _configure
    reloaded = self._prometheus_server.reload_configuration()
  File "/var/lib/juju/agents/unit-prometheus-0/charm/src/prometheus_server.py", line 38, in reload_configuration
    response = post(url, timeout=self.api_timeout)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/requests/api.py", line 117, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-prometheus-0/charm/venv/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=2.0)
unit-prometheus-0: 14:46:05.350 ERROR juju.worker.uniter.operation hook "metrics-endpoint-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Forwarding Juju topology through intermediate charms

Introduction

Future revisions of the Prometheus charm may support relations with scrape targets through intermediate charms such as Scrape Target Integrator or Scrape Configuration charm. This may require a workflow for Juju topology labeling where the topology information is passed through from the scrape target via the intermediate charm.

Alert rules not resilient against leader failure

If the leader unit of the charm with built-in alert rules experiences failure, alert rules are not reliably sent to Prometheus. We should consider updating alert rules in the relation data when the leader unit of the charm with built-in alert rules is elected.

Test and improve design to handle relations before Prometheus is ready

Introduction

At present the Prometheus charm checks if Prometheus server is ready and only then instantiates the Prometheus provider. This runs the risk of failing to handle a relation very early in the charm lifecycle.

Open Questions

  1. Can improvements to the Prometheus charm library handle this problem ?

Pending cleanups for integration tests

  1. Provide a clean up routine that removes copied over Prometheus library file. Perhaps also create and delete the library file directory.
  2. Move assert for Prometheus is ready into test code rather than in helper file.
  3. Explore if model reset can be done with a test scoped autouse test fixture.
  4. Gather consensus on if all tests should abort on fail, none of them should or some of them should.

Support multiple ports, paths and scrape jobs

Introduction

Recent changes to the Prometheus charm library fundamentally changed it structure in order to support unit specific Juju topology scrape target labels. This change implemented a workflow that restricted all units to a single metrics path and port. It also restricted the charms to have a single scrape job. The purpose of this ticket is to broaden the support of the Prometheus scrape library to support multiple scrape jobs for the same charm, allowing for different paths and ports.

Should charm library use stored state ?

The Prometheus charm library currently uses stored state to save it's list of scape targets. This was motivated by the need to allow use of the library independent of whether a Prometheus relation was instantiated at the time when add_endpoint was invoked. However stored state is a per unit state and hence this may be problematic.

Ingress relation for Prometheus

Introduction

The purpose of this issue is to gather requirements and refine implementation design related to use cases for an ingress controller for the Prometheus operator charm. A couple of alternatives are

  1. Only allow relations from charms in the same model. If this is the case no ingress controller relation is required for Prometheus.
  2. Provided an ingress controller relation and allow scrapping of data from any related charm. However for security reasons only support Grafana relation within the same model since this relation allows data exfiltration. This still leaves open the possibility of data poisoning by charms outside the model.
  3. Support cross model relations of any type dynamically. This implies the Prometheus charm will detect if an incoming relation (scrape target, alertmanager, grafana, ...etc.) are intra-model or cross model. For cross model relation the charm will only accept and provide public IP addresses, and in all other cases the Prometheus charm will provide private IP address and accept private IP addresses.

Creating relation to prometheus-k8s fails with "no relation found"

Hello,

I was trying to create a basic monitoring setup with prometheus-k8s, grafana-k8s and kube-state-metrics on a microk8s cluster with juju. prometheus-k8s in version latest/stable version 1 revision 1 works, but latest/edge channel or a self packed version from the repository (master commit c2d3bc5) do not work.

Issue: Can not relate any other charm with prometheus-k8s

What is also suspicious to me is that prometheus does not have a port listed. When I deployed latest/stable there was the default port 9090 listed.

$ juju status
Model       Controller             Cloud/Region           Version  SLA          Timestamp
monitoring  microk8s-vm-localhost  microk8s-vm/localhost  2.9.10   unsupported  08:57:48+02:00

App                 Version                  Status  Scale  Charm               Store     Channel  Rev  OS          Address         Message
grafana-k8s         grafana/grafana@7f26ece  active      1  grafana-k8s         charmhub  stable     1  kubernetes  10.152.183.2    
kube-state-metrics                           active      1  kube-state-metrics  charmhub  stable     3  kubernetes  10.152.183.205  
prometheus-k8s                               active      1  prometheus-k8s      local                0  kubernetes  10.152.183.151  

Unit                   Workload  Agent  Address      Ports     Message
grafana-k8s/0*         active    idle   10.1.254.74  3000/TCP  
kube-state-metrics/0*  active    idle   10.1.254.76            
prometheus-k8s/0*      active    idle   10.1.254.75  
          
$ juju --debug relate kube-state-metrics prometheus-k8s
08:58:25 INFO  juju.cmd supercommand.go:56 running juju [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
08:58:25 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/16977/bin/juju", "--debug", "relate", "kube-state-metrics", "prometheus-k8s"}
08:58:25 INFO  juju.juju api.go:78 connecting to API addresses: [10.152.183.136:17070]
08:58:25 DEBUG juju.api apiclient.go:618 starting proxier for connection
08:58:25 DEBUG juju.api apiclient.go:622 tunnel proxy in use at localhost on port 33873
08:58:25 DEBUG juju.api apiclient.go:1149 successfully dialed "wss://localhost:33873/model/2879f1c6-f9ee-41a5-84df-da21268bc4eb/api"
08:58:25 INFO  juju.api apiclient.go:1051 cannot resolve "localhost": operation was canceled
08:58:25 INFO  juju.api apiclient.go:680 connection established to "wss://localhost:33873/model/2879f1c6-f9ee-41a5-84df-da21268bc4eb/api"
08:58:25 DEBUG juju.api monitor.go:35 RPC connection died
ERROR no relations found
08:58:25 DEBUG cmd supercommand.go:537 error stack: 
no relations found
/build/snapcraft-juju-35d6cf/parts/juju/src/rpc/client.go:178: 
/build/snapcraft-juju-35d6cf/parts/juju/src/api/apiclient.go:1248: 

$ juju --debug relate grafana-k8s prometheus-k8s
08:59:06 INFO  juju.cmd supercommand.go:56 running juju [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
08:59:06 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/16977/bin/juju", "--debug", "relate", "grafana-k8s", "prometheus-k8s"}
08:59:06 INFO  juju.juju api.go:78 connecting to API addresses: [10.152.183.136:17070]
08:59:06 DEBUG juju.api apiclient.go:618 starting proxier for connection
08:59:06 DEBUG juju.api apiclient.go:622 tunnel proxy in use at localhost on port 33161
08:59:06 DEBUG juju.api apiclient.go:1149 successfully dialed "wss://localhost:33161/model/2879f1c6-f9ee-41a5-84df-da21268bc4eb/api"
08:59:06 INFO  juju.api apiclient.go:1051 cannot resolve "localhost": operation was canceled
08:59:06 INFO  juju.api apiclient.go:680 connection established to "wss://localhost:33161/model/2879f1c6-f9ee-41a5-84df-da21268bc4eb/api"
08:59:06 DEBUG juju.api monitor.go:35 RPC connection died
ERROR no relations found
08:59:06 DEBUG cmd supercommand.go:537 error stack: 
no relations found
/build/snapcraft-juju-35d6cf/parts/juju/src/rpc/client.go:178: 
/build/snapcraft-juju-35d6cf/parts/juju/src/api/apiclient.go:1248: 

version information

microk8s (v1.21.3) is running in a VM created with multipass

$ multipass exec microk8s-vm -- snap info microk8s | grep installed
installed:          v1.21.3             (2346) 191MB classic

host is running ubuntu 21.04 desktop

$ hostnamectl | grep -E 'Op|Kernel|Arch'
  Operating System: Ubuntu 21.04
            Kernel: Linux 5.11.0-25-generic
      Architecture: x86-64

juju is running on the host (version 2.9.11-ubuntu-amd64), juju controller inside microk8s cluster

$ snap info multipass  | grep installed
installed:          1.7.1                              (5309) 326MB -
$ snap info juju  | grep installed
installed:          2.9.11                        (16977) 104MB classic

how to reproduce

install microk8s in VM

sudo snap install multipass
export NAME="microk8s-vm"
multipass launch --name $NAME --cpus 4 --mem 8G --disk 50G
multipass exec $NAME -- sudo snap install microk8s --classic
multipass exec $NAME -- microk8s status --wait-ready
multipass exec $NAME -- sudo iptables -P FORWARD ACCEPT
multipass exec $NAME -- sudo usermod -aG microk8s ubuntu
multipass exec $NAME -- microk8s enable dns storage ingress registry
multipass exec $NAME -- microk8s status --wait-ready
multipass exec $NAME -- microk8s config > ''.kube/config_$NAME'

install juju and bootstrap microk8s

$ sudo snap install juju --classic
$ cat .kube/config_microk8s-vm | juju add-k8s microk8s-vm --cluster-name=microk8s-cluster --client
k8s substrate "microk8s-cluster" added as cloud "microk8s-vm".
You can now bootstrap to this cloud by running 'juju bootstrap microk8s-vm'.
$ juju bootstrap microk8s-vm
$ juju create-storage-pool operator-storage kubernetes storage-class=microk8s-hostpath

install charms

$ juju add-model monitor
Added 'monitor' model on microk8s-vm/localhost with credential 'microk8s-vm' for user 'admin'
$ juju deploy grafana-k8s
Located charm "grafana-k8s" in charm-hub, revision 1
Deploying "grafana-k8s" from charm-hub charm "grafana-k8s", revision 1 in channel stable
$ juju deploy prometheus-k8s --channel edge
Located charm "prometheus-k8s" in charm-hub, revision 6
Deploying "prometheus-k8s" from charm-hub charm "prometheus-k8s", revision 6 in channel edge
$ juju status --relations --storage
Model    Controller             Cloud/Region           Version  SLA          Timestamp
monitor  microk8s-vm-localhost  microk8s-vm/localhost  2.9.10   unsupported  11:51:00+02:00

App             Version                  Status  Scale  Charm           Store     Channel  Rev  OS          Address         Message
grafana-k8s     grafana/grafana@7f26ece  active      1  grafana-k8s     charmhub  stable     1  kubernetes  10.152.183.242  
prometheus-k8s                           active      1  prometheus-k8s  charmhub  edge       6  kubernetes  10.152.183.134  

Unit               Workload  Agent  Address      Ports     Message
grafana-k8s/0*     active    idle   10.1.254.81  3000/TCP  
prometheus-k8s/0*  active    idle   10.1.254.82            

Relation provider    Requirer             Interface      Type  Message
grafana-k8s:grafana  grafana-k8s:grafana  grafana-peers  peer  

Storage Unit      Storage id  Type        Pool        Mountpoint                        Size    Status    Message
grafana-k8s/0     sqlitedb/0  filesystem  kubernetes  /var/lib/grafana                  35MiB   attached  Successfully provisioned volume pvc-22287b49-c618-4123-bf2d-1272fc701f39
prometheus-k8s/0  database/1  filesystem  kubernetes  /var/lib/juju/storage/database/0  1.0GiB  attached  Successfully provisioned volume pvc-0c0ba296-bf07-41bd-9fd3-4dddf7baca88
$ juju --debug relate grafana-k8s prometheus-k8s
11:51:18 INFO  juju.cmd supercommand.go:56 running juju [2.9.11 0 7fcbdb3115b295c1610287d0db7323dfa72e8f21 gc go1.14.15]
11:51:18 DEBUG juju.cmd supercommand.go:57   args: []string{"/snap/juju/16977/bin/juju", "--debug", "relate", "grafana-k8s", "prometheus-k8s"}
11:51:18 INFO  juju.juju api.go:78 connecting to API addresses: [10.152.183.136:17070]
11:51:18 DEBUG juju.api apiclient.go:618 starting proxier for connection
11:51:18 DEBUG juju.api apiclient.go:622 tunnel proxy in use at localhost on port 41723
11:51:18 DEBUG juju.api apiclient.go:1149 successfully dialed "wss://localhost:41723/model/89545807-8205-4b09-8ecd-8f68940b235e/api"
11:51:18 INFO  juju.api apiclient.go:1051 cannot resolve "localhost": operation was canceled
11:51:18 INFO  juju.api apiclient.go:680 connection established to "wss://localhost:41723/model/89545807-8205-4b09-8ecd-8f68940b235e/api"
11:51:18 DEBUG juju.api monitor.go:35 RPC connection died
ERROR no relations found
11:51:18 DEBUG cmd supercommand.go:537 error stack: 
no relations found
/build/snapcraft-juju-35d6cf/parts/juju/src/rpc/client.go:178: 
/build/snapcraft-juju-35d6cf/parts/juju/src/api/apiclient.go:1248: 

"src" is missing from DEFAULT_ALERT_RULES_RELATIVE_PATH

My alert rule is suddenly not being picked up and integration tests failing. example
Seems like DEFAULT_ALERT_RULES_RELATIVE_PATH etc. is currently resolving to <charm root>/prometheus_alert_rules instead of <charm root>/src/prometheus_alert_rules.

logger.warning("charm.charm_dir = %s, cwd = %s; charm_dir = %s", charm.charm_dir, os.getcwd(), charm_dir)
unit-avalanche-0: 14:18:39.359 WARNING unit.avalanche/0.juju-log charm.charm_dir = /var/lib/juju/agents/unit-avalanche-0/charm, cwd = /var/lib/juju/agents/unit-avalanche-0/charm; charm_dir = /var/lib/juju/agents/unit-avalanche-0/charm
> jsh avalanche/0
# ls /var/lib/juju/agents/unit-avalanche-0/charm
LICENSE  README.md  config.yaml  dispatch  hooks  lib  manifest.yaml  metadata.yaml  revision  src  venv

After moving prometheus_alert_rules one folder up, the alert rule is picked up and integration tests pass again.

Investigate effects of scaling Prometheus

Introduction

What is the effect of scaling Prometheus to more than one unit in the presence of relations with one or more scrape targets, Grafana and Alertmanager ?

Delete useless actions

Clean up the actions.yaml. I am under the impression that it is a legacy from PodSpec, and makes no sense in an Operator SDK charm.

Provides vs Requires for monitoring endpoint

As discussed in MM, the prometheus2 machine charm uses requires for the scrape endpoint which uses the prometheus interface protocol, whereas this charm uses provides for the equivalent monitoring endpoint. The choice of provides or requires is largely subjective, although Chris thinks that CMR might impose hard restrictions. If that is true, then provides is the correct choice but it does mean that all existing charms which support that interface will have to be updated. If CMR does not impose any restrictions, then for compatibility reasons it might be better to keep using the existing choice.

./run_tests fails

I've just downloaded the charm code, and from looking in the README.md I tried running the tests, but got:

$ ./run_tests 
tests.test_charm (unittest.loader._FailedTest) ... ERROR

======================================================================
ERROR: tests.test_charm (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: tests.test_charm
Traceback (most recent call last):
  File "/usr/lib/python3.8/unittest/loader.py", line 436, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib/python3.8/unittest/loader.py", line 377, in _get_module_from_name
    __import__(name)
  File "/home/mthaddon/repos/prometheus-operator/prometheus-operator/tests/test_charm.py", line 8, in <module>
    from ops.testing import Harness
ModuleNotFoundError: No module named 'ops'


----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)

Investigate if prometheus-tester could be replaced with avalanche

I still believe prometheus-tester should not be needed (should not exist):

  • test scraping/remote-write + alerting with avalanche and test-specific rule file
  • test relations etc. with real charms: deploy alertmanager and/or grafana as part of your integration test. if the test involves a lot of LMA charms, then it should probably be a bundle test (which are to be run anyway as part of integration tests).

Originally posted by @sed-i in #150 (comment)

Drop the *.rule format and only use the standard *.rules

From looking at an internal client's rule files:

  1. all the rules are in the same folder (not divided into group dirs)
  2. in all rule files there is only one group per file, and group name matches filename
  3. in all rule files there are multiple alerts per file

This is contrast to our current working assumptions:

  1. Only one rule per file, and the "groups" level is omitted.
  2. All rules from files in the same directory are loaded into a single group.
  3. Rule files are loaded recursively and the relpath to root becomes part of the group name.

Seems like prom lib should support "proper" rules files so that users do not need to refactor their entire rules collection.
Should we drop the *.rule format we rolled on our own and only focus on the standard *.rules?

config_hash calculated incorrectly

https://github.com/canonical/prometheus-operator/blob/56c820f5cd093846fc0fe0f642e087f5782d30d9/src/charm.py#L91

config_hash may be the same value even for different strings:

>>> prometheus_config1 = "{'global': {'scrape_interval': '1m', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {'alertmanagers': [{'static_configs': [{'targets': ['10.1.157.127', '10.1.157.71', '10.1.157.74']}]}]}}"
>>> prometheus_config2 = "{'global': {'scrape_interval': '1m', 'scrape_timeout': '10s', 'evaluation_interval': '1m'}, 'scrape_configs': [{'job_name': 'prometheus', 'scrape_interval': '5s', 'scrape_timeout': '5s', 'metrics_path': '/metrics', 'honor_timestamps': True, 'scheme': 'http', 'static_configs': [{'targets': ['localhost:9090']}]}], 'alerting': {'alertmanagers': [{'static_configs': [{'targets': ['10.1.157.127', '10.1.157.71', '10.1.157.72', '10.1.157.74']}]}]}}"
>>> str(hashlib.md5(str(prometheus_config1).encode('utf-8')))
'<md5 HASH object @ 0x7f54ddcdfa10>'
>>> str(hashlib.md5(str(prometheus_config2).encode('utf-8')))
'<md5 HASH object @ 0x7f54ddcdfa10>'

A call to .hexdigest() is missing:

>>> hashlib.md5(str(prometheus_config1).encode('utf-8')).hexdigest()
'48e7ae9dce66f3022a3b5342e002c356'
>>> hashlib.md5(str(prometheus_config2).encode('utf-8')).hexdigest()
'6e1dcd35bee04b36d76674128782230b'

Alert rules not updated over juju refresh

When a new version of a charm embeds different alert rules than the previous one, the new alert rules seem not to be propagated to Prometheus. Probably we miss a hook in the Prometheus library for the provider charm update. This lack of update on charm update is also likely to be the case for scrape jobs by the way :-)

Hook error on config-changed (on master)

I encountered the following hook error in a test run:

controller-0: 20:52:24 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus-k8s": restart immediately
unit-prometheus-k8s-0: 20:52:38 ERROR unit.prometheus-k8s/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1350, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/pebble.py", line 58, in connect
    self.sock.connect(self.socket_path)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/pebble.py", line 737, in _request_raw
    response = self.opener.open(request, timeout=self.timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/pebble.py", line 72, in http_open
    return self.do_open(_UnixSocketConnection, req, socket_path=self.socket_path)
  File "/usr/lib/python3.8/urllib/request.py", line 1353, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 436, in <module>
    main(PrometheusCharm)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/main.py", line 406, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/main.py", line 140, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/framework.py", line 278, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/framework.py", line 722, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/framework.py", line 767, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 79, in _on_config_changed
    container.push(PROMETHEUS_CONFIG, prometheus_config)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/model.py", line 1135, in push
    self._pebble.push(path, source, encoding=encoding, make_dirs=make_dirs,
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/pebble.py", line 1037, in push
    response = self._request_raw('POST', '/v1/files', None, headers, data)
  File "/var/lib/juju/agents/unit-prometheus-k8s-0/charm/venv/ops/pebble.py", line 750, in _request_raw
    raise ConnectionError(e.reason)
ops.pebble.ConnectionError: [Errno 2] No such file or directory
unit-prometheus-k8s-0: 20:52:38 ERROR juju.worker.uniter.operation hook "config-changed" (via hook dispatching script: dispatch) failed: exit status 1

Not sure what's going on there, and it doesn't happen most times. It might be a Pebble issue.

Rename prometheus relation interface

As we are likely to expose other interfaces in the future, like prometheus_remote_write, we must rename the overly-generic prometheus interface name to prometheus_scrape.

Reconsider alertmanager interface name

alertmanager is not a general relation name, it is more the interface of the relation. We need to find a more general name and use it as standard name for relations like "we push alerts".

Prometheus image path as config vs resource

The general recommendation from Juju is to use a Resource of type "oci-image" for the image that a charm would use for the application.
This is because it allows us to snapshot an exact digest during "charm publish" times, while still allowing an override by the user during deploy.

For example:
https://github.com/juju-solutions/bundle-kubeflow/blob/master/charms/dex-auth/metadata.yaml

I realize that Mattermost did not do this, but as Prometheus is going to be our example charm, we want to align with the recommended path so others who are copying this follow suite.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.