canonical / cos-lite-bundle Goto Github PK

Canonical Observability Stack Lite, or COS Lite, is a light-weight, highly-integrated, Juju-based observability suite running on Kubernetes.

Home Page: https://charmhub.io/cos-lite

License: Apache License 2.0

Jinja 3.22% Python 62.43% HCL 23.94% JavaScript 0.98% TypeScript 2.60% Shell 6.83%

alertmanager grafana juju kubernetes loki observability prometheus hacktoberfest juju-bundle

cos-lite-bundle's Introduction

COS Lite bundle

The lite flavor of Canonical Observability Stack, also called COS Lite, is a light-weight, highly-integrated, Juju-based observability suite running on Kubernetes.

This Juju bundle deploys the stack, consisting of the following interrelated charms:

This bundle is under development. Join us on Discourse and MatterMost!

The Vision

The Canonical Observability Stack is the go-to solution for monitoring Canonical appliances when the end user does not already have an established observability stack. COS Lite, being a flavor of the Canonical Observability Stack, is designed for:

Best-in-class monitoring of software charmed with Juju
Limited resource consumption
High integration and out-of-the-box value
Running on MicroK8s

With COS Lite now being generally available, we are now working on a highly-available, highly-scalable flavor. It will use many of the same components as COS Lite, plus some additional new ones, and provide the same overall user-experience, and focus on scalability, resilience and broad compatibility with Kubernetes distributions out there.

Usage

For traefik ingress to work, you may first need to enable the metallb microk8s addon. See the tutorial for full details.

The --trust option is needed by the charms in the cos-lite bundle to be able to patch their K8s services to:

use the right ports (see this Juju limitation)
apply resource limits

Before deploying the bundle you will most likely want to create a dedicated model for it:

$ juju add-model cos
$ juju switch cos

Deploy from charmhub

You can deploy the bundle from charmhub with:

$ juju deploy cos-lite --trust

Deploy using this repo

To deploy the bundle from a local file:

# render bundle with "edge" charms
$ tox -e render-edge

$ juju deploy ./bundle.yaml --trust

Deploy for testing

tox -e render-edge
juju deploy ./bundle.yaml --trust \
  --overlay overlays/tls-overlay.yaml \
  --overlay overlays/testing-overlay.yaml

Deploy for testing with local charms

# generate and activate a virtual environment with dependencies
$ tox -e integration --notest
$ source .tox/integration/bin/activate

# render bundle, overriding charm paths
$ ./render_bundle.py bundle.yaml --channel=edge \
  --traefik=$(pwd)/../path/to/traefik.charm \
  --prometheus=$(pwd)/../path/to/prometheus.charm \
  --alertmanager=$(pwd)/../path/to/alertmanager.charm \
  --grafana=$(pwd)/../path/to/grafana.charm \
  --loki=$(pwd)/../path/to/loki.charm

# deploy rendered bundle
$ juju deploy ./bundle.yaml --trust

Overlays

We also make available some overlays for convenience:

offers: exposes as offers the relation endpoints of the COS Lite charms that are likely to be consumed over cross-model relations.
storage-small: provides a setup of the various storages for the COS Lite charms for a small setup. Using an overlay for storage is fundamental for a productive setup, as you cannot change the amount of storage assigned to the various charms after the deployment of COS Lite.
tls: adds an internal CA to encrypt all inter-workload communications.
testing: adds avalanche relation to prometheus and a watchdog alert (always firing) to test prometheus and alertmanager.

In order to use the overlays above, you need to:

Download the overlays (or clone the repository)
Pass the --overlay <path-to-overlay-file-1> --overlay <path-to-overlay-file-2> ... arguments to the juju deploy command

For example, to deploy the COS Lite bundle with the offers overlay, you would do the following:

$ curl -L https://raw.githubusercontent.com/canonical/cos-lite-bundle/main/overlays/offers-overlay.yaml -O
$ juju deploy cos-lite --trust --overlay ./offers-overlay.yaml

To use COS Lite with machine charms, see cos-proxy (source).

Publishing

$ tox -e render-edge  # creates bundle.yaml
$ charmcraft pack
$ charmcraft upload cos-lite.zip
$ charmcraft release cos-lite --channel=edge --revision=4

cos-lite-bundle's People

Contributors

Stargazers

Watchers

Forkers

sed-i mmanciop macduff23 cjohnston1158 isabella232 denisgolius moisesbenzan sudeephb

cos-lite-bundle's Issues

Fix how `kubectl top pod` is called

cmd = "/snap/microk8s/current/kubectl --kubeconfig /var/snap/microk8s/current/credentials/client.config top pod -n ${JUJU_MODEL_NAME} --no-headers".split()

This is way too brittle. Calling various microk8s commands differs between strictly confined and not. The amount of scraping needed to actually pull this in "properly" is enormous, and it's honestly a PITA to get any of these out without something running inside the cluster, but there is a middle ground between "literally call kubectl top pod" and run in k8s.

That middle ground is:
kubectl get --raw /apis/metrics.k8s.io/v1beta1 | jq

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "nodes",
      "singularName": "",
      "namespaced": false,
      "kind": "NodeMetrics",
      "verbs": [
        "get",
        "list"
      ]
    },
    {
      "name": "pods",
      "singularName": "",
      "namespaced": true,
      "kind": "PodMetrics",
      "verbs": [
        "get",
        "list"
      ]
    }
  ]
}

You can drill into nodes or pods.
For kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq, you'll get a list of all of the objects. Including metadata, like the namespace, which can be used to filter.

    {                                                                                                                                                                                                                                               
      "metadata": {                                                                                                                                                                                                                                 
        "name": "coredns-597584b69b-tddzt",                                                                                                                                                                                                         
        "namespace": "kube-system",                                                                                                                                                                                                                 
        "creationTimestamp": "2023-02-26T02:08:55Z",                                                                                                                                                                                                
        "labels": {                                                                                                                                                                                                                                 
          "k8s-app": "kube-dns",                                                                                                                                                                                                                    
          "pod-template-hash": "597584b69b"                                                                                                                                                                                                         
        }                                                                                                                                                                                                                                           
      },                                                                                                                                                                                                                                            
      "timestamp": "2023-02-26T02:08:41Z",                                                                                                                                                                                                          
      "window": "17.843s",                                                                                                                                                                                                                          
      "containers": [                                                                                                                                                                                                                               
        {                                                                                                                                                                                                                                           
          "name": "coredns",                                                                                                                                                                                                                        
          "usage": {                                                                                                                                                                                                                                
            "cpu": "3491677n",                                                                                                                                                                                                                      
            "memory": "24656Ki"                                                                                                                                                                                                                     
          }                                                                                                                                                                                                                                         
        }                                                                                                                                                                                                                                           
      ]                                                                                                                                                                                                                                             
    }

All of the certs are in /snap/microk8s/current/certs-beta/. You can pretty much yank all of these in pure Python (no subprocess) with something like:

curl $KUBE_API/apis/apps/v1/deployments \
  --cacert /snap/microk8s/current/certs-beta/ca.crt \
  --cert /snap/microk8s/current/certs-beta/client.crt \
  --key /snap/microk8s/current/certs-beta/client.key

Yank the certs into the exporter on startup, and use requests or urllib or whatever you want to read it all and transform

Originally posted by @rbarry82 in #65 (comment)

CI: collect logs on failure

Copy over goodies from https://github.com/canonical/metacontroller-operator/blob/c19d314ac1976059d59a82e2415b7dd4be6c0381/.github/workflows/integrate.yaml#L50

Add itest to check prometheus alert rules

Enhancement Proposal

Since we are going to add self-monitoring relations we should add itest that check, for instance, that Loki alert rules for prometheus added in this PR are present in Prometheus

Catalogue Resource Blocked Due to MIME Type Mismatch

Bug Description

When attempting to access resources from the "cos-lite-catalogue" after deploying cos-lite and navigating to it, the following error message is displayed:

The resource from “http://192.168.1.0/vanilla-framework-3.7.1.min.css” was blocked due to MIME type (“text/plain”) mismatch (X-Content-Type-Options: nosniff). cos-lite-catalogue
The resource from “http://192.168.1.0/ui.css” was blocked due to MIME type (“text/plain”) mismatch (X-Content-Type-Options: nosniff). cos-lite-catalogue
The resource from “http://192.168.1.0/handlebars.min.js” was blocked due to MIME type (“text/plain”) mismatch (X-Content-Type-Options: nosniff). cos-lite-catalogue
The resource from “http://192.168.1.0/iconify.min.js” was blocked due to MIME type (“text/plain”) mismatch (X-Content-Type-Options: nosniff). cos-lite-catalogue
The resource from “http://192.168.1.0/ui.js” was blocked due to MIME type (“text/plain”) mismatch (X-Content-Type-Options: nosniff). cos-lite-catalogue

To Reproduce

Deploy cos-lite
Navigate to cos-lite-catalogue

Environment

I'm running it locally on lxd on my machine, cos-lite deployed with microk8s

juju controllers
Use --refresh option with this command to see the latest information.

Controller  Model      User   Access     Cloud/Region         Models  Nodes    HA  Version
lxd*        k8s-cloud  admin  superuser  localhost/localhost       4      1  none  2.9.38

I'm using latest/stable channel

Model     Controller  Cloud/Region         Version  SLA          Timestamp
cos-lite  lxd         micro-k8s/localhost  2.9.38   unsupported  19:31:10+01:00

App           Version  Status  Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.23.0   active      1  alertmanager-k8s  stable    47  10.152.183.128  no       
catalogue              active      1  catalogue-k8s     stable    10  10.152.183.174  no       
grafana       9.2.1    active      1  grafana-k8s       stable    64  10.152.183.134  no       
loki          2.4.1    active      1  loki-k8s          stable    60  10.152.183.226  no       
prometheus    2.33.5   active      1  prometheus-k8s    stable   103  10.152.183.170  no       
traefik       2.9.6    active      1  traefik-k8s       stable   110  192.168.1.0     no       

Unit             Workload  Agent  Address       Ports  Message
alertmanager/0*  active    idle   10.1.218.146         
catalogue/0*     active    idle   10.1.218.135         
grafana/0*       active    idle   10.1.218.149         
loki/0*          active    idle   10.1.218.147         
prometheus/0*    active    idle   10.1.218.148         
traefik/0*       active    idle   10.1.218.145         

Offer                            Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager-karma-dashboard     alertmanager  alertmanager-k8s  47   0/0        karma-dashboard       karma_dashboard          provider
grafana-dashboards               grafana       grafana-k8s       64   1/1        grafana-dashboard     grafana_dashboard        requirer
loki-logging                     loki          loki-k8s          60   1/1        logging               loki_push_api            provider
prometheus-receive-remote-write  prometheus    prometheus-k8s    103  0/0        receive-remote-write  prometheus_remote_write  provider
prometheus-scrape                prometheus    prometheus-k8s    103  1/1        metrics-endpoint      prometheus_scrape        requirer

Relevant log output

2023-03-07T17:07:57.042Z [pebble] HTTP API server listening on ":38812".
2023-03-07T17:07:57.042Z [pebble] Started daemon.
2023-03-07T17:07:57.071Z [pebble] POST /v1/services 28.526368ms 202
2023-03-07T17:07:57.071Z [pebble] Started default services with change 1.
2023-03-07T17:07:57.099Z [pebble] Service "container-agent" starting: /charm/bin/containeragent unit --data-dir /var/lib/juju --append-env "PATH=$PATH:/charm/bin" --show-log --charm-modified-version 0
2023-03-07T17:07:57.148Z [container-agent] 2023-03-07 17:07:57 INFO juju.cmd supercommand.go:56 running containerAgent [2.9.38 6d211be0d72d6f4d625c61c7c4ddb4e9325226c8 gc go1.19.4]
2023-03-07T17:07:57.148Z [container-agent] starting containeragent unit command
2023-03-07T17:07:57.149Z [container-agent] containeragent unit "unit-traefik-0" start (2.9.38 [gc])
2023-03-07T17:07:57.149Z [container-agent] 2023-03-07 17:07:57 INFO juju.cmd.containeragent.unit runner.go:556 start "unit"
2023-03-07T17:07:57.149Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 2.9.38 have already been run.
2023-03-07T17:07:57.149Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.probehttpserver server.go:157 starting http server on [::]:65301
2023-03-07T17:07:57.367Z [container-agent] 2023-03-07 17:07:57 INFO juju.api apiclient.go:687 connection established to "wss://10.48.188.215:17070/model/e4adda24-152d-49d6-8cf2-c78cc7c0090e/api"
2023-03-07T17:07:57.369Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.apicaller connect.go:163 [e4adda] "unit-traefik-0" successfully connected to "10.48.188.215:17070"
2023-03-07T17:07:57.583Z [container-agent] 2023-03-07 17:07:57 INFO juju.api apiclient.go:687 connection established to "wss://10.48.188.215:17070/model/e4adda24-152d-49d6-8cf2-c78cc7c0090e/api"
2023-03-07T17:07:57.584Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.apicaller connect.go:163 [e4adda] "unit-traefik-0" successfully connected to "10.48.188.215:17070"
2023-03-07T17:07:57.606Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE
2023-03-07T17:07:57.607Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.logger logger.go:120 logger worker started
2023-03-07T17:07:57.609Z [container-agent] 2023-03-07 17:07:57 WARNING juju.worker.proxyupdater proxyupdater.go:282 unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
2023-03-07T17:07:57.710Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.caasupgrader upgrader.go:113 abort check blocked until version event received
2023-03-07T17:07:57.710Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.caasupgrader upgrader.go:119 unblocking abort check
2023-03-07T17:07:57.749Z [container-agent] 2023-03-07 17:07:57 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-traefik-0
2023-03-07T17:07:57.749Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.leadership tracker.go:194 traefik/0 promoted to leadership of traefik
2023-03-07T17:07:57.757Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.uniter uniter.go:326 unit "traefik/0" started
2023-03-07T17:07:57.757Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.uniter uniter.go:631 resuming charm install
2023-03-07T17:07:57.758Z [container-agent] 2023-03-07 17:07:57 INFO juju.worker.uniter.charm bundles.go:78 downloading ch:amd64/focal/traefik-k8s-110 from API server
2023-03-07T17:07:57.758Z [container-agent] 2023-03-07 17:07:57 INFO juju.downloader download.go:110 downloading from ch:amd64/focal/traefik-k8s-110
2023-03-07T17:07:57.803Z [container-agent] 2023-03-07 17:07:57 INFO juju.downloader download.go:93 download complete ("ch:amd64/focal/traefik-k8s-110")
2023-03-07T17:07:57.822Z [container-agent] 2023-03-07 17:07:57 INFO juju.downloader download.go:173 download verified ("ch:amd64/focal/traefik-k8s-110")
2023-03-07T17:08:07.056Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 418
2023-03-07T17:08:17.049Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418
2023-03-07T17:08:27.049Z [pebble] Check "readiness" failure 3 (threshold 3): received non-20x status code 418
2023-03-07T17:08:27.049Z [pebble] Check "readiness" failure threshold 3 hit, triggering action
2023-03-07T17:08:37.049Z [pebble] Check "readiness" failure 4 (threshold 3): received non-20x status code 418
2023-03-07T17:08:40.482Z [container-agent] 2023-03-07 17:08:40 INFO juju.worker.uniter uniter.go:344 hooks are retried true
2023-03-07T17:08:40.581Z [container-agent] 2023-03-07 17:08:40 INFO juju.worker.uniter resolver.go:149 found queued "install" hook
2023-03-07T17:08:42.196Z [container-agent] 2023-03-07 17:08:42 INFO juju-log Running legacy hooks/install.
2023-03-07T17:08:42.830Z [container-agent] 2023-03-07 17:08:42 INFO juju-log Kubernetes service 'traefik' patched successfully
2023-03-07T17:08:43.559Z [container-agent] 2023-03-07 17:08:43 INFO juju.worker.uniter.operation runhook.go:146 ran "install" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:46.828Z [container-agent] 2023-03-07 17:08:46 INFO juju-log ingress-per-unit:3: Kubernetes service 'traefik' patched successfully
2023-03-07T17:08:47.045Z [pebble] Check "readiness" failure 5 (threshold 3): received non-20x status code 418
2023-03-07T17:08:47.398Z [container-agent] 2023-03-07 17:08:47 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-per-unit-relation-created" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:48.438Z [container-agent] 2023-03-07 17:08:48 INFO juju-log ingress-per-unit:4: Kubernetes service 'traefik' patched successfully
2023-03-07T17:08:48.903Z [container-agent] 2023-03-07 17:08:48 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-per-unit-relation-created" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:50.073Z [container-agent] 2023-03-07 17:08:50 INFO juju-log ingress:19: Kubernetes service 'traefik' patched successfully
2023-03-07T17:08:50.657Z [container-agent] 2023-03-07 17:08:50 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-created" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:52.250Z [container-agent] 2023-03-07 17:08:52 INFO juju.worker.uniter.operation runhook.go:146 ran "traefik-route-relation-created" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:53.508Z [container-agent] 2023-03-07 17:08:53 INFO juju.worker.uniter.operation runhook.go:146 ran "metrics-endpoint-relation-created" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:54.686Z [container-agent] 2023-03-07 17:08:54 INFO juju-log ingress:6: Kubernetes service 'traefik' patched successfully
2023-03-07T17:08:55.216Z [container-agent] 2023-03-07 17:08:55 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-created" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:55.788Z [container-agent] 2023-03-07 17:08:55 INFO juju.worker.uniter resolver.go:149 found queued "leader-elected" hook
2023-03-07T17:08:57.045Z [pebble] Check "readiness" failure 6 (threshold 3): received non-20x status code 418
2023-03-07T17:08:57.301Z [container-agent] 2023-03-07 17:08:57 INFO juju.worker.uniter.operation runhook.go:146 ran "leader-elected" hook (via hook dispatching script: dispatch)
2023-03-07T17:08:58.746Z [container-agent] 2023-03-07 17:08:58 INFO juju.worker.uniter.operation runhook.go:146 ran "configurations-storage-attached" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:00.486Z [container-agent] 2023-03-07 17:09:00 INFO juju.worker.uniter.operation runhook.go:146 ran "config-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:00.584Z [container-agent] 2023-03-07 17:09:00 INFO juju.worker.uniter resolver.go:149 found queued "start" hook
2023-03-07T17:09:01.047Z [container-agent] 2023-03-07 17:09:01 INFO juju-log Running legacy hooks/start.
2023-03-07T17:09:02.137Z [container-agent] 2023-03-07 17:09:02 INFO juju.worker.uniter.operation runhook.go:146 ran "start" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:03.048Z [container-agent] 2023-03-07 17:09:03 INFO juju-log ingress-per-unit:3: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:03.527Z [container-agent] 2023-03-07 17:09:03 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-per-unit-relation-joined" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:04.672Z [container-agent] 2023-03-07 17:09:04 INFO juju-log ingress-per-unit:4: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:05.173Z [container-agent] 2023-03-07 17:09:05 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-per-unit-relation-joined" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:06.349Z [container-agent] 2023-03-07 17:09:06 INFO juju-log ingress-per-unit:4: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:06.820Z [container-agent] 2023-03-07 17:09:06 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-per-unit-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:08.072Z [container-agent] 2023-03-07 17:09:08 INFO juju-log ingress:19: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:08.602Z [container-agent] 2023-03-07 17:09:08 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:10.002Z [container-agent] 2023-03-07 17:09:10 INFO juju-log traefik-route:5: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:10.584Z [container-agent] 2023-03-07 17:09:10 INFO juju.worker.uniter.operation runhook.go:146 ran "traefik-route-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:13.028Z [container-agent] 2023-03-07 17:09:13 INFO juju-log ingress:19: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:13.548Z [container-agent] 2023-03-07 17:09:13 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-joined" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:14.901Z [container-agent] 2023-03-07 17:09:14 INFO juju-log ingress:19: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:15.461Z [container-agent] 2023-03-07 17:09:15 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:17.320Z [container-agent] 2023-03-07 17:09:17 INFO juju.worker.uniter.operation runhook.go:146 ran "traefik-route-relation-joined" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:18.985Z [container-agent] 2023-03-07 17:09:18 INFO juju-log ingress:6: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:19.486Z [container-agent] 2023-03-07 17:09:19 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:21.548Z [container-agent] 2023-03-07 17:09:21 INFO juju-log traefik-route:5: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:22.154Z [container-agent] 2023-03-07 17:09:22 INFO juju.worker.uniter.operation runhook.go:146 ran "traefik-route-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:23.741Z [container-agent] 2023-03-07 17:09:23 INFO juju-log ingress-per-unit:3: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:24.247Z [container-agent] 2023-03-07 17:09:24 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-per-unit-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:26.326Z [container-agent] 2023-03-07 17:09:26 INFO juju.worker.uniter.operation runhook.go:146 ran "metrics-endpoint-relation-joined" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:28.326Z [container-agent] 2023-03-07 17:09:28 INFO juju.worker.uniter.operation runhook.go:146 ran "metrics-endpoint-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:29.744Z [container-agent] 2023-03-07 17:09:29 INFO juju-log ingress:6: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:30.197Z [container-agent] 2023-03-07 17:09:30 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-joined" hook (via hook dispatching script: dispatch)
2023-03-07T17:09:32.364Z [container-agent] 2023-03-07 17:09:32 INFO juju-log ingress:6: Kubernetes service 'traefik' patched successfully
2023-03-07T17:09:32.864Z [container-agent] 2023-03-07 17:09:32 INFO juju.worker.uniter.operation runhook.go:146 ran "ingress-relation-changed" hook (via hook dispatching script: dispatch)
2023-03-07T17:10:29.790Z [container-agent] 2023-03-07 17:10:29 INFO juju.worker.uniter.operation runhook.go:146 ran "traefik-pebble-ready" hook (via hook dispatching script: dispatch)
2023-03-07T17:13:43.541Z [container-agent] 2023-03-07 17:13:43 WARNING juju-log No relation: certificates
2023-03-07T17:13:44.794Z [container-agent] 2023-03-07 17:13:44 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:19:06.475Z [container-agent] 2023-03-07 17:19:06 WARNING juju-log No relation: certificates
2023-03-07T17:19:07.563Z [container-agent] 2023-03-07 17:19:07 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:24:23.691Z [container-agent] 2023-03-07 17:24:23 WARNING juju-log No relation: certificates
2023-03-07T17:24:24.750Z [container-agent] 2023-03-07 17:24:24 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:28:59.949Z [container-agent] 2023-03-07 17:28:59 WARNING juju-log No relation: certificates
2023-03-07T17:29:00.995Z [container-agent] 2023-03-07 17:29:00 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:33:33.087Z [container-agent] 2023-03-07 17:33:33 WARNING juju-log No relation: certificates
2023-03-07T17:33:34.146Z [container-agent] 2023-03-07 17:33:34 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:39:31.636Z [container-agent] 2023-03-07 17:39:31 WARNING juju-log No relation: certificates
2023-03-07T17:39:32.688Z [container-agent] 2023-03-07 17:39:32 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:45:16.545Z [container-agent] 2023-03-07 17:45:16 WARNING juju-log No relation: certificates
2023-03-07T17:45:17.601Z [container-agent] 2023-03-07 17:45:17 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:50:39.487Z [container-agent] 2023-03-07 17:50:39 WARNING juju-log No relation: certificates
2023-03-07T17:50:40.589Z [container-agent] 2023-03-07 17:50:40 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:54:44.745Z [container-agent] 2023-03-07 17:54:44 WARNING juju-log No relation: certificates
2023-03-07T17:54:45.864Z [container-agent] 2023-03-07 17:54:45 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T17:59:02.119Z [container-agent] 2023-03-07 17:59:02 WARNING juju-log No relation: certificates
2023-03-07T17:59:03.182Z [container-agent] 2023-03-07 17:59:03 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T18:04:26.041Z [container-agent] 2023-03-07 18:04:26 WARNING juju-log No relation: certificates
2023-03-07T18:04:27.102Z [container-agent] 2023-03-07 18:04:27 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T18:09:04.400Z [container-agent] 2023-03-07 18:09:04 WARNING juju-log No relation: certificates
2023-03-07T18:09:05.537Z [container-agent] 2023-03-07 18:09:05 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T18:14:16.244Z [container-agent] 2023-03-07 18:14:16 WARNING juju-log No relation: certificates
2023-03-07T18:14:17.315Z [container-agent] 2023-03-07 18:14:17 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T18:18:50.177Z [container-agent] 2023-03-07 18:18:50 WARNING juju-log No relation: certificates
2023-03-07T18:18:51.512Z [container-agent] 2023-03-07 18:18:51 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T18:24:12.603Z [container-agent] 2023-03-07 18:24:12 WARNING juju-log No relation: certificates
2023-03-07T18:24:14.560Z [container-agent] 2023-03-07 18:24:14 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)
2023-03-07T18:29:42.750Z [container-agent] 2023-03-07 18:29:42 WARNING juju-log No relation: certificates
2023-03-07T18:29:44.184Z [container-agent] 2023-03-07 18:29:44 INFO juju.worker.uniter.operation runhook.go:146 ran "update-status" hook (via hook dispatching script: dispatch)

Additional context

No response

Scrape metrics from kubernetes nodes and container runtime

Enhancement Proposal

Given the COS is a kubernetes bundle, it think I'd be a really nice addition to discover and scrape all the kubernetes components without the need of manual configuration and also nodes and pods, although the latter would probably require another charm to be deployed along the current charms in the bundle.

Also, adding grafana dashboards for these metrics could come in handy too.

Thank you!

Endpoints are not reachable via localhost

Bug Description

COS lite charms use ingress, and are reachable via the proxied endpoint

curl http://pd-ssd-4cpu-8gb.us-central1-a.c.lma-light-load-testing.internal:80/cos-lite-load-test-prometheus-0/metrics

but not via localhost

$ curl http://localhost/cos-lite-load-test-prometheus-0/metrics
curl: (7) Failed to connect to localhost port 80 after 0 ms: Connection refused

Seems like they all should be reachable via localhost too?

To Reproduce

Deploy the bundle and try curl http://localhost/cos-lite-load-test-prometheus-0/metrics.

Environment

Bundle from edge.

Relevant log output

$ curl http://localhost/cos-lite-load-test-prometheus-0/metrics
curl: (7) Failed to connect to localhost port 80 after 0 ms: Connection refused

Additional context

No response

Add .jujuignore file

Enhancement Proposal

/venv
*.py[cod]
*.charm

This would also probably address canonical/charmcraft#829

Use juju-bundle to simplify testing logic

juju-bundle is a bundle management plugin for juju maintained by @knkski.

a juju plugin is any executable binary in $PATH that has the name juju-*, you can call it as juju foo instead of juju-foo so they're called plugins, but really it's just a convenience bit of functionality, instead of being a plugin to juju itself

most of the kubeflow CI/CD uses it, because it handles stuff like filling in default resources. Here's a couple of examples:
https://github.com/canonical/kfp-operators/blob/master/.github/workflows/integrate.yaml#L65
https://github.com/canonical/kfp-operators/blob/master/.github/workflows/publish.yaml#L29-L32
https://github.com/canonical/bundle-kubeflow/blob/master/.github/workflows/publish.yaml#L34-L36

If you were to create a charms directory at the root of this repo and create a symlink inside for each charm with e.g. ln -s ../../prometheus-operator prometheus, you could run juju bundle deploy --build to have each charm built and deployed as part of the bundle. Does that meet your needs vs a bundle template? Otherwise, I'd be interesting in see if juju-bundle could be expanded to meet any extra needs.

I've updated the juju-bundle plugin to handle these four use cases:
juju bundle deploy                           # Don't build anything
juju bundle deploy --build                   # Build every app with a source
juju bundle deploy --build foo bar           # Build both `foo` and `bar` from their sources
juju bundle deploy --build foo bar=../bar    # Build `foo` from its source, and build `bar` with a source of `../bar`
it also has a few other features, e.g. instead of using {%- if testing %} to include the avalanche app, you can just include it and use juju bundle deploy --exclude avalanche when you don't want to include it

you could run juju bundle deploy -b /tmp/RANDOM/bundle.yaml --build alertmanager=../
the juju-bundle plugin would then go off and build alertmanager only, save an updated bundle.yaml to a tempfile
--- @knkski

So with juju-bundle the tests may looks like this:

Alertmanager integration tests:

alertmanager tox command downloads lma bundle
alertmanager tox command calls juju bundle deploy -b /path/to/lma/bundle.yaml --build alertmanager={toxinidir}
alertmanager tox command calls pytest on lma bundle (same as now)
lma test runs tests but does not build anything, only deploys the bundle

lma bundle integration tests:

lma bundle tox command calls juju bundle deploy -b /path/to/lma/bundle.yaml
lma bundle tox command calls pytest (same as now but do not build anything)

Incorrect command for fetching/organizing endpoints

I am using Juju v3.1.6
In this page under the "Browse dashboards" section, the command listed did not work for me:

juju run traefik/0 show-proxied-endpoints | yq '."unit-traefik-0".results."proxied-endpoints"' | jq

Instead, I modified and produced the intended result using the following command:

juju run traefik/0 show-proxied-endpoints | yq '."proxied-endpoints"' | jq

Auto fetch images from `upstream-source`

wouldn't it be even nicer to template the images and fetch them from the individual charm's metadata.resources.{image_name}.upstream-source?

Originally posted by @PietroPasotti in #51 (review)

Instead of hardcoding the images in the .j2.yaml, and keep remembering to update it when we update the underlying charms, the rendering script could auto fetch the upstream-source for every charm.

This is relevant for testing purposes when rendering a bundle with one or more charms from local path, e.g.:

./render_bundle.py local.yaml \
  --prometheus ../prometheus-k8s-operator/prometheus-k8s_ubuntu-20.04-amd64.charm`

Document external dependencies for COS lite

Bug Description

We need a list of external dependencies needed to have COS installed and working correctly.

This is mostly needed to communicate to customer how to configure proxy/firewall to allow COS setup.

Ideally this should include also dependencies that are not directly from components in the cos-lite bundle, but still they are core ones of a COS deployment like grafana-agent.

To Reproduce

Not applicable

Environment

Not applicable

Relevant log output

Not applicable

Additional context

No response

Add integration test prom-loki-traefik

Enhancement Proposal

We should have an integration test as described in this Prometheus PR:

Deploy prometheus, loki and traefik
Relate prometheus and loki
- Check that loki scrape job in prometheus config use internal IP or a FQDN
Relate loki and traefik
- Check that loki scrape job in prometheus config use external URL provided by traefik.
Remove the relation between loki and traefik
- Check that loki scrape job in prometheus config use internal IP or a FQDN again

Review self monitoring alert rules and dashboards

Enhancement Proposal

Review self monitoring alert rules and dashboards.

Prometheus stuck in installing charm software

Bug Description

Solutions QA team has a failed run when deploying COS layer (latest/stable:11) on top of microk8s v.1.29.0, prometheus charm stayed in maintenance state with the message: "installing charm software", until it timed out after

From the logs, no indication of the root cause of the error was found.

Failed run:

https://solutions.qa.canonical.com/testruns/25f56cba-e9c3-4ca2-abd5-c807b6d8b902

Logs:

https://oil-jenkins.canonical.com/artifacts/25f56cba-e9c3-4ca2-abd5-c807b6d8b902/index.html

To Reproduce

On top of MAAS 3.3.5, we bootsrtap a juju 3.3.1 controller, we deploy microk8s v.1.29 and the COS on latest/stable.

Environment

This test was ran on virtual machines on our lab3-litleo

Relevant log output

prometheus/0     maintenance  idle                        installing charm software

Additional context

No response

Integration tests keep failing because prometheus fails to scrape grafana

Description

Integrations tests keep failing on

cos-lite-bundle/tests/integration/test_bundle.py

Lines 248 to 250 in 83a5062

 # All jobs should be up 

 health = {target["labels"]["job"]: target["health"] for target in as_dict["activeTargets"]} 

 assert set(health.values()) == {"up"}

AssertionError: assert {'up', 'down'} == {'up'}
  Extra items in the left set:
  'down'
  Full diff:
  - {'up'}
  + {'up', 'down'}

because prometheus fails to scrape grafana

      {
        "discoveredLabels": {
          "__address__": "grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000",
          "__metrics_path__": "/metrics",
          "__scheme__": "https",
          "__scrape_interval__": "1m",
          "__scrape_timeout__": "10s",
          "job": "juju_test-bundle-8n1r_e252fd59_grafana_prometheus_scrape",
          "juju_application": "grafana",
          "juju_charm": "grafana-k8s",
          "juju_model": "test-bundle-8n1r",
          "juju_model_uuid": "e252fd59-3737-4887-80af-f8a9c426125a"
        },
        "labels": {
          "instance": "test-bundle-8n1r_e252fd59-3737-4887-80af-f8a9c426125a_grafana",
          "job": "juju_test-bundle-8n1r_e252fd59_grafana_prometheus_scrape",
          "juju_application": "grafana",
          "juju_charm": "grafana-k8s",
          "juju_model": "test-bundle-8n1r",
          "juju_model_uuid": "e252fd59-3737-4887-80af-f8a9c426125a"
        },
        "scrapePool": "juju_test-bundle-8n1r_e252fd59_grafana_prometheus_scrape",
        "scrapeUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "globalUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "lastError": "Get \"https://10.43.8.206/test-bundle-8n1r-grafana/metrics\": tls: failed to verify certificate: x509: certificate signed by unknown authority",
        "lastScrape": "2024-01-16T18:37:16.119998201Z",
        "lastScrapeDuration": 0.005085648,
        "health": "down",
        "scrapeInterval": "1m",
        "scrapeTimeout": "10s"
      },

Potential issue

There's a TLS error tls: failed to verify certificate: x509: certificate signed by unknown authority.
All scrape targets in the test are behind TLS, but only grafana fails:

$ curl -sk https://10.1.166.115:9090/api/v1/targets | jq | grep https
          "__scheme__": "https",
        "scrapeUrl": "https://alertmanager-0.alertmanager-endpoints.test-bundle-8n1r.svc.cluster.local:9093/metrics",
        "globalUrl": "https://alertmanager-0.alertmanager-endpoints.test-bundle-8n1r.svc.cluster.local:9093/metrics",
          "__scheme__": "https",
        "scrapeUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "globalUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "lastError": "Get \"https://10.43.8.206/test-bundle-8n1r-grafana/metrics\": tls: failed to verify certificate: x509: certificate signed by unknown authority",
          "__scheme__": "https",
        "scrapeUrl": "https://loki-0.loki-endpoints.test-bundle-8n1r.svc.cluster.local:3100/metrics",
        "globalUrl": "https://loki-0.loki-endpoints.test-bundle-8n1r.svc.cluster.local:3100/metrics",
          "__scheme__": "https",
        "scrapeUrl": "https://prometheus-0.prometheus-endpoints.test-bundle-8n1r.svc.cluster.local:9090/metrics",
        "globalUrl": "https://prometheus-0.prometheus-endpoints.test-bundle-8n1r.svc.cluster.local:9090/metrics",
$ curl -sk https://10.1.166.115:9090/api/v1/targets | jq | grep health
        "health": "up",
        "health": "up",
        "health": "up",
        "health": "down",
        "health": "up",
        "health": "up",
        "health": "up",

Perhaps this is related to the grafana 9 vs grafana 10 ingress+redirect issue. Could retry after grafana 9.5.3 rock is published by oci-factory, and grafana metadata update to point there.

Update Bundle Readme

Enhancement Proposal

Document Traefik and Catalogue as part of the cos-lite bundle.

The cos-lite bundle readme makes no mention of Traefik or Catalogue.

Additionally, it seems as though Traefik needs an additional charm gateway configuration that is not mentioned.

Add itest: all metric names used in alert rules are in fact served by the workload

Enhancement Proposal

Iiuc, if any metric of an expr is missing, the alert wouldn't fire.
This may happen when the workload is updated with a breaking change which removes/renames metrics.

Before releasing a charm, it could be handy to verify that alerts would in fact evaluate. This is relevant for:

alerts updated by the PR
the charm's oci image resource is updated to workload version with a breaking change which removes/renames metrics

Related: canonical/cos-tool#3

Add a tox environment for rendering the bundle

Enhancement Proposal

Instead of having to set up a venv, install jinja2 and run the command manually, we should have a tox env for rendering the bundle.

Add grafana ingress relation

Enhancement Proposal

Now that canonical/grafana-k8s-operator#117 is merged, add relation to traefik.

Add documentation explaining how to expose grafana, loki and prometheus

Enhancement Proposal

Documentation on how the expose the cos charms through ingress in unclear. It would be really usefull to have some docs explaining how to expose grafana and the other charms (if this is already supported)

Merge offers overlay into the bundle

Enhancement Proposal

Charmcraft and charmhub accept offers in the bundle yaml, e.g.:

  prometheus:
    charm: prometheus-k8s
    scale: 1
    trust: true
    channel: edge
+   offers:
+     prometheus-scrape:
+       endpoints:
+         - metrics-endpoint

OOH, it's convenient to have all offers in place right away after deployment.
OTOH, the names of the offers may be a matter of personal taste (prometheus-scrape or scrape or metrics-endpoint or ...)

Wdyt?

Implement TLS between bundle components and component endpoints

increase urlopen timeout

https://github.com/canonical/lma-light-bundle/blob/c72b9a02adec54ac67c01642552af00d77831902/tests/integration/test_bundle.py#L97

ghwf fails because 2.0 sec is apparently not long enough
https://github.com/canonical/prometheus-scrape-target-operator/runs/3759890594?check_suite_focus=true

Integration tests for TLS

Enhancement Proposal

A integration (scenario?) test suite for TLS is needed to test whether TLS is enabled or not between COS components

Scheduled CI: add to the matrix juju 2.9 and juju 3

Enhancement Proposal

Our centralized CI runs on PRs so shouldn't take too long.
But here we can do more: check for both juju 2.9 and 3.

Tracking issue for TLS

Enhancement Proposal

This is a tracking issue for end-to-end TLS in COS Lite.

Spec

TODO: Update

Steps / History

Traefik

Loki

Alertmanager

Prometheus

canonical/prometheus-k8s-operator#496

Prometheus – Alertmanager

Prometheus - grafana agent

TODO: Cert for remote-write

Grafana agent

Grafana

canonical/grafana-k8s-operator#225 @lucabello
canonical/grafana-k8s-operator#226
canonical/grafana-k8s-operator#233
canonical/grafana-k8s-operator#235
~~TODO: GrafanaSourceProvider: source_url should be a callable, to solve code ordering issue.~~ We already have a public .update_source() method, so hopefully that will suffice.

Catalogue

canonical/catalogue-k8s-operator#27 @Abuelodelanada

For the whole stack

Remove insecureSkipVerify
#78

To do later

canonical/alertmanager-k8s-operator#166
Replicating for
- Mimir Worker
- Mimir Coordinator
- Tempo
- Karma

Won't do

COS Config
COS Proxy

Unresolved Questions

With ingress per app, is it correct for all units of the same app have the same web_external_url in their certificate's SANs? I.e. would a CA complain if asked to generate two different certs for the same DNS? (The SANs of a unit's cert is made up of the unit's FQDN and the app's external url)
Should we also allow provisioning certificates with IPs as part of the SAN?

grafana_dashboard lib raises when processing dashboards exported by Grafana 9 (COS Lite version)

Bug Description

I have found this same error in Grafana's unit while trying to deploy a charm with a Grafana 9 Dashboard and relate it with the COS lite bundle. When inspecting the Grafana charm's code accessing the container, the fix applied to the code to solve this bug wasn't implemented there, so maybe the charm in the bundle needs an update, could you please check if that is the case? Thank you! :D

To Reproduce

Clone this branch: https://github.com/canonical/content-cache-k8s-operator/tree/add_cos
Then, inside the repo:
change the grafana dashboard to a grafana 9 dashboard (copy & paste)
docker build . -t localhost:32000/content-cache:latest --no-cache -f content-cache.Dockerfile
docker push localhost:32000/content-cache:latest
charmcraft pack
juju add-model cosdepl && juju deploy cos-lite --trust --overlay offer.yaml
juju add-model contentdepl && juju deploy ./content-cache-k8s_ubuntu-22.04-amd64.charm --resource content-cache-image=localhost:32000/content-cache:latest --resource nginx-prometheus-exporter-image=nginx/nginx-prometheus-exporter:0.11.0 && juju deploy hello-kubecon && juju deploy nginx-ingress-integrator && juju relate hello-kubecon content-cache-k8s:ingress-proxy && juju relate nginx-ingress-integrator content-cache-k8s:ingress
juju relate content-cache-k8s:grafana-dashboard admin/cosdepl.grafana-dashboards
juju relate content-cache-k8s:metrics-endpoint admin/cosdepl.prometheus-scrape
juju relate content-cache-k8s:logging admin/cosdepl.loki-logging
juju switch cosdepl
microk8s kubectl exec -ti -n grafana-0 /bin/bash
tac /var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py
Then you can inspect the code and the fix is not there.

Environment

Deploying everything locally on my juju/microk8s installation

Relevant log output

unit-grafana-0: 14:03:20 ERROR unit.grafana/0.juju-log grafana-dashboard:23: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 1156, in <module>
    main(GrafanaCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 438, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/main.py", line 150, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 355, in emit
    framework._emit(event)  # noqa
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 856, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-grafana-0/charm/venv/ops/framework.py", line 931, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1272, in _on_grafana_dashboard_relation_changed
    changes = self._render_dashboards_and_signal_changed(event.relation)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 1369, in _render_dashboards_and_signal_changed
    content = _convert_dashboard_fields(content, inject_dropdowns)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 599, in _convert_dashboard_fields
    dict_content = _replace_template_fields(dict_content, datasources, existing_templates)
  File "/var/lib/juju/agents/unit-grafana-0/charm/lib/charms/grafana_k8s/v0/grafana_dashboard.py", line 633, in _replace_template_fields
    if panel["datasource"].lower() in replacements.values():
AttributeError: 'dict' object has no attribute 'lower'
unit-grafana-0: 14:03:20 ERROR juju.worker.uniter.operation hook "grafana-dashboard-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

Replace the test bundle oci images with our ROCKs using the dev-tag on ghcr.io

Enhancement Proposal

See title.

Missing open ports in COS charms when running juju status

Bug Description

$ juju status
Model  Controller    Cloud/Region            Version  SLA          Timestamp
lma    aws-microk8s  aws-microk8s/localhost  2.9.34   unsupported  18:37:06-04:00

App           Version  Status  Scale  Charm             Channel  Rev  Address         Exposed  Message
alertmanager  0.23.0   active      1  alertmanager-k8s  edge      33  10.152.183.69   no       
grafana       8.2.6    active      1  grafana-k8s       edge      45  10.152.183.205  no       
loki          2.4.1    active      1  loki-k8s          edge      45  10.152.183.210  no       
prometheus    2.33.5   active      1  prometheus-k8s    edge      75  10.152.183.254  no       

Unit             Workload  Agent  Address      Ports  Message
alertmanager/0*  active    idle   10.1.100.12         
grafana/0*       active    idle   10.1.57.73          
loki/0*          active    idle   10.1.100.13         
prometheus/0*    active    idle   10.1.100.14

To Reproduce

On a Microk8s cloud

$ juju add-model lma
$ juju deploy ./microk8s/lma/lma_servers.yaml --trust
$ juju status

Environment

Cloud: Microk8s v1.25/stable (Remote)
Juju: 2.9.34 (Client and Controller)

COS bundle: ./microk8s/lma/lma_servers.yaml

# https://charmhub.io/topics/canonical-observability-stack/editions/lite
# https://charmhub.io/cos-lite
# channel = latest/edge (rev 5)

bundle: kubernetes
name: cos-lite
description: >
  COS Lite is a light-weight, highly-integrated, observability stack running on Kubernetes
applications:
  # I am ignoring Traefik for now
  # since some older charms do not have 
  # juju endpoint relation "ingress" implemented 
  # 
  # traefik:
  #   charm: traefik-k8s
  #   scale: 1
  #   trust: true
  #   channel: edge
  alertmanager:
    charm: alertmanager-k8s
    scale: 1
    trust: true
    channel: edge
  prometheus:
    charm: prometheus-k8s
    scale: 1
    trust: true
    channel: edge
  grafana:
    charm: grafana-k8s
    scale: 1
    trust: true
    channel: edge
  loki:
    charm: loki-k8s
    scale: 1
    trust: true
    channel: edge

relations:
# ----- Servers
# - [traefik:ingress-per-unit, prometheus:ingress]
# - [traefik:ingress-per-unit, loki:ingress]
- [ grafana:grafana-source, prometheus:grafana-source ]
- [ grafana:grafana-source, loki:grafana-source ]
- [ grafana:grafana-source, alertmanager:grafana-source ]

- [ alertmanager:alerting, prometheus:alertmanager ]
- [ alertmanager:alerting, loki:alertmanager ]

# ----- Agents on servers
# - [traefik:metrics-endpoint, prometheus:metrics-endpoint]
- [ grafana:grafana-dashboard, loki:grafana-dashboard ]
- [ grafana:grafana-dashboard, prometheus:grafana-dashboard ]
- [ grafana:grafana-dashboard, alertmanager:grafana-dashboard ]

- [ prometheus:metrics-endpoint, alertmanager:self-metrics-endpoint ]
- [ prometheus:metrics-endpoint, loki:metrics-endpoint ]

Relevant log output

n/a

Additional context

I did not use Traefik as a ingress controller in this Microk8s deployment since I wanted to expose other k8s charm services deployed on the same Microk8s which do not have the ingress relation implemented (apps model: redis-k8s, mattermost-k8s, discourse-k8s, wordpress-k8s).

When running juju config <app> juju-external-hostname=<>; juju expose <app> for any charm in apps model I would get the ingress resource created on Microk8s when using ingress addon (kubectl get ingress --all-namespaces).

However when doing the same for COS charms no ingress resource would get created.

I would think that showing open ports for COS charms would allow to be able to expose its services with Microk8s ingress addon just like the k8s charms I have on the apps model.

Support use of certificates provided by third parties

Enhancement Proposal

Hi,

Currently, the TLS overlay relies on the self-signed certificates charm. To use COS in production, however, we need to be able to use certificates provided by ANY third party. The use of charms that rely on "Let's Encrypt" (e.g. https://charmhub.io/route53-acme-operator) is not a solution since there are organizations that will require the use of certificates signed by their own Certificate Authorities or will use external CAs (such as DigiCert or GlobalSign to name a few).

Currently, such a scenario does not seem to be supported or at least it's not documented.

A good start would be an example of how to use a wildcard certificate in COS.

Add self-monitoring relations

Enhancement Proposal

Since we are adding self-monitoring to our charms, we should add those relation to the bundle.

Integrate the COS charms with Loki

Enhancement Proposal

Almost none of our charms currently integrate with Loki. We should fix this using the Pebble Log Forwarding feature.

Implement a small web charm with ingress that aids in discovering the UIs, endpoints, and documentation

Enhance promotion CI

Enhancement Proposal

Seems like the quality gates and air-gapped stories intersected:

We should have a callable "promote" workflow which should create a PR the includes a bundle file with all the revisions and resource revisions pinned, as well as OCI image urls point to charmhub urls.
A condition to merge should be a matrix test across juju and microk8s versions.

Add cluster crash recovery itest

Enhancement Proposal

Would be handy to have a cluster crash recovery itest, for when, for example, daemon-kubelite fails to (re)start.

After manually restarting daemon-kubelite, the deployment was stuck on unknown/lost.

Model               Controller  Cloud/Region        Version  SLA          Timestamp
cos-lite-load-test  uk8s        microk8s/localhost  2.9.34   unsupported  19:43:43Z

App            Version  Status   Scale  Charm                         Channel  Rev  Address         Exposed  Message
alertmanager   0.23.0   active       1  alertmanager-k8s              edge      37  10.152.183.156  no       
catalogue               active       1  catalogue-k8s                 edge       5  10.152.183.232  no       
cos-config     3.5.0    active       1  cos-configuration-k8s         edge      14  10.152.183.20   no       
grafana        9.2.1    waiting    0/1  grafana-k8s                   edge      55  10.152.183.105  no       waiting for units to settle down
loki           2.4.1    waiting    0/1  loki-k8s                      edge      47  10.152.183.135  no       waiting for units to settle down
prometheus     2.33.5   waiting    0/1  prometheus-k8s                edge      87  10.152.183.141  no       waiting for units to settle down
scrape-config  n/a      active       1  prometheus-scrape-config-k8s  edge      38  10.152.183.231  no       
scrape-target  n/a      active       1  prometheus-scrape-target-k8s  edge      23  10.152.183.233  no       
traefik                 waiting    0/1  traefik-k8s                   edge      95  10.128.0.6      no       waiting for units to settle down

Unit              Workload  Agent  Address      Ports  Message
alertmanager/0*   active    idle   10.1.79.238         
catalogue/0*      active    idle   10.1.79.215         
cos-config/0*     active    idle   10.1.79.197         
grafana/0         unknown   lost                       agent lost, see 'juju show-status-log grafana/0'
loki/0            unknown   lost                       agent lost, see 'juju show-status-log loki/0'
prometheus/0      unknown   lost                       agent lost, see 'juju show-status-log prometheus/0'
scrape-config/0*  active    idle   10.1.79.204         
scrape-target/0*  active    idle   10.1.79.213         
traefik/0         unknown   lost                       agent lost, see 'juju show-status-log traefik/0'

Grafana frequent "Bad Gateway" and "Templating Failed to upgrade legacy queries Datasource was not found" error messages

Bug Description

When using Grafana frequently there are messages of error shown and also the filters on top are removed (emptied) for Kubeflow Dashboards (notebooks etc.)

To Reproduce

Bootstrap juju using MAAS VMs (HA)
Create a VM for COS in MAAS (Ubuntu 22.04)
Install microk8s on COS VM
Add microk8s to juju controller
Deploy COS on microk8s
Deploy Charmed K8s
Use the same controller to add Charmed K8s as a cloud
Deploy Kubeflow and add relations

#prometheus
$ juju add-relation grafana-agent-k8s:send-remote-write admin/cos.prometheus-receive-remote-write


#grafana
$ juju add-relation jupyter-controller:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation seldon-controller-manager:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation argo-controller:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation dex-auth:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation katib-controller:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation kfp-api:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation metacontroller-operator:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation minio:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation training-operator:grafana-dashboard admin/cos.grafana-dashboards
$ juju add-relation mlflow-server:grafana-dashboard admin/cos.grafana-dashboards

Environment

COS - latest/stable
microk8s - 1.24/stable
Charmed K8s - 1.24/stable
juju - 2.9.42

Relevant log output

https://pastebin.canonical.com/p/K5Ks2nwPqy/

Additional context

No response

Add disk space alerts to loki, prometheus

Bug Description

When loki fills up the disk, then cos charms fail.
Looking at juju status it may look like this:

Model               Controller  Cloud/Region        Version  SLA          Timestamp
cos-lite-load-test  uk8s        microk8s/localhost  3.1.6    unsupported  03:26:19Z

App            Version  Status   Scale  Charm                         Channel  Rev  Address         Exposed  Message
alertmanager   0.25.0   active       1  alertmanager-k8s              edge      96  10.152.183.245  no       
catalogue               active       1  catalogue-k8s                 edge      31  10.152.183.95   no       
cos-config     3.5.0    active       1  cos-configuration-k8s         edge      39  10.152.183.106  no       
grafana        9.2.1    active       1  grafana-k8s                   edge      93  10.152.183.29   no       
loki           2.8.4    waiting    0/1  loki-k8s                      edge     104  10.152.183.180  no       waiting for units to settle down
prometheus     2.47.2   waiting    0/1  prometheus-k8s                edge     156  10.152.183.54   no       waiting for units to settle down
scrape-config  n/a      active       1  prometheus-scrape-config-k8s  edge      44  10.152.183.33   no       
scrape-target  n/a      active       1  prometheus-scrape-target-k8s  edge      31  10.152.183.159  no       
traefik                 waiting    0/1  traefik-k8s                   edge     164  10.128.0.6      no       waiting for units to settle down

Unit              Workload  Agent  Address       Ports  Message
alertmanager/0*   active    idle   10.1.174.152         
catalogue/0*      active    idle   10.1.174.140         
cos-config/0*     active    idle   10.1.174.144         
grafana/0*        active    idle   10.1.174.186         
loki/0            unknown   lost   10.1.174.171         agent lost, see 'juju show-status-log loki/0'
prometheus/0      unknown   lost   10.1.174.160         agent lost, see 'juju show-status-log prometheus/0'
scrape-config/0*  active    idle   10.1.174.147         
scrape-target/0*  active    idle   10.1.174.150         
traefik/0         unknown   lost   10.1.174.184         agent lost, see 'juju show-status-log traefik/0'

and debug-log doesn't show anything obvious.

It could be handy to include baked-in disk space alerts (predict_linear etc.).

To Reproduce

Keep Loki running long enough until disk space is exhausted.

Environment

Not limited to a particular env.

Relevant log output

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        97G   96G  1.4G  99% /
tmpfs           7.9G     0  7.9G   0% /dev/shm
tmpfs           3.2G   11M  3.2G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
tmpfs           1.6G  4.0K  1.6G   1% /run/user/1000

Additional context

No response

`storage-small-overlay.yaml` has bigger size than the default

Bug Description

The following two documents suggest to apply storage-small-overlay.yaml for a small setup.

https://charmhub.io/topics/canonical-observability-stack/tutorials/install-microk8s#heading--deploy-the-cos-lite-bundle-with-overlays

the storage-small overlay applies some defaults for the various storages used by the COS Lite components.

https://github.com/canonical/cos-lite-bundle/blob/fbfe8bd8769fb711695d5dbfbcff6b43bcf28ddc/README.md#overlays

storage-small: provides a setup of the various storages for the COS Lite charms for a small setup. Using an overlay for storage is fundamental for a productive setup, as you cannot change the amount of storage assigned to the various charms after the deployment of COS Lite.

However, the default size of Juju storage is 1GiB if I'm not mistaken so it doesn't make much sense to have bigger size (2G, 10G) in the "small" overlay.

https://juju.is/docs/juju/storage-constraint

<size>: determined from the charm’s minimum storage size, or 1GiB if the charmed operator does not specify a minimum

To Reproduce

Follow the tutorial https://charmhub.io/topics/canonical-observability-stack/tutorials/install-microk8s

Environment

latest/stable of cos-lite bundle.

latest/stable: 11 2022-10-21 (11) 6kB @

Relevant log output

1Gi without the storage-small overlay.

$ kubectl get pvc -n cos
NAME                                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
alertmanager-data-68d6b87a-alertmanager-0     Bound    pvc-239703cd-f2d4-4067-a4af-5e795ef775d4   1Gi        RWO            microk8s-hostpath   14h
grafana-database-bcc2ac5f-grafana-0           Bound    pvc-7673d498-55f7-495a-9888-7e1806ce8149   1Gi        RWO            microk8s-hostpath   14h
loki-active-index-directory-15d17a1f-loki-0   Bound    pvc-3355eea3-57f9-4f75-9a1a-a97b7a1a0f84   1Gi        RWO            microk8s-hostpath   14h
loki-loki-chunks-15d17a1f-loki-0              Bound    pvc-320fd438-338f-4b18-8c8c-ad06d56b45c0   1Gi        RWO            microk8s-hostpath   14h
prometheus-database-b2afe454-prometheus-0     Bound    pvc-c3ee21e2-11d2-4545-87c3-67ef44c8713b   1Gi        RWO            microk8s-hostpath   14h
traefik-configurations-45747e2e-traefik-0     Bound    pvc-00da7694-3b53-4e5a-a279-c640cfd0770a   1Gi        RWO            microk8s-hostpath   14h

Bigger sizes like 2Gi or 10Gi when overlay was applied.

$ kubectl get pvc -n cos
NAME                                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
alertmanager-data-d1a74255-alertmanager-0     Bound    pvc-6ad038f4-6853-47ad-a27c-b2ab410f34c3   2Gi        RWO            microk8s-hostpath   65m
grafana-database-02c4415a-grafana-0           Bound    pvc-abd920da-5870-44ca-b194-0c8528a244ad   2Gi        RWO            microk8s-hostpath   65m
loki-active-index-directory-4234c5e0-loki-0   Bound    pvc-c111064b-5cd1-46b5-b677-b5f7a5d1c5b3   2Gi        RWO            microk8s-hostpath   65m
loki-loki-chunks-4234c5e0-loki-0              Bound    pvc-43376e03-15b1-473f-90ea-ebae1ba29ce1   10Gi       RWO            microk8s-hostpath   65m
prometheus-database-d3dff8a1-prometheus-0     Bound    pvc-b6b4f8ee-18f0-4fc0-b07c-b3cf247ba7c9   10Gi       RWO            microk8s-hostpath   64m
traefik-configurations-9d719ca9-traefik-0     Bound    pvc-4de44324-c1f4-4591-acbd-13b84ce21319   1Gi        RWO            microk8s-hostpath   64m

Additional context

No response

Grafana Admin password not retrieveable.

Bug Description

Querying a Grafana-k8s charm unit after running the action juju run-action --wait grafana/0 get-admin-password for the first time fails with shell output "Admin password has been changed by an administrator".

To Reproduce

Deploy cos-lite: juju deploy ch:cos-lite --overlay offers-overlay.yaml --overlay storage-small-overlay.yaml where
juju run-action --wait grafana/0 get-admin-password

Environment

I am running Juju in a KVM virtual machine. grafana-k8s channel is latest/edge.
Juju snap version is 2.9.43.

Relevant log output

microk8s kubectl logs -f -n cos grafana-0 -c charm

2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.operation executor.go:121 lock released for grafana/0
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.storage resolver.go:173 next hook op for storage-database-1: {Kind:2 Life:alive Attached:true Location:/var/lib/juju/storage/database/0}
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter resolver.go:188 no operations in progress; waiting for changes
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "alertmanager/0" already joined relation 18
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "prometheus/0" already joined relation 8
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "remote-efb43826e77245cc8563ea6c3555d026/1" already joined relation 24
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "prometheus/0" already joined relation 17
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "prometheus/0" already joined relation 15
2023-07-14T19:46:49.141Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "loki/0" already joined relation 9
2023-07-14T19:46:49.142Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "loki/0" already joined relation 16
2023-07-14T19:46:49.142Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "traefik/0" already joined relation 5
2023-07-14T19:46:49.142Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "catalogue/0" already joined relation 20
2023-07-14T19:46:49.142Z [container-agent] 2023-07-14 19:46:49 DEBUG juju.worker.uniter.relation resolver.go:285 unit "alertmanager/0" already joined relation 10
2023-07-14T19:51:27.225Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter.remotestate watcher.go:595 got action change for grafana/0: [2] ok=true
2023-07-14T19:51:27.225Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter.operation executor.go:85 running operation run action 2 for grafana/0
2023-07-14T19:51:27.225Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.machinelock machinelock.go:162 acquire machine lock for grafana/0 uniter (run action 2)
2023-07-14T19:51:27.225Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.machinelock machinelock.go:172 machine lock acquired for grafana/0 uniter (run action 2)
2023-07-14T19:51:27.225Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter.operation executor.go:132 preparing operation "run action 2" for grafana/0
2023-07-14T19:51:27.274Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter.operation executor.go:132 executing operation "run action 2" for grafana/0
2023-07-14T19:51:27.274Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action get-admin-password
2023-07-14T19:51:27.290Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter.runner runner.go:380 running action "get-admin-password" on 1
2023-07-14T19:51:27.290Z [container-agent] 2023-07-14 19:51:27 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server  {unix @/var/lib/juju/agents/unit-grafana-0/agent.socket <nil>}
2023-07-14T19:51:27.774Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.774Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log Operator Framework 2.2.0+1.g734e12d up and running.
2023-07-14T19:51:27.785Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.796Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.803Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.811Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.819Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.819Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log load_ssl_context verify='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt' cert=None trust_env=True http2=False
2023-07-14T19:51:27.827Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.827Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log load_verify_locations cafile='/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
2023-07-14T19:51:27.834Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "relation-ids" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.840Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "relation-list" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.846Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.853Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "config-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.863Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "is-leader" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.870Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "relation-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.880Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.886Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.886Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log Emitting Juju event get_admin_password_action.
2023-07-14T19:51:27.886Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log Emitting Juju event get_admin_password_action.
2023-07-14T19:51:27.892Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-287559604529
7097082
2023-07-14T19:51:27.892Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-287559604529
7097082
2023-07-14T19:51:27.896Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-287559604529
7097082
2023-07-14T19:51:27.902Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.902Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.909Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.915Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.921Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "action-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.927Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.927Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log Starting new HTTP connection (1): localhost:3000
2023-07-14T19:51:27.935Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.935Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log http://localhost:3000 "GET /api/health HTTP/1.1" 200 0
2023-07-14T19:51:27.959Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "juju-log" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.959Z [container-agent] 2023-07-14 19:51:27 DEBUG juju-log http://localhost:3000 "GET //api/org HTTP/1.1" 401 0
2023-07-14T19:51:27.966Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "action-set" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.971Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-get" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.975Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.976Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.981Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-delete" for grafana/0-get-admin-password-2875596045297097082
2023-07-14T19:51:27.987Z [container-agent] 2023-07-14 19:51:27 DEBUG jujuc server.go:213 running hook tool "state-set" for grafana/0-get-admin-password-2875596045297097082

Additional context

There is a workaround for this.
microk8s kubectl exec -it -n cos grafana-0 -c grafana -- /bin/bash

root@grafana-0:/usr/share/grafana# ./bin/grafana-cli admin reset-admin-password

Integrate Loki and Alertmanager

Loki has an integration with Alertmanager that is currently not leveraged by the bundle. Besides the one-liner to add the missing relation, we need integration tests where Loki-generated alerts are ported through Alertmanager.

Add the metrics endpoint for grafana as a relation in the COS Lite bundle

Also, can I please get a matching PR in https://github.com/canonical/cos-lite-bundle ? :-)

Originally posted by @mmanciop in canonical/grafana-k8s-operator#50 (comment)

	# All jobs should be up
	health = {target["labels"]["job"]: target["health"] for target in as_dict["activeTargets"]}
	assert set(health.values()) == {"up"}