Bug Deion After Solutions QA successfully deploys the cos la

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Prometheus charm comes up and then goes down after relations from another model are added. ,about canonical/cos-lite-bundle

Comments (16)

Abuelodelanada commented on July 24, 2024

Hi @amc94

Let me try to understand the situation.

You have deployed COS-Lite in one k8s model.
You have applied two more layers to this deployment. (Please may you share those layers?)
In a machine model you deploy cos-proxy and relate it to Prometheus
In the k8s model you see the Traceback you posted.

Are you able to reproduce the same behaviour using edge instead of stable?

from cos-lite-bundle.

sed-i commented on July 24, 2024

In charm code we call self.container.exec(["update-ca-certificates", "--fresh"]).wait() behind a can_connect guard.

It is one of those cases that we deemed "ok to go into error state".

We often see pebble exceptions after can_connect guard when testing on a slow vm (although this is the first time I see http.client.RemoteDisconnected).

But the crash loop backoff is curious.

Is that a transient error? In the logs (1, 2, 3) it is active/idle.

from cos-lite-bundle.

amc94 commented on July 24, 2024

Hi, I tried it edge instead of stable and managed to run into it again.
Juju status:

from the cos-proxy logs:
from the telegraf monitoring cos-proxy:

It's not necessarily two more layers, as seen in the first run where only a landscape layer is deployed.

that juju log output is collected 5 hours before the end of that run, so when the cos layer finished deployemnt, in the later output it shows
`

Unit           Workload  Agent  Address     Ports      Message
controller/0*  active    idle   10.1.216.4  37017/TCP  
Model  Controller            Cloud/Region              Version  SLA          Timestamp
cos    foundations-microk8s  microk8s_cloud/localhost  3.1.7    unsupported  17:06:51Z

App           Version  Status   Scale  Charm                     Channel  Rev  Address         Exposed  Message
alertmanager  0.26.0   active       2  alertmanager-k8s          stable   101  10.152.183.99   no       
avalanche              active       2  avalanche-k8s             edge      39  10.152.183.56   no       
ca                     active       1  self-signed-certificates  edge     117  10.152.183.227  no       
catalogue              active       1  catalogue-k8s             stable    33  10.152.183.89   no       
external-ca            active       1  self-signed-certificates  edge     117  10.152.183.212  no       
grafana       9.5.3    active       1  grafana-k8s               stable   105  10.152.183.116  no       
loki          2.9.4    active       1  loki-k8s                  stable   118  10.152.183.232  no       
prometheus    2.49.1   waiting      1  prometheus-k8s            stable   170  10.152.183.187  no       installing agent
traefik       2.10.5   active       1  traefik-k8s               stable   169  10.246.167.216  no       

Unit             Workload     Agent      Address      Ports  Message
alertmanager/0*  active       idle       10.1.81.16          
alertmanager/1   active       idle       10.1.216.9          
avalanche/0*     active       idle       10.1.81.11          
avalanche/1      active       idle       10.1.216.6          
ca/0*            active       idle       10.1.81.12          
catalogue/0*     active       idle       10.1.81.13          
external-ca/0*   active       idle       10.1.216.7          
grafana/0*       active       idle       10.1.216.10         
loki/0*          active       idle       10.1.89.5           
prometheus/0*    maintenance  executing  10.1.81.17          Configuring Prometheus
traefik/0*       active       idle       10.1.81.15          

Offer         Application   Charm             Rev  Connected  Endpoint              Interface                Role
alertmanager  alertmanager  alertmanager-k8s  101  0/0        karma-dashboard       karma_dashboard          provider
grafana       grafana       grafana-k8s       105  1/1        grafana-dashboard     grafana_dashboard        requirer
loki          loki          loki-k8s          118  1/1        logging               loki_push_api            provider
prometheus    prometheus    prometheus-k8s    170  2/2        metrics-endpoint      prometheus_scrape        requirer
                                                              receive-remote-write  prometheus_remote_write  provider`

and in the pods.txt in the cos crashdump it shows
prometheus-0 1/2 CrashLoopBackOff 42 (34s ago) 5h46m

also sorry about the less than beautiful screenshots

from cos-lite-bundle.

sed-i commented on July 24, 2024

@amc94 from the screenshot it looks like prometheus was in error for about 40sec and then active/idle eventually?
Can you confirm if this is a transient or persistent?

It would also be handy to see the output of describe pod to see the reason for the crashloop backoff

kubectl -n cos describe pod prometheus-0

from cos-lite-bundle.

amc94 commented on July 24, 2024

Name:             prometheus-0
Namespace:        cos
Priority:         0
Service Account:  prometheus
Node:             microk8s-27-3-3/10.246.167.163
Start Time:       Thu, 21 Mar 2024 15:09:31 +0000
Labels:           app.kubernetes.io/name=prometheus
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=prometheus-7ff58f989c
                  statefulset.kubernetes.io/pod-name=prometheus-0
Annotations:      cni.projectcalico.org/containerID: c1bd838033801c0a6112899cd335f3c7859d545f8541e73be7936d2a58c2800b
                  cni.projectcalico.org/podIP: 10.1.81.8/32
                  cni.projectcalico.org/podIPs: 10.1.81.8/32
                  controller.juju.is/id: 5e202d63-f30a-41b1-8e96-023b50669e08
                  juju.is/version: 3.3.3
                  model.juju.is/id: 883d2661-9ec5-4f40-878f-38e0b778205c
                  unit.juju.is/id: prometheus/0
Status:           Running
IP:               10.1.81.8
IPs:
  IP:           10.1.81.8
Controlled By:  StatefulSet/prometheus
Init Containers:
  charm-init:
    Container ID:  containerd://0ed257779317430360e5a618330e69228ef2b3fa72e1e91717ac9d2cc4966a0d
    Image:         public.ecr.aws/juju/jujud-operator:3.3.3
    Image ID:      public.ecr.aws/juju/jujud-operator@sha256:0c48818b8aceb3a2c98cf0a79ae472a51d3ad74e217f348b5d948ab22cdf5937
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/containeragent
    Args:
      init
      --containeragent-pebble-dir
      /containeragent/pebble
      --charm-modified-version
      0
      --data-dir
      /var/lib/juju
      --bin-dir
      /charm/bin
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 21 Mar 2024 15:09:40 +0000
      Finished:     Thu, 21 Mar 2024 15:09:40 +0000
    Ready:          True
    Restart Count:  0
    Environment Variables from:
      prometheus-application-config  Secret  Optional: false
    Environment:
      JUJU_CONTAINER_NAMES:  prometheus
      JUJU_K8S_POD_NAME:     prometheus-0 (v1:metadata.name)
      JUJU_K8S_POD_UUID:      (v1:metadata.uid)
    Mounts:
      /charm/bin from charm-data (rw,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /containeragent/pebble from charm-data (rw,path="containeragent/pebble")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Containers:
  charm:
    Container ID:  containerd://14d81c28503399b3cacde0f93a58dce331beb6ba5c769d47f264447b5c5b5cf0
    Image:         public.ecr.aws/juju/charm-base:ubuntu-20.04
    Image ID:      public.ecr.aws/juju/charm-base@sha256:2c3ca53095187fc456bb84b939a69cb1fadb829aaee1c5f200b7d42f1e75a304
    Port:          <none>
    Host Port:     <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --http
      :38812
      --verbose
    State:          Running
      Started:      Thu, 21 Mar 2024 15:09:41 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness:      http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Startup:        http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAMES:  prometheus
      HTTP_PROBE_PORT:       3856
    Mounts:
      /charm/bin from charm-data (ro,path="charm/bin")
      /charm/containers from charm-data (rw,path="charm/containers")
      /var/lib/juju from charm-data (rw,path="var/lib/juju")
      /var/lib/juju/storage/database/0 from prometheus-database-5b4ad243 (rw)
      /var/lib/pebble/default from charm-data (rw,path="containeragent/pebble")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
  prometheus:
    Container ID:  containerd://7bc1b456c12525a0a4c52aa9d0fc8a9cd50962e083572811735bcd04590b4ac6
    Image:         registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
    Image ID:      registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
    Port:          <none>
    Host Port:     <none>
    Command:
      /charm/bin/pebble
    Args:
      run
      --create-dirs
      --hold
      --http
      :38813
      --verbose
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 21 Mar 2024 22:40:34 +0000
      Finished:     Thu, 21 Mar 2024 22:41:30 +0000
    Ready:          False
    Restart Count:  57
    Limits:
      cpu:     250m
      memory:  209715200
    Requests:
      cpu:      250m
      memory:   200Mi
    Liveness:   http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
    Readiness:  http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
    Environment:
      JUJU_CONTAINER_NAME:  prometheus
      PEBBLE_SOCKET:        /charm/container/pebble.socket
    Mounts:
      /charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
      /charm/container from charm-data (rw,path="charm/containers/prometheus")
      /var/lib/prometheus from prometheus-database-5b4ad243 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  prometheus-database-5b4ad243:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-database-5b4ad243-prometheus-0
    ReadOnly:   false
  charm-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-bgxjs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Warning  BackOff  3m51s (x1194 over 5h18m)  kubelet  Back-off restarting failed container prometheus in pod prometheus-0_cos(e46453e4-4594-49ad-8a5a-d425dad7e920)

from cos-lite-bundle.

amc94 commented on July 24, 2024

@sed-i it's persistent, it hits idle/active for a small amount of time after a restart

from cos-lite-bundle.

sed-i commented on July 24, 2024

Thanks @amc94, we have another hint - prometheus is being OOMKilled:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

Any chance prometheus has accumulated a large WAL that doesn't fit into memory (could you attach the output of juju config avalanche?)?
You could check with:

juju ssh --container prometheus prometheus/0 du -hs /var/lib/prometheus/wal

This type of failure could be more obvious if you apply resource limits to the pod:

juju config prometheus cpu=2 memory=4Gi

from cos-lite-bundle.

amc94 commented on July 24, 2024

application: avalanche
application-config: 
  juju-application-path: 
    default: /
    description: the relative http path used to access an application
    source: default
    type: string
    value: /
  juju-external-hostname: 
    description: the external hostname of an exposed application
    source: unset
    type: string
  kubernetes-ingress-allow-http: 
    default: false
    description: whether to allow HTTP traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-class: 
    default: nginx
    description: the class of the ingress controller to be used by the ingress resource
    source: default
    type: string
    value: nginx
  kubernetes-ingress-ssl-passthrough: 
    default: false
    description: whether to passthrough SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-ingress-ssl-redirect: 
    default: false
    description: whether to redirect SSL traffic to the ingress controller
    source: default
    type: bool
    value: false
  kubernetes-service-annotations: 
    description: a space separated set of annotations to add to the service
    source: unset
    type: attrs
  kubernetes-service-external-ips: 
    description: list of IP addresses for which nodes in the cluster will also accept
      traffic
    source: unset
    type: string
  kubernetes-service-externalname: 
    description: external reference that kubedns or equivalent will return as a CNAME
      record
    source: unset
    type: string
  kubernetes-service-loadbalancer-ip: 
    description: LoadBalancer will get created with the IP specified in this field
    source: unset
    type: string
  kubernetes-service-loadbalancer-sourceranges: 
    description: traffic through the load-balancer will be restricted to the specified
      client IPs
    source: unset
    type: string
  kubernetes-service-target-port: 
    description: name or number of the port to access on the pods targeted by the
      service
    source: unset
    type: string
  kubernetes-service-type: 
    description: determines how the Service is exposed
    source: unset
    type: string
  trust: 
    default: false
    description: Does this application have access to trusted credentials
    source: user
    type: bool
    value: true
charm: avalanche-k8s
settings: 
  label_count: 
    default: 10
    description: Number of labels per-metric.
    source: default
    type: int
    value: 10
  labelname_length: 
    default: 5
    description: Modify length of label names.
    source: default
    type: int
    value: 5
  metric_count: 
    default: 500
    description: Number of metrics to serve.
    source: user
    type: int
    value: 10
  metric_interval: 
    default: 3.6e+07
    description: |
      Change __name__ label values every {interval} seconds. Avalanche's CLI default value is 120, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
    source: default
    type: int
    value: 3.6e+07
  metricname_length: 
    default: 5
    description: Modify length of metric names.
    source: default
    type: int
    value: 5
  series_count: 
    default: 10
    description: Number of series per-metric.
    source: user
    type: int
    value: 2
  series_interval: 
    default: 3.6e+07
    description: |
      Change series_id label values every {interval} seconds. Avalanche's CLI default value is 60, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
    source: default
    type: int
    value: 3.6e+07
  value_interval: 
    default: 30
    description: Change series values every {interval} seconds.
    source: default
    type: int
    value: 30

16M /var/lib/prometheus/wal

from cos-lite-bundle.

sed-i commented on July 24, 2024

Yep, 500*10 = 5000 values every 30sec is not a high load at all, and the WAL reflects it.
Can we dig a bit deeper? Could you share the output of:

journalctl | grep eviction
journalctl --no-pager -kqg 'killed process' -o verbose --output-fields=MESSAGE
kubectl get pod prometheus-0 -o=jsonpath='{.status}' -n cos

from cos-lite-bundle.

amc94 commented on July 24, 2024

journalctl was empty for both
{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:26Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:13Z","status":"True","type":"PodScheduled"}],"containerStatuses":[{"containerID":"containerd://b97ff807f8b8738db2c91851d21deb317448ab489a9c2b81d161630c448fc20a","image":"public.ecr.aws/juju/charm-base:ubuntu-20.04","imageID":"public.ecr.aws/juju/charm-base@sha256:accafa4a09fea590ba0c5baba90fec90e6c51136fe772695e3724b3d8c879dd2","lastState":{},"name":"charm","ready":true,"restartCount":0,"started":true,"state":{"running":{"startedAt":"2024-03-22T07:14:26Z"}}},{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","image":"sha256:d09e269a1213ea7586369dfd16611f33823897871731d01588e1096e2c146614","imageID":"registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd","lastState":{"terminated":{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","exitCode":137,"finishedAt":"2024-03-22T13:52:10Z","reason":"OOMKilled","startedAt":"2024-03-22T13:51:21Z"}},"name":"prometheus","ready":false,"restartCount":48,"started":false,"state":{"waiting":{"message":"back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos(1513187a-9472-491c-a5d5-065665d3a8b4)","reason":"CrashLoopBackOff"}}}],"hostIP":"10.246.164.182","initContainerStatuses":[{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","image":"public.ecr.aws/juju/jujud-operator:3.3.3","imageID":"public.ecr.aws/juju/jujud-operator@sha256:2921a3ee54d7f7f7847a8e8bc9a132b1deb40ed32c37098694df68b9e1a6808b","lastState":{},"name":"charm-init","ready":true,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","exitCode":0,"finishedAt":"2024-03-22T07:14:24Z","reason":"Completed","startedAt":"2024-03-22T07:14:24Z"}}}],"phase":"Running","podIP":"10.1.240.201","podIPs":[{"ip":"10.1.240.201"}],"qosClass":"Burstable","startTime":"2024-03-22T07:14:14Z"}

from cos-lite-bundle.

sed-i commented on July 24, 2024

Really odd to see "reason":"OOMKilled" and "restartCount":48 with such a small ingestion load.
Anything noteworthy from prometheus itself?

 kubectl -n cos logs prometheus-0 -c prometheus

from cos-lite-bundle.

amc94 commented on July 24, 2024

@sed-i
We've currently stopped deploying cos proxy so prometheus isn't hitting this issue. Could it be that cos-proxy was writing enough data in a single go that it caused prometheus to hit oom?

from cos-lite-bundle.

sed-i commented on July 24, 2024

(Technically, cos-proxy doesn't send metrics; cos-proxy sends scrape job specs over relation data to prometheus, and prometheus does the scraping.)
It's possible that there are a lot of metrics to scrape, but I somehow doubt you hit that in a testing env.

It is much more likely that loki gets overloaded. When both prom and loki consume much resources, I've seen the oomkill algo selecting prometheus over loki.

From the jenkins logs you shared I couldn't spot the bundle yamls that are related to the cos charms.
Would you be able to link them here?

from cos-lite-bundle.

amc94 commented on July 24, 2024

Thank you for explaining.

The bundle file for openstack

from cos-lite-bundle.

lucabello commented on July 24, 2024

Have you seen this error recently?

from cos-lite-bundle.

amc94 commented on July 24, 2024

it has not

from cos-lite-bundle.

Prometheus charm comes up and then goes down after relations from another model are added. about cos-lite-bundle HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent