Comments (16)
Hi @amc94
Let me try to understand the situation.
- You have deployed COS-Lite in one k8s model.
- You have applied two more layers to this deployment. (Please may you share those layers?)
- In a machine model you deploy cos-proxy and relate it to Prometheus
- In the k8s model you see the Traceback you posted.
Are you able to reproduce the same behaviour using edge
instead of stable
?
from cos-lite-bundle.
In charm code we call self.container.exec(["update-ca-certificates", "--fresh"]).wait()
behind a can_connect guard.
It is one of those cases that we deemed "ok to go into error state".
We often see pebble exceptions after can_connect guard when testing on a slow vm (although this is the first time I see http.client.RemoteDisconnected
).
But the crash loop backoff
is curious.
Is that a transient error? In the logs (1, 2, 3) it is active/idle.
from cos-lite-bundle.
Hi, I tried it edge instead of stable and managed to run into it again.
Juju status:
from the cos-proxy logs:
from the telegraf monitoring cos-proxy:
It's not necessarily two more layers, as seen in the first run where only a landscape layer is deployed.
that juju log output is collected 5 hours before the end of that run, so when the cos layer finished deployemnt, in the later output it shows
`
Unit Workload Agent Address Ports Message
controller/0* active idle 10.1.216.4 37017/TCP
Model Controller Cloud/Region Version SLA Timestamp
cos foundations-microk8s microk8s_cloud/localhost 3.1.7 unsupported 17:06:51Z
App Version Status Scale Charm Channel Rev Address Exposed Message
alertmanager 0.26.0 active 2 alertmanager-k8s stable 101 10.152.183.99 no
avalanche active 2 avalanche-k8s edge 39 10.152.183.56 no
ca active 1 self-signed-certificates edge 117 10.152.183.227 no
catalogue active 1 catalogue-k8s stable 33 10.152.183.89 no
external-ca active 1 self-signed-certificates edge 117 10.152.183.212 no
grafana 9.5.3 active 1 grafana-k8s stable 105 10.152.183.116 no
loki 2.9.4 active 1 loki-k8s stable 118 10.152.183.232 no
prometheus 2.49.1 waiting 1 prometheus-k8s stable 170 10.152.183.187 no installing agent
traefik 2.10.5 active 1 traefik-k8s stable 169 10.246.167.216 no
Unit Workload Agent Address Ports Message
alertmanager/0* active idle 10.1.81.16
alertmanager/1 active idle 10.1.216.9
avalanche/0* active idle 10.1.81.11
avalanche/1 active idle 10.1.216.6
ca/0* active idle 10.1.81.12
catalogue/0* active idle 10.1.81.13
external-ca/0* active idle 10.1.216.7
grafana/0* active idle 10.1.216.10
loki/0* active idle 10.1.89.5
prometheus/0* maintenance executing 10.1.81.17 Configuring Prometheus
traefik/0* active idle 10.1.81.15
Offer Application Charm Rev Connected Endpoint Interface Role
alertmanager alertmanager alertmanager-k8s 101 0/0 karma-dashboard karma_dashboard provider
grafana grafana grafana-k8s 105 1/1 grafana-dashboard grafana_dashboard requirer
loki loki loki-k8s 118 1/1 logging loki_push_api provider
prometheus prometheus prometheus-k8s 170 2/2 metrics-endpoint prometheus_scrape requirer
receive-remote-write prometheus_remote_write provider`
and in the pods.txt in the cos crashdump it shows
prometheus-0 1/2 CrashLoopBackOff 42 (34s ago) 5h46m
also sorry about the less than beautiful screenshots
from cos-lite-bundle.
@amc94 from the screenshot it looks like prometheus was in error for about 40sec and then active/idle eventually?
Can you confirm if this is a transient or persistent?
It would also be handy to see the output of describe pod
to see the reason for the crashloop backoff
kubectl -n cos describe pod prometheus-0
from cos-lite-bundle.
Name: prometheus-0
Namespace: cos
Priority: 0
Service Account: prometheus
Node: microk8s-27-3-3/10.246.167.163
Start Time: Thu, 21 Mar 2024 15:09:31 +0000
Labels: app.kubernetes.io/name=prometheus
apps.kubernetes.io/pod-index=0
controller-revision-hash=prometheus-7ff58f989c
statefulset.kubernetes.io/pod-name=prometheus-0
Annotations: cni.projectcalico.org/containerID: c1bd838033801c0a6112899cd335f3c7859d545f8541e73be7936d2a58c2800b
cni.projectcalico.org/podIP: 10.1.81.8/32
cni.projectcalico.org/podIPs: 10.1.81.8/32
controller.juju.is/id: 5e202d63-f30a-41b1-8e96-023b50669e08
juju.is/version: 3.3.3
model.juju.is/id: 883d2661-9ec5-4f40-878f-38e0b778205c
unit.juju.is/id: prometheus/0
Status: Running
IP: 10.1.81.8
IPs:
IP: 10.1.81.8
Controlled By: StatefulSet/prometheus
Init Containers:
charm-init:
Container ID: containerd://0ed257779317430360e5a618330e69228ef2b3fa72e1e91717ac9d2cc4966a0d
Image: public.ecr.aws/juju/jujud-operator:3.3.3
Image ID: public.ecr.aws/juju/jujud-operator@sha256:0c48818b8aceb3a2c98cf0a79ae472a51d3ad74e217f348b5d948ab22cdf5937
Port: <none>
Host Port: <none>
Command:
/opt/containeragent
Args:
init
--containeragent-pebble-dir
/containeragent/pebble
--charm-modified-version
0
--data-dir
/var/lib/juju
--bin-dir
/charm/bin
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 21 Mar 2024 15:09:40 +0000
Finished: Thu, 21 Mar 2024 15:09:40 +0000
Ready: True
Restart Count: 0
Environment Variables from:
prometheus-application-config Secret Optional: false
Environment:
JUJU_CONTAINER_NAMES: prometheus
JUJU_K8S_POD_NAME: prometheus-0 (v1:metadata.name)
JUJU_K8S_POD_UUID: (v1:metadata.uid)
Mounts:
/charm/bin from charm-data (rw,path="charm/bin")
/charm/containers from charm-data (rw,path="charm/containers")
/containeragent/pebble from charm-data (rw,path="containeragent/pebble")
/var/lib/juju from charm-data (rw,path="var/lib/juju")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Containers:
charm:
Container ID: containerd://14d81c28503399b3cacde0f93a58dce331beb6ba5c769d47f264447b5c5b5cf0
Image: public.ecr.aws/juju/charm-base:ubuntu-20.04
Image ID: public.ecr.aws/juju/charm-base@sha256:2c3ca53095187fc456bb84b939a69cb1fadb829aaee1c5f200b7d42f1e75a304
Port: <none>
Host Port: <none>
Command:
/charm/bin/pebble
Args:
run
--http
:38812
--verbose
State: Running
Started: Thu, 21 Mar 2024 15:09:41 +0000
Ready: True
Restart Count: 0
Liveness: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Readiness: http-get http://:38812/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
Startup: http-get http://:38812/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Environment:
JUJU_CONTAINER_NAMES: prometheus
HTTP_PROBE_PORT: 3856
Mounts:
/charm/bin from charm-data (ro,path="charm/bin")
/charm/containers from charm-data (rw,path="charm/containers")
/var/lib/juju from charm-data (rw,path="var/lib/juju")
/var/lib/juju/storage/database/0 from prometheus-database-5b4ad243 (rw)
/var/lib/pebble/default from charm-data (rw,path="containeragent/pebble")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
prometheus:
Container ID: containerd://7bc1b456c12525a0a4c52aa9d0fc8a9cd50962e083572811735bcd04590b4ac6
Image: registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
Image ID: registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd
Port: <none>
Host Port: <none>
Command:
/charm/bin/pebble
Args:
run
--create-dirs
--hold
--http
:38813
--verbose
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 21 Mar 2024 22:40:34 +0000
Finished: Thu, 21 Mar 2024 22:41:30 +0000
Ready: False
Restart Count: 57
Limits:
cpu: 250m
memory: 209715200
Requests:
cpu: 250m
memory: 200Mi
Liveness: http-get http://:38813/v1/health%3Flevel=alive delay=30s timeout=1s period=5s #success=1 #failure=1
Readiness: http-get http://:38813/v1/health%3Flevel=ready delay=30s timeout=1s period=5s #success=1 #failure=1
Environment:
JUJU_CONTAINER_NAME: prometheus
PEBBLE_SOCKET: /charm/container/pebble.socket
Mounts:
/charm/bin/pebble from charm-data (ro,path="charm/bin/pebble")
/charm/container from charm-data (rw,path="charm/containers/prometheus")
/var/lib/prometheus from prometheus-database-5b4ad243 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bgxjs (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
prometheus-database-5b4ad243:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-database-5b4ad243-prometheus-0
ReadOnly: false
charm-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-bgxjs:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 3m51s (x1194 over 5h18m) kubelet Back-off restarting failed container prometheus in pod prometheus-0_cos(e46453e4-4594-49ad-8a5a-d425dad7e920)
from cos-lite-bundle.
@sed-i it's persistent, it hits idle/active for a small amount of time after a restart
from cos-lite-bundle.
Thanks @amc94, we have another hint - prometheus is being OOMKilled:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Any chance prometheus has accumulated a large WAL that doesn't fit into memory (could you attach the output of juju config avalanche
?)?
You could check with:
juju ssh --container prometheus prometheus/0 du -hs /var/lib/prometheus/wal
This type of failure could be more obvious if you apply resource limits to the pod:
juju config prometheus cpu=2 memory=4Gi
from cos-lite-bundle.
application: avalanche
application-config:
juju-application-path:
default: /
description: the relative http path used to access an application
source: default
type: string
value: /
juju-external-hostname:
description: the external hostname of an exposed application
source: unset
type: string
kubernetes-ingress-allow-http:
default: false
description: whether to allow HTTP traffic to the ingress controller
source: default
type: bool
value: false
kubernetes-ingress-class:
default: nginx
description: the class of the ingress controller to be used by the ingress resource
source: default
type: string
value: nginx
kubernetes-ingress-ssl-passthrough:
default: false
description: whether to passthrough SSL traffic to the ingress controller
source: default
type: bool
value: false
kubernetes-ingress-ssl-redirect:
default: false
description: whether to redirect SSL traffic to the ingress controller
source: default
type: bool
value: false
kubernetes-service-annotations:
description: a space separated set of annotations to add to the service
source: unset
type: attrs
kubernetes-service-external-ips:
description: list of IP addresses for which nodes in the cluster will also accept
traffic
source: unset
type: string
kubernetes-service-externalname:
description: external reference that kubedns or equivalent will return as a CNAME
record
source: unset
type: string
kubernetes-service-loadbalancer-ip:
description: LoadBalancer will get created with the IP specified in this field
source: unset
type: string
kubernetes-service-loadbalancer-sourceranges:
description: traffic through the load-balancer will be restricted to the specified
client IPs
source: unset
type: string
kubernetes-service-target-port:
description: name or number of the port to access on the pods targeted by the
service
source: unset
type: string
kubernetes-service-type:
description: determines how the Service is exposed
source: unset
type: string
trust:
default: false
description: Does this application have access to trusted credentials
source: user
type: bool
value: true
charm: avalanche-k8s
settings:
label_count:
default: 10
description: Number of labels per-metric.
source: default
type: int
value: 10
labelname_length:
default: 5
description: Modify length of label names.
source: default
type: int
value: 5
metric_count:
default: 500
description: Number of metrics to serve.
source: user
type: int
value: 10
metric_interval:
default: 3.6e+07
description: |
Change __name__ label values every {interval} seconds. Avalanche's CLI default value is 120, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
source: default
type: int
value: 3.6e+07
metricname_length:
default: 5
description: Modify length of metric names.
source: default
type: int
value: 5
series_count:
default: 10
description: Number of series per-metric.
source: user
type: int
value: 2
series_interval:
default: 3.6e+07
description: |
Change series_id label values every {interval} seconds. Avalanche's CLI default value is 60, but this is too low and quickly overloads the scraper. Using 3600000 (10k hours ~ 1 year) in lieu of "inf" (never refresh).
source: default
type: int
value: 3.6e+07
value_interval:
default: 30
description: Change series values every {interval} seconds.
source: default
type: int
value: 30
16M /var/lib/prometheus/wal
from cos-lite-bundle.
Yep, 500*10 = 5000 values every 30sec is not a high load at all, and the WAL reflects it.
Can we dig a bit deeper? Could you share the output of:
journalctl | grep eviction
journalctl --no-pager -kqg 'killed process' -o verbose --output-fields=MESSAGE
kubectl get pod prometheus-0 -o=jsonpath='{.status}' -n cos
from cos-lite-bundle.
journalctl was empty for both
{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:26Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T13:52:11Z","message":"containers with unready status: [prometheus]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2024-03-22T07:14:13Z","status":"True","type":"PodScheduled"}],"containerStatuses":[{"containerID":"containerd://b97ff807f8b8738db2c91851d21deb317448ab489a9c2b81d161630c448fc20a","image":"public.ecr.aws/juju/charm-base:ubuntu-20.04","imageID":"public.ecr.aws/juju/charm-base@sha256:accafa4a09fea590ba0c5baba90fec90e6c51136fe772695e3724b3d8c879dd2","lastState":{},"name":"charm","ready":true,"restartCount":0,"started":true,"state":{"running":{"startedAt":"2024-03-22T07:14:26Z"}}},{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","image":"sha256:d09e269a1213ea7586369dfd16611f33823897871731d01588e1096e2c146614","imageID":"registry.jujucharms.com/charm/h9a0wskime1pr9ve26xf9oj0yp09xk5potmgk/prometheus-image@sha256:27753c83f6e9766fb3b0ff158a2da79f6e7a26b3f873c39facd724c07adf54bd","lastState":{"terminated":{"containerID":"containerd://ab166870ead535a311590ed8bec4ba71520fbbfb7895bbd72d3d78eca3e71ebd","exitCode":137,"finishedAt":"2024-03-22T13:52:10Z","reason":"OOMKilled","startedAt":"2024-03-22T13:51:21Z"}},"name":"prometheus","ready":false,"restartCount":48,"started":false,"state":{"waiting":{"message":"back-off 5m0s restarting failed container=prometheus pod=prometheus-0_cos(1513187a-9472-491c-a5d5-065665d3a8b4)","reason":"CrashLoopBackOff"}}}],"hostIP":"10.246.164.182","initContainerStatuses":[{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","image":"public.ecr.aws/juju/jujud-operator:3.3.3","imageID":"public.ecr.aws/juju/jujud-operator@sha256:2921a3ee54d7f7f7847a8e8bc9a132b1deb40ed32c37098694df68b9e1a6808b","lastState":{},"name":"charm-init","ready":true,"restartCount":0,"started":false,"state":{"terminated":{"containerID":"containerd://32e5b91441deabf9e5a0f35b0c3f3be2c7203e2dd2efcebd56fe66d7bb9b82bd","exitCode":0,"finishedAt":"2024-03-22T07:14:24Z","reason":"Completed","startedAt":"2024-03-22T07:14:24Z"}}}],"phase":"Running","podIP":"10.1.240.201","podIPs":[{"ip":"10.1.240.201"}],"qosClass":"Burstable","startTime":"2024-03-22T07:14:14Z"}
from cos-lite-bundle.
Really odd to see "reason":"OOMKilled"
and "restartCount":48
with such a small ingestion load.
Anything noteworthy from prometheus itself?
kubectl -n cos logs prometheus-0 -c prometheus
from cos-lite-bundle.
@sed-i
We've currently stopped deploying cos proxy so prometheus isn't hitting this issue. Could it be that cos-proxy was writing enough data in a single go that it caused prometheus to hit oom?
from cos-lite-bundle.
(Technically, cos-proxy doesn't send metrics; cos-proxy sends scrape job specs over relation data to prometheus, and prometheus does the scraping.)
It's possible that there are a lot of metrics to scrape, but I somehow doubt you hit that in a testing env.
It is much more likely that loki gets overloaded. When both prom and loki consume much resources, I've seen the oomkill algo selecting prometheus over loki.
From the jenkins logs you shared I couldn't spot the bundle yamls that are related to the cos charms.
Would you be able to link them here?
from cos-lite-bundle.
Thank you for explaining.
The bundle file for openstack
from cos-lite-bundle.
Have you seen this error recently?
from cos-lite-bundle.
it has not
from cos-lite-bundle.
Related Issues (20)
- Integrate the COS charms with Loki HOT 1
- Enhance promotion CI
- Add bundle test to cover custom relabel configs
- Implement access control for Grafana
- Tracking issue for Juju 3.6 LTS related changes
- Force https redirect for each charm integrated with Traefik
- Fix how `kubectl top pod` is called
- Merge offers overlay into the bundle HOT 2
- Tracking issue for TLS
- Grafana Admin password not retrieveable. HOT 2
- Integration tests for TLS HOT 2
- Replace the test bundle oci images with our ROCKs using the dev-tag on ghcr.io
- Document external dependencies for COS lite HOT 5
- `storage-small-overlay.yaml` has bigger size than the default HOT 1
- Add disk space alerts to loki, prometheus
- Incorrect command for fetching/organizing endpoints HOT 1
- Integration tests keep failing because prometheus fails to scrape grafana
- Prometheus stuck in installing charm software HOT 2
- Support use of certificates provided by third parties HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cos-lite-bundle.