Git Product home page Git Product logo

zmon-agent-core's Introduction

ZMON source code on GitHub is no longer in active development. Zalando will no longer actively review issues or merge pull-requests.

ZMON is still being used at Zalando and serves us well for many purposes. We are now deeper into our observability journey and understand better that we need other telemetry sources and tools to elevate our understanding of the systems we operate. We support the OpenTelemetry initiative and recommended others starting their journey to begin there.

If members of the community are interested in continuing developing ZMON, consider forking it. Please review the licence before you do.

ZMON AGENT CORE

Build status OpenTracing enabled

WIP

ZMON agent core for infrastructure discovery.

Supports:

  • Kubernetes discovery
  • Kubernetes Spilo Postgresql clusters/databases

zmon-agent-core's People

Contributors

a1exsh avatar aermakov-zalando avatar arjunrn avatar avaczi avatar epaul avatar erthalion avatar hjacobs avatar jan-m avatar linki avatar mikkeloscar avatar mohabusama avatar rafiasabih avatar rajatparida86 avatar twz123 avatar vetinari avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zmon-agent-core's Issues

Add 'environment' to entities.

In our Kubernetes setup we have the concept of Environments e.g. Production and Test.
It would be useful to be able to filter on this in entities, similar to what we can do for cluster alias or cluster ID.

Add namespace also to postgresql_database entities

The postgresql_cluster entities have a "namespace" field, which allows filtering the databases one wants to look at. It would be nice if the same field is also available in the postgresql_database field.

As I understand, each database is in exactly one cluster, which has a namespace, so there should be no problem of populating this field too.

Remove protected fields warning

Warning about skipping protected field could get really spammy in the logs, and it has already been released for a while. We can remove it now.

Create Entities for K8S Custom Resources

If one wants to write a check e.g. to monitor etcd clusters deployed with etcd-operator, it would be handy to have the actual custom resources as entities in ZMON.

In the case of etcd-operator, the K8S entity even exposes some health-metrics that could be very useful in a check, e.g:

status:
  clientPort: 2379
  conditions:
  - lastTransitionTime: 2018-12-03T09:38:46Z
    lastUpdateTime: 2018-12-03T09:38:46Z
    reason: Cluster available
    status: "True"
    type: Available
  currentVersion: 3.3.10
  members:
    ready:
    - etcd-cluster-2z4b5scjgp
    - etcd-cluster-4qltzm25xr
    - etcd-cluster-r6nw5mc9hg
    - etcd-cluster-szgbpl7zdk
    - etcd-cluster-wktt76nszx
  phase: Running
  serviceName: etcd-cluster-client
  size: 5
  targetVersion: ""

Kubernetes Ingress entities

Kubernetes Ingress resources should be added as ZMON entities. The exact structure needs to be discussed as the Ingress resource can have host and path rules.

Agent crashes if statefulset doesn't have a version label

agent should be able to handle statefulsets without a version label.

KeyError: 'version'
ZMON agent failed. Retrying after 60 seconds ...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/main.py", line 104, in sync
    all_current_entities = discovery.get_entities() + [account_entity]
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 134, in get_entities
    postgresql_cluster_member_entities, postgresql_database_entities))
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 454, in get_cluster_statefulsets
    'version': obj['metadata']['labels']['version']
KeyError: 'version'
2018-02-22 16:04:18,537 ERROR: ZMON agent failed. Retrying after 60 seconds ...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/main.py", line 104, in sync
    all_current_entities = discovery.get_entities() + [account_entity]
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 134, in get_entities
    postgresql_cluster_member_entities, postgresql_database_entities))
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 454, in get_cluster_statefulsets
    'version': obj['metadata']['labels']['version']
KeyError: 'version'
ZMON agent failed. Retrying after 60 seconds ...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/main.py", line 104, in sync
    all_current_entities = discovery.get_entities() + [account_entity]
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 134, in get_entities
    postgresql_cluster_member_entities, postgresql_database_entities))
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 454, in get_cluster_statefulsets
    'version': obj['metadata']['labels']['version']
KeyError: 'version'
2018-02-22 16:05:28,838 ERROR: ZMON agent failed. Retrying after 60 seconds ...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/main.py", line 104, in sync
    all_current_entities = discovery.get_entities() + [account_entity]
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 134, in get_entities
    postgresql_cluster_member_entities, postgresql_database_entities))
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 454, in get_cluster_statefulsets
    'version': obj['metadata']['labels']['version']
KeyError: 'version'

Introduced in #53

Add kube_pvc entities.

Users want to monitor the disk usage of the PVCs they provision. Therefore it would be very handy to have a kube_pvc entity which can collect this information making it simple to create checks/alerts per PVC.

PVCs by themselves does not have any information about the actually backing EBS volume, this is only available in the corresponding Persistent Volume that the claim owns. Therefore it would be convenient to have the volumeid also exposed in the kube_pvc entity because then you can get cloudwatch metrics for the EBS volume and you can filter on your pvc name which is human readable rather than the PV name which is a uuid.

ZMON should detect old pod entities and clean them up.

We often see ZMON running checks on old pods which have been deleted for several hours.

It seems like the agent doesn't clean up the old entities as it should.

For now we can work around the problem by clicking "cleanup" in the ZMON UI, but it's a bit inconvenient :)

Define a scope for ZMON agent

ZMON agent is currently used to discover Kubernetes resources and Postgresql resources running on Kubernetes. Eventually zmon-aws-agent should be part of this unified agent as well.

The problem about mixing infrastructure discovery with applications/DB discovery is that it leads to:

  • Slower development and release cycles.
  • Bugs and issues that is not fully related to ZMON.

The idea is to define the following:

  • ZMON discovery scope is mainly related to infrastructure discovery.
  • ZMON agent discovery could be extended (one way or another) by external users.
  • ZMON agent should make infrastructure resources available to all extensible solutions.

Add cluster alias entity to kubernetes zmon-agent.

We introduced an alias for the clusters in the cluster registry e.g. stups or teapot, so it's easier to identify what cluster we are talking about.

It would be helpful if the zmon-agent could add this field to the cluster entities so we can easily display the alias value in our zmon alerts.

If the agents supports an environment variable ZMON_AGENT_KUBERNETES_CLUSTER_ALIAS like https://github.com/zalando-incubator/kubernetes-on-aws/blob/dev/cluster/manifests/zmon-agent/deployment.yaml#L46 Then we can provide the alias value when provisioning a cluster.

Add Namespace to pod entites

Description

Since all pods have a namespace,
It would be nice to index/pass the namespace also for extra filtering.

Entities should be scraped more resiliently

There have been multiple cases where scraping failed with an exception for one of the entities, completely stopping the agent from working. While this should not happen in general, the agent should be much more robust. One option would be to process different entity types independently from each other, another one would be to skip entities on errors and just report it to monitoring so the underlying problem would be fixed.

Handle image not defined in Statefulsets

Currently it's possible to define a statefulset without an image defined for the podSpec template (upstream issue: kubernetes/kubernetes#47233)

If there is such a statefulset in the cluster, then zmon-agent will fail:

    all_current_entities = discovery.get_entities() + [account_entity]
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 105, in get_entities
    namespace=self.namespace)
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 400, in get_cluster_statefulsets
    'containers': {c['name']: c['image'] for c in containers},
  File "/usr/local/lib/python3.6/dist-packages/zmon_agent-0.1-py3.6.egg/zmon_agent/discovery/kubernetes/cluster.py", line 400, in <dictcomp>
    'containers': {c['name']: c['image'] for c in containers},
KeyError: 'image'

which results in not updating any entities.

The agent should be able to handle this case.

Allow non-ready pods to be sync'd

Currently, the agent ignores (and in turn removes) pod entities in a non-ready state. The non-ready state can be useful in monitoring.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.