Git Product home page Git Product logo

zookeeper-monitoring's Introduction

Tools and Recipes for ZooKeeper Monitoring
------------------------------------------

UPDATE: This repository have been committed [1] to the ZooKeeper trunk as a contrib. You can find it under src/contrib/monitoring. Please use the ZooKeeper JIRA [2] to submit issues and feature requests. It's going to be a part of the upcoming 3.4.0 release. Thanks. 

[1] https://issues.apache.org/jira/browse/ZOOKEEPER-799
[2] https://issues.apache.org/jira/browse/ZOOKEEPER

How To Monitor
--------------

A ZooKeeper cluster can be monitored in two ways:
 1. by using the 'mntr' 4letterword command
 2. by using JMX to query the MBeans 

This repo contains tools and recipes for monitoring ZooKeeper using the first method. 

Check the file JMX-RESOURCE for some links to resources that could help you monitor a ZooKeeper cluster using the JMX interface. 

Requirements
------------

ZooKeeper 3.4.0 or later or you can apply ZOOKEEPER-744 patch over the latest 3.3.x release.
The server should understand the 'mntr' 4letterword command. 

$ echo 'mntr' | nc localhost 2181
zk_version  3.4.0--1, built on 06/19/2010 15:07 GMT
zk_avg_latency  141
zk_max_latency  1788
zk_min_latency  0
zk_packets_received 385466
zk_packets_sent 435364
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count  5
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    41
zk_open_file_descriptor_count   20
zk_max_file_descriptor_count    1024

Python 2.6 (maybe it works on previous version but it's not tested yet).

In a nutshell
-------------

All you need is check_zookeeper.py It has no external dependencies. 


*** On Nagios call the script like this:

./check_zookeeper.py -o nagios -s "<server-or-list-of-servers>" -k <key> -w <warning> -c <critical>


*** On Cacti define a custom data input method using the script like this:

./check_zookeeper.py -o cacti -s "<list-of-servers>" -k <key> --leader

-- outputs a single value for the given key fetched from the cluster leader

OR 

./check_zookeeper.py -o cacti -s "<list-of-servers>" -k <key> 

-- outputs multiple values on for each cluster node
ex: localhost_2182:0  localhost_2183:0  localhost_2181:0  localhost_2184:0  localhost_2185:0

*** On Ganglia:

install the plugin found in the ganglia/ subfolder OR

./check_zookeeper.py -o ganglia -s "<current-zookeeper-node>"

it will use gmetric to send zookeeper node status data.


Check the subfolders for configuration details and samples for each platform.

ZooKeeper 4letterwords Commands
-------------------------------

http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands

zookeeper-monitoring's People

Contributors

andreisavu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zookeeper-monitoring's Issues

There is a bug in nagios module when using 'stat' keyword

in _send_cmd method (check_zookeeper.py line 169)

-- data = s.recv(2048)

++ data = ''
++ while True:
++ pack = s.recv(2048)
++ if len(pack) is 0:
++ break
++ data += pack

I'm not very familiar with python, you may rewrite this for better reading / performance.

Thank you for your plugin~

support zk versions prior to 3.4

Any way you could have something (a script?) that converts "stat" output (etc..., and provides defaults for metrics not available prior to 3.4) for users running older versions of ZK? It would really enable more people to try it out, granted with reduced metric coverage.

Fault (trace) when run it.

Hello.
When run script get error:

./check_zookeeper.py -o nagios -s localhost
Traceback (most recent call last):
File "./check_zookeeper.py", line 337, in
sys.exit(main())
File "./check_zookeeper.py", line 255, in main
cluster_stats = get_cluster_stats(opts.servers)
File "./check_zookeeper.py", line 290, in get_cluster_stats
for host, port in servers:
ValueError: need more than 1 value to unpack

But mntr work fine:

echo 'mntr' | nc localhost 2181

zk_version 3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0, built on 11/01/2017 18:06 GMT
zk_avg_latency 0
zk_max_latency 316
zk_min_latency 0
zk_packets_received 171365
zk_packets_sent 171364
zk_num_alive_connections 6
zk_outstanding_requests 0
zk_server_state standalone
zk_znode_count 18
zk_watch_count 2
zk_ephemerals_count 3
zk_approximate_data_size 633
zk_open_file_descriptor_count 33
zk_max_file_descriptor_count 4096

unreachable should be critical

Shouldn't a missing key (eg because the host/service is down) result in an critical error?

*** check_zookeeper.py.orig 2013-02-15 15:31:48.417663551 +0100
--- check_zookeeper.py  2013-02-15 15:23:33.100174871 +0100
***************
*** 47,55 ****
--- 47,57 ----
              return 2
  
          warning_state, critical_state, values = [], [], []
+   key_found = (opts.key is None)
          for host, stats in cluster_stats.items():
              if opts.key in stats:
  
+       key_found = True
                  value = stats[opts.key]
                  values.append('%s=%s;%s;%s' % (host, value, warning, critical))
  
***************
*** 60,65 ****
--- 62,72 ----
                      critical_state.append(host)
  
          values = ' '.join(values)
+ 
+   if not key_found:
+             print 'Critical "%s" %s!|%s' % (opts.key, ', '.join(critical_state), values)
+             return 2
+ 
          if critical_state:
              print 'Critical "%s" %s!|%s' % (opts.key, ', '.join(critical_state), values)
              return 2

No error when service not reachable

Hi,

thanks for sharing that test helper.

I guess it makes some sense not to fail hard when one server can't be reached. But on the machine level I think that should be a hard fault.
How can I make the nagios helper throw critical if the service is not reachable (only testing localhost on every machine in that case)
Is there a trick to do that or should I test differently?

Also, I'm no zookeeper expert. Does any of the metrics measured in https://github.com/andreisavu/zookeeper-monitoring/blob/master/nagios/services.cfg monitor something as generic as 'cluster status good/bad'?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.