andreisavu / zookeeper-monitoring Goto Github PK
View Code? Open in Web Editor NEWTools and Recipes for Monitoring Apache Zookeeper
Home Page: https://zookeeper.apache.org/
License: Apache License 2.0
Tools and Recipes for Monitoring Apache Zookeeper
Home Page: https://zookeeper.apache.org/
License: Apache License 2.0
Tools and Recipes for ZooKeeper Monitoring ------------------------------------------ UPDATE: This repository have been committed [1] to the ZooKeeper trunk as a contrib. You can find it under src/contrib/monitoring. Please use the ZooKeeper JIRA [2] to submit issues and feature requests. It's going to be a part of the upcoming 3.4.0 release. Thanks. [1] https://issues.apache.org/jira/browse/ZOOKEEPER-799 [2] https://issues.apache.org/jira/browse/ZOOKEEPER How To Monitor -------------- A ZooKeeper cluster can be monitored in two ways: 1. by using the 'mntr' 4letterword command 2. by using JMX to query the MBeans This repo contains tools and recipes for monitoring ZooKeeper using the first method. Check the file JMX-RESOURCE for some links to resources that could help you monitor a ZooKeeper cluster using the JMX interface. Requirements ------------ ZooKeeper 3.4.0 or later or you can apply ZOOKEEPER-744 patch over the latest 3.3.x release. The server should understand the 'mntr' 4letterword command. $ echo 'mntr' | nc localhost 2181 zk_version 3.4.0--1, built on 06/19/2010 15:07 GMT zk_avg_latency 141 zk_max_latency 1788 zk_min_latency 0 zk_packets_received 385466 zk_packets_sent 435364 zk_outstanding_requests 0 zk_server_state follower zk_znode_count 5 zk_watch_count 0 zk_ephemerals_count 0 zk_approximate_data_size 41 zk_open_file_descriptor_count 20 zk_max_file_descriptor_count 1024 Python 2.6 (maybe it works on previous version but it's not tested yet). In a nutshell ------------- All you need is check_zookeeper.py It has no external dependencies. *** On Nagios call the script like this: ./check_zookeeper.py -o nagios -s "<server-or-list-of-servers>" -k <key> -w <warning> -c <critical> *** On Cacti define a custom data input method using the script like this: ./check_zookeeper.py -o cacti -s "<list-of-servers>" -k <key> --leader -- outputs a single value for the given key fetched from the cluster leader OR ./check_zookeeper.py -o cacti -s "<list-of-servers>" -k <key> -- outputs multiple values on for each cluster node ex: localhost_2182:0 localhost_2183:0 localhost_2181:0 localhost_2184:0 localhost_2185:0 *** On Ganglia: install the plugin found in the ganglia/ subfolder OR ./check_zookeeper.py -o ganglia -s "<current-zookeeper-node>" it will use gmetric to send zookeeper node status data. Check the subfolders for configuration details and samples for each platform. ZooKeeper 4letterwords Commands ------------------------------- http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
in _send_cmd method (check_zookeeper.py line 169)
++ data = ''
++ while True:
++ pack = s.recv(2048)
++ if len(pack) is 0:
++ break
++ data += pack
I'm not very familiar with python, you may rewrite this for better reading / performance.
Thank you for your plugin~
Any way you could have something (a script?) that converts "stat" output (etc..., and provides defaults for metrics not available prior to 3.4) for users running older versions of ZK? It would really enable more people to try it out, granted with reduced metric coverage.
Hello.
When run script get error:
./check_zookeeper.py -o nagios -s localhost
Traceback (most recent call last):
File "./check_zookeeper.py", line 337, in
sys.exit(main())
File "./check_zookeeper.py", line 255, in main
cluster_stats = get_cluster_stats(opts.servers)
File "./check_zookeeper.py", line 290, in get_cluster_stats
for host, port in servers:
ValueError: need more than 1 value to unpack
But mntr work fine:
zk_version 3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0, built on 11/01/2017 18:06 GMT
zk_avg_latency 0
zk_max_latency 316
zk_min_latency 0
zk_packets_received 171365
zk_packets_sent 171364
zk_num_alive_connections 6
zk_outstanding_requests 0
zk_server_state standalone
zk_znode_count 18
zk_watch_count 2
zk_ephemerals_count 3
zk_approximate_data_size 633
zk_open_file_descriptor_count 33
zk_max_file_descriptor_count 4096
Shouldn't a missing key (eg because the host/service is down) result in an critical error?
*** check_zookeeper.py.orig 2013-02-15 15:31:48.417663551 +0100 --- check_zookeeper.py 2013-02-15 15:23:33.100174871 +0100 *************** *** 47,55 **** --- 47,57 ---- return 2 warning_state, critical_state, values = [], [], [] + key_found = (opts.key is None) for host, stats in cluster_stats.items(): if opts.key in stats: + key_found = True value = stats[opts.key] values.append('%s=%s;%s;%s' % (host, value, warning, critical)) *************** *** 60,65 **** --- 62,72 ---- critical_state.append(host) values = ' '.join(values) + + if not key_found: + print 'Critical "%s" %s!|%s' % (opts.key, ', '.join(critical_state), values) + return 2 + if critical_state: print 'Critical "%s" %s!|%s' % (opts.key, ', '.join(critical_state), values) return 2
Hi,
thanks for sharing that test helper.
I guess it makes some sense not to fail hard when one server can't be reached. But on the machine level I think that should be a hard fault.
How can I make the nagios helper throw critical if the service is not reachable (only testing localhost on every machine in that case)
Is there a trick to do that or should I test differently?
Also, I'm no zookeeper expert. Does any of the metrics measured in https://github.com/andreisavu/zookeeper-monitoring/blob/master/nagios/services.cfg monitor something as generic as 'cluster status good/bad'?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.