mperham / inspeqtor Goto Github PK

Monitor your application infrastructure!

License: GNU General Public License v3.0

Makefile 1.26% Go 96.37% Shell 1.53% CSS 0.84%

inspeqtor's Introduction

Inspeqtor

Inspeqtor monitors your application infrastructure. It gathers and verifies key metrics from all the moving parts in your application and alerts you when something looks wrong. It understands the application deployment workflow so it won't bother you during a deploy.

What it does:

Monitor init.d-, systemd-, upstart-, runit- or launchd-managed services
Monitor process memory and CPU usage
Monitor daemon-specific metrics (e.g. redis, memcached, mysql, nginx...)
Monitor and alert based on host CPU, load, swap and disk usage
Alert or restart a process if a rule threshold is breached
Alert if a process disappears or changes PID
Signal deploy start/stop to silence alerts during deploy

What it doesn't:

monitor or control arbitrary processes, services must be init-managed
have any runtime dependencies at all, not even libc.

If you've used monit before, Inspeqtor will look familiar. Same high-level goals but in a modern package.

Status

Inspeqtor is feature complete, reliable and (mostly?) bug-free. This repo does not see a lot of code changes because of this, not because it is unmaintained.

Installation

See the Inspeqtor wiki for complete documentation.

Requirements

Linux 3.0+. It will run on OS X. FreeBSD is untested. It uses about 5-10MB of RAM at runtime.

License

GPLv3.

Want to Help?

See the Development wiki page for details on how to get the source code and build Inspeqtor locally.

Author

Inspeqtor is written by Mike Perham of Contributed Systems. We build awesome open source-based infrastructure to help you build awesome apps.

We also develop Sidekiq and sell Sidekiq Pro, the best Ruby background job processing system.

inspeqtor's People

Contributors

Stargazers

Watchers

inspeqtor's Issues

E: Unable to locate package inspeqtor

I tried the installation on Ubuntu 12.10, but the last step fails.

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.10
DISTRIB_CODENAME=quantal
DISTRIB_DESCRIPTION="Ubuntu 12.10"

$ curl -L https://bit.ly/InspeqtorDEB | sudo bash
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   168  100   168    0     0    345      0 --:--:-- --:--:-- --:--:--   473
  0     0    0  1654    0     0   1252      0 --:--:--  0:00:01 --:--:--     0
Detected operating system as ubuntu/quantal.
Checking for curl...
Detected curl...
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/contribsys_inspeqtor.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

$ sudo apt-get install inspeqtor
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package inspeqtor

Create man page

We need to ship a man page. This looks promising:

https://github.com/sunaku/md2man

inspeqtorctl help

inspeqtorctl should support -h or --help. Right now --help hangs for some reason.

inspeqtorctl should not require sudo

inspeqtorctl requires sudo because /var/run/inspeqtor.sock is 644. We need write access to the socket to send inspeqtor commands.

Monitoring redis-specific metrics causes a runtime error

Attempting to configure the redis daemon support with essentially the following config:

check service redis-6379 with hostname localhost, port 6379, password REDACTED
  if memory:rss > 2g then alert
  if redis:connected_clients > 1000 then alert

The logs indicate that the init.d/redis-6379 job was found:

I 2014-11-05T15:34:28.177893Z 18654 Activating redis-specific metrics
I 2014-11-05T15:34:28.178311Z 18654 Found init.d/redis-6379 with status <nil>

That yields a lengthy backtrace:

My assumption is that this stems from the status <nil> segment above. Of course running sudo service redis status just complains that it only handles start or stop.

Will inspeqtor monitor redis if it is controlled by upstart instead of init.d? It seems to me that there are some things I can do to prevent the issue, which I'd like to figure out and give back to the wiki. Separately, a more graceful exit or error from inspeqtor would be nice in this situation.

Support expvars

expvars are a really neat, native way to pull runtime metrics out of Golang daemons.

http://golang.org/pkg/expvar/

Something like this:

check service godaemon with port XXX
  if go:page_hits > 100 then alert

Add support for collecting Postgres stats

It would be great to have support for collecting / monitoring Postgres databases. This isn't as straight forward as MySQL, as there are a lot of metrics available and they all need to be enabled. However, I'd like to start a discussion about which stats exist, where they are documented, which ones are important, and finally which stats Inspeqtor can reasonably support.

Stat Statements: http://www.postgresql.org/docs/current/static/pgstatstatements.html

The pg_stat_statements module provides a means for tracking execution statistics of all SQL statements executed by a server.

Useful for "tracking execution statistics", making recent statements query-able by number of calls, total time, cache hit percentage, etc.

Activity Stats: http://www.postgresql.org/docs/current/static/monitoring-stats.html

PostgreSQL's statistics collector is a subsystem that supports collection and reporting of information about server activity. Presently, the collector can count accesses to tables and indexes in both disk-block and individual-row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table. It can also count calls to user-defined functions and the total time spent in each one.

Useful for tracking a lot of information. Of particular use is pg_stat_database and pg_stat_all_tables. It's difficult to prescribe exact metrics that are important to track, but those include values that I particularly care about.

Heroku Postgres: https://devcenter.heroku.com/articles/heroku-postgres-database-tuning

Heroku's pg-extra provides some very helpful and digestible stats views for introspecting on performance. Each of the commands is just a SQL query. For example, pg:blocking, table-size.

The commands exposed by pg-extras digest the output of pg_stats and pg_activity into recognizable and obviously useful reports. My personal preference would be to target the pg-extras stats.

Parse error for inspeqtor.conf comments without space

The following config triggers a parse error:

#send alerts via gmail
#  with username mike, password fuzzbucket, to_email [email protected]

Error in S12: value(5,#send), Pos(offset=995, line=43, column=7), expected one of: $ send set

While this works:

# send alerts via gmail
#   with username mike, password fuzzbucket, to_email [email protected]

Memory/GC monitoring for go daemons

The expvar package provides runtime memory and GC information. Build a new Inspeqtor Pro feature which allows Inspeqtor to monitor any local Go daemon which has exported its runtime memory data.

Here's what I have so far. This is Inspeqtor monitoring itself.

Synchronize internal versions with release tag

The currently tagged version is 0.7.0-2, but the hard coded version in the Makefile and inspeqtor.go are simply 0.7.0. To my knowledge there isn't a real 0.7.0 release, but that is difficult to ascertain when pulling from apt-get.

@mperham: Are you ok with me bumping up the coded version?

Inspeqtor monitors disk /home without being told to

Inspeqtor 0.6.0 with a default /etc/inspeqtor/host.inq should only monitor the root file system, ie:

  #
  # Running out of disk space will not make you look good to the boss.
  # Alert if the root partition is running out of space.
  #
  # You can add any mount point, e.g.
  #    if disk:/backups > 90% then alert
  #
  if disk:/ > 90% then alert

So imagine my surprise to find inspeqtor monitoring /home as well:

# inspeqtorctl status 
Inspeqtor 0.6.0, uptime: 12m8.809127238s, pid: 1868

Host: web02
  cpu                                 152.9%         
  cpu:iowait                          0.3%           
  cpu:steal                           0.0%           
  cpu:system                          4.5%           
! cpu:user                            102.0%          95%
  disk:/                              60.0%           90%
  disk:/home                          41.0%          
  load:1                              15.78          
  load:15                             8.25           
  load:5                              13.07          
  swap                                1.0%            20%

Obviously not a huge issue. But it should monitor what I tell it to monitor. No more, no less! 😄

Go live checklist

I'm hoping to launch Inspeqtor next week!!1!!one11! So EXCITED!

TODO

monitoring for standalone Passenger

Add GoDoc badge

Add Godoc badge for easy accessibility of docs

Support watching children (eg. unicorn workers)

Common way to monitor unicorn in production is watch children and kill them if they exceed memory and cpu metric. Example extract from bluepill config:

app.process('unicorn') do |process|
  process.monitor_children do |child_process|
    child_process.stop_command = 'kill -QUIT {{PID}}'

    child_process.checks :mem_usage, every: 15.seconds, below: 150.megabytes, times: [3,4], fires: :stop
    child_process.checks :cpu_usage, every: 15.seconds, below: 90, times: [3,4], fires: :stop
  end
end

monitoring for sidekiq

Daily alert digest email

Someone suggested a daily alert summary email, detailing the alerts which were fired in the last day. I'm not sold on the idea, more email is the last thing I want, especially if it's not actionable. WDYT?

Use full import paths

I tried to go get github.com/mperham/inspeqtor which didn't work because there were a lot of relative imports, for example:

package github.com/mperham/inspeqtor
        imports inspeqtor/conf/global/ast: unrecognized import path "inspeqtor/conf/global/ast"
package github.com/mperham/inspeqtor
        imports inspeqtor/conf/global/lexer: unrecognized import path "inspeqtor/conf/global/lexer"
package github.com/mperham/inspeqtor
        imports inspeqtor/conf/global/parser: unrecognized import path "inspeqtor/conf/global/parser"
package github.com/mperham/inspeqtor
        imports inspeqtor/conf/inq/ast: unrecognized import path "inspeqtor/conf/inq/ast"
package github.com/mperham/inspeqtor
        imports inspeqtor/conf/inq/lexer: unrecognized import path "inspeqtor/conf/inq/lexer"
package github.com/mperham/inspeqtor
        imports inspeqtor/conf/inq/parser: unrecognized import path "inspeqtor/conf/inq/parser"
package github.com/mperham/inspeqtor
        imports inspeqtor/metrics: unrecognized import path "inspeqtor/metrics"
package github.com/mperham/inspeqtor
        imports inspeqtor/services: unrecognized import path "inspeqtor/services"
package github.com/mperham/inspeqtor
        imports inspeqtor/util: unrecognized import path "inspeqtor/util"

To get around this, I think you need to use the full import path. So, instead of importing "inspeqtor/util" import "github.com/mperham/inspeqtor/util". A bit wordier but go get-friendly.

Nginx monitoring not working on Ubuntu 14.04.1

inspeqtorctl status shows host stats, but nothing for the nginx service:

...
Service: nginx [Unknown/0]
  cpu:system                          0.0%           
  cpu:total_system                    0.0%           
  cpu:total_user                      0.0%           
  cpu:user                            0.0%           
  memory:rss                          -0.00m          100m
  nginx:Active_connections            -1.0            1
  nginx:requests                      0.0             1000

... with the following INQ. /etc/inspeqtor/conf.d/nginx.inq:

check service nginx with hostname localhost, port 80, endpoint /nginx_status
   if memory:rss > 100m then alert
   if nginx:Active_connections > 1 then alert
   if nginx:requests > 1000 then alert

... but nginx stub_status is on:

# curl localhost/nginx_status
Active connections: 2 
server accepts handled requests
 36 36 34 
Reading: 0 Writing: 2 Waiting: 0

Configuration is Ubuntu 14.04.1 with nginx 1.6.1 and inspeqtor 0.5.0.

Optionally include the hostname for statsd metrics

It would be quite helpful to prefix host or app metrics with the server's hostname. For a setup with multiple app servers it would make monitoring the health of individual servers more practical.

Statsd support is pro only, so I can't volunteer for this one.

Run 'go vet' on codebase

Add a Makefile target which runs go vet on the codebase.
Fix the issues raised.

Starting inspeqtor with a service that triggers causes Go panic

Given a service config that monitors a process that is already above a threshold inspeqtor will crash immediately after performing the check. Changing the checked value to a higher number prevents the panic. I haven't confirmed if this applies to other metrics or only total_rss. The stack trace is included below.

Seems to be simply because there aren't any previous values.

puma.conf

check service puma
  if memory:total_rss > 2g then alert, reload

trace

W 2014-12-01T18:00:31.279422Z 15590 puma[memory:total_rss] triggered.  Current value = 1217777664.0
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x20 pc=0x448ede]

goroutine 24 [running]:
runtime.panic(0x7bac20, 0xa298b3)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/panic.c:279 +0xf5
github.com/mperham/inspeqtor.(*Service).Verify(0xc20807e810, 0x0, 0x0, 0x0)
        /Users/mike/src/github.com/mperham/inspeqtor/types.go:191 +0x30e
github.com/mperham/inspeqtor.(*Inspeqtor).verify(0xc208040160)
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:310 +0x202
github.com/mperham/inspeqtor.(*Inspeqtor).scanSystem(0xc208040160)
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:271 +0x35
github.com/mperham/inspeqtor.(*Inspeqtor).runLoop(0xc208040160)
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:250 +0x115
created by github.com/mperham/inspeqtor.(*Inspeqtor).Start
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:82 +0x1dc

goroutine 16 [chan receive]:
github.com/mperham/inspeqtor.HandleSignals()
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:146 +0x170
main.main()
        /Users/mike/src/github.com/mperham/inspeqtor-pro/main.go:57 +0x488

goroutine 19 [finalizer wait]:
runtime.park(0x416160, 0xa30e88, 0xa2f0c9)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0xa30e88, 0xa2f0c9)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/proc.c:1445

goroutine 20 [syscall]:
os/signal.loop()
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/os/signal/signal_unix.go:27 +0x32

goroutine 21 [chan receive]:
main.ping(0xc20803e720, 0x13, 0xc208004420)
        /Users/mike/src/github.com/mperham/inspeqtor-pro/licensing.go:123 +0x1d3
created by main.phoneHome
        /Users/mike/src/github.com/mperham/inspeqtor-pro/licensing.go:105 +0x157

goroutine 22 [select]:
github.com/mperham/inspeqtor-pro/jobs.func·001()
        /Users/mike/src/github.com/mperham/inspeqtor-pro/jobs/types.go:68 +0x13a
created by github.com/mperham/inspeqtor-pro/jobs.Watch
        /Users/mike/src/github.com/mperham/inspeqtor-pro/jobs/types.go:80 +0xb0

goroutine 23 [IO wait]:
net.runtime_pollWait(0x7fe4c8b5b008, 0x72, 0x0)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc20802a680, 0x72, 0x0, 0x0)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc20802a680, 0x0, 0x0)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).accept(0xc20802a620, 0x8c59e0, 0x0, 0x7fe4c8b59418, 0xb)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/fd_unix.go:419 +0x343
net.(*UnixListener).AcceptUnix(0xc2080747c0, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/unixsock_posix.go:293 +0x73
net.(*UnixListener).Accept(0xc2080747c0, 0x0, 0x0, 0x0, 0x0)
        /usr/local/Cellar/go/1.3.3/libexec/src/pkg/net/unixsock_posix.go:304 +0x4b
github.com/mperham/inspeqtor.(*Inspeqtor).acceptCommand(0xc208040160, 0x0)
        /Users/mike/src/github.com/mperham/inspeqtor/commands.go:51 +0x63
github.com/mperham/inspeqtor.(*Inspeqtor).safelyAccept(0xc208040160, 0x0)
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:95 +0x40
github.com/mperham/inspeqtor.func·002()
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:74 +0x33
created by github.com/mperham/inspeqtor.(*Inspeqtor).Start
        /Users/mike/src/github.com/mperham/inspeqtor/inspeqtor.go:79 +0x199

Clean up golint issues

Running golint ./... points out dozens of warnings. Fix them if reasonable.

Redis(using init.d) monitoring doesn't work

I can't get redis metrics as I configure, which just return 0.

My redis config is like:

$ sudo vim /etc/inspeqtor/services.d/redis6379.inq
check service redis6379
  if redis:used_memory_rss > 10g then alert
  if redis:used_memory_peak > 115g then alert

My service redis6379 is in:

$ ls /etc/init.d/redis6379
/etc/init.d/redis6379

And I've soft linked the pid file as Wiki:

$ ls -l /var/run/redis6379.pid
lrwxrwxrwx 1 root root 34 Nov  9 15:55 /var/run/redis6379.pid -> /var/run/redis/6379/redis_6379.pid

The pid is same with the progress:

$ cat /var/run/redis6379.pid
3962
$ sudo service redis6379 status
redis6379 /var/run/redis/6379/redis_6379.pid exists, pid is 3962, should be running

I can get redis info as

$ redis-cli info | grep used_memory_rss
used_memory_rss:22825369600
$ redis-cli info | grep used_memory_peak
used_memory_peak:64174271032

But when I startup inspeqtor with verbose mode, I only get 0 for metrics I config:

D 2014-11-10T07:54:57.861971Z 67609 Parsing config in /etc/inspeqtor/services.d
V 2014-11-10T07:54:57.862023Z 67609 Parsing /etc/inspeqtor/services.d/redis6379.inq
V 2014-11-10T07:54:57.862375Z 67609 Rule: {Entity:redis6379 [Unknown/0] MetricFamily:redis MetricName:used_memory_rss Op:greater than DisplayThreshold:10g Threshold:1.073741824e+10 CurrentValue:0 PerSec:false CycleCount:1 TrippedCount:0 State:Ok Actions:[0xc20801a640]}
V 2014-11-10T07:54:57.862485Z 67609 Rule: {Entity:redis6379 [Unknown/0] MetricFamily:redis MetricName:used_memory_peak Op:greater than DisplayThreshold:115g Threshold:1.2348030976e+11 CurrentValue:0 PerSec:false CycleCount:1 TrippedCount:0 State:Ok Actions:[0xc20801a910]}
V 2014-11-10T07:54:57.862537Z 67609 Service: {Entity:0xc20802e080 EventHandler:0xc20801a5a0 Process:Unknown/0 Manager:<nil>}
V 2014-11-10T07:54:57.862580Z 67609 Config: &{GlobalConfig:{CycleTime:15 DeployLength:300 Variables:map[log_level:info deploy_length:300 cycle_time:15]} AlertRoutes:map[:0xc20800fc20]}
V 2014-11-10T07:54:57.862632Z 67609 Host: &{Entity:0xc20802f380}
V 2014-11-10T07:54:57.862659Z 67609 Service: redis6379 [Unknown/0]
I 2014-11-10T07:54:57.862696Z 67609 Activating redis-specific metrics
D 2014-11-10T07:54:57.862722Z 67609 Watching redis(used_memory_rss)
D 2014-11-10T07:54:57.862756Z 67609 Watching redis(used_memory_peak)
D 2014-11-10T07:54:57.862796Z 67609 Starting command socket
D 2014-11-10T07:54:57.862892Z 67609 Starting main run loop
V 2014-11-10T07:54:57.863077Z 67609 Resolving services
D 2014-11-10T07:54:57.863117Z 67609 upstart doesn't have redis6379
D 2014-11-10T07:54:57.863165Z 67609 Executing systemctl [systemctl show -p MainPID redis6379]
D 2014-11-10T07:54:57.864043Z 67609 Executing /bin/df [df -P]
D 2014-11-10T07:54:57.865450Z 67609 Collection complete in 2.22935ms
D 2014-11-10T07:54:57.865493Z 67609 redis6379 is Unknown, skipping...
D 2014-11-10T07:55:12.866414Z 67609 Executing /bin/df [df -P]
D 2014-11-10T07:55:12.867731Z 67609 Collection complete in 2.041772ms
D 2014-11-10T07:55:12.867758Z 67609 neo-redis-prod-1[cpu:user] tripped. Current: 99.3, Threshold: 95.0
D 2014-11-10T07:55:12.867776Z 67609 redis6379 is Unknown, skipping...

Why do I always get Service: redis6379 [Unknown/0] or redis6379 is Unknown, skipping and Rule: {Entity:redis6379 [Unknown/0] MetricFamily:redis MetricName:used_memory_peak ... CurrentValue:0 ... the CurrentValue is 0.

I can't figure out why it doesn't work.

Inspeqtor does not detect runit service restarts

OS: Ubuntu 14.04.1
Inspeqtor: 0.6.0-2

If a runit service being monitored by Inspeqtor is restarted via runit and not inspeqtorctl, Inspeqtor will alert and think the process has gone missing. This can be reproduced by tracking a runit managed process and calling sv restart in the process to get a new pid generated. Inspeqtor will begin alerting that a process has gone missing.

Shows Down status for running init.d service

Hi @mperham, I'm trying to use inspeqtor to monitor resque-scheduler daemon
Service is up and running but inspeqtor shows service status as "Down"

> cat /etc/inspeqtor/services.d/resque_scheduler.inq
check service resque_scheduler with pidfile /home/deploy/app/current/tmp/resque-scheduler.pid
  if memory:rss > 2g then alert
  if cpu:user > 90% for 20 cycles then alert

 > /etc/init.d/resque_scheduler start
Starting resque-scheduler:                                 [  OK  ]

> /etc/init.d/resque_scheduler status
resque-scheduler (pid  16715) is running...

> ps aux | grep `cat /home/deploy/app/current/tmp/resque-scheduler.pid`
deploy   16715  0.0  1.9 542084 152840 ?       Sl   08:04   0:00 resque-scheduler-2.2.0: Schedules Loaded

> /usr/bin/inspeqtor -c /etc/inspeqtor -s /var/run/inspeqtor.sock
Inspeqtor 0.6.0
Copyright © 2014 Contributed Systems LLC
Licensed under the GNU Public License 3.0

Want more? Upgrade to Inspeqtor Pro for more features and support.
See http://contribsys.com/inspeqtor for details.

I 2014-10-17T15:05:29.805827Z 16726 Detected upstart in /etc/init
I 2014-10-17T15:05:29.805975Z 16726 Detected init.d in /etc/init.d
I 2014-10-17T15:05:29.806719Z 16726 Found init.d/resque_scheduler with status Down/0

Pro installation fails on apt-update for 64bit systems

I'm opening this up as a discussion prior to editing the wiki, and so that other users can have a reference.

On a 64bit system users can not complete apt-get update, as apt will check for i386 packages first and crash:

W: Failed to fetch https://dl.contribsys.com/inspeqtor-pro/apt/dists/ubuntu/Release  Unable to find expected entry 'trusty/binary-i386/Packages' in Release fi
le (Wrong sources.list entry or malformed file)

The solution to this issue is to specify the architecture within the inspector-pro.list file, as described in Debian Multi Architecture:

deb [arch=amd64] $PROSOURCE/inspeqtor-pro/apt ubuntu trusty

After the change apt-get update will work again. @mperham I think this is something best fixed with documentation, but we may want to change the default pro install instructions. Thoughts?

nginx:Active_connections and nginx:requests reports

[PROD]ubuntu@awsprodapp1 ~ $ sudo inspeqtorctl show nginx nginx:Active_connections
nginx nginx:Active_connections min 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0 max 0.0 avg 0.0

[PROD]ubuntu@awsprodapp1 ~ $ sudo inspeqtorctl show nginx nginx:requests
nginx nginx:requests min 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0 max 0.0 avg 0.0

Extend SMTP Server options

Currently Inspeqtor requires Authentication and TLS for e-mail integration:

Your SMTP server must accept TLS connections on port 587.

Would you accept to allow to connect to a weaker SMTP server (e.g. anonymous+plain on port 25) ?

Incorrect postgresql:total_size value reported

In this case "incorrect" simply means 0. I have confirmed that running the size computation SQL directly yields the correct value:

psql (9.3.5, server 9.3.4)
Type "help" for help.

app.com/production=# select sum(pg_total_relation_size(pg_class.oid))
FROM pg_class LEFT JOIN pg_namespace N ON (N.oid = pg_class.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema') AND
nspname !~ '^pg_toast' AND relkind IN ('r');
    sum
------------
 3127959552

The reported status:

Service: postgresql [Up/32068]
  cpu:system                          0.1%
  cpu:total_system                    0.1%
  cpu:total_user                      0.0%
  cpu:user                            0.0%
  memory:rss                          72.97m          4g
  postgresql:blk_hit_rate             99.9%           95%
  postgresql:numbackends              37.0            100
  postgresql:total_size               0.00m           10g

Init.d support not working?

Testing init.d support in inspeqtor 0.6.0 on Ubuntu 14.04. I'm trying nginx with the nginx.org stable PPA. While hitting the server with siege there is definitely a load:

$ curl http://localhost/nginx_status
Active connections: 26 
server accepts handled requests
 371 371 381 
Reading: 0 Writing: 25 Waiting: 1

But inspeqtor doesn't show this in the status output:

# inspeqtorctl status 
Inspeqtor 0.6.0, uptime: 2m12.402191666s, pid: 31489

Host: web02
  cpu                                 190.3%         
  cpu:iowait                          0.0%           
  cpu:steal                           0.0%           
  cpu:system                          4.0%           
! cpu:user                            185.5%          95%
  disk:/                              60.0%           90%
  disk:/home                          41.0%          
  load:1                              13.56          
  load:15                             5.80           
  load:5                              10.61          
  swap                                1.0%            20%

Service: nginx [Unknown/0]
  cpu:system                          0.0%           
  cpu:total_system                    0.0%           
  cpu:total_user                      0.0%           
  cpu:user                            0.0%           
  memory:rss                          -0.00m          100m
  nginx:Active_connections            -1.0            500
  nginx:requests                      0.0             1000/sec

The /etc/inspeqtor/services.d/nginx.inq is from the wiki, and looks like this:

check service nginx with hostname localhost, port 80, endpoint /nginx_status
  if memory:rss > 100m then alert
  if nginx:Active_connections > 500 then alert
  if nginx:requests > 1000/sec then alert

Race condition with cron job check

The cron job checker verifies that the job has run exactly one hour after it was last seen. Any variance in runtime will cause a superfluous alert. Add a bit of splay to the check, e.g. 1-2% of the time interval.

Support for haproxy?

Any chance of adding support to inspeqt haproxy? I use it extensively for my experiments, and would love to monitor and take action based on statistics of how many connections are dropped, how many servers are still alive, etc.

regs
Vivek

Alert on down Sidekiq Job/Services

Does anyone know how to monitor Sidekiq and send a alert when sidekiq crashes. I know there are ways to monitor memory and when the services exceeds or is under a amount but what we need to know is if sidekiq is down.

Currently we are using sidekiq pro with upstart. However yet to find a way to solve problem.

View historical metric data

Inspeqtor currently stores 240 values, one hour of data, per metric. It would be nice to be able to review this data and visualize the history. Right now, inspeqtorctl info just shows the current value for each metric.

How do we visualize it? Web front-end? Terminal output? (An ASCII/ANSI graph would be cool!) Generate a Google Chart URL?

Incorrect memory:rss reported for service

I'm attempting to monitor sidekiq's memory usage but don't seem to be getting accurate readings. The service is being ran through upstart, is reporting status correctly, and is apparently sending metrics out to statsd. However, the memory holds steady at ~0.63m no matter what sidekiq actually balloons to. Here is the output of inspeqtorctl status:

Service: sidekiq [Up/4890]
  cpu:system                          0.0%
  cpu:total_system                    0.0%
  cpu:total_user                      0.0%
  cpu:user                            0.0%
  memory:rss                          0.63m           1g

And here is the info through htop:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 4892 dscout     20   0 1697M  714M 11320 S  0.5 18.1  0:53.57 sidekiq 3.2.6 dscout [0 of 2 busy]

And here again through ps:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dscout    4892  0.6 17.8 1737756 721176 ?      Sl   07:35   0:58 sidekiq 3.2.6 dscout [0 of 2 busy]

Would any other information be helpful? The sidekiq.conf upstart config?

Monitor cron execution

One common application issue is cron jobs that silently fail. It would be awesome to have inspeqtor monitor cron execution so it could fire an alert email if a signal has not be received in the last interval. Something like:

check cron
  job foo every day
  job bar every hour

And then your cron job would need to fire a success message to Inspeqtor:

inspeqtorctl cron foo

This is similar to SaaSes like Dead Man's Snitch today.

Time out commands

Currently the metric collection commands can hang forever and block Inspeqtor's collection loop. We should probably put a hard limit of 5 seconds on any external command.

One idea here: https://stackoverflow.com/questions/11886531/terminating-a-process-started-with-os-exec-in-golang

Unable to upstart inspeqtor

When init tries to load the config I'm seeing this error:

Nov 12 19:53:19 databot init: /etc/init/inspeqtor.conf:10: Unknown stanza

Line 10 is setgid arm.

[root@databot init]# cat /etc/issue
CentOS release 6.6 (Final)
Kernel \r on an \m

[root@databot init]# rpm -q inspeqtor
inspeqtor-0.6.0-2.x86_64
[root@databot init]# cat /etc/init/inspeqtor.conf 
#
# Upstart script for Ubuntu 12.04 and 14.04, CentOS 6
#
description "Inspeqtor: Application infrastructure monitoring"

start on runlevel [2345]
stop on runlevel [016]

# allow any 'adm' user to run inspeqtorctl without sudo
setgid adm
umask 0002

# if we crash, restart
respawn
# don't try to restart anymore if we fail 5 times in 5 seconds
respawn limit 5 5

exec /usr/bin/inspeqtor -c /etc/inspeqtor -s /var/run/inspeqtor.sock
# ensure our socket is cleaned up, even if we crash
pre-start exec /bin/rm -f /var/run/inspeqtor.sock
[root@databot init]#

Change metric naming format

Currently metrics are written like family(name), e.g. memory(rss). One major issue is that parens must be escaped on the command line so they are painful to type. Switch format to memory:rss.

Action errors are swallowed

The code calling action.Trigger() doesn't do anything with the returned error. It should at least be logged. For instance, failure to send email is swallowed.

Push stats to carbon/statsd

It would be nice if this could also push the data to carbon or statsd so that they could be viewed in Graphite or equivalent.

Tools like collectd do this, but its a pain to install (the latest version, which has write_graphite backend)

Is this planned?

Create intro screencast

It would be nice to have a short video showing how quick it is to get started with Inspeqtor. "Look at all the stuff I'm not configuring!"

Support rates

A blog reader pointed out that counter values aren't transformed into per second rates. Things like mysql:Queries should be divided by the cycle time so the rule thresholds remain stable, even if the cycle_time is changed.

check service mysql
  # today
  if mysql:Queries > 750 then alert
  # should be
  if mysql:Queries > 50 then alert

Safer goroutine usage

Right now the metric collection for a service can lock up and thus lock up Inspeqtor because of this Wait().

Integrate better error handling and collection timeouts for goroutine usage. Start with the Context pattern and see where that takes us.

Use a stopping channel

Right now there's a global Stopping boolean which doesn't wake up child goroutines. Instead use select multiplexing on time.After and a stopping channel so everything shuts down cleanly immediately.

Emails should be per-cycle, not per-rule

When something breaks, it's often the case that several rules will break at the same time. Each cycle's email should summarize all the rules that are broken and not a single email per rule. This summary allows the reader to see everything broken at a glance and possibly understand the root cause.

Support init.d in OSS package

I've been thinking that fighting against the use of init.d by withholding support is just hurting the users and will be a HUGE support issue. Common infrastructure like nginx.org's PPA deb is still distributed with an init.d script.

My proposal:

Move init.d support from Pro into OSS.
Add cron job monitoring to Pro as a nice replacement feature.

Inspeqtor will not start on Ubuntu 14.04

Hi there. I've been configuring and setting up Inspeqtor on a new Ubuntu 14.04.1 instance via the debian package. After setting up the configuration files and following the guide in the wiki, it still does not start. Here is the output of Inspeqtor with verbose logging enabled:

$ sudo /usr/bin/inspeqtor -c /etc/inspeqtor -s /var/run/inspeqtor.sock -l verbose
Inspeqtor 0.6.0
Copyright © 2014 Contributed Systems LLC
Licensed under the GNU Public License 3.0

Want more? Upgrade to Inspeqtor Pro for more features and support.
See http://contribsys.com/inspeqtor for details.

I 2014-11-07T23:44:56.157100Z 20160 Detected upstart in /etc/init
I 2014-11-07T23:44:56.157649Z 20160 Detected runit in /etc/service
I 2014-11-07T23:44:56.157798Z 20160 Detected systemd in /etc/systemd
I 2014-11-07T23:44:56.158181Z 20160 Detected init.d in /etc/init.d/
D 2014-11-07T23:44:56.158262Z 20160 Parsing /etc/inspeqtor/inspeqtor.conf
V 2014-11-07T23:44:56.158398Z 20160 Global config: &{GlobalConfig:{CycleTime:15 DeployLength:300 Variables:map[log_level:info deploy_length:300]} AlertRoutes:map[]}
V 2014-11-07T23:44:56.171633Z 20160 Parsing /etc/inspeqtor/host.inq
V 2014-11-07T23:44:56.171950Z 20160 Rule: <nil>
No alert route configured!

The /etc/inspeqtor/host.inq is the default and here is my /etc/inspeqtor/inspeqtor.conf

$ cat /etc/inspeqtor/inspeqtor.conf
# a deploy will timeout after this many seconds
set deploy_length 300

# controls the log verbosity
set log_level info

Is metric export supported

I've given the wiki doco a quick read-through and it seems like inspeqtor is a really nice way to collect information from a number of systems and configure email alerts for threshold violations. Is there a way to export the metrics that inspeqtor collects during runtime? I didn't see it in the documentation. Thanks!