itrs-group / monitor-merlin Goto Github PK

View Code? Open in Web Editor NEW

22.0 16.0 14.0 7.54 MB

Module for Effortless Redundancy and Loadbalancing In Naemon

Home Page: https://itrs-group.github.io/monitor-merlin/

License: GNU General Public License v2.0

Makefile 0.96% M4 0.63% Python 20.93% Shell 3.18% PHP 1.44% C 56.52% Ruby 3.30% Gherkin 13.03%

distributed naemon database

monitor-merlin's Introduction

Welcome to Merlin

The Merlin project, or Module for Effortless Redundancy and Loadbalancing In Naemon, is a Naemon module built to create an easy way to set up distributed Naemon installations, allowing Naemon processes to exchange information. This allows Naemon to scale past a single installation to handle a bigger monitoring workload.

Merlin can also save state changes to a database, which can be used for reporting purposes.

Documentation

Full documentation including installation and usage guides can be found at Github Pages.

Contributing to Merlin

Contributions are always welcome. Please see CONTRIBUTING.md for the full details.

monitor-merlin's People

Contributors

Stargazers

Watchers

Forkers

jjneely ipstatic rfelsburg mfalkvidd tvestelind marcianobarros20 sponkydive roengstrom fmikker infraknit jacobbaungard it-novum leffu33 dirtyren

monitor-merlin's Issues

Merlin 2021.3.1 is not syncing

I upgrade from 210.c1 to 2021.3.1 due to the new backlog feature, where it does not het wiped on restart, but, merlin is not syncing.
It connects and show the errors below.
Any idea what coud it be? Tks.

[1616677669] 6: NODESTATE: ipc: STATE_NONE -> STATE_CONNECTED: Connected
[1616677669] 3: ipc_action_handler(): iobroker_register(0xbe7160, 16, 0x7fdd30f61c20, 0x7fdd30d0a840) returned -114: Operation already in progress
[1616677669] 3: BACKLOG: binlog returned a packet claiming to be of size 176
[1616677669] 3: BACKLOG: binlog claims the data length is 32733
[1616677669] 3: BACKLOG: wiping backlog. ipc is now out of sync
[1616677669] 3: send(16, (buf + total), 664, MSG_DONTWAIT) returned -1 (Bad file descriptor)
[1616677669] 6: NODESTATE: ipc: STATE_CONNECTED -> STATE_NONE: Partial or failed write() (sent=-1; len=664): Bad file descriptor

[Question] Is really required MySQL database?

Hi, I'm evaluating Merlin for integrate in a Naemon + Mod-gearman distributed architecture where Naemon Core doesn't have redundacy because of stateful data (comments, acks...).

I guess that I can define my 2 Naemon Core as peers in Merlin to sync stateful data. But, my question is. Why I need MySQL DB? Can I not to sync directly daemon to daemon?

Thanks.

Error with install-merlin.sh script

When I try to intall merlin whit the script install-merlin.sh, it throw me an error:

ERROR 1049 (42000): Unknown database 'merlin'
install-merlin.sh: línea 45: /usr/local/share/merlin/sql/mysql/merlin.sql: No such file or direcctory

Any idea how i can fix it?

Thanks

Host Notification issue with Loadbalancing

I'm opening a new issue for some odd behavior I'm seeing with host notifications. I've seen this behavior occur about 4 times now but I haven't necessarily been able to replicate it intentionally yet.

I was using escalations for host notifications but removed that in favor of delay_first_notification. Sometimes it appears that Naemon/Merlin chooses to send out a host notification earlier than when it should. Take the following example:

Sep 17 06:31:05 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:05 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:05 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:24 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:31:42 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:03 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:05 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:05 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:05 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:32:24 node_b naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;DOWN;dcops_host-notify-by-email;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;DOWN;dcops_host-notify-by-email-pager;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_a naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:32:24 node_c naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:32:24 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
Sep 17 06:32:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Re-notification blocked for this problem because not enough time has passed since last notification.
Sep 17 06:32:51 node_b naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;UP;dcops_host-notify-by-email;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;UP;dcops_host-notify-by-email-pager;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;UP;dcops_host-notify-by-pushover;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_a naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:32:51 node_c naemon: HOST ALERT: pdu-g24;UP;HARD;1;OK - 172.20.9.140: rta 1.302ms, lost 0%
Sep 17 06:32:51 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:33:13 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:13 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:13 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;1;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:33:27 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:33:27 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:33:27 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notifications blocked for object in a soft state.
Sep 17 06:34:16 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:34:16 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:34:16 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;2;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:35:13 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;3;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_b naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_a naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:09 node_c naemon: HOST ALERT: pdu-g24;DOWN;SOFT;4;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_b naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_a naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_b naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).
Sep 17 06:36:48 node_a naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:36:48 node_c naemon: HOST ALERT: pdu-g24;DOWN;HARD;5;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:36:48 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification was blocked by a NEB module. Module '/usr/local/merlin/lib/merlin/merlin.so' cancelled notification: 'Notification will be handled by a peer (node_b.colorado.edu)'
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1+admin2+admin3_email;pdu-g24;DOWN;dcops_host-notify-by-email;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1-pager;pdu-g24;DOWN;dcops_host-notify-by-email-pager;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin1-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin2-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%
Sep 17 06:39:54 node_b naemon: HOST NOTIFICATION: admin3-pushover;pdu-g24;DOWN;dcops_host-notify-by-pushover;CRITICAL - 172.20.9.140: rta nan, lost 100%

Look particularly at 06:32:24. 'node_b' logs a HARD state and immediately sends a notification. At the same time node_a reports that the notification is blocked by a NEB module and will be handled by node_b (that makes sense). But then node_c reports something entirely different:

Sep 17 06:32:24 node_c naemon: HOST NOTIFICATION SUPPRESSED: pdu-g24;Notification blocked because first_notification_delay is configured and not enough time has elapsed since the object changed to a non-UP/non-OK state (or since program start).

So for some reason node_c realizes first_notification_delay is in play ... but somehow node_a doesn't. I can confirm that all three nodes have the exact same configuration.

Here is the relevant configuration for host checks:

define host {
        active_checks_enabled           1
        check_command                   team_check-host-alive
        check_freshness                 0
        check_period                    team_24x7
        contact_groups                  admin1+admin2+admin3_email
        event_handler_enabled           0
        flap_detection_enabled          1
        max_check_attempts              5
        name                            team_generic_host_template
        notification_interval           2
        notification_options            d,r
        notification_period             team_24x7
        notifications_enabled           1
        first_notification_delay        3
        obsess_over_host                0
        passive_checks_enabled          0
        process_perf_data               1
        register                        0
        retain_nonstatus_information    1
        retain_status_information       1
}

track_current deprecated

Hi
I set track current on yes, I can see warning about this option beeing deprecated, it is no writing to database about host etc.
Is this expected?
Shouldn't that write to database although warning?

If this is as expected, what is alternative to write that informations to database?

Query Handler does not return nodeinfo in a 3 node cluster

Hi,

i have created a cluster out of 3 peers. Basicaly the cluster is working and executing checks. How ever, as soon as I join a third peer to the setup, the Naemon query Handler will stuck on #merlin nodeinfo.

As soon as i stop Naemon and the Merlin daemon on Peer1 (or Peer2), #merlin nodeinfo reports back as expected, and show 2 systems active and one inactive.

Make failed

Hi,

I'm on CentOS 7. Here is the installation process:

git clone https://github.com/op5/merlin.git
cd merlin
./autogen.sh -> No errors
make

make[1]: Entering directory `/usr/local/merlin/merlin'
CC shared/merlin_la-shared.lo
In file included from shared/shared.c:8:0:
shared/node.h:245:2: error: unknown type name 'nm_bufferqueue'
nm_bufferqueue bq; / I/O cache for bulk reads */
^
make[1]: *** [shared/merlin_la-shared.lo] Error 1

First time I encounter this error, unfortunately I don't have the knowledge to solve it by myself.

Thanks.

Instruction for standalone server

Hi, I am trying to configure Naemon, merlin and Ninja with just one server. The current instruction does not have information about how to configure merlin without peer and pool. Meanwhile, how to check the merlin running status without mon command? PS. I really don't know where to get this command.

Current Merlin releases appear to trigger instability in Naemon

There is a release available here on github: 2022.06.02. I started off using that. I then migrated to using packages hosted on this mirror:
https://download.opensuse.org/repositories/home:/itrs-op5/CentOS_8_Stream/

In both cases I ran into situations where naemon would crash (and dump a core) and merlind would eventually peg itself at 99% CPU usage. I went in circles for awhile trying to determine what was going on. I started to narrow in on service and host checks that would return a CRITICAL state and cause naemon to crash when it was attempting to generate a notification (even though I had notifications disabled globally). During my initial load testing I was using mostly ping checks that all returned OK, so I rarely hit this condition. But the moment I started getting checks that returned CRITICAL, things would break.

Anyway, long story short - I built merlin from source and everything is fine now. But given the run-around I went through, I figured I'd report this here for anyone else who may encounter this problem -or- merely as a suggestion that it might be an appropriate time to package a new release.

I did see in the github issues (#146 ) there was a 2022.06.30 release, but I never actually found it.

I can re-configure these systems to trigger the issue again pretty easily if you need more info, but since the issue is fixed in the current source code I doubt any further troubleshooting is needed.

Some additional information about the systems where I encountered these problems:
CentOS Stream release 8
4.18.0-527.el8.x86_64 #1 SMP Thu Nov 23 14:16:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
libnaemon-1.4.1-18.1.x86_64
naemon-thruk-1.4.1-13.1.noarch
naemon-livestatus-1.4.1-14.1.x86_64
naemon-1.4.1-13.1.noarch
naemon-devel-1.4.1-18.1.x86_64
naemon-core-1.4.1-18.1.x86_64
naemon-vimvault-1.4.0-3.2.x86_64

NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
4.18.0-513.5.1.el8_9.x86_64 #1 SMP Fri Sep 29 05:21:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
naemon-livestatus-1.4.1-14.1.x86_64
naemon-1.4.1-13.1.noarch
libnaemon-1.4.1-18.1.x86_64
naemon-vimvault-1.4.0-3.2.x86_64
naemon-core-1.4.1-18.1.x86_64
naemon-thruk-1.4.1-13.1.noarch

(Both CentOS and RHEL systems were originally fetching naemon from https://labs.consol.de/repo/stable/rhel8/x86_64/ but switched to https://download.opensuse.org/repositories/home:/naemon/CentOS_7/)

It also seems like I may have had this exact same problem on a set of Debian machines on 3/6/2023 after naemon got upgraded there. I suppose the fix slipped my mind!
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
4.19.0-25-amd64 #1 SMP Debian 4.19.289-2 (2023-08-08) x86_64 GNU/Linux
ii libnaemon:amd64 1.4.1-1 amd64
ii naemon 1.4.1-1 amd64
ii naemon-core 1.4.1-1 amd64
ii naemon-dev 1.4.1-1 amd64
ii naemon-livestatus 1.4.1-1 amd64
ii naemon-thruk 1.4.1-1 amd64
ii naemon-vimvault 1.4.0-1 amd64
ii thruk 3.10-1 amd64

Configure option --with-initscripts yields make error

Here is my configure command:

$ ./configure --prefix=/usr/local/merlin --with-logdir=/var/log/merlin --with-initdirectory=/etc/init.d/ --with-initscripts

I found that the '--with-initscripts' directive is causing this error at the end of make:

  CCLD     merlin.la
cp apps/mon.py apps/op5
make[1]: *** No rule to make target 'yes', needed by 'all-am'.  Stop.
make[1]: Leaving directory '/src/merlin/monitor-merlin'
make: *** [Makefile:1389: all] Error 2

Problem withi installing from source

Hi,
I have problem with make command.
After cloning repo to RHEL8 I run autogen.sh:

export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/monitor/naemon/lib/pkgconfig/ && ./autogen.sh --with-naemon-config-dir=/opt/monitor/naemon/etc/ --with-naemon-user=naemon --with-naemon-group=naemon --with-pkgconfdir=/opt/monitor/merlin --with-logdir=/var/log/merlin

*export need since I install naemon 1.4.1 also from source.

All above ends with success.
Then I try run make and it ends:

module/module.c:679:9: warning: assignment to ‘comment *’ {aka ‘struct comment *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
    cmnt = get_first_comment_by_host(ds->host_name);
         ^
module/module.c:681:23: error: ‘comment’ {aka ‘struct comment’} has no member named ‘nexthash’; did you mean ‘next’?
     next_cmnt = cmnt->nexthash;
                       ^~~~~~~~
                       next
module/module.c:690:16: error: ‘comment_list’ undeclared (first use in this function); did you mean ‘command_list’?
    for (cmnt = comment_list; cmnt; cmnt = next_cmnt) {
                ^~~~~~~~~~~~
                command_list
module/module.c:690:16: note: each undeclared identifier is reported only once for each function it appears in
module/module.c: In function ‘handle_event’:
module/module.c:855:19: warning: comparison of constant ‘3’ with boolean expression is always false [-Wbool-compare]
  if (!node->state == STATE_CONNECTED) {
                   ^~
module/module.c:855:19: warning: logical not is only applied to the left hand side of comparison [-Wlogical-not-parentheses]
module/module.c:855:6: note: add parentheses around left hand side expression to silence this warning
  if (!node->state == STATE_CONNECTED) {
      ^~~~~~~~~~~~
      (           )
make[1]: *** [Makefile:2733: module/merlin_la-module.lo] Error 1
make[1]: Leaving directory '/home/monitor_merlin'
make: *** [Makefile:1406: all] Error 2

I need install something or point somehow to dir where I build naemon_core? I think I haves everything installed.

mon oconf push chokes on rsync failure

[naemon@naemon-test01 .ssh]$ mon oconf push naemon-test02
rsync: failed to set times on "/etc/naemon/conf.d/thruk_templates.cfg": Operation not permitted (1)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1189) [sender=3.1.3]
Some error occured while synchronizing files!
Connecting to 'naemon-test02' with the following command:
  ssh X.X.X.X -o IdentitiesOnly=yes -o BatchMode=yes sudo mon restart
#--- REMOTE OUTPUT START ---
#--- REMOTE OUTPUT DONE ----

[naemon@naemon-test01 .ssh]$ ls -la /etc/naemon/conf.d/thruk_templates.cfg
lrwxrwxrwx. 1 root naemon 44 Dec  2  2022 /etc/naemon/conf.d/thruk_templates.cfg -> /usr/share/thruk/support/thruk_templates.cfg

[naemon@naemon-test01 .ssh]$ ls -la /usr/share/thruk/support/thruk_templates.cfg
-rw-r--r--. 1 root root 2185 Nov  2 18:04 /usr/share/thruk/support/thruk_templates.cfg

I'm on RHEL8 using recent RPMs from consol labs. I've got many years of experience with Naemon/Merlin, but this is the first time I'm trying to use the mon command. I'm sure I can find a way to work-around this, but since this is the 'out-of-the-box' experience ... I figured I'd report it.

pkg-config issue on CentOS 7

Hi, i'm trying to build merlin from git repository, on the autogen.sh execution (well, actually on the ./configure) , i've got this error:

./configure: line 12732: syntax error near unexpected token naemon,' ./configure: line 12732: PKG_CHECK_MODULES(naemon, naemon >= 0.8)'

Googling a bit i've found out that this happen when pkg-config is not installed, but actually it is:
pkg-config --version
0.27.1

Also naemon-devel package is installed (with the dependency libnaemon) installed using this repo:
https://labs.consol.de/repo/stable/rhel7/i386/labs-consol-stable.rhel7.noarch.rpm

Does anyone have any suggestion to help me fix this strange issue?

Thanks
Best regards
Andrea

Naemon Segmentation Fault when running missing notification command

Naemon Core 1.2.4
Merlin daemon 2021.3.1

While merlin is running and the NEB module is loaded I immediately get a segmentation fault from naemon when a notification command is run. I turn off merlind, but leave the merlin module loaded, and I see these errors instead:

Warning: Notification command for contact 'XXX_email' about service 'XYZABC' exited with exit code 127. stdout: '(empty)', stderr: '/usr/lib/naemon/plugins/eventhandlers/notify-by-email: 12: /usr/lib/naemon/plugins/eventhandlers/notify-by-email: [[: not found
/usr/lib/naemon/plugins/eventhandlers/notify-by-email: 16: /usr/lib/naemon/plugins/eventhandlers/notify-by-email: sendmail: not found

Clearly I have some issues to sort out in my configuration, but the fact that the notification command has an error shouldn't yield a segmentation fault, afaic.

Release 2022.06.02 does not work with the latest naemon version 1.4.2

Getting the following error:
Error: Could not load module '/usr/lib64/merlin/merlin.so' -> /usr/lib64/merlin/merlin.so: undefined symbol: comment_list

Naemon release 1.4.2 has the following changes:
https://github.com/naemon/naemon-core/releases/tag/v1.4.2

I've tried compiling from source, using naemon 1.4.2 but getting errors

module/module.c:681:21: error: 'comment' has no member named 'nexthash'
     next_cmnt = cmnt->nexthash;
                     ^
module/module.c:690:16: error: 'comment_list' undeclared (first use in this function)
    for (cmnt = comment_list; cmnt; cmnt = next_cmnt) {

Commenting out those rows will allow compile to work. And naemon will start after that.

[Question] New peer unsynced

Hi, finally I get working an automated load balanced naemon core with merlin. But I have a problem now (maybe is working as expected).

If I add a comment (or downtime) with two peers running, the comment spread all over the peers. But If I add a new third peer this one doesn't get the old information from other peers.

Is this expected? Maybe some option is needed to get in sync?

Thanks in advance.

Merlin wipes backlog on restart

Hello,
I've been testing merlin with naemon for the past couple of weeks and I noticed that, if the connection between master and poller is down and meanwhile the master configuration is changed, when the link is back on, the master will trigger a configuration sync to the poller, which is expected.
Up until here all is good and expected, but, while the link was down, the poller was writing its checks to the binlog, all nice, but, when the link is back and sync is triggered, the naemon + merlin must be restarted in the poller so it can sync again and when this is done, the binlog is wiped / deleted from the disk and all events are lost.
The question is, am I doing something wrong here? I noticed the function binlog_create uses BINLOG_UNLINK as a flag, I changed it to BINLOG_APPEND hopping to fix this, which it did not work.
If the configuration is not changed, merlin will sync the backlog correctly every time the link is back on perfectly.
Merlind is set up, but I am not sending any data to the database, so I gather, I dont need really it or this is my problem?
Tks a lot for any input here.

merlinkey.py should use proper prefix directory

When using a different prefix directory, the merlinkey.py should likely have an updated 'merlin_conf_dir' to reflect that.

Also, the '/usr/lib64/merlin/keygen' path is also wrong and does not reflect the prefix used in configure.

I do realize the introduction states:
"Non-customers will have to adjust paths etc used throughout this guide in order to be able to use it." So, if that's the case for these two little issues then no big deal:)

make install bug?

Hi everyone,
I was playing around with a fresh install of Merlin and i noticed something wrong.

On "make install" i received this error:

bash install-merlin.sh
install-merlin.sh: line 45: /usr/local/monitor/share/merlin/sql/mysql/merlin.sql: No such file or directory
ERROR 1146 (42S02) at line 1: Table 'monitor_db.db_version' doesn't exist
install-merlin.sh: line 65: /usr/local/monitor/share/merlin/sql/mysql/*-indexes.sql: No such file or directory
make[3]: *** [install-exec-hook] Error 1
make[3]: Leaving directory `/usr/local/crazynetwork/monitor/merlin'
make[2]: *** [install-exec-am] Error 2
make[2]: Leaving directory `/usr/local/crazynetwork/monitor/merlin'
make[1]: *** [install-am] Error 2
make[1]: Leaving directory `/usr/local/crazynetwork/monitor/merlin'
make: *** [install] Error 2

I checked the install-merlin.sh script and is looking for the wrong folder, in fact, instead of looking on the current path (pwd/sql/mysql/), it is looking for the path where the make install it is supose to put the files:

prefix=/usr/local/monitor
exec_prefix=${prefix}
sqldir=${prefix}/share/merlin/sql

That folder (/usr/local/monitor/share/merlin/sql/) still doesn't exist at that moment, in fact, if i manually import the .sql scripts and create the DB structure, the make install success and than the folder (/usr/local/monitor/share/merlin/sql/ exist.

Please note i used just those commands:

./configure --prefix=/usr/local/monitor --with-naemon-user=monitor --with-naemon-group=monitor --with-db-type=mysql--with-db-name=monitor_db --with-db-user=monitor_db --with-db-pass=mypassword --with-initdirectory=/etc/init.d
make
make install

Also please note this, in the install-merlin.sh file i can see those DB settings:

db_type=mysql--with-db-name=monitor_db
db_name=merlin
db_user=monitor_db
db_pass=mypassword

As you can see, despite the fact i specified the db_name on the ./configure, it still show merlin, it doesn't look like it is actually using that (i can tell from the error which specify the correct DB), but maybe can be fixed anyway :)

Best regards
Andrea

Loadbalancing Irregularity

My apologies for posting this here - I was looking for the op5 mailing list (which I was subscribed to years ago) but it appears op5.org is gone ...

I am switching from a very old version of Nagios/Merlin ... probably 6 years old at least. I am seeing some behavior with the newer version that I do not understand. I am running Naemon 1.2.4-1 and pulled Merlin from git today.

In the past, the nagios node running a check would log the 'SERVICE ALERT' for that check. I had all three of our peered nagios/merlin machines syslogging to each other so I had a nice view of where checks were being run and what was happening.

With the new version it appears certain events are being "echoed" by all the naemon daemons like this:

Jul  3 21:41:31 node_a naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul  3 21:41:31 node_c naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul  3 21:41:31 node_b naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul  3 21:42:32 node_a naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
Jul  3 21:42:32 node_c naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
Jul  3 21:42:32 node_b naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666

However, on some occasions the behavior is much different:

Jul  3 21:04:12 node_a naemon: SERVICE NOTIFICATION SUPPRESSED: pdu-f10;Infeed Power Factor;Re-notification blocked for this problem.
Jul  3 21:04:12 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;58;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:09:14 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;51;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:09:14 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;59;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:11:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;52;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.66
Jul  3 21:11:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;60;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.66
Jul  3 21:21:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;61;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:21:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;53;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:31:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;62;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:31:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;54;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:41:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;55;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul  3 21:41:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;63;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67

In the second example only "node_b" and "node_c" appear to be echoing these events, but what is additionally concerning is the retry values do not sync up correctly. ie. at 21:41:37 node_b thought this was the 55th retry while node_c thought it was the 63rd.

Here is a mon node status

Total checks (host / service): 259 / 4780

#00 2/2:2 local ipc: ACTIVE - 0.000s latency
Uptime: 38m 43s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total)   : 86, 0, 259 (33.20% : 33.20%)
Service checks (handled, expired, total): 1593, 0, 4780 (33.33% : 33.33%)

#01 1/2:2 peer node_b: ACTIVE - 0.000s latency - (ENCRYPTED)
Uptime: 38m 43s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total)   : 86, 0, 259 (33.20% : 33.20%)
Service checks (handled, expired, total): 1593, 0, 4780 (33.33% : 33.33%)

#02 0/2:2 peer node_c: ACTIVE - 0.000s latency - (ENCRYPTED)
Uptime: 38m 45s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total)   : 87, 0, 259 (33.59% : 33.59%)
Service checks (handled, expired, total): 1594, 0, 4780 (33.35% : 33.35%)

I looked through the troubleshooting notes and did get identical hashes for this command:
mon node ctrl --type=peer -- mon oconf hash
But my object.cache files do not have the same hash value. I looked through the files side-by-side (and with diff) and it looks like Naemon is listing the 'members' of certain host/service/etc groups in a randomized order, which throws the hashes off from each other.

I looked at the merlin database and found that the 'report_data' table had wildly different rows of data. Could have been the result of testing the system. So I truncated that table and started over ... now all three have roughly the same number of rows ... 612, 619, 618. But that hasn't really resolved the problem with what's getting logged by Naemon.

I did a specific query for the device in question "pdu-f10" in the report_data table, and all three databases have the same exact info ... 'timestamp' and 'retry' are all consistent ... but the 'id' for each row is different (which I guess makes sense since it's auto-increment). So one Naemon node logs the right retry value, another one logs a retry value that's less, and the other one logs nothing at all.

+-----+------------+-----------+---------------------+-------+------+-------+
| id  | timestamp  | host_name | service_description | state | hard | retry |
+-----+------------+-----------+---------------------+-------+------+-------+
| 130 | 1625363424 | pdu-f10   | Infeed Power Factor |     1 |    0 |    36 |
| 160 | 1625363725 | pdu-f10   | Infeed Power Factor |     1 |    0 |    37 |
| 178 | 1625364029 | pdu-f10   | Infeed Power Factor |     1 |    0 |    38 |
| 193 | 1625364330 | pdu-f10   | Infeed Power Factor |     1 |    0 |    47 |
| 222 | 1625364632 | pdu-f10   | Infeed Power Factor |     1 |    0 |    48 |
| 286 | 1625364933 | pdu-f10   | Infeed Power Factor |     1 |    0 |    49 |
| 359 | 1625365235 | pdu-f10   | Infeed Power Factor |     1 |    0 |    50 |
| 384 | 1625365537 | pdu-f10   | Infeed Power Factor |     1 |    0 |    51 |
| 401 | 1625365838 | pdu-f10   | Infeed Power Factor |     1 |    0 |    52 |
| 411 | 1625366140 | pdu-f10   | Infeed Power Factor |     1 |    0 |    45 |
| 430 | 1625366646 | pdu-f10   | Infeed Power Factor |     1 |    0 |    54 |
| 456 | 1625366948 | pdu-f10   | Infeed Power Factor |     1 |    0 |    55 |
| 483 | 1625367249 | pdu-f10   | Infeed Power Factor |     1 |    0 |    56 |
| 499 | 1625367551 | pdu-f10   | Infeed Power Factor |     1 |    0 |    57 |
| 518 | 1625367852 | pdu-f10   | Infeed Power Factor |     1 |    0 |    50 |
| 531 | 1625368154 | pdu-f10   | Infeed Power Factor |     1 |    0 |    51 |
+-----+------------+-----------+---------------------+-------+------+-------+

Again my sincere apologies for posting this as an 'issue' and not a general support request elsewhere!

mon command not found

Hi,

I'm trying to update my naemon configuration to Ninja, and try to make a cluster with Merlin. However, although the installation of Merlin went OK (no error), I don't have the "mon" command. I can't find it.

Can I ask which environment you are working on ? I'm trying on CentOS 7, as I can find all the requirements, unlike with Debian 9.

Steps to install Merlin

Hi All,

does anyone has the steps especially what changes need to be done to configs in order to install merlin with naemon on centos 7.

especially if the ./autogen.sh can do the necessary checks.

I know am asking for alot, but would appreciate.

Thanks,

./configure not found

When i download he .tar and uncompress it, i cant find the configure file. I cant execute ./autogen.sh

#./autogen.sh
Can't exec "aclocal": No existe el fichero o el directorio at /usr/share/autoconf/Autom4te/FileUtils.pm line 326.
autoreconf: failed to run aclocal: No existe el fichero o el directorio
./autogen.sh: línea 4: ./configure: No existe el fichero o el directorio

Thanks

Regards

Default installation path and merlin.cfg

Since all configuration files will use the directory /opt/merlin, the first command should be "./configure --prefix=/opt/merlin". Meanwhile, after run "make install", the merlin.cfg file is copied to /etc/naemon/ instead of /etc/nameon/module-conf.d/.

The last issue is regarding IPC socket as showing in /opt/merlin/var/log/merlin/neb.log:
[1499751576] 6: Initializing IPC socket '/opt/merlin/var/lib/merlin/ipc.sock' for module
[1499751576] 3: Failed to connect to ipc socket '/opt/merlin/var/lib/merlin/ipc.sock': No such file or directory

Downloaded merlin v2018.i.1.tar.gz tarball and autogen.sh fails

LS,

I've downloaded the merlin tarball v2018.i.1.tar.gz and tried to run auto.gen.sh.
Which fails with these errors:

merlin-2018.i.1$ ./autogen.sh --help
fatal: Not a git repository (or any of the parent directories): .git
configure.ac:4: error: AC_INIT should be called with package and version arguments
/usr/share/aclocal-1.13/init.m4:23: AM_INIT_AUTOMAKE is expanded from...
configure.ac:4: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1
./autogen.sh: line 4: ./configure: No such file or directory

And no, configure does not exist.

Kind regards, Edgar Matzinger.