My apologies for posting this here - I was looking for the op5 mailing list (which I was subscribed to years ago) but it appears op5.org is gone ...
I am switching from a very old version of Nagios/Merlin ... probably 6 years old at least. I am seeing some behavior with the newer version that I do not understand. I am running Naemon 1.2.4-1 and pulled Merlin from git today.
In the past, the nagios node running a check would log the 'SERVICE ALERT' for that check. I had all three of our peered nagios/merlin machines syslogging to each other so I had a nice view of where checks were being run and what was happening.
With the new version it appears certain events are being "echoed" by all the naemon daemons like this:
Jul 3 21:41:31 node_a naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul 3 21:41:31 node_c naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul 3 21:41:31 node_b naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;WARNING;SOFT;4;SNMP WARNING - Supply Humidity *673*
Jul 3 21:42:32 node_a naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
Jul 3 21:42:32 node_c naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
Jul 3 21:42:32 node_b naemon: SERVICE ALERT: crac-A03;InRow Supply Humidity;OK;SOFT;5;SNMP OK - Supply Humidity 666
However, on some occasions the behavior is much different:
Jul 3 21:04:12 node_a naemon: SERVICE NOTIFICATION SUPPRESSED: pdu-f10;Infeed Power Factor;Re-notification blocked for this problem.
Jul 3 21:04:12 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;58;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:09:14 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;51;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:09:14 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;59;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:11:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;52;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.66
Jul 3 21:11:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;60;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.66
Jul 3 21:21:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;61;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:21:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;53;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:31:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;62;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:31:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;54;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:41:37 node_b naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;55;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
Jul 3 21:41:37 node_c naemon: SERVICE ALERT: pdu-f10;Infeed Power Factor;WARNING;SOFT;63;WARNING - RACK F10 (CORE), Master_L2(AB) PowerFactor: 0.67
In the second example only "node_b" and "node_c" appear to be echoing these events, but what is additionally concerning is the retry values do not sync up correctly. ie. at 21:41:37 node_b thought this was the 55th retry while node_c thought it was the 63rd.
Here is a mon node status
Total checks (host / service): 259 / 4780
#00 2/2:2 local ipc: ACTIVE - 0.000s latency
Uptime: 38m 43s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total) : 86, 0, 259 (33.20% : 33.20%)
Service checks (handled, expired, total): 1593, 0, 4780 (33.33% : 33.33%)
#01 1/2:2 peer node_b: ACTIVE - 0.000s latency - (ENCRYPTED)
Uptime: 38m 43s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total) : 86, 0, 259 (33.20% : 33.20%)
Service checks (handled, expired, total): 1593, 0, 4780 (33.33% : 33.33%)
#02 0/2:2 peer node_c: ACTIVE - 0.000s latency - (ENCRYPTED)
Uptime: 38m 45s. Connected: 38m 43s. Last alive: 0s ago
Host checks (handled, expired, total) : 87, 0, 259 (33.59% : 33.59%)
Service checks (handled, expired, total): 1594, 0, 4780 (33.35% : 33.35%)
I looked through the troubleshooting notes and did get identical hashes for this command:
mon node ctrl --type=peer -- mon oconf hash
But my object.cache files do not have the same hash value. I looked through the files side-by-side (and with diff) and it looks like Naemon is listing the 'members' of certain host/service/etc groups in a randomized order, which throws the hashes off from each other.
I looked at the merlin database and found that the 'report_data' table had wildly different rows of data. Could have been the result of testing the system. So I truncated that table and started over ... now all three have roughly the same number of rows ... 612, 619, 618. But that hasn't really resolved the problem with what's getting logged by Naemon.
I did a specific query for the device in question "pdu-f10" in the report_data table, and all three databases have the same exact info ... 'timestamp' and 'retry' are all consistent ... but the 'id' for each row is different (which I guess makes sense since it's auto-increment). So one Naemon node logs the right retry value, another one logs a retry value that's less, and the other one logs nothing at all.
+-----+------------+-----------+---------------------+-------+------+-------+
| id | timestamp | host_name | service_description | state | hard | retry |
+-----+------------+-----------+---------------------+-------+------+-------+
| 130 | 1625363424 | pdu-f10 | Infeed Power Factor | 1 | 0 | 36 |
| 160 | 1625363725 | pdu-f10 | Infeed Power Factor | 1 | 0 | 37 |
| 178 | 1625364029 | pdu-f10 | Infeed Power Factor | 1 | 0 | 38 |
| 193 | 1625364330 | pdu-f10 | Infeed Power Factor | 1 | 0 | 47 |
| 222 | 1625364632 | pdu-f10 | Infeed Power Factor | 1 | 0 | 48 |
| 286 | 1625364933 | pdu-f10 | Infeed Power Factor | 1 | 0 | 49 |
| 359 | 1625365235 | pdu-f10 | Infeed Power Factor | 1 | 0 | 50 |
| 384 | 1625365537 | pdu-f10 | Infeed Power Factor | 1 | 0 | 51 |
| 401 | 1625365838 | pdu-f10 | Infeed Power Factor | 1 | 0 | 52 |
| 411 | 1625366140 | pdu-f10 | Infeed Power Factor | 1 | 0 | 45 |
| 430 | 1625366646 | pdu-f10 | Infeed Power Factor | 1 | 0 | 54 |
| 456 | 1625366948 | pdu-f10 | Infeed Power Factor | 1 | 0 | 55 |
| 483 | 1625367249 | pdu-f10 | Infeed Power Factor | 1 | 0 | 56 |
| 499 | 1625367551 | pdu-f10 | Infeed Power Factor | 1 | 0 | 57 |
| 518 | 1625367852 | pdu-f10 | Infeed Power Factor | 1 | 0 | 50 |
| 531 | 1625368154 | pdu-f10 | Infeed Power Factor | 1 | 0 | 51 |
+-----+------------+-----------+---------------------+-------+------+-------+
Again my sincere apologies for posting this as an 'issue' and not a general support request elsewhere!