Sunday 22 July 2012

Nagios check_cluster

The other day, we got an escalation from Nagios in the middle of the night that email was down.  Looking at the system, I quickly found that while yes, one SMTP relay was down, the other was up.  So how do you monitor services that require multiple failures before service is disrupted?


This is a service check which doesn't check a service, but checks the results of other service (or host) checks.  The documentation is pretty clear on who to set this up. 

So where before I had been checking two SMTP services and escalating to SMS on each, I still have checks for two SMTP services, but then added the check_cluster which checks the results of both and is only critical if all the SMTP services are down.  Then I escalate based on the check_cluster checks instead of the check_smtp ones.

