Ranting, Technically Speaking: July 2012

The other day, we got an escalation from Nagios in the middle of the night that email was down. Looking at the system, I quickly found that while yes, one SMTP relay was down, the other was up. So how do you monitor services that require multiple failures before service is disrupted?

check_cluster

This is a service check which doesn't check a service, but checks the results of other service (or host) checks. The documentation is pretty clear on who to set this up.

So where before I had been checking two SMTP services and escalating to SMS on each, I still have checks for two SMTP services, but then added the check_cluster which checks the results of both and is only critical if all the SMTP services are down. Then I escalate based on the check_cluster checks instead of the check_smtp ones.

Ranting, Technically Speaking

Pages

Sunday, 22 July 2012

Nagios check_cluster