The other day, we got an escalation from Nagios in the middle of the night that email was down. Looking at the system, I quickly found that while yes, one SMTP relay was down, the other was up. So how do you monitor services that require multiple failures before service is disrupted?
check_cluster
This is a service check which doesn't check a service, but checks the results of other service (or host) checks. The documentation is pretty clear on who to set this up.
So where before I had been checking two SMTP services and escalating to SMS on each, I still have checks for two SMTP services, but then added the check_cluster which checks the results of both and is only critical if all the SMTP services are down. Then I escalate based on the check_cluster checks instead of the check_smtp ones.
Sunday 22 July 2012
Subscribe to:
Posts (Atom)
Popular Posts
-
For anyone who's had to cleanup some mail problems with Postfix configuration (or more often with other things, like anti-spam, tied in ...
-
In the course of troubleshooting the office Jabber server the other day, I came across some interesting info about the various caches that O...
-
For everyone who uses cron, you are familiar with the job schedule form: min hr day-of-month month day-of-week <command> A problem...