The other day, we got an escalation from Nagios in the middle of the night that email was down. Looking at the system, I quickly found that while yes, one SMTP relay was down, the other was up. So how do you monitor services that require multiple failures before service is disrupted?
check_cluster
This is a service check which doesn't check a service, but checks the results of other service (or host) checks. The documentation is pretty clear on who to set this up.
So where before I had been checking two SMTP services and escalating to SMS on each, I still have checks for two SMTP services, but then added the check_cluster which checks the results of both and is only critical if all the SMTP services are down. Then I escalate based on the check_cluster checks instead of the check_smtp ones.