Monday, 24 February 2014

Principles of Alerting - Part 1

One of my colleagues recently drew my attention to this story about last year's security breach at Neiman Marcus, and specifically the 60,000 or so alerts that were apparently generated but not acted upon.

This led me to once again consider the process for handling the myriad log entries, automated alerts and status emails that are generated on a daily basis within the applications and infrastructure my team supports.

Thanks to the team's ongoing efforts we have in place good tool support, comprehensive alerting and monitoring, and a proven mature process. That's great mitigation, but the reality is that just one missed alert has the potential to cause significant trouble.

So, this seems an opportune moment for our management and engineering teams to refresh our thinking on the guiding principles underpinning our monitoring and alerting activities. I thought I'd start by documenting some of my candidate principles here and then use this to start a conversation within the team.

Here are the first five, in no particular order...

Work on your signal to noise ratio.

Think about what you're alerting and why you're doing it, make every alert earn its place in the application landscape.

A "success!" email at the end of an 8 hour batch process seems like it'd probably be useful - all sorts of people in the organisation would probably want to know about that. An "everything's OK!" alarm that sends an email to the support team every 5 minutes and clogs up the team's inboxes would need some serious justification.

The more background noise the team need to wade through before they get to useful information the greater the chance that something important will be missed. Or inadvertently moved to a seldom opened subfolder, via an email filter...defined by an engineer who is drowning in well meaning spam.

Give us a clue.

There are some pretty cryptic alerts out there - weird job step IDs, arcane code fragments, and terse (or missing) descriptions are all too common. Such a format can work if you're the person that developed the module in question, but chances are that when the alert pops up on a console at 2:30 am on a Saturday morning you'll be elsewhere, and a member of the out of hours monitoring team will have to hit the Ops Guide to see what remedial action an ERR-28001 requires.

Every second counts when you have a live incident, and critical information should be pushed to the people who need it. Why not include the remedial action (if known) in the alert text? Give the unfortunate recipient of the bad news the information they need to rescue the situation and minimise or negate the impact on end users.

Configurable is good.

Rather than following the traditional approach of hardcoding alerts into the application, consider using a tool that allows you to implement and adapt your alerting strategy at runtime. This approach has numerous benefits, including:
  • You won't need to change application code to add, remove or modify alerts - this is a good thing because as we all know every change is a risk, right?
  • You'll be able to reconfigure escalation strategies on demand. Need to include the Ops Director on a specific alert over the peak trading period?...Want to SMS the team in Poland if it's a Wednesday?...Need to change the email address for out of hours alerting? All eminently possible with monitoring and alerting tools. 
Many tools exist that'll perform some or all of the functionality described above; if you're looking for somewhere to start I can recommend the Open Source version of Hyperic.

Make sure someone is listening

Even better, make sure that "someone" is the right person (or team), that they know what to do and who to contact, and that they understand the likely impact of the underlying issue (see Give us a clue, above).

Typically your service desk and out of hours operations teams will be receiving and performing triage on alerts, so ensure they spend time with the engineering team on a regular basis to understand how key processes can fail and what can happen when they do.

A communication plan that includes internal, customer and third party escalation paths is key to reacting quickly and keeping everyone in the loop when a serious incident occurs. Ensure your front line team know who is on call, and how to get hold of them in a hurry.

Agree a standard process for known failure modes where possible, and get these agreed with your customer. When an incident occurs out of hours and you're convening a conference call at 3am, you'll be glad that you made some of the big decisions up front.

Consider leveraging your customer's prior investment.

Not as generally applicable as the principles described above, but if you're providing support services to a client there's a good chance that they will have already invested in alerting and monitoring tools. In this case it may make sense to piggy back on the infrastructure they've already put in place rather than implementing a parallel tool stack.

Going down this route can make ongoing maintenance easier, as the customer may well take responsibility for ensuring the agents are kept in good working order, and the risk of unwanted side effects from running two monitoring systems on the same server is removed.

You may need additional resources to accomplish this, for example Microsoft's SCOM requires that a gateway exist within the customer's domain, but if the opportunity presents itself it is definitely worth considering.

Well, those are my initial thoughts. I'd love to hear suggestions for additions (and corrections!) to this list - please feel free to comment below.
Post a Comment