On Github thomaswsdyer / ItsATrap
Tom Dyer
Systems Engineer - L.I.V.E. - Solium
http://tomdyer.ca / @thomaswsdyer
Powered by reveal.js
Solium's Monitoring and Alerting System
Nagios watches the infrastructure.
Admiral Ackbar looks at behaviour:
We do this monitoring through "Traps".
An email to the entire team
Pre-formatted text for JIRA
Step-by-Step Resolution Instructions
OpsGenie for Alerting
Two parallel schedules for different alert "types".
"Production alerts"
3 people
"Business Hours"
6 people
Directed Alerts!
Auto Close alerts!
Links to Wiki playbooks!
OpsGenie!
Alert / Pager Fatigue for 3 people
Some alerts could only be handled by certain people
More schedules = More overhead
Wait...who's on call?!?
OpsGenie with ONE Schedule!
Alerts go to HipChat and Pager.
Admiral Ackbar for multiple business units!
More Traps!
Automagic Remediation?
Questions?