Here @ Server Density we monitor 100.000+ servers processing 2B metrics a day. We deliver a service that needs to continuously monitor our customer's infrastructure, that's why downtime is critical for us and we keep training to react to incidents. We organize our internal War Games were all engineers practice the processes involved in incident handling. We have seen how this improves the associated human factors, our processes and our tools.
We will go through these points:
- The cost of uptime
- Expect downtime: prepare, respond, postmortem
- Human factors and how to improve them
- Train: War Games! realistic simulation
- The incident handling process
- Results: - revealing deficiencies - increase confidence / reduce panic - coordination / improve time to resolution
- Train: - your people - your processes - your tools
- Review and repeat!