Reducing Incident Escalations at Basecamp

tldr; The bottom line is that we had fewer actual site down interruptions and false alarm escalations in 2015.

Here’s a non exhaustive list of contributing factors to these improvements over the years:

Eliminating scheduled maintenance that would take a site offline
Limiting API abuse (and a general decrease in the number of abuse incidents)
Automated blocking of other common abuse traffic
Fairly generous ongoing hardware refresh with better distribution across cabinets
Completely new core and top of rack network switches
Hiring the right people (and the right number of people)
Moving to more stable storage (EMC / Isilon to Cleversafe)
Taking control of our public Internet connectivity and routing (Our own IP space, our own routers, carefully selected providers, filtering traffic)
Right sizing database hardware for every major application
Better development / deployment practices and consistency in following those practices (local tests / ci, staging, rollout, production)
Practicing incident response and keeping play books up to date
Vastly improved metrics and dashboards
Better application architecture and design choices with regard to availability and failure modes
Being ruthless in our tuning of internal monitoring and alerting (Nagios) to only escalate alerts that really need to be escalated
(Full disclosure we actually had more incidents escalated from our internal monitoring this year. The “quality” of those escalations is higher though.)

Email’s new heyday