My dad is a firefighter and fire captain in Niagara Falls, NY. When I told him I had on-call duty for my new job, he was beyond excited. After relaying stories of waking in the middle of the night to head into the hall and getting overtime from his buddies that
didn’t want to wake up for work were sick, I had to explain that it’s a different type of action for me. I face issues of a less life-threatening variety, but there’s still plenty of virtual fires to extinguish. Here’s a primer for what being an on-call programmer means at 37signals.
On-call programmers are rotated around every 2 weeks, and all 8 programmers currently get their fair share of customer support and interaction. Usually we have 2 to 3 programmers on-call on a given week. Support issues from email and Twitter are routed through Assistly, and handled by our awesome support team.
If there’s an issue they can’t figure out, we get a ping in our On Call room in Campfire to check out a certain issue:
These issues range in severity, and they might look like:
- Got an “Oops” page (our standard 500 Server Error page)
- Can’t log in
- Page is messed up or broken
- Incoming email didn’t process or show up
The next step is to FIX IT! If possible, get the customer’s problem figured out in the short term, and hopefully push out a bug fix to stomp it out permanently. Luckily, we’ve got a few tools to help us out debugging. We use a GitHub wiki to share code snippets that make some common issues go faster:
This screenshot shows off a newer piece of our weapon rack as on-call programmers: 37. 37 is our internal tool for doing pretty much anything with our production servers and data. Some of what it covers:
- SSH into any box on our cluster
- Watch a live feed of exceptions streaming in
- Fire up a Rails or MySQL console for any application
- Grep or tail logs over multiple days from a specific account
37 has helped immensely by just the virtue of being a shared tool that we can use Ruby (and Bash!) to automate our daily work. It started as a small set of scripts tossed around on various machines, but it really became a vital part of our workflow once we made it into a repo and documented each command and script.
Once the issue has been debugged/solved/deployed, we’ll log the issue as fixed on Assistly. The support crew handles telling the customer it’s fixed, but sometimes I’ll jump in if it’s a developer/API related issue.
There’s a few other channels we pay attention to throughout the day as well. In Campfire we get alerts from Nagios about things that might be going wrong.
Here’s one such example: a contact export from Highrise that was stuck in the queue for too long.
Another incoming channel is the API support google group. Although this is more of a support forum for existing API consumers, we’ll try to jump in and help debug issues if there’s time left over from handling direct customer issues.
Our on-call emergencies come in many different flavors. An all-out, 5 alarm fire is rare, but does happen. Typically it’s a slow burning fire: something is smoldering, and we need to put it out before it gets worse.
We’ll learn of these problems in a few ways:
- Nagios alert in Campfire about downtime/inaccessibility
- Twitter eruption
- Support ticket flood
Once we have a handle that it’s not a false alarm, we’ll update the status site and notify customers via @37signals. The number one priority from there is putting out the fire, and it usually involves discussing our attack plan and deploying the fix.
If the fire is really getting out of control and tickets are piling up, sometimes we’ll help answer them to let them know we’re on it. The status site also knows everyone’s phone numbers, so if it’s off-hours we can text and/or call in backup to help solve the problem.
Here’s a few tips I’ve learned from our fires so far:
Given I’m relatively new to our apps and their infrastructure, I don’t know my way around 100% yet. Being honest about what you do and don’t know given the information at hand is extremely important. Be vocal about what you’re attempting, any commands you might be running, and what your thought process is. Jumping on Skype or voice iChat with others attempting to debug the problem might also be useful, but if someone is actively trying to hunt down the problem, be aware that might break their flow.
Cool your head
Each problem we get is different, and it’s most likely pretty hairy to solve. Becoming frustrated at it isn’t going to help! Stepping away might not be the best way to de-stress if the fire is burning, but staying calm and focused on the problem goes a long way. Frustration is immediately visible in communication, even via Campfire, and venting isn’t going to solve the problem any faster.
Write up what happened
Post-mortems are an awesome way to understand into how the fires were put out, and what we can do better the next time one happens. Our writeups get pretty in-depth, and knowing the approach used can definitely help out the next time. Especially since we’re remote and spread across many different timezones, reviewing a full report and reflection on the problem is extremely helpful, no matter how small the fire was.
There is no undo!
Issues that crop up repeatedly always have a larger fix in mind. Losing work due to browser crashes and closed windows was a large support issue, until autosave came to the rescue. When not putting out fires, on-call programmers are working on “slack” issues like these, or fixing other long-standing bugs.
Experimentation with how on-call works is definitely encouraged to help our customers faster, and I’m sure our process will continue to evolve. It’s not perfect (what process is?), but I can definitely say it’s been improving. Putting out our virtual fires is still a lot of fun despite all of the stress!