Are you finding the root cause?

David wrote this on Jul 30 2008 43 comments

We circle the on-call responsibility between all the programmers at 37signals. Every day is someone’s day to take care of the technical issues that bubble up from support but can’t be resolved there. And that seemed to work pretty well in the beginning, but we’re starting to think that we need a more systematic approach.

The problem with passing the support monkey around is that everyone just wants to get rid of him as soon as possible. There’s not a whole lot of vested interested in dealing with the root cause of the issues, so you solve one-off problems for individual customers and get on with your day.

For the individual programmer, that approach will appear to work reasonably well because the feedback cycle is so long. You forget next week that you’ve actually already dealt with this problem before. And you certainly don’t get the feedback of knowing that the issue caused three other incidents for other people during the week. So your personal incentive to fix the true cause isn’t building naturally.

I’ve found that to ever get anything done, you really need to align personal incentives with the task at hand. That’s why we’ve been thinking about doing support weeks.

A single programmer gets assigned to work the support monkey all week and have to solve the root cause for every issue he encounters. No I’ll-just-deal-with-this-guy one-offs. But not just because of the directive that it’s what you’re supposed to do, but because it’ll come ever so natural when you’ve solved the same problem three days in a row.

Are you finding the root causes for your daily grind or do the wheels just keep spinning on the same issues?

David wrote this on Jul 30 2008 There are 43 comments.

Fake David

on 30 Jul 08

Get back to work David ! 37Signals has become a blogging company really. You can be a competitor to techcrunch.

Mike

on 30 Jul 08

There are only 3 ways to find the root cause of a problem: 1. Right Way 2. Wrong way 3. 37 Signals way.

Matt Todd

on 30 Jul 08

This is actually pretty similar to the idea posted in a slightly more technical blog post over at Highgroove Studios’ Scout App blog.

http://blog.scoutapp.com/articles/2008/07/29/4-simple-steps-to-detect-and-fix-slow-rails-requests

The important part is that you are looking for the root cause, not getting distracted (by elaborate technical contests or laziness/aversion). Don’t solve the wrong problem, or band-aid things.

Matt Todd

on 30 Jul 08

Here’s the (clickable) link: 4 Simple Steps to Detect & Fix Slow Rails Requests.

Chris

on 30 Jul 08

Sounds like a good opportunity to hire a support staffer who’s more experienced in problem management (identifying underlying causes of individual incidents, trending etc.) ...

Know anybody like that? ;)

Dave O'Flynn

on 30 Jul 08

At Atlassian, most of our teams have the concept of a Developer On Support. This is the person that support escalates issues to, who helps the support engineers improve their knowledge. This is usually a two-week rotation.

On Bamboo, we’ve got a kick-ass support engineer (hi Ajay!), so the DoS’ other function, when not helping support, is fixing bugs that are causing customer pain. Because it’s a two-week rotation, the developer can regularly confirm a bug and fix it for the next patch release.

We tried one-week rotations, but found it didn’t give us enough time to get many of the bugs fixed by the developer that raised them, and the hand-over to another dev had too much communication overhead for most bugs.

In addition, it gives the other developers longer stretches of time to work on cool new features without being interrupted for support help.

alex

on 30 Jul 08

Every week is support week in our office. each of our developers specialises in one area of our application, support calls relevant to that area are passed on to that person.

what happens is that we always end up looking/finding the root cause, because nobody wants to be on the phone all the time, it is in their best interest to solve the problem properly. downside, it also gets incredibly boring dealing with the same area of the application all the time

it works great for simple code fixes if it is a larger fix, which would require more then a few hours work, management always say leave it for later, as they prefer adding new features to the software, then fixing old problems

SH

on 30 Jul 08

Sounds like a good opportunity to hire a support staffer who’s more experienced in problem management (identifying underlying causes of individual incidents, trending etc.) ...

This isn’t really the fault of a support person who can’t identity trends, and it’s a little presumptuous to think so. You can spot trends all day long and report them all day long, but trends are just trends until they are resolved and actually fixed. The majority or repeat incidents I can see a mile away and vocalize with our team often.

We do a pretty great job of keeping a watchful eye on issues and bugs, and while we can always improve on everything we do, the big issue here isn’t about whether the support person is good at “problem management.” It’s about what you do with those problems that matters, and whether you fix them as a one-off and allow trends to grow, or take the time to resolve them for good.

Dan

on 30 Jul 08

Reminds me of W E Deming father of TQM and the distinction between Special Cause variation and Common cause variation. If each “problem” is seen as an isolated incident you’ll miss the common cause. The key part to improving quality is understanding the difference between the 2 things.

That isn’t always easy.

Chris

on 30 Jul 08

Ack—my bad.

My comment was not a shot at you Sarah, nor the kind of support you provide your customers. I’ve heard nothing but good things about your professionalism and abilities.

This was more aimed at Jason\David from a conversation in the past. Reading it now I can see I didn’t think it through enough.

My apologies for any offense—hastily written and not intended!

GeeIWonder

on 30 Jul 08

I’d argue that this belies the real root cause.

Nice post. Not so much because I think you’ve found the right solution (changing the cycle to a week might help, or it might not), but because it’s a clever way to try and vet the idea/brainstorm over alternatives.

Glenn

on 30 Jul 08

A week is probably a good start. I guess a lot depends on the amount of issues. I am also presuming no night support or any longer could be considered torture.

I have worked at companies that demanded (yes, a shame) that the programmers were always responsible for their area 24/7 to my last company were we rotated the night calls every month (DBA – not that many calls). Although I do not like being called at night, I always made it clear that the next day’s highest priority was going to be correct the issue. That helps take the stink out of support – for me anyway.

Louise

on 30 Jul 08

In my last job support was rotated between all developers for a 2 week long cycle each time – the length of one iteration.

Like yourselves, we found this to be particularly effective in allowing the support person to find and fix the underlying bugs that were causing support issues. The extended timeframe encouraged each of us to take ownership of the issue(s) while it was our turn, and to resolve them before support was handed over to the next developer.

We also found that this approach removed the knowledge silos that had developed before we implemented the rotation. Before, the same people always dealt with the same areas of the system – much like what Alex described. The support rotation gave all developers exposure to all aspects of the system and removed the inevitable panic when someone was away and ‘their’ part of the system needed support.

SH

on 30 Jul 08

@Chris, perhaps I was a bit hasty in my reply as well. ;)

My point is that issues like this need to be approached and resolved as a team, that means the support person at the very bottom of the totem pole to all the programmers in between and the CEO at the top. It’s so easy to say that any problem a company has is the result of the person at the bottom, in fact it’s more than easy, it’s instinctive.

Luckily, we work fluidly as a team, so we address issues like this as a team. Sometimes, like in this case, the root of the issue isn’t necessarily a person’s weakness but the weakness of a process, which we’re hoping to resolve and strengthen.

John

on 30 Jul 08

At my last company we had a Duty Supervisor that handled all escalations. That rotated between each person on a weekly basis. Each Thursday it rotated to the next person. We kept a paper binder that listed all on going issues. If something or someone kept coming up we worked on it as a team until it was resolved.

Now I am the only programmer, so everything is handled by me all the time.

Scott Semple

on 31 Jul 08

An excellent post.

I’m not in the tech industry, but the thinking is applicable everywhere: how to instill personal ownership of issues in ourselves and in our team (regardless of on-duty assignments)?

I’m not familiar with software development, but is it possible to determine what areas problems arise from - i.e. who wrote the code in the first place? - and assign the problem solving to them? Would do two things: solve the problem and make the person more attentive to their development in the future.

TD

on 31 Jul 08

I often find that the root cause of problems that appear again and again in slightly different forms is with design and not the code itself.

Dan

on 31 Jul 08

Nice post.

Good problem management is a mindset that IT dudes need to get into more. As you say, we concentrate far too much upon fire-fighting the day-to-day tasks, without bringing our heads up out of the sand. If we just take 10% of time to analyse our incidents (assuming they are logged) it’s easy to pick out the ones that are causing all the grief. And good logging goes some way to ease support handover woes.

There’s a monster service management framework some of you will have no doubt heard of, called ITIL. It’s a long slog to read it all, but it’s all common sense, and stuff that we “should” be doing, but we often don’t. We’re currently building an ITIL easier.

Dan Lee

on 31 Jul 08

Ooops. Try this link.

Kristoffer

on 31 Jul 08

Possibly related in an interesting way: 5 Whys of Toyota

Ben

on 31 Jul 08

Maybe your support work is different than ours, but I don’t think going to a week is going to make that much difference.

We used to do a weekly rotation among 5 developers and it was the same thing – you still just want the monkey off your back. If you’re the unlucky one to get the bad week then it really sucks. You’re still going to fix the same problem you fixed last week, or last month.. same problem, but longer intervals.

Solving the root cause is the right thing to do, but going to a week isn’t going to magically make that happen.

Eric

on 31 Jul 08

Maybe you should stop calling it a “support monkey” and project a more positive and productive attitude about it.

Dmitriy

on 31 Jul 08

Great insight on mismatched incentives. I also like Eric’s idea on trying to “project a more positive [...] attitude about it.” Non-monetary incentives are very powerful (see Tyler Cowen’s “Discover Your Inner Economist” and Dan Ariely’s “Predictably Irrational” on this topic).

To see my modest attempts exploring incentives in IT, please see Operations Alerts and Tragedy of The Commons.

Vicky H

on 31 Jul 08

I can totally relate to this post and think that this is a great idea.

Many times when a tech is ‘on call’, they do look for the quick solution to pass the puck so to say. Also, some times an initial problem is a symptom of the next issue (not for 37s in particular, but in general) and the person ‘on call’ the next day may know the overall of what occured the previous day, but they don’t have the specifics, so you end up bringing 2 tech’s in to resolve.

I also think that it helps the technician in a way to have a week rotation. Many times being ‘on call’ can be intimidating to the tech if it is not outlined what the expectations are or if you are a salary employee. Some employers don’t want to see you put in much time when your on call due to OT, budget, or other things. It also is nice to be able to plan your schedule with family and friends in advance w/o the constraints of a 4 or 5 day rotation.

I think this is a smart decision on 37 s part and would be interested in hearing how it goes after they’ve tried it for a while.

Vicky H

Chad G

on 31 Jul 08

Forgetting that you’re not all in the same office, I really imagined this little stuffed monkey that you passed around from desk to desk each week as a reminder that it’s your turn. Hey, really…why don’t you?

Charlie Triplett

on 31 Jul 08

Louise: “to resolve them before support was handed over to the next developer.”

Perhaps that’s the answer- find motivation to hand the next guy a more peaceful monkey.

Ideally the motivation would be love for your neighbor.

GeeIWonder

on 31 Jul 08

I think you need to create actual owenrship of the problem, and stamp out this ‘hand off’ mentality—otherwise the week cycle will just result in delays of weeks rather than days.

If the problem is the lack of time to handle the problem, the longer cycle might work. But that’s not the image that’s suggested by the overall tone.

How do you create ownership? Well, for one, you could literally create ownership—many privately held companies are employee owned, and many others that aren’t prepared to go so far still structure their performance pay based on shared profits.

Another idea is to make the ‘support monkey’ an opportunity rather than a negative—maybe make a contest out of resolving issues (based on relative breadth and depth) once and for all (say, 3months without the same issue cropping up), with a suitable reward system. Turn the ‘oh no, it’s my week’ into a ‘yay, a chance to make some extra money/get some free tickets/whatever else’.

Vicky H

on 31 Jul 08

@Geelwonder Those are good suggestions, I like it, for my job thinking more about what you said :-)

@FakeDavid Are you like a FakeSteveJobs for DHH? OMG, I hope not.

Marty

on 31 Jul 08

@John: “We kept a paper binder that listed all on going issues.”

That’s a very lo-fi ticketing system!

And in response to the post itself, get a better ticketing system. The first step to stopping repeat incidents and band-aid solutions is to document their occurrence in the first place. Make it part of the incident response process to check for existing tickets.

tyler rooney

on 31 Jul 08

David, a rotating support week is a great idea. I spent 3 years at Amazon and I saw the worst of what can happen when developers own support and time isn’t properly allocated for it. That said, I also got to see it work quite well.

My last team had around 15 developers and owned at least a dozen services including a few monsters. Our on-call rotation was usually 4 days long where you’re primary task was putting out any front line fires. Your secondary task was responding to service requests from internal customers. Once we started collecting stats on which services caused the most tickets it was much easier to prioritize and address support issues. Every project on our team operated on a 2-week sprint cycle so we added a separate “support” project. One developer, rotated by project, would be donated to the support project and was responsible for addressing a long-term support issue in that sprint. I thought this was a great idea as it really shows you how much time your entire team spends on support and how much time your new projects suffer because of it.

That setup might be too rigid for some but I think a variant of that can work for even small teams. I found it worked incredibly well for us especially considering we had to work in sync with a dev team on the other side of the world. That said, I was a developer and a manager on the team might have had a different opinion.

Lester A. McGrath-Rosario

on 31 Jul 08

Where I work support was assigned on a daily basis. I don’t know if it ever was changed to a weekly, but currently we have a team who works on this on a permanent basis.

Three guys; three shifts. Our primary task is user support and issue tracking. Secondary task is support related or non-critical development, so we can concentrate on the issues (by nagging the developers, solving them ourselves and/or creating SOP’s to handle the issue in the future).

David Wagner

on 31 Jul 08

A previous post mentioned ITIL. This is my area of speciality. As most of heard/seen ITIL they are probably shy away from the stacks of manuals that define this best practice. However, my suggestion is take what you need from it and leave the rest.

The first problem I see with you “support monkey” is how you are dealing with problems. When a customer calls with an issue, they are reporting an Incident (i.e. ITIL terminology); the goal of a reported Incident is to restore the customer service as soon as possible, without having to focus on the root cause. This is what your weekly support monkey focuses on – restoring the service if possible.

Next, as a group (i.e. all the support monkies) review current and past Incidents looking for potential Problems (another ITIL term), i.e. underlying issues that keep reoccuring that causes Incidents to happen. The high priority Problems are then worked on (by the whole team) to find the root cause and implement fixes that eleminate the underlying Problem from the environment completely, i.e. no more reoccuring Incidents from this Problem.

This, from the ITIL world, is the distinction between Incident and Problem management, i.e. two different goals with different ways to organize and focus on them.

Sorry for the rambling, but I see this issue a lot in my work.

Lally Singh

on 31 Jul 08

We use a two-tier system:

1. a user-level trouble tracker. One ticket per user problem.

2. A developer-level trouble tracker. Bugs and necessary features. User-level tickets that are due to software problems (even if only symptoms are listed) are ref’d to the dev trouble tickets.

More user-level tickets linked to a dev-level ticket --> higher priority.

Anytime a user support issue could’ve been reduced/avoided by a change in the software, a trouble ticket is added (or existing one added to). Let the developers prioritize/ignore them at will. This way you keep the devs sync’d with what’s really going on.

Corey Reid

on 31 Jul 08

At FreshBooks, we appoint a support developer per iteration. So if the current iteration is two weeks, one developer is the Support developer for that iteration—that means their primary customer are our Support folks, and they take care of whatever issues the Support people want taken care of.

That might be bugs, but it might also be little enhancements that customers are asking for. Two weeks I think is sufficient to build up a sense of ownership—and the rotation ensures that all the developers on the team have chances to work on different parts of the code and learn about how it works.

If you know your peers are going to be coming along behind you and working in your code, you tend to be more careful about your solutions.

Our developers have differing skill sets so some iterations Support will focus on specific areas—if Jeff’s their guy they’ll go after interface stuff, whereas if Taavi is on deck they want to see back-end fixes.

I think it’s a great system that encourages ownership, fits with an agile model and keeps “exploding bugs” from interfering with ongoing product work.

But we’re just starting with this model so it might blow up in my face. You never know—my face is plenty scarred already!

SZ

on 31 Jul 08

I agree w Ben.

“Solving the root cause is the right thing to do, but going to a week isn’t going to magically make that happen.”

I think it has more to do w the person’s analytical skills and their sense of resopnsibility than anything else. We use a one-week approach and it does not make one bit of difference. It is always about fixing the incident, not the problem.

Charles T

on 01 Aug 08

Two observations about human nature:

1. Engineers like to engineer.

2. People care about their appearance.

So I think the best solution is going to work off of these. Here are some random untested ideas to think about:

For #1:

a. Have a manager assign points to bugs, and let everyone see the queue. Points can be raised as problems linger, and if an issue is re-opened then the closer looses points. Every X points means a free lunch or a movie ticket. Engineers love to optimize. Some will dive for the easy points, others will take on 4 bugs if they think the root cause is the same issue. If someone fixes just the symptom they’ll open a new root-cause bug worth more points.

Everyone’s points for the last 4 weeks are tracked weekly. Managers get to use the points per month as way to size the cost of support.

b. Instead of fixing bugs, have developers write tests to reproduce them. Then schedule the work of fixing the bug with someone else. This avoids future bugs, while giving more time and credit to fixing hard stuff.

For #2:

a. Have the rotation be 1 week of fixing, then 1 week of reviewing bugs. This way there are two sets of eyes on every problem. The fixer reassigns the bug to the reviewer when they think it’s nailed (with code diffs). Knowing that someone else is double checking makes folks more careful. It helps the fixers learn from reviewers as well.

b. If there’s a regular weekly meeting have the fixer for the past week review the list of bugs and resolutions for bugs that were closed. That motivates the fixer to do a good job and be ready for questions, and helps spotlight repeating problems across the team.

Merle

on 01 Aug 08

Just fix the problem once and for all when it occurs.

You can only build higher on a solid foundation.

If you don’t get to the “root cause” of an existing problem you are SERIOUSLY jeopardizing the integrity of any subsequent development work.

Tom G

on 01 Aug 08

It strikes me that 37 Signals is small enough that the Tech Support staff would know the person who knows the most about the area of concern.

Don’t you escalate support issues to the best qualified person depending on the situation?

Its possible you need to hire an up and coming programmer willing to do second level tech support for the mundane stuff as well as a total quality management program.

As for solving problems once and for all, that should go without saying. There are exceptional cases where you have to apply a band-aid in an emergency to buy time for a permanent solution. It’s really critical that band aid get left in place though.l

Anonymous Coward

on 01 Aug 08

As for solving problems once and for all, that should go without saying.

A couple of people have said this. It sounds right, but I don’t think it is.

I think many, and especially the authors of ‘Getting Real’, have to be careful here. It’s very disingenuous to say ‘Build it now’ and then scold or suggest that must be final—particularly when the code base itself is in a constant state of flux. In my experience, as often as not new features are culpable as old code.

Sending mixed messages will lead to more confusion and even less of a sense of ownership.

GeeIWonder

on 01 Aug 08

That was me—sorry.

Stephan

on 01 Aug 08

This could be interesting for this blog in general: Eureka Carpark.

Charles400

on 01 Aug 08

You’re on the right track. Some suggestions:

1. One week is too short. Go two weeks or more.

2. Schedule this out. Make sure the same person doesn’t get stuck with the support monkey for all the major holidays.

When everyone (repeat, everyone) has good balance between support and new development, you will: a. Everyone owns support issues b. Begin to identify and solve root causes c. Everyone will continue to “feel the pain” of the customer.

The worst thing to do is hire a dedicated support staff. The insulation would hurt your product in the long run.

(I managed a software and support staff of 40 people for a commercial banking product; I learned these things the hard way)

Alex Beamish

on 05 Aug 08

I just finished doing a week’s worth of Technical Support, and it sure does teach you a lot about what products and services the company provides. I think I ended up filing just two Mantis bugs, and didn’t do any root cause work at all, but even a week spent answering customer E-Mails is a great eye-opener, and every developer should do it.

Of course, we have a web application that automates everything, and we’re usually second line support, but once we take over a ticket, we deal with all of the communication after that.

I disagree, however, that a developer should figure out the ‘root cause’ while they’re still on support. Development is a long, slow process that doesn’t mix well with stopping every half hour to answer a few more E-Mails.

This discussion is closed.

About David

Creator of Ruby on Rails, partner at 37signals, best-selling author, public speaker, race-car driver, hobbyist photographer, and family man.

Read all of David’s posts, and follow David on Twitter.

If you liked this post by David, you’ll probably like reading Apple: The organizational Rorschach, Web designers should do their own HTML/CSS, and Who wants to live in The Real World?