We’ve got something cooking and we could really use your help. Grab your recent Android device, Basecamp (not Classic) account, and join the beta.
Chip Pedersen has been in tech for more than 25 years, 18 in the game industry. He’s led teams at Microsoft and Activision; done R&D for Apple; managed projects for huge brands like DreamWorks, MTV, Discovery Channel, History Channel, the MLB and the NHL.
You should probably hire him.
The catch is that he lives in Minneapolis, and he’s not going anywhere. “I’m just a Minnesota kid who wants to stay here,” he says. “All my three sons were born in different states. We moved back to Minnesota where we’re from. We like it here. We don’t want to go anywhere.”
“I have had offers to return to the West coast, but I just don’t want to do that.” He says his old Silicon Valley pals keep cajoling him: “‘come back out West; we’ll hire you right now!’ and my wife’s like, ‘no!’ We just celebrated our 25th wedding anniversary, and she said, ‘I’ve moved for you all these years, now can you do it for me and stay home?’ and I said, ‘sure.’”
So home they’ll stay. Pedersen is currently a gun for hire (with his company Golden Gear Consulting), and he likes it that way — he just wishes more businesses were open to the idea of remote work. “You can get great talent and let them be where they are,” he says, “and not have to put up with the cost of living in San Francisco.”
Most of the recruiters and hiring managers that reach out to Fleming want him to move, though he’s confident he does a better job working from home. “I can concentrate on my work and there’s no one here to distract me from that,” he says. “There’s no one coming over and tapping me on the shoulder and asking me about something. They may send me a message on Skype or Google Hangouts or something like that, but I can ignore that easier than I can someone coming into my personal space.”
Fleming recently tweeted about being in the market for a new position if anybody has a need for an experienced developer. Predictably, the first reply came from an IT recruiter: “Eric, would you consider moving to Austin or are you looking to remain in J-ville?”
That recruiter — Mark Cunningham, owner of The Bidding Network in Austin — says zero percent of his clients (primarily startups) are open to hiring remote workers. “If we’ve got some crackerjack Java developer who just has something amazing but he lives [20 miles away] in Cedar Park and the startup’s located downtown, we might work something out,” Cunningham says. But for the most part, his clients want to take advantage of the chemistry that results from everyone working in close concert.
“They worry about the loss of synergy, and the collaboration, and then the fires that are stoked from elite software engineers and elite professionals being together face-to-face and what comes from that,” Cunningham says. “That’s where they’re hesitating.”
Fair enough — there’s no denying there are advantages to having everyone in the same room. But when you stack the advantages that come with putting local heads together against the advantages of hiring the best heads from everywhere and collaborating remotely … well, it’s fairly clear where we stand on that.
“Give people the flexibility to work where they feel more comfortable working,” Fleming says. “They’re going to give you better results. It’s better for the company overall.”
Pedersen feels that for the more established companies he’s worked with, the hesitation comes from being stuck in a “face time = work time” paradigm. If you aren’t working onsite, “they think you’re goofing off,” he says.
“I’ve definitely worked at a number of companies where it was about the time you spent there. You may not have been doing much, but you were there. Microsoft was a little bit like that … I had a futon in my office and I would sleep there.”
What will it take for that cultural shift to happen, for companies to begin to allow people to work from wherever they like as long as the work is getting done? A leap of faith, Pedersen says.
“Do a small test,” he suggests. “Try it out. If you can’t find the person you’re trying to hire — if you’ve been looking forever to hire somebody and you can’t find them because they’re not in your region — look for a remote worker. You’re probably going to find an excellent person to meet your needs and get your stuff done. Probably within your budget and faster. Take those leaps when you see the opportunity.”
It comes down to results, Pedersen says. With the teams he manages, he does his best to treat everyone like adults and focus on the work itself. “If they’re getting their stuff done … I’m staying with that person. They got it done last time; they keep getting it done. I don’t care if they live in Venezuela; they’re getting it done.”
The natural tendency of growth is towards specialization. When you only have a few people, they must by necessity do everything. When you have more people, there’s enough room and slack to let people build specialization kingdoms that only they have the keys to. Don’t be so eager to let that happen.
Specialization might give you a temporary boost in productivity, but it comes at the expense of overall functional cohesion and shared ownership. If only Jeff can fiddle with the billing system, any change to the billing system is bottlenecked on Jeff, and who’s going to review his work on a big change?
But it goes even deeper than that. For example, we have all programmers work on-call as well. Everyone gets to feel the impact of customers having trouble with our code (this is on top of Everyone on support).
This really came to the test lately when we started working on a number of iOS and Android projects. Should we hire new specialists in from the outside or should everyone do everything, and thus have our existing team learn the ropes. Well, in that case we ended up doing both. Hiring a little because we needed that anyway, and getting someone with some experience, but also choosing to invest in the existing team by having them learn iOS and Android from scratch.
Good programmers are good programmers. Good designers are good designers. Don’t be so eager to pigeonhole people into just one technology, one aspect of your code base, or one part of your business. Most people are eager to learn and grow if you give them a supportive environment to do so.
Guess what these Google domain icons do. I’ll go first: Send a locksmith, Start a party, Call a handyman, Jump out the window, Put on your seatbelt, Use a lifeline, Start the machine.
Employee benefits for technology companies are often focused around making people stay at office longer: Foosball tables, game rooms, on-site training rooms, gourmet chefs, hell, some even offer laundry services. We don’t do any of that (although we do have a ping-pong table in a back room that gets wheeled out for our bi-yearly meetups).
Instead we focus on benefits that get people out of the office as much as possible. 37signals is in it for the long term, and we designed our benefits system to reflect that. One of the absolute keys to going the distance, and not burning out in the process, is going at a sustainable pace.
Here are the list of benefits we offer to get people away from the computer:
- Vacations: For the last three years in a row, we’ve worked with a professional travel agent to prepare a buffet of travel packages that employees could pick from as a holiday gift. Everything paid for and included. Having it be specific, pre-arranged trips — whether for a family to go to Disneyland or a couple to tour Spain — has helped make sure people actually take their vacations.
- 4-day Summer Weeks: From May through October, everyone who’s been with the company for more than a year gets to work just four days out of the week. This started out as “Friday’s off”, but roles like customer support and operations need to cover all hours, so now it’s just a 4-day Summer Week.
- Sabbaticals: Every three years someone has been with the company, we offer the option of a 1-month sabbatical. This in particular has been very helpful at preventing or dealing with burnout. There’s nothing like a good, long, solid, continuous break away from work to refocus and rekindle.
To come up with the best ideas, you need a fresh mind. These travel and time-off benefits help everyone stay sharp. But it goes beyond that. Even the weeks when people are working full-on, we offer benefits focused around keeping everyone healthy in other ways too:
- CSA stipend: We offer a stipend for people to get weekly fresh, local vegetables from community-supported agriculture. Eating well is good, cooking at home is good, doing both is great.
- Exercise stipend: Whether people want to take yoga classes or spend money on their mountain bike, the company chips in. Eating healthy goes hand-in-hand with getting good exercise. And we sit down for too much of the day as it is, so helping people be active is important.
These benefits form the core of our long-term outlook: Frequent time to refresh, constant encouragement to eat and live healthy. Pair that with the flexibility that remote working offers, and I think we have a pretty good package.
It’s always a real pleasure and a proud moment when our internal Campfire lights up with an anniversary announcement. Like Jeff celebrating 6 years this month, Sam celebrating 8 years and Ann 3 years last month.
We ultimately want 37signals to have the potential of being the last job our people ever need. When you think about what it’ll take to keep someone happy and fulfilled for 10, 20, 30 years into the future, you adopt a very different vantage point from our industry norm.
Name: Chris Hoffman
Title: Co-Founder, Director of Marketing Strategy
Company: It Collective
Based in: Colorado Springs, Colo.
What does your company do?
We offer film production and content marketing strategy services. On the marketing strategy side, we work with clients to identify key stories and messages that will resonate with and be shared amongst a target audience — then we help them tell those stories through the creation of that content and the execution of a marketing strategy. On the film side, we produce everything from commercial spots to short films, and just recently finished our first feature-length production — a live concert film for Gungor, an incredibly talented band who have recently been nominated for a couple of Grammys.
How many people work for the company, and of those, how many work remotely?
We are 100% remote. Our business model is project-based, so our team changes in size depending on the number and types of projects we have in house. We went the contractor direction instead of hiring full-time employees for a number of reasons. Primarily, it allows the flexibility to resource the ideal skill sets for each project. Secondarily, hiring individuals who prefer working in a contract setting help us filter out the people who require micro-management — in other words, people who are not suited for a remote work system. The people we hire are used to managing their own time and workflow. We have around 10 team members that we work with on a regular basis.
Did you start out as a remote company?
We did, and I’d love to say that we had some great strategy behind that decision. In reality, it was made because we didn’t have the startup capital to pay for an office space. We strongly believe in the concept of bootstrapping, and have gotten off the ground without taking on any debt or external capital investment.
We’ve found that we have a great love for hosting face-to-face meetings in coffee shop or home office settings, and that our clients often love meeting in those settings as well. We recently conducted a major client review meeting on a film project in the living room of Andy Catarisano — our Co-Founder and Director of Film Production. We picked apart the final edits over homemade popcorn and cookies. I think our clients loved the experience as much as the final product. It was significantly more effective than presenting in a polished boardroom.
When we need a larger space we rent the tricked-out conference room of a local co-working establishment. Obviously there are occasions when the home office and local Starbucks won’t work, and we don’t pretend that our system will work for everyone. We’ve found a way that works for us to do business without a set physical space, and we aren’t in a hurry to change that.
What challenges did you face in setting up as a remote company?
One major challenge (for those of us that came from a traditional corporate environment) was overcoming the mentality of a 9-5 workday that had been engrained more deeply than we realized. For me personally, it has taken a very intentional effort to ask myself the right questions about my daily activities. I’ve had to learn to look at the day through the lens of, “What is the most high-impact use of my time?” As opposed to, “It’s 3 p.m. — I should be at my desk.”
U.S. work culture has conditioned employees to feel like they are fulfilling their duty to the company they work for by being in their seats for 8 hours in a day. In reality, those employees may or may not be producing anything of value. The amount of time spent at a desk is completely irrelevant to the value and quality of work, and that has been a tough lesson to learn.
What do you see as the major benefits of being a remote company?
The first major benefit is the effect it has on morale, and in turn, the increase in quality of work and dedication to the company. Here is one very practical example of this benefit: Commute time.
I’d love for someone to give me a reason that justifies not giving one of your staff 200 hours of their lives back each year in exchange for zero productivity loss.
Think about how ridiculous it is to demand that an employee sit in rush hour for an hour or more each morning and evening, just to be in by 9 a.m. and leave at 5 p.m. How simple of a switch would it be to allow that team member to work from home until 10 a.m., then arrive at the office in 30 minutes or less with no traffic? That switch translates to well over 200 hours of time given back to that person every year to do as he or she pleases — to spend extra time with family, invest in a personal project, or just take some additional space for decompression.
I’d love for someone to give me a reason that justifies not giving one of your staff 200 hours of their lives back each year in exchange for zero productivity loss. An unwillingness to discuss these types of changes to a work schedule that provide such tangible benefits is just plain arrogance on the part of a management team.
A second huge benefit is the expansion of the talent pool that it provides for us. Instead of being limited to the labor pool within 100 miles of our location, we literally have worldwide talent at our fingertips. We regularly work with a film colorist that lives in Sydney, Australia. The quality of work that we received was vastly superior than anything within our immediate geographic area.
One really interesting thing about working with international teams is that you have almost 24 straight hours of productivity at your disposal. We’d do work in the U.S. on the project, meet briefly at the end of the day with our team member in Sydney before signing off, and then turn it over to him to continue the work. It’s an amazing experience to go to bed, get a great night’s sleep and wake up to a project that is further along than when you left it.
The other really major benefit for us is providing the freedom to tailor the work environment to the type of work being performed. An example of this that really stands out to me is on one of our recent projects, which was a feature-length film. Editing a 90-minute film together is one of the most incredibly detailed processes I’ve ever seen, and it requires a huge amount of focus and precision. We worked an amazing team of editors for this project — the kicker was that they preferred to do their editing nocturnally, from about 10 p.m. to 8 or 9 a.m. The world is quiet then — there are zero interruptions and that was their period of ultimate creativity and effectiveness. A remote work environment allowed us to say yes to that request, and the results were outstanding.
Any advice for other companies who are considering going remote?
The thing about remote work is that it magnifies existing dysfunction in the workplace. An organization with a highly functional team and a deep understanding of role clarity and how to work together in an effective manner is going to have a much easier time transitioning to a remote work structure. A dysfunctional team is going to have a much more difficult time making that leap, because the freedom of working remotely magnifies those inefficiencies.
A physical office space has long been used as a safety net for managers to push the the messes of their team dynamics under the rug as opposed to addressing them. Being able to walk down the hallway every 15 minutes to micromanage employees can (sometimes) cover up poor hiring decisions. It can compensate for a failure to plan. It can also provide a false sense of security for a manager who needs to micromanage to feel effective in their position. Working remotely immediately removes those safety nets and exposes the true functionality of a team. If you’re thinking about making the leap to a remote work environment, it’s important to ask these questions about your team and be very honest in your answers.
Visit It Collective.
Picked up a great lesson from the book Turn The Ship Around. David Marquet, the author and nuclear sub captain, says you can’t empower people by decree. While you might be able to ask someone to make a decision for themselves, that’s not true empowerment (or true leadership). Why? Because you’re still making the decision to ask them to make the decision. That means they can’t move, or think, or act without you. The way to empower people is by creating an environment where they naturally start making decisions for themselves. That’s true empowerment. Leaving space, creating trust, and having the full faith that someone else will rise to the challenge themselves.
Doing business with a company means you’re not just buying their products, but the experience of having their people, opinions and expertise, too.
Some companies really understand great customer support and service, others fall hard. The latter is the case with my recent (now only) experience with Canadian online menswear retailer, Frank & Oak.
My story is common: I ordered a couple of items, but one got lost in transit. I had full faith that customer service at Frank & Oak could help me track it.
I got a week of radio silence through their online form, and email. Resorting to Twitter, I finally got a reply a couple days later: “we’ll email you.”
Fast forward three weeks from their first reply and we’ve got two valuable lessons from their final correspondence:
I usually answer my email within 3-4 days, but since you sent 3 emails, the number of days showing since our last communication stayed the same. Please wait for a response next time, so that I don’t loose track of our communication.
1. Blame the customer: 3 emails in a 3 week span, of course it’s my fault.
2. Passive-aggresively tell the customer they’re annoying: In 2013, most email clients order messages by time of receipt. My fault, I didn’t know that yours doesn’t.
Every bit of this Frank & Oak email makes it my fault. So much for making customers feel like a bad ass.
For examples on how to avoid bad customer service like this, you can read how Ryan switched to T-Mobile and had a great experience, or you can read how we turned our own disasters into gold. And whether you work on a support team or not, everyone should give Carnegie a read. You’ll make more friends, and probably more customers.
In the mean time, I’m going to find a place to buy a nice shirt.
Back in June we launched Know Your Company, a tool for helping company founders, owners, and CEOs get to know their companies again.
A few hundred one-on-one demos later, we’re about to hit our 100th paid customer.
Because of Know Your Company, thousands of employees have a louder voice, and a hundred company owners have bigger ears. Employees are sharing things they’ve never been asked about before, and owners are hearing things they’ve never heard before. New insights come weekly, and more feedback is flowing in both directions. Things are changing for the better at Know Your Company companies.
Back of the napkin financials
From the business side, in just six months, Know Your Company has booked $390,000 in revenue (and is profitable). The pricing model is $100 per-employee one-time (once you pay for someone you never pay for them again). The smallest customer has 16 employees, the largest has 105. As existing customers grow or replace employees, about 20 new employees are added to the system every week. Customer retention is holding strong at 99% (unfortunately we’ve had one cancellation).
Referrals are healthy too – we get a fair number of emails from CEOs who’ve heard of Know Your Company from existing Know Your Company customers. Even more promising, we’ve been hearing from CEOs who heard about Know Your Company from their employees!
What started as a hunch, then launched as an internal experiment, before ultimately becoming commercial product, has blossomed into a thriving business.
In the spirit of continued experimentation, we’re about to take it up a notch and try something we’ve never done before: We’re spinning off Know Your Company into its own business.
In January 2014, Know Your Company the product will become Know Your Company the company, separate from 37signals.
Meet Claire Lew, the new CEO of Know Your Company
The new company will be co-owned by 37signals and Claire Lew. Claire will be the CEO and run all day-to-day operations. We’ll be on the sidelines purely as advisors, ready to help if called upon. If all goes well, Claire will ultimately own more of the company than 37signals will.
So who’s Claire? Claire’s someone we’ve had our eye on for a while. They don’t come much sharper (and nicer!) than Claire. In fact, we originally contemplated hiring Claire to run Know Your Company from the start, but things just didn’t come together.
Claire went off to start ClarityBox, a consulting practice aimed at helping owners understand what their employees really thought. You can watch her talk about it here:
ClarityBox’s mission was similar to Know Your Company. We obviously saw the same kinds problems out there and wanted to help solve them in similar ways.
So once it was clear that Know Your Company had legs, and that we wanted to spin it off into its own company, Claire was the natural match to run it.
I pitched her the idea and she was into it. We hammered out a deal and related details in a couple of weeks and signed the formal agreement yesterday. We’ll be transitioning the company and product over to Claire this month, and she’ll run it completely starting in January. I’ve heard some of her initial ideas so I’m excited to see where she takes it.
Know Your Company
So if you’re a founder, owner, or CEO of a company between 25 and 75 people, and you feel like you don’t know as much about your company as you used to, it’s time to get to Know Your Company again. Claire will show you how.
It took more than a year and three distinct attempts to get Google Docs in Basecamp ... and still, the damn thing almost didn’t get built. Why was it so hard?
We knew we needed it. Integration with Google Docs was a super-popular feature request, and usage in general is on the rise. Since Basecamp is a repository for everything project-related, it made sense to show the same love to Google Docs we show to any other type of file you can store in a Basecamp project.
Problem was, we don’t really use Google Docs ourselves. And we’re kind of notorious for scratching our own itch and not building shit we don’t need. It’s absolutely the exception that we would create a feature we didn’t plan on using. (For years, to-dos in Basecamp Classic didn’t have due dates, because we just work on things until they’re shippable. It wasn’t until enough customers hollered at us that we eventually added them.)
“We know tons of our customers use Google Docs; they have to,” says Jason Z. “Everybody’s using Google Docs. So we know it’s useful, we know people are asking for it all the time. There just comes a point where we have to figure it out.”
Shortly after launching the new Basecamp in March 2012, a small team explored what it would take to link to Google Docs from Basecamp. “We started with a little experiment to see whether the tools Google provides are enough to do basic integration,” said Jeremy, the programmer on that first spike. The goal was to be able to “pick a file from Google without having to commit to deep integration that changes the way Basecamp works.”
Google’s file picker made integrating with Google Docs easy, but rendered switching between accounts (if you’re signed in as one user and need to sign in as someone else) nigh on impossible. And we got hung up on what to do about permissions: Our choices seemed to be either allowing anyone who had the link to edit the document, or letting Google handle permissions and suffer the nasty flow and UI that resulted (more on that later).
With the account switching problem, our choices were to wait for Google to improve their tools, or scrap that and find some other way to integrate — i.e., roll up our sleeves and build our own picker. “That led to a waiting game,” Jeremy recalled: “if Google’s own tools got good enough that we could use them, then we’d have an easier time integrating.” So we punted.
Managing the two steps separately gave us the flexibility we needed to resolve the account switching issue, but the permissions demon was still rearing its ugly head. We punted again until we’d have more time to explore it.
Each time we felt like we were getting close, we’d reach the same stalemate. No one knew which of the two options for handling permissions was the lesser of two evils:
- Allow anyone with the link to view the document. This route would have meant sharing a Google Doc in Basecamp = changing its permissions so anyone with the link could view and change it. Other tools handle permissions this way; it makes things pretty easy and keeps the UI clean. But it creates a pretty gnarly security concern, in that there’s no way to revoke access later. People no longer employed at an organization might be removed from its Basecamp account, but still have access to proprietary information stored in Google Docs. Or users might share the link with outsiders who could then access and edit the document anonymously. No bueno.
- Let Google be the gatekeeper. When permissions are set within the Google account and Basecamp doesn’t mess with them, we get to wash our hands of security concerns. Convenient for us! But it passes this potential morass of access seeking and granting onto our users: The viewer has to be signed into Google, and they need permission to view the document to see the preview in Basecamp. If they don’t have permission, they can request it through Basecamp. They’ll then be directed to a Google page, and from there, the request is emailed to the Google Doc’s owner. When the owner grants access to the document, Google sends an automated email to the viewer with a link to view it. “A lot of us were feeling like this leads to a pretty crappy experience,” Javan says, “because you click on the doc and then you hit this brick wall.”
“I was worried that people wouldn’t understand that, because I didn’t understand it,” recalls Ann from QA. “I did an experiment with the support team where I shared a Google Doc with them … I got all kinds of requests to view the document, because I hadn’t given them permission yet. I was afraid that oh my God, every customer was going to see that.” Adding a private file to a Basecamp project with 150 people on it might generate 150 email requests for access to the file. That was too big of a burden to pass along to customers.
The temptation was to punt a third time — only that was no longer an option. “We decided very clearly that if we don’t do it this time, if we don’t figure this out, we’re basically saying that Basecamp is not ever going to have this,” Jason Z. says. “Because why would we take a fourth attempt? That would be ridiculous.”
The pressure to “ship or get off the pot” led the team to explore other possibilities, like building a folder system that would copy Google Docs into a Basecamp project folder on Google Drive, or using Box.net’s Google Docs integration. We finally started to wonder whether the people who wanted Google Docs in Basecamp might already have the permissions thing dialed in. Jeremy chimed in at that point:
Companies switch to Google Apps from company Exchange email and central network fileservers. They “go Google.” Everyone at work is on Google, signed in, and has access to email, drive, calendar, contacts, etc. Google Apps recommends default sharing settings that are a lot like having a old-school central fileserver: newly created files are visible to others by default. There’s no sharing step or permissions-request dance: https://support.google.com/a/answer/60781. This is a golden path. It’s well-integrated and it’s the default when a company goes Google.
That perspective alleviated a lot of the trepidation we had about what users would see when they clicked on a Google Doc — the hope was that if people were already using Google Docs at work, they can probably already access all the links they need to be able to access by default. The access nightmare we envisioned wouldn’t occur if companies’ Google Apps admins were already setting up good defaults, the way Google recommends.
We still weren’t 100 percent convinced we had it right, but it felt good enough for v.1 — to be hands-off, and let the people who use it figure it out (with help, of course). “It’s funny how long the project went on, and in the end, it’s almost simpler than where we started,” Javan says. “But I guess that makes sense.”
“We made a bet on this permissions thing,” Jason Z. says. “We don’t use the feature, so we don’t know. We can’t anticipate what the pain points are going to be here.”
A month or so after shipping, it’s looking like we made the right bet. The majority of feedback has been of the thank-you-so-much-for-adding-this! variety. So far, 56 percent of users are logged into Google when trying to preview a document from within Basecamp, and of those, 91.5 percent already have access to the document they were trying to view. For how much concern there was over whether we were making the right call with permissions, it’s been super quiet. “We were really expecting more confusion, because we were confused,” Ann says. “The people who do use it know how to use it, and I guess we’ve fallen in with their expectations.”
“That’s a super important lesson just in product design in general,” Jason Z. concludes. “You can engineer all kinds of things, and they might be the wrong things if you don’t know. So it’s better to under-engineer and let the pain kind of bubble up organically, than to guess wrong.”
On Dec. 4 around 5:30 p.m. CT, a number of our sites began throwing errors and were basically unusable. Specifically, Basecamp Classic was briefly impacted as it was very slow. Campfire users experienced elevated errors and transcripts were not updated for quite some time. Highrise was the most significantly impacted: For two hours every page view produced an error.
Why our sites failed
When you visit a site like Basecamp it sends you information that’s generated from a number of database and application servers. These servers all talk to each other to share and consume data via connections to the same network.
Recently, we’ve been working to improve download speeds for Basecamp. On Tuesday afternoon we set up one server with software that simulates a user with a bad Internet connection. This bad traffic tickled a bug in a number of the database and application servers which caused them to become inaccessible. Ultimately this is why users received error messages while visiting our sites.
How we fixed the sites
We powered off the server sending out the bad traffic. We powered back on the database and application servers that were affected. We checked the consistency of the data and then restarted each affected site.
How we will prevent this from happening again
- We successfully duplicated this problem so we have an understanding of the cause and effect.
- We asked all staff not to run that specific piece of software again.
- We know someone might forget or make a mistake, so we set up alerts to notify us if the software is running anywhere on the network. We verified the check works too.
- We are working with our vendors to remove the bugs that caused the servers to go offline.
Our network is configured with multiple redundant switches in the core, two top of rack (TOR) switches per cabinet, and every server has at least 2×10Gbe or 2×1Gbe connections split over the TOR switches. Servers are are spread among cabinets to isolate the impact of a loss of network or power in any given cabinet. As such, application servers are spread throughout multiple cabinets; master and slave database pairs are separated, etc. Finally the cabinets are physically divided into two “compute rooms” with separate power and cooling.
Before the failure
We’ve been investigating ways to improve the user experience for our customers located outside the U.S. Typically these customers are located far enough away that best case latency is around 200 ms to the origin and many traverse circuits and peering points with high levels of congestion/packet loss. To simulate this type of connectivity we used netem. Other significant changes preceding the event included: an update to our knife plugin that allows us to make network reconfiguration changes, the decomm of a syslog server, and an update of check_mk.
At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A “help” notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.
One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load balancer. While this issue was being addressed other engineers identified that the affected applications and servers were in multiple cabinets. Since we were unable to access the affected servers via out of band management, we suspected a possible power issue. Because the datacenter provides remote hands service, we immediately contacted them to request a technician go to one of our cabinets and inspect the affected servers.
We prioritized our database and nosql (redis) servers first, since they were preventing some applications from working even in a degraded mode. (Both our master and slave servers were affected, and even our backup db host was affected. Talk about bad luck …) About five minutes after we had a few of the servers online, they stopped responding again. We asked the onsite technician to reboot them again, and we began copying data off to hosts that were unaffected. But the servers failed again before the data was successfully copied.
From our network graphs we could see that broadcast traffic was up. We ran tcpdump on a few hosts that weren’t affected, but nothing looked amiss. Even though we didn’t have a ton of supporting evidence it was the problem, we decided to clear the arp cache on our core, in case we had some how poisoned it with bad records. That didn’t seem to change anything.
We decided to regroup and review any information we might have missed in our earlier diagnosis: “Let’s take a few seconds and review what every person worked on today … just name everything you did even if it’s something obvious.” We each recited our work. It became clear we had four likely suspects: “knife switch,” our knife plugin for making changes to our network; syslog-02, which had just been decommisioned; an upgraded version of the check_mk plugin that was rolled out to some hosts; and the chef-testing-01 box with netem for simulating end user performance.
It seemed pretty likely that knife-switch or chef-testing-01 were the culprits. We reviewed our chef configuration and manually inspected a few hosts to rule out syslog-02. We were able to determine that the check_mk plugin wasn’t upgraded everywhere, and that there were no errors logged.
We shut down chef-testing-01 and had the remote hands technician power on the servers that had just gone awol again. We decided that since we were pretty sure this was a networking issue, and it very likely was related to lacp/bonding/something related, we should shut down one interface on each server in case that too prevented a repeat performance. We disabled a single port in each bond both on the switch and on the server. Then we waited 15 long minutes (about 10 minutes after the server was booted and we had confirmed the ports were shut down correctly) before we called the all-clear. During this time we let the databases reload their lru dumps so they were “warm.” We also restarted replication and let it catch up and got the redis instances started up.
With these critical services back online our sites began functioning normally again. Almost 2.5 long hours had passed at this point.
Finally, we made a prioritized list of application hosts that were still offline. For those with working out-of-band management, we used our internal tools to reboot them. For the rest we had the datacenter technician power cycle them in person.
- We were able to reproduce this failure with the same hardware during our after-incident testing. We know what happens on the network, but we have not identified the specific code paths that cause this failure. (The change logs for the network drivers leave lots to be desired!)
- We have adjusted the configuration of the internal DNS virtual server to automatically serve via the backup servers if the two primary servers are unavailable.
- We have added additional redis slaves on hosts that were not previously affected by the outage.
- We are continuing to pursue our investigation with the vendor and through our own testing.
- Everyone on the operations team has made a commitment to halt further testing (with netem) until we can demonstrate it will not cause this failure again.
- We have added “netem” to our Nagios check for blacklisted modules in case anyone forgets about that commitment.
- We are updating our tools so that physically locating servers when Campfire (and thus our Campfire bot) is broken isn’t a hassle.
We’ve built a Google spreadsheet which outlines information about the hosts that were affected. We’re being a bit cautious with reporting every single configuration detail because this could easily be used to maliciously impact someone’s (internal) network. If you’d like more information please contact netem (at) 37signals and we’ll vet each request individually.