You’re reading Signal v. Noise, a publication about the web by Basecamp since 1999. Happy .


Guess what these Google domain icons do. I’ll go first: Send a locksmith, Start a party, Call a handyman, Jump out the window, Put on your seatbelt, Use a lifeline, Start the machine.

Healthy benefits for the long run

David wrote this on 35 comments

Employee benefits for technology companies are often focused around making people stay at office longer: Foosball tables, game rooms, on-site training rooms, gourmet chefs, hell, some even offer laundry services. We don’t do any of that (although we do have a ping-pong table in a back room that gets wheeled out for our bi-yearly meetups).

Instead we focus on benefits that get people out of the office as much as possible. 37signals is in it for the long term, and we designed our benefits system to reflect that. One of the absolute keys to going the distance, and not burning out in the process, is going at a sustainable pace.

Here are the list of benefits we offer to get people away from the computer:

  • Vacations: For the last three years in a row, we’ve worked with a professional travel agent to prepare a buffet of travel packages that employees could pick from as a holiday gift. Everything paid for and included. Having it be specific, pre-arranged trips — whether for a family to go to Disneyland or a couple to tour Spain — has helped make sure people actually take their vacations.
  • 4-day Summer Weeks: From May through October, everyone who’s been with the company for more than a year gets to work just four days out of the week. This started out as “Friday’s off”, but roles like customer support and operations need to cover all hours, so now it’s just a 4-day Summer Week.
  • Sabbaticals: Every three years someone has been with the company, we offer the option of a 1-month sabbatical. This in particular has been very helpful at preventing or dealing with burnout. There’s nothing like a good, long, solid, continuous break away from work to refocus and rekindle.

To come up with the best ideas, you need a fresh mind. These travel and time-off benefits help everyone stay sharp. But it goes beyond that. Even the weeks when people are working full-on, we offer benefits focused around keeping everyone healthy in other ways too:

  • CSA stipend: We offer a stipend for people to get weekly fresh, local vegetables from community-supported agriculture. Eating well is good, cooking at home is good, doing both is great.
  • Exercise stipend: Whether people want to take yoga classes or spend money on their mountain bike, the company chips in. Eating healthy goes hand-in-hand with getting good exercise. And we sit down for too much of the day as it is, so helping people be active is important.

These benefits form the core of our long-term outlook: Frequent time to refresh, constant encouragement to eat and live healthy. Pair that with the flexibility that remote working offers, and I think we have a pretty good package.

It’s always a real pleasure and a proud moment when our internal Campfire lights up with an anniversary announcement. Like Jeff celebrating 6 years this month, Sam celebrating 8 years and Ann 3 years last month.

We ultimately want 37signals to have the potential of being the last job our people ever need. When you think about what it’ll take to keep someone happy and fulfilled for 10, 20, 30 years into the future, you adopt a very different vantage point from our industry norm.

Remote Works: It Collective

Emily Wilder
Emily Wilder wrote this on 1 comment
Name: Chris Hoffman
Title: Co-Founder, Director of Marketing Strategy
Company: It Collective
Based in: Colorado Springs, Colo.
Established: 2012

What does your company do?
We offer film production and content marketing strategy services. On the marketing strategy side, we work with clients to identify key stories and messages that will resonate with and be shared amongst a target audience — then we help them tell those stories through the creation of that content and the execution of a marketing strategy. On the film side, we produce everything from commercial spots to short films, and just recently finished our first feature-length production — a live concert film for Gungor, an incredibly talented band who have recently been nominated for a couple of Grammys.
How many people work for the company, and of those, how many work remotely?
We are 100% remote. Our business model is project-based, so our team changes in size depending on the number and types of projects we have in house. We went the contractor direction instead of hiring full-time employees for a number of reasons. Primarily, it allows the flexibility to resource the ideal skill sets for each project. Secondarily, hiring individuals who prefer working in a contract setting help us filter out the people who require micro-management — in other words, people who are not suited for a remote work system. The people we hire are used to managing their own time and workflow. We have around 10 team members that we work with on a regular basis.

The editing team works during the peace and quiet of a night shift, 10 p.m.-6 a.m.

Did you start out as a remote company?
We did, and I’d love to say that we had some great strategy behind that decision. In reality, it was made because we didn’t have the startup capital to pay for an office space. We strongly believe in the concept of bootstrapping, and have gotten off the ground without taking on any debt or external capital investment.
We’ve found that we have a great love for hosting face-to-face meetings in coffee shop or home office settings, and that our clients often love meeting in those settings as well. We recently conducted a major client review meeting on a film project in the living room of Andy Catarisano — our Co-Founder and Director of Film Production. We picked apart the final edits over homemade popcorn and cookies. I think our clients loved the experience as much as the final product. It was significantly more effective than presenting in a polished boardroom.
When we need a larger space we rent the tricked-out conference room of a local co-working establishment. Obviously there are occasions when the home office and local Starbucks won’t work, and we don’t pretend that our system will work for everyone. We’ve found a way that works for us to do business without a set physical space, and we aren’t in a hurry to change that.
What challenges did you face in setting up as a remote company?
One major challenge (for those of us that came from a traditional corporate environment) was overcoming the mentality of a 9-5 workday that had been engrained more deeply than we realized. For me personally, it has taken a very intentional effort to ask myself the right questions about my daily activities. I’ve had to learn to look at the day through the lens of, “What is the most high-impact use of my time?” As opposed to, “It’s 3 p.m. — I should be at my desk.”
U.S. work culture has conditioned employees to feel like they are fulfilling their duty to the company they work for by being in their seats for 8 hours in a day. In reality, those employees may or may not be producing anything of value. The amount of time spent at a desk is completely irrelevant to the value and quality of work, and that has been a tough lesson to learn.
What do you see as the major benefits of being a remote company?
The first major benefit is the effect it has on morale, and in turn, the increase in quality of work and dedication to the company. Here is one very practical example of this benefit: Commute time.

I’d love for someone to give me a reason that justifies not giving one of your staff 200 hours of their lives back each year in exchange for zero productivity loss.

Think about how ridiculous it is to demand that an employee sit in rush hour for an hour or more each morning and evening, just to be in by 9 a.m. and leave at 5 p.m. How simple of a switch would it be to allow that team member to work from home until 10 a.m., then arrive at the office in 30 minutes or less with no traffic? That switch translates to well over 200 hours of time given back to that person every year to do as he or she pleases — to spend extra time with family, invest in a personal project, or just take some additional space for decompression.
I’d love for someone to give me a reason that justifies not giving one of your staff 200 hours of their lives back each year in exchange for zero productivity loss. An unwillingness to discuss these types of changes to a work schedule that provide such tangible benefits is just plain arrogance on the part of a management team.
A second huge benefit is the expansion of the talent pool that it provides for us. Instead of being limited to the labor pool within 100 miles of our location, we literally have worldwide talent at our fingertips. We regularly work with a film colorist that lives in Sydney, Australia. The quality of work that we received was vastly superior than anything within our immediate geographic area.
One really interesting thing about working with international teams is that you have almost 24 straight hours of productivity at your disposal. We’d do work in the U.S. on the project, meet briefly at the end of the day with our team member in Sydney before signing off, and then turn it over to him to continue the work. It’s an amazing experience to go to bed, get a great night’s sleep and wake up to a project that is further along than when you left it.
The other really major benefit for us is providing the freedom to tailor the work environment to the type of work being performed. An example of this that really stands out to me is on one of our recent projects, which was a feature-length film. Editing a 90-minute film together is one of the most incredibly detailed processes I’ve ever seen, and it requires a huge amount of focus and precision. We worked an amazing team of editors for this project — the kicker was that they preferred to do their editing nocturnally, from about 10 p.m. to 8 or 9 a.m. The world is quiet then — there are zero interruptions and that was their period of ultimate creativity and effectiveness. A remote work environment allowed us to say yes to that request, and the results were outstanding.
Any advice for other companies who are considering going remote?
The thing about remote work is that it magnifies existing dysfunction in the workplace. An organization with a highly functional team and a deep understanding of role clarity and how to work together in an effective manner is going to have a much easier time transitioning to a remote work structure. A dysfunctional team is going to have a much more difficult time making that leap, because the freedom of working remotely magnifies those inefficiencies.
A physical office space has long been used as a safety net for managers to push the the messes of their team dynamics under the rug as opposed to addressing them. Being able to walk down the hallway every 15 minutes to micromanage employees can (sometimes) cover up poor hiring decisions. It can compensate for a failure to plan. It can also provide a false sense of security for a manager who needs to micromanage to feel effective in their position. Working remotely immediately removes those safety nets and exposes the true functionality of a team. If you’re thinking about making the leap to a remote work environment, it’s important to ask these questions about your team and be very honest in your answers.
Visit It Collective.

Picked up a great lesson from the book Turn The Ship Around. David Marquet, the author and nuclear sub captain, says you can’t empower people by decree. While you might be able to ask someone to make a decision for themselves, that’s not true empowerment (or true leadership). Why? Because you’re still making the decision to ask them to make the decision. That means they can’t move, or think, or act without you. The way to empower people is by creating an environment where they naturally start making decisions for themselves. That’s true empowerment. Leaving space, creating trust, and having the full faith that someone else will rise to the challenge themselves.

Jason Fried on Dec 24 2013 4 comments

Lessons from Frank & Oak’s Support

Mig Reyes
Mig Reyes wrote this on 19 comments

Doing business with a company means you’re not just buying their products, but the experience of having their people, opinions and expertise, too.

Some companies really understand great customer support and service, others fall hard. The latter is the case with my recent (now only) experience with Canadian online menswear retailer, Frank & Oak.
My story is common: I ordered a couple of items, but one got lost in transit. I had full faith that customer service at Frank & Oak could help me track it.
I got a week of radio silence through their online form, and email. Resorting to Twitter, I finally got a reply a couple days later: “we’ll email you.”
Fast forward three weeks from their first reply and we’ve got two valuable lessons from their final correspondence:

I usually answer my email within 3-4 days, but since you sent 3 emails, the number of days showing since our last communication stayed the same. Please wait for a response next time, so that I don’t loose track of our communication.

1. Blame the customer: 3 emails in a 3 week span, of course it’s my fault.
2. Passive-aggresively tell the customer they’re annoying: In 2013, most email clients order messages by time of receipt. My fault, I didn’t know that yours doesn’t.
Every bit of this Frank & Oak email makes it my fault. So much for making customers feel like a bad ass.
For examples on how to avoid bad customer service like this, you can read how Ryan switched to T-Mobile and had a great experience, or you can read how we turned our own disasters into gold. And whether you work on a support team or not, everyone should give Carnegie a read. You’ll make more friends, and probably more customers.
In the mean time, I’m going to find a place to buy a nice shirt.

Big: Know Your Company grows up and moves out.

Jason Fried
Jason Fried wrote this on 22 comments

Back in June we launched Know Your Company, a tool for helping company founders, owners, and CEOs get to know their companies again.

A few hundred one-on-one demos later, we’re about to hit our 100th paid customer.

Because of Know Your Company, thousands of employees have a louder voice, and a hundred company owners have bigger ears. Employees are sharing things they’ve never been asked about before, and owners are hearing things they’ve never heard before. New insights come weekly, and more feedback is flowing in both directions. Things are changing for the better at Know Your Company companies.

Back of the napkin financials

From the business side, in just six months, Know Your Company has booked $390,000 in revenue (and is profitable). The pricing model is $100 per-employee one-time (once you pay for someone you never pay for them again). The smallest customer has 16 employees, the largest has 105. As existing customers grow or replace employees, about 20 new employees are added to the system every week. Customer retention is holding strong at 99% (unfortunately we’ve had one cancellation).

Referrals are healthy too – we get a fair number of emails from CEOs who’ve heard of Know Your Company from existing Know Your Company customers. Even more promising, we’ve been hearing from CEOs who heard about Know Your Company from their employees!

Growing up

What started as a hunch, then launched as an internal experiment, before ultimately becoming commercial product, has blossomed into a thriving business.

In the spirit of continued experimentation, we’re about to take it up a notch and try something we’ve never done before: We’re spinning off Know Your Company into its own business.

In January 2014, Know Your Company the product will become Know Your Company the company, separate from 37signals.

Meet Claire Lew, the new CEO of Know Your Company

The new company will be co-owned by 37signals and Claire Lew. Claire will be the CEO and run all day-to-day operations. We’ll be on the sidelines purely as advisors, ready to help if called upon. If all goes well, Claire will ultimately own more of the company than 37signals will.

So who’s Claire? Claire’s someone we’ve had our eye on for a while. They don’t come much sharper (and nicer!) than Claire. In fact, we originally contemplated hiring Claire to run Know Your Company from the start, but things just didn’t come together.

Claire went off to start ClarityBox, a consulting practice aimed at helping owners understand what their employees really thought. You can watch her talk about it here:

ClarityBox’s mission was similar to Know Your Company. We obviously saw the same kinds problems out there and wanted to help solve them in similar ways.

So once it was clear that Know Your Company had legs, and that we wanted to spin it off into its own company, Claire was the natural match to run it.

I pitched her the idea and she was into it. We hammered out a deal and related details in a couple of weeks and signed the formal agreement yesterday. We’ll be transitioning the company and product over to Claire this month, and she’ll run it completely starting in January. I’ve heard some of her initial ideas so I’m excited to see where she takes it.

Know Your Company

So if you’re a founder, owner, or CEO of a company between 25 and 75 people, and you feel like you don’t know as much about your company as you used to, it’s time to get to Know Your Company again. Claire will show you how.

The feature that almost wasn’t

Emily Wilder
Emily Wilder wrote this on 9 comments

It took more than a year and three distinct attempts to get Google Docs in Basecamp ... and still, the damn thing almost didn’t get built. Why was it so hard?

We knew we needed it. Integration with Google Docs was a super-popular feature request, and usage in general is on the rise. Since Basecamp is a repository for everything project-related, it made sense to show the same love to Google Docs we show to any other type of file you can store in a Basecamp project.

Problem was, we don’t really use Google Docs ourselves. And we’re kind of notorious for scratching our own itch and not building shit we don’t need. It’s absolutely the exception that we would create a feature we didn’t plan on using. (For years, to-dos in Basecamp Classic didn’t have due dates, because we just work on things until they’re shippable. It wasn’t until enough customers hollered at us that we eventually added them.)

“We know tons of our customers use Google Docs; they have to,” says Jason Z. “Everybody’s using Google Docs. So we know it’s useful, we know people are asking for it all the time. There just comes a point where we have to figure it out.”

Shortly after launching the new Basecamp in March 2012, a small team explored what it would take to link to Google Docs from Basecamp. “We started with a little experiment to see whether the tools Google provides are enough to do basic integration,” said Jeremy, the programmer on that first spike. The goal was to be able to “pick a file from Google without having to commit to deep integration that changes the way Basecamp works.”

Google’s file picker made integrating with Google Docs easy, but rendered switching between accounts (if you’re signed in as one user and need to sign in as someone else) nigh on impossible. And we got hung up on what to do about permissions: Our choices seemed to be either allowing anyone who had the link to edit the document, or letting Google handle permissions and suffer the nasty flow and UI that resulted (more on that later).

With the account switching problem, our choices were to wait for Google to improve their tools, or scrap that and find some other way to integrate — i.e., roll up our sleeves and build our own picker. “That led to a waiting game,” Jeremy recalled: “if Google’s own tools got good enough that we could use them, then we’d have an easier time integrating.” So we punted.

Nearly a year later, a different group took a second look. While there still wasn’t a straightforward path for switching accounts, Javan experimented with a ton of different parameters and landed on treating authentication as a separate, first step to lead into the file picker, using Google’s JavaScript client.

What a Basecamp user sees before signing into Google
When a user is signed into a different Google account, they can sign out and choose which account to link to Basecamp.

Managing the two steps separately gave us the flexibility we needed to resolve the account switching issue, but the permissions demon was still rearing its ugly head. We punted again until we’d have more time to explore it.

Each time we felt like we were getting close, we’d reach the same stalemate. No one knew which of the two options for handling permissions was the lesser of two evils:

  1. Allow anyone with the link to view the document. This route would have meant sharing a Google Doc in Basecamp = changing its permissions so anyone with the link could view and change it. Other tools handle permissions this way; it makes things pretty easy and keeps the UI clean. But it creates a pretty gnarly security concern, in that there’s no way to revoke access later. People no longer employed at an organization might be removed from its Basecamp account, but still have access to proprietary information stored in Google Docs. Or users might share the link with outsiders who could then access and edit the document anonymously. No bueno.
  2. Let Google be the gatekeeper. When permissions are set within the Google account and Basecamp doesn’t mess with them, we get to wash our hands of security concerns. Convenient for us! But it passes this potential morass of access seeking and granting onto our users: The viewer has to be signed into Google, and they need permission to view the document to see the preview in Basecamp. If they don’t have permission, they can request it through Basecamp. They’ll then be directed to a Google page, and from there, the request is emailed to the Google Doc’s owner. When the owner grants access to the document, Google sends an automated email to the viewer with a link to view it. “A lot of us were feeling like this leads to a pretty crappy experience,” Javan says, “because you click on the doc and then you hit this brick wall.”
Step eleventy-bajillion-and-four in trying to preview a Google Doc in Basecamp that you haven’t been granted access to

“I was worried that people wouldn’t understand that, because I didn’t understand it,” recalls Ann from QA. “I did an experiment with the support team where I shared a Google Doc with them … I got all kinds of requests to view the document, because I hadn’t given them permission yet. I was afraid that oh my God, every customer was going to see that.” Adding a private file to a Basecamp project with 150 people on it might generate 150 email requests for access to the file. That was too big of a burden to pass along to customers.

The temptation was to punt a third time — only that was no longer an option. “We decided very clearly that if we don’t do it this time, if we don’t figure this out, we’re basically saying that Basecamp is not ever going to have this,” Jason Z. says. “Because why would we take a fourth attempt? That would be ridiculous.”

The pressure to “ship or get off the pot” led the team to explore other possibilities, like building a folder system that would copy Google Docs into a Basecamp project folder on Google Drive, or using’s Google Docs integration. We finally started to wonder whether the people who wanted Google Docs in Basecamp might already have the permissions thing dialed in. Jeremy chimed in at that point:

Companies switch to Google Apps from company Exchange email and central network fileservers. They “go Google.” Everyone at work is on Google, signed in, and has access to email, drive, calendar, contacts, etc. Google Apps recommends default sharing settings that are a lot like having a old-school central fileserver: newly created files are visible to others by default. There’s no sharing step or permissions-request dance: This is a golden path. It’s well-integrated and it’s the default when a company goes Google.

That perspective alleviated a lot of the trepidation we had about what users would see when they clicked on a Google Doc — the hope was that if people were already using Google Docs at work, they can probably already access all the links they need to be able to access by default. The access nightmare we envisioned wouldn’t occur if companies’ Google Apps admins were already setting up good defaults, the way Google recommends.

We still weren’t 100 percent convinced we had it right, but it felt good enough for v.1 — to be hands-off, and let the people who use it figure it out (with help, of course). “It’s funny how long the project went on, and in the end, it’s almost simpler than where we started,” Javan says. “But I guess that makes sense.”

“We made a bet on this permissions thing,” Jason Z. says. “We don’t use the feature, so we don’t know. We can’t anticipate what the pain points are going to be here.”

A month or so after shipping, it’s looking like we made the right bet. The majority of feedback has been of the thank-you-so-much-for-adding-this! variety. So far, 56 percent of users are logged into Google when trying to preview a document from within Basecamp, and of those, 91.5 percent already have access to the document they were trying to view. For how much concern there was over whether we were making the right call with permissions, it’s been super quiet. “We were really expecting more confusion, because we were confused,” Ann says. “The people who do use it know how to use it, and I guess we’ve fallen in with their expectations.”

“That’s a super important lesson just in product design in general,” Jason Z. concludes. “You can engineer all kinds of things, and they might be the wrong things if you don’t know. So it’s better to under-engineer and let the pain kind of bubble up organically, than to guess wrong.”

December 4th Basecamp Classic, Campfire and Highrise Outage

Taylor wrote this on 13 comments

Basic Explanation

Some background

On Dec. 4 around 5:30 p.m. CT, a number of our sites began throwing errors and were basically unusable. Specifically, Basecamp Classic was briefly impacted as it was very slow. Campfire users experienced elevated errors and transcripts were not updated for quite some time. Highrise was the most significantly impacted: For two hours every page view produced an error.

Why our sites failed

When you visit a site like Basecamp it sends you information that’s generated from a number of database and application servers. These servers all talk to each other to share and consume data via connections to the same network.

Recently, we’ve been working to improve download speeds for Basecamp. On Tuesday afternoon we set up one server with software that simulates a user with a bad Internet connection. This bad traffic tickled a bug in a number of the database and application servers which caused them to become inaccessible. Ultimately this is why users received error messages while visiting our sites.

How we fixed the sites

We powered off the server sending out the bad traffic. We powered back on the database and application servers that were affected. We checked the consistency of the data and then restarted each affected site.

How we will prevent this from happening again

  • We successfully duplicated this problem so we have an understanding of the cause and effect.
  • We asked all staff not to run that specific piece of software again.
  • We know someone might forget or make a mistake, so we set up alerts to notify us if the software is running anywhere on the network. We verified the check works too.
  • We are working with our vendors to remove the bugs that caused the servers to go offline.

In-Depth Explanation


Our network is configured with multiple redundant switches in the core, two top of rack (TOR) switches per cabinet, and every server has at least 2×10Gbe or 2×1Gbe connections split over the TOR switches. Servers are are spread among cabinets to isolate the impact of a loss of network or power in any given cabinet. As such, application servers are spread throughout multiple cabinets; master and slave database pairs are separated, etc. Finally the cabinets are physically divided into two “compute rooms” with separate power and cooling.

Before the failure

We’ve been investigating ways to improve the user experience for our customers located outside the U.S. Typically these customers are located far enough away that best case latency is around 200 ms to the origin and many traverse circuits and peering points with high levels of congestion/packet loss. To simulate this type of connectivity we used netem. Other significant changes preceding the event included: an update to our knife plugin that allows us to make network reconfiguration changes, the decomm of a syslog server, and an update of check_mk.


At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A “help” notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.

One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load balancer. While this issue was being addressed other engineers identified that the affected applications and servers were in multiple cabinets. Since we were unable to access the affected servers via out of band management, we suspected a possible power issue. Because the datacenter provides remote hands service, we immediately contacted them to request a technician go to one of our cabinets and inspect the affected servers.


We prioritized our database and nosql (redis) servers first, since they were preventing some applications from working even in a degraded mode. (Both our master and slave servers were affected, and even our backup db host was affected. Talk about bad luck …) About five minutes after we had a few of the servers online, they stopped responding again. We asked the onsite technician to reboot them again, and we began copying data off to hosts that were unaffected. But the servers failed again before the data was successfully copied.

From our network graphs we could see that broadcast traffic was up. We ran tcpdump on a few hosts that weren’t affected, but nothing looked amiss. Even though we didn’t have a ton of supporting evidence it was the problem, we decided to clear the arp cache on our core, in case we had some how poisoned it with bad records. That didn’t seem to change anything.

We decided to regroup and review any information we might have missed in our earlier diagnosis: “Let’s take a few seconds and review what every person worked on today … just name everything you did even if it’s something obvious.” We each recited our work. It became clear we had four likely suspects: “knife switch,” our knife plugin for making changes to our network; syslog-02, which had just been decommisioned; an upgraded version of the check_mk plugin that was rolled out to some hosts; and the chef-testing-01 box with netem for simulating end user performance.

It seemed pretty likely that knife-switch or chef-testing-01 were the culprits. We reviewed our chef configuration and manually inspected a few hosts to rule out syslog-02. We were able to determine that the check_mk plugin wasn’t upgraded everywhere, and that there were no errors logged.

We shut down chef-testing-01 and had the remote hands technician power on the servers that had just gone awol again. We decided that since we were pretty sure this was a networking issue, and it very likely was related to lacp/bonding/something related, we should shut down one interface on each server in case that too prevented a repeat performance. We disabled a single port in each bond both on the switch and on the server. Then we waited 15 long minutes (about 10 minutes after the server was booted and we had confirmed the ports were shut down correctly) before we called the all-clear. During this time we let the databases reload their lru dumps so they were “warm.” We also restarted replication and let it catch up and got the redis instances started up.

With these critical services back online our sites began functioning normally again. Almost 2.5 long hours had passed at this point.

Finally, we made a prioritized list of application hosts that were still offline. For those with working out-of-band management, we used our internal tools to reboot them. For the rest we had the datacenter technician power cycle them in person.


  • We were able to reproduce this failure with the same hardware during our after-incident testing. We know what happens on the network, but we have not identified the specific code paths that cause this failure. (The change logs for the network drivers leave lots to be desired!)
  • We have adjusted the configuration of the internal DNS virtual server to automatically serve via the backup servers if the two primary servers are unavailable.
  • We have added additional redis slaves on hosts that were not previously affected by the outage.
  • We are continuing to pursue our investigation with the vendor and through our own testing.
  • Everyone on the operations team has made a commitment to halt further testing (with netem) until we can demonstrate it will not cause this failure again.
  • We have added “netem” to our Nagios check for blacklisted modules in case anyone forgets about that commitment.
  • We are updating our tools so that physically locating servers when Campfire (and thus our Campfire bot) is broken isn’t a hassle.

Additional information

We’ve built a Google spreadsheet which outlines information about the hosts that were affected. We’re being a bit cautious with reporting every single configuration detail because this could easily be used to maliciously impact someone’s (internal) network. If you’d like more information please contact netem (at) 37signals and we’ll vet each request individually.

Quality Assurance

Michael Berger
Michael Berger wrote this on 16 comments

Back in March of 2009 I joined 37signals as Signal #13 and the other half of our two person support team. At the time we relied mostly on bug reports from customers to identify rough spots in our software. This required the full time attention of one or more “on call programmers”– firefighters who tamed quirks as they arose. The approach worked for a while but we weren’t making software quality a priority.

I had a chat with Jason Fried in late 2011 about how my critical tendencies could help improve our products. Out of that, the QA department was born. Kind of. I didn’t know much about QA and it wasn’t part of the development process at 37signals. So my first move was to order a stack of books about QA to help figure out what the hell I was supposed to be doing.

It’s been almost two years since our first project “with QA” back in 2012. Ann Goliak (another support team alumnus) recently joined me at the stead. Our QA process isn’t traditional and goes a bit different for every feature. Here’s a look at how QA fits into our development process, using the recent phone verification project as an example.

Step 1. I sat down with Sam Stephenson back in early July for our first walkthrough of phone verification. Hearing Sam talk about “creating a verification profile” or “completing a verification challenge” familiarized me with the terminology and flows that would be helpful descriptors in my bug reports. Here’s what the notes look like from that first conversation with Sam.
Step 2. After the introduction I’ll dive right into clicking around in a staging or beta environment to get a feel for the feature and what other parts of the app it touches. This is often the first time that someone not designing/coding the feature has a chance to give it a spin, and the fresh perspective always produces some new insights.
Step 3. There are lots of variables to consider when testing. Here are some of the things we keep in mind when putting together a test plan:

  • Does the API need to be updated to support this?
  • Does this feature affect Project templates?
  • Does this feature affect Basecamp Personal?
  • Does our iPhone app support it?
  • Do our mobile web views need to be updated?
  • Does this impact email-in?
  • Does this impact loop-in?
  • Does this impact moving and copying content?
  • Does this impact project imports from Basecamp Classic?
  • Test at various BCX plan levels
  • Test at various content limits (storage, projects)

Project states

  • Active project, Archived project, Project template, Draft (unpublished) project, Trashed project.

Types of content

  • To-do lists, To-do items (assigned + unassigned + dated), Messages, Files, Google docs, Text documents, Events (one time + recurring).


  • Progress screen, In-project latest activity block, History blocks (for each type of content), Calendar, Person pages, Trash, Digest emails.

When these variables are combined you end up with a script of tasks like this one to guide the testing. These lists are unique for each project.
Step 4. In Basecamp we make a few QA-specific to-do lists in each project: the first for unsorted discoveries, a second for tasks that have been allocated, and a third for rough spots support should know about (essentially “known issues”).

When I find a bug I’ll make a new to-do item that describes it including: 1) A thorough description of what I’m seeing, often with a suggested fix; 2) Specific steps to recreate the behavior; 3) The browser(s) and/or platform(s) where this was observed; and 4) Relevant URLs, screenshots, or a screen recording.

We use ScreenFlow to capture screen recordings on the Mac, and Reflector to do the same in iOS. We’re fans of LittleSnapper (now Ember) for annotating and organizing still screenshots.
Step 5. The designer and programmer on the project will periodically sift through the unsorted QA inbox. Some items get moved to the QA allocated list and fixed, then reassigned to QA for verification. Other “bugs” will trigger a conversation about why a decision was intentional, or outside the scope of the iteration.
Step 6. Before each new feature launch, QA hosts a video walkthrough for the support team. We’ll highlight any potential areas of confusion and other things to be on the lookout for. After the walkthrough, a member of support will spend some time putting together a help section page that covers the new feature.
Step 7. Within a couple weeks after a feature launch the team will usually have a retrospective phone call. We talk the highs and lows of the iteration and I use the chance to ask how QA can be better next time around.
At the end of a project there are usually some “nice to haves” and edge-cases that didn’t make the pre-launch cut. These bugs get moved into a different Basecamp project used for tracking long standing issues, then every few months we’ll eradicate some of them during a company-wide “bug mash”.
So that’s a general overview of how QA works at 37signals. We find anywhere from 30-80 bugs per project. Having QA has helped reduce the size of our on-call team to one. The best compliment: After trying it out, no one at the company was interested in shipping features without dedicated QA.

Server-generated JavaScript Responses

David wrote this on 29 comments

The majority of Ajax operations in Basecamp are handled with Server-generated JavaScript Responses (SJR). It works like this:

  1. Form is submitted via a XMLHttpRequest-powered form.
  2. Server creates or updates a model object.
  3. Server generates a JavaScript response that includes the updated HTML template for the model.
  4. Client evaluates the JavaScript returned by the server, which then updates the DOM.

This simple pattern has a number of key benefits.

Benefit #1: Reuse templates without sacrificing performance
You get to reuse the template that represents the model for both first-render and subsequent updates. In Rails, you’d have a partial like messages/message that’s used for both cases.

If you only returned JSON, you’d have to implement your templates for showing that message twice (once for first-response on the server, once for subsequent-updates on the client) — unless you’re doing a single-page JavaScript app where even the first response is done with JSON/client-side generation.

That latter model can be quite slow, since you won’t be able to display anything until your entire JavaScript library has been loaded and then the templates generated client-side. (This was the model that Twitter originally tried and then backed out of). But at least it’s a reasonable choice for certain situations and doesn’t require template duplication.

Benefit #2: Less computational power needed on the client
While the JavaScript with the embedded HTML template might result in a response that’s marginally larger than the same response in JSON (although that’s usually negligible when you compress with gzip), it doesn’t require much client-side computation to update.

This means it might well be faster from an end-to-end perspective to send JavaScript+HTML than JSON with client-side templates, depending on the complexity of those templates and the computational power of the client. This is double so because the server-generated templates can often be cached and shared amongst many users (see Russian Doll caching).

Benefit #3: Easy-to-follow execution flow
It’s very easy to follow the execution flow with SJR. The request mechanism is standardized with helper logic like form_for @post, remote: true. There’s no need for per-action request logic. The controller then renders the response partial view in exactly the same way it would render a full view, the template is just JavaScript instead of straight HTML.

Complete example
0) First-use of the message template.

<h1>All messages:</h1>
<%# renders messages/_message.html.erb %>
<%= render @messages %>

1) Form submitting via Ajax.

<% form_for, remote: true do |form| %>
  <%= form.submit "Send message" %>
<% end %>

2) Server creates the model object.

class MessagesController < ActionController::Base
  def create
    @message = @project.messages.create!(message_params)

    respond_to do |format|
      format.html { redirect_to @message } # no js fallback
      format.js   # just renders messages/create.js.erb

3) Server generates a JavaScript response with the HTML embedded.

<%# renders messages/_message.html.erb %>
$('#messages').prepend('<%=j render @message %>');
$('#<%= dom_id @message %>').highlight();

The final step of evaluating the response is automatically handled by the XMLHttpRequest-powered form generated by form_for, and the view is thus updated with the new message and that new message is then highlighted via a JS/CSS animation.

Beyond RJS
When we first started using SJR, we used it together with a transpiler called RJS, which had you write Ruby templates that were then turned into JavaScript. It was a poor man’s version of CoffeeScript (or Opalrb, if you will), and it erroneously turned many people off the SJR pattern.

These days we don’t use RJS any more (the generated responses are usually so simple that the win just wasn’t big enough for the rare cases where you actually do need something more complicated), but we’re as committed as ever to SJR.

This doesn’t mean that there’s no place for generating JSON on the server and views on the client. We do that for the minority case where UI fidelity is very high and lots of view state is maintained, like our calendar. When that route is called for, we use Sam’s excellent Eco template system (think ERB for CoffeeScript).

If your web application is all high-fidelity UI, it’s completely legit to go this route all the way. You’re paying a high price to buy yourself something fancy. No sweat. But if your application is more like Basecamp or Github or the majority of applications on the web that are proud of their document-based roots, then you really should embrace SJR with open arms.

The combination of Russian Doll-caching, Turbolinks, and SJR is an incredibly powerful cocktail for making fast, modern, and beautifully coded web applications. Enjoy!


The silhouettes and imagined dystopia of work was bad. Images of real people prioritizing their Merchandise Update over their family on a Skype call is just fucking horrendous.