You’re reading Signal v. Noise, a publication about the web by Basecamp since 1999. Happy .

David

About David

Creator of Ruby on Rails, partner at 37signals, best-selling author, public speaker, race-car driver, hobbyist photographer, and family man.

Dragons on the far side of the histogram

David
David wrote this on 9 comments

Performance tuning is a fun sport, but how you’re keeping score matters more than you think, if winning is to have real impact. When it comes to web applications, the first mistake is start with what’s the easiest to measure: server-side generation times.

In Rails, that’s the almighty X-Runtime header — reported to the 6th decimal of a second, for that extra punch of authority. A clear target, easily measured, and in that safe realm of your own code to make it appear fully controllable and scientific. But what good is saving off milliseconds for a 50ms internal target, if your shit (or non-existent!) CDNs are costing you seconds in New Zealand? Pounds, not pennies, is where the wealth is.

Yet that’s still the easy, level one, part of the answer: Don’t worry too much about your internal performance metrics until you’ve cared enough about the full stack of SSL termination overhead, CDN optimization, JS/CSS asset minimization, and client-side computational overhead (the latter easily catching out people following the “just do a server-side API”, since the json may well generate in 50ms, but then the client-side computation takes a full second on the below-average device — doh!).

Level two, once reasonable efforts have been made to trim the fat around the X-Runtime itself, is getting some big numbers up on the board: Mean and the 90th percentile. Those really are great places to start. If your mean is an embarrassing 500ms+, well, then you have some serious, fundamental problems that need fixing, which will benefit everyone using your app. Get to it.

Keep going beyond even the 99th

Just don’t stop there. Neither at the mean or the 90th. Don’t even stop at the 99th! At Basecamp, we sorta fell into that trap for a while. Our means were looking pretty at around 60ms, the 90th was 200ms, and even the 99th was a respectable 700ms. Victory, right?

Well, victory for the requests that fell into the 1st to 99th percentile. But when you process about fifty million requests a day, there’s still an awful lot of requests hidden on the far side of the 99th. And there, young ones, is where the dragons lie.

A while back we started shining the light into that cave. And even while I expected there to be dragons, I was still shocked at just how large and plentiful they were at our scale. Just 0.4% of requests took 1-2 seconds to resolve, but that’s still a shockingly 200,000 requests when you’re doing those fifty million requests.

Yet it gets worse. Just 0.0025% of requests took 10-30 seconds, but that’s still a whooping 1,250 requests. While some of those come from API requests that users do not see directly, a fair slice is indeed from real, impatient human beings. That’s just embarrassing! And a far, far away land from that pretty picture painted by the 60ms mean. Ugh.

Finally, there was the true elite: The 0.0001%, for a total of 50 instances. Those guys sat and waited between 30 and 60 seconds on their merry request to complete. Triple ugh.

Dragon slaying

Since lighting the cave, we’ve already been pointed to big, obvious holes in our setup that we weren’t looking at before. One simple example was file uploads: We’d stage files in one area, then copy them over to their final resting place as part of the record creation process. That’s no problem when it’s a couple of 10MB audio files, but try that again with 20 400MB video files — it takes a while! So now we stage straight in the final resting place, and cut out the copy process. Voila: Lots of dragons dead.

There’s still much more work to do. Not just because it sucks for the people who actually hit those monster requests, but also because it’s a real drain on the rest of the system. Maybe it’s a N+1 case that really only appears under very special circumstances, but every time the request hits, it’s still an onslaught on the database, and everyone else’s fast queries might well be slowed down as a result.

But it really does also just suck for those who actually have to sit through a 30 second request. It doesn’t really help them very much to know that everyone else is having a good time. In fact, that might just piss them off.

It’s like going to the lobby of your hotel to complain about the cockroaches, and then seeing the smug smile of the desk clerk saying “oh, don’t worry about that, none of our other 499 guests have that problem… just deal with it”. You wouldn’t come back next Summer.

So do have a look at the far side of your histogram. And use actual request counts, not just feel-good percentiles.

Basecamp network attack postmortem

David
David wrote this on 13 comments

As we detailed in Basecamp was under network attack, criminals assaulted our network with a DDoS attack on March 24. This is the technical postmortem that we promised.

The main attack lasted a total of an hour and 40 minutes starting at 8:32 central time and ending around 10:12. During that window, Basecamp and the other services were completely unavailable for 45 minutes, and intermittently up and down or slow for the rest. In addition to the attack itself, Basecamp got put in network quarantine by other providers, so it wasn’t until 11:08 that access was restored for everyone, everywhere.

The attack was a combination of SYN flood, DNS reflection, ICMP flooding, and NTP amplification. The combined flow was in excess of 20Gbps. Our mitigation strategy included filtering through a single provider and working with them to remove bogus traffic.

To reiterate, no data was compromised in this attack. This was solely an attack on our customers’ ability to access Basecamp and the other services.

There are two main areas we will improve upon following this event. Regarding our shield against future network attacks:

  1. We’ve formed a DDoS Survivors group to collaborate with other sites who’ve been subject to the same or similar attacks. That’s been enormously helpful already.
  2. We’re exploring all sorts of vendor shields to be able to mitigate future attacks even faster. While it’s tough to completely prevent any interruption in the face of a massive attack, there are options to minimize the disturbance.
  3. Law enforcement has been contacted, we’ve added our statement to their case file, and we’ll continue to assist them in catching the criminals behind this attack.

Regarding the communication:

  1. There was a 20-minute delay between our first learning of the attack and reporting it to our customers via Twitter and status. That’s unacceptable. We’ll make changes to ensure that it doesn’t take more than a maximum of 5 minutes to report something like this again.
  2. Although we were successful at posting information to our status site (which is hosted off site), the site received more traffic than ever in the past, and it too had availability problems. We’ve already upgraded the servers that power the site and we’ll be conducting additional load and availability testing in the coming days.

We will continue to be on high alert in case there is another attack. We have discussed plans with our providers, and we’re initiating new conversations with some of the top security vendors.

Monday was a rough day and we’re incredibly sorry we weren’t more effective at minimizing this interruption. We continue to sincerely appreciate your patience and support. Thank you.

Basecamp was under network attack this morning

David
David wrote this on 12 comments

Criminals attacked the Basecamp network with a distributed denial-of-service attack (DDoS) this morning. The attackers tried to extort us for money to make it stop. We refused to give in and worked with our network providers to mitigate the attack the best we could. Then, about two hours after the attack started, it suddenly stopped.

We’ve been in contact with multiple other victims of the same group, and unfortunately the pattern in those cases were one of on/off attacks. So while things are currently back to normal for almost everyone (a few lingering network quarantine issues remain, but should be cleared up shortly), there’s no guarantee that the attack will not resume.

So for the time being we remain on high alert. We’re collaborating with the other victims of the same group and with law enforcement. These criminals are sophisticated and well-armed.

Still, we want to apologize for such mayhem on a Monday morning. Basecamp, and our other services, are an integral part of how most of our customers get work done. While no data was compromised in this attack, not being able to get to your data when you need it is unacceptable.

During the attack we were able to keep everyone up to date using a combination of status.basecamp.com, Twitter, and an off-site Gist (thank you GitHub!). We’ll use the same channels in case we’re attacked again. If the attack does not resume, we will post a complete technical postmortem within 48 hours.

We want to thank all our customers who were affected by this outage for their patience and support. It means the world to us. Thank you.

Finding your workbench muse

David
David wrote this on 21 comments

Much intellectual capital is spent examining the logical advantages and disadvantages of our programming tools. Much ego invested in becoming completely objective examiners of productivity. The exalted ideal: To have no emotional connection to the workbench.

Hogs and wash. There is no shame in being inspired by your tools. There is no shame in falling in love with your tools. Nobody would chastise a musician for clinging to their favorite, out-dated, beat-up guitar for that impossible to explain “special” sound. Some authors even still write their manuscripts on actual type writers, just for the love of it.

This highlights the tension between programmers as either engineers or craftsmen. A false dichotomy, but a prevalent one. It’s entirely possible to dip inspiration and practice from both cans.

I understand where it’s coming from, of course—strong emotions often run counter to good arguments. It’s hard to convince people who’ve declared their admiration or love of something otherwise. Foolhardy, even. It can make other types of progress harder. If we all fell madly in love with Fortran and punch cards, would that still be the state of the art?

I find the benefits far outweigh the risks, though. We don’t have to declare our eternal fidelity to our tools for them to serve as our muse in the moment. And in that moment, we can enjoy the jolt of energy that can come from using a tool fitting your hand or mind just right. It’s exhilarating.

So much so that it’s worth accepting the limitations of your understanding. Why do I enjoy Ruby so very much? Well, there’s a laundry list of specific features and values to point to, but that still wouldn’t add up to the total sum. I’ve stopped questioning it constantly, and instead just embraced it.

Realizing that it’s not entirely rational, or explainable, also frees you from necessarily having to push your muse unto others. It’s understandable to be proud and interested in inviting others to share in your wonder, but mainly if they haven’t already found their own.

If someone is already beholden to Python, and you can sense that glow, then trying to talk them into Ruby isn’t going to get you anywhere. Just be happy that they too found their workbench muse.

At the end of the day, nobody should tell you how to feel about your tools (let alone police it out of you, under the guise of what’s proper for an engineer). There’s no medal for appearances, only great work.

47246719.jpg

The 3 1/2” floppy disk as the go-to save icon gets a lot of play, but I don’t think enough attention is paid to the venerable trash can icon.

First of all, trash cans never looked like this while I was growing up in Denmark — I only knew what it was from American TV. Second, I don’t think there are any of these left even in the US.

Canceling eFax

David
David wrote this on 19 comments
Wish to cancel your account? You may do so conveniently with an Online Chat Representative during 6AM-6PM Pacific Time, Monday through Friday. Or, you may call us after hours at (323) 817-3205.

Interesting use of the word “conveniently”. After days of missing the window, I finally hit it at the right time. Here’s how that convenient chat went:

  • Please wait for a site operator to respond. You are currently number 1 of 1 in the queue. Thank you for your patience.
  • You are now chatting with ‘Mike B.’
  • David Hansson: Hi there, please cancel my account.
  • Mike B.: Hello, David. Welcome to online Fax support. I am Mike Berry, your online Live Support Representative. How may I assist you?
  • Mike B.: I am glad to help you. Could you please provide me your fax number, registered email address and billing zip code for verification?
  • David Hansson: 555555555, david@loudthinking.com, 99999
  • Mike B.: I am sorry, the zip code provided is incorrect. Please confirm the 4-digit PIN or last 4 digits of the credit card on file.
  • David Hansson: pin: 1111
  • Mike B.: Thank you for providing your information. Please give me a moment while I pull up your account.
  • Mike B.: In the meantime, please type the number corresponding to your reason for cancellation:
  • Mike B.: 1) Moving to another provider
  • Mike B.: 2) Bought a fax machine
  • Mike B.: 3) Business or role changed
  • Mike B.: 4) Short term project completed
  • Mike B.: 5) Financial reasons
  • Mike B.: 6) Problems with faxing or billing
  • Mike B.: 7) Dissatisfied with quality of service
  • Mike B.: 8) Too costly
  • David Hansson: no need for fax
  • Mike B.: David, as we’d like to keep your business, I can offer you a discount and also waive your subscription fee for 1 month.
  • Mike B.: The discounted monthly fee would be $12.95 per month. This new plan includes 150 free inbound and 150 free outbound pages monthly.
  • Mike B.: There is no contract and you may cancel anytime. Shall I switch you to this plan?
  • David Hansson: no thanks, just cancel
  • Mike B.: Alright.
  • Mike B.: I completely understand your wish to discontinue. Conversely, May I offer you a waiver of 2 months on subscription fee so that you can re-evaluate your needs?
  • Mike B.: There is no contract and you may cancel anytime.
  • David Hansson: no thanks, just cancel
  • Mike B.: Okay.
  • Mike B.: If you wish to consider the offer, I can set your account to auto-close at the end of the 2-month waiver period, wherein you need not have to contact us again for cancelling the account.
  • Mike B.: However, if you choose to continue, you would need to get back to us so that we could remove the auto-closure of your account.
  • Mike B.: Would that be fine with you?
  • David Hansson: nope, canceling now is what I would like
  • Mike B.: Okay, I will go ahead and cancel your account.
  • Mike B.: An e-mail confirming that your account has been canceled will be sent to your registered e-mail address.
  • Mike B.: Is there anything else I may assist you with?
  • David Hansson: that’s it, thanks
  • Mike B.: Thank you for contacting online Fax support. I hope you found our session helpful. Goodbye and take care.
  • Chat session has been terminated by the site operator.

I hardly need to add commentary to illustrate just how ridiculous and unfair this process is, but I can’t help myself. If you allow a customer to signup 24/7/365, you should damn well allow that customer to cancel their service 24/7/365. If you allow them to signup self-service, you should damn well allow that customer to cancel by self-service. Anything less is just crummy.

(I wish credit card companies would help enforce consumer protection against this: Unless it’s as easy to cancel as it is to signup, chargeback is automatic).

Everyone does everything

David
David wrote this on 13 comments

The natural tendency of growth is towards specialization. When you only have a few people, they must by necessity do everything. When you have more people, there’s enough room and slack to let people build specialization kingdoms that only they have the keys to. Don’t be so eager to let that happen.

Specialization might give you a temporary boost in productivity, but it comes at the expense of overall functional cohesion and shared ownership. If only Jeff can fiddle with the billing system, any change to the billing system is bottlenecked on Jeff, and who’s going to review his work on a big change?

But it goes even deeper than that. For example, we have all programmers work on-call as well. Everyone gets to feel the impact of customers having trouble with our code (this is on top of Everyone on support).

This really came to the test lately when we started working on a number of iOS and Android projects. Should we hire new specialists in from the outside or should everyone do everything, and thus have our existing team learn the ropes. Well, in that case we ended up doing both. Hiring a little because we needed that anyway, and getting someone with some experience, but also choosing to invest in the existing team by having them learn iOS and Android from scratch.

Good programmers are good programmers. Good designers are good designers. Don’t be so eager to pigeonhole people into just one technology, one aspect of your code base, or one part of your business. Most people are eager to learn and grow if you give them a supportive environment to do so.

google-domain-icons.png

Guess what these Google domain icons do. I’ll go first: Send a locksmith, Start a party, Call a handyman, Jump out the window, Put on your seatbelt, Use a lifeline, Start the machine.

Healthy benefits for the long run

David
David wrote this on 35 comments

Employee benefits for technology companies are often focused around making people stay at office longer: Foosball tables, game rooms, on-site training rooms, gourmet chefs, hell, some even offer laundry services. We don’t do any of that (although we do have a ping-pong table in a back room that gets wheeled out for our bi-yearly meetups).

Instead we focus on benefits that get people out of the office as much as possible. 37signals is in it for the long term, and we designed our benefits system to reflect that. One of the absolute keys to going the distance, and not burning out in the process, is going at a sustainable pace.

Here are the list of benefits we offer to get people away from the computer:

  • Vacations: For the last three years in a row, we’ve worked with a professional travel agent to prepare a buffet of travel packages that employees could pick from as a holiday gift. Everything paid for and included. Having it be specific, pre-arranged trips — whether for a family to go to Disneyland or a couple to tour Spain — has helped make sure people actually take their vacations.
  • 4-day Summer Weeks: From May through October, everyone who’s been with the company for more than a year gets to work just four days out of the week. This started out as “Friday’s off”, but roles like customer support and operations need to cover all hours, so now it’s just a 4-day Summer Week.
  • Sabbaticals: Every three years someone has been with the company, we offer the option of a 1-month sabbatical. This in particular has been very helpful at preventing or dealing with burnout. There’s nothing like a good, long, solid, continuous break away from work to refocus and rekindle.

To come up with the best ideas, you need a fresh mind. These travel and time-off benefits help everyone stay sharp. But it goes beyond that. Even the weeks when people are working full-on, we offer benefits focused around keeping everyone healthy in other ways too:

  • CSA stipend: We offer a stipend for people to get weekly fresh, local vegetables from community-supported agriculture. Eating well is good, cooking at home is good, doing both is great.
  • Exercise stipend: Whether people want to take yoga classes or spend money on their mountain bike, the company chips in. Eating healthy goes hand-in-hand with getting good exercise. And we sit down for too much of the day as it is, so helping people be active is important.

These benefits form the core of our long-term outlook: Frequent time to refresh, constant encouragement to eat and live healthy. Pair that with the flexibility that remote working offers, and I think we have a pretty good package.

It’s always a real pleasure and a proud moment when our internal Campfire lights up with an anniversary announcement. Like Jeff celebrating 6 years this month, Sam celebrating 8 years and Ann 3 years last month.

We ultimately want 37signals to have the potential of being the last job our people ever need. When you think about what it’ll take to keep someone happy and fulfilled for 10, 20, 30 years into the future, you adopt a very different vantage point from our industry norm.

Server-generated JavaScript Responses

David
David wrote this on 29 comments

The majority of Ajax operations in Basecamp are handled with Server-generated JavaScript Responses (SJR). It works like this:

  1. Form is submitted via a XMLHttpRequest-powered form.
  2. Server creates or updates a model object.
  3. Server generates a JavaScript response that includes the updated HTML template for the model.
  4. Client evaluates the JavaScript returned by the server, which then updates the DOM.

This simple pattern has a number of key benefits.

Benefit #1: Reuse templates without sacrificing performance
You get to reuse the template that represents the model for both first-render and subsequent updates. In Rails, you’d have a partial like messages/message that’s used for both cases.

If you only returned JSON, you’d have to implement your templates for showing that message twice (once for first-response on the server, once for subsequent-updates on the client) — unless you’re doing a single-page JavaScript app where even the first response is done with JSON/client-side generation.

That latter model can be quite slow, since you won’t be able to display anything until your entire JavaScript library has been loaded and then the templates generated client-side. (This was the model that Twitter originally tried and then backed out of). But at least it’s a reasonable choice for certain situations and doesn’t require template duplication.

Benefit #2: Less computational power needed on the client
While the JavaScript with the embedded HTML template might result in a response that’s marginally larger than the same response in JSON (although that’s usually negligible when you compress with gzip), it doesn’t require much client-side computation to update.

This means it might well be faster from an end-to-end perspective to send JavaScript+HTML than JSON with client-side templates, depending on the complexity of those templates and the computational power of the client. This is double so because the server-generated templates can often be cached and shared amongst many users (see Russian Doll caching).

Benefit #3: Easy-to-follow execution flow
It’s very easy to follow the execution flow with SJR. The request mechanism is standardized with helper logic like form_for @post, remote: true. There’s no need for per-action request logic. The controller then renders the response partial view in exactly the same way it would render a full view, the template is just JavaScript instead of straight HTML.

Complete example
0) First-use of the message template.

<h1>All messages:</h1>
<%# renders messages/_message.html.erb %>
<%= render @messages %>

1) Form submitting via Ajax.

<% form_for @project.messages.new, remote: true do |form| %>
  ...
  <%= form.submit "Send message" %>
<% end %>

2) Server creates the model object.

class MessagesController < ActionController::Base
  def create
    @message = @project.messages.create!(message_params)

    respond_to do |format|
      format.html { redirect_to @message } # no js fallback
      format.js   # just renders messages/create.js.erb
    end
  end
end

3) Server generates a JavaScript response with the HTML embedded.

<%# renders messages/_message.html.erb %>
$('#messages').prepend('<%=j render @message %>');
$('#<%= dom_id @message %>').highlight();

The final step of evaluating the response is automatically handled by the XMLHttpRequest-powered form generated by form_for, and the view is thus updated with the new message and that new message is then highlighted via a JS/CSS animation.

Beyond RJS
When we first started using SJR, we used it together with a transpiler called RJS, which had you write Ruby templates that were then turned into JavaScript. It was a poor man’s version of CoffeeScript (or Opalrb, if you will), and it erroneously turned many people off the SJR pattern.

These days we don’t use RJS any more (the generated responses are usually so simple that the win just wasn’t big enough for the rare cases where you actually do need something more complicated), but we’re as committed as ever to SJR.

This doesn’t mean that there’s no place for generating JSON on the server and views on the client. We do that for the minority case where UI fidelity is very high and lots of view state is maintained, like our calendar. When that route is called for, we use Sam’s excellent Eco template system (think ERB for CoffeeScript).

If your web application is all high-fidelity UI, it’s completely legit to go this route all the way. You’re paying a high price to buy yourself something fancy. No sweat. But if your application is more like Basecamp or Github or the majority of applications on the web that are proud of their document-based roots, then you really should embrace SJR with open arms.

The combination of Russian Doll-caching, Turbolinks, and SJR is an incredibly powerful cocktail for making fast, modern, and beautifully coded web applications. Enjoy!