You’re reading Signal v. Noise, a publication about the web by Basecamp since 1999. Happy !

The feature that almost wasn’t

Emily Triplett Lentz
Emily Triplett Lentz wrote this on 9 comments

It took more than a year and three distinct attempts to get Google Docs in Basecamp ... and still, the damn thing almost didn’t get built. Why was it so hard?

We knew we needed it. Integration with Google Docs was a super-popular feature request, and usage in general is on the rise. Since Basecamp is a repository for everything project-related, it made sense to show the same love to Google Docs we show to any other type of file you can store in a Basecamp project.

Problem was, we don’t really use Google Docs ourselves. And we’re kind of notorious for scratching our own itch and not building shit we don’t need. It’s absolutely the exception that we would create a feature we didn’t plan on using. (For years, to-dos in Basecamp Classic didn’t have due dates, because we just work on things until they’re shippable. It wasn’t until enough customers hollered at us that we eventually added them.)

“We know tons of our customers use Google Docs; they have to,” says Jason Z. “Everybody’s using Google Docs. So we know it’s useful, we know people are asking for it all the time. There just comes a point where we have to figure it out.”

Shortly after launching the new Basecamp in March 2012, a small team explored what it would take to link to Google Docs from Basecamp. “We started with a little experiment to see whether the tools Google provides are enough to do basic integration,” said Jeremy, the programmer on that first spike. The goal was to be able to “pick a file from Google without having to commit to deep integration that changes the way Basecamp works.”

Google’s file picker made integrating with Google Docs easy, but rendered switching between accounts (if you’re signed in as one user and need to sign in as someone else) nigh on impossible. And we got hung up on what to do about permissions: Our choices seemed to be either allowing anyone who had the link to edit the document, or letting Google handle permissions and suffer the nasty flow and UI that resulted (more on that later).

With the account switching problem, our choices were to wait for Google to improve their tools, or scrap that and find some other way to integrate — i.e., roll up our sleeves and build our own picker. “That led to a waiting game,” Jeremy recalled: “if Google’s own tools got good enough that we could use them, then we’d have an easier time integrating.” So we punted.

Nearly a year later, a different group took a second look. While there still wasn’t a straightforward path for switching accounts, Javan experimented with a ton of different parameters and landed on treating authentication as a separate, first step to lead into the file picker, using Google’s JavaScript client.

What a Basecamp user sees before signing into Google
When a user is signed into a different Google account, they can sign out and choose which account to link to Basecamp.

Managing the two steps separately gave us the flexibility we needed to resolve the account switching issue, but the permissions demon was still rearing its ugly head. We punted again until we’d have more time to explore it.

Each time we felt like we were getting close, we’d reach the same stalemate. No one knew which of the two options for handling permissions was the lesser of two evils:

  1. Allow anyone with the link to view the document. This route would have meant sharing a Google Doc in Basecamp = changing its permissions so anyone with the link could view and change it. Other tools handle permissions this way; it makes things pretty easy and keeps the UI clean. But it creates a pretty gnarly security concern, in that there’s no way to revoke access later. People no longer employed at an organization might be removed from its Basecamp account, but still have access to proprietary information stored in Google Docs. Or users might share the link with outsiders who could then access and edit the document anonymously. No bueno.
  2. Let Google be the gatekeeper. When permissions are set within the Google account and Basecamp doesn’t mess with them, we get to wash our hands of security concerns. Convenient for us! But it passes this potential morass of access seeking and granting onto our users: The viewer has to be signed into Google, and they need permission to view the document to see the preview in Basecamp. If they don’t have permission, they can request it through Basecamp. They’ll then be directed to a Google page, and from there, the request is emailed to the Google Doc’s owner. When the owner grants access to the document, Google sends an automated email to the viewer with a link to view it. “A lot of us were feeling like this leads to a pretty crappy experience,” Javan says, “because you click on the doc and then you hit this brick wall.”
Step eleventy-bajillion-and-four in trying to preview a Google Doc in Basecamp that you haven’t been granted access to

“I was worried that people wouldn’t understand that, because I didn’t understand it,” recalls Ann from QA. “I did an experiment with the support team where I shared a Google Doc with them … I got all kinds of requests to view the document, because I hadn’t given them permission yet. I was afraid that oh my God, every customer was going to see that.” Adding a private file to a Basecamp project with 150 people on it might generate 150 email requests for access to the file. That was too big of a burden to pass along to customers.

The temptation was to punt a third time — only that was no longer an option. “We decided very clearly that if we don’t do it this time, if we don’t figure this out, we’re basically saying that Basecamp is not ever going to have this,” Jason Z. says. “Because why would we take a fourth attempt? That would be ridiculous.”

The pressure to “ship or get off the pot” led the team to explore other possibilities, like building a folder system that would copy Google Docs into a Basecamp project folder on Google Drive, or using Box.net’s Google Docs integration. We finally started to wonder whether the people who wanted Google Docs in Basecamp might already have the permissions thing dialed in. Jeremy chimed in at that point:

Companies switch to Google Apps from company Exchange email and central network fileservers. They “go Google.” Everyone at work is on Google, signed in, and has access to email, drive, calendar, contacts, etc. Google Apps recommends default sharing settings that are a lot like having a old-school central fileserver: newly created files are visible to others by default. There’s no sharing step or permissions-request dance: https://support.google.com/a/answer/60781. This is a golden path. It’s well-integrated and it’s the default when a company goes Google.

That perspective alleviated a lot of the trepidation we had about what users would see when they clicked on a Google Doc — the hope was that if people were already using Google Docs at work, they can probably already access all the links they need to be able to access by default. The access nightmare we envisioned wouldn’t occur if companies’ Google Apps admins were already setting up good defaults, the way Google recommends.

We still weren’t 100 percent convinced we had it right, but it felt good enough for v.1 — to be hands-off, and let the people who use it figure it out (with help, of course). “It’s funny how long the project went on, and in the end, it’s almost simpler than where we started,” Javan says. “But I guess that makes sense.”

“We made a bet on this permissions thing,” Jason Z. says. “We don’t use the feature, so we don’t know. We can’t anticipate what the pain points are going to be here.”

A month or so after shipping, it’s looking like we made the right bet. The majority of feedback has been of the thank-you-so-much-for-adding-this! variety. So far, 56 percent of users are logged into Google when trying to preview a document from within Basecamp, and of those, 91.5 percent already have access to the document they were trying to view. For how much concern there was over whether we were making the right call with permissions, it’s been super quiet. “We were really expecting more confusion, because we were confused,” Ann says. “The people who do use it know how to use it, and I guess we’ve fallen in with their expectations.”

“That’s a super important lesson just in product design in general,” Jason Z. concludes. “You can engineer all kinds of things, and they might be the wrong things if you don’t know. So it’s better to under-engineer and let the pain kind of bubble up organically, than to guess wrong.”

December 4th Basecamp Classic, Campfire and Highrise Outage

Taylor
Taylor wrote this on 13 comments

Basic Explanation

Some background

On Dec. 4 around 5:30 p.m. CT, a number of our sites began throwing errors and were basically unusable. Specifically, Basecamp Classic was briefly impacted as it was very slow. Campfire users experienced elevated errors and transcripts were not updated for quite some time. Highrise was the most significantly impacted: For two hours every page view produced an error.

Why our sites failed

When you visit a site like Basecamp it sends you information that’s generated from a number of database and application servers. These servers all talk to each other to share and consume data via connections to the same network.

Recently, we’ve been working to improve download speeds for Basecamp. On Tuesday afternoon we set up one server with software that simulates a user with a bad Internet connection. This bad traffic tickled a bug in a number of the database and application servers which caused them to become inaccessible. Ultimately this is why users received error messages while visiting our sites.

How we fixed the sites

We powered off the server sending out the bad traffic. We powered back on the database and application servers that were affected. We checked the consistency of the data and then restarted each affected site.

How we will prevent this from happening again

  • We successfully duplicated this problem so we have an understanding of the cause and effect.
  • We asked all staff not to run that specific piece of software again.
  • We know someone might forget or make a mistake, so we set up alerts to notify us if the software is running anywhere on the network. We verified the check works too.
  • We are working with our vendors to remove the bugs that caused the servers to go offline.


In-Depth Explanation

Topology

Our network is configured with multiple redundant switches in the core, two top of rack (TOR) switches per cabinet, and every server has at least 2×10Gbe or 2×1Gbe connections split over the TOR switches. Servers are are spread among cabinets to isolate the impact of a loss of network or power in any given cabinet. As such, application servers are spread throughout multiple cabinets; master and slave database pairs are separated, etc. Finally the cabinets are physically divided into two “compute rooms” with separate power and cooling.

Before the failure

We’ve been investigating ways to improve the user experience for our customers located outside the U.S. Typically these customers are located far enough away that best case latency is around 200 ms to the origin and many traverse circuits and peering points with high levels of congestion/packet loss. To simulate this type of connectivity we used netem. Other significant changes preceding the event included: an update to our knife plugin that allows us to make network reconfiguration changes, the decomm of a syslog server, and an update of check_mk.

Failure

At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A “help” notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.

One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load balancer. While this issue was being addressed other engineers identified that the affected applications and servers were in multiple cabinets. Since we were unable to access the affected servers via out of band management, we suspected a possible power issue. Because the datacenter provides remote hands service, we immediately contacted them to request a technician go to one of our cabinets and inspect the affected servers.

Recovery

We prioritized our database and nosql (redis) servers first, since they were preventing some applications from working even in a degraded mode. (Both our master and slave servers were affected, and even our backup db host was affected. Talk about bad luck …) About five minutes after we had a few of the servers online, they stopped responding again. We asked the onsite technician to reboot them again, and we began copying data off to hosts that were unaffected. But the servers failed again before the data was successfully copied.

From our network graphs we could see that broadcast traffic was up. We ran tcpdump on a few hosts that weren’t affected, but nothing looked amiss. Even though we didn’t have a ton of supporting evidence it was the problem, we decided to clear the arp cache on our core, in case we had some how poisoned it with bad records. That didn’t seem to change anything.

We decided to regroup and review any information we might have missed in our earlier diagnosis: “Let’s take a few seconds and review what every person worked on today … just name everything you did even if it’s something obvious.” We each recited our work. It became clear we had four likely suspects: “knife switch,” our knife plugin for making changes to our network; syslog-02, which had just been decommisioned; an upgraded version of the check_mk plugin that was rolled out to some hosts; and the chef-testing-01 box with netem for simulating end user performance.

It seemed pretty likely that knife-switch or chef-testing-01 were the culprits. We reviewed our chef configuration and manually inspected a few hosts to rule out syslog-02. We were able to determine that the check_mk plugin wasn’t upgraded everywhere, and that there were no errors logged.

We shut down chef-testing-01 and had the remote hands technician power on the servers that had just gone awol again. We decided that since we were pretty sure this was a networking issue, and it very likely was related to lacp/bonding/something related, we should shut down one interface on each server in case that too prevented a repeat performance. We disabled a single port in each bond both on the switch and on the server. Then we waited 15 long minutes (about 10 minutes after the server was booted and we had confirmed the ports were shut down correctly) before we called the all-clear. During this time we let the databases reload their lru dumps so they were “warm.” We also restarted replication and let it catch up and got the redis instances started up.

With these critical services back online our sites began functioning normally again. Almost 2.5 long hours had passed at this point.

Finally, we made a prioritized list of application hosts that were still offline. For those with working out-of-band management, we used our internal tools to reboot them. For the rest we had the datacenter technician power cycle them in person.

Resolution

  • We were able to reproduce this failure with the same hardware during our after-incident testing. We know what happens on the network, but we have not identified the specific code paths that cause this failure. (The change logs for the network drivers leave lots to be desired!)
  • We have adjusted the configuration of the internal DNS virtual server to automatically serve via the backup servers if the two primary servers are unavailable.
  • We have added additional redis slaves on hosts that were not previously affected by the outage.
  • We are continuing to pursue our investigation with the vendor and through our own testing.
  • Everyone on the operations team has made a commitment to halt further testing (with netem) until we can demonstrate it will not cause this failure again.
  • We have added “netem” to our Nagios check for blacklisted modules in case anyone forgets about that commitment.
  • We are updating our tools so that physically locating servers when Campfire (and thus our Campfire bot) is broken isn’t a hassle.

Additional information

We’ve built a Google spreadsheet which outlines information about the hosts that were affected. We’re being a bit cautious with reporting every single configuration detail because this could easily be used to maliciously impact someone’s (internal) network. If you’d like more information please contact netem (at) 37signals and we’ll vet each request individually.

Quality Assurance

Michael Berger
Michael Berger wrote this on 16 comments

Back in March of 2009 I joined 37signals as Signal #13 and the other half of our two person support team. At the time we relied mostly on bug reports from customers to identify rough spots in our software. This required the full time attention of one or more “on call programmers”– firefighters who tamed quirks as they arose. The approach worked for a while but we weren’t making software quality a priority.

I had a chat with Jason Fried in late 2011 about how my critical tendencies could help improve our products. Out of that, the QA department was born. Kind of. I didn’t know much about QA and it wasn’t part of the development process at 37signals. So my first move was to order a stack of books about QA to help figure out what the hell I was supposed to be doing.

It’s been almost two years since our first project “with QA” back in 2012. Ann Goliak (another support team alumnus) recently joined me at the stead. Our QA process isn’t traditional and goes a bit different for every feature. Here’s a look at how QA fits into our development process, using the recent phone verification project as an example.


Step 1. I sat down with Sam Stephenson back in early July for our first walkthrough of phone verification. Hearing Sam talk about “creating a verification profile” or “completing a verification challenge” familiarized me with the terminology and flows that would be helpful descriptors in my bug reports. Here’s what the notes look like from that first conversation with Sam.
Step 2. After the introduction I’ll dive right into clicking around in a staging or beta environment to get a feel for the feature and what other parts of the app it touches. This is often the first time that someone not designing/coding the feature has a chance to give it a spin, and the fresh perspective always produces some new insights.
Step 3. There are lots of variables to consider when testing. Here are some of the things we keep in mind when putting together a test plan:

  • Does the API need to be updated to support this?
  • Does this feature affect Project templates?
  • Does this feature affect Basecamp Personal?
  • Does our iPhone app support it?
  • Do our mobile web views need to be updated?
  • Does this impact email-in?
  • Does this impact loop-in?
  • Does this impact moving and copying content?
  • Does this impact project imports from Basecamp Classic?
  • Test at various BCX plan levels
  • Test at various content limits (storage, projects)

Project states

  • Active project, Archived project, Project template, Draft (unpublished) project, Trashed project.

Types of content

  • To-do lists, To-do items (assigned + unassigned + dated), Messages, Files, Google docs, Text documents, Events (one time + recurring).

Views

  • Progress screen, In-project latest activity block, History blocks (for each type of content), Calendar, Person pages, Trash, Digest emails.

When these variables are combined you end up with a script of tasks like this one to guide the testing. These lists are unique for each project.
Step 4. In Basecamp we make a few QA-specific to-do lists in each project: the first for unsorted discoveries, a second for tasks that have been allocated, and a third for rough spots support should know about (essentially “known issues”).

When I find a bug I’ll make a new to-do item that describes it including: 1) A thorough description of what I’m seeing, often with a suggested fix; 2) Specific steps to recreate the behavior; 3) The browser(s) and/or platform(s) where this was observed; and 4) Relevant URLs, screenshots, or a screen recording.

We use ScreenFlow to capture screen recordings on the Mac, and Reflector to do the same in iOS. We’re fans of LittleSnapper (now Ember) for annotating and organizing still screenshots.
Step 5. The designer and programmer on the project will periodically sift through the unsorted QA inbox. Some items get moved to the QA allocated list and fixed, then reassigned to QA for verification. Other “bugs” will trigger a conversation about why a decision was intentional, or outside the scope of the iteration.
Step 6. Before each new feature launch, QA hosts a video walkthrough for the support team. We’ll highlight any potential areas of confusion and other things to be on the lookout for. After the walkthrough, a member of support will spend some time putting together a help section page that covers the new feature.
Step 7. Within a couple weeks after a feature launch the team will usually have a retrospective phone call. We talk the highs and lows of the iteration and I use the chance to ask how QA can be better next time around.
At the end of a project there are usually some “nice to haves” and edge-cases that didn’t make the pre-launch cut. These bugs get moved into a different Basecamp project used for tracking long standing issues, then every few months we’ll eradicate some of them during a company-wide “bug mash”.
So that’s a general overview of how QA works at 37signals. We find anywhere from 30-80 bugs per project. Having QA has helped reduce the size of our on-call team to one. The best compliment: After trying it out, no one at the company was interested in shipping features without dedicated QA.

Server-generated JavaScript Responses

David
David wrote this on 29 comments

The majority of Ajax operations in Basecamp are handled with Server-generated JavaScript Responses (SJR). It works like this:

  1. Form is submitted via a XMLHttpRequest-powered form.
  2. Server creates or updates a model object.
  3. Server generates a JavaScript response that includes the updated HTML template for the model.
  4. Client evaluates the JavaScript returned by the server, which then updates the DOM.

This simple pattern has a number of key benefits.

Benefit #1: Reuse templates without sacrificing performance
You get to reuse the template that represents the model for both first-render and subsequent updates. In Rails, you’d have a partial like messages/message that’s used for both cases.

If you only returned JSON, you’d have to implement your templates for showing that message twice (once for first-response on the server, once for subsequent-updates on the client) — unless you’re doing a single-page JavaScript app where even the first response is done with JSON/client-side generation.

That latter model can be quite slow, since you won’t be able to display anything until your entire JavaScript library has been loaded and then the templates generated client-side. (This was the model that Twitter originally tried and then backed out of). But at least it’s a reasonable choice for certain situations and doesn’t require template duplication.

Benefit #2: Less computational power needed on the client
While the JavaScript with the embedded HTML template might result in a response that’s marginally larger than the same response in JSON (although that’s usually negligible when you compress with gzip), it doesn’t require much client-side computation to update.

This means it might well be faster from an end-to-end perspective to send JavaScript+HTML than JSON with client-side templates, depending on the complexity of those templates and the computational power of the client. This is double so because the server-generated templates can often be cached and shared amongst many users (see Russian Doll caching).

Benefit #3: Easy-to-follow execution flow
It’s very easy to follow the execution flow with SJR. The request mechanism is standardized with helper logic like form_for @post, remote: true. There’s no need for per-action request logic. The controller then renders the response partial view in exactly the same way it would render a full view, the template is just JavaScript instead of straight HTML.

Complete example
0) First-use of the message template.

<h1>All messages:</h1>
<%# renders messages/_message.html.erb %>
<%= render @messages %>

1) Form submitting via Ajax.

<% form_for @project.messages.new, remote: true do |form| %>
  ...
  <%= form.submit "Send message" %>
<% end %>

2) Server creates the model object.

class MessagesController < ActionController::Base
  def create
    @message = @project.messages.create!(message_params)

    respond_to do |format|
      format.html { redirect_to @message } # no js fallback
      format.js   # just renders messages/create.js.erb
    end
  end
end

3) Server generates a JavaScript response with the HTML embedded.

<%# renders messages/_message.html.erb %>
$('#messages').prepend('<%=j render @message %>');
$('#<%= dom_id @message %>').highlight();

The final step of evaluating the response is automatically handled by the XMLHttpRequest-powered form generated by form_for, and the view is thus updated with the new message and that new message is then highlighted via a JS/CSS animation.

Beyond RJS
When we first started using SJR, we used it together with a transpiler called RJS, which had you write Ruby templates that were then turned into JavaScript. It was a poor man’s version of CoffeeScript (or Opalrb, if you will), and it erroneously turned many people off the SJR pattern.

These days we don’t use RJS any more (the generated responses are usually so simple that the win just wasn’t big enough for the rare cases where you actually do need something more complicated), but we’re as committed as ever to SJR.

This doesn’t mean that there’s no place for generating JSON on the server and views on the client. We do that for the minority case where UI fidelity is very high and lots of view state is maintained, like our calendar. When that route is called for, we use Sam’s excellent Eco template system (think ERB for CoffeeScript).

If your web application is all high-fidelity UI, it’s completely legit to go this route all the way. You’re paying a high price to buy yourself something fancy. No sweat. But if your application is more like Basecamp or Github or the majority of applications on the web that are proud of their document-based roots, then you really should embrace SJR with open arms.

The combination of Russian Doll-caching, Turbolinks, and SJR is an incredibly powerful cocktail for making fast, modern, and beautifully coded web applications. Enjoy!

MicrosoftsDystopia.jpg

The silhouettes and imagined dystopia of work was bad. Images of real people prioritizing their Merchandise Update over their family on a Skype call is just fucking horrendous.

Evaluating a redesign

Jason Fried
Jason Fried wrote this on 14 comments

When evaluating a redesign, your first instinct is to compare the new design to the old design. But don’t do that.

The first step is to understand what you’re evaluating. If you just put the new design up against the old design, and compare the two, the old design will strongly influence your evaluation of the new design.

This is OK if nothing’s changed since the original design was launched. But it’s likely a lot has changed since then – especially if many months or years have passed.

Maybe there are new insights, maybe there’s new data, maybe there’s a new goal, maybe there’s a new hunch, or maybe there’s a whole new strategy at play. Maybe “make it readable” was important 3 years ago, while “help people see things they couldn’t see before” is more important today. Or maybe it’s both now.

But if the old design sets the tone about what’s important, then you may be losing out on an opportunity to make a significant leap forward. A design should never set the tone – ideas should set the tone. Ideas are independent of the design.

So, when evaluating a redesign you have to know what you’re looking for, not just what you’re looking at. How the new design compares to the old may be the least important thing to consider.

It’s a subtle thing, but it can make all the difference.

Basecamp's cache-friendly local time

Javan
Javan wrote this on Discuss

An important ingredient in the caching to the max recipe is sharing caches between users. When cache keys are free of user-specific attributes, requests are far more likely to hit a warm cache and remain snappy.

This cache sharing constraint means we can’t render local times. If we did, the time would be cached correctly for the first viewer and incorrectly for everyone in a different time zone.

Basecamp solves this problem using JavaScript. We always render <time> tags in UTC time and then convert them client-side to the browser’s local time. We’ve been doing it this way since Basecamp launched and it has served us well.

We don’t want to be the only ones enjoying this so I plucked all the relevant bits from Basecamp and packaged them up. Here you go.

The performance impact of "Russian doll" caching

Noah
Noah wrote this on 7 comments

One of the key strategies we use to keep the new Basecamp as fast as possible is extensive caching, primarily using the “Russian doll” approach that David wrote about last year. We stuffed a few servers full of RAM and started filling up memcached1 instances.
A few times in the last two years we’ve invalidated large swaths of cache or restarted memcached processes, and observed that our aggregate response time increases by 30-75%, depending on the amount of invalidation and the time of day. We then see caches refill and response times return to normal within a few hours. On a day-to-day basis, caching is incredibly important to page load times too. For example, look at the distribution of response time for the overview page of a project in the case of a cache hit (blue) or a miss (pink): Median request time on a cache hit is 31 milliseconds; on a miss it jumps to 287 milliseconds.
Until recently, we’ve never taken a really in-depth look at the performance impact of caching on a granular level. In particular, I’ve long had a hypothesis that there are parts of the application where we are overcaching; I believed that there are likely places where caching is doing more harm than good.

Hit rate: just the starting point

From the memcached instances themselves (using memcache-top), we know we achieve roughly a 67% hit rate: roughly two out of every three requests we make to the caching server has a valid fragment to return. By parsing2 Rails logs, we can break apart this overall hit rate into a hit rate for each piece of cached content.
Unsurprisingly, there’s a wide range of hit rates for different fragments. At the top of the heap, cache keys like views/projects/?/index_card/people/?3 have a 98.5% hit rate. These fragments represent the portion of a project “card” that contains the faces of people on the project:

This fragment has a high hit rate in large part because it’s rarely updated—we only “bust” the cache when someone is added or removed from a project or some other permissions change is made, which are relatively infrequent events.
At the other end of cache performance with a 0.5% hit rate are views/projects/?/todolists/filter/? fragments, which represent the filters available on the side of a projects full listing of todos:

Because these filters are based on who is on a project and what todos are due when, the cache here is busted every time project access or any todo is updated. As a result, we rarely have a cached fragment available here, and 99 times out of 100 we end up rendering the fragment from scratch.
Hit rate is a great starting point for figuring out what caching is likely effective and what isn’t, but it doesn’t tell the whole story. Caching isn’t free – memcached is blazingly fast, but you still incur some overhead with every cache request whether you get a hit or a miss. That means that a cache fragment with a low hit rate that is also quick to render on a miss might be better off not being cached at all — the costs of all of the misses (the fruitless memcache request) outweigh the benefits of a hit. Conversely, a low hit rate isn’t always bad—a template that is extremely slow to render might still benefit on net even if only 10% of cache requests are successful.

Calculating net cache impact

Continued…

Customer Spotlight: Aardvark London

Emily Triplett Lentz
Emily Triplett Lentz wrote this on Discuss
Name: Christopher Johns
Title: Commercial Director
Company: Aardvark London
Established: 1996
Number of employees: 17
Basecamp customer since: 2004


Tell us a little about Aardvark — what kind of work do you do?
We’ve got two sides of the business. One is the digital agency where we design, build and support the digital experiences for clients. (See examples here.) The other sides is our eCRM platform called Nudge. That’s the back end, effectively, for all our client websites. It manages customer inquires, email marketing, customer communications and customer database and behavioral tracking as well.

Christopher Johns, Commercial Director of Aardvark

You’ve been using Basecamp for a long time, more than 9 years. Do you remember how you first found out about it?
There had been quite a lot of buzz … I think Mike Arrington of TechCrunch had written something about Basecamp. At that time, we were growing and looking for ways to manage our client expectations and their projects. It seemed like a no-brainer, an easy way for everyone to track what’s going on and get customer buy-in as well.
What kinds of problems were you having with managing client expectations?
In what we do, there’s a lot of going back and forth with clients, and various people within various positions. So you might be speaking to the digital marketing person, the head of operations, the head of marketing. Historically, you would have done things via email. You would have sent someone a concept or something, and said “What do you think of this concept?” You’d CC three people on that email, and have three people coming back with comments, some of which are conflicting. Then you start replying to those and indenting your comments on their original comments. Pretty soon, you know what it’s like; everything gets lost. The original sort of thrust, the momentum, of the project gets bogged down in trying to manage the communication between you and these disparate parties.
Clients may not be as technically proficient as you want them to be. My son is 12 and my youngest is three. The three-year-old is very, very capable of using my iPhone and my iPad and doing all of that, whereas my mother is not.
So your clients tend to be older or less technologically adept?
Some of them are, some of them aren’t. What they don’t want to do is have to learn a whole new thing just to work with us. Our job is to make their lives easier. If they have to go and learn a whole new thing in order to just communicate, you’re off to a bad start to the relationship.
Basecamp is very simple. They don’t have to go and log into anything; all they do is reply to an email, and post their comments. And being able to track all those comments coming through, from all of those disparate sources … you’ve got a nice clean audit trail so that if anybody questions it at a later date, you can say, “well, have a look at this. This is what you said on this date.”
Does that happen?
Yeah, it does. Not that frequently.
What kinds of projects have you managed with Basecamp?
We’ve just done a project for Transport for London, the authority that runs all the underground for London and the buses, Boris Bikes, and the river buses as well. We’ve just built a digital sign project for them — you’ll see signs all over London now with these maps on them that have real-time bus departure information overlaid, that’s localized to the location of the digital sign.
With digital signs for Transport for London, it’s about thousands of people all in one location, and how they’re all interacting with that information, and how that sign helps guide them to their chosen destination. So for example, you can’t use colors on a digital sign to indicate a particular stop, because people would be looking around them in the bus station looking for yellow, and that yellow doesn’t exist.

A sample of one of the digital bus timetables Aardvark created for Transport for London

In London, there are literally millions of people who use the buses every day. Millions of people need to know where to go, and digital signs are helping them in that process. It’s a very interesting user experience development process to try to communicate to people via digital signs that are in the real-world environment that have to show complex data and make it simple for all these people to find where they’re going.
What’s your work culture like?
We’re a collaborative culture; we’re a small team. Everyone brings something to the party, and it’s about the respect for what each individual brings to the party that gets the most out of every person working on the team. We’re not a personality-led business. People don’t come to us and say “I want to work with Aardvark because Chris is there.” They work with us because they see the output that we have as a team and they want that for themselves.
I understand you just moved into some new offices?
We are moving, but it’s been delayed. It takes between 60 and 90 days to get fiber installed. We’re moving closer to Transport for London and a couple of our other clients.
We’re changing how we work as well. We just instituted an opportunity for the staff, where they can go and work 25 days per annum from wherever they want to. We have one guy who spent two months earlier this year working out of Peru. If you want to work from home for one day, that’s fine. Basecamp is one of the systems we use that will enable that process to be more effective. At the moment I live in the city with my wife and children and dog, and we’re hoping to move out of London a bit to get a little more greenery. So it’s actually going to help me as well.
Why 25 days in particular?
It’s pretty much an experiment to start with; the 25 days is a pretty arbitrary figure at the moment. If everything goes well, if people like it and they’re motivated and they feel good about the whole thing, productivity is shown to be positive, I don’t see any reason why we couldn’t do more.
Visit Aardvark.

I think we sometimes overlook things we don’t realize we’re already good at or have limited experience with. You may be beating yourself up about not having good enough grades in biology to go to medical school while overlooking the fact that you’ve been working in your family’s hardware store over the summer for eight years and have an extraordinary sense of how to deal with people. That’s a skill that a lot of doctors in their 50s would kill for: they’ve never learned to understand and be empathetic towards others. People have all kinds of soft skills that you can’t train someone to have, but they beat themselves up because it’s not the thing they think they’re supposed to be good at.

Jason Fried on Nov 19 2013 4 comments