When Disaster Strikes

Safety Camper Saves Basecamp From Disaster

Nearly 3 years ago we asked “What would happen if a truck crashed into the datacenter?” The resulting discussion could be summarized as “Well we would probably be offline for days, if not weeks or months. We wouldn’t have many happy customers by the time Basecamp was back online.” No one was satisfied with that answer and, in fact, we were embarrassed. So we worked really hard to be prepared with an answer that made us proud.

This past Sunday, February 15th 2015, we demonstrated that answer in public. With one command we moved Basecamp’s live traffic out of our Chicago, IL datacenter and in to our Ashburn, VA datacenter in about 60 seconds.

Not one of our customers even noticed the change, which is exactly as we planned it. A few hours later we ran one command and moved it all back. Again, no one noticed.

This probably qualifies as the least publicly visible project at Basecamp. And we hope it stays that way. But if it doesn’t, just know Basecamp will be online even if disaster strikes.

(There’s much to share about how we accomplished this and what we learned along the way. I’ll share the technical nitty gritty in future posts.)

Taylor wrote this on Feb 19 2015 There are 14 comments.

Bobby S.

on 19 Feb 15

Which datacenter are you guys running out of in Ashburn?

James

Nice to see it working so slick. I must admit I’m a little surprised you haven’t had a viable and fast disaster plan until now though with so many customers relying on your services.

DHH

James, we’ve had a backup data center for years. We’ve also had many backup options should something like this happen (like off-site backups of databases and files), but we haven’t boiled it down to something that can run in 60 seconds. Very, very few companies have a solution like this at the moment. Ask any of your other hosted software providers, and I think you’ll see that hardly any of them have an automated contingency for a complete data center outage.

@DHH, cool, really impressive getting it boiled down to something so simple and fast

Taylor

@Bobby Raging Wire East and a little bit of Equinix DC2.

Ruud

Wow :) I don’t think you are using DNS to do the switching, so what is it? Cloudflare?

GregT

Hopefully someone will be around (with a computer) to execute that “one command” when the time comes.

John

@Ruud The switching is all DNS based with very short TTLs.

@Ruud What John said is 98% true. The other 2% is the stuff he didn’t mention that happens behind the scenes. That 2% is where the rubber meets the road.

Felipe

So far so good! But why you don’t run in two datacenters? I mean, why you don’t run in multiples servers between multiple datacenters in order to switch automatically?

Thanks for the article!

But not all providers respect the short TTL settings, right?

Justin

How big is the data loss window? Or, how far (approx) are the async MySQL replicas lagging behind?

Bryan

What if both data centers go out simultaneously? I know that’s stretching it and there is only so much you can do. Just curious if have now asked that question and if so, did you come up something for that?

Bryan, we have all files and backups in a 3rd location as well. So that extremely unlikely contingency is also addressed, but we wouldn’t be back online in 60 seconds.

When Disaster Strikes

Bobby S.

James

DHH

James

Taylor

Ruud

GregT

John

Taylor

Felipe

Ruud

Justin

Bryan

DHH

This discussion is closed.

About Taylor