On the evening of Monday, November 12, we experienced a few of hours of downtime due to an explosion at Rackspace’s main data center in Dallas, TX. This event lead to the eventual failure of a backup cooling system. Without adequate cooling, our servers had to be shut down to prevent permanent damage. We have detailed the events that led to the downtime. We deeply apologize for any inconveniences this may have caused and will work hard to make sure we reduce the likelihood of this happening again.
Anonymous Coward
on 13 Nov 07What was that? A Suicide bomber?
Eli Duke
on 13 Nov 07get this. i’m living (and working) in antarctica right now, and i’ve been using campfire to chat with my friends and family back home. i set up a time and people just show up and come and go and we have a great time with it. this crazy downtime just happened to occur at the exact moment i schedule.
what are the chances?
MikeInAZ
on 13 Nov 07Funny, just the other day, I was bragging to my friend how Rackspace has virtually 100% uptime. Even fail-safe systems sometimes fail.
Mike McDerment
on 13 Nov 07Jason – looks like we are server farm neighbors.
@AC – no…truck drove into a transformer.
Joe
on 13 Nov 07I’ve read that happens, that AC units don’t turn back on. Aren’t they supposed to be regularly tested?
asdf king
on 13 Nov 07I hope no one got hurt.
AW
on 13 Nov 07AC a suicide bomber in a data center?? You should ease up on Fox News and get a life and some ability to see reality. What an ass!
Tim
on 13 Nov 07I’m actually surprised to learn your datacenter is in Texas, since most of you are in Chicago. But I suppose that in these days of information highways, that’s far from uncommon…
Sam
on 13 Nov 07I know this was nearly an impossible chain of events, but maybe if you can’t have your website/apps up on servers elsewhere, at least redirect web requests to another working server that simply serves up a static page with information about the situation, keeping everyone in the loop.
Frank
on 13 Nov 07Jason,
Redundancy has its advantages but there is the expense. I am located very close to a military installation with, well, cant talk about that, but with the popularity of your products, please consider inspection of the facilities especially after this episode. You might find that your very users might have some ideas on other server farms that are more “secure” and anticipate the worst. It’s your call but consider moving moving your business to a different server farm just to send a message that you cant have this sort of problem. Not after 9/11.
In this day and age its amazing what might happen when a squirrel gets too curious. I’m glad there wasn’t any damage to the servers! Find the server farms that have thought outside the box, you owe it to your users. And I love this stuff. My friends and colleagues really enjoy it. Perfect for the medical and dental research I am doing and able to collaborate with people everywhere and the learning curve isnt hard.
Thanks and I hope you didnt get too many messages at home!
Frank
Dylan Jones
on 13 Nov 07Shit happens but its the way you deal with it that makes the difference.
I really respect the way you guys provided a clear, concise reason why it happened without any blame-mongering or excuses but a commitment to move forward and get even better,
This is in start contrast to several big corporates who have f*ed up my services lately but tried to pass the buck, not apologise and generally develop very slopey shoulders when it comes to accountability.
You’ve turned a potentially damaging incident into something that once again makes me smile at your level of professionalism compared to the big guys who are in the stone age when it comes to good service. Well done.
Hope no-one was seriously hurt. Dylan
Muki
on 13 Nov 07Jason
Thanks for being so open about this issue. As always we love to do business with you guys. You guys are professional.
Remember that you are providing your services in the flat world. Downtime affected us out here in india.
Shawn
on 13 Nov 07I guess Rackspace can now no longer advertise that they have had 100% network uptime.
Nick
on 13 Nov 07Jason & Co.,
I’d love it if you guys posted about your disaster preparedness and redundancy as you implement new strategies to handle stuff like this. My company was right there alongside you (clients’ servers in Rackspace DFW), but we were also affected by the Sunday AM and Monday AM outages that you guys missed, so we’re in the middle of the same process.
I know you can’t go into all the details, but I think we’d all benefit from what you could share.
karl
on 13 Nov 07At minimum, maybe host at least one dns server external to the main datacenter – and maybe host 37svn and/or product blog external as well so that there’s an opportunity for “live updates”.
To me, some downtime is unavoidable/acceptable when it’s totally accidental like this, but getting information about what’s happening asap is paramount in those scenarios. Thankfully last night techcrunch/valleywag/etc were all over this like an hour after it happened, so it was less mysterious than it would have been even though no word from 37s.
MI
on 13 Nov 07karl: That’s absolutely one of the first things we’re doing. Not being able to post status updates during the downtime was a big problem, but fortunately it’s one of the easier ones for us to solve and we’ll be addressing that quickly.
Anonymous Coward
on 13 Nov 07I am surprised that a company with 37signals’ resources does not yet have fail-over capabilities. I applaud you too for being transparent and professional, however redundancy for technology sophisticates like 37 signals should be done. Is anyone else surprised by this or am I being naive? It sounds like 37signals is well on the way to recovery – here’s to a speedy and full recovery and redundancy is better late than never. Best wishes Jason et al.
DHH
on 13 Nov 07All our servers are available in duplicates, triplicates, or more within the data center, which has given us the fail-over capability to avoid widespread downtime for a very long time. As anyone, we’ve had servers fail in the past, but thanks to the fail-over setup we’ve been able to keep on trucking without inconveniencing our customers.
But last night all our fail-over systems weren’t enough to avoid the fact that the entire data center went dark. Naturally, this has prodded us to explore what we can do to become geographically redundant as well as being machine redundant.
We’ll definitely be working on this and other initiatives to be avoid similar incidents in the future.
Anonymous Coward
on 13 Nov 07Good point DHH – I should have been more clear – I meant geographic redundancy. I am sure within the hosting facility you are infinitely scalable.
brad
on 13 Nov 07Geographic redundancy is indeed a big issue. How many of us depend on servers in earthquake-prone areas of California? I often wonder how safe my backups and stored files are on .Mac, for example. My brother lives and works in Menlo Park, and after the quake a couple of weeks ago he set up an SVN repository and asked me to keep a replica of his coding projects on my machine here in Montreal.
LDMiller
on 13 Nov 07Our company has used Rackspace for over 3 years and we love them. This is an extremely rare occurence for this Rackspace. We have found the uptime with Rackspace to be far superior to any other we have used in the past.
I would not consider switching to another company if I were you. Their support is the best in the business!
Rackspace has data centers all over the world; so they have the ability to offer geographic redundancy. You need to negotiate that into your pricing package.
James Byers
on 13 Nov 07DHH: “we’ve been able to keep on trucking without inconveniencing our customers”
Given the circumstances, a funny turn of phrase.
I’m curious to hear your take on geographic redundancy once you’ve had the chance to sort through this outage. It strikes me that true geographical redundancy - a full warm or hot copy of your infrastructure - is something most companies should skip. I don’t think the costs are worth it unless you’re really operating on a global scale. Forget about hardware and hosting costs, I’m talking about time spent on systems engineering and application maintenance going forward.
If it’s a hot copy, you’ve got lots of data and cache synchronization problems to sort through.
If it’s a warm copy, you’ve got to worry about how to fail over to it quickly when the primary colo goes down. Can you do this faster than the primary colo comes back online? What will be the “chillers didn’t restart” equivalent in your application?
That said, Wordpress.com happily serves traffic from multiple colos on the back of MySQL replication. Lots of applications, like search, distribute well. Maybe it’s just a question of applying the right constraints to a multi-colo deployment.
Mike
on 13 Nov 07If you guys are looking for multiple data center implementations, check out (and maybe talk to) the guys over at Automattic. They seem to have something like three data centers, all sync’ed together, running their web apps, including Wordpress.com.
See, http://barry.wordpress.com/2007/04/16/additional-capacity/ See also, http://photomatt.net/2006/09/11/new-servers/
John Topley
on 13 Nov 07It’s an interesting subject. I went to a presentation once where Werner Vogels from Amazon was plugging their S3 and EC2 services. Someone asked him where the data were physically located. He didn’t answer that specific question but did say that Amazon could afford to lose two entire data centres without service being affected.
Rick
on 13 Nov 07I appreciate the notice, but it would be really great if we could dismiss the notification from the dashboard. It is very long and I am scroll intolerant.
Cliff
on 13 Nov 07@Rick, if you’re referring to Basecamp, there’s a ‘Hide This Notice’ link in the upper-right corner of the dashboard notice.
Justin Reese
on 13 Nov 07Michael: “Really? When did that start?” Tobias: “Well, I don’t want to blame it all on 9/11, but it certainly didn’t help.”
Please please let’s not usher in the Brave New World with fallacious associations, pleeease.
metacircular
on 13 Nov 07well, I think 37signals is handling this as well as possible. i mean, being transparent and acknowledging it as soon as possible.
it’s not like you didn’t take reasonable preparations beforehand, this is almost an Act of God (unforeseeable) type thing.
Huh
on 13 Nov 07Am I the only person who’s a little surprised that 37s’ host is just Rackspace?
I mean, their supposedly reliable and have great support, but I had always assumed that they’d be running their own servers- or at least using a host as cool and innovative as their products…
MI
on 14 Nov 07Huh: We’ve been extremely happy with Rackspace since we switched to them mid last year. “Reliable” and “great service” are the two most important attributes that I look for in a hosting provider.
As far as running our own servers is concerned, we do run them. The only help we get from Rackspace is when the machines are installed or we need some help at the physical hardware level. Beyond that, we manage our systems ourselves.
3stripe
on 14 Nov 07“100% Network Uptime Isn’t Wishful Thinking, It’s A Guaranteed Reality”
Erm maybe they should take this claim off the Rackspace website?
Anyhow I’ve been off work so I didn’t even notice the downtime :)
Jay Levitt
on 15 Nov 07Kudos for the openness.
James: “I’m curious to hear your take on geographic redundancy once you’ve had the chance to sort through this outage.”
Ditto. I agree, it’s difficult, especially at the hot-spare level. On the other hand, wouldn’t it be nice if Rails could abstract some of that complexity?
Out of pure selfishness, I’m glad that 37Signals is experiencing the failure of low-level redundancy systems (sorry, guys). Rails is built on YAGNI and “extract, don’t envision”. That means that things don’t generally go into Rails until they’ve caused 37Signals pain. It keeps the core lean and mean, just as it’s supposed to. But it also means (especially given the youth of the framework) that there’s not a lot of “this bit me once and I’ll be damned if it will ever bite me again” hardening in Rails yet.
So outages like this really mean a better Rails for all of us. Here’s wishing 37S a year full of TEMPFAILs, router flaps, gamma rays and simplexed disks!
This discussion is closed.