All the 37signals properties were offline for two hours this morning between 10AM and 12PM CST (16:00 to 18:00 GMT) as our load balancer blew out and knocked out the network connection for all our servers. No data was lost and the machines all kept running, but they weren’t accessible from the internet.
We’re very, very sorry for this interruption of service. While we were able to report on the progress of this interruption through our http://status.37signals.com (all the products and 37signals.com pointed to that site during the majority of the outage) that’s a small consolation when you want to access data stored on our services right now. It was just not good enough.
While we don’t have a formal service-level agreement (SLA), we still want to compensate anyone who felt they were negatively affected in their work because of this outage. Please write [email protected] (and include your application URL) and we’ll get that taken care of.
Naturally, we’re going to have a long, serious talk with our service provider (Rackspace). They’re supposed to be the best in the business, but in this instance they failed us, so we in turn failed you. We’ll do everything we can to make sure that something as simple as a load balancer (or firewall or switch or any other network equipment) going bad does not cause two hours of downtime.
Again, we’re truly sorry for this interruption. This is not how Fridays are supposed to be.
Tobias
on 18 Jan 08I just wanted to say that I thought you all handled the situation brilliantly. Posting the frequent updates was much appreciated.
Ross
on 18 Jan 08Whilst it wasn’t really the end of the world, you wouldn’t have believed it if you had seen my client’s sales and marketing guys this afternoon (in the UK) – the sky is falling, the sky is falling :) I guess this at least reflects on how useful your products are.
I tried to contact Rackspace as soon as it happened (my client is a heavy user) but they didn’t really want to talk and suggested I talk to you guys – easier said than done when they’ve taken out all of your infrastructure :(
This is the second time Rackspace have let you down – time to move?
Sean
on 18 Jan 08It’s always horrible to feel helpless with a service provider. An unfortunate situation for many people, but I’m sure you’ll tighten things up with Rackspace. Thank you for the status updates this AM (and great products in general)... you guys still rock.
Onward and upward!
_sean
Joshua
on 18 Jan 08I’m having an issue with you guys pointing the finger at Rackspace. I use both Highrise and Basecamp every day, I also use Rackspace to host all my boxes.
However, you guys made it clear that you only have one Load Balancer in your setup and when it failed, as all hardware inevitably fails… your WHOLE business went down, every last site..
That seems to be a massive failure on your planning more so than anything having to do with Rackspace.
Please don’t point the finger at other companies, that’s far too easy and lacks a bit of professionalism. Pay for another Load Balancer to be in standby if your entire internet presence relies on it..
Dave
on 18 Jan 08I would first ask who the IT genius was that decided a CF card was a good place to store critical (and apparently not backed up) configuration data…
ET
on 18 Jan 08Yes; while the outage was inconvenient, the frequent updates and hard work on your part to get it back up (not to mention offering compensation to those who might have been tangibly affected) were and are very admirable.
On another note, I’m not sure if this is a known issue or if it was caused by something going awry this morning: While grouphub.com led to your alert/update page, adding the www prefix did not work—in fact, this appears to still be the case (the former works, the latter leads to a blank page).
Thomas
on 18 Jan 08Thanks for your clear and frequent updates about this issue.
Cameron Watters
on 18 Jan 08You keep referring to load balancer in the singular? Are you not using multiple fault-tolerant load balancers configured to fail over?
Anon
on 18 Jan 08Way to throw rackspace under the bus.
37s should know better not to have a single point of failure for EVERY ONE of their apps.
Ivan
on 18 Jan 08@Joshua: A second LB would be a wise investment, I am sure. Although, having run an operations team, my beef would be with Rackspace as well for some of the blame. There should have been someone there who recommended a second standby LB.
@Dave: Perhaps there should be a backup CF card with the configuration standing by? A small investment for peace of mind, I would think.
@ET: Yeah, not only is my own grouphub.com domain still pointing to the status page, the HTTPS version is still timing out. Anyone help?
Jim
on 18 Jan 08Try Pair (http://pair.com) for hosting. Always solid.
Ian Ragsdale
on 18 Jan 08As a former Rackspace customer, I’d say that Rackspace was at one time the best in the business, but I think their service has gone WAY downhill (at least it had a couple of years ago when we switched away from them).
When we started there, their service was top-notch – not only was there never any hold time to talk to a tech, but they were also extremely competent. When we left, there still wasn’t any hold time, but their ability to solve problems quickly had gone way down, yet they were still charging premium prices.
For the amount of money we were paying them (we had a lot of servers there), it became both cheaper and more effective to go bare-bones on hosting, and hire and train our own operations staff.
BZ
on 18 Jan 08As both a Rackspace and Backback customer, it is a big bummer.
That being said, I have been with Rackspace for 5+ years and they are one of the best.
BZ
Ivan
on 18 Jan 08One more thing… the twitter posts rocked too. Thanks for that.
Cameron Watters
on 18 Jan 08@Ivan: As a Rackspace customer w/ redundant load balancers, I can vouch for the fact that there’s not shortage of people and/or web pages at Rackspace recommending fail-over stuffs (including load balancers) for high-availability applications.
Christian
on 18 Jan 08The communication this morning was very good. It was easy for everyone in the organization to know where we were at.
Ivan
on 18 Jan 08@Jim: I second that… pair is the bomb. I would move to pair Networks before hosting with Rackspace any day of the week. Also, LogicWorks kicks serious Enterprise butt too… let me know if you need a contact there.
Kevin Mackie
on 18 Jan 08Time to update the risk management plan…
David
on 18 Jan 08Well taken care of guys none the less. Your response to the consumer (me) was frequently updated via the default site, how you wrote and the tone in which you wrote was transparent enough to ensure the respect of the service you provide continues. We enjoy the basecamp service here at Fi and this has been the first time we have experienced down time during work hours.
Ivan
on 18 Jan 08@Cameron: Yeah, that’s why I made the comment I did. When we were shopping for managed hosting, Rackspace certainly recommended failover LBs… pricey… woo!
Kevin Haggerty
on 18 Jan 08Thanks for the updates – much better than the week long downtime over at Strongspace!
GeeIWonder
on 18 Jan 08I applaud the attempts to be forthright and keep the lines of communication open, but I also tend to agree with Joshua and the subsequent comments.
Also, it’s very confusing when the top item on the updates ‘Status’ page switches to the bottom item in the middle of a crisis.
Anyhow, glad you guys are up and running again. Lessons learned? Other than blaming the service provider, I mean… is there anything you guys could’ve done better?
Micah Calabrese
on 18 Jan 08Aside from the fact that switching to any new host is a complete nightmare, is Amazon’s Elastic Compute Cloud an option for you guys? I’d imagine you’d be able to get a deal considering your relationship with them.
Deano
on 18 Jan 08I’m appalled, disgusted, outraged!
Seriously I went and did the shopping and then it was working again. Shit does happen but I wasn’t put out that much.
The key thing is not losing data. That’s the one thing that worries me about web applications. I can back up my gmail but can’t with 37Signals products….
Thanks for communicating and keeping people up to date.
DHH
on 18 Jan 08Joshua, you’re right that there should have been a second load-balancer sitting in the rack. By the end of the day, there will be.
But even that could have failed (as was the case when the backup cooling system failed when a truck knocked out their power last time) and we would again have relied on their expedience getting new hardware in the rack and configured.
The same goes for the configuration of the balancer, which should have been backed up. Rackspace is responsible for managing all the network equipment and we don’t even have access to touch it.
But at the end of the day, it is our fault that the servers were down. It’ll always be our fault if something is down. The buck stops here.
Again, we’re very sorry to have this happen. And I can’t blame anyone for being frustrated or even angry.
Joshua
on 18 Jan 08@ Ivan: I think he may have been referring to Ian.. :]
This is the kind of post I really hoped wouldn’t go up after the service interruption. Now you have people talking about the reliability of web-hosts and such. When in reality it has nothing to do with the service provider, it has to do with poor pre-planning.
Whether that’s in the fault of 37 Signals or the Business Developer Consultants at Rackspace, who knows. But an educated guess is that 37 Signals is full of smart folk who should know better.
Dan
on 18 Jan 08Another suggestion is to have a status.37signals.com site hosted with another provider for you to provide status updates when something goes wrong. This is what Slicehost does and it comes in handy whenever the main system has (rarely) gone down.
DHH
on 18 Jan 08Dan, status.37signals.com is hosted with another provider. That’s how we were able to keep it update during the downtime. Certainly a best practice for anyone running a service like this. Have your status site on another host.
Marc
on 18 Jan 08I only wish my cable company handled outages this well.
Ivan
on 18 Jan 08@Marc: Your cable company has outages? :o I’m shocked!!! I thought that was impossible!!
DHH
on 18 Jan 08GeelWonder, we could certainly have done better. Regardless of our disappointment in the response time from Rackspace, it’s ultimately our problem to ensure that we don’t even have to rely on Rackspace’s response time to stay up.
It was no doubt a mistake not to have a second load balancer sitting ready. We’re working with them right now to ensure that we have redundant hardware for all the network pieces ready by the end of the day.
Again, I agree with Joshua in the sense that there’s no way we can sidestep our responsibility here. The buck does stop at our doorstep.
MI
on 18 Jan 08Joshua: You’re right, we should have known better. It was a failure of our’s as well as Rackspace’s. We have been generally very happy with the level of support we’ve gotten from Rackspace in the past and it made us somewhat complacent about a critical factor of redundancy that we should have taken responsibility for ourselves. Lesson learned, redundant hardware should be online hopefully today.
Ivan
on 18 Jan 08@DHH, @MI: I love how forthright you guys are… it is a really refreshing and well respected quality of your team. Keep going at it… you guys do a great job.
And, BTW, my grouphub.com page is now back up.
drew olanoff
on 18 Jan 08great job handling this, thanks for the udpate!
Colin
on 18 Jan 08What? No pic of Homer Simpson to say you’re sorry?
In all seriousness, though, cheers for the great communication, folks.
Kevin Milden
on 18 Jan 08Hey things happen. But… I think it is time to ask either Amazon or Google to help you guys with your infrastructure. Both companies understand what your applications mean to the community. I think both companies would be more than happy to help make sure you never have service interruption ever again.
Just an idea.
Another load balancer will solve this issue this time.
As your apps become critical to more companies. Interruptions in service will become more expensive.
Tony
on 18 Jan 08It was surprising this morning but at least I had a chuckle. Thanks for the updates! In times like these, it is healthy to smile.
Ryan
on 18 Jan 08Hey Guys, Rackspace is not the industry best. Everyone just thinks they are.
Check out softlayer or liquid web. Both great replacements to Rackspace. I personally use both.
Justin
on 18 Jan 08pulls pants back up
Brad M
on 18 Jan 08You know what? IT WAS ONLY TWO HOURS. You guys are doing a great job and honestly there was no reason to post anything more than a simple apology note (that is if the blogs hadn’t started blowing up).
If two hours of 37signals downtime for someone was anything more than an inconvenience they need to reevaluate themselves and how they work.
Unless you were using basecamp to monitor organs for transplants or something, this small (in the scheme of things) outage was no big deal. Hopefully you took the time to walk in a park or lighten up and relax a bit (it is a Friday afterall), the business you had stored in 37signals products will probably still be there when you get back.
I’m amazed at how quickly 100% uptime has become the expectation.
TR
on 18 Jan 08Obviously, things like this are bound to happen from time to time. That being said – normally when I make any changes that could result in downtime or hiccups, I try to schedule those for weekends or at times in the middle of the night. That way, the tree can fall in the woods, but no one is there to hear it.
Seems weird to schedule this on a weekday late morning- early afternoon.
Deano
on 18 Jan 08I agree Brad but you could look at it another way. That’s 2-3 hours of time lost in the working day if you rely upon 37Signals products. You can still get work done but if the data is unavailable to you it can be a big issue. You could also, for dramatic effect, multiply the downtime by the number of customers and that’s alot of productivity down the drain.
Offering compensation to those who feel most affected seems sensible though I wonder if that will reward those who complain the loudest.
Still the upside seems to be that the network will be more robust. I personally would love an offline mode for these apps perhaps using Google Gears or some equivalent. Just being able to look up my contacts’ details in Highrise in the minimum feature I require at all times.
DHH
on 18 Jan 08TR, this wasn’t a scheduled change. We would never schedule a change like this in the middle of the day in the work week. The load balancer died without warning and we had to get a replacement.
Eric
on 18 Jan 08Dissapointed with Rackspace’s response time? It seems that having your hardware replaced and being back up and running within 2 hours is both very quick and in-line with their SLA.
Joshua Go
on 18 Jan 08Multiple redundant load balancers could be a step in the right direction, but whether Rackspace has their act together or not, there’s really no telling what can happen.
Have you guys ever considered hosting with one or two other companies aside from Rackspace, and decentralizing your infrastructure with round-robin DNS entries? I know this introduces additional overhead in terms of manpower required, and takes up your time in managing your infrastructure (having to deal with MySQL replication across data centers).
The advantage of such an approach works something like this. Say you have a choice of three providers who can each give you 99.0% uptime—most hosts provide better. That’s a 1% chance of downtime/failure. With three providers, the probability of the overall system failing in its entirety would be (0.01×0.01×0.01), or 0.0001% after you move the proper decimal places.
We all have limited time to focus on the things that matter, but I thought I’d offer some suggestions. I love your apps and my company is a paying customer for Basecamp.
GeeIWonder
on 18 Jan 08Exactly. In the same way that 37signals is re-evaluating Rackspace and how they work. If those two hours are right before a meeting, as a deal is closing or whatever, then those two hours can be of ridiculous importance. Compensation is meaningless in those terms. Unfortunately (for some, fortunately for others) this also happened on a Friday, so everything is potentially even more crippling.
The people who had real problems aren’t the ones posting here this afternoon. They’re the ones cancelling their weekend plans and looking into how they can host their webware themselves.
The 37signals guys know this, and did their best, and are owning up. But saying it’s not a big deal is just plain naive.
carlivar
on 18 Jan 08So what happens if the datacenter catches on fire next time or, heaven forbid, a plane crashes into it?
What’s the BCP plan in other words?
Chris Carter
on 18 Jan 08Having been in the middle of situations like this – good job on keeping the downtime to a minimum and communicating exactly was going on. I know many people are up in arms over anything, but I commend you on how you handled it. My company uses Campfire internally for developer communications, and we can always fall back to e-mail or IM if necessary, so we weren’t getting hot and/or bothered about the situation.
However, I have both a question and a suggestion: why don’t you invest in a hot site, rather than relying on a single hosting data center? Is it that business decision, or a financial decision?
A second load balancer is a great move – removing any single point of failure from your rack is key. But you’ve had problems at your data center before, in fact I seem to remember an outage just a few months back due to a power supply…? In that case, a second load balancer wouldn’t solve your problem, but a hot-site would. With commodity networking hardware, it’s becoming routine to either use round-robin DNS or even BGP to set up hot-site failover. And modern SAN technology is getting cheap enough now that you can set up SAN to SAN replication over moderately inexpensive network connections.
Just a thought.
Travis
on 18 Jan 08I’m truly appreciative of you keeping us in the loop via Twitter. Professionalism far and above any other service providers I’ve dealt with.
For those relentless complainers, it was two hours that you’ll never remember. And if you do, you should analyze your priorities in life and dependence on external services.
Would have loved to have watched you all back in the Y2K craze. “Run for your lives!”
Ben
on 18 Jan 08During the outage, you know what our team did? We talked face to face. It was refreshing. We discovered that we look just like our avatars! There was an employee in the back of our office that we never even knew existed.
GiP
on 18 Jan 08In my time zone (GMT+1) the outage was not a big deal.
While completely unrelated to the outage I second Rackspace going downhill in the past two years anyway, the Texas datacenter experienced all kind of ridiculous and unbelievable issues. At the moment they’re overpriced for the level of service they offer in my humble opinion. They provide old, overpriced equipment.
I encourage you to evaluate solid alternatives in the long term: There are a few. I’ve finished my assets relocation 4 months ago, away from them
Neil Elver
on 18 Jan 08What we do for our mission critical sites is we run the sites in two locations, at our office (cheap bandwidth, but not diesel just UPSes) and at our leased colocation. We use round-robin DNS to split the bandwidth and server load. The DNS service allows an auto-failover so if one location goes down it fails all traffic to the other and vice-versa. The round robin dns keeps people pegged to one location per session (or dns lookup) so session state is not an issue (users are not hopping between locations). The only downside is the extensive database and file replication, but once set up it works with a few seconds latency and is not a problem. When our datacenter goes down (which, like rackspace, they ‘assure us’ of their resiliency) we just dump all traffic to our office. This may seem extreme, but despite catastrophic downtime at one location or another we’ve only had about 5 minutes down (including scheduled!) in the last six years.
Joga Luce
on 18 Jan 08You guys remind me of MediaTemple. They’re also quick to take responsibility when something goes wrong and also maintain great status updates. Weren’t you guys hosted with them back in the day? Back when they still did design work. Like 2001ish? wow. I just made myself feel old.
MI
on 18 Jan 08For those of you who mentioned having a second site, it’s something we’ve been evaluating and working with vendors on for a while now. We definitely intend to bring up a second redundant site, but it’s a time consuming process and it’s not going to happen overnight.
For the record, we’ve got a great working relationship with Rackspace and we’re working hard with them right now to ensure that we’re not vulnerable to this type of outage in the future. We were a little disappointed with some aspects of the response this morning, maybe more than a little at times when the fur was flying, but we’re happy with the follow-up and steps to solve the underlying problems.
Julien Le Nestour
on 18 Jan 08I’ve got one significant complaint (see below).
Before, let me second other comments and say you’ve handled the downtime brilliantly. posting frequent and transparent updates as well as resolving the problem in 2 hours was very good. Though you could have planned better for hardware failure, you are doing it right now. You should probably just switch to round-robin DNS as suggested by others, given the nature of 37signals biz and the expectations you set for yourself.
My one complaint is that our basecamp address ( https://xxx.projectpath.com ) was just timing out, and not redirecting to the status updates. So, in effect, the actual users of our Basecamp account were not kept posted and saw that as an unplanned failure without communication.
Since using 37signals in a big corporation isn’t already easy, this was an unneeded blow of credibility in the eyes of the users.
Richard
on 18 Jan 08Read the the posting as:
Gee were cheap and it’s not our fault!
Less features, does that mean less hardware too?
For those who haven’t looked at hardware recently, a second load balancer is so inexpensive in the scheme of things.
I wasn’t impacted by the outage but, WOW guys what the heck? I have never deployed a system (even a startup) without N+1 configuration.
This isn’t the fault of Rack Space. You simply configured an environment in an unprofessional manner. Period.
Step up and take the blame.
Shawn Oster
on 18 Jan 08Brings up a lot of question about how ready people are to move to a pure hosted application solution Personally if someone can’t handle a two hour outage then they shouldn’t be using a web-based application.
Brad M above hit on the same theme yet from another angle, saying basically “Come on guys, it was only two hours.” If you can’t turn around to your customer and say “Hey, seems one of our tools isn’t working, can you give us two hours?” then it’s time get a desktop or intranet application.
Brad M writes off those two hours like no big deal yet there are quite a few situations where two hours are absolutely critical. Basecamp and like tools are awesome, just make sure you know their limitations and weak spots. Personally I don’t use them for anything critical but for everything else they’re very nice.
MI
on 18 Jan 08Julien: “Just use round robin DNS” doesn’t make much sense. Setting up a secondary redundant site is a complicated proposition, particularly with database sizes like ours. As I mentioned above, we’re working on it.
Regarding DNS, if you weren’t getting redirected to the status page, either your browser or your upstream DNS server was incorrectly caching old DNS data. We had a 10 minute maximum TTL (time-to-live) set on all of our DNS records prior to the problem. About 20 minutes into the downtime we changed all DNS records to point to the status page and reduced the TTL to 1 minute. Once things came back up, we switched the DNS records to point back to the proper location.
Richard: We’ve repeatedly taken responsibility for the problem. It’s not a matter of being cheap, it’s a side effect of not being able to manage our own network infrastructure that we took it for granted. As I said earlier: Lesson Learned. We will have fully redundant firewalls, load balancers, and network switches installed today and hope to bring them online within the next couple of days, if not sooner.
Richard
on 18 Jan 08MI: “It’s not a matter of being cheap, it’s a side effect of not being able to manage our own network infrastructure that we took it for granted.”
Does this mean you never asked if the Load Balancers were in an N+1 configuration?
Chris Carter
on 18 Jan 08Just out of curiosity (and I truly mean that, I don’t expect an answer if you don’t want to give it): how big are your various databases? Or rather – how much data do you actually retain in your transactional environment?
Give it up Richard – the error was made and it sounds like they’re doing their best to remedy it for the future.
Allan White
on 18 Jan 08So, our strongspace account is down (more on the Joyent blog); Basecamp was out for two hours this morning, and my fallback tool, Google Docs, wouldn’t create new documents.
Conspiracy! The terrorists won today.
Allan W.
on 18 Jan 08@ Richard/NYB: They stepped up and took the blame. They won’t make the same mistake again. No need to rub their face in it, now.
@ Oster & Brad M. – great points about desktop apps. However, single machines (running on single hard drives) are also prone to failure, many times more so than a proper datacenter. While I’d love to see offline data available, desktop apps are not inherently safer.
And regarding hosting apps yourselves: that might remove the fear of the unknown, but then all the reliability challenges are simply moving to your office. For you to plan, manage, and maintain.
Randy J. Hunt
on 18 Jan 08Thanks for handling it all with great responsibility and transparency. Always appreciated.
BenC
on 18 Jan 08@AllanW—To your point, SmugMug was down last night, as well. Conspiracy, indeed.
Jack
on 18 Jan 08Desktop app vs Web app
Nick
on 18 Jan 08@Richard: Seriously, give it a rest. If you’re going to continue to try to get mileage out of this, at least give us a link to your large scale web-application and evidence of it’s 100% uptime and flawless hardware infrastructure.
I personally think that they’ve handled this respectably. They were up front and now they’ll learn from the oversight. And I’m sure any users that were exceptionally affected by the downtime will be compensated.
MI
on 18 Jan 08In an effort to be as transparent as possible, I want to recap a quick conversation that we just had with Rackspace. I mentioned earlier that we were working to have a fully redundant set of load balancers and firewalls online this weekend—that’s not going to happen. Unfortunately, getting to that configuration will require us to take an extended downtime to move our existing servers to new cabinets and we don’t feel like we’re in a position to do that after today’s unplanned outage.
What we have decided to do is to go ahead and build out the new network configuration in the new cabinets and perform all of the prep work for the move, but to hold off on the actual move for a couple of weeks. We’ll post an announcement as soon as we can about when that move will actually take place, and when the downtime will be, but rest assured that we’ll do everything we can to schedule things to cause as little disruption as possible.
In the mean time, we’re in the process of adding cold standby networking gear to our existing racks so that we can recover from a hardware failure by swapping cables and refreshing the configuration on the devices. We’ve also made sure that the backup process for our existing networking hardware have been double checked so we don’t have to rebuild them on the fly again.
Again, we apologize for today’s downtime and we’re doing everything we possibly can to ensure that this kind of thing doesn’t happen again
Jack
on 18 Jan 08“However, single machines (running on single hard drives) are also prone to failure, many times more so than a proper datacenter. While I’d love to see offline data available, desktop apps are not inherently safer.”
who the fuck make this comment? i have a machine that i build from 5 years ago and i use it daily. The machine i build is Asus barebone, cpu is AMD XP 200+. that gotta tell you how old this desktop is and both PATA hard disk Western digital NEVER even went down ONCE!
At least you have the control over your own machine while using hosted service is giving up all your rights. all your data are hosted on their servers and you have no way to know when if ever it will come back. what if you need a critical data but you can’t access it and lose million!
GeeIWonder
on 18 Jan 08I’m not trying to harp on, but this is an interesting development for an interesting application and an interesting business model. The opportunity to learn (for all of us, really) is significant. For example:
and
and
This is an interesting point. At what point does 37signals have an implicit guarantee of service and/or compensation? Is there any legal precedent here for 37signals or a similar company, out of curiosity? Does offering some compensation weaken your case? I think it might.
Kevin Smith
on 18 Jan 08Since the topic is about redundancy. It maybe a good time to check Dedicated Servers at The Planet. Our data centers are a great place for redundant servers. Everyone living off web-apps should have redundacy today. Our low price and network make us a cost effective solution. Give us a try.
Tal Giat
on 18 Jan 08The only question I have is why Jason did not respond, or write anything about today’s outage.
DHH
on 18 Jan 08Tal, Jason is actually speaking at a conference today. But everyone else was on board to handle the situation.
MI
on 18 Jan 08Tal Giat: Jason has been at the SEED conference all day. Ironically, the facility where the conference is suffered a power outage not long after our network outage began.
Cameron Watters
on 18 Jan 08@MI:
Great Update! That kind of transparency and forthrightness is awesome. You guys have also taken your licks in this forum and responded well.
Gotta love a culture where that can be practiced without people fearing for their jobs (or, I assume that’s the case anyway).
Anon
on 18 Jan 08Since The Planet wants to come in and issue comment spam. I suppose it’s worth pointing out a really bad datacenter outage where they didn’t notify any of their customers for 6 hours.
Come on guys, there’s no point coming in here and advertising..
Brad M
on 18 Jan 08I’m curious to see how much the complainers actually pay for the service. You can’t honestly expect 100% uptime on a service that your company probably pays less than $1000 per year for.
As for critical deals and whatnot being lost, don’t you think that if you are doing a deal that is so sensitive that a 2 hour delay could cause it to disappear, that you should look into making your information itself redundant. Personally I would have had a second copy of the info somewhere if that was the case.
Take precautions and get off the “my work is life or death” bandwagon are the two things that should be taken away from this.
Jon S
on 18 Jan 08Thanks for the updates throughout the outage, you guys handled it as well as could be expected.
Will
on 18 Jan 08Kevin, bad form man. Also shame on whoever is plugging Pair. Show some class.
Joshua
on 18 Jan 08Brad I don’t think it’s like that. I think people are a little surprised that it happened in the first place. Because one can argue that Basecamp is a mission critical app for a bunch of companies (I definitely sat for awhile twiddling my thumbs because I couldn’t check out the task lists I had setup for today) and it’s an easy thing to prevent. But now that appropriate steps have been taken to rectify this, that’s great. The chances of this happening again are very very slim and considering the service is pretty stable anyways, it’s just makes it even more reliable.
GeeIWonder
on 18 Jan 08@Brad: Anyone who has ever issued or submitted something as trivial as a paper or an RFP knows precisely how important 2 hours can be. Yesterday’s (probably last Friday’s in most cases) backup doesn’t do you much good when you’re team has spent a Zulu time workday on it since. As has been said, Projects don’t fail from a lack of charts, graphs, stats, or reports, they fail from a lack of clear communication. Kudos to the 37signals team for keeping what lines of communication open they could via the status blog.
Now listen: Just because something can’t possibly be as important as all that to you doesn’t mean it can’t be or isn’t vital to others. It’s backwards to fault people for using a product the way it is intended to be used, and saying ‘take precautions’ reeks of misplaced condescension. It’s entirely possible, even for those users who make regular backups, have critical infrastructure plans, and do everything else right, to be severely affected by two hours. If you can’t wrap your mind around that, the fault is yours, not theirs.
I’ll take your two things away from this if you take my one.
Brad M
on 19 Jan 08Not to drag this down, but if you’re severely affected by 2 hours you shouldn’t be betting it all on a web app. If its worth that much to you, invest in something more stable. I’ll leave it at that.
Rabbit
on 19 Jan 08It’s comments like this that make me laugh.
I laugh because, strictly in my view, the authors of such comments need to lighten up.
Language like “of ridiculous importance” speaks volumes about what you find important in life. I promise you, nothing you can do inside the confines of an office is that important, unless you’re in Washington.
We’re taught, and we think, that what we do is important, but it’s mostly bullshit.
Joshua
on 19 Jan 08That’s like saying, if your money is so important to you, you shouldn’t be putting it in a bank you should be stashing it under your bed..
GeeIWonder
on 19 Jan 08unless you’re in Washington.
Wow. Just wow.
Mari G
on 19 Jan 08I’m really wondering, why we must beg for some type of compensation. Why not just give a discount to all your paying users. Even something as simple as $5 off the next month would do a lot for your customer service.
Nate Berkopec
on 19 Jan 08Dugg.
Brad M
on 19 Jan 08No @Joshua it is not. Its like saying if you value your money you shouldn’t be putting it all in one place.
Jack
on 19 Jan 08“don’t you think that if you are doing a deal that is so sensitive that a 2 hour delay could cause it to disappear, that you should look into making your information itself redundant”
how can you access your redundant copy of data if you doesn’t have the application?
Jack
on 19 Jan 08“Take precautions and get off the “my work is life or death” bandwagon are the two things that should be taken away from this.”
please tell your boss that during in the middle of the meeting or work day.
moe
on 19 Jan 08I read this post while seeing a banner for rackspace on the right. lol
Michelle
on 19 Jan 08Wow, a 2 hour outage! I can’t believe hardware can fail! I can’t believe you didn’t think to have at least 3 facilities dedicated to running your applications, with redundancy in between. While you are at it, I’m sure your customers would gladly pay the extra cost to cover these facilities. LOL Yes, I’m being sarcastic. Sorry, I’ve just been in this business too long and have to laugh at the “sky is falling” mentality. But then again, I can remember a time when people did work without computers. •gasps• ;) Remember, things happen, but it’s the response and more importantly, the post-response that matters most. If you except a Web application to never experience an outage, you are in the wrong business.
Gavin McLelland
on 19 Jan 08@Michelle: A 2 hour down time is not going to kill anyone, thats for sure. However when you have as many paying customers as 37signals on multiple hosted app’s there is just no valid reason (other than not understanding a how to plan a fail-safe hosted infrastructure) to have not planned far enough in advance to eliminate a single point of failure such as a load balancer.
The bottom line is hardware is extremely cheap and protecting against an issue like this is as simple as running in multiple data centers… its really not as hard as it sounds.
@37Signals: well played but you need to spend some time on your infrastructure beyond the boys at rackspace. Check out softlayer, they have blown me away so far.
AaronS
on 19 Jan 08On this very same day our ISP lost a circuit near our building. Our ad agency was without Internet connectivity for 1.5 hours. Our CEO was in the middle of pitching to a client via a WebEx meeting. One of our AD’s was in the middle of uploading a 60 second spot to a client for approval. It needed to be approved that day in order to make it to the station in time to start playing on Sunday. Various other important tasks couldn’t be completed because we did not have Internet connectivity.
We were not contacted by our service provider, we had to contact them. After contacting them we were put on hold for 15 minutes while “they checked things out”. After they came back they gave us a ticket number and told us they would work on it.
Over an hour later our service finally came back up. We were not contacted by the service provider to check if our service was back up and running. We had to contact them. There were no other updates given during this time.
We requested compensation for our downtime and were told that since we did not have a service level agreement, compensation would not be given.
As the IT manager at my agency people were breathing down my neck wanting to know why things weren’t working during a critical time. I couldn’t give them an answer other than “they were working on it”.
2 hours can be a big deal to people. If I am going to have a service outage, I would gladly have it with 37signals because I know they will keep me updated, they will be doing everything they can to get back up and running, and they will work to prevent the issue in the future.
Good work.
MarkS
on 19 Jan 08I’m amazed that 37 Signals has single points of failure in their network and think Rackspace are the cause of the outage!
Rackspace seems to have done a great job getting the site back in 2 hours, without their spare equipment the site could have been down for far longer waiting for replacement equipment from a vendor.
Very disappointing to hear them blaming their hosting company for their mistakes and reduces my confidence in 37 Signals.
ayjay
on 19 Jan 0837S: The buck stops with us, we take the blame, we’re going to fix it.
Richard: Why don’t you take responsibility instead of blaming other people?
37S: As we said, the buck stops with us, we take the blame, we’re going to fix it.
MarkS: Why don’t you take responsibility instead of blaming other people?
This could go on for quite a while. You know, these conversations tend to go better when people actually read threads before adding their comments.
bear454
on 19 Jan 08One question – why doesn’t status.37signals.com have an RSS or Atom feed?
Here I am at my desk, hitting F5 like it was a Woot-off...
bob
on 19 Jan 08What happens when somebody like Tim McVeigh decides to take out a data center like Rackspace, or a couple of their centers? Hmmm.
Michal
on 19 Jan 08Hi Jason and Team,
It was Friday afternoon in Europe so the atmosphere was somewhat relaxed. We’ve been using your service for more than a year now and this is the first time a problem occured. Not so bad if you campare it to our hosting company.
I’d like to thank you for quick and honest communication. This is how it should be done.
Cheers
Anonymous Coward
on 19 Jan 08Desktop app vs Web app
Meaning? What if your computer crashes? Do you have a spare computer hot backed up than you can turn on? What if your laptop is stolen? Do you have a spare in your luggage? Talk about single points of failure.
Desktop apps are not more reliable than hosted apps.
Kevin
on 19 Jan 08Wow… You guys still can’t seem to take responsibility for your business decision even after so many people have called you out on it. Not having control of the networking hardware, etc. really doesn’t make a difference. You knew the risk and I’m sure your account manager explained the risks. Rackspace has a one-hour hardware replacement guarantee – they replaced your hardware within that span of time and were working with you to configure it.
Rackspace runs a completely redundant network up to the point where it enters your hardware. I highly doubt you would have got a faster response from any other company that had to both replace the hardware and rebuild the configuration
Cheers, Kevin
DHH
on 19 Jan 08Kevin, please do read back through this thread. As we’ve said countless times, the buck does stop here. We do ultimately have the responsibility. But the replacement guarantee of 1 hour wasn’t kept as we were down for two hours because of this.
Sean H
on 19 Jan 08I guess this is all part of 37 Signals maturing as a business.
To 37Signals credit, you appear to have communicated well with your client base which is important.
It’s just not as important as ensuring that you’ve put some planning and investment into system capacity and failover/redundancy before things come crashing down. It’s very easy for companies going through rapid growth to forget about investing in infrastructure until it’s too late. Having designed and built many HA systems, my experience is that doing it right costs a lot of money. You really only see the return on those kind of investments when you’ve avoided a major outage.
I just hope that the 37Signals management and team have learned the lesson and will make the required investments.
Manifester
on 19 Jan 08“I’m really wondering, why we must beg for some type of compensation.”
Whoever you are. I don’t hope you run a business. Seriously. Manifest. Manifest, punk!
MI
on 19 Jan 08Sean: The load balancer was one of the very few single points of failure that we had on our network and we definitely relied too much on being able to get it very quickly replaced in the event of a failure. We have spent a lot of money making our systems as redundant as possible and we’ll continue to do so going forward. It’s a testament to the level of service that we’re used to with regard to our networking infrastructure that we took it for granted. We won’t make that mistake again.
Matt Russell
on 19 Jan 08As someone who uses Basecamp a lot (but for whom it’s not mission-critical), my first thought upon seeing the status page was “One more example of 37 being in rare form, even when they’re down”. I suppose it’s possible to critique the prior config, but I don’t think it’s possible to find fault in your response. Kudos.
And, albeit guiltily, I appreciate seeing you suffer a problem because I can learn as much from that as reading your blog.
David Risley
on 19 Jan 08I used to be a paying customer of Basecamp. Not anymore. However, just wanted to say I impressed with the professionalism shown with this. 37 Signals is a class act.
Anonymous Coward
on 20 Jan 08What a massive fuss over nothing
singingdancingbear
on 20 Jan 08It seems that 37Signals took a page out of Rackspace’s book on responding to customers during an outage. If I remember correctly Rackspace was praised up and down for responding and keeping their customers informed.
Good job on copying the response tactic and then blaming them for not being hard nose sales people on selling you a second load balancer. An SLA is as strong as the risk you want to accept.
Tom Brady
on 20 Jan 08Can we insert a little common sense in here for a moment? 37signals deployed a configuration that made techincal sense and met their budget. You can’t have everything you want, we all make choices based on a variety of factorts. 37signals was great with its communication, and Rackspace was great with its response. Can the Planet, Softlayer, LogicWorks or any of the other advertisers in here identify and then replace a failed dedicated hardware load balancer in less than two hours?
There are lots of scenarios, lots of things can fail. If you truly can’t handle a two hour outage in a month you have NO CHOICE but to deploy a cluster in one location and a second config in a second data center. If you don’t do that, you will fail at ANY provider. Rackspace and 37signals handled the situation fine… all of the haters need to find something else to gripe about, like Philip Rivers and Ladanian’s bad knees. Go Patriots! ;-)
alan
on 20 Jan 08c’mon. the apps went down for a little while. how about resorting back to say, erm, email in the meantime?
Matt Carey
on 20 Jan 08Come on, the world has not ended and we are still breathing. Really, why is everyone getting into such a heated stress about this?
I think 37s handled it extremely well. Panic can set in in those situations and they handled it calmly and professionally. I personally really appreciated the twitter updates.
Seriously, this is the risk (however small) of using web-based apps. Either the net connection or the app could go down—that is the risk you take.
Who here has wasted work time with an OS or desktop app crashing loosing all the work they have done that day because they forgot to save? I bet you don’t go flaming to M$, Adobe, Apple etc?
Another LB
on 21 Jan 08I think the outage was handled well. This should be expected by anyone using an ASP and I can not remember the last time BC was down since I signed up when you first launched. Nice job!
Although I was disappointed of the laying blame on Rackspace. It is poor business and bad manners. It should be between you and the service provider. It should be a united front until the problem is solved and then just switch if you are not happy and keep it too yourself. If you look at the posts above you have potentially negatively impacted their business based on one mistake. I agree you should command ‘Fanatical Service” but you can’t get that by shifting blame. Whenever they made a mistake with us they have taken care of it ASAP and apologized repeatedly. Sorry for the rant but it just seemed so out of character for 37signals.
Disclosure: I am a 5+ year Rackspace customer and I have blamed others for problems with our apps. One of my clients showed me there is a better way to handle it. Usually I follow this advice.
No one is perfect but you all seemed to handle it well. Keep up the good work.
Jason Leister
on 21 Jan 08I appreciate how you guys handled this. A good example for the rest of us.
And very different from Joyent’s strongspace.com outage that’s been going on for the past WEEK.
Sebhelyesfarku
on 21 Jan 08Looks like 37signals fanboys are apologists like Maczealots.
Robin
on 21 Jan 08I haven’t read all of the comments in detail, and I was not affected by the outage. However there is a question that nobody seems to have asked – was the problem a result of using Rails, or exacerbated by it. In other words if the app was written in PHP would there have been the same problem for the same size of database?
By the way, it is somewhat ironic that this outage happened on the same day that someone drew attention to Dreamhosts problems – pot and kettle etc.
JF
on 21 Jan 08Robin, this was a physical hardware failure. It had nothing to do with software.
Harry
on 21 Jan 08For all the people pointing fingers at Basecamp, I offer this solution. Get your own box and install http://www.activecollab.com/. Then you can go to sleep knowing you are in complete control.
I get that it’s the community feedback that provides the motivation for 37signals to be a better business. But c’mon, you think this level of service is common place? Appreciate what you got before 37signals decides to raise the prices just to make it worth listening to some of these entitlement attitudes. Most good developers charge in a day what I pay for a year of Basecamp.
This isn’t about “being cheap” or “cutting corners”. This is about a balance between cost and price. If most people were honest, they would know that the value/price ratio of Basecamp is through the roof. Sure it was a major outage, but they have remedied it and protected the future. Sometimes you have to be bitten to know the value of redundancy.
One thing I’ve learned is that people don’t remember you for your failures. They remember you for how you responded to them. We’re human. We’re going to make mistakes.
As a contract developer, I know the dollar value of my time. And I certainly couldn’t produce the results Basecamp does for what it costs me each month.
@Robin: “I haven’t read all of the comments in detail, and I was not affected by the outage. However there is a question that nobody seems to have asked – was the problem a result of using Rails, or exacerbated by it.”
Try starting with the big white area above the comments. You will find valuable information there.
IH-RS
on 22 Jan 08Another Rackspace horror story : http://blog.davidville.com/2008/01/21/downtime-today/
Vive le Web 2.0 :-D !
Robin
on 22 Jan 08@JF – Yes, I sort of gathered it was a hardware problem. But why is the hardware needed? I’m new to Rails and from what I read there is a need to have a collection of mongrel “servers” (I use the word loosely) to counteract some performance shortcomings. This leads me to wonder if a PHP system might not have needed a load balancer. And your rather sharp answer does not allay my concerns.
From reading various bits on the web (including on 37Signals site – eg the recent long comment on shared hosting) there also seems to be a somewhat dismissive attitude to some users concerns.
Don’t get me wrong, I think Rails is a wonderful intellectual achievement. But if it wants to get real market penetration it probably would be better if it sold for a small charge (eg $20 – $40) to provide funds for dealing with customer concerns and providing proper documentation.
@Harry – I had read the “official statement” – it’s only the comments that I skimmed.
MI
on 22 Jan 08@Robin: It is common, regardless of the framework or language that is used, for web applications to require more than one web server in order to meet the performance and redundancy demands of their users. When you need more than one web server to handle the traffic demands, you have to have a way for incoming web requests to be routed to more than one server. That’s where the load balancer comes in. The same situation would have risen not matter what technology we had chosen to write our applications in.
There is a fairly extensive entry on Wikipedia that explains load balancing more fully, I’d suggest you give it a read if you’re interested in understanding how load balancing works.
Robin
on 23 Jan 08@Ml – Thanks, thats a much more comprehensive answer.
I understand perfectly that, if there was a load balancer operating with a PHP application a failure of the load balancer would have had the same consequences.
However I detect that you may still be evading the issue of whether there would have been a load balancer, in your particular case, if your application was written in (say) PHP.
MI
on 23 Jan 08@Robin: I’ll say this as clearly as I possibly can: We would have required a load balancer no matter what technology we had used on the backend. A single web/application server is not nearly enough to handle the volume of traffic that we have, nor is it resilient in the face of a failure. The load balancer makes us MORE stable, not less, despite this incident. It allows for the failure of web and application servers without creating a user visible downtime.
GeeIWonder
on 23 Jan 08Unless your argument is that one server could handle the full load at any time of the webapps, the blog apps (since they went down to), and the webserver if the apps were coded in PHP vs. RoR, it holds no water, Robin.
This is pretty unlikely to happen. I’d suggest the differences are smaller than the difference in having another processor and RAM for half the users.
Also, your single server is a single point of failure. Unless you want to have a second server hanging around doing nothing. Wait, that seems silly. Why not have two servers hanging around working at half load (and better performance for all users, plus greater capacity to handle spikes and even maintenance) rather than one server that would lock up during spikes and another sitting around doing nothing? If only there was some piece of software or hardware that co do that for us, ‘balance the load’ as it were…
Oh, wait…
Keith
on 23 Jan 08Well handled folks. Thanks for the updates. Anyone who deals with technology certainly understands and plans for hiccups in the products they rely on every day! Thank you for the status updates.
Keith
on 23 Jan 08@ Harry
ActiveCollab? You’re kidding right? These are the guys who came on and said Basecamp was a joke because it was a commercial product that had too high a price point and that they would design a better solution in PHP more quickly and have it be free.
They designed what amounted to a weak imitation and then instead of fixing the problems decided to restart their project and charge for it. Their price point is higher than Basecamp’s is last I checked.
So…I wouldn’t exactly be tooting their horn too loudly given their history.
Robin
on 23 Jan 08@GeelWonder – I’m not making any argument – just asking a question and hoping for a comprehensive answer from the people who created Rails. I hope the answer will be reassuring, but it hasn’t come yet.
I can speculate about how the 37Signals web applications are hosted but that is not the same as them telling us how it is done.
The more I don’t see a simple answer, the more I think I may have asked an awkward question.
Let me repeat the question (for answer by someone from 37Signals). If the 37Signals web apps had been developed in PHP rather than Rails would the server arrangement have been different and would that difference have reduced the inconvenience to customers?
And if the answer is “no” I would like some supporting explanation.
Mi
on 23 Jan 08@Robin: Scroll up.
Robin
on 24 Jan 08@Mi – Humble apologies. For some reason I missed your penultimate reply.
I’m happy now.
This discussion is closed.