Basecamp has suffered through three serious outages in the last week, on Friday, August 28th, on Tuesday, September 1, and again today. It’s embarrassing, and we’re deeply sorry.
This is more than a blip or two. Basecamp has been down during the middle of your day. We know these outages have really caused issues for you and your work. We’ve put you in the position of explaining Basecamp’s reliability to your customers and clients, too.
We’ve been leaning on your goodwill and we’re all out of it.
Here’s what has happened, what we’re doing to recover from these outages, and our plan to get Basecamp reliability back on track.
What happened
Friday, August 28
- What you saw: Basecamp 3 Campfire chat rooms and Pings stopped loading. You couldn’t chat with each other or your teams for 40 minutes, from to 12:15pm to 12:55pm Central Time (17:15–17:55 UTC). Incident timeline.
- What we saw: We have two independent, redundant network links that connect our two redundant datacenters. The fiber optic line carrying one of the network links was cut in a construction incident. No problem, right? We have a redundant link! Not today. Due to a surprise interdependency between our network providers, we lost the redundant link as well, resulting in a brief disconnect between our datacenters. This led to a failure in our cross-datacenter Redis replication when we exceeded the maximum replication buffer size, triggering a catastrophic replication resync loop that overloaded the primary Redis server, causing very slow responses. This took Basecamp 3 Campfire chats and Pings out of commission.
Tuesday, September 1
- What you saw: You couldn’t load Basecamp at all for 17 minutes, from 9:51am to 10:08am Central Time (14:51–15:08 UTC). Nothing seemed to work. When Basecamp came back online, everything seemed back to normal. Incident timeline.
- What we saw: Same deal, with a new twist. Our network links went offline, taking down Basecamp 3 Campfire chats and Pings again. While recovering from this, one of our load balancers (a hardware device that directs Internet traffic to Basecamp servers) crashed. A standby load balancer picked up operations immediately, but that triggered a third issue: our network routers failed to automatically synchronize with the new load balancer. That required manual intervention, extending the outage.
Wednesday, September 2
- What you saw: You couldn’t load Basecamp for 15 minutes, from 10:50am to 11:05am Central Time (15:50–16:05 UTC). When Basecamp came back online, chat messages felt slow and sluggish for hours afterward. Incident timeline.
- What we saw: Earlier in the morning, the primary load balancer in our Virginia datacenter crashed again. Failover to its secondary load balancer proceeded as expected. Later that morning, the secondary load balancer also crashed and failed back to the former primary. This led to the same desynchronization issue from yesterday, which again required manual intervention to fix.
All told, we’ve tickled three obscure, tricky issues in a 5-day span that led to overlapping, interrelated failure modes. These woes are what we plan for. We detect and avert these sorts of technical issues daily, so this was a stark wake-up call: why not today? We’re working to learn why.
What we’re doing to recover from these outages
We’re working multiple options in parallel to recover and manage any contingencies in case our recovery plans fall through.
- We’re getting to the bottom of the load balancer crash with our vendor. We have a preliminary assessment and bugfix.
- We’re replacing our hardware load balancers. We’ve been pushing them hard. Traffic overload is a driving factor in one outage.
- We’re rerouting our redundant cross-datacenter network paths to ensure proper circuit diversity, eliminating the surprise interdependency between our network providers.
- As a contingency, we’re evaluating moving from hardware to software load balancers to decrease provisioning time. When a hardware device has an issue, we’re days out from a replacement. New software can be deployed in minutes.
- As a contingency, we’re evaluating decentralizing our load balancer architecture to limit the impact of any one failure.
What we’re doing to get our reliability back on track
We engineer our systems with multiple levels of redundancy & resilience precisely to avoid disasters like this one, including practicing our response to catastrophic failures within our live systems.
We didn’t catch these specific incidents. We don’t expect to catch them all! But what catches us by surprise are cascading failures that expose unexpected fragility and difficult paths to recovery. These, we can prepare for.
We’ll be assessing our systems for resilience, fragility, and risk, and we’ll review our assessment process itself. We’ll share what we learn and the steps we take with you.
We’re sorry. We’re making it right.
We’re really sorry for the repeated disruption this week. One thing after another. There’s nothing like trying to get your own work done and your computer glitching out you or just not cooperating. This one’s on us. We’ll make it right.
We really appreciate all your understanding and patience you’ve shown us. We’ll do our best to earn back the credibility and goodwill you’ve extended to us as we get Basecamp back to rock-solid reliability. Expect Basecamp to be up 24/7.
As always, you can follow along with live updates about Basecamp status here and follow the play-by-play on Twitter, and get in touch with our support team anytime.
Hey folks, we are trialing Basecamp and obviously something like this doesn’t instill confidence. But we totally understand and this seems like a “Black Swan” series of events. What’s the probability?! However, I must confess that this transparent root cause analysis with a clear plan to prevent future issues has raised my hopes. You folks are an impressive company and I’m confident this will be behind you soon. Wishing you all the best and we are looking forward to becoming a customer soon.
Thanks Tahir. It doesn’t instill confidence, you’re right. But we’re doing all we can to regain your confidence. We’re confident in our long-term track uptime record – it’s generally stellar – but we clearly have work to do. Thanks for giving us another chance, and we hope to have you as a customer when you’re ready.
I really appreciate these posts. Your honesty and transparency is thoroughly refreshing. Appreciate all that you do!
Our customers deserve this level of clarity and transparency. It’s what I’d want the companies I do business with.
Cheering on you guys. This is the standard for customer communication and transparency when SHTF.
Great, thanks for the breakdown. We were considered moving away from Basecamp but this changed our mind. Love the product and clearly a great team too!
Thanks Waka, we appreciate your continued trust. And sorry about this.
It’s software. Software runs on hardware. Hardware contains software. There is no such thing as perfect software or perfect hardware.
But there is a perfect honest response. True transparency. Once again Basecamp you are best in class!
Keep up the good work. We love you =D
Thank you thank you. Appreciate your support, Pär!
@Basecamp
I thought you guys hosted in the cloud (AWS).
Am I naive in thinking that these problems shouldn’t exist isolated to just your company. Wouldn’t these kind of issues you described above effect way more AWS customers?
Good instincts! Yes, we do run many of our applications in the cloud, and a disaster there would affect many other sites as well.
Basecamp 3 and some supporting services (login, billing, and the like) are not in the cloud, however. They’re hosted exclusively in our on-premises datacenters.
It’s a bit more complex than that. I work in operations for a large enterprise that has over 200 AWS accounts. We see outages in a single account or a cluster of accounts far more frequently than we see full outages with AWS. AWS works hard to contain the “blast radius” of any outage, which means that we they do have an outage, it usually only impacts a small subset of their entire infrastructure.
I should also say that we don’t see outages very frequently in AWS, but they do happen.
I have been a big fan of Basecamp for a while. But after reading this post, it’s become my dream to join and work with this incredibly transparent team!
The cascading problems are the result of a complex, tightly coupled system with not enough buffering. In IT when things go wrong it happens so fast the manual interventions mostly miss the target and more cascading happen. Try to make the system simpler, add buffering and maybe add diversity (cultural and technical) to the architecture teams overseeing the solution to better see “soft spots”. Also take care not to put in too many alerts and triggers but have the ones in place be prioritized by “urgency”
In short work it like the aircraft industry 🙂
Kudos for openness – but that I had expected 🙂
And yes we can all fail – what counts is what we learn and how we “get back up”.
Oh – and Basecamp is really an eye opener in how to work efficiently, thank you for that!