From 2:13am GMT March 13 / 9:13pm Central March 12 until around 4:10am GMT / 11:10pm Central, Basecamp 3 was mostly offline and Basecamp 2 unable to process file uploads and downloads, as our cloud storage provider had a severe, sustained outage. We continued to have minor disruptions in service from 4:10am GMT / 11:10pm Central until everything was cleared at 6:53am GMT / 1:53am Central.
This is the second time in a week that I’m forced to write “I’m so sorry”. That’s incredibly painful. Both because it’s because we’re failing our customers for the second time in a week, but also because it’s showing us just how unprepared we’ve been as an organization to deal with these cloud challenges, despite our belief otherwise.
I’m not going to bother you with platitudes about “lessons to be learnt”, because I’ve already done that just a few days ago. This goes much deeper than just a few lessons. It has called into question our entire risk management and operational structure at Basecamp.
It’s also been a mighty fall. From reaching for 99.999% in uptime – the hallowed five nines! – we’re now scrambling for two of them. From riches of reliance to rags of shambles. To say this is humbling is an epic understatement.
We’re stopping all major product development at Basecamp for the moment, and dedicating all our attention to fixing these single points of failure that the recent cloud outages have revealed. We’re also going to pull back from our big migration to the cloud for a while, until we’re able to comfortably commit to a multi-region, multi-provider setup that’s more resilient against these outages.
I’m sorry. I’m really sorry (and ashamed).
For posterity, here’s the complete play-by-play updates:
Starting 2:13am GMT / 9:13pm Central, Basecamp 2 and Basecamp 3 both started going intermittently offline, as our cloud storage provider had a severe, sustained outage.
As of 3:21am GMT / 10:21pm Central, these problems are ongoing. We are working all emergency avenues and all fallback options.
As of 3:33am GMT / 10:33pm Central, we’ve managed to make both Basecamp 2 and 3 accessible more of the time. But file uploads and downloads are still offline. We’re working on all of this.
As of 3:43am GMT / 10:43pm Central, we’re seeing some improvement in both file downloads and uploads for both Basecamp 2 and 3 at the moment. But it’s still very intermittent. We’ve managed to somewhat stabilize general access to the system, though even that is somewhat intermittent as well.
As of 3:57am GMT / 10:57pm Central, our cloud storage provider believes they’ve found the root cause, and they’re working on resolving it. In the meantime, we’re working on making sure that the apps remain accessible for all other access than file uploads and downloads without interruption.
As of 4:10am GMT / 11:10pm Central, both Basecamp 2 and 3 continue to be mostly available and file uploads and downloads mostly working. We are not stopping until things are fully and completely working for everyone and for sure.
As of 4:24am GMT / 11:24pm Central, things continue to look mostly good for both Basecamp 2 and 3 and file uploads and downloads are mostly working. Mostly. (Not exactly a comforting word in situations like this, but here we are! 😢). Work continues.
As of 4:50am GMT / 11:50pm Central, the issues continue at a simmer. Basecamp 2 and Basecamp 3 continues to have blips, and file uploads and downloads continue to see some errors as well. The cloud storage provider is now rolling out their root-cause fix, albeit slowly. We continue to fight the best we can on our end.
As of 5:10am GMT / 12:10am Central, we continue to manage the situation the best we can on our end, including developing better error messages, and preparing backup provisions. The cloud storage provider continues their work as well.
As of 5:24am GMT / 12:24am Central, Basecamp 2 and 3 are still available, but there’s a trickle of errors with file uploads and downloads. If you see an error, retry the upload and it’ll most likely go through.
As of 5:43am GMT / 12:43am Central, Basecamp 2 and 3 are still online, but we’re not out of the woods yet. We’ll continue to update here, on our Status Page and on Twitter.
As of 6:11am GMT / 1:11am Central, we’re still working to bring Basecamp 2 and Basecamp 3 back up and fully stable.
As of 6:35am GMT / 1:35am Central, we’re not seeing the same level of errors on our end for Basecamp 2 and Basecamp 3. Neither are we seeing any customers reporting issues. But, we want to wait a beat to make sure we’re in the all clear.
As of 6:53am GMT / 1:53am Central, we’re in the all clear. Everything is back up and behaving as-expected in Basecamp 2 and Basecamp 3. All file uploads, file downloads, and avatars are working again.
Whoa! Ease up on the self-flagellation! Yes, it’s unfortunate to have 2 notable outages in a week. And yes, it’s time to do some deeper dives. But it takes time to make deep dives, so if two fairly unrelated problems hit in the same week, there’s no way you could have prepared for it, short of charging 10X what you charge for Basecamp today and ballooning your staff by a huge factor (which creates its own set of problems).
Basecamp is very important to me and my team. But honestly, it’s $100/month. I can’t expect perfection, and I don’t. I’m just very, very appreciative that you guys care as much as you do and you make a high-value product.
Keep up the good work.
You’re too kind, John. The sad fact is that this issue too was avoidable (everything ultimately is). And that we don’t set our standards on “it’s only $100/month”. We ask people to trust us with their important data, and we take that incredibly seriously.
But thank you. Again 🙏❤️
David,
Thanks for the extraordinary status reporting. Although the team is currently unaware of the problems as they are asleep, they soon will be when I present this fine example of customer care and support.
For your own sanity and peace of mind, it might be good to know that charging an extra few dollars to every single user for a more robust back up solution would probably not result in a mass exodus. Basecamp is one of a kind and the whole community loves and supports it. You might want to hang that one out there to everyone at some point.
I agree with John, you shouldn’t wear your heart on your sleeve quite so much. We all know how proud you are of the Basecamp ecosystem because we are. It also lets all our teamwork be the best it can be. But, you need to sign of positively and keep the personal disappointment up your sleeve 😉
Onwards!
Thank you, Joe. Means a lot to hear you be so supportive. This is really tough for us to swallow after having such a stellar record for so many years. We got ahead of ourselves on a cloud migration, and we didn’t do a proper job of getting backup and failure models mapped out. We will improve.
❤️
Chill out, man.
You have a great product. We’ve come to expect outages, as the web is a multi-layered mega-marvel, something usually goes wrong.
Waiting to see what actually went wrong from an engineering POV.
Thank you, Hemanth. We were caught up in the global outage of Google’s cloud storage, which also took down Gmail, Drive, and other Google applications: https://www.theguardian.com/technology/2019/mar/13/googles-gmail-and-drive-suffer-global-outages
Which is your cloud provider? Do they have a status page?
We were using Google Cloud Storage, but given too many issues big and small over the year, we’ll be migrating to AWS S3 for storage going forward.
The obvious care, transparency and honesty which you and the team demonstrate more than offsets a little downtime.
Keep up the great work.
Thank you, Jon. That’s very kind of you. We really don’t have the standing at the moment to ask for more patience or understanding, but we’re very grateful when it’s offered none the less.
Thank you for the way that you have chosen to manage your company. Failures are hard to take at anytime but the way that you update and tell the world what is happening, not only technically but also with your thoughts and comments along the way and after action is something that more companies should do.
You allow us to see that you are not perfect, but you are always working at improving. Keep being as awesome as you are!
The two coins principle at work!
It doesn’t have to be crazy at work… Great book!
You need to fix your cloud plan and basecamp3. It’s a slow painful experience using this version is Basecamp 2 was. User management is a joke at over 10 seconds per action needed to add or delete people or even list who you added to your project
It was a good reminder for anyone providing a customer system of what can and does go wrong. Thanks for the honesty and transperancy.
David,
As a user, candidate, and (recovering) CEO/entrepreneur, I applaud how you’ve led the company thus far and through these challenging times. Your accountability makes it clear that it’s not about the wins, it’s about the promise of getting there and refinement as the CEO. This moment proves exactly why leadership is so tough for so many, and why there are so few leaders.
Onward + upward,
Marissa Spano
The Brave