Basic Explanation
Some background
On Dec. 4 around 5:30 p.m. CT, a number of our sites began throwing errors and were basically unusable. Specifically, Basecamp Classic was briefly impacted as it was very slow. Campfire users experienced elevated errors and transcripts were not updated for quite some time. Highrise was the most significantly impacted: For two hours every page view produced an error.
Why our sites failed
When you visit a site like Basecamp it sends you information that’s generated from a number of database and application servers. These servers all talk to each other to share and consume data via connections to the same network.
Recently, we’ve been working to improve download speeds for Basecamp. On Tuesday afternoon we set up one server with software that simulates a user with a bad Internet connection. This bad traffic tickled a bug in a number of the database and application servers which caused them to become inaccessible. Ultimately this is why users received error messages while visiting our sites.
How we fixed the sites
We powered off the server sending out the bad traffic. We powered back on the database and application servers that were affected. We checked the consistency of the data and then restarted each affected site.
How we will prevent this from happening again
- We successfully duplicated this problem so we have an understanding of the cause and effect.
- We asked all staff not to run that specific piece of software again.
- We know someone might forget or make a mistake, so we set up alerts to notify us if the software is running anywhere on the network. We verified the check works too.
- We are working with our vendors to remove the bugs that caused the servers to go offline.
In-Depth Explanation
Topology
Our network is configured with multiple redundant switches in the core, two top of rack (TOR) switches per cabinet, and every server has at least 2×10Gbe or 2×1Gbe connections split over the TOR switches. Servers are are spread among cabinets to isolate the impact of a loss of network or power in any given cabinet. As such, application servers are spread throughout multiple cabinets; master and slave database pairs are separated, etc. Finally the cabinets are physically divided into two “compute rooms” with separate power and cooling.
Before the failure
We’ve been investigating ways to improve the user experience for our customers located outside the U.S. Typically these customers are located far enough away that best case latency is around 200 ms to the origin and many traverse circuits and peering points with high levels of congestion/packet loss. To simulate this type of connectivity we used netem. Other significant changes preceding the event included: an update to our knife plugin that allows us to make network reconfiguration changes, the decomm of a syslog server, and an update of check_mk.
Failure
At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A “help” notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.
One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load balancer. While this issue was being addressed other engineers identified that the affected applications and servers were in multiple cabinets. Since we were unable to access the affected servers via out of band management, we suspected a possible power issue. Because the datacenter provides remote hands service, we immediately contacted them to request a technician go to one of our cabinets and inspect the affected servers.
Recovery
We prioritized our database and nosql (redis) servers first, since they were preventing some applications from working even in a degraded mode. (Both our master and slave servers were affected, and even our backup db host was affected. Talk about bad luck …) About five minutes after we had a few of the servers online, they stopped responding again. We asked the onsite technician to reboot them again, and we began copying data off to hosts that were unaffected. But the servers failed again before the data was successfully copied.
From our network graphs we could see that broadcast traffic was up. We ran tcpdump on a few hosts that weren’t affected, but nothing looked amiss. Even though we didn’t have a ton of supporting evidence it was the problem, we decided to clear the arp cache on our core, in case we had some how poisoned it with bad records. That didn’t seem to change anything.
We decided to regroup and review any information we might have missed in our earlier diagnosis: “Let’s take a few seconds and review what every person worked on today … just name everything you did even if it’s something obvious.” We each recited our work. It became clear we had four likely suspects: “knife switch,” our knife plugin for making changes to our network; syslog-02, which had just been decommisioned; an upgraded version of the check_mk plugin that was rolled out to some hosts; and the chef-testing-01 box with netem for simulating end user performance.
It seemed pretty likely that knife-switch or chef-testing-01 were the culprits. We reviewed our chef configuration and manually inspected a few hosts to rule out syslog-02. We were able to determine that the check_mk plugin wasn’t upgraded everywhere, and that there were no errors logged.
We shut down chef-testing-01 and had the remote hands technician power on the servers that had just gone awol again. We decided that since we were pretty sure this was a networking issue, and it very likely was related to lacp/bonding/something related, we should shut down one interface on each server in case that too prevented a repeat performance. We disabled a single port in each bond both on the switch and on the server. Then we waited 15 long minutes (about 10 minutes after the server was booted and we had confirmed the ports were shut down correctly) before we called the all-clear. During this time we let the databases reload their lru dumps so they were “warm.” We also restarted replication and let it catch up and got the redis instances started up.
With these critical services back online our sites began functioning normally again. Almost 2.5 long hours had passed at this point.
Finally, we made a prioritized list of application hosts that were still offline. For those with working out-of-band management, we used our internal tools to reboot them. For the rest we had the datacenter technician power cycle them in person.
Resolution
- We were able to reproduce this failure with the same hardware during our after-incident testing. We know what happens on the network, but we have not identified the specific code paths that cause this failure. (The change logs for the network drivers leave lots to be desired!)
- We have adjusted the configuration of the internal DNS virtual server to automatically serve via the backup servers if the two primary servers are unavailable.
- We have added additional redis slaves on hosts that were not previously affected by the outage.
- We are continuing to pursue our investigation with the vendor and through our own testing.
- Everyone on the operations team has made a commitment to halt further testing (with netem) until we can demonstrate it will not cause this failure again.
- We have added “netem” to our Nagios check for blacklisted modules in case anyone forgets about that commitment.
- We are updating our tools so that physically locating servers when Campfire (and thus our Campfire bot) is broken isn’t a hassle.
Additional information
We’ve built a Google spreadsheet which outlines information about the hosts that were affected. We’re being a bit cautious with reporting every single configuration detail because this could easily be used to maliciously impact someone’s (internal) network. If you’d like more information please contact netem (at) 37signals and we’ll vet each request individually.
drawtheweb
on 16 Dec 13This is a great post-mortem. I wish all vendors could be open like this.
If only there was a way for servers to tell you what they really want!
Toedip
on 16 Dec 13this is a great post and appreciate the in-depth analysis, but makes me wonder about testing such things against a live system with real customers, you did not expect the downtime but isn’t it playing with fire to test a live system in this way? would it not be better to replicate the entire system, even with a subset of the data or a subset of fake data, and then test these kinds of things on that replicated system? then you could test all kinds of scenarios, even crazy ones, without concern over bringing down the real system? anyways thanks again for the post
Andrew
on 16 Dec 13Shouldn’t the 1st question when something goes wrong be…what just changed? Asking this 1st sounds like it would have saved about 2 hours off your downtime in this case.
Justin
on 16 Dec 13Sorry to be juvenile, but your Google Spreadsheet has the header of “Pubic host information for Dec 3rd outage : Information”, yes it said pubic.
Nathan
on 16 Dec 13@Justin: Thank you. Very unfortunate editing error. All fixed now.
Matthew
on 17 Dec 13This is a great piece, but interestingly, the word “Sorry” doesn’t appear once. I’m pretty sure you guys actually wrote a piece about that once.
Still a great explanation.
Jonta
on 17 Dec 13“through or own testing” should be “through our own testing”. While were polishing spelling.
Great stuff. I especially liked the non-technical intro.
Taylor
on 17 Dec 13@Toedip When we think there is any appreciable risk we always use a separate environment. In this case we didn’t have any reason to believe that risk was present. Lesson learned.
@Andrew Before posting this we discussed that someone might read that part and assume we had not started with “what just changed”. In fact we did, as individuals, but verbalizing it one after another “all at once” really helped bring clarity. It’s something we are considering making part of the formalized response procedure.
@Matthew Glad you found it valuable. We apologized numerous times via Twitter and our status page both during and after the outage. If you are one of our customers who experienced this outage please know that I am incredibly sorry your work was interrupted and we failed miserably at keeping these sites available and performant during this two hour period. We absolutely have to earn your trust back, and I hope by being transparent about the outage, and continuing to improve our site reliability we will be able to do that.
@Jonta I’ve corrected that mistake. Thank’s for your feedback. Writing the “simple” version is actually much harder!
GeeIWonder
on 17 Dec 13@Taylor Lesson learned—which lesson? Understanding that this particular mode of failure is a risk. and therefore you should not use a specific tool in this way on the live network is one lesson.
Understanding that you (as company) evidently can’t judge the risk correctly and therefore should not be doing any of this kind of testing on the live environment is a far more important lesson and the one you should reassure your customers that you’ve learnt.
On another note, there’s an interesting theme in (or at least between the lines) of this post about the challenges of new scale, both in terms of workforce and software suites. I find the set of new systems (including people and “don’t do that anymore” meetings/briefs) you’ve adopted to address this issue very interesting and wonder if it makes more sense to try and scale the workflow, management and practices by adding hodge-podge rules and more complexity or if it might make sense to rethink the underlying habits/systems.
Drew
on 17 Dec 13Seems odd you would test something as non-trivial as simulated network errors in a production domain, but kudos for maintaining the utmost transparency.
Toedip
on 17 Dec 13@GeeIWonder I think they thought simulating a spotty noncompliant connection/user would not bring down the system, but instead give them insight into helping users with latency and bad connections, etc. I am guessing here, but I think many of us rely on low level networking implementations that are often stock in default unix/linux distros along with network adapter software drivers provided by manufacturers or third parties. If we have to reinvent every wheel to do anything on a computer, nothing would be possible. We all travel on the backs of others who have developed software, many of whom are unsung. The end result though, is that we are using stacks of software and sometimes things happen that are difficult to diagnose and resolve and difficult to predict. But, in general, if you have a working system, I do think it is best to replicate it and test on the replicated system. The problem is that replicating an entire operation is a PITA and there is pushback from smart (and headstrong) mavericks to not do so and there is often a cost/benefit analysis, which in this case was probably something like, “hey there is no way this will cause a problem”. Just an example of famous last words I guess!
GeeIWonder
on 17 Dec 13@Toedip
I think that’s probably exactly right.
Either way, customers should know.
The underlying problem here was not a software one. So the ban on that specific program is only a superficial fix.
Andrew
on 18 Dec 13@Taylor many thanks for the reply. Much appreciated
This discussion is closed.