On Dec. 4 around 5:30 p.m. CT, a number of our sites began throwing errors and were basically unusable. Specifically, Basecamp Classic was briefly impacted as it was very slow. Campfire users experienced elevated errors and transcripts were not updated for quite some time. Highrise was the most significantly impacted: For two hours every page view produced an error.
Why our sites failed
When you visit a site like Basecamp it sends you information that’s generated from a number of database and application servers. These servers all talk to each other to share and consume data via connections to the same network.
Recently, we’ve been working to improve download speeds for Basecamp. On Tuesday afternoon we set up one server with software that simulates a user with a bad Internet connection. This bad traffic tickled a bug in a number of the database and application servers which caused them to become inaccessible. Ultimately this is why users received error messages while visiting our sites.
How we fixed the sites
We powered off the server sending out the bad traffic. We powered back on the database and application servers that were affected. We checked the consistency of the data and then restarted each affected site.
How we will prevent this from happening again
- We successfully duplicated this problem so we have an understanding of the cause and effect.
- We asked all staff not to run that specific piece of software again.
- We know someone might forget or make a mistake, so we set up alerts to notify us if the software is running anywhere on the network. We verified the check works too.
- We are working with our vendors to remove the bugs that caused the servers to go offline.
Our network is configured with multiple redundant switches in the core, two top of rack (TOR) switches per cabinet, and every server has at least 2×10Gbe or 2×1Gbe connections split over the TOR switches. Servers are are spread among cabinets to isolate the impact of a loss of network or power in any given cabinet. As such, application servers are spread throughout multiple cabinets; master and slave database pairs are separated, etc. Finally the cabinets are physically divided into two “compute rooms” with separate power and cooling.
Before the failure
We’ve been investigating ways to improve the user experience for our customers located outside the U.S. Typically these customers are located far enough away that best case latency is around 200 ms to the origin and many traverse circuits and peering points with high levels of congestion/packet loss. To simulate this type of connectivity we used netem. Other significant changes preceding the event included: an update to our knife plugin that allows us to make network reconfiguration changes, the decomm of a syslog server, and an update of check_mk.
At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A “help” notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.
One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load balancer. While this issue was being addressed other engineers identified that the affected applications and servers were in multiple cabinets. Since we were unable to access the affected servers via out of band management, we suspected a possible power issue. Because the datacenter provides remote hands service, we immediately contacted them to request a technician go to one of our cabinets and inspect the affected servers.
We prioritized our database and nosql (redis) servers first, since they were preventing some applications from working even in a degraded mode. (Both our master and slave servers were affected, and even our backup db host was affected. Talk about bad luck …) About five minutes after we had a few of the servers online, they stopped responding again. We asked the onsite technician to reboot them again, and we began copying data off to hosts that were unaffected. But the servers failed again before the data was successfully copied.
From our network graphs we could see that broadcast traffic was up. We ran tcpdump on a few hosts that weren’t affected, but nothing looked amiss. Even though we didn’t have a ton of supporting evidence it was the problem, we decided to clear the arp cache on our core, in case we had some how poisoned it with bad records. That didn’t seem to change anything.
We decided to regroup and review any information we might have missed in our earlier diagnosis: “Let’s take a few seconds and review what every person worked on today … just name everything you did even if it’s something obvious.” We each recited our work. It became clear we had four likely suspects: “knife switch,” our knife plugin for making changes to our network; syslog-02, which had just been decommisioned; an upgraded version of the check_mk plugin that was rolled out to some hosts; and the chef-testing-01 box with netem for simulating end user performance.
It seemed pretty likely that knife-switch or chef-testing-01 were the culprits. We reviewed our chef configuration and manually inspected a few hosts to rule out syslog-02. We were able to determine that the check_mk plugin wasn’t upgraded everywhere, and that there were no errors logged.
We shut down chef-testing-01 and had the remote hands technician power on the servers that had just gone awol again. We decided that since we were pretty sure this was a networking issue, and it very likely was related to lacp/bonding/something related, we should shut down one interface on each server in case that too prevented a repeat performance. We disabled a single port in each bond both on the switch and on the server. Then we waited 15 long minutes (about 10 minutes after the server was booted and we had confirmed the ports were shut down correctly) before we called the all-clear. During this time we let the databases reload their lru dumps so they were “warm.” We also restarted replication and let it catch up and got the redis instances started up.
With these critical services back online our sites began functioning normally again. Almost 2.5 long hours had passed at this point.
Finally, we made a prioritized list of application hosts that were still offline. For those with working out-of-band management, we used our internal tools to reboot them. For the rest we had the datacenter technician power cycle them in person.
- We were able to reproduce this failure with the same hardware during our after-incident testing. We know what happens on the network, but we have not identified the specific code paths that cause this failure. (The change logs for the network drivers leave lots to be desired!)
- We have adjusted the configuration of the internal DNS virtual server to automatically serve via the backup servers if the two primary servers are unavailable.
- We have added additional redis slaves on hosts that were not previously affected by the outage.
- We are continuing to pursue our investigation with the vendor and through our own testing.
- Everyone on the operations team has made a commitment to halt further testing (with netem) until we can demonstrate it will not cause this failure again.
- We have added “netem” to our Nagios check for blacklisted modules in case anyone forgets about that commitment.
- We are updating our tools so that physically locating servers when Campfire (and thus our Campfire bot) is broken isn’t a hassle.
We’ve built a Google spreadsheet which outlines information about the hosts that were affected. We’re being a bit cautious with reporting every single configuration detail because this could easily be used to maliciously impact someone’s (internal) network. If you’d like more information please contact netem (at) 37signals and we’ll vet each request individually.