Since I joined 37signals, I have been working to improve our monitoring infrastructure. We use Nagios for the majority of our monitoring. Nagios is like an old Volvo – it might not be the prettiest or the fastest, but it’s easy to work on and it won’t leave you stranded.

To give you some context, in January 2009 we had 350 Nagios services. By September of 2010 that had grown to 797, and currently we are up to 7,566. In the process of growing that number, we have also drastically reduced the number of alerts that have escalated to page someone in the middle of the night. There have certainly been some bumps along the road to better monitoring, and in this post I hope to provide some insight into how we use Nagios and some helpful hints for folks out there who want to expand and improve their monitoring systems.

Like most things at 37signals, our Nagios environment is controlled by Chef. When new hosts are provisioned, they get added to our monitoring system automatically. A year or so ago, we were only automatically monitoring a handful of things on our hosts: disk use, load and memory.

Monitoring More

The first step to improving the situation was to install Check_MK. Check_MK is a Nagios plugin that automatically inventories hosts, gathers performance data and provides a nicer UI. With Check_MK, we now monitor about 20 metrics per host automatically; everything from postfix queues to open TCP connections is monitored. Check_MK also provides a very helpful backend, mk_livestatus, which allows you to query Nagios for real-time host and service information and to send commands to be processed. For example, we used Livestatus to train our friendly Campfire bot to acknowledge alerts and set downtime. Using Tally, almost all of our Nagios interactions now take place directly from a Campfire room.

We’ve also added a large amount of application-specific monitoring to Nagios over time – we track response time, error codes and various other metrics about our applications’ performance using statsd, as well as a range of MySQL, Redis, and Memcached statistics. These are all things we want to monitor before our customers notice a problem. These additional checks give us far more visibility into our operations than we had before, but they come at a cost: the performance of our Nagios installation and the host that it lives on has suffered as we’ve ramped up our monitoring.

The Problem

Nagios works well out of the box for a small to medium-sized installation, but we quickly ran into some limitations that caused us problems. First, it was taking 45 seconds from the time that a service was scheduled to be checked until Nagios had the resources to run the check. To reduce this latency, we enabled Large Installation Tweaks, which had an instant impact on our service latency, dropping to an average latency of less than 0.3 seconds. Unfortunately, it also had an instant impact to our monitoring host’s load – our high check latency had been effectively acting as a throttle on the number of checks that Nagios could execute at a given time. When we reduced that bottleneck, we saw load go from 5 to around 30 (our primary monitoring server runs on 2x Xeon E5530 processors).

Eventually, I decided that the load was getting out of control and went about trying to reduce it. Reducing the frequency of our check_mk agent checks had very little impact on load, but changing the timing of our other active checks to check half as often had a huge impact on load, dropping it from around 30 to under 10. This clearly demonstrated that active services were our enemy and must be eliminated at all costs.

A Short Primer on Nagios Services

  • Active Services are checks defined by executable shell scripts that Nagios executes directly. The services are scheduled at defined intervals, placed into a scheduler and then executed when threads are available. Nagios must shell out, execute the check script, wait for results, parse those results, append the results to the command buffer and then process the result. During the entire check duration, the thread is held and cannot be used for anything else.
  • Passive Services are checks that are triggered either by Nagios, like the check_mk agent checks, or another mechanism, but are not actively run by the Nagios server. When there are passive check results, the external process simply appends the results to the command buffer directly, where Nagios processes them like an active check result. Nagios does not schedule the checks or use resources to execute them, and so uses only a tiny fraction of the resources.

A large number of our active services were making an HTTP request to our internal dashboard application to get the application and database metrics mentioned previously. Rather than have Nagios actively check each of those, we decided to push updates from Statsd over websockets at regular intervals (using the very nice Slanger library). To do this, we generate a configuration file from Chef to determine which metrics are needed with what thresholds, and then a small daemon subscribes to those metrics and periodically sends check result data to Livestatus, which appends it to the command buffer for processing. We also supplemented these pushed checks that come from our dashboard with others that push directly from the check script.

Results

As expected, moving these services to be passive had a large impact on our Nagios CPU usage, as shown in the graphs below.

All in all, we have reduced the number of Active Services from around 1900 to 745. Most of the remaining checks have to be active – we want ping checks, Check_MK agents, and HTTP checks for applications to be active so they fail quickly and loudly.

To some extent, this just shifts load—some of that load is now being incurred on other hosts, either from the check scripts or from the pusher daemon that sends the results to Nagios. While that’s beneficial in and of itself (we were able to spread load out to servers with more excess capacity), we also improved overall efficiency of the system by rewriting some check scripts and eliminating the overhead of thousands of HTTP requests. More importantly, we have restored our original check intervals and added some new monitoring while keeping load around 3 and latency under a half second.

I hope this gives you some sense of how we approached solving a problem with our monitoring infrastructure by taking a different approach than the conventional “add another executable script” way of monitoring, and perhaps gave you some ideas about how you can improve the performance of your own monitoring system.