Our team adds new checks and alerts every week so that we can stay ahead of new issues. We try very hard to make sure that each alert is configured and tested such that it provides timely and credible evidence of a real problem. Sometimes though, when things go wrong we are inundated with alert information which actually hinders and confuses our problem identification and resolution.
A real world example
A server with two 10 Gigabit network connections experiences a hardware failure and spontaneously reboots. Our Campfire room is filled with alerts not just for the host being down, but also for the switch (ports) the host is connected to.
We monitor the switch ports because we want to know that they are at the correct speed, that there are no individual failures, and that no “foreign” devices have been plugged into the network. In the case of a host failure, the information about the switch ports is secondary to the information about the host—but it represents 2x the volume of alert data we receive.
In cases like this we need to make our monitoring system more aware of the dependencies exist between these checks so that we can eliminate the noise. To do so we use a number of open source technologies:
Link Layer Discovery Protocol
The Link Layer Discovery Protocol (LLDP) is a vendor-neutral link layer protocol in the Internet Protocol Suite used by network devices for advertising their identity, capabilities, and neighbors on an IEEE 802 local area network, principally wired Ethernet.
(Via http://en.wikipedia.org/wiki/Link_Layer_Discovery_Protocol.)
A more human readable description is that LLDP is a special Internet Protocol that allows us to find out what switches (and switch ports) are given server is plugged in to.
First we had to configure our switches to support lldp. We did so using a basic global configuration entry:
protocol lldp advertise management-tlv system-description system-name
After the switches are configured we can collect information from each host through the use of lldpctl. Here’s some sample output from lldpctl:
------------------------------------------------------------------------------- LLDP neighbors: ------------------------------------------------------------------------------- Interface: eth0, via: LLDP, RID: 1, Time: 16 days, 18:45:59 Chassis: ChassisID: mac 00:01:e8:8b:0a:c1 SysName: zk100-switch1 SysDescr: Dell Force10 Real Time Operating System... Port: PortID: ifname TenGigabitEthernet 0/23 PortDescr: Not received ------------------------------------------------------------------------------- Interface: eth1, via: LLDP, RID: 2, Time: 16 days, 18:46:03 Chassis: ChassisID: mac 00:01:e8:8b:0a:82 SysName: zk100-switch2 SysDescr: Dell Force10 Real Time Operating System... Port: PortID: ifname TenGigabitEthernet 0/23 PortDescr: Not received
As you can see we get information on each connected interface. If the interfaces are in a port-channel (bonded) we would get information about the port-channel too.
What’s nice is that lldpctl can present the output in multiple ways. By passing in ‘-f keyvalue’ we get the same information but it’s formatted in a way that we can easily parse it:
lldp.eth0.via=LLDP lldp.eth0.rid=1 lldp.eth0.age=16 days, 18:49:22 lldp.eth0.chassis.mac=00:01:e8:8b:0a:c1 lldp.eth0.chassis.name=zk100-switch1 lldp.eth0.chassis.descr=Dell Force10 Real Time Operating System... lldp.eth0.port.ifname=TenGigabitEthernet 0/23 lldp.eth0.port.descr=Not received lldp.eth1.via=LLDP lldp.eth1.rid=2 lldp.eth1.age=16 days, 18:49:26 lldp.eth1.chassis.mac=00:01:e8:8b:0a:82 lldp.eth1.chassis.name=zk100-switch2 lldp.eth1.chassis.descr=Dell Force10 Real Time Operating System... lldp.eth1.port.ifname=TenGigabitEthernet 0/23 lldp.eth1.port.descr=Not received
Gathering the information provided by LLDP
So how do we gather this data on every server so that we can use it construct the correct service dependency? Since we use Chef for configuration management we can leverage an Ohai plugin!
Ohai is a tool that is used to detect certain properties about a node’s environment and provide them to the chef-client during every Chef run.
So every time the chef-client is run, we can gather the data up and make it available. As if it wasn’t easy enough, (John Dewey) posted such an Ohai plugin that we can use:
#
# Cookbook Name:: ohai
# Plugin:: llpd
#
# "THE BEER-WARE LICENSE" (Revision 42):
# <[email protected]> wrote this file. As long as you retain this notice you
# can do whatever you want with this stuff. If we meet some day, and you think
# this stuff is worth it, you can buy me a beer in return John-B Dewey Jr.
#
provides "linux/llpd"
lldp Mash.new
def hashify h, list
if list.size == 1
return list.shift
end
key = list.shift
h[key] ||= {}
h[key] = hashify h[key], list
h
end
begin
cmd = "lldpctl -f keyvalue"
status, stdout, stderr = run_command(:command => cmd)
stdout.split("\n").each do |element|
key, value = element.split(/=/)
elements = key.split(/\./)[1..-1].push value
hashify lldp, elements
end
lldp
rescue => e
Chef::Log.warn "Ohai llpd plugin failed with: '#{e}'"
end
Now every one of our Chef node (server) objects has an lldp attribute which we can use to build the correct service dependency.
Since we manage our Nagios configuration via Chef we just need to add a few lines to the service_dependency erb template:
<% @hardware_nodes.each do |hardware_node| %>
<% if hardware_node[:lldp] && hardware_node['hostname'] != node['hostname'] %>
<% hardware_node[:lldp].each do |int, data| %>
<% if data['port']['ifname'] and data['chassis']['name'] %>
define servicedependency {
host_name <%= hardware_node['hostname'] %>
service_description Check_MK
dependent_service_description Interface <%= data['port']['ifname'] %>
dependent_host_name <%= data['chassis']['name'] + "." + hardware_node[:domain] %>
notification_failure_criteria w,c
}
<% end %>
<% end %>
<% end %>
<% end %>
Which results in service dependency entries like this:
define servicedependency { host_name cats-02 service_description Check_MK dependent_service_description Interface TenGigabitEthernet 0/34 dependent_host_name zk100-switch2.sc-chi-int.37signals.com notification_failure_criteria w,c }
In the above example we are identifying that we want to identify a dependency of the switchport between cats-02 and zk100-switch2. The outcome of this dependency is that instead of getting an alert for both the host and associated switch ports being down, we only get alerted that the host is down. (We know/expect the switch ports to be down too.)
Additional Configuration
We also needed to set “soft_state_dependencies=1” in our Nagios configuration:
This option determines whether or not Nagios will use soft state information when checking host and service dependencies. Normally Nagios will only use the latest hard host or service state when checking dependencies. If you want it to use the latest state (regardless of whether its a soft or hard state type), enable this option.
(Via http://nagios.sourceforge.net/docs/3_0/configmain.html.)
Here’s the difference it makes in our Campfire when we get these alerts:
Without Service Dependency
sc-chi Interface TenGigabitEthernet 0/10 CRITICAL zk100-switch1.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 478.74kB/s(0.4%), out: 0.00B/s(0.0%). sc-chi Interface TenGigabitEthernet 0/10 CRITICAL zk100-switch2.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 403.11kB/s(0.3%), out: 0.00B/s(0.0%) sc-chi Interface TenGigabitEthernet 0/6 CRITICAL zk100-switch1.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 0.00B/s(0.0%), out: 0.00B/s(0.0%). sc-chi Interface TenGigabitEthernet 0/6 CRITICAL zk100-switch2.sc-chi-int.37signals.com PROBLEM CRIT - (down)(!!) MAC: 00:01:e8:8b:08:2c, 1GBit/s, in: 0.00B/s(0.0%), out: 0.00B/s(0.0%). sc-chi cats-02 DOWN PROBLEM DOWN CRITICAL - 10.99.22.37: Host unreachable @ 10.99.22.1. rta nan, lost 100%
With Service Dependency
sc-chi cats-02 DOWN PROBLEM DOWN CRITICAL - 10.99.22.37: Host unreachable @ 10.99.22.1. rta nan, lost 100%
Devon
on 15 Jul 13In reference to:
Does your datacenter actually provide you 10 gigabit links? Or are you just using a network card capable of 10 gigabit bit your on an actual 100/1000 megabit link?
Taylor
on 16 Jul 13@Devon,
We manage all of our own infrastructure so the datacenter only provides space, power and cooling, and connections to the meet me room. We utilize multiple 1 Gigabit links from a number of providers for our Internet connections.
In the text you referenced I was referring to our local network. We use multiple 10 Gigabit top of rack switches and we connect every server to two switches for redundancy. This allows us to have a reasonable level of fault tolerance and to conduct maintenance on the network without user facing interruption.
Jaime
on 19 Jul 13Hi Taylor,
Do you use any third-party (external) monitoring solution as well?
I’m looking for outstanding options for a benchmarking study.
Thank you! Jaime.co
This discussion is closed.