Monitoring PERC3Di controllers with afacli and Nagios on SLES9

Intro

These docs describe the basic process of going about monitoring a Dell PERC3Di controller (as found on the PowerEdge 1650) via Nagios and afacli under SuSE Linux Enterprise Server 9 (SLES9).

Just to say, Nagios is a super useful open source tool for monitoring various network services and such. You can find the full deal on it at the Nagios home.

Also, these directions would presumably work for any other system, Dell PowerEdge or not, with the same family of Adaptec RAID controllers which use the aacraid driver and can thus be monitored via the afacli utility.

As always, any comments, code enhancements, etc that you might have are always appreciated.

The Problem

So, I've got a rack full of Dell PowerEdge servers... mostly 1650s and 1750s. They have nifty RAID controllers, but we hadn't really been monitoring them actively, mainly the occasion check of the status lights on the systems. Not much point in having a RAID if you don't know when it stops having redundancy.

Now, with Dell it would seem that if I ran Red Hat in their preferred releases, I would be able to use some of the canned Dell management systems for Linux. One problem (of many) is that I am lazy and I didn't want to go through the whole hassle of trying to get the Dell management solution running under SLES. The other problem is that I just don't trust running the Dell stuff, besides, I already have Nagios installed and it rocks.

The Solution

The basic way that things work is like so:

  1. My Nagios central monitoring system polls the remote server for it's RAID status as the schedule demands.
  2. A daemon process listening on the remote server receives the requests and kicks off the local plugin.
  3. The local plugin dumps a set of commands to the command line RAID utility and then parses the logged output.
  4. The plugin returns an appropriate result code for the interpreted logs back to the Nagios server.

Getting all of this to work requires three basic parts:

  1. The RAID controller monitoring utility, afacli.
  2. A basic Nagios installation.
  3. The nagios plugin which provides the glue between afacli and the centralized monitoring, check_afacli.

afacli

afacli is the command line interface (thus the cli in afacli) for the Adaptec RAID controller which Dell uses as their PERC 3Di. There are links to some RPMs for it from Dell's Linux RAID page. The most recent version listed on that page (at the time of writing this) has the afaapps-2.7 RPM as part of it. 2.7 works fine, but whoever built it is a real tool and managed to leave some dependencies audio libs (WTF???) in the package. So, if you use that, you actually need to install the arts RPM.

Otherwise you want to find afaapps-2.8 which is less broken. I found that with the that comes with 2.8 that I really needed to run sh MAKEDEV.afa afa0 with the MAKEDEV.afa provided in the RPM to make the appropriate device. This was not an issue with 2.7.

Nagios

I can't and won't go into the details of setting up Nagios monitoring, please refer to the Nagios home for that. For the purposes of this doc, I am that you are remotely monitoring the RAID. If it is a local RAID, then you can obviously cut out many steps.

There isn't much Nagios-wise that needs to be installed on the system to be monitored. Basically, you need to install all of the glue to enable the remote execution and results gathering from the Nagios plugin. SLES9 comes with a nagios-plugins-1.3.1 RPM which I installed. This gives me some Perl libs that my plugin depends upon and other plugins that I would want to use anyways.

Because I am checking the state of the RAID remotely, I need to setup a daemon on the system to answer the requests to check on the RAID. The tool used for this is nrpe (Nagios Remote Plugin Executor), which can be downloaded from the Nagios: Extras and Addons page. This is pretty trivial to build and install. Be sure to create an unprivileged nagios user and group for nrpe to run under as a daemon.

nrpe needs to be configured with info on which commands it will accept and what it does when they are called, so my nrpe.cfg has the following line in it to call my plugin:
command[check_afacli]=/usr/lib/nagios/plugins/check_afacli

The Nagios server has to know how to call check_afacli on the remote system, so my checkcommands.cfg has an entry like:

define command{
        command_name    check_afacli
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_afacli -t 30
        }

check_afacli

There isn't a whole lot to say about my script, check_afacli. It is written in my unpolished Perl. I would like to think that the code is sound, but my regex may be ugly. You are warned.

If you decide to adopt it for your own use, you will need to customize any paths to required files as needed, of course.

Also, the script is being executed as the nagios user by the nrpe daemon, to you need to be sure that the nagios user has permission to run afacli. I did this by enabling the nagios user to sudo afacli without requiring a password. So my sudoers file has a line like this:
nagios ALL=(ALL) NOPASSWD: /sbin/afacli

The plugin redirects a set of commands to afacli from a file called afascript which looks like:

    logfile start '/tmp/afacli.log'
    open afa0
    controller details
    container list /all /full
    enclosure show slot
    close
    logfile end
    exit

Yes, the spaces do seem to be required in there, that isn't just indenting for the sake of it. You could also add more commands to be passed to afacli, but check_afacli won't do much with them. What it does do is:

To-Dos

Some things I need to work on with this:


techno-obscura : delgado : notes