Jonathan Delgado's Weblog: check_megaraid

June 7, 2007

check_megaraid_sas Nagios plugin

This is somewhat related to my earlier posting about updating the megaraid drivers. I use Nagios at work for system monitoring and one thing that I like to check is the status of the volumes managed by the RAID controller. When I first started configuring the Nagios on my new PowerEdge 1950 and 2950 systems I found a check_perc5i over on Nagios Exchange.

Unfortunately the plugin only looked like it worked properly. It would report back correctly things like the number of volumes you had online, the number of disks, failed disks etc., but if you had a failed disk it would not actually return the proper error status. It just kept on going blindly saying OK : Bad Disks=3.

So I have written my own script to check the RAID controller status, check_megaraid_sas. It is somewhat similar to the work I did for the PERC3Di with afacli and Nagios quite a while back.

In order to use it you need to have LSI's MegaCli utility installed and the user executing the script will need to have sudo privileges (w/o a password) to execute it. Then you will end up with output like:
OK: 0:0:RAID-1:2 drives:68GB:Optimal 1:0:RAID-5:7 drives:2792GB:Optimal Drives:10 Hotspare(s):1
or (less good)
WARNING: 0:0:RAID-1:2 drives:74GB:Optimal 0:1:RAID-5:4 drives:1396GB:Optimal Drives:6 (3 Errors)

The warning is due to the detection of "other" disk errors on the drive. I am trying to find out from Dell if I can reset this count in the controller. Otherwise if it is cumulative I will probably modify my code to take a n argument for a threshold under which to ignore non-fatal errors. The output above is basically in the form:
<status> <controller #>:<volume #>:<RAID level>:<volume drive count>:<volume size>:<volume status> ... Drives:<total drives attached to controller(s)>

Posted by delgado at June 7, 2007 11:11 AM

Jonathan Delgado's Weblog

June 7, 2007

check_megaraid_sas Nagios plugin

Comments

Post a comment