Discussion:
[Ipmitool-devel] SDR Failure Assertion for Nagios Check_IPMI
Brian A. Seklecki
2010-04-08 03:35:42 UTC
Permalink
All:

This is more of a general IPMI question. Sorry there's isn't
a -users@ list.

There's an old Nagios monitoring script that would look
through 'ipmitool sdr list', and search the status column
for values != "ok".

It turns out that this simple logic may be insufficient for
checking values such 'Power Supply Fully Redundancy'.

For example:

Consider sensor 7.1 on a PowerEdge 2950r1:

Sensor ID : PS Redundancy (0x74)
Entity ID : 7.1 (System Board)
Sensor Type (Discrete): Power Supply
States Asserted : Redundancy State
[Fully Redundant]
Assertion Events : Redundancy State
[Fully Redundant]
Assertions Enabled : Redundancy State
[Fully Redundant]
[Redundancy Lost]


---------------------------------------------------


When the primary power supply is missing or unplugged, the
'sdr list' returns the sensor with 'OK' value:

% sudo ipmitool -U foo -H system-lom sdr elist all
PS Redundancy | 74h | ok | 7.1 | Redundancy Lost

Note how the sensor status reads 'OK' in almost all
conditions (except for possibly both power supplies
being 'not present' or 'failed', which would hard
to test! >:} )

I'm a bit confused about the data structures, but I understand
thresholds for assertion and de-asseration can be programmed
using OpenIPMI (A co-worker had to do this for a broken Dell
DRAC Card in an r710 or 2950r3 reading the upper warning state
threshold wrong)

So is there a way to progarm 7.1 or 10.1/10.2 to set status
NOT OK during: 1) Predictive Failure 2) Power loss 3) Absence?

As an alternative, I can script start doing additional
string matching for key words on specific sensor categories:

For example, sdr type "Power Supply"

----------------------------

$ ipmitool -P XX -U netadmin -H system-lom sdr entity 10
Presence | 54h | ok | 10.1 | Absent
Presence | 55h | ok | 10.2 | Present
Status | 64h | ok | 10.1 | Failure detected, Power Supply AC lost
Status | 65h | ok | 10.2 | Presence detected



With the power cable pulled:

% ipmitool -P XX -U netadmin -H system-lom sdr entity 10
Presence | 54h | ok | 10.1 | Present
Presence | 55h | ok | 10.2 | Present
Status | 64h | ok | 10.1 | Presence detected,
Failure detected,
Power Supply AC lost
Status | 65h | ok | 10.2 | Presence detected


Thanks, ~BAS
Brian A. Seklecki
2010-04-08 15:02:36 UTC
Permalink
Post by Brian A. Seklecki
I'm a bit confused about the data structures, but I understand
thresholds for assertion and de-asseration can be programmed
So after digging some more, it would seem that the worst case scenario
is true:

Sensors can be of 'threshold' type (RPMs, degC, etc.) or they can be
'discrete', meaning that they dont have non-critical values, just
a series of boolean states.

There is a concept of a "Sensor Specific Offset" in the "IPMI Platform
Event Trap Format Specification v1.0" which is a series of hex values:

For example, sensor of type "08h" (Power Supply) can have status codes

Power Supply 00h Presence Detected
01h Power Supply Failure Detected
02h Predictive Failure Asserted

These types of status code prefixes seem be common in Event Traps and in
SDR Sensor lists.

For example, with one power supply unlpugged:

PS Redundancy | 0x0 | discrete | 0x0280|

Normally this sould read 0x0180.


For hot swap power supplies installed -> removed from chassis:
Presence | 0x0 | discrete | 0x0180|
--->
Presence | 0x0 | discrete | 0x0280|

It seems like we should be able to toggle sensor status flags for
'discrete' sensors in this fashion.

Until then, I'll re-code the check parse through each sensor type and
regex match common assertion types for critical status.

Meh. Coding. >:}

~BAS

Loading...