In many cases Manufacturers-specific IPMI Platfrom Events are stored in binary form in System Event Log making it very difficult to easily understand platfrom state. This document specifies a solution for presenting Manufacturer Spcific IPMI Platform Events in a human readable form by defining a generic framework for parsing and defining new messages in an easy and scallable way. Example of events originating from Intel Management Engine (ME) is used as a case-study. General design of the solution is followed by tailored-down implementation for OpenBMC described in detail.
[1]
[1]-29.3
[5]
[2]
[3]
IPMI is designed to be a compact and efficient binary format of data exchanged between entities in data-center. Recipient is responsible to receive data, properly analyze, parse and translate the binary representation to human-readable format. IPMI Platform Events is one type of these messages, used to inform recipient about occurence of a particular well defined situation.
Part of IPMI Platform Events are standarized and described in the specification and already have an open-source implementation ready [6]
, however this is only part of the spectrum. Increasing complexity of datacenter systems have multipled possible sources of events which are defined by manufacturer-specirfic extenstions to platform event data. One of these sources is Intel ME, which is able to deliver information about its own state of operation and in some cases notify about certain erroneous system-wide conditions, like interface errors.
These OEM-specific messages lacks support in existing open-source implementations. They require manual, documentation-based [5]
implementation, which is historically the source of many interpretation errors. Any document update requires manual code modification according to specific changes which is not efficient nor scalable. Furthermore - documentation is not always clear on event severity or possible resolution actions.
Generic OEM-agnostic algorithm is proposed to achieve human-readable output for binary IPMI Platform Event.
In general, each event consists of predefined payload:
[GeneratorID][SensorNumber][EventType][EventData[2]]
where:
GeneratorID
- used to determine source of the event,SensorNumber
- generator-specific unique sensor number,EventType
- sensor-specific group of events,EventData
- array with detailed event data.One might observe, that each consecutive event field is narrowing down the domain of event interpretations, starting with GeneratorID
at the top, ending with EventData
at the end of a decision tree
. Software should be able to determine meaning of the event by using the divide and conquer
approach for predefined list of well known event definitions. One should notice the fact, that such decision tree might be also needed for breakdown of EventData
, as in many OEM-specific IPMI implementations that is also the case.
Implementation should be therefore a series of filters with increasing specialization on each level. Recursive algorithm for this will look like the following:
+-------------+ +*Step 1* + | +---------+ | | | | |Currently| | |Analyze and choose | +----> |analyzed +------------>+proper 'subtree' parser| | | |chunk | | | | | | +---------+ | + + +---------+ | | +---------+ | |Remainder| | | |Remainder| | | | | | | | | +*Step 2* + | | | | | | | | | | | | | | +---------------------------------------------->+ +---+ | | | | | |'Cut' the remainder | | | | | | | | | |and go back to Step 1 | | | | | | | | | + + | | | | | +---------+ | | | | | +-------------+ +---------+ | | | | | +------------------------------------------------------------------------------+
Described process will be repeated until there is nothing to break-down and singular unique event interpretation will be determined (an EventId
).
Not all event data is a decision point - certain chunks of data should be kept as-is or formatted in certain way, to be introduced in human-readable Message
. Parser operation should also include a logic for extracting Parameters
during the traversal process.
Effectively, both EventId
and an optional collection of Parameters
should be then used as input for lookup mechanic to generate final Event Message
. Each message consists of following entries:
EventId
- associated unique event,Severity
- determines how severely this particular event might affect usual datacenter operation,Resolution
- suggested steps to mitigate possible problem,Message
- human-readable message, possibly with predefined placeholders for Parameters
.Example of such message parsing process is shown below:
+-------------+ |[GeneratorId]| |0x2C (ME) | +------+------+ | +------v---------+ |[SensorNumber] | . . . . |0x17 (ME Health)| +------+---------+ | +------v---------+ |[EventType] | . . . . |0x00 (FW Status)| +------+---------+ | +------v-------------------+ |[EventData[0]] | +-------------------------------------------+ . . . . |0x0A (FlashWearoutWarning)+------+ |ParsedEvent| | +------+-------------------+ | +-----------+ | | +---->'EventId' = FlashWearoutWarning | +------v----------+ +---->'Parameters' = [ toDecimal(EventData[1]) ] | |[EventData[1]] | | | | |0x## (Percentage)+---------------+ +-------------------------------------------+ +-----------------+
, determined ParsedEvent
might be then passed to lookup mechanism, which contains human-readable information for each EventId
:
+------------------------------------------------+ |+------------------------------------------------+ ||+------------------------------------------------+ ||| EventId: FlashWearoutWarning | ||| Severity: Warning | ||| Resolution: No immediate repair action needed | ||| Message: Warning threshold for number of flash | ||| operations has been exceeded. Current | ||| percentage of write operations | +|| capacity: %1 | +| | +------------------------------------------------+
Proposed algorithm is delivered as part of open-source OpenBMC project [3]
. As this software stack is built with micro-service architecture in mind, the implementation had to be divided into multiple parts:
[7]
)openbmc/intel-ipmi-oem/src/sensorcommands.cpp
openbmc/intel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp
openbmc/intel-ipmi-oem/src/me_to_redfish_hooks.cpp
[4]
)systemd journal
[2], [8]
)MessageRegistry in bmcweb
openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp
intel-ipmi-oem
about incoming Platform Event
(NetFn=0x4, Cmd=0x2)intel-ipmi-oem/src/sensorcommands.cpp
is notifiedintel-ipmi-oem/src/ipmi_to_redfish_hooks.cpp
as call to sel::checkRedfishHooks
sel::checkRedfishHooks
analyzes the data, BIOS
events are handled in-place, while ME
events are delegated to intel-ipmi-oem/src/me_to_redfish_hooks.cpp
me::messageHook
is called with the payload. Parsing algorithm determines final EventId
and Parameters
me::utils::storeRedfishEvent(EventId, Parameters)
is called, it stores event securely in system journal
Each IPMI Platform Event is parsed using aforementioned me::messageHook
handler. Implementation of the proposed algorithm is the following:
Based on EventType
proper designated handler is called.
namespace me { static bool messageHook(const SELData& selData, std::string& eventId, std::vector<std::string>& parameters) { const HealthEventType healthEventType = static_cast<HealthEventType>(selData.offset); switch (healthEventType) { case HealthEventType::FirmwareStatus: return fw_status::messageHook(selData, eventId, parameters); break; case HealthEventType::SmbusLinkFailure: return smbus_failure::messageHook(selData, eventId, parameters); break; } return false; } }
Example of handler for FirmwareStatus
, tailored down to essential distinctive use cases:
namespace fw_status { static bool messageHook(const SELData& selData, std::string& eventId, std::vector<std::string>& parameters) { // Maps EventData[0] to either a resolution or further action static const boost::container::flat_map< uint8_t, std::pair<std::string, std::optional<std::variant<utils::ParserFunc, utils::MessageMap>>>> eventMap = { // EventData[0]=0 // > MessageId=MERecoveryGpioForced {0x00, {"MERecoveryGpioForced", {}}}, // EventData[0]=3 // > call specific handler do determine MessageId and Parameters {0x03, {{}, flash_state::messageHook}}, // EventData[0]=7 // > MessageId=MEManufacturingError // > Use manufacturingError map to translate EventData[1] to string // and add it to Parameters collection {0x07, {"MEManufacturingError", manufacturingError}}, // EventData[0]=9 // > MessageId=MEFirmwareException // > Use a function to log specified byte of payload as Parameter // in chosen format. Here it stores 2-nd byte in hex format. {0x09, {"MEFirmwareException", utils::logByteHex<2>}} return utils::genericMessageHook(eventMap, selData, eventId, parameters); } // Maps EventData[1] to specified message static const boost::container::flat_map<uint8_t, std::string> manufacturingError = { {0x00, "Generic error"}, {0x01, "Wrong or missing VSCC table"}}}; }
Cascading calls of functions, logging utilities and map resolutions are resulting in populating both std::string& eventId
and std::vector<std::string>& parameters
. This data is then used to form a valid system log and stored in system journal.
Event data is accessible as Redfish
resources in two places:
MessageRegistry
- stores all event 'metadata' (severity, resolution notes, messageId)EventLog
- lists all detected events in the system in processed, human-readable formImplementation of bmcweb
MessageRegistry contents can be found at openbmc/bmcweb/redfish-core/include/registries/openbmc_message_registry.hpp
.
Intel-specific events have proper prefix in MessageId: either 'BIOS' or 'ME'.
It can be read by the user by calling GET
on Redfish resource: /redfish/v1/Registries/OpenBMC/OpenBMC
. It contains JSON array of entries in standard Redfish format, like so:
"MEFlashWearOutWarning": { "Description": "Indicates that Intel ME has reached certain threshold of flash write operations.", "Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: %1", "NumberOfArgs": 1, "ParamTypes": [ "number" ], "Resolution": "No immediate repair action needed.", "Severity": "Warning" }
System-wide EventLog is implemented in bmcweb
at openbmc/bmcweb/redfish-core/lib/log_services.hpp
.
It can be read by the user by calling GET
on Redfish resource: /redfish/v1/Systems/system/LogServices/EventLog
. It contains JSON array of log entries in standard Redfish format, like so:
{ "@odata.id": "/redfish/v1/Systems/system/LogServices/EventLog/Entries/37331", "@odata.type": "#LogEntry.v1_4_0.LogEntry", "Created": "1970-01-01T10:22:11+00:00", "EntryType": "Event", "Id": "37331", "Message": "Warning threshold for number of flash operations has been exceeded. Current percentage of write operations capacity: 50", "MessageArgs": ["50"], "MessageId": "OpenBMC.0.1.MEFlashWearOutWarning", "Name": "System Event Log Entry", "Severity": "Warning" }