Hardware Fault Monitor

Author: Claire Weinan (cweinan@google.com), daylight22)

Primary assignee: Claire Weinan (cweinan@google.com, daylight22), Heinz Boehmer Fiehn (heinzboehmer@google.com)

Other contributors: Drew Walton (acwalton@google.com)

Created: Aug 5, 2021

Problem Description

The goal is to create a new hardware fault monitor which will provide a framework for collecting various fault and sensor information and making it available externally via Redfish for data center monitoring and management purposes. The information logged would include a wide variety of chipset registers and data from manageability hardware. In addition to collecting information through BMC interfaces, the hardware fault monitor will also receive information via Redfish from the associated host kernel (specifically for cases in which the desired information cannot be collected directly by the BMC, for example when accessing registers that are read and cleared by the host kernel).

Future expansion of the hardware fault monitor would include adding the means to locally analyze fault and sensor information and then based on specified criteria trigger repair actions in the host BIOS or kernel. In addition, the hardware fault monitor could receive repair action requests via Redfish from external data center monitoring software.

Background and References

The following are a few related existing OpenBMC modules:

There is an OpenCompute Fault Management Infrastructure proposal that also recommends delivering error logs from the BMC: https://drive.google.com/file/d/1A9Qc7hB3THw0wiEK_dbXYj85_NOJWrb5/

Requirements

  • The users of this solution are Redfish clients in data center software. The goal of the fault monitor is to enable rich error logging (OEM and CPU vendor specific) for data center tools to monitor servers, manage repairs, predict crashes, etc.

  • The fault monitor must be able to handle receiving fault information that is polled periodically as well as fault information that may come in sporadically based on fault incidents (e.g. crash dumps).

  • The fault monitor should allow for logging of a variety of sizes of fault information entries (on the order of bytes to megabytes). In general, more severe errors which require more fault information to be collected tend to occur less frequently, while less severe errors such as correctable errors require less logging but may happen more frequently.

  • Fault information must be added to a Redfish LogService in a timely manner (within a few seconds of the original event) to be available to external data center monitoring software.

  • The fault monitor must allow for custom overwrite rules for its log entries (e.g. on overflow, save first errors and more severe errors), or guarantee that enough space is available in its log such that all data from the most recent couple of hours is always kept intact. The log does not have to be stored persistently (though it can be).

Proposed Design

A generic fault monitor will be created to collect fault information. First we discuss a few example use cases:

  • On CATERR, the Host Error Monitor requests a crash dump (this is an existing capability). The crash dump includes chipset registers but doesn’t include platform-specific system-level data. The fault monitor would therefore additionally collect system-level data such as clock, thermal, and power information. This information would be bundled, logged, and associated with the crash dump so that it could be post-processed by data center monitoring tools without having to join multiple data sources.

  • The fault monitor would monitor link level retries and link retrainings of high speed serial links such as UPI links. This isn’t typically monitored by the host kernel at runtime and the host kernel isn’t able to log it during a crash. The fault monitor in the BMC could check link level retries and link retrainings during runtime by polling over PECI. If a MCERR or IERR occurred, the fault monitor could then add additional information such as high speed serial link statistics to error logs.

  • In order to monitor memory out of band, a system could be configured to give the BMC exclusive access to memory error logging registers (to prevent the host kernel from being able to access and clear the registers before the BMC could collect the register data). For corrected memory errors, the fault monitor could log error registers either through polling or interrupts. Data center monitoring tools would use the logs to determine whether memory should be swapped or a machine should be removed from usage.

The fault monitor will not have its own dedicated OpenBMC repository, but will consist of components incorporated into the existing repositories host-error-monitor, bmcweb, and phosphor-debug-collector.

In the existing Host Error Monitor module, new monitors will be created to add functionality needed for the fault monitor. For instance, based on the needs of the OEM, the fault monitor will register to be notified of D-Bus signals of interest in order to be alerted when fault events occur. The fault monitor will also poll registers of interest and log their values to the fault log (described more later). In addition, the host will be able to write fault information to the fault log (via a POST (Create) request to its corresponding Redfish log resource collection). When the fault monitor becomes aware of a new fault occurrence through any of these ways, it may add fault information to the fault log. The fault monitor may also gather relevant sensor data (read via D-Bus from the dbus-sensors services) and add it to the fault log, with a reference to the original fault event information. The EventGroupID in a Redfish LogEntry could potentially be used to associate multiple log entries related to the same fault event.

The fault log for storing relevant fault information (and exposing it to external data center monitoring software) will be a new Redfish LogService (/redfish/v1/Systems/system/LogServices/FaultLog) with OverwritePolicy=unknown, in order to implement custom overwrite rules such as prioritizing retaining first and/or more severe faults. The back end implementation of the fault log including saving and managing log files will be added into the existing Phosphor Debug Collector repository with an associated D-bus object (e.g. xyz/openbmc_project/dump/faultlog) whose interface will include methods for writing new data into the log, retrieving data from the log, and clearing the log. The fault log will be implemented as a new dump type in an existing Phosphor Debug Collector daemon (specifically the one whose main() function is in dump_manager_main.cpp). The new fault log would contain dump files that are collected in a variety of ways in a variety of formats. A new fault log dump entry class (deriving from the "Entry" class in dump_entry.hpp) would be defined with an additional "dump type" member variable to identify the type of data that a fault log dump entry's corresponding dump file contains.

bmcweb will be used as the associated Redfish webserver for external entities to read and write the fault log. Functionality for handling a POST (Create) request to a Redfish log resource collection will be added in bmcweb. When delivering a Redfish fault log entry to a Redfish client, large-sized fault information (e.g. crashdumps) can be specified as an attachment sub-resource (AdditionalDataURI) instead of being inlined. Redfish events (EventService schema) will be used to send external notifications, such as when the fault monitor needs to notify external data center monitoring software of new fault information being available. Redfish events may also be used to notify the host kernel and/or BIOS of any repair actions that need to be triggered based on the latest fault information.

Alternatives Considered

We considered adding the fault logs into the main system event log (/redfish/v1/Systems/system/LogServices/EventLog) or other logs already existing in bmcweb (e.g. /redfish/v1/Systems/system/LogServices/Dump, /redfish/v1/Managers/bmc/LogServices/Dump), but we would like to implement a separate custom overwrite policy to ensure the most important information (such as first errors and most severe errors) is retained for local analysis.

Impacts

There may be situations where external consumers of fault monitor logs (e.g. data center monitoring tools) are running software that is newer or older than the version matching the BMC software running on a machine. In such cases, consumers can ignore any types of fault information provided by the fault monitor that they are not prepared to handle.

Errors are expected to happen infrequently, or to be throttled, so we expect little to no performance impact.

Testing

Error injection mechanisms or simulations may be used to artificially create error conditions that will be logged by the fault monitor module.

There is no significant impact expected with regards to CI testing, but we do intend to add unit testing for the fault monitor.