Fail Boot on Hardware Errors

Author: Andrew Geissler (geissonator)

Primary assignee: Andrew Geissler (geissonator)

Other contributors:

Created: Feb 20, 2020

Problem Description

Some groups, for example a manufacturing team, have a requirement for the BMC firmware to halt a system if an error log is created which calls out a piece of hardware. The reason behind this is to ensure a system is not shipped to a customer if it has any type of hardware issue. It also ensures when an error is found, it is identified quickly and all activity stops until the issue is fixed. If the system has a hardware issue once shipped from manufacturing, then the BMC firmware behavior should be to report the error, but allow the system to continue to boot and operate.

OpenBMC firmware needs a mechanism to support this use case.

Background and References

Within IBM, this function has been enabled/disabled by what is called manufacturing flags. They were bits the user could set in registry variables which the firmware would then query. These registry variables were only settable by someone with admin authority to the system. These flags were not used outside of manufacturing and test.

Extensions within phosphor-logging may process logs that do not always come through the standard phosphor-logging interfaces (for example logs sent down by the host). In these cases the system must still halt if those logs contain hardware callouts.

This email thread was sent on this topic to the list.

Requirements

Provide a mechanism to cause the OpenBMC firmware to halt a system if a phosphor-logging log is created with a inventory callout
- The mechanism to enable/disable this feature does not need to be an external API (i.e. Redfish). It can simply be a busctl command one runs in an ssh to the BMC
- The halt must be obvious to the user when it occurs
  - The log which causes the halt must be identifiable
- The halt must only stop the chassis/host instance that encountered the error
- The halt must stop the host (run obmc-host-stop@X.target) associated with the error and attempt to leave system in the fail state (i.e. chassis power remains on if it is on)
- The chassis/host instance pair will not be allowed to power on until the log that caused the halt is resolved or deleted
  - A BMC reset will clear this power on prevention
Quiesce the associated host during this failure

Special Note: Initially the associated host and chassis will be hard coded to chassis0 and host0. More work throughout the BMC stack is required to handle multiple chassis and hosts. This design allows that type of feature to be enabled at a later time.

Proposed Design

Create a phosphor-settingsd setting, xyz.openbmc_project.Logging.Settings. Within this create a boolean property called QuiesceOnHwError. This property will be hosted under the xyz.openbmc_project.Settings service.

Define a new D-Bus interface which will indicate an error has been created which will prevent the boot of a chassis/host instance: xyz.openbmc_project.Logging.ErrorBlocksTransition

This interface will be hosted under a instance based D-Bus object /xyz/openbmc_project/logging/blockX where X is the instance of the chassis/host pair being blocked.

When an error is created via a phosphor-logging interface, the software will check to see if the error has a callout, and if so it will check the new xyz.openbmc_project.Logging.Settings.QuiesceOnHwError. If this is true then phosphor-logging will create a /xyz/openbmc_project/logging/blockX D-Bus object with a xyz.openbmc_project.Logging.ErrorBlocksTransition interface under it. A mapper association between the log and this new D-Bus object will be created. The corresponding host instance will be put in quiesce by phosphor-logging.

The blocked state can be exited by rebooting the BMC or clearing the log responsible for the blocking. Other system specific policies could be placed in the appropriate targets (for example if a chassis power off should clear the block)

See the phosphor-logging callout design for more information on callouts.

The appropriate obmc-host-stop@.target instance will also be called when obmc-bmc-quiesce.target is started. This ensures the host is stopped as soon as the error is discovered.

obmcutil will be enhanced to look for these block interfaces and notify the user via the obmcutil state command if a block is enabled and what log is associated with it.

The goal is to build upon this concept when future design work is done to allow developers to associate certain error logs with causing a halt to the system until a log is handled.

Alternatives Considered

Currently this feature is a part of the base phosphor-logging design. If no one other then IBM sees value, we could roll this into the PEL-specific portion of phosphor-logging.

A systemd target could be created to do the host stop and quiesce (and any other system specific things people need) but at this point there doesn't seem to be a ton of value in it. Could always be added later if needed.

Impacts

This will require some additional checking on reported logs but should have minimal overhead.

There will be no changes to system behavior unless a user turns on this new setting.

Testing

Unit tests will be run to ensure logic to detect errors with logs and verify both possible values of the new setting.

Test cases will need to look for this new blocking D-Bus object and handle appropriately.