Author: Lakshminarayana Kammath
Other contributors: Jayanth Othayoth
Created: 2019-05-21
Currently, servers that use OpenBMC cannot have the ability to capture relevant debug data when the host is unresponsive or hung. These systems need the ability to diagnose the root cause of hang and perform recovery along with debugging data collected.
There is a situation at customer places/lab where the host goes unresponsive causing system hang(https://github.com/ibm-openbmc/dev/issues/457). This means there is no way to figure out what went wrong with the host in a hung state. One has to recover the system with no relevant debug data captured.
Whenever the host is unresponsive/running, Admin needs to trigger an NMI event which, in turn, triggers an architecture-dependent procedure that fires an interrupt on all the available processors on the system.
This proposal aims to trigger NMI, which in turn will invoke an architecture-specific procedure that enables data collection followed by recovery of the Host. This will enable Host/OS development teams to analyze and fix any issues where they see host hang and unresponsive system.
Introducing new D-Bus interface in the control.host namespace (/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/ NMI.interface.yaml) and implement the new D-Bus back-end for respective processor specific targets.
Enable NMI D-Bus phosphor interface and support this via Redfish
Extending the existing D-Bus interface state.Host namespace (/openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml) to support new RequestedHostTransition property called Nmi. D-Bus back-end can internally invoke processor-specific target to invoke NMI and do associated actions.
There were strong reasons to move away from the above approach. phosphor-state-manager has always been focused on the states of the BMC, Chassis, and Host. NMI will be more of action against the host than a state.
This implementation only needs to make some changes to the system state when NMI is initiated irrespective of what host OS state is in, so it has minimal impact on the rest of the system.
Depending on the platform hardware design, this test requires a host OS kernel module driver to create hard lockup/hang and then check the scenario is good. Also, one can invoke NMI to get the crash dump and confirm HOST received NMI via console logs.