Improve BMC error handling for OCC comm failures
- Delay starting OCC reset until all OCCs have been detected (or
timeout). It will prevent multiple resets from being triggered and to
help detecting when reset is completed (active sensor being set after
reset is complete)
- Wait for PLDM response to OCC reset and HRESET requests and retry if
they fail
- If HRESET returns NOT_READY, collect SBE FFDC and try OCC reset. A
persistent failure will put the system in safe state.
- Prevent overwriting dvfs over-temp filename for p10 and beyond since
that old file is only present in old kernel
- Prevent assert when opening sysfs files. (added catch and then created
an OCC Comm failure PEL, which will force an OCC reset.)
- Check return code after reading sysfs files to confirm success. If
read fails, try reset to recover.
- Updated traces to include which processor/OCC encountered issues.
- Better recovery to close windows that were leaving system in partial
good state.
JIRA: PFES-66
Change-Id: I0b087d0e05bd8562682062e1c662f9e18164a720
Signed-off-by: Chris Cain <cjcain@us.ibm.com>
diff --git a/occ_errors.hpp b/occ_errors.hpp
index 8cd97af..e3ae412 100644
--- a/occ_errors.hpp
+++ b/occ_errors.hpp
@@ -20,6 +20,8 @@
constexpr auto SAFE_ERROR_PATH = "org.open_power.OCC.Device.Error.SafeState";
constexpr auto MISSING_OCC_SENSORS_PATH =
"org.open_power.OCC.Firmware.Error.MissingOCCSensors";
+constexpr auto OCC_COMM_ERROR_PATH =
+ "org.open_power.OCC.Device.Error.OpenFailure";
/** @class Error
* @brief Monitors for OCC device error condition