watchdog:Trigger System Dump when host load fails
Using the xyz.openbmc_project.State.Host interface to determine the
operational status of the host becomes less precise when shifting from
hostboot to host. In the interim phase when host is initializing and
hasn't reached full functionality, the host's state is inaccurately
assumed to be in hostboot. In cases where host encounters initial
boot challenges and the watchdog timer triggers because the boot
process hasn't finished within the set time, this watchdog
misinterprets the situation as a hostboot problem.
To address this, there exists a core scratch register that undergoes
an update by hostboot just before transferring control to host.
We have devised a method that leverages this register to determine
whether the transition to host has already occurred.
By implementing this functionality we can determine which booting
subsystem is failed or stopped responding, and the dump can be
extracted from the right subsystem
Tested:
Oct 03 09:35:36 p10bmc watchdog_timeout[7099]: Host did not respond
within watchdog timeout interval
Oct 03 09:35:36 p10bmc watchdog_timeout[7099]: PHYP boot failure,
triggering system dump
Oct 03 09:35:37 p10bmc phosphor-log-manager[372]: Created PEL
0x50001924 (BMC ID 252) with SRC BD5EC101
Oct 03 09:35:37 p10bmc ibm-panel[1208]: Resolution is empty for
PEL = /xyz/openbmc_project/logging/entry/252
Oct 03 09:35:37 p10bmc phosphor-host-state-manager[795]: Received
signal that host has crashed, decrement reboot count
...
...
Oct 03 09:35:38 p10bmc systemd[1]: Starting Start memory preserving
reboot host0...
Oct 03 09:35:38 p10bmc pldmd[719]: BIOS:pvm_sys_dump_active, updated
to value: Enabled(16), by BMC: true
Oct 03 09:35:38 p10bmc phosphor-dump-manager[473]: OriginatorId is
not provided
Oct 03 09:35:38 p10bmc phosphor-dump-manager[473]: OriginatorType is
not provided. Replacing the string with the default value
Oct 03 09:35:38 p10bmc sh[7151]:
o "/xyz/openbmc_project/dump/bmc/entry/2"
Oct 03 09:35:38 p10bmc openpower-proc-control[7153]: Starting memory
preserving reboot
Change-Id: I312ad7201e9258d23f6e784fab504d0fb8f0f712
Signed-off-by: Deepa Karthikeyan <deepakala.karthikeyan@ibm.com>
diff --git a/watchdog/watchdog_main.cpp b/watchdog/watchdog_main.cpp
index b124be3..e650f70 100644
--- a/watchdog/watchdog_main.cpp
+++ b/watchdog/watchdog_main.cpp
@@ -33,6 +33,42 @@
transitionHost(HOST_STATE_QUIESCE_TGT);
}
+void triggerSystemDump()
+{
+ try
+ {
+ // Create a PEL may be before setting the target
+ constexpr auto eventName =
+ "org.open_power.Host.Boot.Error.WatchdogTimedOut";
+
+ // CreatePELWithFFDCFiles requires a vector of FFDCTuple.
+ auto emptyFfdc = std::vector<FFDCTuple>{};
+
+ std::map<std::string, std::string> additionalData;
+
+ // Create PEL with empty additional data.
+ createPel(eventName, additionalData, emptyFfdc);
+
+ // We will be transitioning host by starting appropriate dbus target
+ constexpr auto target = "obmc-host-crash@0.target";
+
+ auto bus = sdbusplus::bus::new_system();
+ auto method = bus.new_method_call(
+ "org.freedesktop.systemd1", "/org/freedesktop/systemd1",
+ "org.freedesktop.systemd1.Manager", "StartUnit");
+
+ method.append(target); // target unit to start
+ method.append("replace"); // mode = replace conflicting queued jobs
+
+ bus.call_noreply(method); // start the service
+ }
+ catch (const sdbusplus::exception::SdBusError& e)
+ {
+ lg2::error("triggerMPIPLDump:: D-Bus call exception, errorMsg({ERROR})",
+ "ERROR", e.what());
+ }
+}
+
/**
* @brief get SBE special callout information
*
diff --git a/watchdog/watchdog_main.hpp b/watchdog/watchdog_main.hpp
index bb90ee2..b915a0c 100644
--- a/watchdog/watchdog_main.hpp
+++ b/watchdog/watchdog_main.hpp
@@ -26,5 +26,13 @@
*/
void handleSbeBootError(struct pdbg_target* procTarget, const uint32_t timeout);
+/**
+ * @brief creates a PEL and triggers System dump
+ *
+ * @details This function creates the PEL and then triggers System dump
+ *
+ */
+void triggerSystemDump();
+
} // namespace dump
} // namespace watchdog
diff --git a/watchdog_timeout.cpp b/watchdog_timeout.cpp
index f616312..aa2ed0a 100644
--- a/watchdog_timeout.cpp
+++ b/watchdog_timeout.cpp
@@ -62,9 +62,22 @@
return EXIT_SUCCESS;
}
- // SBE boot done, Need to collect hostboot dump
- lg2::info("Handle Hostboot boot failure");
- triggerHostbootDump(timeout);
+ // If the hostboot has transitioned to hypervisor, and there is a
+ // failure before hypervisor is loaded we will hit the below piece
+ // of code, so trigger system dump
+ if (openpower::phal::pdbg::hasControlTransitionedToHost())
+ {
+ // hostboot done, Need to collect system dump
+ lg2::info(
+ "Failure while loading hypervisor, triggering system dump");
+ triggerSystemDump();
+ }
+ else
+ {
+ // SBE boot done, Need to collect hostboot dump
+ lg2::info("Handle Hostboot boot failure");
+ triggerHostbootDump(timeout);
+ }
}
else
{