hiomap: Handle SIGTERM to ensure delivery of HIOMAP BMC status
We need to jump through a few hoops to ensure that the host firmware
reliably receives indication of changes to the HIOMAP daemon's state
over the IPMI transport. This is driven partly by IPMI's design, partly
by ipmid's implementation, and partly by the original design of the
HIOMAP DBus transport interface.
A long comment has been added outlining the race conditions eliminated
by this change and its related patches, however it's worth drawing
attention to the issue not addressed there - the original design of the
HIOMAP DBus transport:
The HIOMAP BMC status is composed of two distinct types of data:
1. Stateful: BMC_EVENT_DAEMON_READY and BMC_EVENT_FLASH_CONTROL_LOST
2. Events: BMC_EVENT_PROTOCOL_RESET and BMC_EVENT_WINDOW_RESET
The data types described by 1 and 2 map directly onto the DBus concepts
of Properties and Signals. Originally the specification for the HIOMAP
DBus transport exploited this direct mapping, however experience from
dealing with SIGTERM handling has shown that it was a poor mapping to
exploit.
On shutdown the HIOMAP daemon, mboxd, needs to atomically both clear the
BMC_EVENT_DAEMON_READY bit and set the BMC_EVENT_PROTOCOL_RESET bit.
With the data exposed as distinct types this results on two messages
emitted on the bus: A PropertiesChanged signal to advertise the update
to BMC_EVENT_DAEMON_READY, and a raw signal to advertise the update to
BMC_EVENT_PROTOCOL_RESET.
With two separate signals to be processed by ipmid's event loop it is
hard to ensure both will be propagated to the host before the SIGTERM
from systemd is processed by ipmid.
The solution to eliminate the race is to rework the events exposed as
signals into properties and live with the slight mismatch of intent.
This results in the removal of the associated signal handling code in
the plugin.
Change-Id: Ic05c40b52138c132eface6f8d873088e7e66585b
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
diff --git a/hiomap.cpp b/hiomap.cpp
index 5d3408e..ba4d11b 100644
--- a/hiomap.cpp
+++ b/hiomap.cpp
@@ -7,7 +7,12 @@
#include <endian.h>
#include <host-ipmid/ipmid-api.h>
+#include <signal.h>
+#include <string.h>
+#include <systemd/sd-bus.h>
+#include <systemd/sd-event.h>
+#include <cassert>
#include <cstring>
#include <fstream>
#include <functional>
@@ -23,6 +28,196 @@
#include <tuple>
#include <utility>
+/*
+
+Design and integration notes
+============================
+
+The primary motivation of the Host I/O Mapping protocol (HIOMAP) is to mediate
+host access to a BMC-controlled flash chip housing the host's boot firmware.
+
+openpower-host-ipmi-flash facilitates the system design of transporting the
+HIOMAP protocol[1] over IPMI. This is somewhat abusive of IPMI, basically
+treating the BT interface as a mailbox with an interrupt each way between the
+BMC and the host.
+
+[1] https://github.com/openbmc/mboxbridge/blob/master/Documentation/protocol.md
+
+Using IPMI in this way has a number of challenges, a lot of them on the host
+side where we need to bring up the LPC and BT interfaces to enable IPMI before
+accessing the flash, and before any interrupts are enabled. There are also
+challenges on the BMC side with the design of the current implementation. We
+will cover those here.
+
+BMC-side System Design and Integration Issues
+---------------------------------------------
+
+The current design is that we have the HIOMAP daemon, mboxd (to be renamed),
+exposing a set of DBus interfaces. Whilst the spec defines the IPMI transport
+message packing, mboxd knows nothing of IPMI itself, instead relying on the
+DBus interface to receive messages from ipmid. ipmid in-turn knows nothing of
+the interfaces communicating with it, also relying on DBus to receive messages
+from interface-specific daemons, e.g. btbridged[2].
+
+[2] https://github.com/openbmc/btbridge
+
+For this design to function correctly we must ensure that the daemons are
+started and shut down in a reasonable order, however defining that order is
+somewhat tricky:
+
+1. systemd uses Wants=/Before=/After= relationships in units to define both
+ start-up *and* shutdown order, in stack push / pop order respectively.
+2. Clearly ipmid depends on btbridged to receive messages sent by signals and
+ replied to by method calls, so it needs a Wants=/After= relationship on
+ btbridged
+3. mboxd depends on ipmid to receive messages sent by method call, and issues a
+ PropertiesChanged signal to notify of state changes.
+
+Point 3. suggests mboxd should have a Wants=/Before= relationship with ipmid to
+ensure ipmid can call into mboxd as messages arrive. However, this causes some
+grief with shutdown of the BMC, as mboxd needs to issue a state-change
+notification when it is shut down to inform the host that will not repsond to
+future requests and that the protocol state has been reset. If mboxd has a
+Wants=/Before= relationship with ipmid this message will never propagate to the
+host, as ipmid will be shut by systemd before mboxd.
+
+The above leads to mboxd having a Wants=/After= relationship with ipmid. This
+ensures that if mboxd is restarted on its own the correct state changes will be
+propagated to the host. The case where ipmid attempts to call into mboxd's DBus
+interface before mboxd is ready is mitigated by the ready bit in the protocol's
+BMC status, which will not yet be set, preventing a conforming host from
+attempting to contact mboxd.
+
+While this ordering prevents mboxd from being terminated before ipmid, there is
+no control over the *scheduling* of processes to ensure the PropertiesChanged
+signal emitted by mboxd before mboxd is terminated is seen by ipmid before
+*ipmid* is also terminated. This leads to our first implementation wart:
+
+ On the basis that mboxd has a Wants=/After= relationship with ipmid,
+ openpower-host-ipmi-flash will emit an HIOMAP BMC status event to the host
+ with the value BMC_EVENT_PROTOCOL_RESET upon receiving SIGTERM iff the BMC
+ state is not already set to BMC_EVENT_PROTOCOL_RESET.
+
+If ipmid has received SIGTERM the assumption is that it is systemd that is
+sending it, and that the Wants=/After= relationship requires that mboxd has
+been terminated before ipmid receives SIGTERM. By ensuring
+openpower-host-ipmi-flash emits the BMC event state we close the race where the
+host is not informed of the termination of mboxd due to scheduling ipmid (to
+deliver SIGTERM) prior to scheduling dbus-daemon, where the PropertiesChanged
+event would be delivered from mboxd to ipmid.
+
+Observations on the IPMI Specification and Design Details of ipmid
+------------------------------------------------------------------
+
+In addition to the system-level design problems with delivering
+PropertiesChanged signals during shutdown, IPMI specification and ipmid design
+issues exist that make it tedious to ensure that events will be correctly
+delivered to the host.
+
+The first necessary observation is that the mechanism for delivering BMC state
+change events from mboxd to the host over IPMI uses the SMS ATN bit to indicate
+a message is ready for delivery from the BMC to the host system. Retrieving the
+BMC state data involves the host recognising that the SMS ATN bit is set,
+performing Get Message Flags transaction with the BMC followed by a subsequent
+Get Message transaction. Thus, delivery of the HIOMAP protocol's BMC status is
+not an atomic event.
+
+The second necessary observation is that the kernel delivers signals
+asynchronously. This couples badly with IPMI's event delivery not being atomic:
+ipmid can win the race against SIGTERM to receive the PropertiesChanged event
+from mboxd, but lose the race to complete delivery to the host.
+
+ On this basis, we need to block the delivery of SIGTERM to ipmid until ipmid
+ has completed the set of `SMS ATN`/`Get Message Flags`/`Get Message`
+ transactions with the host
+
+One approach to this would be to configure a custom SIGTERM handler that sets
+some global application state to indicate that SIGTERM has been delivered. A
+better approach that avoids the need for global application state is to simply
+block the signal until we are ready to handle it, which we can do via
+sigprocmask(2).
+
+The existing design of ipmid makes it feasible to block and unblock
+asynchronous SIGTERM as we require. ipmid_send_cmd_to_host() takes a CallBack
+function as an argument, which is invoked by
+phosphor::host::command::Manager::getNextCommand(). The documentation for
+phosphor::host::command::Manager::getNextCommand() says:
+
+ @brief Extracts the next entry in the queue and returns
+ Command and data part of it.
+
+ @detail Also calls into the registered handlers so that they can now
+ send the CommandComplete signal since the interface contract
+ is that we emit this signal once the message has been
+ passed to the host (which is required when calling this)
+
+ Also, if the queue has more commands, then it will alert the
+ host
+
+However, its description is not entirely accurate. The callback function is
+invoked when ipmid *dequeues* the data to send to the host: Delivery of the
+data to the host occurs at some *after* the callback has been invoked.
+
+Invoking the callback before completion of delivery of the data to the host
+nullifies the approach of unblocking asynchronous SIGTERM in the callback
+associated with sending the HIOMAP BMC state event to the host, as the BMC
+kernel can asynchronously terminate the process between the callback being
+invoked and the host receiving the BMC state event data.
+
+Overcoming this issue hinges on a significant implementation detail of ipmid:
+
+ ipmid uses an sd_event loop in the main function to pump DBus events.
+
+This leads to a third necessary observation:
+
+ sd_event can be used to process UNIX signals as well as other events by way
+ of Linux's signalfd(2) interface.
+
+The fact that sd_event is used to pump DBus events means that ipmid can remain
+a single-threaded process. By remaining single-threaded we know that events
+processing is sequencial and no two events can be processed simultaneously. A
+corollary of this is that DBus events and UNIX signals are serialised with
+respect to eachother.
+
+The fourth necessary observation is that we do not need to pump sd_event in
+order to complete DBus method calls; sd_bus will handle the pumping independent
+of the main loop in order to complete the method invocation.
+
+Implementing Reliable HIOMAP BMC Status Event Delivery
+------------------------------------------------------
+
+We achieve reliable delivery of HIOMAP BMC status events in the following way:
+
+1. During plugin initialisation, mask SIGTERM using sigprocmask(2)
+2. Subsequent to masking SIGTERM, register
+ openpower::flash::hiomap_protocol_reset() as the SIGTERM handler using
+ sd_event_add_signal() to hook a signalfd(2) into sd_event
+3. openpower::flash::hiomap_protocol_reset() implements the logic to send the
+ BMC_EVENT_PROTOCOL_RESET state to the host if necessary, otherwise terminate
+ the sd_event loop.
+4. If it is necessary to send BMC_EVENT_PROTOCOL_RESET to the host in 3, assign
+ a callback handler that terminates the sd_event loop, which is only
+ processed after the current iteration is complete.
+
+This process and its use of signalfd integration in the sd_event loop
+eliminates the following three races:
+
+1. The scheduler race between mboxd, dbus-daemon and ipmid, by having
+ openpower-host-ipmi-flash conditionally deliver the protocol reset event if
+ no such message has been received from mboxd
+2. The race between delivering the BMC status event to the host and ipmid
+ receiving asynchronous SIGTERM after receiving the PropertiesChanged event
+ from mboxd
+3. The race to deliver the BMC status data to the host after unblocking
+ asynchronous SIGTERM in the host command callback and before receiving
+ asynchronous SIGTERM.
+
+Ultimately, ipmid could benefit from a redesign that fires the callback *after*
+delivering the associated data to the host, but brief inspection determined
+that this involved a non-trivial amount of effort.
+
+*/
+
using namespace sdbusplus;
using namespace phosphor::host::command;
@@ -46,14 +241,24 @@
constexpr auto DBUS_IFACE_PROPERTIES = "org.freedesktop.DBus.Properties";
+/* XXX: ipmid is currently single-threaded, pumping dbus events in sequence
+ * via the main event loop. Thus the code is not forced to be re-entrant. We
+ * also know that the callback and DBus event handling will not be running
+ * concurrently.
+ *
+ * ipmid_send_cmd_to_host() takes a callback that doesn't define a context
+ * pointer, so instead use a global. active_event_updates gates manipulation of
+ * process state, so its definition as a global at least aligns with its use.
+ */
+static int active_event_updates;
+static struct sd_event_source* event_source;
+
struct hiomap
{
bus::bus* bus;
/* Signals */
bus::match::match* properties;
- bus::match::match* window_reset;
- bus::match::match* bmc_reboot;
/* Protocol state */
std::map<std::string, int> event_lookup;
@@ -114,18 +319,53 @@
static void ipmi_hiomap_event_response(IpmiCmdData cmd, bool status)
{
using namespace phosphor::logging;
+ int rc;
if (!status)
{
log<level::ERR>("Failed to deliver host command",
entry("SEL_COMMAND=%x:%x", cmd.first, cmd.second));
}
+
+ assert(active_event_updates);
+ active_event_updates--;
+ if (!active_event_updates)
+ {
+ rc = sd_event_source_set_enabled(event_source, SD_EVENT_ON);
+ if (rc < 0)
+ {
+ log<level::WARNING>("Failed to unblock SIGTERM delivery",
+ entry("RC=%d", rc));
+ }
+ else
+ {
+ log<level::DEBUG>("Unblocked SIGTERM");
+ }
+ }
}
static int hiomap_handle_property_update(struct hiomap* ctx,
sdbusplus::message::message& msg)
{
+ using namespace phosphor::logging;
+
std::map<std::string, sdbusplus::message::variant<bool>> msgData;
+ int rc;
+
+ if (!active_event_updates)
+ {
+ rc = sd_event_source_set_enabled(event_source, SD_EVENT_OFF);
+ if (rc < 0)
+ {
+ log<level::WARNING>("Failed to block SIGTERM delivery",
+ entry("RC=%d", rc));
+ }
+ else
+ {
+ log<level::DEBUG>("Blocked SIGTERM");
+ }
+ }
+ active_event_updates++;
std::string iface;
msg.read(iface, msgData);
@@ -158,6 +398,41 @@
return 0;
}
+static int hiomap_protocol_reset_response(IpmiCmdData cmd, bool status)
+{
+ return sd_event_exit(ipmid_get_sd_event_connection(), status ? 0 : EIO);
+}
+
+static int hiomap_protocol_reset(sd_event_source* source,
+ const struct signalfd_siginfo* si,
+ void* userdata)
+{
+ struct hiomap* ctx = static_cast<struct hiomap*>(userdata);
+
+ if (ctx->bmc_events == BMC_EVENT_PROTOCOL_RESET)
+ {
+ return sd_event_exit(ipmid_get_sd_event_connection(), 0);
+ }
+
+ /*
+ * Send an attention indicating the hiomapd has died
+ * (BMC_EVENT_DAEMON_READY cleared) and that the protocol has been reset
+ * (BMC_EVENT_PROTOCOL_RESET set) to indicate to the host that it needs to
+ * wait for the BMC to come back and renegotiate the protocol.
+ *
+ * We know this to be the case in systems that integrate
+ * openpower-host-ipmi-flash, as hiomapd's unit depends on
+ * phosphor-ipmi-host, and thus hiomapd has been terminated before ipmid
+ * receives SIGTERM.
+ */
+ auto cmd = std::make_pair(IPMI_CMD_HIOMAP_EVENT, BMC_EVENT_PROTOCOL_RESET);
+
+ auto cmdHandler = std::make_tuple(cmd, hiomap_protocol_reset_response);
+ ipmid_send_cmd_to_host(cmdHandler);
+
+ return 0;
+}
+
static bus::match::match hiomap_match_properties(struct hiomap* ctx)
{
auto properties =
@@ -170,31 +445,6 @@
return match;
}
-static int hiomap_handle_signal_v2(struct hiomap* ctx, const char* name)
-{
- ctx->bmc_events |= ctx->event_lookup[name];
-
- auto cmd = std::make_pair(IPMI_CMD_HIOMAP_EVENT, ctx->bmc_events);
-
- ipmid_send_cmd_to_host(std::make_tuple(cmd, ipmi_hiomap_event_response));
-
- return 0;
-}
-
-static bus::match::match hiomap_match_signal_v2(struct hiomap* ctx,
- const char* name)
-{
- using namespace bus::match;
-
- auto signals = rules::type::signal() + rules::path(HIOMAPD_OBJECT) +
- rules::interface(HIOMAPD_IFACE_V2) + rules::member(name);
-
- bus::match::match match(*ctx->bus, signals,
- std::bind(hiomap_handle_signal_v2, ctx, name));
-
- return match;
-}
-
static ipmi_ret_t hiomap_reset(ipmi_request_t request, ipmi_response_t response,
ipmi_data_len_t data_len, ipmi_context_t context)
{
@@ -458,10 +708,6 @@
{
auto reply = ctx->bus->call(m);
- /* Update our cache: Necessary because the signals do not carry a value
- */
- ctx->bmc_events &= ~acked;
-
*data_len = 0;
}
catch (const exception::SdBusError& e)
@@ -592,10 +838,13 @@
static void register_openpower_hiomap_commands()
{
+ using namespace phosphor::logging;
using namespace openpower::flash;
- /* FIXME: Clean this up? Can we unregister? */
+ sigset_t _sigset, *sigset = &_sigset;
struct hiomap* ctx = new hiomap();
+ struct sd_event* events;
+ int rc;
/* Initialise mapping from signal and property names to status bit */
ctx->event_lookup["DaemonReady"] = BMC_EVENT_DAEMON_READY;
@@ -613,10 +862,37 @@
*/
ctx->properties =
new bus::match::match(std::move(hiomap_match_properties(ctx)));
- ctx->bmc_reboot = new bus::match::match(
- std::move(hiomap_match_signal_v2(ctx, "ProtocolReset")));
- ctx->window_reset = new bus::match::match(
- std::move(hiomap_match_signal_v2(ctx, "WindowReset")));
+
+ rc = sigemptyset(sigset);
+ if (rc < 0)
+ {
+ log<level::ERR>("sigemptyset() failed", entry("RC=%d", rc));
+ return;
+ }
+
+ rc = sigaddset(sigset, SIGTERM);
+ if (rc < 0)
+ {
+ log<level::ERR>("sigaddset() failed", entry("RC=%d", rc));
+ return;
+ }
+
+ rc = sigprocmask(SIG_BLOCK, sigset, NULL);
+ if (rc < 0)
+ {
+ log<level::ERR>("sigprocmask() failed", entry("RC=%d", rc));
+ return;
+ }
+
+ events = ipmid_get_sd_event_connection();
+
+ rc = sd_event_add_signal(events, &event_source, SIGTERM,
+ openpower::flash::hiomap_protocol_reset, ctx);
+ if (rc < 0)
+ {
+ log<level::ERR>("sd_event_add_signal() failed", entry("RC=%d", rc));
+ return;
+ }
ipmi_register_callback(NETFUN_IBM_OEM, IPMI_CMD_HIOMAP, ctx,
openpower::flash::hiomap_dispatch, SYSTEM_INTERFACE);