Define openbmc systemd target and service error handling
Change-Id: Icc83ca78b7b485374d3a922fe5372c71390b3b67
Signed-off-by: Andrew Geissler <andrewg@us.ibm.com>
diff --git a/openbmc-systemd.md b/openbmc-systemd.md
index d52dae8..1a91ce7 100644
--- a/openbmc-systemd.md
+++ b/openbmc-systemd.md
@@ -82,3 +82,71 @@
xyz.openbmc_project.State.Host.Transition.On
Underneath the covers, this is calling systemd with obmc-chassis-start@0.target
+
+## Error Handling of Systemd
+With great numbers of targets and services, come great chances for failures.
+To make OpenBMC a robust and productive system, it needs to be sure to have an
+error handling policy for when services and their targets fail.
+
+When a failure occurs, the OpenBMC software needs to notify the users of the
+system and provide mechanisms for either the system to automatically retry the
+failed operation (i.e. reboot the system) or to stay in a quiesced state so that
+error data can be collected and the failure can be investigated.
+
+There are two main failure scenarios when it comes to OpenBMC and systemd usage:
+
+1. A service within a target fails
+- If the service is a "oneshot" type, and the service is required
+(not wanted) by the target then the target will fail if the service
+fails
+ - Define a behavior for when the target fails using the
+ "OnFailure" option (i.e. go to a new failure target if any required
+ service fails)
+- If the service is not a "oneshot", then it can not fail the target
+(the target only knows that it started successfully)
+ - Define a behavior for when the service fails (OnFailure)
+ option.
+ - The service can not have "RemainAfterExit=yes" otherwise, the OnFailure
+ action does not occur until the service is stopped (instead of when it
+ fails)
+ - *See more information below on [RemainAfterExit](#RemainAfterExit)
+
+2. A failure outside of a normal systemd target/service (host watchdog expires,
+host checkstop detected)
+- The service which detects this failure is responsible for logging the
+appropriate error, and instructing systemd to go to the appropriate target
+
+Within OpenBMC, there is a host quiesce target. This is the target that other
+host related targets should go to when they hit a failure. Other software within
+OpenBMC can then monitor for the entry into this quiesce target and will handle
+the halt vs. automatic reboot functionality.
+
+Targets which are not host related, will need special thought in regards to
+their error handling. For example, the target responsible for applying chassis
+power, obmc-power-chassis-on@0.target, will have a
+"OnFailure=obmc-power-chassis-off@%i.target" error path. That is, if the
+chassis power on target fails then power off the chassis.
+
+The above info sets up some general **guidelines** for our host related
+targets and services:
+
+- All targets should have an "OnFailure=obmc-quiesce-host@.target"
+- All services which are required for a target to achieve its function should
+be RequiredBy that target (not WantedBy)
+- All services should first try to be "Type=oneshot" so that we can just rely on
+the target failure path
+- If a service can not be "Type=oneshot", then it needs to have a
+"OnFailure=obmc-quiesce-host@.target" and ideally set "RemainAfterExit=no"
+(but see caveats on this below)
+- If a service can not be any of these then it's up to the service application
+to call systemd with the obmc-quiesce-host@.target on failures
+
+### RemainAfterExit
+This is set to "yes" for most OpenBMC services to handle the situation where
+someone starts the same target twice. If the associated service with that
+target is not running (i.e. RemainAfterExit=no), then the service will be
+executed again. Think about someone accidentally running the
+obmc-chassis-start@.target twice. If you execute it when the operating system
+is up and running, and the service which toggles the pgood pin is re-executed,
+you're going to crash your system. Given this info, the goal should always be
+to write "oneshot" services that have RemainAfterExit set to yes.