fail boot on hw error

This is a design proposal on how to handle requirements by our
manufacturing team in regards to ensuring a computer system is not
shipped to a customer with broken components and to ensure quick
detection and repair of an issue.

Change-Id: I32dedd31e2f04b7d9602214974a2ac2e1573885a
Signed-off-by: Andrew Geissler <geissonator@yahoo.com>
diff --git a/designs/fail-boot-on-hw-error.md b/designs/fail-boot-on-hw-error.md
new file mode 100644
index 0000000..fc1f007
--- /dev/null
+++ b/designs/fail-boot-on-hw-error.md
@@ -0,0 +1,128 @@
+# Fail Boot on Hardware Errors
+
+Author: Andrew Geissler (geissonator)
+
+Primary assignee: Andrew Geissler (geissonator)
+
+Other contributors:
+
+Created: Feb 20, 2020
+
+## Problem Description
+Some groups, for example a manufacturing team, have a requirement for the BMC
+firmware to halt a system if an error log is created which calls out a piece of
+hardware. The reason behind this is to ensure a system is not shipped to a
+customer if it has any type of hardware issue. It also ensures when an error is
+found, it is identified quickly and all activity stops until the issue is fixed.
+If the system has a hardware issue once shipped from manufacturing, then the BMC
+firmware behavior should be to report the error, but allow the system to
+continue to boot and operate.
+
+OpenBMC firmware needs a mechanism to support this use case.
+
+## Background and References
+Within IBM, this function has been enabled/disabled by what is called
+manufacturing flags. They were bits the user could set in registry variables
+which the firmware would then query. These registry variables were only
+settable by someone with admin authority to the system. These flags were not
+used outside of manufacturing and test.
+
+Extensions within phosphor-logging may process logs that do not always come
+through the standard phosphor-logging interfaces (for example logs sent
+down by the host). In these cases the system must still halt if those logs
+contain hardware callouts.
+
+[This][1] email thread was sent on this topic to the list.
+
+## Requirements
+- Provide a mechanism to cause the OpenBMC firmware to halt a system if a
+  phosphor-logging log is created with a inventory callout
+  - The mechanism to enable/disable this feature does not need to be an
+    external API (i.e. Redfish). It can simply be a busctl command one runs
+    in an ssh to the BMC
+  - The halt must be obvious to the user when it occurs
+    - The log which causes the halt must be identifiable
+  - The halt must only stop the chassis/host instance that encountered the error
+  - The halt must stop the host (run obmc-host-stop@X.target) associated with
+    the error and attempt to leave system in the fail state (i.e. chassis power
+    remains on if it is on)
+  - The chassis/host instance pair will not be allowed to power on until
+    the log that caused the halt is resolved or deleted
+      - A BMC reset will clear this power on prevention
+- Quiesce the associated host during this failure
+
+**Special Note:** Initially the associated host and chassis will be hard coded to
+chassis0 and host0. More work throughout the BMC stack is required to handle
+multiple chassis and hosts. This design allows that type of feature to be
+enabled at a later time.
+
+## Proposed Design
+Create a [phosphor-settingsd][2] setting,
+`xyz.openbmc_project.Logging.Settings`. Within this create a boolean property
+called QuiesceOnHwError. This property will be hosted under the
+xyz.openbmc_project.Settings service.
+
+Define a new D-Bus interface which will indicate an error has been created which
+will prevent the boot of a chassis/host instance:
+`xyz.openbmc_project.Logging.ErrorBlocksTransition`
+
+This interface will be hosted under a instance based D-Bus object
+`/xyz/openbmc_project/logging/blockX` where X is the instance of the
+chassis/host pair being blocked.
+
+When an error is created via a phosphor-logging interface, the software will
+check to see if the error has a callout, and if so it will check the new
+`xyz.openbmc_project.Logging.Settings.QuiesceOnHwError`. If this is true then
+phosphor-logging will create a `/xyz/openbmc_project/logging/blockX` D-Bus
+object with a `xyz.openbmc_project.Logging.ErrorBlocksTransition` interface
+under it. A mapper [association][3] between the log and this new D-Bus
+object will be created. The corresponding host instance will be put
+in quiesce by phosphor-logging.
+
+The blocked state can be exited by rebooting the BMC or clearing the log
+responsible for the blocking. Other system specific policies could be placed
+in the appropriate targets (for example if a chassis power off should clear
+the block)
+
+See the phosphor-logging [callout][4] design for more information on callouts.
+
+The appropriate `obmc-host-stop@.target` instance will also be called when
+`obmc-bmc-quiesce.target` is started. This ensures the host is stopped as soon as
+the error is discovered.
+
+obmcutil will be enhanced to look for these block interfaces and notify the
+user via the `obmcutil state` command if a block is enabled and what log
+is associated with it.
+
+The goal is to build upon this concept when future design work is done to allow
+developers to associate certain error logs with causing a halt to the system
+until a log is handled.
+
+## Alternatives Considered
+Currently this feature is a part of the base phosphor-logging design. If no
+one other then IBM sees value, we could roll this into the PEL-specific
+portion of phosphor-logging.
+
+A systemd target could be created to do the host stop and quiesce (and any
+other system specific things people need) but at this point there doesn't
+seem to be a ton of value in it. Could always be added later if needed.
+
+## Impacts
+This will require some additional checking on reported logs but should have
+minimal overhead.
+
+There will be no changes to system behavior unless a user turns on this new
+setting.
+
+## Testing
+Unit tests will be run to ensure logic to detect errors with logs and verify
+both possible values of the new setting.
+
+Test cases will need to look for this new blocking D-Bus object and handle
+appropriately.
+
+
+[1]: https://lists.ozlabs.org/pipermail/openbmc/2020-February/020575.html
+[2]: https://github.com/openbmc/phosphor-settingsd
+[3]: https://github.com/openbmc/docs/blob/master/architecture/object-mapper.md#associations
+[4]: https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/xyz/openbmc_project/Common/Callout/README.md