designs: BMC service failure debug and recovery

The design outlines a mechanism for capturing debug data in the event
that some BMC interfaces become unresponsive.

Change-Id: I0dbf8879b0e1ca337535f8878d3775ab5d2c531d
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
diff --git a/designs/bmc-service-failure-debug-and-recovery.md b/designs/bmc-service-failure-debug-and-recovery.md
new file mode 100644
index 0000000..90a83b6
--- /dev/null
+++ b/designs/bmc-service-failure-debug-and-recovery.md
@@ -0,0 +1,435 @@
+# BMC Service Failure Debug and Recovery
+
+Author:
+Andrew Jeffery <andrew@aj.id.au> @arj
+
+Primary Assignee:
+Andrew Jeffery <andrew@aj.id.au> @arj
+
+Created:
+6th May 2021
+
+## Problem Description
+
+The capability to debug critical failures of the BMC firmware is essential to
+meet the reliability and serviceability claims made for some platforms.
+
+A class of failure exists under which we can attempt debug data collection
+despite being unable to communicate with the BMC via standard protocols.
+
+This proposal argues for and proposes a software-driven debug data capture
+and recovery of a failed BMC.
+
+## Background and References
+
+By necessity, BMCs are not self-contained systems. BMCs exist to service the
+needs of both the host system by providing in-band platform services such as
+thermal and power management as well as system operators by providing
+out-of-band system management interfaces such as error reporting, platform
+telemetry and firmware management.
+
+As such, failures of BMC subsystems may impact external consumers.
+
+The BMC firmware stack is not trivial, in the sense that common implementations
+are usually a domain-specific Linux distributions with complex or highly
+coupled relationships to platform subsystems.
+
+Complexity and coupling drive concern around the risk of critical failures in
+the BMC firmware. The BMC firmware design should provide for resilience and
+recovery in the face of well-defined error conditions, but the need to mitigate
+ill-defined error conditions or entering unintended software states remains.
+
+The ability for a system to recover in the face of an error condition depends
+on its ability to detect the failure. Thus, error conditions can be assigned to
+various classes based on the ability to externally observe the error:
+
+1. Continued operation: The services detects the error and performs the actions
+   required to return to its operating state
+
+2. Graceful exit: The service detects an error it cannot recover from, but
+   gracefully cleans up its resources before exiting with an appropriate exit
+   status
+
+3. Crash: The service detects it is an unintended software state and exits
+   immediately, failing to gracefully clean up its resources before exiting
+
+4. Unresponsive: The service fails to detect it cannot make progress and
+   continues to run but is unresponsive
+
+
+As the state transformations to enter the ill-defined or unintended software
+state are unanticipated, the actions required to gracefully return to an
+expected state are also not well defined. The general approaches to recover a
+system or service to a known state in the face of entering an unknown state
+are:
+
+1. Restart the affected service
+2. Restart the affected set of services
+3. Restart all services
+
+In the face of continued operation due to internal recovery a service restart
+is unnecessary, while in the case of a unresponsive service the need to restart
+cannot be detected by service state alone. Implementation of resiliency by way
+of service restarts via a service manager is only possible in the face of a
+graceful exit or application crash. Handling of services that have entered an
+unresponsive state can only begin upon receiving external input.
+
+Like error conditions, services exposed by the BMC can be divided into several
+external interface classes:
+
+1. Providers of platform data
+2. Providers of platform data transports
+
+Examples of the first are applications that expose various platform sensors or
+provide data about the firmware itself. Failure of the first class of
+applications usually yields a system that can continue to operate in a reduced
+capacity.
+
+Examples of the second are the operating system itself and applications that
+implement IPMI, HTTPS (e.g. for Redfish), MCTP and PLDM. This second class also
+covers implementation-specific data transports such as D-Bus, which requires a
+broker service. Failure of a platform data transport may result in one or all
+external interfaces becoming unresponsive and be viewed as a critical failure
+of the BMC.
+
+Like error conditions and services, the BMC's external interfaces can be
+divided into several classes:
+
+1. Out-of-band interfaces: Remote, operator-driven platform management
+2. In-band interfaces: Local, host-firmware-driven platform management
+
+Failures of platform data transports generally leave out-of-band interfaces
+unresponsive to the point that the BMC cannot be recovered except via external
+means, usually by issuing a (disruptive) AC power cycle. On the other hand, if
+the host can detect the BMC is unresponsive on the in-band interface(s), an
+appropriate platform design can enable the host to reset the BMC without
+disrupting its own operation.
+
+### Analysis of eBMC Error State Management and Mitigation Mechanisms
+
+Assessing OpenBMC userspace with respect to the error classes outlined above,
+the system manages and mitigates error conditions as follows:
+
+| Condition           | Mechanism                                           |
+|---------------------|-----------------------------------------------------|
+| Continued operation | Application-specific error handling                 |
+| Graceful exit       | Application-specific error handling                 |
+| Crash               | Signal, unhandled exceptions, `assert()`, `abort()` |
+| Unresponsive        | None                                                |
+
+These mechanisms inform systemd (the service manager) of an event, which it
+handles according to the restart policy encoded in the unit file for the
+service.
+
+Assessing the OpenBMC operating system with respect to the error classes, it
+manages and mitigates error conditions as follows:
+
+| Condition           | Mechanism                              |
+|---------------------|----------------------------------------|
+| Continued operation | ramoops, ftrace, `printk()`            |
+| Graceful exit       | System reboot                          |
+| Crash               | kdump or ramoops                       |
+| Unresponsive        | `hardlockup_panic`, `softlockup_panic` |
+
+Crash conditions in the Linux kernel trigger panics, which are handled by kdump
+(though may be handled by ramoops until kdump support is integrated). Kernel
+lockup conditions can be configured to trigger panics, which in-turn trigger
+either ramoops or kdump.
+
+### Synthesis
+
+In the context of the information above, handling of application lock-up error
+conditions is not provided. For applications in the platform-data-provider
+class of external interfaces, the system will continue to operate with reduced
+functionality. For applications in the platform-data-transport-provider class,
+this represents a critical failure of the firmware that must have accompanying
+debug data.
+
+## Requirements
+
+### Recovery Mechanisms
+
+The ability for external consumers to control the recovery behaviour of BMC
+services is usually coarse, the nuanced handling is left to the BMC
+implementation. Where available the options for external consumer tend to be,
+in ascending order of severity:
+
+| Severity | BMC Recovery Mechanism  | Used for                                                              |
+|----------|-------------------------|-----------------------------------------------------------------------|
+| 1        | Graceful reboot request | Normal circumstances or recovery from platform data provider failures |
+| 2        | Forceful reboot request | Recovery from unresponsive platform data transport providers          |
+| 3        | External hardware reset | Unresponsive operating system                                         |
+
+Of course it's not possible to issue these requests over interfaces that are
+unresponsive. A robust platform design should be capable of issuing all three
+restart requests over separate interfaces to minimise the impact of any one
+interface becoming unresponsive. Further, the more severe the reset type, the
+fewer dependencies should be in its execution path.
+
+Given the out-of-band path is often limited to just the network, it's not
+feasible for the BMC to provide any of the above in the event of some kind of
+network or relevant data transport failure. The considerations here are
+therefore limited to recovery of unresponsive in-band interfaces.
+
+The need to escalate above mechanism 1 should come with data that captures why
+it was necessary, i.e. dumps for services that failed in the path for 1.
+However, by escalating straight to 3, the BMC will necessarily miss out on
+capturing a debug dump because there is no opportunity for software to
+intervene in the reset. Therefore, mechanism 2 should exist in the system
+design and its implementation should capture any appropriate data needed to
+debug the need to reboot and the inability to execute on approach 1.
+
+The need to escalate to 3 would indicate that the BMC's own mechanisms for
+detecting a kernel lockup have failed. Had they not failed, we would have
+ramoops or kdump data to analyse. As data cannot be captured with an escalation
+to method 3 the need to invoke it will require its own specialised debug
+experience. Given this and the kernel's own lockup detection and data
+collection mechanism, support for 2 can be implemented in BMC userspace.
+
+Mechanism 1 is typically initiated by the usual in-band interfaces, either IPMI
+or PLDM. In order to avoid these in the implementation of mechanism 2, the host
+needs an interface to the BMC that is dedicated to the role of BMC recovery,
+with minimal dependencies on the BMC side for initiating the dump collection
+and reboot. At its core, all that is needed is the ability to trigger a BMC
+IRQ, which could be as simple as monitoring a GPIO.
+
+### Behavioural Requirements for Recovery Mechanism 2
+
+The system behaviour requirement for the mechanism is:
+
+1. The BMC executes collection of debug data and then reboots once it observes
+   a recovery message from the host
+
+It's desirable that:
+
+1. The host has some indication that the recovery process has been activated
+2. The host has some indication that a BMC reset has taken place
+
+It's necessary that:
+
+1. The host make use of a timeout to escalate to recovery mechanism 3 as it's
+   possible the BMC will be unresponsive to recovery mechanism 2
+
+### Analysis of BMC Recovery Mechanisms for Power10 Platforms
+
+The implementation of recovery mechanism 1 is already accounted for in the
+in-band protocols between the host and the BMC and so is considered resolved
+for the purpose of the discussion.
+
+To address recovery mechanism 3, the Power10 platform designs wire up a GPIO
+driven by the host to the BMC's EXTRST pin. If the host firmware detects that
+the BMC has become unresponsive to its escalating recovery requests, it can
+drive the hardware to forcefully reset the BMC.
+
+However, host-side GPIOs are in short supply, and we do not have a dedicated
+pin to implement recovery mechanism 2 in the platform designs.
+
+### Analysis of Implementation Methods on Power10 Platforms
+
+The implementation of recovery mechanism 2 is limited to using existing
+interfaces between the host and the BMC. These largely consist of:
+
+1. FSI
+2. LPC
+3. PCIe
+
+FSI is inappropriate because the host is the peripheral in its relationship
+with the BMC. If the BMC has become unresponsive, it is possible it's in a
+state where it would not accept FSI traffic (which it needs to drive in the
+first place) and we would need an mechanism architected into FSI for the BMC to
+recognise it is in a bad state. PCIe and LPC are preferable by comparison as
+the BMC is the peripheral in this relationship, with the host driving cycles
+into it over either interface. Comparatively, PCIe is more complex than LPC, so
+an LPC-based approach is preferred.
+
+The host already makes use of several LPC peripherals exposed from the BMC:
+
+1. Mapped LPC FW cycles
+2. iBT for IPMI
+3. The VUARTs for system and debug consoles
+4. A KCS device for a vendor-defined MCTP LPC binding
+
+The host could take advantage of any of the following LPC peripherals for
+implementing recovery mechanism 2:
+
+1. The SuperIO-based iLPC2AHB bridge
+2. The LPC mailbox
+3. An otherwise unused KCS device
+
+In ASPEED BMC SoCs prior to the AST2600 the LPC mailbox required configuration
+via the SuperIO device, which exposes the unrestricted iLPC2AHB backdoor into
+the BMC's physical address space. The iLPC2AHB capability could not be
+mitigated without disabling SuperIO support entirely, and so the ability to use
+the mailbox went with it. This security issue is resolved in the AST2600
+design, so the mailbox could be used in the Power10 platforms, but we have
+lower-complexity alternatives for generating an IRQ on the BMC. We could use
+the iLPC2AHB from the host to drive one of the watchdogs in the BMC to trigger
+a reset, but this exposes a stability risk due to the unrestricted power of the
+interface, let alone the security implications, and like the mailbox is more
+complex than the alternatives.
+
+This draws us towards the use of a KCS device, which is best aligned with the
+simple need of generating an IRQ on the BMC. AST2600 has at least 4 KCS devices
+of which one is already in use for IBM's vendor-defined MCTP LPC binding
+leaving at least 3 from which to choose.
+
+## Proposed Design
+
+The proposed design is for a simple daemon started at BMC boot to invoke the
+desired crash dump handler according to the system policy upon receiving the
+external signal. The implementation should have no IPC dependencies or
+interactions with `init`, as the reason for invoking the recovery mechanism is
+unknown and any of these interfaces might be unresponsive.
+
+A trivial implementation of the daemon is
+
+```sh
+dd if=$path bs=1 count=1
+echo c > /proc/sysrq-trigger
+```
+
+For systems with kdump enabled, this will result in a kernel crash dump
+collection and the BMC being rebooted.
+
+A more elegant implementation might be to invoke `kexec` directly, but this
+requires the support is already available on the platform.
+
+Other activities in userspace might be feasible if it can be assumed that
+whatever failure has occurred will not prevent debug data collection, but no
+statement about this can be made in general.
+
+### A Idealised KCS-based Protocol for Power10 Platforms
+
+The proposed implementation provides for both the required and desired
+behaviours outlined in the requirements section above.
+
+The host and BMC protocol operates as follows, starting with the BMC
+application invoked during the boot process:
+
+1. Set the `Ready` bit in STR
+
+2. Wait for an `IBF` interrupt
+
+3. Read `IDR`. The hardware clears IBF as a result
+
+4. If the read value is 0x44 (`D` for "Debug") then execute the debug dump
+   collection process and reboot. Otherwise,
+
+5. Go to step 2.
+
+On the host:
+
+1. If the `Ready` bit in STR is clear, escalate to recovery mechanism 3.
+   Otherwise,
+
+2. If the `IBF` bit in STR is set, escalate to recovery mechanism 3. Otherwise,
+
+3. Start an escalation timer
+
+4. Write 0x44 (`D` for "Debug") to the Input Data Register (IDR). The hardware
+   sets IBF as a result
+
+5. If `IBF` clears before expiry, restart the escalation timer
+
+6. If an STR read generates an LPC SYNC No Response abort, or `Ready` clears
+   before expiry, restart the escalation timer
+
+7. If `Ready` becomes set before expiry, disarm the escalation timer. Recovery
+   is complete. Otherwise,
+
+8. Escalate to recovery mechanism 3 if the escalation timer expires at any
+   point
+
+A SerIRQ is unnecessary for correct operation of the protocol. The BMC-side
+implementation is not required to emit one and the host implementation must
+behave correctly without one. Recovery is only necessary if other paths have
+failed, so STR can be read by the host when it decides recovery is required,
+and by read by time-based polling thereafter.
+
+The host must be prepared to handle LPC SYNC errors when accessing the KCS
+device IO addresses, particularly "No Response" aborts. It is not guaranteed
+that the KCS device will remain available during BMC resets.
+
+As STR is polled by the host it's not necessary for the BMC to write to ODR.
+The protocol only requires the host to write to IDR and periodically poll STR
+for changes to IBF and Ready state. This removes bi-directional dependencies.
+
+The uni-directional writes and the lack of SerIRQ reduce the features required
+for correct operation of the protocol and thus the surface area for failure of
+the recovery protocol.
+
+The layout of the KCS Status Register (STR) is as follows:
+
+| Bit | Owner    | Definition               |
+|-----|----------|--------------------------|
+| 7   | Software |                          |
+| 6   | Software |                          |
+| 5   | Software |                          |
+| 4   | Software | Ready                    |
+| 3   | Hardware | Command / Data           |
+| 2   | Software |                          |
+| 1   | Hardware | Input Buffer Full (IBF)  |
+| 0   | Hardware | Output Buffer Full (OBF) |
+
+### A Real-World Implementation of the KCS Protocol for Power10 Platforms
+
+Implementing the protocol described above in userspace is challenging due to
+available kernel interfaces[1], and implementing the behaviour in the kernel
+falls afoul of the defacto "mechanism, not policy" rule of kernel support.
+
+Realistically, on the host side the only requirements are the use of a timer
+and writing the appropriate value to the Input Data Register (IDR). All the
+proposed status bits can be ignored. With this in mind, the BMC's
+implementation can be reduced to reading an appropriate value from IDR.
+Reducing requirements on the BMC's behaviour in this way allows the use of the
+`serio_raw` driver (which has the restriction that userspace can't access the
+status value).
+
+[1] https://lore.kernel.org/lkml/37e75b07-a5c6-422f-84b3-54f2bea0b917@www.fastmail.com/
+
+### Prototype Implementation Supporting Power10 Platforms
+
+A concrete implementation of the proposal's userspace daemon is available on
+Github:
+
+https://github.com/amboar/debug-trigger/
+
+Deployment requires additional kernel support in the form of patches at [2].
+
+[2] https://github.com/amboar/linux/compare/2dbb5aeba6e55e2a97e150f8371ffc1cc4d18180...for/openbmc/kcs-raw
+
+## Alternatives Considered
+
+See the discussion in Background.
+
+## Impacts
+
+The proposal has some security implications. The mechanism provides an
+unauthenticated means for the host firmware to crash and/or reboot the BMC,
+which can itself become a concern for stability and availability. Use of this
+feature requires that the host firmware is trusted, that is, that the host and
+BMC firmware must be in the same trust domain. If a platform concept requires
+that the BMC and host firmware remain in disjoint trust domains then this
+feature must not be provided by the BMC.
+
+As the feature might provide surprising system behaviour, there is an impact on
+documentation for systems deploying this design: The mechanism must be
+documented in such a way that rebooting the BMC in these circumstances isn't
+surprising.
+
+Developers are impacted in the sense that they may have access to better debug
+data than might otherwise be possible. There are no obvious developer-specific
+drawbacks.
+
+Due to simplicity being a design-point of the proposal, there are no
+significant API, performance or upgradability impacts.
+
+## Testing
+
+Generally, testing this feature requires complex interactions with
+host firmware and platform-specific mechanisms for triggering the reboot
+behaviour.
+
+For Power10 platforms this feature may be safely tested under QEMU by scripting
+the monitor to inject values on the appropriate KCS device. Implementing this
+for automated testing may need explicit support in CI.