enhance target-fail-monitoring for services
BMC systems are becoming more and more critical in todays data centers.
If a critical service within the BMC fails, it's paramount that
appropriate debug data be collected and the user of the system be
notified that the BMC is in an undefined state.
This new feature will allow a user to pass in a json file with systemd
service names that they wish this function to monitor. If a monitored
services goes into an error state (exhausted all retries and service has
been stopped) then the monitor service will create an error and collect
appropriate debug data.
Change-Id: I9ddc05387f23c80e490d72561397ab2850a05a3e
Signed-off-by: Andrew Geissler <geissonator@yahoo.com>
diff --git a/designs/target-fail-monitoring.md b/designs/target-fail-monitoring.md
index 786baef..f98224c 100644
--- a/designs/target-fail-monitoring.md
+++ b/designs/target-fail-monitoring.md
@@ -1,4 +1,4 @@
-# Monitoring and Logging OpenBMC Systemd Target Failures
+# Monitoring and Logging OpenBMC Systemd Target and Service Failures
 
 Author: Andrew Geissler
   < geissonator >
@@ -6,6 +6,7 @@
   Primary assignee: Andrew Geissler
 
   Created: June 18, 2019
+  Last Updated: Feb 21, 2022
 
 ## Problem Description
 
@@ -17,21 +18,32 @@
 [openbmc-systemd.md][1] has a good summary of systemd and the basics of how
 it is used within OpenBMC.
 
+At a high level, systemd is composed of units. For OpenBMC, the two key
+unit types are target and service.
+
 There are situations where OpenBMC systemd targets fail. They fail due to a
 target or service within them hitting some sort of error condition. In some
-cases, the service which caused the failure will log an error to
+cases, the unit which caused the failure will log an error to
 phosphor-logging but in some situations (segfault, unhandled exception,
 unhandled error path, ...), there will be no indication to the end user that
 something failed. It is critical that if a systemd target fails within OpenBMC,
 the user be notified.
 
+There are also scenarios where a system has successfully started all targets
+but a running service within that target fails. Some services are not all
+that critical, but something like fan-control or a power monitoring service,
+could be very critical. At a minimum, need to ensure the user of the system
+is informed that a critical service has failed. The solution proposed in
+this document does not preclude service owners from doing other recovery, this
+solution just ensures a bare minimum of reporting is done when a failure occurs.
+
 ## Background and References
 
 See the [phosphor-state-manager][2] repository for background information on
 state management and systemd within OpenBMC.
 
-systemd provides signals when targets complete and provides status on that
-completed target. See the JobNew()/JobRemoved() section in the [systemd dbus
+systemd provides signals when units complete and provides status on that
+completed unit. See the JobNew()/JobRemoved() section in the [systemd dbus
 API][3]. The six different results are:
 ```
 done, canceled, timeout, failed, dependency, skipped
@@ -44,8 +56,13 @@
 The proposal within this document is to monitor for these other results and
 log an appropriate error to phosphor-logging.
 
+A systemd unit that is a service will only enter into a `failed` state after
+all systemd defined retries have been executed. For OpenBMC systems, that
+involves 2 restarts within a 30 second window.
+
 ## Requirements
 
+### Systemd Target Units
 - Must be able to monitor any arbitrary systemd target and log a defined error
   based on the type of failure the target encountered
 - Must be configurable
@@ -75,10 +92,21 @@
     - Fully qualified target name must be input (i.e. no templated / wild card
       target support)
 
+### Systemd Service Units
+- Must be able to monitor any arbitrary systemd service within OpenBMC
+  - Service(s) to monitor must be configurable
+- Log an error indicating a service has failed (with service name in log)
+  - `xyz.openbmc_project.State.Error.CriticalServiceFailure`
+- Collect a BMC dump
+- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
+- Report changed state externally via Redfish managers/bmc state
+
 ## Proposed Design
 
 Create a new standalone application in phosphor-state-manager which will load
-json file(s) on startup. The json file(s) would have the following format:
+json file(s) on startup.
+
+The json file(s) would have the following format for targets:
 ```
 {
     "targets" : [
@@ -96,6 +124,16 @@
 }
 ```
 
+The json (files) would have the following format for services:
+```
+{
+    "services" : [
+        "xyz.openbmc_project.biosconfig_manager.service",
+        "xyz.openbmc_project.Dump.Manager.service"
+        ]
+}
+```
+
 On startup, all input json files will be loaded and monitoring will be setup.
 
 This application will not register any interfaces on D-Bus but will subscribe
@@ -106,19 +144,30 @@
 ensure the cause of the failure can be determined. This requires the errors
 logged by this service be put into `phosphor-debug-errors/errors_watch.yaml`.
 
+For service failures, a dump will be collected by default because the BMC
+will be moved into a Quisced state.
+
+Note that services which are short running applications responsible for
+transitioning the system from one target to another (i.e. chassis power on/off,
+starting/stopping the host, ...) are critical services, but they are not the
+type of services to be monitored. Their failures will cause systemd targets to
+fail which will fall into the target monitoring piece of this design. The
+service monitoring is meant for long running services which stay running
+after a target has completed.
+
 ## Alternatives Considered
 
-Targets have an OnError directive. Could put the error logging logic within that
+Units have an OnError directive. Could put the error logging logic within that
 path but it introduces more complexity to OpenBMC systemd usage which is already
 quite complicated.
 
 Could implement this within the existing state manager applications since they
-are already monitoring some of these targets. The standalone application and
-generic capability to monitor any target was chosen as a better option.
+are already monitoring some of these units. The standalone application and
+generic capability to monitor any unit was chosen as a better option.
 
 ## Impacts
 
-A phosphor-logging event will be logged when one of the targets listed above
+A phosphor-logging event will be logged when one of the units listed above
 fails. This will be viewable by owners of the system. There may be situations
 where two logs are generated for the same issue. For example, if the power
 application detects an issue during power on and logs it to phosphor-logging and
@@ -132,7 +181,7 @@
 not result in phosphor-state-manager generating an error.
 
 ## Testing
-Need to cause all targets mentioned within this design to fail. They should fail
+Need to cause all units mentioned within this design to fail. They should fail
 for each of the reasons defined within this design and the error generated for
 each scenario should be verified.