design: error and event logging
Signed-off-by: Patrick Williams <patrick@stwcx.xyz>
Change-Id: I028d7ada80c1ba05ec6f9d12f8f6506202c926f6
diff --git a/designs/event-logging.md b/designs/event-logging.md
new file mode 100644
index 0000000..af85787
--- /dev/null
+++ b/designs/event-logging.md
@@ -0,0 +1,985 @@
+# Error and Event Logging
+
+Author: [Patrick Williams][patrick-email] `<stwcx>`
+
+[patrick-email]: mailto:patrick@stwcx.xyz
+
+Other contributors:
+
+Created: May 16, 2024
+
+## Problem Description
+
+There is currently not a consistent end-to-end error and event reporting design
+for the OpenBMC code stack. There are two different implementations, one
+primarily using phosphor-logging and one using rsyslog, both of which have gaps
+that a complete solution should address. This proposal is intended to be an
+end-to-end design handling both errors and tracing events which facilitate
+external management of the system in an automated and maintainable manner.
+
+## Background and References
+
+### Redfish LogEntry and Message Registry
+
+In Redfish, the [`LogEntry` schema][LogEntry] is used for a range of items that
+could be considered "logs", but one such use within OpenBMC is for an equivalent
+of the IPMI "System Event Log (SEL)".
+
+The IPMI SEL is the location where the BMC can collect errors and events,
+sometimes coming from other entities, such as the BIOS. Examples of these might
+be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful".
+These SEL records are exposed as human readable strings, either natively by a
+OEM SEL design or by tools such as `ipmitool`, which are typically unique to
+each system or manufacturer, and could hypothethically change with a BMC or
+firmware update, and are thus difficult to create automated tooling around. Two
+different vendors might use different strings to represent a critical
+temperature threshold exceeded: ["temperature threshold exceeded"][HPE-Example]
+and ["Temperature #0x30 Upper Critical going high"][Oracle-Example]. There is
+also no mechanism with IPMI to ask the machine "what are all of the SELs you
+might create".
+
+In order to solve two aspects of this problem, listing of possible events and
+versioning, Redfish has Message Registries. A message registry is a versioned
+collection of all of the error events that a system could generate and hints as
+to how they might be parsed and displayed to a user. An [informative
+reference][Registry-Example] from the DMTF gives this example:
+
+```json
+{
+ "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry",
+ "Id": "Alert.1.0.0",
+ "RegistryPrefix": "Alert",
+ "RegistryVersion": "1.0.0",
+ "Messages": {
+ "LanDisconnect": {
+ "Description": "A LAN Disconnect on %1 was detected on system %2.",
+ "Message": "A LAN Disconnect on %1 was detected on system %2.",
+ "Severity": "Warning",
+ "NumberOfArgs": 2,
+ "Resolution": "None"
+ }
+ }
+}
+```
+
+This example defines an event, `Alert.1.0.LanDisconnect`, which can record the
+disconnect state of a network device and contains placeholders for the affected
+device and system. When this event occurs, there might be a `LogEntry` recorded
+containing something like:
+
+```json
+{
+ "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.",
+ "MessageId": "Alert.1.0.LanDisconnect",
+ "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"]
+}
+```
+
+The `Message` contains a human readable string which was created by applying the
+`MessageArgs` to the placeholders from the `Message` field in the registry.
+System management software can rely on the message registry (referenced from the
+`MessageId` field in the `LogEntry`) and `MessageArgs` to avoid needing to
+perform string processing for reacting to the event.
+
+Within OpenBMC, there is currently a [limited design][existing-design] for this
+Redfish feature and it requires inserting specially formed Redfish-specific
+logging messages into any application that wants to record these events, tightly
+coupling all applications to the Redfish implementation. It has also been
+observed that these [strings][app-example], when used, are often out of date
+with the [message registry][registry-example] advertised by `bmcweb`. Some
+maintainers have rejected adding new Redfish-specific logging messages to their
+applications.
+
+[LogEntry]:
+ https://github.com/openbmc/bmcweb/blob/de0c960c4262169ea92a4b852dd5ebbe3810bf00/redfish-core/schema/dmtf/json-schema/LogEntry.v1_16_0.json
+[HPE-Example]:
+ https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002092en_us&docLocale=en_US&page=GUID-D7147C7F-2016-0901-06CE-000000000422.html
+[Oracle-Example]:
+ https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html#50602039_63068
+[Registry-Example]:
+ https://www.dmtf.org/sites/default/files/Redfish%20School%20-%20Events_0.pdf
+[existing-design]:
+ https://github.com/openbmc/docs/blob/master/architecture/redfish-logging-in-bmcweb.md
+[app-example]:
+ https://github.com/openbmc/phosphor-post-code-manager/blob/f2da78deb3a105c7270f74d9d747c77f0feaae2c/src/post_code.cpp#L143
+[registry-example]:
+ https://github.com/openbmc/bmcweb/blob/4ba5be51e3fcbeed49a6a312b4e6b2f1ea7447ba/redfish-core/include/registries/openbmc.json#L5
+
+### Existing phosphor-logging implementation
+
+**Note**: While the word 'exception' is used in this section, the existing (and
+proposed) types can be used by applications and execution contexts with
+exceptions disabled. They are 'exceptions' because they do inherit from
+`std::exception` and there is support in the `sdbusplus` bindings for them to be
+used in exception handling.
+
+The `sdbusplus` bindings have the capability to define new C++ exception types
+which can be thrown by a DBus server and turned into an error response to the
+client. `phosphor-logging` extended this to also add metadata associated to the
+log type. See the following example error definitions and usages.
+
+`sdbusplus` error binding definition (in
+`xyz/openbmc_project/Certs.errors.yaml`):
+
+```yaml
+- name: InvalidCertificate
+ description: Invalid certificate file.
+```
+
+`phosphor-logging` metadata definition (in
+`xyz/openbmc_project/Certs.metadata.yaml`):
+
+```yaml
+- name: InvalidCertificate
+ meta:
+ - str: "REASON=%s"
+ type: string
+```
+
+Application code reporting an error:
+
+```cpp
+elog<InvalidCertificate>(Reason("Invalid certificate file format"));
+// or
+report<InvalidCertificate>(Reason("Existing certificate file is corrupted"));
+```
+
+In this sample, an error named
+`xyz.openbmc_project.Certs.Error.InvalidCertificate` has been defined, which can
+be sent between applications as a DBus response. The `InvalidCertificate` is
+expected to have additional metadata `REASON` which is a string. The two APIs
+`elog` and `report` have slightly different behaviors: `elog` throws an
+exception which can either result in an error DBus result or be handled
+elsewhere in the application, while `report` sends the event directly to
+`phosphor-logging`'s daemon for recording. As a side-effect of both calls, the
+metadata is inserted into the `systemd` journal.
+
+When an error is sent to the `phosphor-logging` daemon, it will:
+
+1. Search back through the journal for recorded metadata associated with the
+ event (this is a relative slow operation).
+2. Create an [`xyz.openbmc_project.Logging.Entry`][Logging-Entry] DBus object
+ with the associated data extracted from the journal.
+3. Persist a serialized version of the object.
+
+Within `bmcweb` there is support for translating
+`xyz.openbmc_project.Logging.Entry` objects advertised by `phosphor-logging`
+into Redfish `LogEntries`, but this support does not reference a Message
+Registry. This makes the events of limited utility for consumption by system
+management software, as it cannot know all of the event types and is left to
+perform (hand-coded) regular-expressions to extract any information from the
+`Message` field of the `LogEntry`. Furthermore, these regular-expressions are
+likely to become outdated over time as internal OpenBMC error reporting
+structure, metadata, or message strings evolve.
+
+[Logging-Entry]:
+ https://github.com/openbmc/phosphor-dbus-interfaces/blob/9012243e543abdc5851b7e878c17c991b2a2a8b7/yaml/xyz/openbmc_project/Logging/Entry.interface.yaml#L1
+
+### Issues with the Status Quo
+
+- There are two different implementations of error logging, neither of which are
+ both complete and fully accepted by maintainers. These implementations also do
+ not cover tracing events.
+
+- The `REDFISH_MESSAGE_ID` log approach leads to differences between the Redfish
+ Message Registry and the reporting application. It also requires every
+ application to be "Redfish aware" which limits decoupling between applications
+ and external management interfaces. This also leaves gaps for reporting errors
+ in different management interfaces, such as inband IPMI and PLDM. The approach
+ also does not provide comple-time assurance of appropriate metadata
+ collection, which can lead to producing code being out-of-date with the
+ message registry definitions.
+
+- The `phosphor-logging` approach does not provide compile-time assurance of
+ appropriate metadata collection and requires expensive daemon processing of
+ the `systemd` journal on each error report, which limits scalability.
+
+- The `sdbusplus` bindings for error reporting do not currently handle lossless
+ transmission of errors between DBus servers and clients.
+
+- Similar applications can result in different Redfish `LogEntry` for the same
+ error scenario. This has been observed in sensor threshold exceeded events
+ between `dbus-sensors`, `phosphor-hwmon`, `phosphor-virtual-sensor`, and
+ `phosphor-health-monitor`. One cause of this is two different error reporting
+ approaches and disagreements amongst maintainers as to the preferred approach.
+
+## Requirements
+
+- Applications running on the BMC must be able to report errors and failure
+ which are persisted and available for external system management through
+ standards such as Redfish.
+
+ - These errors must be structured, versioned, and the complete set of errors
+ able to be created by the BMC should be available at built-time of a BMC
+ image.
+ - The set of errors, able to be created by the BMC, must be able to be
+ transformed into relevant data sets, such as Redfish Message Registries.
+ - For Redfish, the transformation must comply with the Redfish standard
+ requirements, such as conforming to semantic versioning expectations.
+ - For Redfish, the transformation should allow mapping internally defined
+ events to pre-existing Redfish Message Registries for broader
+ compatibility.
+ - For Redfish, the implementation must also support the EventService
+ mechanics for push-reporting.
+ - Errors reported by the BMC should contain sufficient information to allow
+ service of the system for these failures, either by humans or automation
+ (depending on the individual system requirements).
+
+- Applications running on the BMC should be able to report important tracing
+ events relevant to system management and/or debug, such as the system
+ successfully reaching a running state.
+
+ - All requirements relevant to errors are also applicable to tracing events.
+ - The implementation must have a mechanism for vendors to be able to disable
+ specific tracing events to conform to their own system design requirements.
+
+- Applications running on the BMC should be able to determine when a previously
+ reported error is no longer relevant and mark it as "resolved", while
+ maintaining the persistent record for future usages such as debug.
+
+- The BMC should provide a mechanism for managed entities within the server to
+ report their own errors and events. Examples of managed entities would be
+ firmware, such as the BIOS, and satellite management controllers.
+
+- The implementation on the BMC should scale to a minimum of
+ [10,000][error-discussion] error and events without impacting the BMC or
+ managed system performance.
+
+- The implementation should provide a mechanism to allow OEM or vendor
+ extensions to the error and event definitions (and generated artifacts such as
+ the Redfish Message Registry) for usage in closed-source or non-upstreamed
+ code. These extensions must be clearly identified, in all interfaces, as
+ vendor-specific and not be tied to the OpenBMC project.
+
+- APIs to implement error and event reporting should have good ergonomics. These
+ APIs must provide compile-time identification, for applicable programming
+ languages, of call sites which do not conform to the BMC error and event
+ specifications.
+
+ - The generated error classes and APIs should not require exceptions but
+ should also integrate with the `sdbusplus` client and server bindings, which
+ do leverage exceptions.
+
+[error-discussion]:
+ https://discord.com/channels/775381525260664832/855566794994221117/867794201897992213
+
+## Proposed Design
+
+The proposed design has a few high-level design elements:
+
+- Consolidate the `sdbusplus` and `phosphor-logging` implementation of error
+ reporting; expand it to cover tracing events; improve the ergonomics of the
+ associated APIs and add compile-time checking of missing metadata.
+
+- Add APIs to `phosphor-logging` to enable daemons to easily look up their own
+ previously reported events (for marking as resolved).
+
+- Add to `phosphor-logging` a compile-time mechanism to disable recording of
+ specific tracing events for vendor-level customization.
+
+- Generate a Redfish Message Registry for all error and events defined in
+ `phosphor-dbus-interfaces`, using binding generators from `sdbusplus`. Enhance
+ `bmcweb` implementation of the `Logging.Entry` to `LogEvent` transformation to
+ cover the Redfish Message Registry and `phosphor-logging` enhancements;
+ Leverage the Redfish `LogEntry.DiagnosticData` field to provide a
+ Base64-encoded JSON representation of the entire `Logging.Entry` for
+ additional diagnostics [[does this need to be optional?]]. Add support to the
+ `bmcweb` EventService implementation to support `phosphor-logging`-hosted
+ events.
+
+### `sdbusplus`
+
+The `Foo.errors.yaml` content will be combined with the content formerly in the
+`Foo.metadata.yaml` files specified by `phosphor-logging` and specified by a new
+file type `Foo.events.yaml`. This `Foo.events.yaml` format will cover both the
+current `error` and `metadata` information as well as augment with additional
+information necessary to generate external facing datasets, such as Redfish
+Message Registries. The current `Foo.errors.yaml` and `Foo.metadata.yaml` files
+will be deprecated as their usage is replaced by the new format.
+
+The `sdbusplus` library will be enhanced to provide the following:
+
+- JSON serialization and de-serialization of generated exception types with
+ their assigned metadata; assignment of the JSON serialization to the `message`
+ field of `sd_bus_error_set` calls when errors are returned from DBus server
+ calls.
+
+- A facility to register exception types, at library load time, with the
+ `sdbusplus` library for automatic conversion back to C++ exception types in
+ DBus clients.
+
+The binding generator(s) will be expanded to do the following:
+
+- Generate complete C++ exception types, with compile-time checking of missing
+ metadata and JSON serialization, for errors and events. Metadata can be of one
+ of the following types:
+
+ - size-type and signed integer
+ - floating-point number
+ - string
+ - DBus object path
+
+- Generate a format that `bmcweb` can use to create and populate a Redfish
+ Message Registry, and translate from `phosphor-logging` to Redfish `LogEntry`
+ for a set of errors and events
+
+For general users of `sdbusplus` these changes should have no impact, except for
+the availability of new generated exception types and that specialized instances
+of `sdbusplus::exception::generated_exception` will become available in DBus
+clients.
+
+### `phosphor-dbus-interfaces`
+
+Refactoring will be done to migrate existing `Foo.metadata.yaml` and
+`Foo.errors.yaml` content to the `Foo.events.yaml` as migration is done by
+applications. Minor changes will take place to utilize the new binding
+generators from `sdbusplus`. A small library enhancement will be done to
+register all generated exception types with `sdbusplus`. Future contributors
+will be able to contribute new error and tracing event definitions.
+
+### `phosphor-logging`
+
+> TODO: Should a tracing event be a `Logging.Entry` with severity of
+> `Informational` or should they be a new type, such as `Logging.Event` and
+> managed separately. The `phosphor-logging` default `meson.options` have
+> `error_cap=200` and `error_info_cap=10`. If we increase the total number of
+> events allowed to 10K, the majority of them are likely going to be information
+> / tracing events.
+
+The `Logging.Entry` interface's `AdditionalData` property should change to
+`dict[string, variant[string,int64_t,size_t,object_path]]`.
+
+The `Logging.Create` interface will have a new method added:
+
+```yaml
+- name: CreateEntry
+ parameters:
+ - name: Message
+ type: string
+ - name: Severity
+ type: enum[Logging.Entry.Level]
+ - name: AdditionalData
+ type: dict[string, variant[string,int64_t,size_t,object_path]]
+ - name: Hint
+ type: string
+ default: ""
+ returns:
+ - name: Entry
+ type: object_path
+```
+
+The `Hint` parameter is used for daemons to be able to query for their
+previously recorded error, for marking as resolved. These strings need to be
+globally unique and are suggested to be of the format `"<service_name>:<key>"`.
+
+A `Logging.SearchHint` interface will be created, which will be recorded at the
+same object path as a `Logging.Entry` when the `Hint` parameter was not an empty
+string:
+
+```yaml
+- property: Hint
+ type: string
+```
+
+The `Logging.Manager` interface will be added with a single method:
+
+```yaml
+- name: FindEntry
+ parameters:
+ - name: Hint
+ type: String
+ returns:
+ - name: Entry
+ type: object_path
+ errors:
+ - xyz.openbmc_project.Common.ResourceNotFound
+```
+
+A `lg2::commit` API will be added to support the new `sdbusplus` generated
+exception types, calling the new `Logging.Create.CreateEntry` method proposed
+earlier. This new API will support `sdbusplus::bus_t` for synchronous DBus
+operations and both `sdbusplus::async::context_t` and
+`sdbusplus::asio::connection` for asynchronous DBus operations.
+
+There are outstanding performance concerns with the `phosphor-logging`
+implementation that may impact the ability for scaling to 10,000 event records.
+This issue is expected to be self-contained within `phosphor-logging`, except
+for potential future changes to the log-retrieval interfaces used by `bmcweb`.
+In order to decouple the transition to this design, by callers of the logging
+APIs, from the experimentation and improvements in `phosphor-logging`, we will
+add a compile option and Yocto `DISTRO_FEATURE` that can turn `lg2::commit`
+behavior into an `OPENBMC_MESSAGE_ID` record in the journal, along the same
+approach as the previous `REDFISH_MESSAGE_ID`, and corresponding `rsyslog`
+configuration and `bmcweb` support to use these directly. This will allow
+systems which knowingly scale to a large number of event records, using
+`rsyslog` mechanics, the same level of performance. One caveat of this support
+is that the hint and resolution behavior will not exist when that option is
+enabled.
+
+### `bmcweb`
+
+`bmcweb` already has support for build-time conversion from a Redfish Message
+Registry, codified in JSON, to header files it uses to serve the registry; this
+will be expanded to support Redfish Message Registries generated by `sdbusplus`.
+`bmcweb` will add a Meson option for additional message registries, provided
+from bitbake from `phosphor-dbus-interfaces` and vendor-specific event
+definitions as a path to a directory of Message Registry JSONs. Support will
+also be added for adding `phosphor-dbus-interfaces` as a Meson subproject for
+stand-alone testing.
+
+It is desirable for `sdbusplus` to generate a Redfish Message Registry directly,
+leveraging the existing scripts for integration with `bmcweb`. As part of this
+we would like to support mapping a `Logging.Entry` event to an existing
+standardized Redfish event (such as those in the Base registry). The generated
+information must contain the `Logging.Entry::Message` identifier, the
+`AdditionalData` to `MessageArgs` mapping, and the translation from the
+`Message` identifier to the Redfish Message ID (when the Message ID is not from
+"this" registry). In order to facilitate this, we will need to add OEM fields to
+the Redfish Message Registry JSON, which are only used by the `bmcweb`
+processing scripts, to generate the information necessary for this additional
+mapping.
+
+The `xyz.openbmc_project.Logging.Entry` to `LogEvent` conversion needs to be
+enhanced, to utilize these Message Registries, in four ways:
+
+1. A Base64-encoded JSON representation of the `Logging.Entry` will be assigned
+ to the `DiagnosticData` property.
+
+2. If the `Logging.Entry::Message` contains an identifier corresponding to a
+ Registry entry, the `MessageId` property will be set to the corresponding
+ Redfish Message ID. Otherwise, the `Logging.Entry::Message` will be used
+ directly with no further transformation (as is done today).
+
+3. If the `Logging.Entry::Message` contains an identifier corresponding to a
+ Registry entry, the `MessageArgs` property will be filled in by obtaining the
+ corresponding values from the `AdditionalData` dictionary and the `Message`
+ field will be generated from combining these values with the `Message` string
+ from the Registry.
+
+4. A mechanism should be implemented to translate DBus `object_path` references
+ to Redfish Resource URIs. When an `object_path` cannot be translated,
+ `bmcweb` will use a prefix such as `object_path:` in the `MessageArgs` value.
+
+The implementation of `EventService` should be enhanced to support
+`phosphor-logging` hosted events. The implementation of `LogService` should be
+enhanced to support log paging for `phosphor-logging` hosted events.
+
+### `phosphor-sel-logger`
+
+The `phosphor-sel-logger` has a meson option `send-to-logger` which toggles
+between using `phosphor-logging` or the [`REDFISH_MESSAGE_ID`
+mechanism][existing-design]. The `phosphor-logging`-utilizing paths will be
+updated to utilize `phosphor-dbus-interfaces` specified errors and events.
+
+### YAML format
+
+Consider an example file in `phosphor-dbus-interfaces` as
+`yaml/xyz/openbmc_project/Software/Update.events.yaml` with hypothetical errors
+and events:
+
+```yaml
+version: 1.3.1
+
+errors:
+ - name: UpdateFailure
+ severity: critical
+ metadata:
+ - name: TARGET
+ type: string
+ primary: true
+ - name: ERRNO
+ type: int64
+ - name: CALLOUT_HARDWARE
+ type: object_path
+ primary: true
+ en:
+ description: While updating the firmware on a device, the update failed.
+ message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}.
+ resolution: Retry update.
+
+ - name: BMCUpdateFailure
+ severity: critical
+ deprecated: 1.0.0
+ en:
+ description: Failed to update the BMC
+ redfish-mapping: OpenBMC.FirmwareUpdateFailed
+
+events:
+ - name: UpdateProgress
+ metadata:
+ - name: TARGET
+ type: string
+ primary: true
+ - name: COMPLETION
+ type: double
+ primary: true
+ en:
+ description: An update is in progress and has reached a checkpoint.
+ message: Updating of {TARGET} is {COMPLETION}% complete.
+```
+
+Each `foo.events.yaml` file would be used to generate both the C++ classes (via
+`sdbusplus`) for exception handling and event reporting, as well as a versioned
+Redfish Message Registry for the errors and events. The YAML schema is as
+follows:
+
+```yaml
+$id: https://openbmc-project.xyz/sdbusplus/events.schema.yaml
+$schema: https://json-schema.org/draft/2020-12/schema
+title: Event and error definitions
+type: object
+$defs:
+ event:
+ type: array
+ items:
+ type: object
+ properties:
+ name:
+ type: string
+ description:
+ An identifier for the event in UpperCamelCase; used as the class and
+ Redfish Message ID.
+ en:
+ type: object
+ description: The details for English.
+ properties:
+ description:
+ type: string
+ description:
+ A developer-applicable description of the error reported. These
+ form the "description" of the Redfish message.
+ message:
+ type: string
+ description:
+ The end-user message, including placeholders for arguemnts.
+ resolution:
+ type: string
+ description: The end-user resolution.
+ severity:
+ enum:
+ - emergency
+ - alert
+ - critical
+ - error
+ - warning
+ - notice
+ - informational
+ - debug
+ description:
+ The `xyz.openbmc_project.Logging.Entry.Level` value for this
+ error. Only applicable for 'errors'.
+ redfish-mapping:
+ type: string
+ description:
+ Used when a `sdbusplus` event should map to a specific Redfish
+ Message rather than a generated one. This is useful when an internal
+ error has an analog in a standardized registry.
+ deprecated:
+ type: string
+ pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
+ description:
+ Indicates that the event is now deprecated and should not be created
+ by any OpenBMC software, but is required to still exist for
+ generation in the Redfish Message Registry. The version listed here
+ should be the first version where the error is no longer used.
+ metadata:
+ type: array
+ items:
+ type: object
+ properties:
+ name:
+ type: string
+ description: The name of the metadata field.
+ type:
+ enum:
+ - string
+ - size
+ - int64
+ - uint64
+ - double
+ - object_path
+ description: The type of the metadata field.
+ primary:
+ type: boolean
+ description:
+ Set to true when the metadata field is expected to be part of
+ the Redfish `MessageArgs` (and not only in the extended
+ `DiagnosticData`).
+properties:
+ version:
+ type: string
+ pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
+ description:
+ The version of the file, which will be used as the Redfish Message
+ Registry version.
+errors:
+ $ref: "#/definitions/event"
+events:
+ $ref: ":#/definitions/event"
+```
+
+The above example YAML would generate C++ classes similar to:
+
+```cpp
+namespace sdbusplus::errors::xyz::openbmc_project::software::update
+{
+
+class UpdateFailure
+{
+
+ template <typename... Args>
+ UpdateFailure(Args&&... args);
+};
+
+}
+
+namespace sdbusplus::events::xyz::openbmc_project::software::update
+{
+
+class UpdateProgress
+{
+ template <typename... Args>
+ UpdateProgress(Args&&... args);
+};
+
+}
+```
+
+The constructors here are variadic templates because the generated constructor
+implementation will provide compile-time assurance that all of the metadata
+fields have been populated (in any order). To raise an `UpdateFailure` a
+developers might do something like:
+
+```cpp
+// Immediately report the event:
+lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path));
+// or send it in a dbus response (when using sdbusplus generated binding):
+throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path);
+```
+
+If one of the fields, such as `ERRNO` were omitted, a compile failure will be
+raised indicating the first missing field.
+
+### Versioning Policy
+
+Assume the version follows semantic versioning `MAJOR.MINOR.PATCH` convention.
+
+- Adjusting a description or message should result in a `PATCH` increment.
+- Adding a new error or event, or adding metadata to an existing error or event,
+ should result in a `MINOR` increment.
+- Deprecating an error or event should result in a `MAJOR` increment.
+
+There is [guidance on maintenance][registry-guidance] of the OpenBMC Message
+Registry. We will incorporate that guidance into the equivalent
+`phosphor-dbus-interfaces` policy.
+
+[registry-guidance]:
+ https://github.com/openbmc/bmcweb/blob/master/redfish-core/include/registries/openbmc_message_registry.readmefirst.md
+
+### Generated Redfish Message Registry
+
+[DSP0266][dsp0266], the Redfish specification, gives requirements for Redfish
+Message Registries and dictates guidelines for identifiers.
+
+The hypothetical events defined above would create a message registry similar
+to:
+
+```json
+{
+ "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1",
+ "Language": "en",
+ "Messages": {
+ "UpdateFailure": {
+ "Description": "While updating the firmware on a device, the update failed.",
+ "Message": "A failure occurred updating %1 on %2.",
+ "Resolution": "Retry update."
+ "NumberOfArgs": 2,
+ "ParamTypes": ["string", "string"],
+ "Severity": "Critical",
+ },
+ "UpdateProgress" : {
+ "Description": "An update is in progress and has reached a checkpoint."
+ "Message": "Updating of %1 is %2\% complete.",
+ "Resolution": "None",
+ "NumberOfArgs": 2,
+ "ParamTypes": ["string", "number"],
+ "Severity": "OK",
+ }
+ }
+}
+```
+
+The prefix `OpenBMC_Base` shall be exclusively reserved for use by events from
+`phosphor-logging`. Events defined in other repositories will be expected to use
+some other prefix. Vendor-defined repositories should use a vendor-owned prefix
+as directed by [DSP0266][dsp0266].
+
+[dsp0266]:
+ https://www.dmtf.org/sites/default/files/standards/documents/DSP0266_1.20.0.pdf
+
+### Vendor implications
+
+As specified above, vendors must use their own identifiers in order to conform
+with the Redfish specification (see [DSP0266][dsp0266] for requirements on
+identifier naming). The `sdbusplus` (and `phosphor-logging` and `bmcweb`)
+implementation(s) will enable vendors to create their own events for downstream
+code and Registries for integration with Redfish, by creating downstream
+repositories of error definitions. Vendors are responsible for ensuring their
+own versioning and identifiers conform to the expectations in the [Redfish
+specification][dsp0266].
+
+One potential bad behavior on the part of vendors would be forking and modifying
+`phosphor-dbus-interfaces` defined events. Vendors must not add their own events
+to `phosphor-dbus-interfaces` in downstream implementations because it would
+lead to their implementation advertising support for a message in an
+OpenBMC-owned Registry which is not the case, but they should add them to their
+own repositories with a separate identifier. Similarly, if a vendor were to
+_backport_ upstream changes into their fork, they would need to ensure that the
+`foo.events.yaml` file for that version matches identically with the upstream
+implementation.
+
+## Alternatives Considered
+
+Many alternatives have been explored and referenced through earlier work. Within
+this proposal there are many minor-alternatives that have been assessed.
+
+### Exception inheritance
+
+The original `phosphor-logging` error descriptions allowed inheritance between
+two errors. This is not supported by the proposal for two reasons:
+
+- This introduces complexity in the Redfish Message Registry versioning because
+ a change in one file should induce version changes in all dependent files.
+
+- It makes it difficult for a developer to clearly identify all of the fields
+ they are expected to populate without traversing multiple files.
+
+### sdbusplus Exception APIs
+
+There are a few possible syntaxes I came up with for constructing the generated
+exception types. It is important that these have good ergonomics, are easy to
+understand, and can provide compile-time awareness of missing metadata fields.
+
+```cpp
+ using Example = sdbusplus::error::xyz::openbmc_project::Example;
+
+ // 1)
+ throw Example().fru("Motherboard").value(42);
+
+ // 2)
+ throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42);
+
+ // 3)
+ throw Example("FRU", "Motherboard", "VALUE", 42);
+
+ // 4)
+ throw Example([](auto e) { return e.fru("Motherboard").value(42); });
+
+ // 5)
+ throw Example({.fru = "Motherboard", .value = 42});
+```
+
+**Note**: These examples are all show using `throw` syntax, but could also be
+saved in local variables, returned from functions, or immediately passed to
+`lg2::commit`.
+
+1. This would be my preference for ergonomics and clarity, as it would allow
+ LSP-enabled editors to give completions for the metadata fields but
+ unfortunately there is no mechanism in C++ to define a type which can be
+ constructed but not thrown, which means we cannot get compile-time checking
+ of all metadata fields.
+
+2. This syntax uses tag-dispatch to enables compile-time checking of all
+ metadata fields and potential LSP-completion of the tag-types, but is more
+ verbose than option 3.
+
+3. This syntax is less verbose than (2) and follows conventions already used in
+ `phosphor-logging`'s `lg2` API, but does not allow LSP-completion of the
+ metadata tags.
+
+4. This syntax is similar to option (1) but uses an indirection of a lambda to
+ enable compile-time checking that all metadata fields have been populated by
+ the lambda. The LSP-completion is likely not as strong as option (1), due to
+ the use of `auto`, and the lambda necessity will likely be a hang-up for
+ unfamiliar developers.
+
+5. This syntax has similar characteristics as option (1) but similarly does not
+ provide compile-time confirmation that all fields have been populated.
+
+The proposal therefore suggests option (3) is most suitable.
+
+### Redfish Translation Support
+
+The proposed YAML format allows future addition of translation but it is not
+enabled at this time. Future development could enable the Redfish Message
+Registry to be generated in multiple languages if the `message:language` exists
+for those languages.
+
+### Redfish Registry Versioning
+
+The Redfish Message Registries are required to be versioned and has 3 digit
+fields (ie. `XX.YY.ZZ`), but only the first 2 are suppose to be used in the
+Message ID. Rather than using the manually specified version we could take a few
+other approaches:
+
+- Use a date code (ex. `2024.17.x`) representing the ISO 8601 week when the
+ registry was built.
+
+ - This does not cover vendors that may choose to branch for stabilization
+ purposes, so we can end up with two machines having the same
+ OpenBMC-versioned message registry with different content.
+
+- Use the most recent `openbmc/openbmc` tag as the version.
+
+ - This does not cover vendors that build off HEAD and may deploy multiple
+ images between two OpenBMC releases.
+
+- Generate the version based on the git-history.
+
+ - This requires `phosphor-dbus-interfaces` to be built from a git repository,
+ which may not always be true for Yocto source mirrors, and requires
+ non-trivial processing that continues to scale over time.
+
+### Existing OpenBMC Redfish Registry
+
+There are currently 191 messages defined in the existing Redfish Message
+Registry at version `OpenBMC.0.4.0`. Of those, not a single one in the codebase
+is emitted with the correct version. 96 of those are only emitted by
+Intel-specific code that is not pulled into any upstreamed machine, 39 are
+emitted by potentially common code, and 56 are not even referenced in the
+codebase outside of the bmcweb registry. Of the 39 common messages half of them
+have an equivalent in one of the standard registries that should be leveraged
+and many of the others do not have attributes that would facilitate a multi-host
+configuration, so the registry at a minimum needs to be updated. None of the
+current implementation has the capability to handle Redfish Resource URIs.
+
+The proposal therefore is to deprecate the existing registry and replace it with
+the new generated registries. For repositories that currently emit events in the
+existing format, we can maintain those call-sites for a time period of 1-2
+years.
+
+If this aspect of the proposal is rejected, the YAML format allows mapping from
+`phosphor-dbus-interfaces` defined events to the current `OpenBMC.0.4.0`
+registry `MessageIds`.
+
+Potentially common:
+
+- phosphor-post-code-manager
+ - BIOSPOSTCode (unique)
+- dbus-sensors
+ - ChassisIntrusionDetected (unique)
+ - ChassisIntrusionReset (unique)
+ - FanInserted
+ - FanRedundancyLost (unique)
+ - FanRedudancyRegained (unique)
+ - FanRemoved
+ - LanLost
+ - LanRegained
+ - PowerSupplyConfigurationError (unique)
+ - PowerSupplyConfigurationErrorRecovered (unique)
+ - PowerSupplyFailed
+ - PowerSupplyFailurePredicted (unique)
+ - PowerSupplyFanFailed
+ - PowerSupplyFanRecovered
+ - PowerSupplyPowerLost
+ - PowerSupplyPowerRestored
+ - PowerSupplyPredictiedFailureRecovered (unique)
+ - PowerSupplyRecovered
+- phosphor-sel-logger
+ - IPMIWatchdog (unique)
+ - `SensorThreshold*` : 8 different events
+- phosphor-net-ipmid
+ - InvalidLoginAttempted (unique)
+- entity-manager
+ - InventoryAdded (unique)
+ - InventoryRemoved (unique)
+- estoraged
+ - ServiceStarted
+- x86-power-control
+ - NMIButtonPressed (unique)
+ - NMIDiagnosticInterrupt (unique)
+ - PowerButtonPressed (unique)
+ - PowerRestorePolicyApplied (unique)
+ - PowerSupplyPowerGoodFailed (unique)
+ - ResetButtonPressed (unique)
+ - SystemPowerGoodFailed (unique)
+
+Intel-only implementations:
+
+- intel-ipmi-oem
+ - ADDDCCorrectable
+ - BIOSPostERROR
+ - BIOSRecoveryComplete
+ - BIOSRecoveryStart
+ - FirmwareUpdateCompleted
+ - IntelUPILinkWidthReducedToHalf
+ - IntelUPILinkWidthReducedToQuarter
+ - LegacyPCIPERR
+ - LegacyPCISERR
+ - `ME*` : 29 different events
+ - `Memory*` : 9 different events
+ - MirroringRedundancyDegraded
+ - MirroringRedundancyFull
+ - `PCIeCorrectable*`, `PCIeFatal` : 29 different events
+ - SELEntryAdded
+ - SparingRedundancyDegraded
+- pfr-manager
+ - BIOSFirmwareRecoveryReason
+ - BIOSFirmwarePanicReason
+ - BMCFirmwarePanicReason
+ - BMCFirmwareRecoveryReason
+ - BMCFirmwareResiliencyError
+ - CPLDFirmwarePanicReason
+ - CPLDFirmwareResilencyError
+ - FirmwareResiliencyError
+- host-error-monitor
+ - CPUError
+ - CPUMismatch
+ - CPUThermalTrip
+ - ComponentOverTemperature
+ - SsbThermalTrip
+ - VoltageRegulatorOverheated
+- s2600wf-misc
+ - DriveError
+ - InventoryAdded
+
+## Impacts
+
+- New APIs are defined for error and event logging. This will deprecate existing
+ `phosphor-logging` APIs, with a time to migrate, for error reporting.
+
+- The design should improve performance by eliminating the regular parsing of
+ the `systemd` journal. The design may decrease performance by allowing the
+ number of error and event logs to be dramatically increased, which have an
+ impact to file system utilization and potential for DBus impacts some services
+ such as `ObjectMapper`.
+
+- Backwards compatibility and documentation should be improved by the automatic
+ generation of the Redfish Message Registry corresponding to all error and
+ event reports.
+
+### Organizational
+
+- **Does this repository require a new repository?**
+ - No
+- **Who will be the initial maintainer(s) of this repository?**
+ - N/A
+- **Which repositories are expected to be modified to execute this design?**
+ - `sdbusplus`
+ - `phosphor-dbus-interfaces`
+ - `phosphor-logging`
+ - `bmcweb`
+ - Any repository creating an error or event.
+
+## Testing
+
+- Unit tests will be written in `sdbusplus` and `phosphor-logging` for the error
+ and event generation, creation APIs, and to provide coverage on any changes to
+ the `Logging.Entry` object management.
+
+- Unit tests will be written for `bmcweb` for basic `Logging.Entry`
+ transformation and Message Registry generation.
+
+- Integration tests should be leveraged (and enhanced as necessary) from
+ `openbmc-test-automation` to cover the end-to-end error creation and Redfish
+ reporting.