Author: Patrick Williams <stwcx>
Other contributors:
Created: May 16, 2024
There is currently not a consistent end-to-end error and event reporting design for the OpenBMC code stack. There are two different implementations, one primarily using phosphor-logging and one using rsyslog, both of which have gaps that a complete solution should address. This proposal is intended to be an end-to-end design handling both errors and tracing events which facilitate external management of the system in an automated and maintainable manner.
In Redfish, the LogEntry
schema is used for a range of items that could be considered "logs", but one such use within OpenBMC is for an equivalent of the IPMI "System Event Log (SEL)".
The IPMI SEL is the location where the BMC can collect errors and events, sometimes coming from other entities, such as the BIOS. Examples of these might be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful". These SEL records are exposed as human readable strings, either natively by a OEM SEL design or by tools such as ipmitool
, which are typically unique to each system or manufacturer, and could hypothethically change with a BMC or firmware update, and are thus difficult to create automated tooling around. Two different vendors might use different strings to represent a critical temperature threshold exceeded: "temperature threshold exceeded" and "Temperature #0x30 Upper Critical going high". There is also no mechanism with IPMI to ask the machine "what are all of the SELs you might create".
In order to solve two aspects of this problem, listing of possible events and versioning, Redfish has Message Registries. A message registry is a versioned collection of all of the error events that a system could generate and hints as to how they might be parsed and displayed to a user. An informative reference from the DMTF gives this example:
{ "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry", "Id": "Alert.1.0.0", "RegistryPrefix": "Alert", "RegistryVersion": "1.0.0", "Messages": { "LanDisconnect": { "Description": "A LAN Disconnect on %1 was detected on system %2.", "Message": "A LAN Disconnect on %1 was detected on system %2.", "Severity": "Warning", "NumberOfArgs": 2, "Resolution": "None" } } }
This example defines an event, Alert.1.0.LanDisconnect
, which can record the disconnect state of a network device and contains placeholders for the affected device and system. When this event occurs, there might be a LogEntry
recorded containing something like:
{ "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.", "MessageId": "Alert.1.0.LanDisconnect", "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"] }
The Message
contains a human readable string which was created by applying the MessageArgs
to the placeholders from the Message
field in the registry. System management software can rely on the message registry (referenced from the MessageId
field in the LogEntry
) and MessageArgs
to avoid needing to perform string processing for reacting to the event.
Within OpenBMC, there is currently a limited design for this Redfish feature and it requires inserting specially formed Redfish-specific logging messages into any application that wants to record these events, tightly coupling all applications to the Redfish implementation. It has also been observed that these strings, when used, are often out of date with the message registry advertised by bmcweb
. Some maintainers have rejected adding new Redfish-specific logging messages to their applications.
Note: While the word 'exception' is used in this section, the existing (and proposed) types can be used by applications and execution contexts with exceptions disabled. They are 'exceptions' because they do inherit from std::exception
and there is support in the sdbusplus
bindings for them to be used in exception handling.
The sdbusplus
bindings have the capability to define new C++ exception types which can be thrown by a DBus server and turned into an error response to the client. phosphor-logging
extended this to also add metadata associated to the log type. See the following example error definitions and usages.
sdbusplus
error binding definition (in xyz/openbmc_project/Certs.errors.yaml
):
- name: InvalidCertificate description: Invalid certificate file.
phosphor-logging
metadata definition (in xyz/openbmc_project/Certs.metadata.yaml
):
- name: InvalidCertificate meta: - str: "REASON=%s" type: string
Application code reporting an error:
elog<InvalidCertificate>(Reason("Invalid certificate file format")); // or report<InvalidCertificate>(Reason("Existing certificate file is corrupted"));
In this sample, an error named xyz.openbmc_project.Certs.Error.InvalidCertificate
has been defined, which can be sent between applications as a DBus response. The InvalidCertificate
is expected to have additional metadata REASON
which is a string. The two APIs elog
and report
have slightly different behaviors: elog
throws an exception which can either result in an error DBus result or be handled elsewhere in the application, while report
sends the event directly to phosphor-logging
's daemon for recording. As a side-effect of both calls, the metadata is inserted into the systemd
journal.
When an error is sent to the phosphor-logging
daemon, it will:
xyz.openbmc_project.Logging.Entry
DBus object with the associated data extracted from the journal.Within bmcweb
there is support for translating xyz.openbmc_project.Logging.Entry
objects advertised by phosphor-logging
into Redfish LogEntries
, but this support does not reference a Message Registry. This makes the events of limited utility for consumption by system management software, as it cannot know all of the event types and is left to perform (hand-coded) regular-expressions to extract any information from the Message
field of the LogEntry
. Furthermore, these regular-expressions are likely to become outdated over time as internal OpenBMC error reporting structure, metadata, or message strings evolve.
There are two different implementations of error logging, neither of which are both complete and fully accepted by maintainers. These implementations also do not cover tracing events.
The REDFISH_MESSAGE_ID
log approach leads to differences between the Redfish Message Registry and the reporting application. It also requires every application to be "Redfish aware" which limits decoupling between applications and external management interfaces. This also leaves gaps for reporting errors in different management interfaces, such as inband IPMI and PLDM. The approach also does not provide comple-time assurance of appropriate metadata collection, which can lead to producing code being out-of-date with the message registry definitions.
The phosphor-logging
approach does not provide compile-time assurance of appropriate metadata collection and requires expensive daemon processing of the systemd
journal on each error report, which limits scalability.
The sdbusplus
bindings for error reporting do not currently handle lossless transmission of errors between DBus servers and clients.
Similar applications can result in different Redfish LogEntry
for the same error scenario. This has been observed in sensor threshold exceeded events between dbus-sensors
, phosphor-hwmon
, phosphor-virtual-sensor
, and phosphor-health-monitor
. One cause of this is two different error reporting approaches and disagreements amongst maintainers as to the preferred approach.
Applications running on the BMC must be able to report errors and failure which are persisted and available for external system management through standards such as Redfish.
Applications running on the BMC should be able to report important tracing events relevant to system management and/or debug, such as the system successfully reaching a running state.
Applications running on the BMC should be able to determine when a previously reported error is no longer relevant and mark it as "resolved", while maintaining the persistent record for future usages such as debug.
The BMC should provide a mechanism for managed entities within the server to report their own errors and events. Examples of managed entities would be firmware, such as the BIOS, and satellite management controllers.
The implementation on the BMC should scale to a minimum of 10,000 error and events without impacting the BMC or managed system performance.
The implementation should provide a mechanism to allow OEM or vendor extensions to the error and event definitions (and generated artifacts such as the Redfish Message Registry) for usage in closed-source or non-upstreamed code. These extensions must be clearly identified, in all interfaces, as vendor-specific and not be tied to the OpenBMC project.
APIs to implement error and event reporting should have good ergonomics. These APIs must provide compile-time identification, for applicable programming languages, of call sites which do not conform to the BMC error and event specifications.
sdbusplus
client and server bindings, which do leverage exceptions.The proposed design has a few high-level design elements:
Consolidate the sdbusplus
and phosphor-logging
implementation of error reporting; expand it to cover tracing events; improve the ergonomics of the associated APIs and add compile-time checking of missing metadata.
Add APIs to phosphor-logging
to enable daemons to easily look up their own previously reported events (for marking as resolved).
Add to phosphor-logging
a compile-time mechanism to disable recording of specific tracing events for vendor-level customization.
Generate a Redfish Message Registry for all error and events defined in phosphor-dbus-interfaces
, using binding generators from sdbusplus
. Enhance bmcweb
implementation of the Logging.Entry
to LogEvent
transformation to cover the Redfish Message Registry and phosphor-logging
enhancements; Leverage the Redfish LogEntry.DiagnosticData
field to provide a Base64-encoded JSON representation of the entire Logging.Entry
for additional diagnostics [[does this need to be optional?]]. Add support to the bmcweb
EventService implementation to support phosphor-logging
-hosted events.
sdbusplus
The Foo.errors.yaml
content will be combined with the content formerly in the Foo.metadata.yaml
files specified by phosphor-logging
and specified by a new file type Foo.events.yaml
. This Foo.events.yaml
format will cover both the current error
and metadata
information as well as augment with additional information necessary to generate external facing datasets, such as Redfish Message Registries. The current Foo.errors.yaml
and Foo.metadata.yaml
files will be deprecated as their usage is replaced by the new format.
The sdbusplus
library will be enhanced to provide the following:
JSON serialization and de-serialization of generated exception types with their assigned metadata; assignment of the JSON serialization to the message
field of sd_bus_error_set
calls when errors are returned from DBus server calls.
A facility to register exception types, at library load time, with the sdbusplus
library for automatic conversion back to C++ exception types in DBus clients.
The binding generator(s) will be expanded to do the following:
Generate complete C++ exception types, with compile-time checking of missing metadata and JSON serialization, for errors and events. Metadata can be of one of the following types:
Generate a format that bmcweb
can use to create and populate a Redfish Message Registry, and translate from phosphor-logging
to Redfish LogEntry
for a set of errors and events
For general users of sdbusplus
these changes should have no impact, except for the availability of new generated exception types and that specialized instances of sdbusplus::exception::generated_exception
will become available in DBus clients.
phosphor-dbus-interfaces
Refactoring will be done to migrate existing Foo.metadata.yaml
and Foo.errors.yaml
content to the Foo.events.yaml
as migration is done by applications. Minor changes will take place to utilize the new binding generators from sdbusplus
. A small library enhancement will be done to register all generated exception types with sdbusplus
. Future contributors will be able to contribute new error and tracing event definitions.
phosphor-logging
TODO: Should a tracing event be a
Logging.Entry
with severity ofInformational
or should they be a new type, such asLogging.Event
and managed separately. Thephosphor-logging
defaultmeson.options
haveerror_cap=200
anderror_info_cap=10
. If we increase the total number of events allowed to 10K, the majority of them are likely going to be information / tracing events.
The Logging.Entry
interface's AdditionalData
property should change to dict[string, variant[string,int64_t,size_t,object_path]]
.
The Logging.Create
interface will have a new method added:
- name: CreateEntry parameters: - name: Message type: string - name: Severity type: enum[Logging.Entry.Level] - name: AdditionalData type: dict[string, variant[string,int64_t,size_t,object_path]] - name: Hint type: string default: "" returns: - name: Entry type: object_path
The Hint
parameter is used for daemons to be able to query for their previously recorded error, for marking as resolved. These strings need to be globally unique and are suggested to be of the format "<service_name>:<key>"
.
A Logging.SearchHint
interface will be created, which will be recorded at the same object path as a Logging.Entry
when the Hint
parameter was not an empty string:
- property: Hint type: string
The Logging.Manager
interface will be added with a single method:
- name: FindEntry parameters: - name: Hint type: String returns: - name: Entry type: object_path errors: - xyz.openbmc_project.Common.ResourceNotFound
A lg2::commit
API will be added to support the new sdbusplus
generated exception types, calling the new Logging.Create.CreateEntry
method proposed earlier. This new API will support sdbusplus::bus_t
for synchronous DBus operations and both sdbusplus::async::context_t
and sdbusplus::asio::connection
for asynchronous DBus operations.
There are outstanding performance concerns with the phosphor-logging
implementation that may impact the ability for scaling to 10,000 event records. This issue is expected to be self-contained within phosphor-logging
, except for potential future changes to the log-retrieval interfaces used by bmcweb
. In order to decouple the transition to this design, by callers of the logging APIs, from the experimentation and improvements in phosphor-logging
, we will add a compile option and Yocto DISTRO_FEATURE
that can turn lg2::commit
behavior into an OPENBMC_MESSAGE_ID
record in the journal, along the same approach as the previous REDFISH_MESSAGE_ID
, and corresponding rsyslog
configuration and bmcweb
support to use these directly. This will allow systems which knowingly scale to a large number of event records, using rsyslog
mechanics, the same level of performance. One caveat of this support is that the hint and resolution behavior will not exist when that option is enabled.
bmcweb
bmcweb
already has support for build-time conversion from a Redfish Message Registry, codified in JSON, to header files it uses to serve the registry; this will be expanded to support Redfish Message Registries generated by sdbusplus
. bmcweb
will add a Meson option for additional message registries, provided from bitbake from phosphor-dbus-interfaces
and vendor-specific event definitions as a path to a directory of Message Registry JSONs. Support will also be added for adding phosphor-dbus-interfaces
as a Meson subproject for stand-alone testing.
It is desirable for sdbusplus
to generate a Redfish Message Registry directly, leveraging the existing scripts for integration with bmcweb
. As part of this we would like to support mapping a Logging.Entry
event to an existing standardized Redfish event (such as those in the Base registry). The generated information must contain the Logging.Entry::Message
identifier, the AdditionalData
to MessageArgs
mapping, and the translation from the Message
identifier to the Redfish Message ID (when the Message ID is not from "this" registry). In order to facilitate this, we will need to add OEM fields to the Redfish Message Registry JSON, which are only used by the bmcweb
processing scripts, to generate the information necessary for this additional mapping.
The xyz.openbmc_project.Logging.Entry
to LogEvent
conversion needs to be enhanced, to utilize these Message Registries, in four ways:
A Base64-encoded JSON representation of the Logging.Entry
will be assigned to the DiagnosticData
property.
If the Logging.Entry::Message
contains an identifier corresponding to a Registry entry, the MessageId
property will be set to the corresponding Redfish Message ID. Otherwise, the Logging.Entry::Message
will be used directly with no further transformation (as is done today).
If the Logging.Entry::Message
contains an identifier corresponding to a Registry entry, the MessageArgs
property will be filled in by obtaining the corresponding values from the AdditionalData
dictionary and the Message
field will be generated from combining these values with the Message
string from the Registry.
A mechanism should be implemented to translate DBus object_path
references to Redfish Resource URIs. When an object_path
cannot be translated, bmcweb
will use a prefix such as object_path:
in the MessageArgs
value.
The implementation of EventService
should be enhanced to support phosphor-logging
hosted events. The implementation of LogService
should be enhanced to support log paging for phosphor-logging
hosted events.
phosphor-sel-logger
The phosphor-sel-logger
has a meson option send-to-logger
which toggles between using phosphor-logging
or the REDFISH_MESSAGE_ID
mechanism. The phosphor-logging
-utilizing paths will be updated to utilize phosphor-dbus-interfaces
specified errors and events.
Consider an example file in phosphor-dbus-interfaces
as yaml/xyz/openbmc_project/Software/Update.events.yaml
with hypothetical errors and events:
version: 1.3.1 errors: - name: UpdateFailure severity: critical metadata: - name: TARGET type: string primary: true - name: ERRNO type: int64 - name: CALLOUT_HARDWARE type: object_path primary: true en: description: While updating the firmware on a device, the update failed. message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}. resolution: Retry update. - name: BMCUpdateFailure severity: critical deprecated: 1.0.0 en: description: Failed to update the BMC redfish-mapping: OpenBMC.FirmwareUpdateFailed events: - name: UpdateProgress metadata: - name: TARGET type: string primary: true - name: COMPLETION type: double primary: true en: description: An update is in progress and has reached a checkpoint. message: Updating of {TARGET} is {COMPLETION}% complete.
Each foo.events.yaml
file would be used to generate both the C++ classes (via sdbusplus
) for exception handling and event reporting, as well as a versioned Redfish Message Registry for the errors and events. The YAML schema is contained in the sdbusplus repository.
The above example YAML would generate C++ classes similar to:
namespace sdbusplus::errors::xyz::openbmc_project::software::update { class UpdateFailure { template <typename... Args> UpdateFailure(Args&&... args); }; } namespace sdbusplus::events::xyz::openbmc_project::software::update { class UpdateProgress { template <typename... Args> UpdateProgress(Args&&... args); }; }
The constructors here are variadic templates because the generated constructor implementation will provide compile-time assurance that all of the metadata fields have been populated (in any order). To raise an UpdateFailure
a developers might do something like:
// Immediately report the event: lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path)); // or send it in a dbus response (when using sdbusplus generated binding): throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path);
If one of the fields, such as ERRNO
were omitted, a compile failure will be raised indicating the first missing field.
Assume the version follows semantic versioning MAJOR.MINOR.PATCH
convention.
PATCH
increment.MINOR
increment.MAJOR
increment.There is guidance on maintenance of the OpenBMC Message Registry. We will incorporate that guidance into the equivalent phosphor-dbus-interfaces
policy.
DSP0266, the Redfish specification, gives requirements for Redfish Message Registries and dictates guidelines for identifiers.
The hypothetical events defined above would create a message registry similar to:
{ "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1", "Language": "en", "Messages": { "UpdateFailure": { "Description": "While updating the firmware on a device, the update failed.", "Message": "A failure occurred updating %1 on %2.", "Resolution": "Retry update." "NumberOfArgs": 2, "ParamTypes": ["string", "string"], "Severity": "Critical", }, "UpdateProgress" : { "Description": "An update is in progress and has reached a checkpoint." "Message": "Updating of %1 is %2\% complete.", "Resolution": "None", "NumberOfArgs": 2, "ParamTypes": ["string", "number"], "Severity": "OK", } } }
The prefix OpenBMC_Base
shall be exclusively reserved for use by events from phosphor-logging
. Events defined in other repositories will be expected to use some other prefix. Vendor-defined repositories should use a vendor-owned prefix as directed by DSP0266.
As specified above, vendors must use their own identifiers in order to conform with the Redfish specification (see DSP0266 for requirements on identifier naming). The sdbusplus
(and phosphor-logging
and bmcweb
) implementation(s) will enable vendors to create their own events for downstream code and Registries for integration with Redfish, by creating downstream repositories of error definitions. Vendors are responsible for ensuring their own versioning and identifiers conform to the expectations in the Redfish specification.
One potential bad behavior on the part of vendors would be forking and modifying phosphor-dbus-interfaces
defined events. Vendors must not add their own events to phosphor-dbus-interfaces
in downstream implementations because it would lead to their implementation advertising support for a message in an OpenBMC-owned Registry which is not the case, but they should add them to their own repositories with a separate identifier. Similarly, if a vendor were to backport upstream changes into their fork, they would need to ensure that the foo.events.yaml
file for that version matches identically with the upstream implementation.
Many alternatives have been explored and referenced through earlier work. Within this proposal there are many minor-alternatives that have been assessed.
The original phosphor-logging
error descriptions allowed inheritance between two errors. This is not supported by the proposal for two reasons:
This introduces complexity in the Redfish Message Registry versioning because a change in one file should induce version changes in all dependent files.
It makes it difficult for a developer to clearly identify all of the fields they are expected to populate without traversing multiple files.
There are a few possible syntaxes I came up with for constructing the generated exception types. It is important that these have good ergonomics, are easy to understand, and can provide compile-time awareness of missing metadata fields.
using Example = sdbusplus::error::xyz::openbmc_project::Example; // 1) throw Example().fru("Motherboard").value(42); // 2) throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42); // 3) throw Example("FRU", "Motherboard", "VALUE", 42); // 4) throw Example([](auto e) { return e.fru("Motherboard").value(42); }); // 5) throw Example({.fru = "Motherboard", .value = 42});
Note: These examples are all show using throw
syntax, but could also be saved in local variables, returned from functions, or immediately passed to lg2::commit
.
This would be my preference for ergonomics and clarity, as it would allow LSP-enabled editors to give completions for the metadata fields but unfortunately there is no mechanism in C++ to define a type which can be constructed but not thrown, which means we cannot get compile-time checking of all metadata fields.
This syntax uses tag-dispatch to enables compile-time checking of all metadata fields and potential LSP-completion of the tag-types, but is more verbose than option 3.
This syntax is less verbose than (2) and follows conventions already used in phosphor-logging
's lg2
API, but does not allow LSP-completion of the metadata tags.
This syntax is similar to option (1) but uses an indirection of a lambda to enable compile-time checking that all metadata fields have been populated by the lambda. The LSP-completion is likely not as strong as option (1), due to the use of auto
, and the lambda necessity will likely be a hang-up for unfamiliar developers.
This syntax has similar characteristics as option (1) but similarly does not provide compile-time confirmation that all fields have been populated.
The proposal therefore suggests option (3) is most suitable.
The proposed YAML format allows future addition of translation but it is not enabled at this time. Future development could enable the Redfish Message Registry to be generated in multiple languages if the message:language
exists for those languages.
The Redfish Message Registries are required to be versioned and has 3 digit fields (ie. XX.YY.ZZ
), but only the first 2 are suppose to be used in the Message ID. Rather than using the manually specified version we could take a few other approaches:
Use a date code (ex. 2024.17.x
) representing the ISO 8601 week when the registry was built.
Use the most recent openbmc/openbmc
tag as the version.
Generate the version based on the git-history.
phosphor-dbus-interfaces
to be built from a git repository, which may not always be true for Yocto source mirrors, and requires non-trivial processing that continues to scale over time.There are currently 191 messages defined in the existing Redfish Message Registry at version OpenBMC.0.4.0
. Of those, not a single one in the codebase is emitted with the correct version. 96 of those are only emitted by Intel-specific code that is not pulled into any upstreamed machine, 39 are emitted by potentially common code, and 56 are not even referenced in the codebase outside of the bmcweb registry. Of the 39 common messages half of them have an equivalent in one of the standard registries that should be leveraged and many of the others do not have attributes that would facilitate a multi-host configuration, so the registry at a minimum needs to be updated. None of the current implementation has the capability to handle Redfish Resource URIs.
The proposal therefore is to deprecate the existing registry and replace it with the new generated registries. For repositories that currently emit events in the existing format, we can maintain those call-sites for a time period of 1-2 years.
If this aspect of the proposal is rejected, the YAML format allows mapping from phosphor-dbus-interfaces
defined events to the current OpenBMC.0.4.0
registry MessageIds
.
Potentially common:
SensorThreshold*
: 8 different eventsIntel-only implementations:
ME*
: 29 different eventsMemory*
: 9 different eventsPCIeCorrectable*
, PCIeFatal
: 29 different eventsNew APIs are defined for error and event logging. This will deprecate existing phosphor-logging
APIs, with a time to migrate, for error reporting.
The design should improve performance by eliminating the regular parsing of the systemd
journal. The design may decrease performance by allowing the number of error and event logs to be dramatically increased, which have an impact to file system utilization and potential for DBus impacts some services such as ObjectMapper
.
Backwards compatibility and documentation should be improved by the automatic generation of the Redfish Message Registry corresponding to all error and event reports.
sdbusplus
phosphor-dbus-interfaces
phosphor-logging
bmcweb
Unit tests will be written in sdbusplus
and phosphor-logging
for the error and event generation, creation APIs, and to provide coverage on any changes to the Logging.Entry
object management.
Unit tests will be written for bmcweb
for basic Logging.Entry
transformation and Message Registry generation.
Integration tests should be leveraged (and enhanced as necessary) from openbmc-test-automation
to cover the end-to-end error creation and Redfish reporting.