Error and Event Logging

Author: Patrick Williams <stwcx>

Other contributors:

Created: May 16, 2024

Problem Description

There is currently not a consistent end-to-end error and event reporting design for the OpenBMC code stack. There are two different implementations, one primarily using phosphor-logging and one using rsyslog, both of which have gaps that a complete solution should address. This proposal is intended to be an end-to-end design handling both errors and tracing events which facilitate external management of the system in an automated and maintainable manner.

Background and References

Redfish LogEntry and Message Registry

In Redfish, the LogEntry schema is used for a range of items that could be considered "logs", but one such use within OpenBMC is for an equivalent of the IPMI "System Event Log (SEL)".

The IPMI SEL is the location where the BMC can collect errors and events, sometimes coming from other entities, such as the BIOS. Examples of these might be "DIMM-A0 encountered an uncorrectable ECC error" or "System boot successful". These SEL records are exposed as human readable strings, either natively by a OEM SEL design or by tools such as ipmitool, which are typically unique to each system or manufacturer, and could hypothethically change with a BMC or firmware update, and are thus difficult to create automated tooling around. Two different vendors might use different strings to represent a critical temperature threshold exceeded: "temperature threshold exceeded" and "Temperature #0x30 Upper Critical going high". There is also no mechanism with IPMI to ask the machine "what are all of the SELs you might create".

In order to solve two aspects of this problem, listing of possible events and versioning, Redfish has Message Registries. A message registry is a versioned collection of all of the error events that a system could generate and hints as to how they might be parsed and displayed to a user. An informative reference from the DMTF gives this example:

{
  "@odata.type": "#MessageRegistry.v1_0_0.MessageRegistry",
  "Id": "Alert.1.0.0",
  "RegistryPrefix": "Alert",
  "RegistryVersion": "1.0.0",
  "Messages": {
    "LanDisconnect": {
      "Description": "A LAN Disconnect on %1 was detected on system %2.",
      "Message": "A LAN Disconnect on %1 was detected on system %2.",
      "Severity": "Warning",
      "NumberOfArgs": 2,
      "Resolution": "None"
    }
  }
}

This example defines an event, Alert.1.0.LanDisconnect, which can record the disconnect state of a network device and contains placeholders for the affected device and system. When this event occurs, there might be a LogEntry recorded containing something like:

{
  "Message": "A LAN Disconnnect on EthernetInterface 1 was detected on system /redfish/v1/Systems/1.",
  "MessageId": "Alert.1.0.LanDisconnect",
  "MessageArgs": ["EthernetInterface 1", "/redfish/v1/Systems/1"]
}

The Message contains a human readable string which was created by applying the MessageArgs to the placeholders from the Message field in the registry. System management software can rely on the message registry (referenced from the MessageId field in the LogEntry) and MessageArgs to avoid needing to perform string processing for reacting to the event.

Within OpenBMC, there is currently a limited design for this Redfish feature and it requires inserting specially formed Redfish-specific logging messages into any application that wants to record these events, tightly coupling all applications to the Redfish implementation. It has also been observed that these strings, when used, are often out of date with the message registry advertised by bmcweb. Some maintainers have rejected adding new Redfish-specific logging messages to their applications.

Existing phosphor-logging implementation

Note: While the word 'exception' is used in this section, the existing (and proposed) types can be used by applications and execution contexts with exceptions disabled. They are 'exceptions' because they do inherit from std::exception and there is support in the sdbusplus bindings for them to be used in exception handling.

The sdbusplus bindings have the capability to define new C++ exception types which can be thrown by a DBus server and turned into an error response to the client. phosphor-logging extended this to also add metadata associated to the log type. See the following example error definitions and usages.

sdbusplus error binding definition (in xyz/openbmc_project/Certs.errors.yaml):

- name: InvalidCertificate
  description: Invalid certificate file.

phosphor-logging metadata definition (in xyz/openbmc_project/Certs.metadata.yaml):

- name: InvalidCertificate
  meta:
    - str: "REASON=%s"
      type: string

Application code reporting an error:

elog<InvalidCertificate>(Reason("Invalid certificate file format"));
// or
report<InvalidCertificate>(Reason("Existing certificate file is corrupted"));

In this sample, an error named xyz.openbmc_project.Certs.Error.InvalidCertificate has been defined, which can be sent between applications as a DBus response. The InvalidCertificate is expected to have additional metadata REASON which is a string. The two APIs elog and report have slightly different behaviors: elog throws an exception which can either result in an error DBus result or be handled elsewhere in the application, while report sends the event directly to phosphor-logging's daemon for recording. As a side-effect of both calls, the metadata is inserted into the systemd journal.

When an error is sent to the phosphor-logging daemon, it will:

  1. Search back through the journal for recorded metadata associated with the event (this is a relative slow operation).
  2. Create an xyz.openbmc_project.Logging.Entry DBus object with the associated data extracted from the journal.
  3. Persist a serialized version of the object.

Within bmcweb there is support for translating xyz.openbmc_project.Logging.Entry objects advertised by phosphor-logging into Redfish LogEntries, but this support does not reference a Message Registry. This makes the events of limited utility for consumption by system management software, as it cannot know all of the event types and is left to perform (hand-coded) regular-expressions to extract any information from the Message field of the LogEntry. Furthermore, these regular-expressions are likely to become outdated over time as internal OpenBMC error reporting structure, metadata, or message strings evolve.

Issues with the Status Quo

  • There are two different implementations of error logging, neither of which are both complete and fully accepted by maintainers. These implementations also do not cover tracing events.

  • The REDFISH_MESSAGE_ID log approach leads to differences between the Redfish Message Registry and the reporting application. It also requires every application to be "Redfish aware" which limits decoupling between applications and external management interfaces. This also leaves gaps for reporting errors in different management interfaces, such as inband IPMI and PLDM. The approach also does not provide comple-time assurance of appropriate metadata collection, which can lead to producing code being out-of-date with the message registry definitions.

  • The phosphor-logging approach does not provide compile-time assurance of appropriate metadata collection and requires expensive daemon processing of the systemd journal on each error report, which limits scalability.

  • The sdbusplus bindings for error reporting do not currently handle lossless transmission of errors between DBus servers and clients.

  • Similar applications can result in different Redfish LogEntry for the same error scenario. This has been observed in sensor threshold exceeded events between dbus-sensors, phosphor-hwmon, phosphor-virtual-sensor, and phosphor-health-monitor. One cause of this is two different error reporting approaches and disagreements amongst maintainers as to the preferred approach.

Requirements

  • Applications running on the BMC must be able to report errors and failure which are persisted and available for external system management through standards such as Redfish.

    • These errors must be structured, versioned, and the complete set of errors able to be created by the BMC should be available at built-time of a BMC image.
    • The set of errors, able to be created by the BMC, must be able to be transformed into relevant data sets, such as Redfish Message Registries.
      • For Redfish, the transformation must comply with the Redfish standard requirements, such as conforming to semantic versioning expectations.
      • For Redfish, the transformation should allow mapping internally defined events to pre-existing Redfish Message Registries for broader compatibility.
      • For Redfish, the implementation must also support the EventService mechanics for push-reporting.
    • Errors reported by the BMC should contain sufficient information to allow service of the system for these failures, either by humans or automation (depending on the individual system requirements).
  • Applications running on the BMC should be able to report important tracing events relevant to system management and/or debug, such as the system successfully reaching a running state.

    • All requirements relevant to errors are also applicable to tracing events.
    • The implementation must have a mechanism for vendors to be able to disable specific tracing events to conform to their own system design requirements.
  • Applications running on the BMC should be able to determine when a previously reported error is no longer relevant and mark it as "resolved", while maintaining the persistent record for future usages such as debug.

  • The BMC should provide a mechanism for managed entities within the server to report their own errors and events. Examples of managed entities would be firmware, such as the BIOS, and satellite management controllers.

  • The implementation on the BMC should scale to a minimum of 10,000 error and events without impacting the BMC or managed system performance.

  • The implementation should provide a mechanism to allow OEM or vendor extensions to the error and event definitions (and generated artifacts such as the Redfish Message Registry) for usage in closed-source or non-upstreamed code. These extensions must be clearly identified, in all interfaces, as vendor-specific and not be tied to the OpenBMC project.

  • APIs to implement error and event reporting should have good ergonomics. These APIs must provide compile-time identification, for applicable programming languages, of call sites which do not conform to the BMC error and event specifications.

    • The generated error classes and APIs should not require exceptions but should also integrate with the sdbusplus client and server bindings, which do leverage exceptions.

Proposed Design

The proposed design has a few high-level design elements:

  • Consolidate the sdbusplus and phosphor-logging implementation of error reporting; expand it to cover tracing events; improve the ergonomics of the associated APIs and add compile-time checking of missing metadata.

  • Add APIs to phosphor-logging to enable daemons to easily look up their own previously reported events (for marking as resolved).

  • Add to phosphor-logging a compile-time mechanism to disable recording of specific tracing events for vendor-level customization.

  • Generate a Redfish Message Registry for all error and events defined in phosphor-dbus-interfaces, using binding generators from sdbusplus. Enhance bmcweb implementation of the Logging.Entry to LogEvent transformation to cover the Redfish Message Registry and phosphor-logging enhancements; Leverage the Redfish LogEntry.DiagnosticData field to provide a Base64-encoded JSON representation of the entire Logging.Entry for additional diagnostics [[does this need to be optional?]]. Add support to the bmcweb EventService implementation to support phosphor-logging-hosted events.

sdbusplus

The Foo.errors.yaml content will be combined with the content formerly in the Foo.metadata.yaml files specified by phosphor-logging and specified by a new file type Foo.events.yaml. This Foo.events.yaml format will cover both the current error and metadata information as well as augment with additional information necessary to generate external facing datasets, such as Redfish Message Registries. The current Foo.errors.yaml and Foo.metadata.yaml files will be deprecated as their usage is replaced by the new format.

The sdbusplus library will be enhanced to provide the following:

  • JSON serialization and de-serialization of generated exception types with their assigned metadata; assignment of the JSON serialization to the message field of sd_bus_error_set calls when errors are returned from DBus server calls.

  • A facility to register exception types, at library load time, with the sdbusplus library for automatic conversion back to C++ exception types in DBus clients.

The binding generator(s) will be expanded to do the following:

  • Generate complete C++ exception types, with compile-time checking of missing metadata and JSON serialization, for errors and events. Metadata can be of one of the following types:

    • size-type and signed integer
    • floating-point number
    • string
    • DBus object path
  • Generate a format that bmcweb can use to create and populate a Redfish Message Registry, and translate from phosphor-logging to Redfish LogEntry for a set of errors and events

For general users of sdbusplus these changes should have no impact, except for the availability of new generated exception types and that specialized instances of sdbusplus::exception::generated_exception will become available in DBus clients.

phosphor-dbus-interfaces

Refactoring will be done to migrate existing Foo.metadata.yaml and Foo.errors.yaml content to the Foo.events.yaml as migration is done by applications. Minor changes will take place to utilize the new binding generators from sdbusplus. A small library enhancement will be done to register all generated exception types with sdbusplus. Future contributors will be able to contribute new error and tracing event definitions.

phosphor-logging

TODO: Should a tracing event be a Logging.Entry with severity of Informational or should they be a new type, such as Logging.Event and managed separately. The phosphor-logging default meson.options have error_cap=200 and error_info_cap=10. If we increase the total number of events allowed to 10K, the majority of them are likely going to be information / tracing events.

The Logging.Entry interface's AdditionalData property should change to dict[string, variant[string,int64_t,size_t,object_path]].

The Logging.Create interface will have a new method added:

- name: CreateEntry
  parameters:
    - name: Message
      type: string
    - name: Severity
      type: enum[Logging.Entry.Level]
    - name: AdditionalData
      type: dict[string, variant[string,int64_t,size_t,object_path]]
    - name: Hint
      type: string
      default: ""
  returns:
    - name: Entry
      type: object_path

The Hint parameter is used for daemons to be able to query for their previously recorded error, for marking as resolved. These strings need to be globally unique and are suggested to be of the format "<service_name>:<key>".

A Logging.SearchHint interface will be created, which will be recorded at the same object path as a Logging.Entry when the Hint parameter was not an empty string:

- property: Hint
  type: string

The Logging.Manager interface will be added with a single method:

- name: FindEntry
  parameters:
    - name: Hint
      type: String
  returns:
    - name: Entry
      type: object_path
  errors:
    - xyz.openbmc_project.Common.ResourceNotFound

A lg2::commit API will be added to support the new sdbusplus generated exception types, calling the new Logging.Create.CreateEntry method proposed earlier. This new API will support sdbusplus::bus_t for synchronous DBus operations and both sdbusplus::async::context_t and sdbusplus::asio::connection for asynchronous DBus operations.

There are outstanding performance concerns with the phosphor-logging implementation that may impact the ability for scaling to 10,000 event records. This issue is expected to be self-contained within phosphor-logging, except for potential future changes to the log-retrieval interfaces used by bmcweb. In order to decouple the transition to this design, by callers of the logging APIs, from the experimentation and improvements in phosphor-logging, we will add a compile option and Yocto DISTRO_FEATURE that can turn lg2::commit behavior into an OPENBMC_MESSAGE_ID record in the journal, along the same approach as the previous REDFISH_MESSAGE_ID, and corresponding rsyslog configuration and bmcweb support to use these directly. This will allow systems which knowingly scale to a large number of event records, using rsyslog mechanics, the same level of performance. One caveat of this support is that the hint and resolution behavior will not exist when that option is enabled.

bmcweb

bmcweb already has support for build-time conversion from a Redfish Message Registry, codified in JSON, to header files it uses to serve the registry; this will be expanded to support Redfish Message Registries generated by sdbusplus. bmcweb will add a Meson option for additional message registries, provided from bitbake from phosphor-dbus-interfaces and vendor-specific event definitions as a path to a directory of Message Registry JSONs. Support will also be added for adding phosphor-dbus-interfaces as a Meson subproject for stand-alone testing.

It is desirable for sdbusplus to generate a Redfish Message Registry directly, leveraging the existing scripts for integration with bmcweb. As part of this we would like to support mapping a Logging.Entry event to an existing standardized Redfish event (such as those in the Base registry). The generated information must contain the Logging.Entry::Message identifier, the AdditionalData to MessageArgs mapping, and the translation from the Message identifier to the Redfish Message ID (when the Message ID is not from "this" registry). In order to facilitate this, we will need to add OEM fields to the Redfish Message Registry JSON, which are only used by the bmcweb processing scripts, to generate the information necessary for this additional mapping.

The xyz.openbmc_project.Logging.Entry to LogEvent conversion needs to be enhanced, to utilize these Message Registries, in four ways:

  1. A Base64-encoded JSON representation of the Logging.Entry will be assigned to the DiagnosticData property.

  2. If the Logging.Entry::Message contains an identifier corresponding to a Registry entry, the MessageId property will be set to the corresponding Redfish Message ID. Otherwise, the Logging.Entry::Message will be used directly with no further transformation (as is done today).

  3. If the Logging.Entry::Message contains an identifier corresponding to a Registry entry, the MessageArgs property will be filled in by obtaining the corresponding values from the AdditionalData dictionary and the Message field will be generated from combining these values with the Message string from the Registry.

  4. A mechanism should be implemented to translate DBus object_path references to Redfish Resource URIs. When an object_path cannot be translated, bmcweb will use a prefix such as object_path: in the MessageArgs value.

The implementation of EventService should be enhanced to support phosphor-logging hosted events. The implementation of LogService should be enhanced to support log paging for phosphor-logging hosted events.

phosphor-sel-logger

The phosphor-sel-logger has a meson option send-to-logger which toggles between using phosphor-logging or the REDFISH_MESSAGE_ID mechanism. The phosphor-logging-utilizing paths will be updated to utilize phosphor-dbus-interfaces specified errors and events.

YAML format

Consider an example file in phosphor-dbus-interfaces as yaml/xyz/openbmc_project/Software/Update.events.yaml with hypothetical errors and events:

version: 1.3.1

errors:
  - name: UpdateFailure
    severity: critical
    metadata:
      - name: TARGET
        type: string
        primary: true
      - name: ERRNO
        type: int64
      - name: CALLOUT_HARDWARE
        type: object_path
        primary: true
    en:
      description: While updating the firmware on a device, the update failed.
      message: A failure occurred updating {TARGET} on {CALLOUT_HARDWARE}.
      resolution: Retry update.

  - name: BMCUpdateFailure
    severity: critical
    deprecated: 1.0.0
    en:
      description: Failed to update the BMC
    redfish-mapping: OpenBMC.FirmwareUpdateFailed

events:
  - name: UpdateProgress
    metadata:
      - name: TARGET
        type: string
        primary: true
      - name: COMPLETION
        type: double
        primary: true
    en:
      description: An update is in progress and has reached a checkpoint.
      message: Updating of {TARGET} is {COMPLETION}% complete.

Each foo.events.yaml file would be used to generate both the C++ classes (via sdbusplus) for exception handling and event reporting, as well as a versioned Redfish Message Registry for the errors and events. The YAML schema is as follows:

$id: https://openbmc-project.xyz/sdbusplus/events.schema.yaml
$schema: https://json-schema.org/draft/2020-12/schema
title: Event and error definitions
type: object
$defs:
  event:
    type: array
    items:
      type: object
      properties:
        name:
          type: string
          description:
            An identifier for the event in UpperCamelCase; used as the class and
            Redfish Message ID.
        en:
          type: object
          description: The details for English.
          properties:
            description:
              type: string
              description:
                A developer-applicable description of the error reported. These
                form the "description" of the Redfish message.
            message:
              type: string
              description:
                The end-user message, including placeholders for arguemnts.
            resolution:
              type: string
              description: The end-user resolution.
        severity:
          enum:
            - emergency
            - alert
            - critical
            - error
            - warning
            - notice
            - informational
            - debug
          description:
            The `xyz.openbmc_project.Logging.Entry.Level` value for this
            error.  Only applicable for 'errors'.
        redfish-mapping:
          type: string
          description:
            Used when a `sdbusplus` event should map to a specific Redfish
            Message rather than a generated one. This is useful when an internal
            error has an analog in a standardized registry.
        deprecated:
          type: string
          pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
          description:
            Indicates that the event is now deprecated and should not be created
            by any OpenBMC software, but is required to still exist for
            generation in the Redfish Message Registry. The version listed here
            should be the first version where the error is no longer used.
        metadata:
          type: array
          items:
            type: object
            properties:
              name:
                type: string
                description: The name of the metadata field.
              type:
                enum:
                  - string
                  - size
                  - int64
                  - uint64
                  - double
                  - object_path
                description: The type of the metadata field.
              primary:
                type: boolean
                description:
                  Set to true when the metadata field is expected to be part of
                  the Redfish `MessageArgs` (and not only in the extended
                  `DiagnosticData`).
properties:
  version:
    type: string
    pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
    description:
      The version of the file, which will be used as the Redfish Message
      Registry version.
errors:
  $ref: "#/definitions/event"
events:
  $ref: ":#/definitions/event"

The above example YAML would generate C++ classes similar to:

namespace sdbusplus::errors::xyz::openbmc_project::software::update
{

class UpdateFailure
{

    template <typename... Args>
    UpdateFailure(Args&&... args);
};

}

namespace sdbusplus::events::xyz::openbmc_project::software::update
{

class UpdateProgress
{
    template <typename... Args>
    UpdateProgress(Args&&... args);
};

}

The constructors here are variadic templates because the generated constructor implementation will provide compile-time assurance that all of the metadata fields have been populated (in any order). To raise an UpdateFailure a developers might do something like:

// Immediately report the event:
lg2::commit(UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path));
// or send it in a dbus response (when using sdbusplus generated binding):
throw UpdateFailure("TARGET", "BMC Flash A", "ERRNO", rc, "CALLOUT_HARDWARE", bmc_object_path);

If one of the fields, such as ERRNO were omitted, a compile failure will be raised indicating the first missing field.

Versioning Policy

Assume the version follows semantic versioning MAJOR.MINOR.PATCH convention.

  • Adjusting a description or message should result in a PATCH increment.
  • Adding a new error or event, or adding metadata to an existing error or event, should result in a MINOR increment.
  • Deprecating an error or event should result in a MAJOR increment.

There is guidance on maintenance of the OpenBMC Message Registry. We will incorporate that guidance into the equivalent phosphor-dbus-interfaces policy.

Generated Redfish Message Registry

DSP0266, the Redfish specification, gives requirements for Redfish Message Registries and dictates guidelines for identifiers.

The hypothetical events defined above would create a message registry similar to:

{
  "Id": "OpenBMC_Base_Xyz_OpenbmcProject_Software_Update.1.3.1",
  "Language": "en",
  "Messages": {
    "UpdateFailure": {
      "Description": "While updating the firmware on a device, the update failed.",
      "Message": "A failure occurred updating %1 on %2.",
      "Resolution": "Retry update."
      "NumberOfArgs": 2,
      "ParamTypes": ["string", "string"],
      "Severity": "Critical",
    },
    "UpdateProgress" : {
      "Description": "An update is in progress and has reached a checkpoint."
      "Message": "Updating of %1 is %2\% complete.",
      "Resolution": "None",
      "NumberOfArgs": 2,
      "ParamTypes": ["string", "number"],
      "Severity": "OK",
    }
  }
}

The prefix OpenBMC_Base shall be exclusively reserved for use by events from phosphor-logging. Events defined in other repositories will be expected to use some other prefix. Vendor-defined repositories should use a vendor-owned prefix as directed by DSP0266.

Vendor implications

As specified above, vendors must use their own identifiers in order to conform with the Redfish specification (see DSP0266 for requirements on identifier naming). The sdbusplus (and phosphor-logging and bmcweb) implementation(s) will enable vendors to create their own events for downstream code and Registries for integration with Redfish, by creating downstream repositories of error definitions. Vendors are responsible for ensuring their own versioning and identifiers conform to the expectations in the Redfish specification.

One potential bad behavior on the part of vendors would be forking and modifying phosphor-dbus-interfaces defined events. Vendors must not add their own events to phosphor-dbus-interfaces in downstream implementations because it would lead to their implementation advertising support for a message in an OpenBMC-owned Registry which is not the case, but they should add them to their own repositories with a separate identifier. Similarly, if a vendor were to backport upstream changes into their fork, they would need to ensure that the foo.events.yaml file for that version matches identically with the upstream implementation.

Alternatives Considered

Many alternatives have been explored and referenced through earlier work. Within this proposal there are many minor-alternatives that have been assessed.

Exception inheritance

The original phosphor-logging error descriptions allowed inheritance between two errors. This is not supported by the proposal for two reasons:

  • This introduces complexity in the Redfish Message Registry versioning because a change in one file should induce version changes in all dependent files.

  • It makes it difficult for a developer to clearly identify all of the fields they are expected to populate without traversing multiple files.

sdbusplus Exception APIs

There are a few possible syntaxes I came up with for constructing the generated exception types. It is important that these have good ergonomics, are easy to understand, and can provide compile-time awareness of missing metadata fields.

    using Example = sdbusplus::error::xyz::openbmc_project::Example;

    // 1)
    throw Example().fru("Motherboard").value(42);

    // 2)
    throw Example(Example::fru_{}, "Motherboard", Example::value_{}, 42);

    // 3)
    throw Example("FRU", "Motherboard", "VALUE", 42);

    // 4)
    throw Example([](auto e) { return e.fru("Motherboard").value(42); });

    // 5)
    throw Example({.fru = "Motherboard", .value = 42});

Note: These examples are all show using throw syntax, but could also be saved in local variables, returned from functions, or immediately passed to lg2::commit.

  1. This would be my preference for ergonomics and clarity, as it would allow LSP-enabled editors to give completions for the metadata fields but unfortunately there is no mechanism in C++ to define a type which can be constructed but not thrown, which means we cannot get compile-time checking of all metadata fields.

  2. This syntax uses tag-dispatch to enables compile-time checking of all metadata fields and potential LSP-completion of the tag-types, but is more verbose than option 3.

  3. This syntax is less verbose than (2) and follows conventions already used in phosphor-logging's lg2 API, but does not allow LSP-completion of the metadata tags.

  4. This syntax is similar to option (1) but uses an indirection of a lambda to enable compile-time checking that all metadata fields have been populated by the lambda. The LSP-completion is likely not as strong as option (1), due to the use of auto, and the lambda necessity will likely be a hang-up for unfamiliar developers.

  5. This syntax has similar characteristics as option (1) but similarly does not provide compile-time confirmation that all fields have been populated.

The proposal therefore suggests option (3) is most suitable.

Redfish Translation Support

The proposed YAML format allows future addition of translation but it is not enabled at this time. Future development could enable the Redfish Message Registry to be generated in multiple languages if the message:language exists for those languages.

Redfish Registry Versioning

The Redfish Message Registries are required to be versioned and has 3 digit fields (ie. XX.YY.ZZ), but only the first 2 are suppose to be used in the Message ID. Rather than using the manually specified version we could take a few other approaches:

  • Use a date code (ex. 2024.17.x) representing the ISO 8601 week when the registry was built.

    • This does not cover vendors that may choose to branch for stabilization purposes, so we can end up with two machines having the same OpenBMC-versioned message registry with different content.
  • Use the most recent openbmc/openbmc tag as the version.

    • This does not cover vendors that build off HEAD and may deploy multiple images between two OpenBMC releases.
  • Generate the version based on the git-history.

    • This requires phosphor-dbus-interfaces to be built from a git repository, which may not always be true for Yocto source mirrors, and requires non-trivial processing that continues to scale over time.

Existing OpenBMC Redfish Registry

There are currently 191 messages defined in the existing Redfish Message Registry at version OpenBMC.0.4.0. Of those, not a single one in the codebase is emitted with the correct version. 96 of those are only emitted by Intel-specific code that is not pulled into any upstreamed machine, 39 are emitted by potentially common code, and 56 are not even referenced in the codebase outside of the bmcweb registry. Of the 39 common messages half of them have an equivalent in one of the standard registries that should be leveraged and many of the others do not have attributes that would facilitate a multi-host configuration, so the registry at a minimum needs to be updated. None of the current implementation has the capability to handle Redfish Resource URIs.

The proposal therefore is to deprecate the existing registry and replace it with the new generated registries. For repositories that currently emit events in the existing format, we can maintain those call-sites for a time period of 1-2 years.

If this aspect of the proposal is rejected, the YAML format allows mapping from phosphor-dbus-interfaces defined events to the current OpenBMC.0.4.0 registry MessageIds.

Potentially common:

  • phosphor-post-code-manager
    • BIOSPOSTCode (unique)
  • dbus-sensors
    • ChassisIntrusionDetected (unique)
    • ChassisIntrusionReset (unique)
    • FanInserted
    • FanRedundancyLost (unique)
    • FanRedudancyRegained (unique)
    • FanRemoved
    • LanLost
    • LanRegained
    • PowerSupplyConfigurationError (unique)
    • PowerSupplyConfigurationErrorRecovered (unique)
    • PowerSupplyFailed
    • PowerSupplyFailurePredicted (unique)
    • PowerSupplyFanFailed
    • PowerSupplyFanRecovered
    • PowerSupplyPowerLost
    • PowerSupplyPowerRestored
    • PowerSupplyPredictiedFailureRecovered (unique)
    • PowerSupplyRecovered
  • phosphor-sel-logger
    • IPMIWatchdog (unique)
    • SensorThreshold* : 8 different events
  • phosphor-net-ipmid
    • InvalidLoginAttempted (unique)
  • entity-manager
    • InventoryAdded (unique)
    • InventoryRemoved (unique)
  • estoraged
    • ServiceStarted
  • x86-power-control
    • NMIButtonPressed (unique)
    • NMIDiagnosticInterrupt (unique)
    • PowerButtonPressed (unique)
    • PowerRestorePolicyApplied (unique)
    • PowerSupplyPowerGoodFailed (unique)
    • ResetButtonPressed (unique)
    • SystemPowerGoodFailed (unique)

Intel-only implementations:

  • intel-ipmi-oem
    • ADDDCCorrectable
    • BIOSPostERROR
    • BIOSRecoveryComplete
    • BIOSRecoveryStart
    • FirmwareUpdateCompleted
    • IntelUPILinkWidthReducedToHalf
    • IntelUPILinkWidthReducedToQuarter
    • LegacyPCIPERR
    • LegacyPCISERR
    • ME* : 29 different events
    • Memory* : 9 different events
    • MirroringRedundancyDegraded
    • MirroringRedundancyFull
    • PCIeCorrectable*, PCIeFatal : 29 different events
    • SELEntryAdded
    • SparingRedundancyDegraded
  • pfr-manager
    • BIOSFirmwareRecoveryReason
    • BIOSFirmwarePanicReason
    • BMCFirmwarePanicReason
    • BMCFirmwareRecoveryReason
    • BMCFirmwareResiliencyError
    • CPLDFirmwarePanicReason
    • CPLDFirmwareResilencyError
    • FirmwareResiliencyError
  • host-error-monitor
    • CPUError
    • CPUMismatch
    • CPUThermalTrip
    • ComponentOverTemperature
    • SsbThermalTrip
    • VoltageRegulatorOverheated
  • s2600wf-misc
    • DriveError
    • InventoryAdded

Impacts

  • New APIs are defined for error and event logging. This will deprecate existing phosphor-logging APIs, with a time to migrate, for error reporting.

  • The design should improve performance by eliminating the regular parsing of the systemd journal. The design may decrease performance by allowing the number of error and event logs to be dramatically increased, which have an impact to file system utilization and potential for DBus impacts some services such as ObjectMapper.

  • Backwards compatibility and documentation should be improved by the automatic generation of the Redfish Message Registry corresponding to all error and event reports.

Organizational

  • Does this repository require a new repository?
    • No
  • Who will be the initial maintainer(s) of this repository?
    • N/A
  • Which repositories are expected to be modified to execute this design?
    • sdbusplus
    • phosphor-dbus-interfaces
    • phosphor-logging
    • bmcweb
    • Any repository creating an error or event.

Testing

  • Unit tests will be written in sdbusplus and phosphor-logging for the error and event generation, creation APIs, and to provide coverage on any changes to the Logging.Entry object management.

  • Unit tests will be written for bmcweb for basic Logging.Entry transformation and Message Registry generation.

  • Integration tests should be leveraged (and enhanced as necessary) from openbmc-test-automation to cover the end-to-end error creation and Redfish reporting.