commit | ed0af21ca6092a7ceb19660b69b76a5c304efcb0 | [log] [tgz] |
---|---|---|
author | Aditya Kurdunkar <akurdunkar@nvidia.com> | Wed Jun 11 04:38:52 2025 +0530 |
committer | Ed Tanous <ed@tanous.net> | Thu Jul 10 15:01:22 2025 +0000 |
tree | aa28efd09d908cee935cdb23d964a3c64483fb56 | |
parent | bd815c7f79301fe3932093bfd799122879122680 [diff] |
gpu: add support for per EID request queuing The Nvidia Extension of OCP MCTP VDM Protocol specifies that there should be only one outstanding request message to a GPU Device implementing the VDM protocol. This introduces a requirement for request queuing per EID. This patch implements the same. This patch renames the MctpRequester to Requester and introduces a new QueuingRequester that composes on top of the Requester and introduces per EID queuing. Each call to `sendRecvMsg` now enqueues the request (instead of sending it immediately). If there is no ongoing request the requester will send the request out right away. Otherwise the requester waits for the ongoing request to finish before sending out the previously enqueued request. This ensures the serialization of the requests and makes sure that there is only one request "in flight" at a time. For minimal/no client changes, QueuingRequester is type aliased to MctpRequester. Tested. Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack. https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422 Pick the following changes (in order) that enable multiple GPU sensors: ''' https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/79970 https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/80031 https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/80078 https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/80099 https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/80566 https://gerrit.openbmc.org/c/openbmc/dbus-sensors/+/80567 ''' Check if all sensors are available on redfish. ''' ~ % curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/ { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors", "@odata.type": "#SensorCollection.SensorCollection", "Description": "Collection of Sensors for this Chassis", "Members": [ { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0" }, { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0" }, { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0" }, { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1" }, { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0" } ], "Members@odata.count": 5, "Name": "Sensors" } ''' Check Individual Sensor Updates. ''' curl -s -k -u 'root:0penBmc' https://10.137.203.245/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_0", "Name": "NVIDIA GB200 GPU 0 TEMP 0", "Reading": 27.71875, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" } } curl -s -k -u 'root:0penBmc' https://10.137.203.245/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_1", "Name": "NVIDIA GB200 GPU 0 TEMP 1", "Reading": 57.28125, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" } } curl -s -k -u 'root:0penBmc' https://10.137.203.245/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "power_NVIDIA_GB200_GPU_0_Power_0", "Name": "NVIDIA GB200 GPU 0 Power 0", "Reading": 27.468, "ReadingRangeMax": 4294967.295, "ReadingRangeMin": 0.0, "ReadingType": "Power", "ReadingUnits": "W", "Status": { "Health": "OK", "State": "Enabled" } } curl -s -k -u 'root:0penBmc' https://10.137.203.245/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "energy_NVIDIA_GB200_GPU_0_Energy_0", "Name": "NVIDIA GB200 GPU 0 Energy 0", "Reading": 45058.545, "ReadingRangeMax": 1.8446744073709552e+16, "ReadingRangeMin": 0.0, "ReadingType": "EnergyJoules", "ReadingUnits": "J", "Status": { "Health": "OK", "State": "Enabled" } } curl -s -k -u 'root:0penBmc' https://10.137.203.245/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "voltage_NVIDIA_GB200_GPU_0_Voltage_0", "Name": "NVIDIA GB200 GPU 0 Voltage 0", "Reading": 0.735, "ReadingRangeMax": 4294.967295, "ReadingRangeMin": 0.0, "ReadingType": "Voltage", "ReadingUnits": "V", "Status": { "Health": "OK", "State": "Enabled" } } ''' Change-Id: Ic3b892ef2c76c4c703aa55f5b2a66c22a5d71bdf Signed-off-by: Aditya Kurdunkar <akurdunkar@nvidia.com>
dbus-sensors is a collection of sensor applications that provide the xyz.openbmc_project.Sensor collection of interfaces. They read sensor values from hwmon, d-bus, or direct driver access to provide readings. Some advance non-sensor features such as fan presence, pwm control, and automatic cpu detection (x86) are also supported.
runtime re-configurable from d-bus (entity-manager or the like)
isolated: each sensor type is isolated into its own daemon, so a bug in one sensor is unlikely to affect another, and single sensor modifications are possible
async single-threaded: uses sdbusplus/asio bindings
multiple data inputs: hwmon, d-bus, direct driver access
A typical dbus-sensors object support the following dbus interfaces:
Path /xyz/openbmc_project/sensors/<type>/<sensor_name> Interfaces xyz.openbmc_project.Sensor.Value xyz.openbmc_project.Sensor.Threshold.Critical xyz.openbmc_project.Sensor.Threshold.Warning xyz.openbmc_project.State.Decorator.Availability xyz.openbmc_project.State.Decorator.OperationalStatus xyz.openbmc_project.Association.Definitions
Sensor interfaces collection are described in phosphor-dbus-interfaces.
Consumer examples of these interfaces are Redfish, Phosphor-Pid-Control, IPMI SDR.
dbus-sensor daemons are reactors that dynamically create and update sensors configuration when system configuration gets updated.
Using asio timers and async calls, dbus-sensor daemons read sensor values and check thresholds periodically. PropertiesChanged signals will be broadcasted for other services to consume when value or threshold status change. OperationStatus is set to false if the sensor is determined to be faulty.
A simple sensor example can be found in entity-manager examples.
Sensor devices are described using Exposes records in configuration file. Name and Type fields are required. Different sensor types have different fields. Refer to entity manager schema for complete list.
ADC sensors are sensors based on an Analog to Digital Converter. They are read via the Linux kernel Industrial I/O subsystem (IIO).
One of the more common use cases within OpenBMC is for reading these sensors from the ADC on the Aspeed ASTXX cards.
To utilize ADC sensors feature within OpenBMC you must first define and enable it within the kernel device tree.
When using a common OpenBMC device like the AST2600 you will find a "adc0" and "adc1" section in the aspeed-g6.dtsi file. These are disabled by default so in your system-specific dts you would enable and configure what you want with something like this:
iio-hwmon { compatible = "iio-hwmon"; io-channels = <&adc0 0>; ... } &adc0 { status = "okay"; ... }; &adc1 { status = "okay"; ... };
Note that this is not meant to be an exhaustive list on the nuances of configuring a device tree but really to point users in the general direction.
You will then create an entity-manager configuration file that is of type "ADC" A very simple example would like look this:
"Index": 0, "Name": "P12V", "PowerState": "Always", "ScaleFactor": 1.0, "Type": "ADC"
When your system is booted, a "in0_input" file will be created within the hwmon subsystem (/sys/class/hwmon/hwmonX). The adcsensor application will scan d-bus for any ADC entity-manager objects, look up their "Index" value, and try to match that with the hwmon inY_input files. When it finds a match it will create a d-bus sensor under the xyz.openbmc_project.ADCSensor service. The sensor will be periodically updated based on readings from the hwmon file.