Update with current implementation
Updated document as per current implementation and fixed some of
basic config field names.
Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Change-Id: Ibec30f2d14607065254a60990b7bb436f808c5b8
diff --git a/designs/bmc-health-monitor.md b/designs/bmc-health-monitor.md
index 6ff5294..f94b8c1 100644
--- a/designs/bmc-health-monitor.md
+++ b/designs/bmc-health-monitor.md
@@ -12,10 +12,10 @@
## Problem Description
The problem is to monitor the health of a system with a BMC so we have some
-means to make sure the BMC is working correctly. Set of monitored metrics may
-include CPU and memory utilization, uptime, free disk space, I2C bus stats,
-and so on. Actions can be taken based on monitoring data to correct the BMC’s
-state.
+means to make sure the BMC is working correctly. User can get required metrics
+data as per configurations instantly. Set of monitored metrics may include CPU
+and memory utilization, uptime, free disk space, I2C bus stats, and so on.
+Actions can be taken based on monitoring data to correct the BMC’s state.
For this purpose, there may exist a metric producer (the subject of discussion
of this document), and a metric consumer (a program that makes use of health
@@ -93,47 +93,47 @@
For example,
```json
- "cpu" : {
- "frequency" : 1,
- "window_size": 120,
- "threshold":
+ "CPU" : {
+ "Frequency" : 1,
+ "Window_size": 120,
+ "Threshold":
{
- "critical":
+ "Critical":
{
- "value": 90.0,
- "log": true,
- "target": "reboot.target"
+ "Value": 90.0,
+ "Log": true,
+ "Target": "reboot.target"
},
- "warning":
+ "Warning":
{
- "value": 80.0,
- "log": false,
- "target": "systemd unit file"
+ "Value": 80.0,
+ "Log": false,
+ "Target": "systemd unit file"
}
}
},
- "memory" : {
- "frequency" : 1,
- "window_size": 120,
- "threshold":
+ "Memory" : {
+ "Frequency" : 1,
+ "Window_size": 120,
+ "Threshold":
{
- "critical":
+ "Critical":
{
- "value": 90.0,
- "log": true,
- "target": "reboot.target"
+ "Value": 90.0,
+ "Log": true,
+ "Target": "reboot.target"
}
}
}
```
-frequency : It is time in second when these data are collected in regular
+Frequency : It is time in second when these data are collected in regular
interval.
-window_size: This is a value for number of samples taken to average out usage
+Window_size: This is a value for number of samples taken to average out usage
of system rather than taking a spike in usage data.
-log : A boolean value which allows to log an alert. This field is an
+Log : A boolean value which allows to log an alert. This field is an
optional with default value for this in critical is 'true' and in
warning it is 'false'.
-target : This is a systemd target unit file which will called once value
+Target : This is a systemd target unit file which will called once value
crosses its threshold and it is optional.
For 2) Metric collection, this will be done by running certain functions
@@ -150,8 +150,8 @@
```
/xyz/openbmc_project
└─/xyz/openbmc_project/sensors
- └─/xyz/openbmc_project/sensors/utilization/bmc_cpu
- └─/xyz/openbmc_project/sensors/utilization/bmc_memory
+ └─/xyz/openbmc_project/sensors/utilization/CPU
+ └─/xyz/openbmc_project/sensors/utilization/Memory
```
## Alternatives Considered
@@ -173,5 +173,9 @@
DBus traffic generated by the Daemon, and the readings on the Daemon’s DBus
objects.
+This can also be tested over IPMI/Redfish using sensor command as some of
+metrics data are presented as sensors like CPU and Memory are presented as
+utilization sensors.
+
To verify the performance aspect, we can stress-test the Daemon’s DBus
interfaces to make sure the interfaces do not cause a high overhead.