nvme_manager: add support of configurable smbus error retry
NVMe sometimes too busy to response smbus commands, this trigger fan
failsafe due to sensor failed (smbus error).
Add support for configurable smbus error retries to avoid sensor
failures by single smbus error.
readNvmeData() may return and remain NvmeSSD object uncreated when
Smbus error occurs at service startup, add extra NvmeSSD object check
before setSensorAvailability() to avoid service crashes.
Default retry is 0 retry if maxSmbusErrorRetry not exists in config.
Example of set maximum smbus error retry to 3 times:
{
"config": [
...
],
"threshold": [
...
],
"maxSmbusErrorRetry": 3
}
Tested on Bletchley:
```
root@bletchley:~# journalctl _PID=4790 | grep -v SendSmbusRWCmdRAW
Nov 22 18:57:33 bletchley nvme_main[4790]: Send command code 0 fail!
Nov 22 18:57:33 bletchley nvme_main[4790]: getNVMeInfobyBusID failed, retry...
Nov 22 18:57:36 bletchley nvme_main[4790]: getNVMeInfobyBusID failed, retry...
Nov 22 18:57:39 bletchley nvme_main[4790]: getNVMeInfobyBusID failed, retry...
Nov 22 18:57:42 bletchley nvme_main[4790]: SSD plug.
Nov 22 18:57:42 bletchley nvme_main[4790]: Drive status is good but can not get data.
```
Signed-off-by: Potin Lai <potin.lai@quantatw.com>
Change-Id: Ibc95efc53a212e55dcd5c5cfa7a654839a13342d
diff --git a/nvme_manager.cpp b/nvme_manager.cpp
index e641edc..54ca992 100644
--- a/nvme_manager.cpp
+++ b/nvme_manager.cpp
@@ -14,6 +14,7 @@
#include "i2c.h"
#define MONITOR_INTERVAL_SECONDS 1
+#define MAX_SMBUS_ERROR_RETRY 0
#define NVME_SSD_SLAVE_ADDRESS 0x6a
#define NVME_SSD_VPD_SLAVE_ADDRESS 0x53
#define GPIO_BASE_PATH "/sys/class/gpio/gpio"
@@ -423,6 +424,8 @@
std::vector<Json> thresholds = data.value("threshold", empty);
monitorIntervalSec =
data.value("monitorIntervalSec", MONITOR_INTERVAL_SECONDS);
+ maxSmbusErrorRetry =
+ data.value("maxSmbusErrorRetry", MAX_SMBUS_ERROR_RETRY);
if (!thresholds.empty())
{
@@ -556,6 +559,23 @@
auto success = getNVMeInfobyBusID(config.busID, nvmeData);
auto iter = nvmes.find(config.index);
+ if (success)
+ {
+ nvmeSmbusErrCnt[config.busID] = 0;
+ }
+ else
+ {
+ if (nvmeSmbusErrCnt[config.busID] < maxSmbusErrorRetry)
+ {
+ // skip this time if error count less than maxSmbusErrorRetry
+ nvmeSmbusErrCnt[config.busID]++;
+ log<level::INFO>("getNVMeInfobyBusID failed, retry...",
+ entry("INDEX=%s", config.index.c_str()),
+ entry("ERRCNT=%u", nvmeSmbusErrCnt[config.busID]));
+ return;
+ }
+ }
+
// can not find. create dbus
if (iter == nvmes.end())
{
@@ -676,7 +696,10 @@
// (To make thermal loop know that the sensor reading
// is invalid).
readNvmeData(config);
- nvmes.find(config.index)->second->setSensorAvailability(isPwrGood);
+ if (nvmes.find(config.index) != nvmes.end())
+ {
+ nvmes.find(config.index)->second->setSensorAvailability(isPwrGood);
+ }
}
}
} // namespace nvme