nvme_manager: add support of configurable smbus error retry
NVMe sometimes too busy to response smbus commands, this trigger fan
failsafe due to sensor failed (smbus error).
Add support for configurable smbus error retries to avoid sensor
failures by single smbus error.
readNvmeData() may return and remain NvmeSSD object uncreated when
Smbus error occurs at service startup, add extra NvmeSSD object check
before setSensorAvailability() to avoid service crashes.
Default retry is 0 retry if maxSmbusErrorRetry not exists in config.
Example of set maximum smbus error retry to 3 times:
{
"config": [
...
],
"threshold": [
...
],
"maxSmbusErrorRetry": 3
}
Tested on Bletchley:
```
root@bletchley:~# journalctl _PID=4790 | grep -v SendSmbusRWCmdRAW
Nov 22 18:57:33 bletchley nvme_main[4790]: Send command code 0 fail!
Nov 22 18:57:33 bletchley nvme_main[4790]: getNVMeInfobyBusID failed, retry...
Nov 22 18:57:36 bletchley nvme_main[4790]: getNVMeInfobyBusID failed, retry...
Nov 22 18:57:39 bletchley nvme_main[4790]: getNVMeInfobyBusID failed, retry...
Nov 22 18:57:42 bletchley nvme_main[4790]: SSD plug.
Nov 22 18:57:42 bletchley nvme_main[4790]: Drive status is good but can not get data.
```
Signed-off-by: Potin Lai <potin.lai@quantatw.com>
Change-Id: Ibc95efc53a212e55dcd5c5cfa7a654839a13342d
diff --git a/nvme_manager.hpp b/nvme_manager.hpp
index ffdb20a..08cfc76 100644
--- a/nvme_manager.hpp
+++ b/nvme_manager.hpp
@@ -158,6 +158,10 @@
/** @brief Monitor interval in second */
size_t monitorIntervalSec;
+ /** @brief Maximum smbus error retry */
+ uint16_t maxSmbusErrorRetry;
+ /** @brief Map of each NVMe smbus error count */
+ std::unordered_map<int, uint16_t> nvmeSmbusErrCnt;
};
} // namespace nvme
} // namespace phosphor