Increase StartLimitIntervalSec to 240s
The DefaultTimeoutStartSec is 90s. If a service is hitting
this timeout repeatedly then the StartLimitIntervalSec needs
to be set in a way to handle this worse case scenario so
that the service which is timing out does not continuously
get restarted.
This means it needs to be set to:
StartLimitBurst*DefaultTimeoutStartSec +
StartLimitBurst*<worst case processing time> (30s)
which currently would be 2x90 + 2x30
Ref: systemd-system.conf
Tested: Verified that if 90s timeout is hit in service that
it is no longer restarted after 2 attempts.
Resolves openbmc/openbmc#3379
Change-Id: I12eca7bc23f54d77b4bf0327e44eb042359aaeae
Signed-off-by: Andrew Geissler <geissonator@yahoo.com>
diff --git a/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf b/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf
index 54516c2..17c9e6b 100644
--- a/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf
+++ b/recipes-phosphor/systemd-policy/phosphor-systemd-policy/service-restart-policy.conf
@@ -13,19 +13,23 @@
# restarting once does the job or restarting all 5 times does not help
# and we just end up hitting the 5 limit anyway.
#
-# - Change the StartLimitIntervalSec to 30s
+# - Change the StartLimitIntervalSec to 240s
# The BMC CPU performance is already challenged. When a service is
# failing and a core dump is being generated and collected into a dump,
# it's even more challenged. Recent failures have shown situations where
# the service does not fail again until 15-20 seconds after the initial
# failure which means the default of 10s for this results in the service
-# being restarted indefinitely. Change this to 30s to only allow a service
-# to be restarted StartLimitBurst times within a 30s interval before
-# being put in a permanent fail state.
+# being restarted indefinitely.
+# Another issue that has cropped up recently is that the DefaultTimeoutStartSec
+# is 90s. If a service is hitting this timeout repeatedly then there
+# is a similar issue as noted above. Because of this, the StartLimitIntervalSec
+# needs to be StartLimitBurst*DefaultTimeoutStartSec +
+# StartLimitBurst* worst case processing time (30s)
+# which currently would be 2x90 + 2x30
#
# See systemd-system.conf(5) for details on the conf files
[Manager]
DefaultRestartSec=1s
DefaultStartLimitBurst=2
-DefaultStartLimitIntervalSec=30s
+DefaultStartLimitIntervalSec=240s