docs: anti-pattern: unexpected errors

This adds an anti-pattern for ineffective error handlers.

Signed-off-by: Joseph Reynolds <joseph-reynolds@charter.net>
Change-Id: Iaf5d8b8100a8720b7cae1b8051888da3035c27aa
diff --git a/anti-patterns.md b/anti-patterns.md
index 2af9901..83ec545 100644
--- a/anti-patterns.md
+++ b/anti-patterns.md
@@ -221,3 +221,97 @@
 
 ### Resolution
 Install OpenBMC applications in /usr/bin/.
+
+## Handling unexpected error codes and exceptions
+
+### Identification
+The anti-pattern is for an application to continue processing after it
+encounters unexpected conditions in the form of error codes and exceptions and
+not capturing the data needed to resolve the problem.
+
+Example C++ code:
+```
+using InternalFailure = sdbusplus::xyz::openbmc_project::Common::Error::InternalFailure;
+try
+{
+  ... use d-Bus APIs...
+}
+catch (InternalFailure& e)
+{
+    phosphor::logging::commit<InternalFailure>();
+}
+```
+
+### Description
+Suppressing unexpected errors can lead an application to incorrect or erratic
+behavior which can affect the service it provides and the overall system.
+
+Writing a log entry instead of a core dump may not give enough information to
+debug a truly unexpected problem, so developers do not get a chance to
+investigate problems and the system's reliability does not improve over time.
+
+### Background
+Programmers want their application to continue processing when it encounters
+conditions it can recover from.  Sometimes they try too hard and continue when
+it is not appropriate.
+
+Programmers also want to log what is happening in the application, so they
+write log entries that give debug data when something goes wrong, typically
+targeted for their use.  They don't consider how the log entry is consumed
+by the BMC administrator or automated service tools.
+
+The `InternalFailure` in the [Phosphor logging README][] is overused.
+
+[Phosphor logging README]: https://github.com/openbmc/phosphor-logging/blob/master/README.md
+
+### Resolution
+
+Several items are needed:
+1. Check all places where a return code or errno value is given.  Strongly
+   consider that a default error handler should throw an exception, for
+   example `std::system_error`.
+2. Have a good reason to handle specific error codes.  See below.
+3. Have a good reason to handle specific exceptions.  Allow other exceptions
+   to propagate.
+4. Document (in terms of impacts to other services) what happens when this
+   service fails, stops, or is restarted.  Use that to inform the recovery
+   strategy.
+
+In the error handler:
+
+1. Consider what data (if any) should be logged.  Determine who will consume
+   the log entry: BMC developers, administrator, or an analysis tool.  Usually
+   the answer is more than one of these.
+
+   The following example log entries use an IPMI request to set network access
+   on, but the user input is invalid.
+
+    - BMC Developer: Reference internal applications, services, pids, etc.
+      the developer would be familiar with.
+
+      Example: `ipmid service successfully processed a network setting packet,
+      however the user input of USB0 is not a valid network interface to
+      configure.`
+
+    - Administrator: Reference the external interfaces of the BMC such as the
+      REST API.  They can respond to feedback about invalid input, or a need
+      to restart the BMC.
+
+      Example: `The network interface of USB0 is not a valid option. Retry the
+      IPMI command with a valid interface.`
+
+    - Analyzer tool: Consider breaking the log down and including several
+      properties which an analyzer can leverage.  For instance, tagging the
+      log with 'Internal' is not helpful.  However, breaking that down into
+      something like [UserInput][IPMI][Network] tells at a quick glance that
+      the input received for configuring the network via an IPMI command was
+      invalid.  Categorization and system impact are key things to focus on
+      when creating logs for an analysis application.
+
+      Example: `[UserInput][IPMI][Network][Config][Warning] Interface USB0 not
+      valid.`
+
+2. Determine if the application can fully recover from the condition.  If not,
+   don't continue.  Allow the system to determine if it writes a core dump or
+   restarts the service.  If there are severe impacts when the service fails,
+   consider using a better error recovery mechanism.