Add designs/uart-mux-support.md
Document the uart mux support design.
Change-Id: I40deb1f5b5f2f5d4386af769730ebfdde525820f
Signed-off-by: Alexander Hansen <alexander.hansen@9elements.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
diff --git a/designs/uart-mux-support.md b/designs/uart-mux-support.md
new file mode 100644
index 0000000..d71a6d9
--- /dev/null
+++ b/designs/uart-mux-support.md
@@ -0,0 +1,488 @@
+# uart-mux-support design
+
+Author: Alexander Hansen <alexander.hansen@9elements.com>
+
+Other contributors: Andrew Jeffery <andrew@codeconstruct.com.au> @arj, Jeremy
+Kerr <jk@ozlabs.org>, Patrick Williams <patrick@stwcx.xyz>
+
+Created: June 17, 2024
+
+## Problem Description
+
+Some hardware configurations feature a UART mux which can be switched via GPIOs.
+To support this configuration, obmc-console needs to provide a method for
+console selection to avoid manually setting GPIOs.
+
+## Background and References
+
+There are already [open changes for obmc-console][obmc-console-uart-mux-series]
+but it has been determined that this feature needs a design document.
+
+[obmc-console-uart-mux-series]:
+ https://gerrit.openbmc.org/c/openbmc/obmc-console/+/71864
+
+The background here is that there are some design choices which may affect other
+subprojects - not in the way of causing regression, but later when the mentioned
+hardware configuration needs to be supported in those projects.
+
+## Requirements
+
+- The user can select a console to be muxed
+
+- Platform policy (whichever service implements it) can select the appropriate
+ console depending on the host state and other information.
+
+- It is clear to whoever is reading the logs of that console when a console was
+ connected or disconnected via mux control. There should be no inexplicable
+ gaps in log files.
+
+- The mux configuration can be specified in a single file
+
+- Console selection (implies mux control) must be possible from an external
+ application.
+
+The scope of this change is obmc-console and other projects which rely on the
+APIs exposed by it.
+
+The change will not affect users who do not have this hardware configuration.
+
+## Design Considerations
+
+There are a number of choices available for adding mux support into
+obmc-console:
+
+1. What the "connection endpoint" (Unix domain socket, D-Bus object) represents.
+ This could be either:
+
+ 1. The TTY device exposed by Linux
+ 2. The desired downstream mux port
+
+2. How the mux state is controlled. We might control it by any of:
+
+ 1. An out-of-band command (e.g. via a D-Bus method that's somehow associated
+ with the connection endpoint)
+ 2. An in-band command (e.g. introducing an SSH-style escape-sequence)
+ 3. Selecting the mux port based on the endpoint to which the user has
+ connected
+
+3. The circumstances under which we allow the mux state to be changed
+
+ 1. Active connections prevent the mux state from being changed
+ 2. The mux state can always change but will terminate any existing
+ conflicting connections
+ 3. The mux state can always change and has no impact on existing conflicting
+ connections
+
+4. Whether we want the data stream on a given connection to represent:
+ 1. The console IO regardless of the mux state
+ 2. The console IO isolated to a specific mux port
+
+There are constraints on some combinations of these. For instance:
+
+- If the connection endpoint represents the TTY device exposed by Linux (1.1)
+ then we can't select the mux port based on the endpoint to which the user has
+ connected (2.3) as we simply don't have the information required
+
+- If the connection endpoint represents the desired downstream mux port (1.2)
+ then it doesn't make sense to implement support for an in-band command to
+ change the mux state (2.2) as it's a violation of the abstraction
+
+- If the connection endpoint represents the desired downstream mux port (1.2)
+ then it can't provide the console IO of another mux port (4.1) as that's
+ contrary to the definition.
+
+With these in mind we end up with the following table of design options:
+
+| ID | Connection Endpoint (1) | Mux Control Defined By (2) | Mux Control Policy (3) | Stream Data (4) |
+| --- | ----------------------- | -------------------------- | ------------------------------------------------ | ----------------- |
+| A | TTY (1.1) | Out-of-band command (2.1) | Active connections prevent mux change (3.1) | Isolated (4.2) |
+| B | TTY | Out-of-band command | Mux change with disconnections (3.2) | Isolated |
+| C | TTY | Out-of-band command | Mux change without disconnections (3.3) | Multiplexed (4.1) |
+| D | TTY | In-band command (2.2) | Mux change without disconnections | Multiplexed |
+| E | Mux port (1.2) | Connection-based (2.3) | Conflicting connections prevent mux change (3.1) | Isolated |
+| F | Mux port | Connection-based | Mux change with disconnections | Isolated |
+| G | Mux port | Connection-based | Mux change without disconnections | Isolated |
+| H | Mux port | Out-of-band command | Conflicting connections prevent mux change | Isolated |
+| I | Mux port | Out-of-band command | Mux change with disconnections | Isolated |
+| J | Mux port | Out-of-band command | Mux change without disconnections | Isolated |
+
+### Scenarios and Use Cases
+
+1. A UART mux selecting between a satellite BMC on a blade and the blade host
+
+ A software update is in progress on the satellite BMC and the mux has been
+ switched to capture the output of whatever the satellite is printing. It is
+ important to log the output of the update process to understand any failures
+ that might result.
+
+ While the satellite BMC update is in progress, a user chooses to connect to
+ the host console.
+
+2. A blade's satellite BMC, CPLD and host are all on separate ports of a UART
+ mux, and relevant output from the blade's boot process must be captured
+
+ The boot process for a blade requires a sequence of actions across its
+ satellite BMC, CPLD and host. Each component contributes critical information
+ about the boot process, which is output on the respective consoles at various
+ points in time.
+
+ For ease of correlation, their output should be logged together.
+
+### Discussion
+
+Scenario 1 is problematic. It highlights the fundamental concern of ownership of
+the mux state. In the scenario the system is in a sensitive state where a
+specific mux configuration is required (to output update progress from the
+satellite BMC), but a user has shown intent for the selection of another (to
+interact with the host console).
+
+What should occur? And does this choice impact how we choose to control the mux?
+
+Taking a connection-based approach to setting the mux state (2.3) will cause the
+user connecting to the host console endpoint to immediately disrupt the update
+progress output from the satellite BMC.
+
+By contrast, by setting the mux state with an out-of-band command (2.1) and not
+on the initiation of a connection (2.3), the user connecting to the host console
+will not immediately disrupt the update progress output from the satellite BMC.
+
+However, we can presume the user is connecting to the host console endpoint for
+a reason. With extra actions, using the out-of-band command interface, they may
+equally choose to switch the mux without regard for the system state, disrupting
+the update progress output from the satellite BMC.
+
+This highlights that the fundamental problem is access to the system by multiple
+users who are neither coordinating with each other nor the system state. The
+question that follows is:
+
+Should it be the responsibility of obmc-console to coordinate otherwise
+un-coordinated users?
+
+This is a question of policy: How those users should be coordinated will likely
+look very different based on concerns such as the role of the platform in a
+larger system, the roles and needs of the users interacting with it, and the
+concrete design of the platform itself.
+
+obmc-console should implement a mechanism to control the mux state, but likely
+shouldn't apply any policy governing access to the muxed consoles.
+
+A further concern for the out-of-band command approach is its interactions with
+other components exposing consoles:
+
+1. The dropbear/obmc-console-client integration exposing consoles via SSH
+2. [bmcweb](https://github.com/openbmc/bmcweb/blob/master/include/obmc_console.hpp)
+3. [phosphor-net-ipmid](https://github.com/openbmc/phosphor-net-ipmid/blob/master/sol/sol_manager.hpp)
+
+With the out-of-band command approach these components have to choose between:
+
+- Not providing any capability to change the mux state; rather, they defer to
+ making the user log in via SSH to affect the change themselves
+
+- Expose some mechanism for setting the mux state in terms of their own external
+ interfaces
+
+- Assume that a user connecting to the exposed console endpoint wants to select
+ that console if it's behind a mux
+
+The first assumes that SSH is exposed at all and accessible by users who need
+access to the muxed consoles. It's not yet clear whether this is a reasonable
+expectation.
+
+The second assumes that these external interfaces have the capability to model
+the problem. It's not yet clear that this is the case for either of IPMI or
+Redfish, and it's not the case for serial over SSH.
+
+The third implies that we must add capability to all three components to drive
+the out-of-band command interface when they receive a connection for a given
+console. The net result is no behavioural difference from obmc-console
+implementing this itself (2.3), but increased complexity across the system.
+
+## Implementation Considerations
+
+### How are muxed consoles represented on D-Bus?
+
+Every console will have its own D-Bus name, as this is backwards-compatible with
+the current implementation.
+
+Multiple consoles can be represented as a split- or unified- object tree.
+
+### Tradeoffs of unified vs split object tree on D-Bus
+
+In split-tree, it is not clear which consoles all belong to one UART mux, but in
+unified-tree, this is clear.
+
+In unified-tree, one console is reachable via the D-Bus name of another,
+effectively creating multiple ways of doing something.
+
+Example:
+
+```
+busctl set-property xyz.openbmc_project.Console.host1 \
+/xyz/openbmc_project/console/host2 \
+xyz.openbmc_project.Console.Access Connect ""
+```
+
+So a choice has to be made how to represent multiple consoles on dbus, and what
+information needs to be exposed to other subprojects.
+
+Unified Tree:
+
+```
+busctl tree --user xyz.openbmc_project.Console.host1
+└─/xyz
+ └─/xyz/openbmc_project
+ └─/xyz/openbmc_project/console
+ ├─/xyz/openbmc_project/console/host1
+ └─/xyz/openbmc_project/console/host2
+```
+
+Split Tree:
+
+```
+busctl tree --user xyz.openbmc_project.Console.host1
+└─/xyz
+ └─/xyz/openbmc_project
+ └─/xyz/openbmc_project/console
+ └─/xyz/openbmc_project/console/host1
+
+busctl tree --user xyz.openbmc_project.Console.host2
+└─/xyz
+ └─/xyz/openbmc_project
+ └─/xyz/openbmc_project/console
+ └─/xyz/openbmc_project/console/host2
+```
+
+The choice of representation impacts how the mux can be described on D-Bus,
+which is necessary if the out-of-band command strategy (2.1) is chosen. Two
+possibilities for exposing an out-of-band mux control on D-Bus are:
+
+1. Implement an interface on each console object that defines a boolean `Active`
+ property, and an `Activate()` method. The `Activate()` method, by nature of
+ being implemented on the console object, has all the context it needs to
+ switch the mux without requiring caller-supplied parameters. The `Activate`
+ property is `true` when the mux is configured for the console of interest,
+ and `false` otherwise. A `PropertiesChanged` D-Bus signal for the `Active`
+ variable may alert local users to changes of mux state.
+
+2. Implement a `Mux` interface on an object common to all consoles exposed by
+ the mux. The `Mux` interface might have a writable string `Selected` property
+ that represents the state of the mux and provides a mechanism to switch it to
+ a given console.
+
+These have both been [discussed on an existing patch to
+phosphor-dbus-interfaces][pdi-uart-mux-control-interface].
+
+[pdi-uart-mux-control-interface]:
+ https://gerrit.openbmc.org/c/openbmc/phosphor-dbus-interfaces/+/71878/comment/dd34b099_66dbc49e/
+
+The second approach is quite explicit - directly representing the mux state
+makes it easy to discover the state of the system. However, it motivates the
+choice of a unified object tree to provide a common object path to host the
+`Mux` interface (e.g. at `/xyz/openbmc_project/console`). This is desired to
+avoid an alternative instance of the "multiple representations of one thing"
+problem highlighted in the discussion of claiming multiple bus names for the
+unified object tree: If the tree isn't unified, this `Mux` interface would have
+to be represented and synchronised on objects across multiple D-Bus connections.
+
+The first approach doesn't have this limitation. However, it does have the
+trade-off previously mentioned, that it's unclear how any of the consoles in the
+system are related, and what the impact might be of activating any one of them.
+
+Choosing a strategy for D-Bus representation is required if we add to the D-Bus
+API, i.e. with the out-of-band command design point (2.1). However, the choice
+becomes more of an implementation detail if either of design options 2.2 or 2.3
+are selected. The choice in those cases is instead motivated by the level of
+clarity we desire in describing the relationships between consoles.
+
+## Pruning the Design Decision Tree
+
+To help shape the choices here, we have the existing behaviours of obmc-console
+[discussed on the PDI patch][pdi-uart-mux-control-interface]:
+
+1. We already have support for concurrent console server instances
+
+2. Concurrent console support is implemented as one obmc-console-server process
+ per Linux TTY device
+
+3. As each Linux TTY device is paired with its obmc-console-server process, each
+ obmc-console-server D-Bus connection needs a unique name
+
+4. We use the unique console-ids to name global resources, including both the
+ D-Bus connection and the instance's unix domain socket.
+
+As in the linked discussion, given the `console-id` value really represents
+what's at the remote end of the BMC's TTY device for regular unmuxed consoles,
+it stands to reason that we should continue this strategy for muxed consoles.
+Taking this approach avoids adding a new endpoint ABI to obmc-console and
+eliminates design options A-D inclusive.
+
+Further, on the basis of frustrating behaviour in the face of lingering network
+connections, preventing mux changes on the grounds of an existing connection
+seems like a bad path forward.
+
+This leaves us with design options `F`, `G`, `I`, and `J`, which are
+differentiated by how the mux is switched, and its effect on already-connected
+clients.
+
+Concentrating on how the mux is switched, based on the discussion about the
+D-Bus representation above, the discussion on the PDI patch, and the impact on
+related applications, it's reasonable to say there are some complications with
+the out-of-band command method (2.1).
+
+By contrast we can consider the alternative: We make the mux state reflect the
+endpoint of the most recent connection. This has the benefit of functioning for
+both the Unix domain socket and D-Bus access with no further effort. Neither
+bmcweb nor phosphor-net-ipmid need be patched. The choice also eliminates the
+D-Bus complications mentioned above as there's no need for the additional D-Bus
+interface.
+
+This reasoning leaves us the choice of design options `F` and `G`.
+
+`F` and `G` are differentiated by whether or not we drop connections on
+endpoints that are not the endpoint selected by the mux. There's been some back
+and forth on that subject elsewhere[[1][drop-connections-discussion-1]]
+[[2][drop-connections-discussion-2]], but it seems that not disconnecting
+clients is effectively a worse implementation of design option `C`, which we've
+already eliminated. It's worse than `C` because instead of 1 connection we could
+have `N` connections for `N` mux ports, `(N - 1)` of which are idle. Not only
+that, but the `(N - 1)` connections are effectively zombies, as they have no way
+to switch the mux back to their associated port without establishing yet another
+connection. It follows that if we're establishing a subsequent connection in
+order to switch the mux we may as well disconnect the existing session, in which
+case it may as well have been disconnected when the mux switched away to begin
+with[^1].
+
+[drop-connections-discussion-1]:
+ https://gerrit.openbmc.org/c/openbmc/obmc-console/+/71228/comment/62a5fce9_60c3ad3e/
+[drop-connections-discussion-2]:
+ https://gerrit.openbmc.org/c/openbmc/obmc-console/+/71867/comment/756f0abe_5ebe8d66/
+
+[^1]: which also saves resources
+
+These arguments combined eliminate all but option `F`. It seems to sit at a neat
+nexus in terms of both existing ABI, desired behaviour, and implementation
+complexity.
+
+Addendum: Discussions so far have been are around a _minimal_ design that
+achieves the desired console behaviour. It's worth noting that design option `F`
+(connection-based mux control which disconnects conflicting clients) allows us
+to _optionally_ implement an out-of-band command interface in addition, because
+the observable behaviour is no different to a new connection being accepted:
+conflicting clients are disconnected and the mux is switched. This may be
+helpful to implement platform policy around logging.
+
+## Proposed Design
+
+It's proposed that we use one obmc-console-server process to expose the `N`
+consoles connected to a UART mux, where each console represents one mux port.
+The mux is switched based on the endpoint of the most recent client connection,
+and any conflicting clients are disconnected. This is design option `F` in the
+table above.
+
+The internal datastructures of obmc-console will change to accomodate the
+design.
+
+We will use one config file for the `N` muxed consoles. The configuration will
+provide a similar approach for specifying the mux GPIOs to that used by [the
+i2c-mux-gpio devicetree binding][linux-i2c-mux-gpio].
+
+[linux-i2c-mux-gpio]:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/i2c/i2c-mux-gpio.yaml?h=v6.9#n12
+
+Below is a block diagram of the relationships between the software and hardware
+components:
+
+```
+ +--------------------+
+ | server.conf |
+ +--------------------+
+ |
+ |
+ |
+ |
+ +----+----+ +-----+ +-------+
+ | | | | | |
+ | | +-------+ +-------+ | +-----+ UART1 |
++-----------------------------------+ | | | | | | | | | |
+| xyz.openbmc_project.Console.host1 +-----+ +-----+ ttyS0 +-----+ UART0 +-----+ | +-------+
++-----------------------------------+ | | | | | | | |
+ | obmc | +-------+ +-------+ | |
+ | console | | MUX |
+ | server | +-------+ | |
++-----------------------------------+ | | | | | |
+| xyz.openbmc_project.Console.host2 +-----+ +-------------------+ GPIO +-----+ | +-------+
++-----------------------------------+ | | | | | | | |
+ | | +-------+ | +-----+ UART2 |
+ | | | | | |
+ +----+----+ +-----+ +-------+
+
+```
+
+To inform people who may be reading log files for a console, connection and
+disconnection events of a console via mux control will produce messages for
+clients and in log files.
+
+Requirements are:
+
+- Making it clear this message is from obmc-console
+- Timestamp
+- Indication of connected/disconnected
+
+These messages are not meant as an API or reliable means to get information
+about mux state. Any application on the other side of the uart could also
+produce the exact same messages, even if unlikely.
+
+The initial format of these messages will be something like:
+
+```
+[obmc-console] %Y-%m-%d %H:%M:%S UTC CONNECTED
+[obmc-console] %Y-%m-%d %H:%M:%S UTC DISCONNECTED
+```
+
+for the connect and disconnect case.
+
+For the D-Bus representation we choose the unified tree.
+
+## Other Alternatives Considered
+
+### Kernel implementation
+
+Did not do that since the support can be implemented in userspace. Also it may
+not be merged since the hardware configuration it supports may not be widely
+available. It may be better to have a userspace implementation to refer back to
+in case someone wants to do a kernel implementation later.
+
+### Multiple obmc-console-server processes for the multiple consoles
+
+This was considered and implemented is a PoC, but discarded later as it would be
+easier to synchronize everything in a single process.
+
+### Multiple configuration files for multiple consoles
+
+This was considered but it would duplicate configuration, like the definition of
+the mux GPIOs. Inconsistencies across the files would also need to be managed.
+
+## Impacts
+
+### API Impact
+
+### Performance Impact
+
+Minimal to none.
+
+### Developer Impact
+
+Minimal. Existing users do not need to change anything about their
+configuration.
+
+### Organizational
+
+- Does this repository require a new repository? No
+- Who will be the initial maintainer(s) of this repository?
+- Which repositories are expected to be modified to execute this design?
+ obmc-console, docs
+- Make a list, and add listed repository maintainers to the gerrit review.
+
+## Testing
+
+There are already integration tests for this feature available on gerrit.