mctp: Add initial kernel MCTP design definition Covering network & interface representation, sockets API implementation and configuration interface. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Change-Id: Ia7b7e51c7528b08941c2811b94f4775c51031efb

commit: f0ca2e418c0b4669d3a759775f654343f0b9e504 [log] [tgz]
author: Jeremy Kerr <jk@codeconstruct.com.au> Mon Feb 08 18:01:21 2021 +0800
committer: Jeremy Kerr <jk@codeconstruct.com.au> Thu Mar 04 10:22:10 2021 +0800
tree: dd2ac90444a0841a9b54ab9d99035f479065bc36
parent: bac8940ea96db6928748089e2b7e9d65fa659120 [diff]
diff --git a/designs/mctp/mctp-kernel.md b/designs/mctp/mctp-kernel.md
new file mode 100644
index 0000000..ac19dee
--- /dev/null
+++ b/designs/mctp/mctp-kernel.md

@@ -0,0 +1,952 @@
+# OpenBMC in-kernel MCTP
+
+Author: Jeremy Kerr `<jk@codeconstruct.com.au>`
+
+Please refer to the [MCTP Overview](mctp.md) document for general MCTP design
+description, background and requirements.
+
+This document describes a kernel-based implementation of MCTP infrastructure,
+providing a sockets-based API for MCTP communication within an OpenBMC-based
+platform.
+
+# Requirements for a kernel implementation
+
+ * The MCTP messaging API should be an obvious application of the existing POSIX
+   socket interface
+
+ * Configuration should be simple for a straightforward MCTP endpoint: a single
+   network with a single local endpoint id (EID).
+
+ * Infrastructure should be flexible enough to allow for more complex MCTP
+   networks, allowing:
+
+    - each MCTP network (as defined by section 3.2.31 of DSP0236) may
+      consist of multiple local physical interfaces, and/or multiple EIDs;
+
+    - multiple distinct (ie., non-bridged) networks, possibly containing
+      duplicated EIDs between networks;
+
+    - multiple local EIDs on a single interface, and
+
+    - customisable routing/bridging configurations within a network.
+
+
+# Proposed Design #
+
+The design contains several components:
+
+ * An interface for userspace applications to send and receive MCTP messages: A
+   mapping of the sockets API to MCTP usage
+
+ * Infrastructure for control and configuration of the MCTP network(s),
+   consisting of a configuration utility, and a kernel messaging facility for
+   this utility to use.
+
+ * Kernel drivers for physical interface bindings.
+
+In general, the kernel components cover the transport functionality of MCTP,
+such as message assembly/disassembly, packet forwarding, and physical interface
+implementations.
+
+Higher-level protocols (such as PLDM) are implemented in userspace, through the
+introduced socket API. This also includes the majority of the MCTP Control
+Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically
+have a specific process to request and respond to control protocol messages.
+However, the kernel will include a small subset of control protocol code to
+allow very simple endpoints, with static EID allocations, to run without this
+process. MCTP endpoints that require more than just single-endpoint
+functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would
+include the control message protocol process.
+
+A new driver is introduced to handle each physical interface binding. These
+drivers expose the appropriate `struct net_device` to handle transmission and
+reception of MCTP packets on their associated hardware channels. Under Linux,
+the namespace for these interfaces is separate from other network interfaces -
+such as those for ethernet.
+
+## Structure: interfaces & networks #
+
+The kernel models the local MCTP topology through two items: interfaces and
+networks.
+
+An interface (or "link") is an instance of an MCTP physical transport binding
+(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
+device. This is represented as a `struct netdevice`, and has a user-visible
+name and index (`ifindex`). Non-hardware-attached interfaces are permitted, to
+allow local loopback and/or virtual interfaces.
+
+A network defines a unique address space for MCTP endpoints by endpoint-ID
+(described by DSP0236, section 3.2.31). A network has a user-visible identifier
+to allow refernces from userspace. Route definitions are specific to one
+network.
+
+Interfaces are associated with one network. A network may be associated with one
+or more interfaces.
+
+If multiple networks are present, each may contain EIDs that are also present on
+other networks.
+
+## Sockets API ##
+
+### Protocol definitions ###
+
+We define a new address family (and corresponding protocol family) for MCTP:
+
+```c
+    #define AF_MCTP /* TBD */
+    #define PF_MCTP AF_MCTP
+```
+
+MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as
+the domain. Currently, only a `SOCK_DGRAM` socket type is defined.
+
+```c
+    int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+```
+
+The only (current) value for the `protocol` argument is 0. Future protocol
+implementations may be added later.
+
+MCTP Sockets opened with a protocol value of 0 will communicate directly at the
+transport layer; message buffers received by the application will consist of
+message data from reassembled MCTP packets, and will include the full message
+including message type byte and optional message integrity check (IC).
+Individual packet headers are not included; they may be accessible through a
+future `SOCK_RAW` socket type.
+
+As with all socket address families, source and destination addresses are
+specified with a new `sockaddr` type:
+
+```c
+    struct sockaddr_mctp {
+            sa_family_t         smctp_family; /* = AF_MCTP */
+            int                 smctp_network;
+            struct mctp_addr    smctp_addr;
+            uint8_t             smctp_type;
+            uint8_t             smctp_tag;
+    };
+
+    struct mctp_addr {
+            uint8_t             s_addr;
+    };
+
+    /* MCTP network values */
+    #define MCTP_NET_ANY        0
+
+    /* MCTP EID values */
+    #define MCTP_ADDR_ANY       0xff
+    #define MCTP_ADDR_BCAST     0xff
+
+    /* MCTP type values. Only the least-significant 7 bits of
+     * smctp_type are used for tag matches; the specification defines
+     * the type to be 7 bits.
+     */
+    #define MCTP_TYPE_MASK      0x7f
+
+    /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */
+    /* MCTP-spec-defined fields */
+    #define MCTP_TAG_MASK    0x07
+    #define MCTP_TAG_OWNER   0x08
+    /* Others: reserved */
+
+    /* Helpers */
+    #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */
+```
+
+### Syscall behaviour ###
+
+The following sections describe the MCTP-specific behaviours of the standard
+socket system calls. These behaviours have been chosen to map closely to the
+existing sockets APIs.
+
+#### `bind()`: set local socket address ####
+
+Sockets that receive incoming request packets will bind to a local address,
+using the `bind()` syscall.
+
+```c
+    struct sockaddr_mctp addr;
+
+    addr.smctp_family = AF_MCTP;
+    addr.smctp_network = MCTP_NET_ANY;
+    addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
+    addr.smctp_type = MCTP_TYPE_PLDM;
+    addr.smctp_tag = MCTP_TAG_OWNER;
+
+    int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
+```
+
+This establishes the local address of the socket. Incoming MCTP messages that
+match the network, address, and message type will be received by this socket.
+The reference to 'incoming' is important here; a bound socket will only receive
+messages with the TO bit set, to indicate an incoming request message, rather
+than a response.
+
+The `smctp_tag` value will configure the tags accepted from the remote side of
+this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which
+will result in remotely "owned" tags being routed to this socket. Since
+`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are
+not used; callers must set them to zero. See the [Tag behaviour for transmitted
+messages](#tag-behaviour-for-transmitted-messages) section for more details. If
+the `MCTP_TAG_OWNER` bit is not set, `bind()` will fail with an errno of
+`EINVAL`.
+
+A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive
+incoming packets from any locally-connected network. A specific network value
+will cause the socket to only receive incoming messages from that network.
+
+The `smctp_addr` field specifies a local address to bind to. A value of
+`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any
+local destination EID.
+
+The `smctp_type` field specifies which message types to receive. Only the lower
+7 bits of the type is matched on incoming messages (ie., the most-significant IC
+bit is not part of the match). This results in the socket receiving packets with
+and without a message integrity check footer.
+
+#### `connect()`: set remote socket address ####
+
+Sockets may specify a socket's remote address with the `connect()` syscall:
+
+```c
+    struct sockaddr_mctp addr;
+    int rc;
+
+    addr.smctp_family = AF_MCTP;
+    addr.smctp_network = MCTP_NET_ANY;
+    addr.smctp_addr.s_addr = 8;
+    addr.smctp_tag = MCTP_TAG_OWNER;
+    addr.smctp_type = MCTP_TYPE_PLDM;
+
+    rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr));
+```
+
+This establishes the remote address of a socket, used for future message
+transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP
+traffic directly, but just sets the default destination for messages sent from
+this socket.
+
+The `smctp_network` field may specify a locally-attached network, or the value
+`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network.
+This is guaranteed to work for single-network configurations, but may require
+additional routing definitions for endpoints attached to multiple distinct
+networks. See the [Addressing](#addressing) section for details.
+
+The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST`
+the MCTP broadcast EID (0xff).
+
+The `smctp_type` field specifies the type field of messages transferred over
+this socket.
+
+The `smctp_tag` value will configure the tag used for the local side of this
+socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an
+"owned" tag to be allocated for this socket, and will remain allocated for all
+future outgoing messages, until either the socket is closed, or `connect()` is
+called again. If a tag cannot be allocated, `connect()` will report an error,
+with an errno value of `EAGAIN`. See the [Tag behaviour for transmitted
+messages](#tag-behaviour-for-transmitted-messages) section for more details. If
+the `MCTP_TAG_OWNER` bit is not set, `connect()` will fail with an errno of
+`EINVAL`.
+
+Requesters which connect to a single responder will typically use `connect()` to
+specify the peer address and tag for future outgoing messages.
+
+#### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message ####
+
+An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`, `send()`
+or `write()` syscalls. Using `sendto()` as the primary example:
+
+```c
+    struct sockaddr_mctp addr;
+    char buf[14];
+    ssize_t len;
+
+    /* set message destination */
+    addr.smctp_family = AF_MCTP;
+    addr.smctp_network = 0;
+    addr.smctp_addr.s_addr = 8;
+    addr.smctp_tag = MCTP_TAG_OWNER;
+    addr.smctp_type = MCTP_TYPE_ECHO;
+
+    /* arbitrary message to send, with message-type header */
+    buf[0] = MCTP_TYPE_ECHO;
+    memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
+
+    len = sendto(sd, buf, sizeof(buf), 0,
+                    (struct sockaddr_mctp *)&addr, sizeof(addr));
+```
+
+The address argument is treated the same way as for `connect()`: The network and
+address fields define the remote address to send to. If `smctp_tag` has the
+`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and
+generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is
+not set, the message will be sent with the tag value as specified. If a tag
+value cannot be allocated, the system call will report an errno of `EAGAIN`.
+
+The application must provide the message type byte as the first byte of the
+message buffer passed to `sendto()`. If a message integrity check is to be
+included in the transmitted message, it must also be provided in the message
+buffer, and the most-significant bit of the message type byte must be 1.
+
+If the first byte of the message does not match the message type value, then the
+system call will return an error of `EPROTO`.
+
+The `send()` and `write()` system calls behave in a similar way, but do not
+specify a remote address. Therefore, `connect()` must be called beforehand; if
+not, these calls will return an error of `EDESTADDRREQ` (Destination address
+required).
+
+Using `sendto()` or `sendmsg()` on a connected socket may override the remote
+socket address specified in `connect()`. The `connect()` address and tag will
+remain associated with the socket, for future unaddressed sends. The tag
+allocated through a call to `sendto()` or `sendmsg()` on a connected socket is
+subject to the same invalidation logic as on an unconnected socket: It is
+expired either by timeout or by a subsequent `sendto()`.
+
+The `sendmsg()` system call allows a more compact argument interface, and the
+message buffer to be specified as a scatter-gather list. At present no
+ancillary message types (used for the `msg_control` data passed to `sendmsg()`)
+are defined.
+
+Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified
+will cause an allocation of a tag, if no valid tag is already allocated for that
+destination. The (destination-eid,tag) tuple acts as an implicit local socket
+address, to allow the socket to receive responses to this outgoing message. If
+any previous allocation has been performed (to for a different remote EID), that
+allocation is lost. This tag behaviour can be controlled through the
+`MCTP_TAG_CONTROL` socket option.
+
+Sockets will only receive responses to requests they have sent (with TO=1) and may
+only respond (with TO=0) to requests they have received.
+
+#### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message ####
+
+An MCTP message can be received by an application using one of the `recvfrom()`,
+`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the
+primary example:
+
+```c
+    struct sockaddr_mctp addr;
+    socklen_t addrlen;
+    char buf[14];
+    ssize_t len;
+
+    addrlen = sizeof(addr);
+
+    len = recvfrom(sd, buf, sizeof(buf), 0,
+                    (struct sockaddr_mctp *)&addr, &addrlen);
+
+    /* We can expect addr to describe an MCTP address */
+    assert(addrlen >= sizeof(buf));
+    assert(addr.smctp_family == AF_MCTP);
+
+    printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
+```
+
+The address argument to `recvfrom` and `recvmsg` is populated with the remote
+address of the incoming message, including tag value (this will be needed in
+order to reply to the message).
+
+The first byte of the message buffer will contain the message type byte. If an
+integrity check follows the message, it will be included in the received buffer.
+
+The `recv()` and `read()` system calls behave in a similar way, but do not
+provide a remote address to the application. Therefore, these are only useful if
+the remote address is already known, or the message does not require a reply.
+
+Like the send calls, sockets will only receive responses to requests they have
+sent (TO=1) and may only respond (TO=0) to requests they have received.
+
+#### `getsockname()` & `getpeername()`: query local/remote socket address ####
+
+The `getsockname()` system call returns the `struct sockaddr_mctp` value for the
+local side of this socket, `getpeername()` for the remote (ie, that used in a
+connect()). Since the tag value is a property of the remote address,
+`getpeername()` may be used to retrieve a kernel-allocated tag value.
+
+Calling `getpeername()` on an unconnected socket will result in an error of
+`ENOTCONN`.
+
+#### Socket options ####
+
+The following socket options are defined for MCTP sockets:
+
+##### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg #####
+
+Enabling this socket option allows an application to specify extended addressing
+information on transmitted packets, and access the same on received packets.
+
+When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify
+an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls.
+This as defined as:
+
+```c
+    struct sockaddr_mctp_ext {
+            /* fields exactly match struct sockaddr_mctp */
+            sa_family_t         smctp_family; /* = AF_MCTP */
+            int                 smctp_network;
+            struct mctp_addr    smctp_addr;
+            uint8_t             smcp_tag;
+            /* extended addressing */
+            int                 smctp_ifindex;
+            uint8_t             smctp_halen;
+            unsigned char       smctp_haddr[/* TBD */];
+    }
+```
+
+If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to
+contain this larger structure, then the extended addressing fields are consumed
+/ populated respectively.
+
+
+##### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour #####
+
+The set/getsockopt argument is a `mctp_tagctl` structure:
+
+    struct mctp_tagctl {
+        bool            retain;
+        struct timespec timeout;
+    };
+
+This allows an application to control the behaviour of allocated tags for
+non-connected sockets when transferring messages to multiple different
+destinations (ie., where a `struct sockaddr_mctp` is provided for individual
+messages, and the `smctp_addr` destination for those sockets may vary across
+calls).
+
+The `retain` flag indicates to the kernel that the socket should not release tag
+allocations when a message is sent to a new destination EID. This causes the
+socket to continue to receive incoming messages to the old (dest,tag) tuple, in
+addition to the new tuple.
+
+The `timeout` value specifies a maximum amount of time to retain tag values.
+This should be based on the reply timeout for any upper-level protocol.
+
+The kernel may reject a request to set values that would cause excessive tag
+allocation by this socket. The kernel may also reject subsequent tag-allocation
+requests (through send or connect syscalls) which would cause excessive tags to
+be consumed by the socket, even though the tag control settings were accepted in
+the setsockopt operation.
+
+Changing the default tag control behaviour should only be required when:
+
+ * the socket is sending messages with TO=1 (ie, is a requester); and
+ * messages are sent to multiple different destination EIDs from the one
+   socket.
+
+
+#### Syscalls not implemented ####
+
+The following system calls are not implemented for MCTP, primarily as they are
+not used in `SOCK_DGRAM`-type sockets:
+
+ * `listen()`
+ * `accept()`
+ * `ioctl()`
+ * `shutdown()`
+ * `mmap()`
+
+### Userspace examples ###
+
+These examples cover three general use-cases:
+
+ - **requester**: sends requests to a particular (EID, type) target, and
+   receives responses to those packets
+
+   This is similar to a typical UDP client
+
+ - **responder**: receives all locally-addressed messages of a specific
+   message-type, and responds to the requester immediately.
+
+   This is similar to a typical UDP server
+
+ - **controller**: a specific service for a bus owner; may send broadcast
+   messages, manage EID allocations, update local MCTP stack state. Will
+   need low-level packet data.
+
+   This is similar to a DHCP server.
+
+#### Requester ####
+
+"Client"-side implementation to send requests to a responder, and receive a response.
+This uses a (fictitious) message type of `MCTP_TYPE_ECHO`.
+
+```c
+    int main() {
+            struct sockaddr_mctp addr;
+            socklen_t addrlen;
+            struct {
+                uint8_t type;
+                uint8_t data[14];
+            } msg;
+            int sd, rc;
+
+            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+            addr.sa_family = AF_MCTP;
+            addr.smctp_network = MCTP_NET_ANY; /* any network */
+            addr.smctp_addr.s_addr = 9;    /* remote eid 9 */
+            addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */
+            addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */
+            addrlen = sizeof(addr);
+
+            /* set message type and payload */
+            msg.type = MCTP_TYPE_ECHO;
+            strncpy(msg.data, "hello, world!", sizeof(msg.data));
+
+            /* send message */
+            rc = sendto(sd, &msg, sizeof(msg), 0,
+                            (struct sockaddr *)&addr, addrlen);
+
+            if (rc < 0)
+                    err(EXIT_FAILURE, "sendto");
+
+            /* Receive reply. This will block until a reply arrives,
+             * which may never happen. Actual code would need a timeout
+             * here. */
+            rc = recvfrom(sd, &msg, sizeof(msg), 0,
+                        (struct sockaddr *)&addr, &addrlen);
+            if (rc < 0)
+                    err(EXIT_FAILURE, "recvfrom");
+
+            assert(msg.type == MCTP_TYPE_ECHO);
+            /* ensure we're nul-terminated */
+            msg.data[sizeof(msg.data)-1] = '\0';
+
+            printf("reply: %s\n", msg.data);
+
+            return EXIT_SUCCESS;
+    }
+```
+
+#### Responder ####
+
+"Server"-side implementation to receive requests and respond. Like the client,
+This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the `struct
+sockaddr_mctp`; only messages matching this type will be received.
+
+```c
+    int main() {
+            struct sockaddr_mctp addr;
+            socklen_t addrlen;
+            int sd, rc;
+
+            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+            addr.sa_family = AF_MCTP;
+            addr.smctp_network = MCTP_NET_ANY; /* any network */
+            addr.smctp_addr.s_addr = MCTP_EID_ANY;
+            addr.smctp_type = MCTP_TYPE_ECHO;
+            addr.smctp_tag = MCTP_TAG_OWNER;
+            addrlen = sizeof(addr);
+
+            rc = bind(sd, (struct sockaddr *)&addr, addrlen);
+            if (rc)
+                    err(EXIT_FAILURE, "bind");
+
+            for (;;) {
+                    struct {
+                        uint8_t type;
+                        uint8_t data[14];
+                    } msg;
+
+                    rc = recvfrom(sd, &msg, sizeof(msg), 0,
+                                    (struct sockaddr *)&addr, &addrlen);
+                    if (rc < 0)
+                            err(EXIT_FAILURE, "recvfrom");
+                    if (rc < 1)
+                            warnx("not enough data for a message type");
+
+                    assert(addrlen == sizeof(addr));
+                    assert(msg.type == MCTP_TYPE_ECHO);
+
+                    printf("%zd bytes from EID %d\n", rc, addr.smctp_addr);
+
+                    /* Reply to requester; this macro just clears the TO-bit.
+                     * Other addr fields will describe the remote endpoint,
+                     * so use those as-is.
+                     */
+                    addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag);
+
+                    rc = sendto(sd, &msg, rc, 0,
+                                (struct sockaddr *)&addr, addrlen);
+                    if (rc < 0)
+                            err(EXIT_FAILURE, "sendto");
+            }
+
+            return EXIT_SUCCESS;
+    }
+```
+
+#### Broadcast request ####
+
+Sends a request to a broadcast EID, and receives (unicast) replies. Typical
+control protocol pattern.
+
+```c
+    int main() {
+            struct sockaddr_mctp txaddr, rxaddr;
+            struct timespec start, cur;
+            struct pollfd pollfds[1];
+            socklen_t addrlen;
+            uint8_t buf[2];
+            int timeout;
+
+            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+            /* destination address setup */
+            txaddr.sa_family = AF_MCTP;
+            txaddr.smctp_network = 1; /* specific network required for broadcast */
+            txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */
+            txaddr.smctp_type = MCTP_TYPE_CONTROL;
+            txaddr.smctp_tag = MCTP_TAG_OWNER;
+
+            buf[0] = MCTP_TYPE_CONTROL;
+            buf[1] = 'a';
+
+            /* We're doing a sendto() to a broadcast address here. If we were
+             * sending more than one broadcast message, we'd be better off
+             * doing connect(); sendto();, in order to retain the tag
+             * reservation across all transmitted messages. However, since this
+             * is a single transmit, that makes no difference in this
+             * particular case.
+             */
+            rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr,
+                            sizeof(txaddr));
+            if (rc < 0)
+                    err(EXIT_FAILURE, "sendto");
+
+            /* Set up poll behaviour, and record our starting time for
+             * reply timeouts */
+            pollfds[0].fd = sd;
+            pollfds[0].events = POLLIN;
+            clock_gettime(CLOCK_MONOTONIC, &start);
+
+            for (;;) {
+                    /* Calculate the amount of time left for replies */
+                    clock_gettime(CLOCK_MONOTONIC, &cur);
+                    timeout = calculate_timeout(&start, &cur, 1000);
+
+                    rc = poll(pollfds, 1, timeout)
+                    if (rc < 0)
+                        err(EXIT_FAILURE, "poll");
+
+                    /* timeout receiving a reply? */
+                    if (rc == 0)
+                        break;
+
+                    /* sanity check that we have a message to receive */
+                    if (!(pollfds[0].revents & POLLIN))
+                        break;
+
+                    addrlen = sizeof(rxaddr);
+
+                    rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr,
+                            &addrlen);
+                    if (rc < 0)
+                            err(EXIT_FAILURE, "recvfrom");
+
+                    assert(addrlen >= sizeof(rxaddr));
+                    assert(rxaddr.smctp_family == AF_MCTP);
+
+                    printf("response from EID %d\n", rxaddr.smctp_addr);
+            }
+
+            return EXIT_SUCCESS;
+    }
+```
+
+### Implementation notes ###
+
+#### Addressing ####
+
+Transmitted messages (through `sendto()` and related system calls) specify their
+destination via the `smctp_network` and `smctp_addr` fields of a `struct
+sockaddr_mctp`.
+
+The `smctp_addr` field maps directly to the destination endpoint's EID.
+
+The `smctp_network` field specifies a locally defined network identifier. To
+simplify situations where there is only one network defined, the special value
+`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific
+network for transmission.
+
+This selection is entirely user-configured; one specific network may be defined
+as the system default, in which case it will be used for all message
+transmission where `MCTP_NET_ANY` is used as the destination network.
+
+In particular, the destination EID is never used to select a destination
+network.
+
+MCTP responders should use the EID and network values of an incoming request to
+specify the destination for any responses.
+
+#### Bridging/routing ####
+
+The network and interface structure allows multiple interfaces to share a common
+network. By default, packets are not forwarded between interfaces.
+
+A network can be configured for "forwarding" mode. In this mode, packets may be
+forwarded if their destination EID is non-local, and matches a route for another
+interface on the same network.
+
+As per DSP0236, packet reassembly does not occur during the forwarding process.
+If the packet is larger than the MTU for the destination interface/route, then
+the packet is dropped.
+
+#### Tag behaviour for transmitted messages ####
+
+On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel
+must allocate a tag that will uniquely identify responses over a (destination
+EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size.
+
+To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag`
+field will cause the kernel to allocate a unique tag for subsequent replies from
+that specific remote EID.
+
+This allocation will expire when any of the following occur:
+
+ * the socket is closed
+ * a new message is sent to a new destination EID
+ * an implementation-defined timeout expires
+
+Because the "tag space" is limited, it may not be possible for the kernel to
+allocate a unique tag for the outgoing message. In this case, the `sendto()`
+call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when
+a local port cannot be allocated for an outgoing message.
+
+The implementation-defined timeout value shall be chosen to reasonably cover
+standard reply timeouts. If necessary, this timeout may be modified through the
+`MCTP_TAG_CONTROL` socket option.
+
+For applications that expect to perform an ongoing message exchange with a
+particular destination address, they may use the `connect()` call to set a
+persistent remote address. In this case, the tag will be allocated during
+connect(), and remain reserved for this socket until any of the following occur:
+
+ * the socket is closed
+ * the remote address is changed through another call to `connect()`.
+
+In particular, calling `sendto()` with a different address does not release the
+tag reservation.
+
+Broadcast messages are particularly onerous for tag reservations. When a message
+is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must
+reserve the tag across the entire range of possible EIDs. Therefore, a
+particular tag value must be currently-unused across all EIDs to allow a
+`sendto()` to a broadcast address. Additionally, this reservation is not cleared
+when a reply is received, as there may be multiple replies to a broadcast.
+
+For this reason, applications wanting to send to the broadcast address should
+use the `connect()` system call to reserve a tag, and guarantee its availability
+for future message transmission. Note that this will remove the tag value for
+use with *any other EID*. Sending to the broadcast address should be avoided; we
+expect few applications will need this functionality.
+
+
+#### MCTP Control Protocol implementation ####
+
+Aside from the "Resolve endpoint EID" message, the MCTP control protocol
+implementation would exist as a userspace process, `mctpd`. This process is
+responsible for responding to incoming control protocol messages, any dynamic
+EID allocations (for bus owner devices) and maintaining the MCTP route table
+(for bridging devices).
+
+This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with
+the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing
+data on incoming control protocol requests. It would interact with the kernel's
+route table via a netlink interface - the same as that implemented for the
+[Utility and configuration interfaces](#utility-and-configuration-interfaces).
+
+### Neighbour and routing implementation ###
+
+The packet-transmission behaviour of the MCTP infrastructure relies on a single
+routing table to lookup both route and neighbour information. Entries in this
+table are of the format:
+
+ | EID range | interface | physical address | metric | MTU | flags | expiry |
+ |-----------|-----------|------------------|--------|-----|-------|--------|
+
+This table can be updated from two sources:
+
+  * From userspace, via a netlink interface (see the
+    [Utility and configuration interfaces](#utility-and-configuration-interfaces)
+    section).
+
+  * Directly within the kernel, when basic neighbour information is discovered.
+    Kernel-originated routes are marked as such in the flags field, and have a
+    maximum validity age, indicated by the expiry field.
+
+Kernel-discovered routing information can originate from two sources:
+
+  * physical-to-EID mappings discovered through received packets
+
+  * explicit endpoint physical-address resolution requests
+
+When a packet is to be transmitted to an EID that does not have an entry in the
+routing table, the kernel may attempt to resolve the physical address of that
+endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol
+(section 12.9 of DSP0236). The response message will be used to add a
+kernel-originated route into the routing table.
+
+This is the only kernel-internal usage of MCTP Control Protocol messages.
+
+## Utility and configuration interfaces ##
+
+A small utility will be developed to control the state of the kernel MCTP stack.
+This will be similar in design to the 'iproute2' tools, which perform a similar
+function for the IPv4 and IPv6 protocols.
+
+The utility will be invoked as `mctp`, and provide subcommands for managing
+different aspects of the kernel stack.
+
+### `mctp link`: manage interfaces ###
+
+```sh
+    mctp link set <link> <up|down>
+    mctp link set <link> network <network-id>
+    mctp link set <link> mtu <mtu>
+    mctp link set <link> bus-owner <hwaddr>
+```
+
+### `mctp network`: manage networks ###
+
+```sh
+    mctp network create <network-id>
+    mctp network set <network-id> forwarding <on|off>
+    mctp network set <network-id> default [<true|false>]
+    mctp network delete <network-id>
+```
+
+### `mctp address`: manage local EID assignments ###
+
+```sh
+    mctp address add <eid> dev <link>
+    mctp address del <eid> dev <link>
+```
+
+### `mctp route`: manage routing tables ###
+
+```sh
+    mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
+    mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
+    mctp route show [net <network-id>]
+```
+
+### `mctp stat`: query socket status ###
+
+```sh
+    mctp stat
+```
+
+A set of netlink message formats will be defined to support these control
+functions.
+
+
+# Design points & alternatives considered #
+
+## Including message-type byte in send/receive buffers ##
+
+This design specifies that message buffers passed to the kernel in send syscalls
+and from the kernel in receive syscalls will have the message type byte as the
+first byte of the buffer. This corresponds to the definition of a MCTP message
+payload in DSP0236.
+
+This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's
+superficially possible for the kernel to prepend this byte on send, and remove
+it on receive.
+
+However, the exact format of the MCTP message payload is not precisely defined
+by the specification. Particularly, any message integrity check data (which
+would also need to be appended / stripped in conjunction with the type byte) is
+defined by the type specification, not DSP0236. The kernel would need knowledge
+of all protocols in order to correctly deconstruct the payload data.
+
+Therefore, we transfer the message payload as-is to userspace, without any
+modification by the kernel.
+
+## MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol ##
+
+This design specifies message-types to be passed in the `smctp_type` field of
+`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol`
+argument of the `socket()` system call:
+
+```c
+    int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol);
+```
+
+The `smctp_type` implementation was chosen as it better matches the "addressing"
+model of the message type; sockets are bound to an incoming message type,
+similar to the IP protocol's model of binding UDP sockets to a local port number.
+
+There is no kernel behaviour that depends on the specific type (particularly
+given the design choice above), so it is not suited to use the protocol argument
+here.
+
+Future additions that perform protocol-specific message handling, and so alter
+the send/receive buffer format, may use a new protocol argument.
+
+
+## Networks referenced by index rather than UUID ##
+
+This design proposes referencing networks by an integer index. The MCTP standard
+does optionally associate a RFC4122 UUID with a networks; it would be possible
+to use this UUID where we pass a network identifier.
+
+This approach does not incorporate knowledge of network UUIDs in the kernel.
+Given that the Get Network ID message in the MCTP Control Protocol is
+implemented entirely via userspace, it does not need to be aware of network
+UUIDs, and requiring network references (for example, the `smctp_network` field
+of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment.
+
+Instead, the index integer is used instead, in a similar fashion to the integer
+index used to reference `struct netdevice`s elsewhere in the network stack.
+
+
+## Tag behaviour alternatives ##
+
+We considered *several* different designs for the tag handling behaviour. A
+brief overview of the more-feasible of those, and why they were rejected:
+
+### Each socket is allocated a unique tag value on creation ###
+
+We could allocate a tag for each socket on creation, and use that value when a
+tag is required. This, however:
+
+ * needlessly consumes a tag on non-tag-owning sockets (ie, those which send
+   with TO=0 - responders); and
+
+ * limits us to 8 sockets per network.
+
+### Tags only used for message packetisation / reassembly ###
+
+An alternative would be to completely dissociate tag allocation from sockets;
+and only allocate a tag for the (short-lived) task of packetising a message, and
+sending those packets. Tags would be released when the last packet has been sent.
+
+However, this removes any facility to correlate responses with the correct
+socket, which is the purpose of the TO bit in DSP0236. In order for the sending
+application to receive the response, we would either need to:
+
+ * limit the system to one socket of each message type (which, for example,
+   precludes running a requester and a responder of the same type); or
+
+ * forward all incoming messages of a specific message-type to all sockets
+   listening on that type, making it trivial to eavesdrop on MCTP data of
+   other applications
+
+### Allocate a tag for one request/response pair ###
+
+Another alternative would be to allocate a tag on each outgoing TO=1 message,
+and then release that allocation after the incoming response to that tag (TO=0) is
+observed.
+
+However, MCTP protocols exist that do not have a 1:1 mapping of responses to
+requests - more than one response may be valid for a given request message. For
+example, in response to a request, a NVMe-MI implementation may send an
+in-progress reply before the final reply. In this case, we would release the tag
+after the first response is received, and then have no way to correlate the
+second message with the socket.
+
+Broadcast MCTP request messages may have multiple replies from multiple
+endpoints, meaning we cannot release the tag allocation on the first reply.

diff --git a/designs/mctp/mctp.md b/designs/mctp/mctp.md
index d958e3f..6e2d7d7 100644
--- a/designs/mctp/mctp.md
+++ b/designs/mctp/mctp.md

@@ -91,7 +91,8 @@
    described in [MCTP Userspace](mctp-userspace.md).
 
  - A kernel-based approach, using a sockets API for client and server
-   applications. This approach is in a design stage.
+   applications. This approach is in a design stage, and is described
+   in [MCTP Kernel](mctp-kernel.md)
 
 Design details for both approaches are covered in their relevant
 documents, but both share the same Problem Description, Background and
commit	f0ca2e418c0b4669d3a759775f654343f0b9e504	[log] [tgz]
author	Jeremy Kerr <jk@codeconstruct.com.au>	Mon Feb 08 18:01:21 2021 +0800
committer	Jeremy Kerr <jk@codeconstruct.com.au>	Thu Mar 04 10:22:10 2021 +0800
tree	dd2ac90444a0841a9b54ab9d99035f479065bc36
parent	bac8940ea96db6928748089e2b7e9d65fa659120 [diff]