OpenBMC in-kernel MCTP

Author: Jeremy Kerr <jk@codeconstruct.com.au>

Please refer to the MCTP Overview document for general MCTP design description, background and requirements.

This document describes a kernel-based implementation of MCTP infrastructure, providing a sockets-based API for MCTP communication within an OpenBMC-based platform.

Requirements for a kernel implementation

  • The MCTP messaging API should be an obvious application of the existing POSIX socket interface

  • Configuration should be simple for a straightforward MCTP endpoint: a single network with a single local endpoint id (EID).

  • Infrastructure should be flexible enough to allow for more complex MCTP networks, allowing:

    • each MCTP network (as defined by section 3.2.31 of DSP0236) may consist of multiple local physical interfaces, and/or multiple EIDs;

    • multiple distinct (ie., non-bridged) networks, possibly containing duplicated EIDs between networks;

    • multiple local EIDs on a single interface, and

    • customisable routing/bridging configurations within a network.

Proposed Design

The design contains several components:

  • An interface for userspace applications to send and receive MCTP messages: A mapping of the sockets API to MCTP usage

  • Infrastructure for control and configuration of the MCTP network(s), consisting of a configuration utility, and a kernel messaging facility for this utility to use.

  • Kernel drivers for physical interface bindings.

In general, the kernel components cover the transport functionality of MCTP, such as message assembly/disassembly, packet forwarding, and physical interface implementations.

Higher-level protocols (such as PLDM) are implemented in userspace, through the introduced socket API. This also includes the majority of the MCTP Control Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically have a specific process to request and respond to control protocol messages. However, the kernel will include a small subset of control protocol code to allow very simple endpoints, with static EID allocations, to run without this process. MCTP endpoints that require more than just single-endpoint functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would include the control message protocol process.

A new driver is introduced to handle each physical interface binding. These drivers expose the appropriate struct net_device to handle transmission and reception of MCTP packets on their associated hardware channels. Under Linux, the namespace for these interfaces is separate from other network interfaces - such as those for ethernet.

Structure: interfaces & networks

The kernel models the local MCTP topology through two items: interfaces and networks.

An interface (or "link") is an instance of an MCTP physical transport binding (as defined by DSP0236, section 3.2.47), likely connected to a specific hardware device. This is represented as a struct netdevice, and has a user-visible name and index (ifindex). Non-hardware-attached interfaces are permitted, to allow local loopback and/or virtual interfaces.

A network defines a unique address space for MCTP endpoints by endpoint-ID (described by DSP0236, section 3.2.31). A network has a user-visible identifier to allow references from userspace. Route definitions are specific to one network.

Interfaces are associated with one network. A network may be associated with one or more interfaces.

If multiple networks are present, each may contain EIDs that are also present on other networks.

Sockets API

Protocol definitions

We define a new address family (and corresponding protocol family) for MCTP:

    #define AF_MCTP /* TBD */
    #define PF_MCTP AF_MCTP

MCTP sockets are created with the socket() syscall, specifying AF_MCTP as the domain. Currently, only a SOCK_DGRAM socket type is defined.

    int sd = socket(AF_MCTP, SOCK_DGRAM, 0);

The only (current) value for the protocol argument is 0. Future protocol implementations may be added later.

MCTP Sockets opened with a protocol value of 0 will communicate directly at the transport layer; message buffers received by the application will consist of message data from reassembled MCTP packets, and will include the full message including message type byte and optional message integrity check (IC). Individual packet headers are not included; they may be accessible through a future SOCK_RAW socket type.

As with all socket address families, source and destination addresses are specified with a new sockaddr type:

    struct sockaddr_mctp {
            sa_family_t         smctp_family; /* = AF_MCTP */
            int                 smctp_network;
            struct mctp_addr    smctp_addr;
            uint8_t             smctp_type;
            uint8_t             smctp_tag;
    };

    struct mctp_addr {
            uint8_t             s_addr;
    };

    /* MCTP network values */
    #define MCTP_NET_ANY        0

    /* MCTP EID values */
    #define MCTP_ADDR_ANY       0xff
    #define MCTP_ADDR_BCAST     0xff

    /* MCTP type values. Only the least-significant 7 bits of
     * smctp_type are used for tag matches; the specification defines
     * the type to be 7 bits.
     */
    #define MCTP_TYPE_MASK      0x7f

    /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */
    /* MCTP-spec-defined fields */
    #define MCTP_TAG_MASK    0x07
    #define MCTP_TAG_OWNER   0x08
    /* Others: reserved */

    /* Helpers */
    #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */

Syscall behaviour

The following sections describe the MCTP-specific behaviours of the standard socket system calls. These behaviours have been chosen to map closely to the existing sockets APIs.

bind(): set local socket address

Sockets that receive incoming request packets will bind to a local address, using the bind() syscall.

    struct sockaddr_mctp addr;

    addr.smctp_family = AF_MCTP;
    addr.smctp_network = MCTP_NET_ANY;
    addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
    addr.smctp_type = MCTP_TYPE_PLDM;
    addr.smctp_tag = MCTP_TAG_OWNER;

    int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));

This establishes the local address of the socket. Incoming MCTP messages that match the network, address, and message type will be received by this socket. The reference to 'incoming' is important here; a bound socket will only receive messages with the TO bit set, to indicate an incoming request message, rather than a response.

The smctp_tag value will configure the tags accepted from the remote side of this socket. Given the above, the only valid value is MCTP_TAG_OWNER, which will result in remotely "owned" tags being routed to this socket. Since MCTP_TAG_OWNER is set, the 3 least-significant bits of smctp_tag are not used; callers must set them to zero. See the Tag behaviour for transmitted messages section for more details. If the MCTP_TAG_OWNER bit is not set, bind() will fail with an errno of EINVAL.

A smctp_network value of MCTP_NET_ANY will configure the socket to receive incoming packets from any locally-connected network. A specific network value will cause the socket to only receive incoming messages from that network.

The smctp_addr field specifies a local address to bind to. A value of MCTP_ADDR_ANY configures the socket to receive messages addressed to any local destination EID.

The smctp_type field specifies which message types to receive. Only the lower 7 bits of the type is matched on incoming messages (ie., the most-significant IC bit is not part of the match). This results in the socket receiving packets with and without a message integrity check footer.

connect(): set remote socket address

Sockets may specify a socket's remote address with the connect() syscall:

    struct sockaddr_mctp addr;
    int rc;

    addr.smctp_family = AF_MCTP;
    addr.smctp_network = MCTP_NET_ANY;
    addr.smctp_addr.s_addr = 8;
    addr.smctp_tag = MCTP_TAG_OWNER;
    addr.smctp_type = MCTP_TYPE_PLDM;

    rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr));

This establishes the remote address of a socket, used for future message transmission. Like other SOCK_DGRAM behaviour, this does not generate any MCTP traffic directly, but just sets the default destination for messages sent from this socket.

The smctp_network field may specify a locally-attached network, or the value MCTP_NET_ANY, in which case the kernel will select a suitable MCTP network. This is guaranteed to work for single-network configurations, but may require additional routing definitions for endpoints attached to multiple distinct networks. See the Addressing section for details.

The smctp_addr field specifies a remote EID. This may be the MCTP_ADDR_BCAST the MCTP broadcast EID (0xff).

The smctp_type field specifies the type field of messages transferred over this socket.

The smctp_tag value will configure the tag used for the local side of this socket. The only valid value is MCTP_TAG_OWNER, which will result in an "owned" tag to be allocated for this socket, and will remain allocated for all future outgoing messages, until either the socket is closed, or connect() is called again. If a tag cannot be allocated, connect() will report an error, with an errno value of EAGAIN. See the Tag behaviour for transmitted messages section for more details. If the MCTP_TAG_OWNER bit is not set, connect() will fail with an errno of EINVAL.

Requesters which connect to a single responder will typically use connect() to specify the peer address and tag for future outgoing messages.

sendto(), sendmsg(), send() & write(): transmit an MCTP message

An MCTP message is transmitted using one of the sendto(), sendmsg(), send() or write() syscalls. Using sendto() as the primary example:

    struct sockaddr_mctp addr;
    char buf[14];
    ssize_t len;

    /* set message destination */
    addr.smctp_family = AF_MCTP;
    addr.smctp_network = 0;
    addr.smctp_addr.s_addr = 8;
    addr.smctp_tag = MCTP_TAG_OWNER;
    addr.smctp_type = MCTP_TYPE_ECHO;

    /* arbitrary message to send, with message-type header */
    buf[0] = MCTP_TYPE_ECHO;
    memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);

    len = sendto(sd, buf, sizeof(buf), 0,
                    (struct sockaddr_mctp *)&addr, sizeof(addr));

The address argument is treated the same way as for connect(): The network and address fields define the remote address to send to. If smctp_tag has the MCTP_TAG_OWNER, the kernel will ignore any bits set in MCTP_TAG_VALUE, and generate a tag value suitable for the destination EID. If MCTP_TAG_OWNER is not set, the message will be sent with the tag value as specified. If a tag value cannot be allocated, the system call will report an errno of EAGAIN.

The application must provide the message type byte as the first byte of the message buffer passed to sendto(). If a message integrity check is to be included in the transmitted message, it must also be provided in the message buffer, and the most-significant bit of the message type byte must be 1.

If the first byte of the message does not match the message type value, then the system call will return an error of EPROTO.

The send() and write() system calls behave in a similar way, but do not specify a remote address. Therefore, connect() must be called beforehand; if not, these calls will return an error of EDESTADDRREQ (Destination address required).

Using sendto() or sendmsg() on a connected socket may override the remote socket address specified in connect(). The connect() address and tag will remain associated with the socket, for future unaddressed sends. The tag allocated through a call to sendto() or sendmsg() on a connected socket is subject to the same invalidation logic as on an unconnected socket: It is expired either by timeout or by a subsequent sendto().

The sendmsg() system call allows a more compact argument interface, and the message buffer to be specified as a scatter-gather list. At present no ancillary message types (used for the msg_control data passed to sendmsg()) are defined.

Transmitting a message on an unconnected socket with MCTP_TAG_OWNER specified will cause an allocation of a tag, if no valid tag is already allocated for that destination. The (destination-eid,tag) tuple acts as an implicit local socket address, to allow the socket to receive responses to this outgoing message. If any previous allocation has been performed (to for a different remote EID), that allocation is lost. This tag behaviour can be controlled through the MCTP_TAG_CONTROL socket option.

Sockets will only receive responses to requests they have sent (with TO=1) and may only respond (with TO=0) to requests they have received.

recvfrom(), recvmsg(), recv() & read(): receive an MCTP message

An MCTP message can be received by an application using one of the recvfrom(), recvmsg(), recv() or read() system calls. Using recvfrom() as the primary example:

    struct sockaddr_mctp addr;
    socklen_t addrlen;
    char buf[14];
    ssize_t len;

    addrlen = sizeof(addr);

    len = recvfrom(sd, buf, sizeof(buf), 0,
                    (struct sockaddr_mctp *)&addr, &addrlen);

    /* We can expect addr to describe an MCTP address */
    assert(addrlen >= sizeof(buf));
    assert(addr.smctp_family == AF_MCTP);

    printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);

The address argument to recvfrom and recvmsg is populated with the remote address of the incoming message, including tag value (this will be needed in order to reply to the message).

The first byte of the message buffer will contain the message type byte. If an integrity check follows the message, it will be included in the received buffer.

The recv() and read() system calls behave in a similar way, but do not provide a remote address to the application. Therefore, these are only useful if the remote address is already known, or the message does not require a reply.

Like the send calls, sockets will only receive responses to requests they have sent (TO=1) and may only respond (TO=0) to requests they have received.

getsockname() & getpeername(): query local/remote socket address

The getsockname() system call returns the struct sockaddr_mctp value for the local side of this socket, getpeername() for the remote (ie, that used in a connect()). Since the tag value is a property of the remote address, getpeername() may be used to retrieve a kernel-allocated tag value.

Calling getpeername() on an unconnected socket will result in an error of ENOTCONN.

Socket options

The following socket options are defined for MCTP sockets:

MCTP_ADDR_EXT: Use extended addressing information in sendmsg/recvmsg

Enabling this socket option allows an application to specify extended addressing information on transmitted packets, and access the same on received packets.

When the MCTP_ADDR_EXT socket option is enabled, the application may specify an expanded struct sockaddr to the recvfrom() and sendto() system calls. This as defined as:

    struct sockaddr_mctp_ext {
            /* fields exactly match struct sockaddr_mctp */
            sa_family_t         smctp_family; /* = AF_MCTP */
            int                 smctp_network;
            struct mctp_addr    smctp_addr;
            uint8_t             smcp_tag;
            /* extended addressing */
            int                 smctp_ifindex;
            uint8_t             smctp_halen;
            unsigned char       smctp_haddr[/* TBD */];
    }

If the addrlen specified to sendto() or recvfrom() is sufficient to contain this larger structure, then the extended addressing fields are consumed / populated respectively.

MCTP_TAG_CONTROL: manage outgoing tag allocation behaviour

The set/getsockopt argument is a mctp_tagctl structure:

struct mctp_tagctl {
    bool            retain;
    struct timespec timeout;
};

This allows an application to control the behaviour of allocated tags for non-connected sockets when transferring messages to multiple different destinations (ie., where a struct sockaddr_mctp is provided for individual messages, and the smctp_addr destination for those sockets may vary across calls).

The retain flag indicates to the kernel that the socket should not release tag allocations when a message is sent to a new destination EID. This causes the socket to continue to receive incoming messages to the old (dest,tag) tuple, in addition to the new tuple.

The timeout value specifies a maximum amount of time to retain tag values. This should be based on the reply timeout for any upper-level protocol.

The kernel may reject a request to set values that would cause excessive tag allocation by this socket. The kernel may also reject subsequent tag-allocation requests (through send or connect syscalls) which would cause excessive tags to be consumed by the socket, even though the tag control settings were accepted in the setsockopt operation.

Changing the default tag control behaviour should only be required when:

  • the socket is sending messages with TO=1 (ie, is a requester); and
  • messages are sent to multiple different destination EIDs from the one socket.

Syscalls not implemented

The following system calls are not implemented for MCTP, primarily as they are not used in SOCK_DGRAM-type sockets:

  • listen()
  • accept()
  • ioctl()
  • shutdown()
  • mmap()

Userspace examples

These examples cover three general use-cases:

  • requester: sends requests to a particular (EID, type) target, and receives responses to those packets

    This is similar to a typical UDP client

  • responder: receives all locally-addressed messages of a specific message-type, and responds to the requester immediately.

    This is similar to a typical UDP server

  • controller: a specific service for a bus owner; may send broadcast messages, manage EID allocations, update local MCTP stack state. Will need low-level packet data.

    This is similar to a DHCP server.

Requester

"Client"-side implementation to send requests to a responder, and receive a response. This uses a (fictitious) message type of MCTP_TYPE_ECHO.

    int main() {
            struct sockaddr_mctp addr;
            socklen_t addrlen;
            struct {
                uint8_t type;
                uint8_t data[14];
            } msg;
            int sd, rc;

            sd = socket(AF_MCTP, SOCK_DGRAM, 0);

            addr.sa_family = AF_MCTP;
            addr.smctp_network = MCTP_NET_ANY; /* any network */
            addr.smctp_addr.s_addr = 9;    /* remote eid 9 */
            addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */
            addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */
            addrlen = sizeof(addr);

            /* set message type and payload */
            msg.type = MCTP_TYPE_ECHO;
            strncpy(msg.data, "hello, world!", sizeof(msg.data));

            /* send message */
            rc = sendto(sd, &msg, sizeof(msg), 0,
                            (struct sockaddr *)&addr, addrlen);

            if (rc < 0)
                    err(EXIT_FAILURE, "sendto");

            /* Receive reply. This will block until a reply arrives,
             * which may never happen. Actual code would need a timeout
             * here. */
            rc = recvfrom(sd, &msg, sizeof(msg), 0,
                        (struct sockaddr *)&addr, &addrlen);
            if (rc < 0)
                    err(EXIT_FAILURE, "recvfrom");

            assert(msg.type == MCTP_TYPE_ECHO);
            /* ensure we're nul-terminated */
            msg.data[sizeof(msg.data)-1] = '\0';

            printf("reply: %s\n", msg.data);

            return EXIT_SUCCESS;
    }

Responder

"Server"-side implementation to receive requests and respond. Like the client, This uses a (fictitious) message type of MCTP_TYPE_ECHO in the struct sockaddr_mctp; only messages matching this type will be received.

    int main() {
            struct sockaddr_mctp addr;
            socklen_t addrlen;
            int sd, rc;

            sd = socket(AF_MCTP, SOCK_DGRAM, 0);

            addr.sa_family = AF_MCTP;
            addr.smctp_network = MCTP_NET_ANY; /* any network */
            addr.smctp_addr.s_addr = MCTP_EID_ANY;
            addr.smctp_type = MCTP_TYPE_ECHO;
            addr.smctp_tag = MCTP_TAG_OWNER;
            addrlen = sizeof(addr);

            rc = bind(sd, (struct sockaddr *)&addr, addrlen);
            if (rc)
                    err(EXIT_FAILURE, "bind");

            for (;;) {
                    struct {
                        uint8_t type;
                        uint8_t data[14];
                    } msg;

                    rc = recvfrom(sd, &msg, sizeof(msg), 0,
                                    (struct sockaddr *)&addr, &addrlen);
                    if (rc < 0)
                            err(EXIT_FAILURE, "recvfrom");
                    if (rc < 1)
                            warnx("not enough data for a message type");

                    assert(addrlen == sizeof(addr));
                    assert(msg.type == MCTP_TYPE_ECHO);

                    printf("%zd bytes from EID %d\n", rc, addr.smctp_addr);

                    /* Reply to requester; this macro just clears the TO-bit.
                     * Other addr fields will describe the remote endpoint,
                     * so use those as-is.
                     */
                    addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag);

                    rc = sendto(sd, &msg, rc, 0,
                                (struct sockaddr *)&addr, addrlen);
                    if (rc < 0)
                            err(EXIT_FAILURE, "sendto");
            }

            return EXIT_SUCCESS;
    }

Broadcast request

Sends a request to a broadcast EID, and receives (unicast) replies. Typical control protocol pattern.

    int main() {
            struct sockaddr_mctp txaddr, rxaddr;
            struct timespec start, cur;
            struct pollfd pollfds[1];
            socklen_t addrlen;
            uint8_t buf[2];
            int timeout;

            sd = socket(AF_MCTP, SOCK_DGRAM, 0);

            /* destination address setup */
            txaddr.sa_family = AF_MCTP;
            txaddr.smctp_network = 1; /* specific network required for broadcast */
            txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */
            txaddr.smctp_type = MCTP_TYPE_CONTROL;
            txaddr.smctp_tag = MCTP_TAG_OWNER;

            buf[0] = MCTP_TYPE_CONTROL;
            buf[1] = 'a';

            /* We're doing a sendto() to a broadcast address here. If we were
             * sending more than one broadcast message, we'd be better off
             * doing connect(); sendto();, in order to retain the tag
             * reservation across all transmitted messages. However, since this
             * is a single transmit, that makes no difference in this
             * particular case.
             */
            rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr,
                            sizeof(txaddr));
            if (rc < 0)
                    err(EXIT_FAILURE, "sendto");

            /* Set up poll behaviour, and record our starting time for
             * reply timeouts */
            pollfds[0].fd = sd;
            pollfds[0].events = POLLIN;
            clock_gettime(CLOCK_MONOTONIC, &start);

            for (;;) {
                    /* Calculate the amount of time left for replies */
                    clock_gettime(CLOCK_MONOTONIC, &cur);
                    timeout = calculate_timeout(&start, &cur, 1000);

                    rc = poll(pollfds, 1, timeout)
                    if (rc < 0)
                        err(EXIT_FAILURE, "poll");

                    /* timeout receiving a reply? */
                    if (rc == 0)
                        break;

                    /* sanity check that we have a message to receive */
                    if (!(pollfds[0].revents & POLLIN))
                        break;

                    addrlen = sizeof(rxaddr);

                    rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr,
                            &addrlen);
                    if (rc < 0)
                            err(EXIT_FAILURE, "recvfrom");

                    assert(addrlen >= sizeof(rxaddr));
                    assert(rxaddr.smctp_family == AF_MCTP);

                    printf("response from EID %d\n", rxaddr.smctp_addr);
            }

            return EXIT_SUCCESS;
    }

Implementation notes

Addressing

Transmitted messages (through sendto() and related system calls) specify their destination via the smctp_network and smctp_addr fields of a struct sockaddr_mctp.

The smctp_addr field maps directly to the destination endpoint's EID.

The smctp_network field specifies a locally defined network identifier. To simplify situations where there is only one network defined, the special value MCTP_NET_ANY is allowed. This will allow the kernel to select a specific network for transmission.

This selection is entirely user-configured; one specific network may be defined as the system default, in which case it will be used for all message transmission where MCTP_NET_ANY is used as the destination network.

In particular, the destination EID is never used to select a destination network.

MCTP responders should use the EID and network values of an incoming request to specify the destination for any responses.

Bridging/routing

The network and interface structure allows multiple interfaces to share a common network. By default, packets are not forwarded between interfaces.

A network can be configured for "forwarding" mode. In this mode, packets may be forwarded if their destination EID is non-local, and matches a route for another interface on the same network.

As per DSP0236, packet reassembly does not occur during the forwarding process. If the packet is larger than the MTU for the destination interface/route, then the packet is dropped.

Tag behaviour for transmitted messages

On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel must allocate a tag that will uniquely identify responses over a (destination EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size.

To allow this, a sendto() with the MCTP_TAG_OWNER bit set in the smctp_tag field will cause the kernel to allocate a unique tag for subsequent replies from that specific remote EID.

This allocation will expire when any of the following occur:

  • the socket is closed
  • a new message is sent to a new destination EID
  • an implementation-defined timeout expires

Because the "tag space" is limited, it may not be possible for the kernel to allocate a unique tag for the outgoing message. In this case, the sendto() call will fail with errno EAGAIN. This is analogous to the UDP behaviour when a local port cannot be allocated for an outgoing message.

The implementation-defined timeout value shall be chosen to reasonably cover standard reply timeouts. If necessary, this timeout may be modified through the MCTP_TAG_CONTROL socket option.

For applications that expect to perform an ongoing message exchange with a particular destination address, they may use the connect() call to set a persistent remote address. In this case, the tag will be allocated during connect(), and remain reserved for this socket until any of the following occur:

  • the socket is closed
  • the remote address is changed through another call to connect().

In particular, calling sendto() with a different address does not release the tag reservation.

Broadcast messages are particularly onerous for tag reservations. When a message is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must reserve the tag across the entire range of possible EIDs. Therefore, a particular tag value must be currently-unused across all EIDs to allow a sendto() to a broadcast address. Additionally, this reservation is not cleared when a reply is received, as there may be multiple replies to a broadcast.

For this reason, applications wanting to send to the broadcast address should use the connect() system call to reserve a tag, and guarantee its availability for future message transmission. Note that this will remove the tag value for use with any other EID. Sending to the broadcast address should be avoided; we expect few applications will need this functionality.

MCTP Control Protocol implementation

Aside from the "Resolve endpoint EID" message, the MCTP control protocol implementation would exist as a userspace process, mctpd. This process is responsible for responding to incoming control protocol messages, any dynamic EID allocations (for bus owner devices) and maintaining the MCTP route table (for bridging devices).

This process would create a socket bound to the type MCTP_TYPE_CONTROL, with the MCTP_ADDR_EXT socket option enabled in order to access physical addressing data on incoming control protocol requests. It would interact with the kernel's route table via a netlink interface - the same as that implemented for the Utility and configuration interfaces.

Neighbour and routing implementation

The packet-transmission behaviour of the MCTP infrastructure relies on a single routing table to lookup both route and neighbour information. Entries in this table are of the format:

EID rangeinterfacephysical addressmetricMTUflagsexpiry

This table can be updated from two sources:

  • From userspace, via a netlink interface (see the Utility and configuration interfaces section).

  • Directly within the kernel, when basic neighbour information is discovered. Kernel-originated routes are marked as such in the flags field, and have a maximum validity age, indicated by the expiry field.

Kernel-discovered routing information can originate from two sources:

  • physical-to-EID mappings discovered through received packets

  • explicit endpoint physical-address resolution requests

When a packet is to be transmitted to an EID that does not have an entry in the routing table, the kernel may attempt to resolve the physical address of that endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol (section 12.9 of DSP0236). The response message will be used to add a kernel-originated route into the routing table.

This is the only kernel-internal usage of MCTP Control Protocol messages.

Utility and configuration interfaces

A small utility will be developed to control the state of the kernel MCTP stack. This will be similar in design to the 'iproute2' tools, which perform a similar function for the IPv4 and IPv6 protocols.

The utility will be invoked as mctp, and provide subcommands for managing different aspects of the kernel stack.

mctp link: manage interfaces

    mctp link set <link> <up|down>
    mctp link set <link> network <network-id>
    mctp link set <link> mtu <mtu>
    mctp link set <link> bus-owner <hwaddr>

mctp network: manage networks

    mctp network create <network-id>
    mctp network set <network-id> forwarding <on|off>
    mctp network set <network-id> default [<true|false>]
    mctp network delete <network-id>

mctp address: manage local EID assignments

    mctp address add <eid> dev <link>
    mctp address del <eid> dev <link>

mctp route: manage routing tables

    mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
    mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
    mctp route show [net <network-id>]

mctp stat: query socket status

    mctp stat

A set of netlink message formats will be defined to support these control functions.

Design points & alternatives considered

Including message-type byte in send/receive buffers

This design specifies that message buffers passed to the kernel in send syscalls and from the kernel in receive syscalls will have the message type byte as the first byte of the buffer. This corresponds to the definition of a MCTP message payload in DSP0236.

This somewhat duplicates the type data provided in struct sockaddr_mctp; it's superficially possible for the kernel to prepend this byte on send, and remove it on receive.

However, the exact format of the MCTP message payload is not precisely defined by the specification. Particularly, any message integrity check data (which would also need to be appended / stripped in conjunction with the type byte) is defined by the type specification, not DSP0236. The kernel would need knowledge of all protocols in order to correctly deconstruct the payload data.

Therefore, we transfer the message payload as-is to userspace, without any modification by the kernel.

MCTP message-type specification: using sockaddr_mctp.smctp_type rather than protocol

This design specifies message-types to be passed in the smctp_type field of struct sockaddr_mctp. An alternative would be to pass it in the protocol argument of the socket() system call:

    int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol);

The smctp_type implementation was chosen as it better matches the "addressing" model of the message type; sockets are bound to an incoming message type, similar to the IP protocol's model of binding UDP sockets to a local port number.

There is no kernel behaviour that depends on the specific type (particularly given the design choice above), so it is not suited to use the protocol argument here.

Future additions that perform protocol-specific message handling, and so alter the send/receive buffer format, may use a new protocol argument.

Networks referenced by index rather than UUID

This design proposes referencing networks by an integer index. The MCTP standard does optionally associate a RFC4122 UUID with a networks; it would be possible to use this UUID where we pass a network identifier.

This approach does not incorporate knowledge of network UUIDs in the kernel. Given that the Get Network ID message in the MCTP Control Protocol is implemented entirely via userspace, it does not need to be aware of network UUIDs, and requiring network references (for example, the smctp_network field of struct sockaddr_mctp, as type uuid_t) complicates assignment.

Instead, the index integer is used instead, in a similar fashion to the integer index used to reference struct netdevices elsewhere in the network stack.

Tag behaviour alternatives

We considered several different designs for the tag handling behaviour. A brief overview of the more-feasible of those, and why they were rejected:

Each socket is allocated a unique tag value on creation

We could allocate a tag for each socket on creation, and use that value when a tag is required. This, however:

  • needlessly consumes a tag on non-tag-owning sockets (ie, those which send with TO=0 - responders); and

  • limits us to 8 sockets per network.

Tags only used for message packetisation / reassembly

An alternative would be to completely dissociate tag allocation from sockets; and only allocate a tag for the (short-lived) task of packetising a message, and sending those packets. Tags would be released when the last packet has been sent.

However, this removes any facility to correlate responses with the correct socket, which is the purpose of the TO bit in DSP0236. In order for the sending application to receive the response, we would either need to:

  • limit the system to one socket of each message type (which, for example, precludes running a requester and a responder of the same type); or

  • forward all incoming messages of a specific message-type to all sockets listening on that type, making it trivial to eavesdrop on MCTP data of other applications

Allocate a tag for one request/response pair

Another alternative would be to allocate a tag on each outgoing TO=1 message, and then release that allocation after the incoming response to that tag (TO=0) is observed.

However, MCTP protocols exist that do not have a 1:1 mapping of responses to requests - more than one response may be valid for a given request message. For example, in response to a request, a NVMe-MI implementation may send an in-progress reply before the final reply. In this case, we would release the tag after the first response is received, and then have no way to correlate the second message with the socket.

Broadcast MCTP request messages may have multiple replies from multiple endpoints, meaning we cannot release the tag allocation on the first reply.