tree 2b53340fedbfa241c79bcebdeabdffe992bdd791
parent 6014707b9a95f3ca0604eff41561d159540ad39f
author Andrew Jeffery <andrew@aj.id.au> 1617278774 +1030
committer Brad Bishop <bradleyb@fuzziesquirrel.com> 1619782118 +0000

phosphor-mmc-init: exec switch_root(8) rather than chroot(1)

It was found that perf(1) had some issues with recording and analysing
data on Rainier systems:

```
root@rainier:~# perf probe --add mem_serial_in
root@rainier:~# perf record -e probe:mem_serial_in -aR sleep 1
[ perf record: Woken up 1 times to write data ]
assertion failed at util/namespaces.c:257
No kallsyms or vmlinux with build-id e4e9c7cff1deb3bf32958039c696f094dc76cf5c was found
[ perf record: Captured and wrote 0.377 MB perf.data (25 samples) ]
root@rainier:~# perf script -v
build id event received for [kernel.kallsyms]: e4e9c7cff1deb3bf32958039c696f094dc76cf5c
broken or missing trace data
incompatible file format (rerun with -v to learn more)
```

Starting with the failed assertion in the recording, we find the
relevant code is the following WARN_ON_ONCE():

```
void nsinfo__mountns_exit(struct nscookie *nc)
{
	...

        if (nc->oldcwd) {
                WARN_ON_ONCE(chdir(nc->oldcwd));
                zfree(&nc->oldcwd);
        }
```

A strace of `perf record` demonstrates the relevant syscall sequence,
where /home/root is the working directory at the time when `perf record`
is invoked.

```
openat(AT_FDCWD, "/proc/self/ns/mnt", O_RDONLY|O_LARGEFILE) = 12
openat(AT_FDCWD, "/proc/142/ns/mnt", O_RDONLY|O_LARGEFILE) = 13
setns(13, CLONE_NEWNS)                  = 0
statx(AT_FDCWD, "/mnt/rofs/bin/udevadm", AT_STATX_SYNC_AS_STAT|AT_NO_AUTOMOUNT, STATX_BASIC_STATS, {stx_mask=STATX_BASIC_STATS|0x1000, stx_attributes=0, stx_mode=S_IFREG|0755, stx_size=978616, ...}) = 0
openat(AT_FDCWD, "/mnt/rofs/bin/udevadm", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 14
setns(12, CLONE_NEWNS)                  = 0
chdir("/home/root")                     = -1 ENOENT (No such file or directory)
```

From the path of the binary, PID 142 is executing in an unanticipated
environment. Its path is representative of the state of the filesystem
prior to the initramfs handing over to /sbin/init in the real root,
suggesting an issue with the initramfs' /init implementation.

In /init we find a bunch of setup to discover and mount the root device.
At the end of the script we prepare for the real root by exec'ing chroot.

From `man 2 chroot`[0]:

```
DESCRIPTION
       chroot()  changes the root directory of the calling process to that speci‐
       fied in path.  This directory will be used for pathnames beginning with /.
       The root directory is inherited by all children of the calling process.
```

Specifically, this outlines that chroot(2) affects the state of the
calling *process* and not the state of mount namespace in use by the
process.

Further, a call to `setns(..., CLONE_NEWNS)` explicitly replaces the
mount namespace for the *process*, and as such destroys any chroot state
that might have been associated with the process' original mount
namespace. As the chroot state is not a property of a mount namespace,
switching *back* to the application's original mount namespace does not
restore the process' original chroot state.

As such, the chdir(2) from the strace output above returns an error, as
the get_current_dir_name(3) call that yielded the provided path was
issued prior to switching into the target process' mount namespace, and
was thus derived in the chroot context. The path is therefore invalid
once the original mount namespace is restored via the second setns(2) as
the process has (already) lost the chroot context for the original
namespace.

For perf(1) to work in its current implementation the effective root for
PID 1 must remain the absolute path "/" with respect to the kernel's VFS
layer. This requires /init to use either pivot_root(1) or
switch_root(1). pivot_root(1) is ruled out by its own man-page[1]:

```
NOTES
       ...

       The rootfs (initial ramfs)  cannot  be  pivot_root()ed.   The  recommended
       method  of  changing  the root filesystem in this case is to delete every‐
       thing in rootfs, overmount rootfs with the  new  root,  attach  stdin/std‐
       out/stderr to the new /dev/console, and exec the new init(1).  Helper pro‐
       grams for this process exist; see switch_root(8).

       ...
```

As noted, the recommendation is a description of the switch_root(8)
application[2]. The details of why the specific sequence for
switch_root(8) is necessary is documented in [3].

Change /init to use switch_root(8) to avoid the nasty interaction of
chroot(2) and setns(2).

[0] https://man7.org/linux/man-pages/man2/chroot.2.html#DESCRIPTION
[1] https://man7.org/linux/man-pages/man2/pivot_root.2.html#NOTES
[2] https://man7.org/linux/man-pages/man8/switch_root.8.html
[3] https://git.busybox.net/busybox/tree/util-linux/switch_root.c?h=1_32_1#n298

Change-Id: Iac29b53a462b03559d18fe9b600aefcd1951057e
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
