Stewart Smith | 0330d0a | 2019-07-09 10:26:35 +1000 | [diff] [blame] | 1 | Release Notes for OpenPower Firmware v2.3.1 |
| 2 | =========================================== |
| 3 | |
| 4 | op-build v2.3.1 was released on July 9th, 2019 and contains several important |
| 5 | fixes for POWER8 and POWER9 systems. |
| 6 | |
| 7 | For POWER8 and POWER9 systems there are updated skiboot, Linux, and buildroot. |
| 8 | There's also an an updated hostboot for POWER8 systems. |
| 9 | |
| 10 | skiboot |
| 11 | ------- |
| 12 | |
| 13 | Bug fixes included in this release are: |
| 14 | |
| 15 | - npu2: Purge cache when resetting a GPU |
| 16 | |
| 17 | After putting all a GPU's links in reset, do a cache purge in case we |
| 18 | have CPU cache lines belonging to the now-unaccessible GPU memory. |
| 19 | |
| 20 | - npu2: Reset NVLinks when resetting a GPU |
| 21 | |
| 22 | Resetting a V100 GPU brings its NVLinks down and if an NPU tries using |
| 23 | those, an HMI occurs. We were lucky not to observe this as the bare metal |
| 24 | does not normally reset a GPU and when passed through, GPUs are usually |
| 25 | before NPUs in QEMU command line or Libvirt XML and because of that NPUs |
| 26 | are naturally reset first. However simple change of the device order |
| 27 | brings HMIs. |
| 28 | |
| 29 | This defines a bus control filter for a PCI slot with a GPU with NVLinks |
| 30 | so when the host system issues secondary bus reset to the slot, it resets |
| 31 | associated NVLinks. |
| 32 | |
| 33 | - hw/phb4: Assert Link Disable bit after ETU init |
| 34 | |
| 35 | The cursed RAID card in ozrom1 has a bug where it ignores PERST being |
| 36 | asserted. The PCIe Base spec is a little vague about what happens |
| 37 | while PERST is asserted, but it does clearly specify that when |
| 38 | PERST is de-asserted the Link Training and Status State Machine |
| 39 | (LTSSM) of a device should return to the initial state (Detect) |
| 40 | defined in the spec and the link training process should restart. |
| 41 | |
| 42 | This bug was worked around in 9078f8268922 ("phb4: Delay training till |
| 43 | after PERST is deasserted") by setting the link disable bit at the |
| 44 | start of the FRESET process and clearing it after PERST was |
| 45 | de-asserted. Although this fixed the bug, the patch offered no |
| 46 | explaination of why the fix worked. |
| 47 | |
| 48 | In b8b4c79d4419 ("hw/phb4: Factor out PERST control") the link disable |
| 49 | workaround was moved into phb4_assert_perst(). This is called |
| 50 | always in the CRESET case, but a following patch resulted in |
| 51 | assert_perst() not being called if phb4_freset() was entered following a |
| 52 | CRESET since p->skip_perst was set in the CRESET handler. This is bad |
| 53 | since a side-effect of the CRESET is that the Link Disable bit is |
| 54 | cleared. |
| 55 | |
| 56 | This, combined with the RAID card ignoring PERST results in the PCIe |
| 57 | link being trained by the PHB while we're waiting out the 100ms |
| 58 | ETU reset time. If we hack skiboot to print a DLP trace after returning |
| 59 | from phb4_hw_init() we get: :: |
| 60 | |
| 61 | PHB#0001[0:1]: Initialization complete |
| 62 | PHB#0001[0:1]: TRACE:0x0000102101000000 0ms presence GEN1:x16:polling |
| 63 | PHB#0001[0:1]: TRACE:0x0000001101000000 23ms GEN1:x16:detect |
| 64 | PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling |
| 65 | PHB#0001[0:1]: TRACE:0x0000183101000000 29ms training GEN1:x16:config |
| 66 | PHB#0001[0:1]: TRACE:0x00001c5881000000 30ms training GEN1:x08:recovery |
| 67 | PHB#0001[0:1]: TRACE:0x00001c5883000000 30ms training GEN3:x08:recovery |
| 68 | PHB#0001[0:1]: TRACE:0x0000144883000000 33ms presence GEN3:x08:L0 |
| 69 | PHB#0001[0:1]: TRACE:0x0000154883000000 33ms trained GEN3:x08:L0 |
| 70 | PHB#0001[0:1]: CRESET: wait_time = 100 |
| 71 | PHB#0001[0:1]: FRESET: Starts |
| 72 | PHB#0001[0:1]: FRESET: Prepare for link down |
| 73 | PHB#0001[0:1]: FRESET: Assert skipped |
| 74 | PHB#0001[0:1]: FRESET: Deassert |
| 75 | PHB#0001[0:1]: TRACE:0x0000154883000000 0ms trained GEN3:x08:L0 |
| 76 | PHB#0001[0:1]: TRACE: Reached target state |
| 77 | PHB#0001[0:1]: LINK: Start polling |
| 78 | PHB#0001[0:1]: LINK: Electrical link detected |
| 79 | PHB#0001[0:1]: LINK: Link is up |
| 80 | PHB#0001[0:1]: LINK: Went down waiting for stabilty |
| 81 | PHB#0001[0:1]: LINK: DLP train control: 0x0000105101000000 |
| 82 | PHB#0001[0:1]: CRESET: Starts |
| 83 | |
| 84 | What has happened here is that the link is trained to 8x Gen3 33ms after |
| 85 | we return from phb4_init_hw(), and before we've waitined to 100ms |
| 86 | that we normally wait after re-initialising the ETU. When we "deassert" |
| 87 | PERST later on in the FRESET handler the link in L0 (normal) state. At |
| 88 | this point we try to read from the Vendor/Device ID register to verify |
| 89 | that the link is stable and immediately get a PHB fence due to a PCIe |
| 90 | Completion Timeout. Skiboot attempts to recover by doing another CRESET, |
| 91 | but this will encounter the same issue. |
| 92 | |
| 93 | This patch fixes the problem by setting the Link Disable bit (by calling |
| 94 | phb4_assert_perst()) immediately after we return from phb4_init_hw(). |
| 95 | This prevents the link from being trained while PERST is asserted which |
| 96 | seems to avoid the Completion Timeout. With the patch applied we get: :: |
| 97 | |
| 98 | PHB#0001[0:1]: Initialization complete |
| 99 | PHB#0001[0:1]: TRACE:0x0000102101000000 0ms presence GEN1:x16:polling |
| 100 | PHB#0001[0:1]: TRACE:0x0000001101000000 23ms GEN1:x16:detect |
| 101 | PHB#0001[0:1]: TRACE:0x0000102101000000 23ms presence GEN1:x16:polling |
| 102 | PHB#0001[0:1]: TRACE:0x0000909101000000 29ms presence GEN1:x16:disabled |
| 103 | PHB#0001[0:1]: CRESET: wait_time = 100 |
| 104 | PHB#0001[0:1]: FRESET: Starts |
| 105 | PHB#0001[0:1]: FRESET: Prepare for link down |
| 106 | PHB#0001[0:1]: FRESET: Assert skipped |
| 107 | PHB#0001[0:1]: FRESET: Deassert |
| 108 | PHB#0001[0:1]: TRACE:0x0000001101000000 0ms GEN1:x16:detect |
| 109 | PHB#0001[0:1]: TRACE:0x0000102101000000 0ms presence GEN1:x16:polling |
| 110 | PHB#0001[0:1]: TRACE:0x0000001101000000 24ms GEN1:x16:detect |
| 111 | PHB#0001[0:1]: TRACE:0x0000102101000000 36ms presence GEN1:x16:polling |
| 112 | PHB#0001[0:1]: TRACE:0x0000183101000000 97ms training GEN1:x16:config |
| 113 | PHB#0001[0:1]: TRACE:0x00001c5881000000 97ms training GEN1:x08:recovery |
| 114 | PHB#0001[0:1]: TRACE:0x00001c5883000000 97ms training GEN3:x08:recovery |
| 115 | PHB#0001[0:1]: TRACE:0x0000144883000000 99ms presence GEN3:x08:L0 |
| 116 | PHB#0001[0:1]: TRACE: Reached target state |
| 117 | PHB#0001[0:1]: LINK: Start polling |
| 118 | PHB#0001[0:1]: LINK: Electrical link detected |
| 119 | PHB#0001[0:1]: LINK: Link is up |
| 120 | PHB#0001[0:1]: LINK: Link is stable |
| 121 | PHB#0001[0:1]: LINK: Card [9005:028c] Optimal Retry:disabled |
| 122 | PHB#0001[0:1]: LINK: Speed Train:GEN3 PHB:GEN4 DEV:GEN3 |
| 123 | PHB#0001[0:1]: LINK: Width Train:x08 PHB:x08 DEV:x08 |
| 124 | PHB#0001[0:1]: LINK: RX Errors Now:0 Max:8 Lane:0x0000 |
| 125 | |
| 126 | - npu2: Reset PID wildcard and refcounter when mapped to LPID |
| 127 | |
| 128 | Since 105d80f85b "npu2: Use unfiltered mode in XTS tables" we do not |
| 129 | register every PID in the XTS table so the table has one entry per LPID. |
| 130 | Then we added a reference counter to keep track of the entry use when |
| 131 | switching GPU between the host and guest systems (the "Fixes:" tag below). |
| 132 | |
| 133 | The POWERNV platform setup creates such entries and references them |
| 134 | at the boot time when initializing IOMMUs and only removes it when |
| 135 | a GPU is passed through to a guest. This creates a problem as POWERNV |
| 136 | boots via kexec and no defererencing happens; the XTS table state remains |
| 137 | undefined. So when the host kernel boots, skiboot thinks there are valid |
| 138 | XTS entries and does not update the XTS table which breaks ATS. |
| 139 | |
| 140 | This adds the reference counter and the XTS entry reset when a GPU is |
| 141 | assigned to LPID and we cannot rely on the kernel to clean that up. |
| 142 | |
| 143 | - hw/phb4: Use read/write_reg in assert_perst |
| 144 | |
| 145 | While the PHB is fenced we can't use the MMIO interface to access PHB |
| 146 | registers. While processing a complete reset we inject a PHB fence to |
| 147 | isolate the PHB from the rest of the system because the PHB won't |
| 148 | respond to MMIOs from the rest of the system while being reset. |
| 149 | |
| 150 | We assert PERST after the fence has been erected which requires us to |
| 151 | use the XSCOM indirect interface to access the PHB registers rather than |
| 152 | the MMIO interface. Previously we did that when asserting PERST in the |
| 153 | CRESET path. However in b8b4c79d4419 ("hw/phb4: Factor out PERST |
| 154 | control"). This was re-written to use the raw in_be64() accessor. This |
| 155 | means that CRESET would not be asserted in the reset path. On some |
| 156 | Mellanox cards this would prevent them from re-loading their firmware |
| 157 | when the system was fast-reset. |
| 158 | |
| 159 | This patch fixes the problem by replacing the raw {in|out}_be64() |
| 160 | accessors with the phb4_{read|write}_reg() functions. |
| 161 | |
| 162 | - opal-prd: Fix prd message size issue |
| 163 | |
| 164 | If prd messages size is insufficient then read_prd_msg() call fails with |
| 165 | below error. And caller is not reallocating sufficient buffer. Also its |
| 166 | hard to guess the size. |
| 167 | |
| 168 | sample log::: |
| 169 | ----------- |
| 170 | Mar 28 03:31:43 zz24p1 opal-prd: FW: error reading from firmware: alloc 32 rc -1: Invalid argument |
| 171 | Mar 28 03:31:43 zz24p1 opal-prd: FW: error reading from firmware: alloc 32 rc -1: Invalid argument |
| 172 | Mar 28 03:31:43 zz24p1 opal-prd: FW: error reading from firmware: alloc 32 rc -1: Invalid argument |
| 173 | .... |
| 174 | |
| 175 | Lets use opal-msg-size device tree property to allocate memory |
| 176 | for prd message. |
| 177 | |
| 178 | - npu2: Fix clearing the FIR bits |
| 179 | |
| 180 | FIR registers are SCOM-only so they cannot be accesses with the indirect |
| 181 | write, and yet we use SCOM-based addresses for these; fix this. |
| 182 | |
| 183 | - opal-gard: Account for ECC size when clearing partition |
| 184 | |
| 185 | When 'opal-gard clear all' is run, it works by erasing the GUARD then |
| 186 | using blockevel_smart_write() to write nothing to the partition. This |
| 187 | second write call is needed because we rely on libflash to set the ECC |
| 188 | bits appropriately when the partition contained ECCed data. |
| 189 | |
| 190 | The API for this is a little odd with the caller specifying how much |
| 191 | actual data to write, and libflash writing size + size/8 bytes |
| 192 | since there is one additional ECC byte for every eight bytes of data. |
| 193 | |
| 194 | We currently do not account for the extra space consumed by the ECC data |
| 195 | in reset_partition() which is used to handle the 'clear all' command. |
| 196 | Which results in the paritition following the GUARD partition being |
| 197 | partially overwritten when the command is used. This patch fixes the |
| 198 | problem by reducing the length we would normally write by the number |
| 199 | of ECC bytes required. |
| 200 | |
| 201 | - nvram: Flag dangerous NVRAM options |
| 202 | |
| 203 | Most nvram options used by skiboot are just for debug or testing for |
| 204 | regressions. They should never be used long term. |
| 205 | |
| 206 | We've hit a number of issues in testing and the field where nvram |
| 207 | options have been set "temporarily" but haven't been properly cleared |
| 208 | after, resulting in crashes or real bugs being masked. |
| 209 | |
| 210 | This patch marks most nvram options used by skiboot as dangerous and |
| 211 | prints a chicken to remind users of the problem. |
| 212 | |
| 213 | - devicetree: Don't set path to dtc in makefile |
| 214 | |
| 215 | By setting the path we fail to build under buildroot which has it's own |
| 216 | set of host tools in PATH, but not at /usr/bin. |
| 217 | |
| 218 | Keep the variable so it can be set if need be but default to whatever |
| 219 | 'dtc' is in the users path. |
| 220 | |
| 221 | |
| 222 | Linux and buildroot |
| 223 | ------------------- |
| 224 | |
| 225 | Move to Linux v5.1.15-openpower1 and buildroot 2019.02.3 |
| 226 | |
| 227 | This updates to a in-support stable Linux release, resolving potential |
| 228 | security and stability issues. Notably, this includes fixes for |
| 229 | CVE-2019-12817, CVE-2019-11477, CVE-2019-11478, and CVE-2019-11479. |
| 230 | |
| 231 | Buildroot stays on the same major version with the .2 and .3 stable |
| 232 | releases added in. |
| 233 | |
| 234 | The skiroot defconfig is updated to ensure we still run the MMU in Radix |
| 235 | mode (see http://git.kernel.org/torvalds/c/8adddf349fda0). It also |
| 236 | disables xmon by default. |
| 237 | |
| 238 | Hostboot |
| 239 | -------- |
| 240 | |
| 241 | Point op-build P8 hostboot at commit to report cache-count-disabled OS flag |
| 242 | |
| 243 | Points OP build at the P8 hostboot package commit which enables reporting to the OS |
| 244 | that the cache-count-disabled Spectre workaround is available. |