blob: c8f52df626ab7eb817276ba880e6e53955354991 [file] [log] [blame] [edit]
*** Settings ***
Documentation Stress the system GPUs using the HTX exerciser.
# Test Parameters:
# OPENBMC_HOST The BMC host name or IP address.
# OS_HOST The OS host name or IP Address.
# OS_USERNAME The OS login userid (usually "root").
# OS_PASSWORD The password for the OS login.
# HTX_DURATION Duration of HTX run, for example, "2h", or "30m".
# HTX_LOOP The number of times to loop HTX.
# HTX_INTERVAL The time delay between consecutive checks of HTX
# status, for example, "15m".
# In summary: Run HTX for $HTX_DURATION, looping
# $HTX_LOOP times checking for errors every
# $HTX_INTERVAL. Then allow extra time for OS
# Boot, HTX startup, shutdown.
# HTX_KEEP_RUNNING If set to 1, this indicates that the HTX is to
# continue running after an error was found.
Resource ../syslib/utils_os.robot
Suite Setup Run Keyword Start SOL Console Logging
Test Setup Test Setup Execution
Test Teardown Test Teardown Execution
Force Tags GPU_Stress
*** Variables ****
${HTX_DURATION} 1h
${HTX_LOOP} ${1}
${HTX_INTERVAL} 30m
${HTX_KEEP_RUNNING} ${0}
${stack_mode} skip
*** Test Cases ***
GPU Stress Test
[Documentation] Stress the GPU using HTX exerciser.
[Tags] GPU_Stress_Test
# Get number of GPU reported by the BMC.
${num_bmc_gpus}= Count GPUs From BMC
Rpvars num_bmc_gpus
# The BMC and OS should report the same number of GPUs.
${failmsg01}= Catenate OS reports ${num_os_gpus} GPUs, but BMC
... reports ${num_bmc_gpus} present and functional GPUs.
Run Keyword If '${num_os_gpus}' != '${num_bmc_gpus}'
... Fail msg=${failmsg01}
# Show parameters for HTX stress test.
Printn
Rpvars HTX_DURATION HTX_LOOP HTX_INTERVAL
# Set the iteration (loop) counter.
Set Suite Variable ${iteration} ${0} children=true
# Shutdown HTX if it is already running.
${status}= Is HTX Running
Run Keyword If '${status}' == 'True'
... Shutdown HTX Exerciser
Repeat Keyword ${HTX_LOOP} times Execute GPU Test
*** Keywords ***
Execute GPU Test
[Documentation] Start HTX exerciser.
# Test Flow:
# - Power on
# - Establish SSH connection session
# - Collect GPU nvidia status output
# - Create HTX mdt profile
# - Run GPU specific HTX exerciser
# - Check for errors
Set Suite Variable ${iteration} ${iteration + 1}
${loop_count}= Catenate Starting iteration: ${iteration}
Printn
Rpvars loop_count
REST Power On stack_mode=skip
Run Key U Sleep \ 15s
# Collect data before the test starts.
Collect NVIDIA Log File start
# Collect NVIDIA maximum limits.
${power_max}= Get GPU Power Limit
${temperature_max}= Get GPU Temperature Limit
${clock_max}= Get GPU Clock Limit
Run Keyword If '${HTX_MDT_PROFILE}' == 'mdt.bu'
... Create Default MDT Profile
Run MDT Profile
Loop HTX Health Check
# Post test loop look out for dmesg error logged.
Check For Errors On OS Dmesg Log
# Check NVIDIA power, temperature, and clocks.
${power}= Get GPU Max Power
${temperature}= Get GPU Max Temperature
${temperature_via_rest}= Get GPU Temperature Via REST
${clock}= Get GPU Clock
Printn
Rpvars power power_max temperature temperature_via_rest
... temperature_max clock clock_max
Run Keyword If ${power} > ${power_max} Fail
... msg=GPU Power ${power} exceeds limit of ${power_max}.
${err_msg}= Catenate GPU temperature of ${temperature} exceeds limit
... of ${temperature_max}.
Run Keyword If ${temperature} > ${temperature_max} Fail msg=${err_msg}
Run Keyword If ${clock} > ${clock_max} Fail
... msg=GPU clock of ${clock} exceeds limit of ${clock_max}.
${err_msg}= Catenate The GPU temperature reported by REST is not within
... 5 degrees of the nvidia_smi reported temperature.
${upper_limit}= Evaluate ${temperature_via_rest}+5
${lower_limit}= Evaluate ${temperature_via_rest}-5
Run Keyword If
... ${temperature} > ${upper_limit} or ${temperature} < ${lower_limit}
... Fail msg=${err_msg}
Shutdown HTX Exerciser
Collect NVIDIA Log File end
Error Logs Should Not Exist
REST Power Off
Flush REST Sessions
Print Timen HTX Test ran for: ${HTX_DURATION}
${loop_count}= Catenate Ending iteration: ${iteration}
Printn
Rpvars loop_count
Loop HTX Health Check
[Documentation] Run until HTX exerciser fails.
Repeat Keyword ${HTX_DURATION}
... Run Keywords Check HTX Run Status
... AND Sleep ${HTX_INTERVAL}
Test Setup Execution
[Documentation] Do the initial test setup.
REST Power On stack_mode=skip
Run Key U Sleep \ 15s
Delete All Error Logs
Tool Exist lspci
Tool Exist htxcmdline
Tool Exist nvidia-smi
# Get number of GPUs reported by the OS.
${cmd}= Catenate lspci | grep NVIDIA | wc -l
${num_os_gpus} ${stderr} ${rc}= OS Execute Command ${cmd}
Printn
Rpvars num_os_gpus
# If no GPUs detected, we cannot continue.
Run Keyword If '${num_os_gpus}' == '${0}' Fail
... msg=No GPUs detected so cannot run test.
Set Suite Variable ${num_os_gpus} children=true
Test Teardown Execution
[Documentation] Do the post test teardown.
# Keep HTX running if user set HTX_KEEP_RUNNING to 1.
Run Keyword If '${TEST_STATUS}' == 'FAIL' and ${HTX_KEEP_RUNNING} == ${0}
... Shutdown HTX Exerciser
${keyword_buf}= Catenate Stop SOL Console Logging
... \ targ_file_path=${EXECDIR}${/}logs${/}SOL.log
Run Key ${keyword_buf}
FFDC On Test Case Fail
Close All Connections