blob: 4661e24853711e29f3b3daf0de569d59b434834d [file] [log] [blame]
*** Settings ***
Documentation This suite tests checkstop operations through HOST.
Resource ../lib/utils.robot
Resource ../lib/openbmc_ffdc.robot
Resource ../lib/ras/host_utils.robot
Resource ../lib/resource.txt
Resource ../lib/state_manager.robot
Resource ../lib/openbmc_ffdc_methods.robot
Resource ../lib/boot_utils.robot
Variables ../lib/ras/variables.py
Library DateTime
Library OperatingSystem
Library random
Library Collections
Suite Setup RAS Suite Setup
Test Setup RAS Test Setup
Test Teardown FFDC On Test Case Fail
Suite Teardown RAS Suite Cleanup
*** Variables ***
${stack_mode} normal
*** Test Cases ***
# Memory channel (MCACALIFIR) related error injection.
Verify Recoverable Callout Handling For MCA With Threshold 1
[Documentation] Verify recoverable callout handling for MCACALIFIR with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_MCA_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} MCACALIFIR_RECV1
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}mcacalfir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
Verify Recoverable Callout Handling For MCA With Threshold 32
[Documentation] Verify recoverable callout handling for MCACALIFIR with
... threshold 32.
[Tags] Verify_Recoverable_Callout_Handling_For_MCA_With_Threshold_32
${value}= Get From Dictionary ${ERROR_INJECT_DICT} MCACALIFIR_RECV32
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}mcacalfir_th32
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 32 ${value[2]} ${err_log_path}
Verify Unrecoverable Callout Handling For MCA
[Documentation] Verify unrecoverable callout handling for MCACALIFIR.
[Tags] Verify_Unrecoverable_Callout_Handling_For_MCA
${value}= Get From Dictionary ${ERROR_INJECT_DICT} MCACALIFIR_UE
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}mcacalfir
Inject Unrecoverable Error Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
# Memory buffer (MCIFIR) related error injection.
Verify Recoverable Callout Handling For MCI With Threshold 1
[Documentation] Verify recoverable callout handling for mci with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_MCI_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} MCS_RECV1
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}mcifir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
Verify Unrecoverable Callout Handling For MCI
[Documentation] Verify unrecoverable callout handling for mci.
[Tags] Verify_Unrecoverable_Callout_Handling_For_MCI
${value}= Get From Dictionary ${ERROR_INJECT_DICT} MCS_UE
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}mcifir
Inject Unrecoverable Error Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
# CAPP accelerator (CXAFIR) related error injection.
Verify Recoverable Callout Handling For CXA With Threshold 5
[Documentation] Verify recoverable callout handling for CXA with
... threshold 5.
[Tags] Verify_Recoverable_Callout_Handling_For_CXA_With_Threshold_5
${value}= Get From Dictionary ${ERROR_INJECT_DICT} CXA_RECV5
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}cxafir_th5
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 5 ${value[2]} ${err_log_path}
Verify Recoverable Callout Handling For CXA With Threshold 32
[Documentation] Verify recoverable callout handling for CXA with
... threshold 32.
[Tags] Verify_Recoverable_Callout_Handling_For_CXA_With_Threshold_32
${value}= Get From Dictionary ${ERROR_INJECT_DICT} CXA_RECV32
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}cxafir_th32
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 32 ${value[2]} ${err_log_path}
# OBUSFIR related error injection.
Verify Recoverable Callout Handling For OBUS With Threshold 32
[Documentation] Verify recoverable callout handling for OBUS with
... threshold 32.
[Tags] Verify_Recoverable_Callout_Handling_For_OBUS_With_Threshold_32
${value}= Get From Dictionary ${ERROR_INJECT_DICT} OBUS_RECV32
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}obusfir_th32
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 32 ${value[2]} ${err_log_path}
# Nvidia graphics processing units (NPU0FIR) related error injection.
Verify Recoverable Callout Handling For NPU0 With Threshold 32
[Documentation] Verify recoverable callout handling for NPU0 with
... threshold 32.
[Tags] Verify_Recoverable_Callout_Handling_For_NPU0_With_Threshold_32
${value}= Get From Dictionary ${ERROR_INJECT_DICT} NPU0_RECV32
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}npu0fir_th32
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 32 ${value[2]} ${err_log_path}
# Nest accelerator NXDMAENGFIR related error injection.
Verify Recoverable Callout Handling For NXDMAENG With Threshold 1
[Documentation] Verify recoverable callout handling for NXDMAENG with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_NXDMAENG_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} NX_RECV1
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}nxfir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
Verify Recoverable Callout Handling For NXDMAENG With Threshold 32
[Documentation] Verify recoverable callout handling for NXDMAENG with
... threshold 32.
[Tags] Verify_Recoverable_Callout_Handling_For_NXDMAENG_With_Threshold_32
${value}= Get From Dictionary ${ERROR_INJECT_DICT} NX_RECV32
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}nxfir_th32
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 32 ${value[2]} ${err_log_path}
Verify Unrecoverable Callout Handling For NXDMAENG
[Documentation] Verify unrecoverable callout handling for NXDMAENG.
[Tags] Verify_Unrecoverable_Callout_Handling_For_NXDMAENG
${value}= Get From Dictionary ${ERROR_INJECT_DICT} NX_UE
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}nxfir_ue
Inject Unrecoverable Error Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
# L2FIR related error injection.
Verify Recoverable Callout Handling For L2FIR With Threshold 1
[Documentation] Verify recoverable callout handling for L2FIR with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_L2FIR_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} L2FIR_RECV1
${translated_fir}= Fetch FIR Address Translation Value 0 ${value[0]} EX
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}l2fir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${translated_fir} ${value[1]} 1 ${value[2]} ${err_log_path}
# L3FIR related error injection.
Verify Recoverable Callout Handling For L3FIR With Threshold 1
[Documentation] Verify recoverable callout handling for L3FIR with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_L3FIR_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} L3FIR_RECV1
${translated_fir}= Fetch FIR Address Translation Value 0 ${value[0]} EX
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}l3fir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${translated_fir} ${value[1]} 1 ${value[2]} ${err_log_path}
Verify Recoverable Callout Handling For L3FIR With Threshold 32
[Documentation] Verify recoverable callout handling for L3FIR with
... threshold 32.
[Tags] Verify_Recoverable_Callout_Handling_For_L3FIR_With_Threshold_32
${value}= Get From Dictionary ${ERROR_INJECT_DICT} L3FIR_RECV32
${translated_fir}= Fetch FIR Address Translation Value 0 ${value[0]} EX
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}l3fir_th32
Inject Recoverable Error With Threshold Limit Through Host
... ${translated_fir} ${value[1]} 32 ${value[2]} ${err_log_path}
# On chip controller (OCCFIR) related error injection.
Verify Recoverable Callout Handling For OCC With Threshold 1
[Documentation] Verify recoverable callout handling for OCCFIR with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_OCC_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} OCCFIR_RECV1
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}occfir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${value[0]} ${value[1]} 1 ${value[2]} ${err_log_path}
# Core management engine (CMEFIR) related error injection.
Verify Recoverable Callout Handling For CMEFIR With Threshold 1
[Documentation] Verify recoverable callout handling for CMEFIR with
... threshold 1.
[Tags] Verify_Recoverable_Callout_Handling_For_CMEFIR_With_Threshold_1
${value}= Get From Dictionary ${ERROR_INJECT_DICT} CMEFIR_RECV1
${translated_fir}= Fetch FIR Address Translation Value 0 ${value[0]} EX
${err_log_path}= Catenate ${RAS_LOG_DIR_PATH}cmefir_th1
Inject Recoverable Error With Threshold Limit Through Host
... ${translated_fir} ${value[1]} 1 ${value[2]} ${err_log_path}
*** Keywords ***
Inject Error Through HOST
[Documentation] Inject checkstop on processor through HOST.
... Test sequence:
... 1. Boot To HOST
... 2. Clear any existing gard records
... 3. Inject Error on processor/centaur
[Arguments] ${fir} ${chip_address} ${threshold_limit}
# Description of argument(s):
# fir FIR (Fault isolation register) value (e.g. 2011400).
# chip_address chip address (e.g 2000000000000000).
# threshold_limit Threshold limit (e.g 1, 5, 32).
Delete Error Logs
Login To OS Host
Gard Operations On OS clear all
# Fetch processor chip IDs.
${chip_ids}= Get ProcChipId From OS Processor
${proc_ids}= Split String ${chip_ids}
${proc_id}= Get From List ${proc_ids} 1
${threshold_limit}= Convert To Integer ${threshold_limit}
:FOR ${i} IN RANGE ${threshold_limit}
\ Run Keyword Putscom Operations On OS ${proc_id} ${fir} ${chip_address}
# Adding delay after each error injection.
\ Sleep 10s
# Adding delay to get error log after error injection.
Sleep 120s
Verify And Clear Gard Records On HOST
[Documentation] Verify And Clear gard records on HOST.
${output}= Gard Operations On OS list
Should Not Contain ${output} 'No GARD entries to display'
Gard Operations On OS clear all
Verify Error Log Entry
[Documentation] Verify error log entry & signature description.
[Arguments] ${signature_desc} ${log_prefix}
# Description of argument(s):
# signature_desc Error log signature description.
# log_prefix Log path prefix.
${resp}= OpenBMC Get Request ${BMC_LOGGING_ENTRY}/list
Should Not Be Equal As Strings ${resp.status_code} ${HTTP_NOT_FOUND}
Collect eSEL Log ${log_prefix}
${error_log_file_path}= Catenate ${log_prefix}esel.txt
${rc} ${output} = Run and Return RC and Output
... grep -i ${signature_desc} ${error_log_file_path}
Should Not Be Empty ${output}
Inject Recoverable Error With Threshold Limit Through Host
[Documentation] Inject and verify recoverable error on processor through
... host.
... Test sequence:
... 1. Enable Auto Reboot Setting
... 2. Inject Error on processor/centaur
... 3. Check If HOST is running.
... 4. Verify error log entry & signature description.
... 4. Verify & clear gard records.
[Arguments] ${fir} ${chip_address} ${threshold_limit}
... ${signature_desc} ${log_prefix}
# Description of argument(s):
# fir FIR (Fault isolation register) value (e.g. 2011400).
# chip_address Chip address (e.g 2000000000000000).
# threshold_limit Threshold limit (e.g 1, 5, 32).
# signature_desc Error log signature description.
# log_prefix Log path prefix.
Set Auto Reboot 1
Inject Error Through HOST ${fir} ${chip_address} ${threshold_limit}
Is Host Running
${output}= Gard Operations On OS list
Should Contain ${output} No GARD
Verify Error Log Entry ${signature_desc} ${log_prefix}
Inject Unrecoverable Error Through Host
[Documentation] Inject and verify recoverable error on processor through
... host.
... Test sequence:
... 1. Enable Auto Reboot Setting
... 2. Inject Error on processor/centaur
... 3. Check If HOST is rebooted.
... 4. Verify error log entry & signature description.
... 4. Verify & clear gard records.
[Arguments] ${fir} ${chip_address} ${threshold_limit}
... ${signature_desc} ${log_prefix}
# Description of argument(s):
# fir FIR (Fault isolation register) value (e.g. 2011400).
# chip_address Chip address (e.g 2000000000000000).
# threshold_limit Threshold limit (e.g 1, 5, 32).
# signature_desc Error Log signature description.
# (e.g 'mcs(n0p0c0) (MCFIR[0]) mc internal recoverable')
# log_prefix Log path prefix.
Set Auto Reboot 1
Inject Error Through HOST ${fir} ${chip_address} ${threshold_limit}
Wait Until Keyword Succeeds 500 sec 20 sec Is Host Rebooted
Wait for OS
Verify And Clear Gard Records On HOST
Verify Error Log Entry ${signature_desc} ${log_prefix}
Fetch FIR Address Translation Value
[Documentation] Fetch FIR address translation value through HOST.
[Arguments] ${proc_chip_id} ${fir} ${target_type}
# Description of argument(s):
# proc_chip_id Processor chip ID (e.g '0', '8').
# fir FIR (Fault isolation register) value (e.g. 2011400).
# core_id Core ID (e.g. 9).
# target_type Target type (e.g. 'EX', 'EQ', 'C').
Login To OS Host
Copy Address Translation Utils To HOST OS
${core_ids}= Get Core IDs From OS 0
# Ignoring master core ID.
${output}= Get Slice From List ${core_ids} 1
# Feth random non-master core ID.
${core_ids_sub_list}= Evaluate random.sample(${core_ids}, 1) random
${core_id}= Get From List ${core_ids_sub_list} 0
${translated_fir_addr}= FIR Address Translation Through HOST
... ${fir} ${core_id} ${target_type}
[Return] ${translated_fir_addr}
RAS Test SetUp
[Documentation] Validates input parameters.
Should Not Be Empty
... ${OS_HOST} msg=You must provide DNS name/IP of the OS host.
Should Not Be Empty
... ${OS_USERNAME} msg=You must provide OS host user name.
Should Not Be Empty
... ${OS_PASSWORD} msg=You must provide OS host user password.
# Boot to OS.
REST Power On quiet=${1}
# Adding delay to after host bring up.
Sleep 60s
RAS Suite Setup
[Documentation] Create RAS log directory to store all RAS test logs.
${RAS_LOG_DIR_PATH}= Catenate ${EXECDIR}/RAS_logs/
Set Suite Variable ${RAS_LOG_DIR_PATH}
Create Directory ${RAS_LOG_DIR_PATH}
OperatingSystem.Directory Should Exist ${RAS_LOG_DIR_PATH}
Empty Directory ${RAS_LOG_DIR_PATH}
RAS Suite Cleanup
[Documentation] Perform RAS suite cleanup and verify that host
... boots after test suite run.
# Boot to OS.
REST Power On quiet=${1}
Delete Error Logs
Gard Operations On OS clear all