# searchbased_testing_of_reinforcement_learning__5bdb5e13.pdf Search-Based Testing of Reinforcement Learning Martin Tappler1,3 , Filip Cano C ordoba2 , Bernhard K. Aichernig1,3 , Bettina K onighofer2,4 1Institute of Software Technology, Graz University of Technology 2Institute of Applied Information Processing and Communications, Graz University of Technology 3TU Graz-SAL DES Lab,Silicon Austria Labs, Graz, Austria 4Lamarr Security Research martin.tappler@ist.tugraz.at, filip.cano@iaik.tugraz.at, aichernig@ist.tugraz.at, bettina.koenighofer@lamarr.at Evaluation of deep reinforcement learning (RL) is inherently challenging. Especially the opaqueness of learned policies and the stochastic nature of both agents and environments make testing the behavior of deep RL agents difficult. We present a search-based testing framework that enables a wide range of novel analysis capabilities for evaluating the safety and performance of deep RL agents. For safety testing, our framework utilizes a search algorithm that searches for a reference trace that solves the RL task. The backtracking states of the search, called boundary states, pose safety-critical situations. We create safety test-suites that evaluate how well the RL agent escapes safety-critical situations near these boundary states. For robust performance testing, we create a diverse set of traces via fuzz testing. These fuzz traces are used to bring the agent into a wide variety of potentially unknown states from which the average performance of the agent is compared to the average performance of the fuzz traces. We apply our search-based testing approach on RL for Nintendo s Super Mario Bros. 1 Introduction In reinforcement learning (RL) [Sutton and Barto, 1998], an agent aims to maximize the total amount of reward through trial-and-error via interactions with an unknown environment. Recently, RL algorithms achieved stunning results in playing video games and complex board games [Schrittwieser et al., 2020]. To achieve a broad acceptance and enlarge the application areas of learned controllers, there is the urgent need to reliably evaluate trained RL agents. When evaluating trained agents, two fundamental questions need to be answered: (Q1) Does the trained deep RL agent circumvent safety violations? (Q2) Does the trained deep RL agent perform well from a wide variety of states? Testing deep RL agents is notoriously difficult. The first challenge arises from the environment, which is is often not fully known and has an immense state space, combined with the byzantine complexity of the Contact author agent s model, and the lack of determinism of both the agent and the environment. Secondly, to evaluate the performance of a trained agent s policy, an estimation of the performance of the optimal policy is needed. To address these challenges, we transfer well-established search-based concepts from software testing into the RLsetting. Search algorithms like backtracking-based depthfirst search (DFS) are standard to find valid and invalid program executions. Fuzz testing refers to automated software testing techniques that generate interesting test cases with the goal to expose corner cases that have not been properly dealt with in the program under test. In this work, we propose a search-based testing framework to reliably evaluate trained RL agents to answer the questions Q1 and Q2. Our testing framework comprises four steps: Step 1: Search for reference trace and boundary states. In the first step, we use a DFS algorithm to search for a reference trace that solves the RL task by sampling the black-box environment. This idea is motivated by the experience from AI competitions, like the Mario AI and the Net Hack Challenges [Karakovskiy and Togelius, 2012; K uttler et al., 2020], where best performers are symbolic agents, providing a reference solution of the task faster than neural-based agents. Furthermore, since the DFS algorithm backtracks when reaching an unsafe state in the environment, the search reveals safetycritical situations that we call boundary states. Step 2: Testing for safety. To answer Q1, our testing framework computes safety test-suites that bring the agent into safety-critical situations near the boundary states. Based on the ability of the agent to succeed in these safety-critical situations we can evaluate the safety of agents. Our intuition is that a safe agent should not violate safety regardless of the situation it faces. Step 3: Generation of fuzz traces. As a basis for performance testing, our testing framework applies a search-based fuzzing method to compute a diverse set of traces from the reference trace (Step 1) aiming for traces that gain high rewards and cover large parts of the state space. Step 4: Testing for performance. To answer Q2, we create performance test-suites from the fuzz traces to bring the agent into a diverse set of states within the environment. As performance metric we propose to point-wise compare the averaged performance gained by executing the agent s policy with the averaged performance gained by executing the fuzz traces. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Figure 1: Super Mario Bros. Up: Reference Trace and Boundary States. Down: Reference Trace and Fuzz Traces. Our approach is very general and can be adapted to several application areas. In settings where initial traces are given, for example, due to demonstrations by humans, such traces can be used as a basis for fuzzing. Our approach only requires to be able to sample the environment as an oracle. Even in the case of partial observability, our testing framework can be successfully applied since we only need the information whether a trace successfully completed the task to be learned, partially completed the task, or whether it violated safety. Exact state information is not required. In our case study, we apply our framework to test the safety and performance of a set of deep RL agents trained to play Super Mario Bros. Fig. 1 shows the reference trace (red) and boundary states (white points) computed in Step 1 and the fuzz traces (yellow) from Step 3, computed in our case study. Since we consider the environment (as well as the trained agent, a learned policy may need to break ties) to be probabilistic, we execute every test case a number of times and present the averaged results. Related Work. While RL has proven successful in solving many complex tasks [Silver et al., 2016] and often outperform classical controllers [Kiran et al., 2022], safety concerns prevent learned controllers from being widely used in safetycritical applications. The research on safe RL targets to guarantee safety during the training and the execution phase of the RL agent [Garcıa and Fern andez, 2015]. Safe RL has attracted a lot of attention in the formal methods community, culminating in a growing body of work on the verification of trained networks [Ehlers, 2017; Pathak et al., 2017; Corsi et al., 2021]. However, all of these approaches suffer from scalability issues and are not yet able to verify industrial-size deep neural networks. An alternative line of research aims to enforce safe operation of an RL agent during runtime, using techniques from runtime monitoring and enforcement [Alshiekh et al., 2018; Pranger et al., 2021]. These methods typically require a complete and faithful model of the environment dynamics, which is often not available. While a large amount of work on offline and runtime verification of RL agents exists, studying suitable testing methods for RL has attracted less attention. The development of RL algorithms has greatly benefited from benchmark environments for performance evaluation, including the Arcade Learning Environment [Bellemare et al., 2013], and Open AI Gym [Brockman et al., 2016], Deepmind Control Suite [Tassa et al., 2018], to name a few. Safety Gym [Achiam and Amodei, 2019] was especially designed to evaluate the safety of RL algorithms during exploration. Most work on testing for RL evaluates the aggregate per- formance by comparing the mean and median scores across tasks. Recently, testing metrics addressing the statistical uncertainty in such point estimates have been proposed [Agarwal et al., 2021]. We extend previous work by proposing search-based testing tailored toward (deep) RL. We use search-based methods to automatically create safety-critical test-cases and test cases for robust performance testing. RL has been proposed for software testing and in particular also for fuzz testing [B ottinger et al., 2018; Wang et al., 2021; Scott et al., 2021; Drozd and Wagner, 2018]. In contrast, we propose a novel search-based testing framework including fuzzing to test RL agents. Fuzzing has been applied to efficiently solve complex tasks [Aschermann et al., 2020; Schumilo et al., 2022]. We perform a backtracking-based search to efficiently solve the task, while fuzzing serves to cover a large portion of the state space. Related is also the work from Trujillo et al. [Trujillo et al., 2020] which analyzes the adequacy of neuron coverage for testing deep RL, whereas our adequacy criteria are inspired by traditional boundary value and combinatorial testing. We used our testing framework to evaluate trained deep Q learning agents, i.e., agents that internally use deep neural networks to approximate the Q-function. Recent years have seen a surge in works on testing deep neural networks. Techniques, like Deep Test [Tian et al., 2018], Deep Xplore [Pei et al., 2019], and Deep Road [Zhang et al., 2018], are orthogonal to our proposed framework. While we focus on the stateful reactive nature of RL agents, viewing them as a whole, these techniques are used to test sensor-related aspects of networks and find application in particular in image processing. Furthermore, we may consider taking into account neuralnetwork-specific testing criteria [Ma et al., 2018]. However, doubts about the adequacy of neuron coverage and related criteria have been raised recently [Harel-Canada et al., 2020]. Hence, more research is necessary in this area, as has been pointed out by Trujillo et al. [Trujillo et al., 2020]. Outline. The remainder of the paper is structured as follows. In Sec. 2, we give the background and notation. In the Sec. 3 to 6 we present and discuss in detail Step 1 - Step 4 of our testing framework. We present a detailed case study in Sec. 7. 2 Preliminaries A Markov decision process (MDP) M = (S, s0, A, P, R) is a tuple with a finite set S of states including initial state s0, a finite set A = {a1 . . . , an} of actions, and a probabilistic transition function P : S A S [0, 1], and an immediate reward function R : S A S R. For all s S the available actions are A(s) = {a A | s , P(s, a, s ) = Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) 0} and we assume |A(s)| 1. A memoryless deterministic policy π : S A is a function over action given states. The set of all memoryless deterministic policies is denoted by Π. An MDP with terminal states is an MDP M with a set of terminal states ST S in which the MDP terminates, i.e., the execution of a policy π on M yields a trace execπ(π, s0) = s0, a1, r1, s1, . . . , rn, sn with only sn being a state in ST . ST consists of two types of states: goalstates SG ST representing states in which the task to be learned was accomplished by reaching them, and undesired unsafe-states SU ST . A safety violation occurs whenever a state in SU is entered. We define the set of bad-states SB as all states that almost-surely lead to an unsafe state in SU, i.e., a state s B S is in SB, if applying any policy π Π starting in s B leads to a state in SU with probability 1. The set of boundary-states SBO is defined as the set of not bad states with successor states within the bad states, i.e., a state s BO S is in SBO if s BO SB and there exists a state s SB and an action a A with P(s BO, a, s) > 0. We consider reinforcement learning (RL) in which an agent learns a task through trial-and-error via interactions with an unknown environment modeled by a MDP M = (S, s0, A, P, R) with terminal states ST S. At each step t, the agent receives an observation st. It then chooses an action at+1 A. The environment then moves to a state st+1 with probability P(st, at+1, st+1). The reward is determined with rt+1 = R(st, at+1, st+1). If the environment enters a terminal state in ST , the training episode ends. The time step the episode ends is denoted by tend. The return ret = Σtend t=1γtrt is the cumulative future discounted reward per episode, using the discount factor γ [0, 1]. The objective of the agent is to learn an optimal policy π : S A that maximizes the expectation of the return, i.e., maxπ Π Eπ(ret). The accumulated reward per episode is R = Σtend t=1rt. Traces. A trace τ = s0, a1, r1, s1, . . . , an, rn, sn is the state-action-reward sequence induced by a policy during an episode starting with the initial state s0. We denote a set of traces with T . Given a trace τ = s0, a1, r1, s1 . . . rn, sn , we use τ[i] to denote the ith state of τ (si = τ[i]), τ i to denote the prefix of τ (τ i consists of all entries from τ from position 0 to i) and we denote the trace τ +i to be the suffix of τ (τ +i consists of all entries from τ from position i to n). Given a trace τ = s0, a1, r1, s1 . . . rn, sn , we denote |τ| = n to be the length of the trace. We denote the first appearance of state s in trace τ by d(τ, s) (if d(τ, s) = i then τ[i] = s). We call the action sequence resulting from omitting the states and rewards from τ an action trace τA = a1, a2, . . . an . τA[i] gives the ith action, i.e., ai = τA[i]. Executing τA on M from s0 yields a trace execτ(τA, s0) = s0, a1, r1, s1 . . . rn, sn with n = |τA|. 3 Step 1 - Search for Reference Trace and Boundary States The first step of our testing framework is to perform a search for a reference trace τref that performs the tasks to be learned by the RL agent (not necessarily in the optimal way) and to detect boundary-states S B0 SB0 along the reference trace. We propose to compute τref using a backtracking-based, Algorithm 1: Search for Reference Trace τref input : MDP M = (S, s0, A, P, R), repetitions rep output: τref, S BO 1 VS [s0]; VA [ ]; Explored ; success false; 2 τref [s0]; S BO ; 3 DFS(s0); 4 if success then 5 sprev s0; 6 for i 1, . . . , |VA| do 7 a, s VA[i], VS[i + 1]; 8 if s / Explored then 9 r R(sprev, a, s); 10 Push(τref, a, r, s ); 11 sprev s; 12 if VS[i + 2] Explored then /* next state is a backtracking point */ 13 S BO S BO {s}; 14 Function DFS(s): 15 if s SU then 16 Explored Explored {s}; return; 17 if s SG or success then 18 success true; return; 19 for a A do 20 repeat rep times 21 Sample s from P(s, a); 22 if s / VS then 23 Push(VA, a) ; Push(VS, s); 24 DFS(s ); 25 if success then Explored Explored {s}; depth-first search (DFS) by sampling the MDP M. For the DFS, we abstract away stochastic behavior of M by exploring all possible behaviors in visited states by repeating actions sufficiently often [Khalili and Tacchella, 2014]. Assuming that p = P(s, a, s ) is the smallest transition probability greater 0 for any s, s S and a A in M, we compute the number of repetitions rep required to match a confidence level c via rep(c, p) = log(1 c)/log(1 p). This ensures observing all possible states with a probability of at least c. Example. Assume that p = 0.1 is the smallest probability > 0 in M. To achieve a confidence level of 90% that the search visited any reachable state, the DFS has to perform rep(0.9, 0.1) = 22 repetitions of any action in any state. Algorithm 1 gives the pseudo code of our search algorithm to compute τref T and a set of boundary-states S B0 SB0. The list VS stores states that have already been visited, and VA stores the executed actions leading to the corresponding states in VS. Every time the search visits an unsafe state the algorithm backtracks. A non-terminal state s is added to Explored if the DFS backtracked to s from all successor states. By keeping track of visited states in the variable VS, we ensure that we do not explore a state twice along the same trace. That is, we use VS to detect cycles. When visiting a goal state, DFS(s0) terminates successfully. In this case, τref is built from the set of visited states that were not part of a backtracking branch of the search, i.e., s τS if s VS and s Explored, with corresponding actions in VA. States s τref that have successor states s Explored are boundary states, i.e., s S BO. Example. Figure 3 shows parts of an MDP M that was explored during a run of our search algorithm. Found unsafe states are marked red. After visiting s10 SG (green circle), the search function DFS(s0) returns with VS = [s0, . . . , s10], VA =[a, a, b, a, b, b, a, a, b, b] and Explored = Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Figure 3: Run of the Search Algorithm {s2, s3, s4, s5, s8, s9}. The reference trace (omitting rewards) is τref = {s0, a, s1, b, s6, a, s7, b, s10} and the subset of boundary states is S BO = {s1, s7} (blue circles). Optimizing Search. Proper abstractions of the state space may be used to merge similar states, thereby pruning the search space and enabling to find cycles in the abstract state space via the DFS. Detecting cycles speeds up the search since the DFS backtracks when finding an edge to an already visited state. An example for such an abstraction is omitting the execution time in the state space to merge states. 4 Step 2 - Testing for Safety Based on τref and S BO searched for in Step 1, we propose several test suites to identify weak points of a policy with a high frequency of fail verdicts, i.e., safety violations. After discussing suitable test suites, we discuss how to execute them to test the safety of RL agents. Simple Boundary Test Suite. We use the boundary states S BO in τref for boundary value testing [Pezz e and Young, 2007].We compute a simple test suite that consists of all prefixes of τref that end in a boundary state. From these traces, we use the action traces to bring the RL agent into safetycritical situations and to test its behavior regarding safety. Formally, let DB be the sequence of depths of the boundary states S BO in τref, i.e., for any s BO S BO : d(τ, s BO) DB. Using DB, we compute a set of traces T by T = {τ DB[i] ref | 1 i |DB|}. Omitting states and rewards from the traces in T results in a set of action traces that form a simple boundary test suite called ST. We say the action trace τ DB[i] A,ref ST is the test case for the ith boundary state in S BO. Test Suites using Boundary Intervals. Boundary value testing checks not only boundary values, but also inputs slightly off of the boundary [Pezz e and Young, 2007]. To transfer this concept into RL testing, we introduce boundary intervals to test at additional states near the boundary. In contrast to boundary testing of conventional software, our test cases stay in states traversed by τref. This choice is motivated by the definition of a boundary state: a state with successor states that necessarily lead to an unsafe state. Bringing the RL agent in such a losing position will not provide additional insight concerning the learned safety objective, since the agent has no other choice than to violate safety. However, testing states of τref within an offset of boundary states provides insights into how well the RL agent escapes safety-critical situations. Given a simple test suite ST and an interval-size is, we create an interval test suite IT(is) by adding additional test cases to ST, such that IT(is) = {τ DB[i]+off A,ref | τ DB[i] A,ref ST, is off is}, where τA,ref is the reference action-trace. The test case τ DB[i]+off A,ref tests the agent at boundary state i with offset off. Test Suites using Action Coverage. Combinatorial testing covers issues resulting from combinations of input values. We adapt this concept by creating test suites that cover combinations of actions near boundary states, i.e., the test suite evaluates which actions cause unsafe behavior in boundary regions. Given the reference action-trace τA,ref, a simple test suite ST, and a k 1, we generate a k-wise action-coverage test suite AC(k) by creating |A|k test cases for every test case in ST covering all k-wise combinations of actions at the kth predecessor of a boundary state. The test suite is given by AC(k) = {τ DB[i] k A,ref ac | τ DB[i] A,ref ST, ac Ak}. Test-case Execution & Verdict. To test the behavior of an agent regarding safety, we use a safety test-suite to bring the agent in safety critical situations. A single test-case execution takes an action-trace τA, an inital state s0 and a test length l as parameters. To test an RL agent using τA, we first execute τA, yielding a trace exec(τA) = s0, a1, r1, s1, . . . , an, rn, sn . A test case τA is invalid if exec(τA) consistently visits a terminal state in ST when executed repeatedly. Starting from sn, we pick the next l actions according to the policy of the agent. Note that l should be chosen large enough to evaluate the behavior of the agent regarding safety. Therefore, it should be considerably larger than the shortest path to the next unsafe state in SU. After performing l steps of the agent s policy, we evaluate the test case. A test can fail or pass: A test fails, if starting from sn the agent reaches an unsafe state in SU within l steps. Otherwise, the test passes. To execute a test suite T, we perform every test case of T n times. During that, we compute the relative frequency of fail verdicts resulting from executing each individual test case. 5 Step 3 - Generation of Fuzz Traces Our testing framework evaluates the performance of RL agents using fuzz traces. The traces are used to compare gained rewards as well as to bring the agent in a variety of states and to evaluate the performance from these onward. In this section, we discuss the fuzz-trace generation for performance testing. For this purpose, we propose a search-based fuzzing method [Zeller et al., 2021] based on genetic algorithms. The goal is to find action traces that (1) cover a large portion of the state space while (2) accomplishing the task to be learned by the RL agent. Overview of Computation of Fuzz Traces. Given the reference trace τref that solves the RL task (i.e., sn SG) and parameter values for the number of generations g and the population size p, the fuzz traces are computed as follows: 1. Initialize T0, the trace population: T0 := {τA,ref}. 2. For i = 1 to g generations do: (a) Create p action traces (called offspring) from Ti 1 to yield a new population Ti of size p by: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) either mutating a single parent trace from Ti 1, or through crossover of two parents from Ti 1 with a specified crossover probability. (b) Evaluate the fitness of every offspring trace in Ti. 3. Return Tfit containing the fittest trace of each generation The fitness of a trace is defined in terms of state-space coverage and the degree to which the RL task is solved. The computation of the fuzz traces searches iteratively for traces with a high fitness by choosing parent traces with a probability proportional to their fitness. To promote diversity, we favor mutation over crossover, by setting the crossover probability to a value < 0.5. The set of the fittest traces Tfit will be used in Step 4 for performance testing. Using the single fittest trace from every generation helps enforcing variety. Fitness Computation. We propose a fitness function especially suited for testing RL agents. For an action trace τA, the fitness F(τA) is the weighted sum of three normalized terms: The positive-reward term rpos(τA, s0) is the normalized positive reward gained in execτ(τA, s0). The negative-reward term rneg(τA, s0) is the normalized inverted negative reward gained in execτ(τA, s0). The coverage fitness-term fc(τA, s0) describes the number of newly visited states by execτ(τA, s0), normalized by dividing by the maximum number of newly visited states by any action traces in the current population. Positive rewards correspond to the degree as to which the RL task is solved by τA. Negative rewards often correspond to the time required to solve the RL task. Hence if τA solves the task fast, it would be assigned a small negative reward. We invert the negative reward to have only positive fitness terms. We normalize rpos/rneg by dividing it by the highest r pos/r neg in the current generation. The coverage fitness-term depends on all states visited in previous populations. Assume that the current generation is i. Let Covppop be the set of all states visited by the previous populations S j