# searchbased_testing_of_reinforcement_learning__5bdb5e13.pdf

Search-Based Testing of Reinforcement Learning

Martin Tappler1,3 , Filip Cano C ordoba2 , Bernhard K. Aichernig1,3 , Bettina K onighofer2,4

1Institute of Software Technology, Graz University of Technology 2Institute of Applied Information Processing and Communications, Graz University of Technology 3TU Graz-SAL DES Lab,Silicon Austria Labs, Graz, Austria 4Lamarr Security Research martin.tappler@ist.tugraz.at, ﬁlip.cano@iaik.tugraz.at, aichernig@ist.tugraz.at, bettina.koenighofer@lamarr.at

Evaluation of deep reinforcement learning (RL) is inherently challenging. Especially the opaqueness of learned policies and the stochastic nature of both agents and environments make testing the behavior of deep RL agents difﬁcult. We present a search-based testing framework that enables a wide range of novel analysis capabilities for evaluating the safety and performance of deep RL agents. For safety testing, our framework utilizes a search algorithm that searches for a reference trace that solves the RL task. The backtracking states of the search, called boundary states, pose safety-critical situations. We create safety test-suites that evaluate how well the RL agent escapes safety-critical situations near these boundary states. For robust performance testing, we create a diverse set of traces via fuzz testing. These fuzz traces are used to bring the agent into a wide variety of potentially unknown states from which the average performance of the agent is compared to the average performance of the fuzz traces. We apply our search-based testing approach on RL for Nintendo s Super Mario Bros.

1 Introduction

In reinforcement learning (RL) [Sutton and Barto, 1998], an agent aims to maximize the total amount of reward through trial-and-error via interactions with an unknown environment. Recently, RL algorithms achieved stunning results in playing video games and complex board games [Schrittwieser et al., 2020]. To achieve a broad acceptance and enlarge the application areas of learned controllers, there is the urgent need to reliably evaluate trained RL agents. When evaluating trained agents, two fundamental questions need to be answered: (Q1) Does the trained deep RL agent circumvent safety violations? (Q2) Does the trained deep RL agent perform well from a wide variety of states? Testing deep RL agents is notoriously difﬁcult. The ﬁrst challenge arises from the environment, which is is often not fully known and has an immense state space, combined with the byzantine complexity of the

Contact author

agent s model, and the lack of determinism of both the agent and the environment. Secondly, to evaluate the performance of a trained agent s policy, an estimation of the performance of the optimal policy is needed. To address these challenges, we transfer well-established search-based concepts from software testing into the RLsetting. Search algorithms like backtracking-based depthﬁrst search (DFS) are standard to ﬁnd valid and invalid program executions. Fuzz testing refers to automated software testing techniques that generate interesting test cases with the goal to expose corner cases that have not been properly dealt with in the program under test. In this work, we propose a search-based testing framework to reliably evaluate trained RL agents to answer the questions Q1 and Q2. Our testing framework comprises four steps: Step 1: Search for reference trace and boundary states. In the ﬁrst step, we use a DFS algorithm to search for a reference trace that solves the RL task by sampling the black-box environment. This idea is motivated by the experience from AI competitions, like the Mario AI and the Net Hack Challenges [Karakovskiy and Togelius, 2012; K uttler et al., 2020], where best performers are symbolic agents, providing a reference solution of the task faster than neural-based agents. Furthermore, since the DFS algorithm backtracks when reaching an unsafe state in the environment, the search reveals safetycritical situations that we call boundary states. Step 2: Testing for safety. To answer Q1, our testing framework computes safety test-suites that bring the agent into safety-critical situations near the boundary states. Based on the ability of the agent to succeed in these safety-critical situations we can evaluate the safety of agents. Our intuition is that a safe agent should not violate safety regardless of the situation it faces. Step 3: Generation of fuzz traces. As a basis for performance testing, our testing framework applies a search-based fuzzing method to compute a diverse set of traces from the reference trace (Step 1) aiming for traces that gain high rewards and cover large parts of the state space. Step 4: Testing for performance. To answer Q2, we create performance test-suites from the fuzz traces to bring the agent into a diverse set of states within the environment. As performance metric we propose to point-wise compare the averaged performance gained by executing the agent s policy with the averaged performance gained by executing the fuzz traces.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Figure 1: Super Mario Bros. Up: Reference Trace and Boundary States. Down: Reference Trace and Fuzz Traces.

Our approach is very general and can be adapted to several application areas. In settings where initial traces are given, for example, due to demonstrations by humans, such traces can be used as a basis for fuzzing. Our approach only requires to be able to sample the environment as an oracle. Even in the case of partial observability, our testing framework can be successfully applied since we only need the information whether a trace successfully completed the task to be learned, partially completed the task, or whether it violated safety. Exact state information is not required. In our case study, we apply our framework to test the safety and performance of a set of deep RL agents trained to play Super Mario Bros. Fig. 1 shows the reference trace (red) and boundary states (white points) computed in Step 1 and the fuzz traces (yellow) from Step 3, computed in our case study. Since we consider the environment (as well as the trained agent, a learned policy may need to break ties) to be probabilistic, we execute every test case a number of times and present the averaged results. Related Work. While RL has proven successful in solving many complex tasks [Silver et al., 2016] and often outperform classical controllers [Kiran et al., 2022], safety concerns prevent learned controllers from being widely used in safetycritical applications. The research on safe RL targets to guarantee safety during the training and the execution phase of the RL agent [Garcıa and Fern andez, 2015]. Safe RL has attracted a lot of attention in the formal methods community, culminating in a growing body of work on the veriﬁcation of trained networks [Ehlers, 2017; Pathak et al., 2017; Corsi et al., 2021]. However, all of these approaches suffer from scalability issues and are not yet able to verify industrial-size deep neural networks. An alternative line of research aims to enforce safe operation of an RL agent during runtime, using techniques from runtime monitoring and enforcement [Alshiekh et al., 2018; Pranger et al., 2021]. These methods typically require a complete and faithful model of the environment dynamics, which is often not available. While a large amount of work on ofﬂine and runtime veriﬁcation of RL agents exists, studying suitable testing methods for RL has attracted less attention. The development of RL algorithms has greatly beneﬁted from benchmark environments for performance evaluation, including the Arcade Learning Environment [Bellemare et al., 2013], and Open AI Gym [Brockman et al., 2016], Deepmind Control Suite [Tassa et al., 2018], to name a few. Safety Gym [Achiam and Amodei, 2019] was especially designed to evaluate the safety of RL algorithms during exploration. Most work on testing for RL evaluates the aggregate per-

formance by comparing the mean and median scores across tasks. Recently, testing metrics addressing the statistical uncertainty in such point estimates have been proposed [Agarwal et al., 2021]. We extend previous work by proposing search-based testing tailored toward (deep) RL. We use search-based methods to automatically create safety-critical test-cases and test cases for robust performance testing. RL has been proposed for software testing and in particular also for fuzz testing [B ottinger et al., 2018; Wang et al., 2021; Scott et al., 2021; Drozd and Wagner, 2018]. In contrast, we propose a novel search-based testing framework including fuzzing to test RL agents. Fuzzing has been applied to efﬁciently solve complex tasks [Aschermann et al., 2020; Schumilo et al., 2022]. We perform a backtracking-based search to efﬁciently solve the task, while fuzzing serves to cover a large portion of the state space. Related is also the work from Trujillo et al. [Trujillo et al., 2020] which analyzes the adequacy of neuron coverage for testing deep RL, whereas our adequacy criteria are inspired by traditional boundary value and combinatorial testing. We used our testing framework to evaluate trained deep Q learning agents, i.e., agents that internally use deep neural networks to approximate the Q-function. Recent years have seen a surge in works on testing deep neural networks. Techniques, like Deep Test [Tian et al., 2018], Deep Xplore [Pei et al., 2019], and Deep Road [Zhang et al., 2018], are orthogonal to our proposed framework. While we focus on the stateful reactive nature of RL agents, viewing them as a whole, these techniques are used to test sensor-related aspects of networks and ﬁnd application in particular in image processing. Furthermore, we may consider taking into account neuralnetwork-speciﬁc testing criteria [Ma et al., 2018]. However, doubts about the adequacy of neuron coverage and related criteria have been raised recently [Harel-Canada et al., 2020]. Hence, more research is necessary in this area, as has been pointed out by Trujillo et al. [Trujillo et al., 2020]. Outline. The remainder of the paper is structured as follows. In Sec. 2, we give the background and notation. In the Sec. 3 to 6 we present and discuss in detail Step 1 - Step 4 of our testing framework. We present a detailed case study in Sec. 7.

2 Preliminaries A Markov decision process (MDP) M = (S, s0, A, P, R) is a tuple with a ﬁnite set S of states including initial state s0, a ﬁnite set A = {a1 . . . , an} of actions, and a probabilistic transition function P : S A S [0, 1], and an immediate reward function R : S A S R. For all s S the available actions are A(s) = {a A | s , P(s, a, s ) =

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

0} and we assume |A(s)| 1. A memoryless deterministic policy π : S A is a function over action given states. The set of all memoryless deterministic policies is denoted by Π. An MDP with terminal states is an MDP M with a set of terminal states ST S in which the MDP terminates, i.e., the execution of a policy π on M yields a trace execπ(π, s0) = s0, a1, r1, s1, . . . , rn, sn with only sn being a state in ST . ST consists of two types of states: goalstates SG ST representing states in which the task to be learned was accomplished by reaching them, and undesired unsafe-states SU ST . A safety violation occurs whenever a state in SU is entered. We deﬁne the set of bad-states SB as all states that almost-surely lead to an unsafe state in SU, i.e., a state s B S is in SB, if applying any policy π Π starting in s B leads to a state in SU with probability 1. The set of boundary-states SBO is deﬁned as the set of not bad states with successor states within the bad states, i.e., a state s BO S is in SBO if s BO SB and there exists a state s SB and an action a A with P(s BO, a, s) > 0. We consider reinforcement learning (RL) in which an agent learns a task through trial-and-error via interactions with an unknown environment modeled by a MDP M = (S, s0, A, P, R) with terminal states ST S. At each step t, the agent receives an observation st. It then chooses an action at+1 A. The environment then moves to a state st+1 with probability P(st, at+1, st+1). The reward is determined with rt+1 = R(st, at+1, st+1). If the environment enters a terminal state in ST , the training episode ends. The time step the episode ends is denoted by tend. The return ret = Σtend t=1γtrt is the cumulative future discounted reward per episode, using the discount factor γ [0, 1]. The objective of the agent is to learn an optimal policy π : S A that maximizes the expectation of the return, i.e., maxπ Π Eπ(ret). The accumulated reward per episode is R = Σtend t=1rt. Traces. A trace τ = s0, a1, r1, s1, . . . , an, rn, sn is the state-action-reward sequence induced by a policy during an episode starting with the initial state s0. We denote a set of traces with T . Given a trace τ = s0, a1, r1, s1 . . . rn, sn , we use τ[i] to denote the ith state of τ (si = τ[i]), τ i to denote the preﬁx of τ (τ i consists of all entries from τ from position 0 to i) and we denote the trace τ +i to be the sufﬁx of τ (τ +i consists of all entries from τ from position i to n). Given a trace τ = s0, a1, r1, s1 . . . rn, sn , we denote |τ| = n to be the length of the trace. We denote the ﬁrst appearance of state s in trace τ by d(τ, s) (if d(τ, s) = i then τ[i] = s). We call the action sequence resulting from omitting the states and rewards from τ an action trace τA = a1, a2, . . . an . τA[i] gives the ith action, i.e., ai = τA[i]. Executing τA on M from s0 yields a trace execτ(τA, s0) = s0, a1, r1, s1 . . . rn, sn with n = |τA|.

3 Step 1 - Search for Reference Trace and Boundary States

The ﬁrst step of our testing framework is to perform a search for a reference trace τref that performs the tasks to be learned by the RL agent (not necessarily in the optimal way) and to detect boundary-states S B0 SB0 along the reference trace. We propose to compute τref using a backtracking-based,

Algorithm 1: Search for Reference Trace τref

input : MDP M = (S, s0, A, P, R), repetitions rep output: τref, S BO 1 VS [s0]; VA [ ]; Explored ; success false;

2 τref [s0]; S BO ; 3 DFS(s0); 4 if success then 5 sprev s0; 6 for i 1, . . . , |VA| do 7 a, s VA[i], VS[i + 1]; 8 if s / Explored then 9 r R(sprev, a, s); 10 Push(τref, a, r, s ); 11 sprev s; 12 if VS[i + 2] Explored then

/* next state is a backtracking point */

13 S BO S BO {s}; 14 Function DFS(s): 15 if s SU then 16 Explored Explored {s}; return; 17 if s SG or success then 18 success true; return; 19 for a A do 20 repeat rep times 21 Sample s from P(s, a);

22 if s / VS then 23 Push(VA, a) ; Push(VS, s);

24 DFS(s ); 25 if success then Explored Explored {s};

depth-ﬁrst search (DFS) by sampling the MDP M. For the DFS, we abstract away stochastic behavior of M by exploring all possible behaviors in visited states by repeating actions sufﬁciently often [Khalili and Tacchella, 2014]. Assuming that p = P(s, a, s ) is the smallest transition probability greater 0 for any s, s S and a A in M, we compute the number of repetitions rep required to match a conﬁdence level c via rep(c, p) = log(1 c)/log(1 p). This ensures observing all possible states with a probability of at least c. Example. Assume that p = 0.1 is the smallest probability > 0 in M. To achieve a conﬁdence level of 90% that the search visited any reachable state, the DFS has to perform rep(0.9, 0.1) = 22 repetitions of any action in any state. Algorithm 1 gives the pseudo code of our search algorithm to compute τref T and a set of boundary-states S B0 SB0. The list VS stores states that have already been visited, and VA stores the executed actions leading to the corresponding states in VS. Every time the search visits an unsafe state the algorithm backtracks. A non-terminal state s is added to Explored if the DFS backtracked to s from all successor states. By keeping track of visited states in the variable VS, we ensure that we do not explore a state twice along the same trace. That is, we use VS to detect cycles. When visiting a goal state, DFS(s0) terminates successfully. In this case, τref is built from the set of visited states that were not part of a backtracking branch of the search, i.e., s τS if s VS and s Explored, with corresponding actions in VA. States s τref that have successor states s Explored are boundary states, i.e., s S BO. Example. Figure 3 shows parts of an MDP M that was explored during a run of our search algorithm. Found unsafe states are marked red. After visiting s10 SG (green circle), the search function DFS(s0) returns with VS = [s0, . . . , s10], VA =[a, a, b, a, b, b, a, a, b, b] and Explored =

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Figure 3: Run of the Search Algorithm

{s2, s3, s4, s5, s8, s9}. The reference trace (omitting rewards) is τref = {s0, a, s1, b, s6, a, s7, b, s10} and the subset of boundary states is S BO = {s1, s7} (blue circles). Optimizing Search. Proper abstractions of the state space may be used to merge similar states, thereby pruning the search space and enabling to ﬁnd cycles in the abstract state space via the DFS. Detecting cycles speeds up the search since the DFS backtracks when ﬁnding an edge to an already visited state. An example for such an abstraction is omitting the execution time in the state space to merge states.

4 Step 2 - Testing for Safety

Based on τref and S BO searched for in Step 1, we propose several test suites to identify weak points of a policy with a high frequency of fail verdicts, i.e., safety violations. After discussing suitable test suites, we discuss how to execute them to test the safety of RL agents.

Simple Boundary Test Suite. We use the boundary states S BO in τref for boundary value testing [Pezz e and Young, 2007].We compute a simple test suite that consists of all preﬁxes of τref that end in a boundary state. From these traces, we use the action traces to bring the RL agent into safetycritical situations and to test its behavior regarding safety. Formally, let DB be the sequence of depths of the boundary states S BO in τref, i.e., for any s BO S BO : d(τ, s BO) DB. Using DB, we compute a set of traces T by

T = {τ DB[i] ref | 1 i |DB|}.

Omitting states and rewards from the traces in T results in a set of action traces that form a simple boundary test suite called ST. We say the action trace τ DB[i] A,ref ST is the test case for the ith boundary state in S BO.

Test Suites using Boundary Intervals. Boundary value testing checks not only boundary values, but also inputs slightly off of the boundary [Pezz e and Young, 2007]. To transfer this concept into RL testing, we introduce boundary intervals to test at additional states near the boundary. In contrast to boundary testing of conventional software, our test cases stay in states traversed by τref. This choice is motivated by the deﬁnition of a boundary state: a state with successor states that necessarily lead to an unsafe state. Bringing the RL agent in such a losing position will not provide additional insight concerning the learned safety objective, since the agent has no other choice than to violate safety. However, testing states of τref within an offset of boundary states provides insights into how well the RL agent escapes safety-critical situations. Given a simple test suite ST and

an interval-size is, we create an interval test suite IT(is) by adding additional test cases to ST, such that

IT(is) = {τ DB[i]+oﬀ A,ref | τ DB[i] A,ref ST, is oﬀ is},

where τA,ref is the reference action-trace. The test case τ DB[i]+oﬀ A,ref tests the agent at boundary state i with offset oﬀ.

Test Suites using Action Coverage. Combinatorial testing covers issues resulting from combinations of input values. We adapt this concept by creating test suites that cover combinations of actions near boundary states, i.e., the test suite evaluates which actions cause unsafe behavior in boundary regions. Given the reference action-trace τA,ref, a simple test suite ST, and a k 1, we generate a k-wise action-coverage test suite AC(k) by creating |A|k test cases for every test case in ST covering all k-wise combinations of actions at the kth predecessor of a boundary state. The test suite is given by

AC(k) = {τ DB[i] k A,ref ac | τ DB[i] A,ref ST, ac Ak}.

Test-case Execution & Verdict. To test the behavior of an agent regarding safety, we use a safety test-suite to bring the agent in safety critical situations. A single test-case execution takes an action-trace τA, an inital state s0 and a test length l as parameters. To test an RL agent using τA, we ﬁrst execute τA, yielding a trace exec(τA) = s0, a1, r1, s1, . . . , an, rn, sn . A test case τA is invalid if exec(τA) consistently visits a terminal state in ST when executed repeatedly. Starting from sn, we pick the next l actions according to the policy of the agent. Note that l should be chosen large enough to evaluate the behavior of the agent regarding safety. Therefore, it should be considerably larger than the shortest path to the next unsafe state in SU. After performing l steps of the agent s policy, we evaluate the test case. A test can fail or pass: A test fails, if starting from sn the agent reaches an unsafe state in SU within l steps. Otherwise, the test passes. To execute a test suite T, we perform every test case of T n times. During that, we compute the relative frequency of fail verdicts resulting from executing each individual test case.

5 Step 3 - Generation of Fuzz Traces Our testing framework evaluates the performance of RL agents using fuzz traces. The traces are used to compare gained rewards as well as to bring the agent in a variety of states and to evaluate the performance from these onward. In this section, we discuss the fuzz-trace generation for performance testing. For this purpose, we propose a search-based fuzzing method [Zeller et al., 2021] based on genetic algorithms. The goal is to ﬁnd action traces that (1) cover a large portion of the state space while (2) accomplishing the task to be learned by the RL agent. Overview of Computation of Fuzz Traces. Given the reference trace τref that solves the RL task (i.e., sn SG) and parameter values for the number of generations g and the population size p, the fuzz traces are computed as follows: 1. Initialize T0, the trace population: T0 := {τA,ref}. 2. For i = 1 to g generations do: (a) Create p action traces (called offspring) from Ti 1 to yield a new population Ti of size p by:

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

either mutating a single parent trace from Ti 1, or through crossover of two parents from Ti 1 with a speciﬁed crossover probability. (b) Evaluate the ﬁtness of every offspring trace in Ti. 3. Return Tﬁt containing the ﬁttest trace of each generation The ﬁtness of a trace is deﬁned in terms of state-space coverage and the degree to which the RL task is solved. The computation of the fuzz traces searches iteratively for traces with a high ﬁtness by choosing parent traces with a probability proportional to their ﬁtness. To promote diversity, we favor mutation over crossover, by setting the crossover probability to a value < 0.5. The set of the ﬁttest traces Tﬁt will be used in Step 4 for performance testing. Using the single ﬁttest trace from every generation helps enforcing variety. Fitness Computation. We propose a ﬁtness function especially suited for testing RL agents. For an action trace τA, the ﬁtness F(τA) is the weighted sum of three normalized terms: The positive-reward term rpos(τA, s0) is the normalized positive reward gained in execτ(τA, s0). The negative-reward term rneg(τA, s0) is the normalized inverted negative reward gained in execτ(τA, s0). The coverage ﬁtness-term fc(τA, s0) describes the number of newly visited states by execτ(τA, s0), normalized by dividing by the maximum number of newly visited states by any action traces in the current population. Positive rewards correspond to the degree as to which the RL task is solved by τA. Negative rewards often correspond to the time required to solve the RL task. Hence if τA solves the task fast, it would be assigned a small negative reward. We invert the negative reward to have only positive ﬁtness terms. We normalize rpos/rneg by dividing it by the highest r pos/r neg in the current generation. The coverage ﬁtness-term depends on all states visited in previous populations. Assume that the current generation is i. Let Covppop be the set of all states visited by the previous populations S j<i Tj and let Cov(τA) be the visited states when executing an action trace τA, i.e., Cov(τA) = S k n{sk}, where execτ(τA, s0) = s0, a0, r0, s1, . . . , sn . The coverage ﬁtness-term fc(τA) is then given by

fc(τA) = |Covppop \ Cov(τA)| maxτ A Ti |Covppop \ Cov(τ A)|.

fc(τA) is a normalized value 1 and changes during fuzzing as more states are covered. The ﬁtness F(τA) is given by F(τA) = λcovfc(τA)+λposrpos(τA)+λneg(1 |rneg(τA)|), where the factors λj are weights and rpos(τA) and rneg(τA) are normalized rewards gained when executing τA. Mutation & Crossover. To generate a new action trace,we perform either a crossover of two parent traces or mutate a single parent trace. For crossover, we create a new offspring trace splitting two parent traces and concatenating the resulting subtraces. Let τA,1 and τA,2 be the parent action-traces. To create an offspring trace, we uniformly select a random crossover point i {1, . . . , min(|τA,1|, |τA,2|) 1}. The offspring is the concatenation of τ i A,1 and τ +i A,2. For mutation, we repeatedly apply mutation operators. Given a parent trace τA and a parameter ms deﬁning the potential effect size of a mutation, we create an offspring τ A as following:

Algorithm 2: Performance Testing with Fuzz Traces

input : M = (S, s0, A, P, R), policy π, fuzz traces Tﬁt, # episodes nep output: Avg. accumulated rewards of the agent Ra and the fuzz traces Rt 1 return Rt Eval Traces(Tﬁt, s0, nep), Ra Eval Agent(π, s0, nep); 2 Function Eval Traces(Tﬁt, s0, nep): 3 for τA Tﬁt do 4 for i 1 to nep do 5 τi execτ (τA, s0) = s0, a1, r1 . . . sn ; 6 Rt,τA,i Σn k=1rk with rk τi

7 return Rt = (ΣτA Tﬁt Σ nep i=1 Rt,τA,i)/(nep |Tﬁt|)

8 Function Eval Agent(π, s0, nep): 9 for i 0 to nep do 10 τi execπ(π, s0) = s0, a1, r1 . . . sn with sn ST ; 11 Ra,i Σn k=1rk with rk τi;

12 return Ra = (Σ nep i=1 Ra,i)/nep

Algorithm 3: Robust Performance Testing

input : M = (S, s0, A, P, R), policy π, fuzz traces Tﬁt, # tests ntest, # episodes nep, step width w output: Avg. accumulated rewards Rpl t and Rpl a 1 pl w; 2 repeat 3 for i 1 to ntest do 4 τA random action trace Tﬁt;

5 τ pl execτ (τ pl A , s0) = s0 . . . spl ;

6 R Σpl t=1rt with rt τ pl;

7 Rpl t,i R +Eval Traces({τ pl+ A }, spl, nep);

8 Rpl a,i R +Eval Agent(π, spl, nep);

9 Rpl t (Σntest i=1 Rpl t,i)/ntest;

10 Rpl a (Σntest i=1 Rpl a,i)/ntest;

11 pl pl + w; 12 until |{τA Tﬁt : |τA| pl}| < ntest // too few traces of length pl;

13 return S

pl{pl 7 Rpl t }, S

pl{pl 7 Rpl a }

1. Uniformly sample x {1, . . . , ms}. 2. Chose a mutation operator parametrized with x and perform the mutation on τA to create an action trace τ A. 3. Stop with a probability pmstop (0, 1] and return τ A. Otherwise, set τA τ A and continue with Step 1. The applied mutation operators are (1) Insert, (2) Remove (3) Change, and (4) Append. Each performs its eponymous operation on an action sequence of length x at a randomly chosen index in the parent trace, except for Append.

6 Step 4 - Testing for Performance In the ﬁnal step, we evaluate the performance of trained RL agents. The evaluation compares the accumulated reward gained from applying the agent s policy with the accumulated reward gained by executing fuzzed traces. Especially in RL settings where the maximal expected reward is unknown, the rewards gained by the fuzz traces serve as a benchmark for the agents performance. Furthermore, the fuzz traces are used to test the agents performance from a diverse set of states. Performance Testing. Simple performance testing starts in a ﬁxed initial state and compares the average accumulated reward of the agent with the average accumulated reward resulting from the execution of fuzz traces. Given the policy π of an agent under test, the fuzz traces Tﬁt, an initial state s0, and a number of episodes nep, Algo. 2 returns the averaged accumulated rewards of the agent Ra and the fuzz traces Rt. Robust Performance Testing. Robust performance testing

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

targets checking the robustness of learned policies in potentially unknown situations. For this purpose, we use the fuzz traces to bring the agent into a diverse set of states and apply the policy of the agent from these states onward until a terminal state is reached. To cover states close to the initial state as well as close to the goal, we actually use fuzz trace preﬁxes of increasing length. The averaged accumulated rewards of the agent traces and the fuzz traces serve as performance metric. Let π be the policy of the RL agent under test, Tﬁt be the fuzz traces, s0 be an initial state, w be a step width to increase the fuzz trace preﬁx-length, and ntest and nep be numbers of tests and episodes, Algo. 3 implements our robust performance testing approach. Starting from preﬁx length pl = w, for each pl the amount of executed tests is ntest. To do so, we ﬁrst select a random trace (line 4) and execute its preﬁx of length pl to arrive at a state spl. From spl, we compute the accumulated reward of the fuzz trace (line 6) and of the agent averaged over nep episodes (line 8) and add the accumulated reward of the common preﬁx R to both. We average the accumulated rewards over all tests for each pl individiually (lines 9 and 10) and ﬁnally return all results in Line 13.

7 Experimental Evaluation

We evaluate our testing framework on trained deep RL agents for the video game Super Mario Bros., trained with varying numbers of training episodes and different action sets. Setup for RL. We use a third-party RL agent that operates in the Open AI gym environment and uses double deep QNetworks [Feng et al., 2020]. Details on the learning parameters along with more experiments and the source code are included in the technical appendix. To evaluate different agents, we test agents at varying stages of learning having three different action sets: (1) 2-moves: fast running and jumping to the right. (2) basic: 2-moves plus slow running to the right and left. (3) right-complex: basic without running left, but actions for pressing up and down, resulting in the largest action set. Unless otherwise noted, we present results from training for 80k episodes. We stopped training at this point, since we observed only little improvement from 40k to 80k episodes, which is also twice as long as suggested [Feng et al., 2020]. Setup for Search and Fuzzing. The search for the reference traces uses the 2-moves actions, which is sufﬁcient to complete most levels in Super Mario Bros. We compute fuzz traces for each action set of the different RL agents. We fuzzed for 50 generations with a population of 50, used a mutation stop-probability pmstop = 0.2 with effect size ms = 15 and ﬁtness weights λcov = 2, λpos = 1.5, and λneg = 1, to focus on exploration. With a crossover probability of 0.25, we mostly rely on mutations. The search and fuzzing were performed in a standard laptop in a few minutes and a few hours, respectively. Compared to the training that took several days on a dedicated cluster, the computational effort for testing is relatively low. For safety testing, we use 10 repetitions and a test length of l = 40 and for performance testing we use ntest = nep = 10 and step width w = 20. Safety Testing. Fig. (3) shows the relative number of fail verdicts averaged over all tests at all boundary states for agents with different action sets. For any agent, all test suites

2-moves right-complex basic

Fail Verdict Frequency

simple interval (is = 1)

action coverage (k = 1) action coverage (k = 2)

Figure 3: Safety Testing: Relative frequency of fail verdicts

0 2 4 6 8 10 12

Boundary State

Fail Verdict Frequency

simple interval (is = 1) action coverage (k = 1) action coverage (k = 2)

Figure 4: Safety Testing of the right-complex agent: Relative frequencies of fail verdicts at boundary states

found safety violations. For instance, the simple test suite produces fail verdicts in about 38% of the cases when testing the right-complex RL agent. The agent with 2-moves is tested to be the safest agent, failing only 10% of test cases of the simple test suite. For right-complex, the least safe agent, Fig. (4) depicts the relative number of fail verdicts distributed over the boundary states, when executing all test suites. Note that the results are affected by stochasticity and they are normalized, so that all results are within [0, 1] even though the extended test suites perform more tests. We can see that the early boundary states that are explored the most cause the least issues. Furthermore, we observe that the boundary interval test suite ﬁnds safety violations not detected by the simple test suite, e.g., at the boundary states 3 and 4. Robust Performance Testing. We perform robust performance testing on all agents trained for 20k and 80k episodes, respectively. Fig. (5) shows the average accumulated rewards (y-axis) gained by the agents and the fuzz traces when performing fuzz trace preﬁxes of the length given by the x-axis. It can be seen that initially only the well-trained 2-moves agent surpasses the performance benchmark set by the fuzz traces. Training for 20k episodes is not enough for any agent to achieve rewards close to the fuzz traces and the basic agent does not improve much with more training. Hence, a larger action space may hurt robustness. The ability to move left of the basic agent also increases the state space of the underlying MDP, since only moving right induces a DAG-like structure. This explains the poor performance of the basic agents. Relationship between Safety and Performance. Finally, we investigate whether performance expressed via rewards implies safety. Fig. (6) shows the average number of safety vi-

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

40 80 120 160 200 240 280 320 360

Fuzz Trace Preﬁx Length pl

Average Return

2-moves (20k) 2-moves (80k) right-comp (20k) right-comp (80k) basic (20k) basic (80k) fuzz traces

Figure 5: Robust Performance Testing: Average accumulated rewards of fuzz traces and agents trained for 20k and 80k episodes.

2-moves (20k) right-c (20k) basic (20k) 2-moves (80k) right-c (80k) basic (80k)

Relative Fail Verdict Frequency

Safety Violations

Average Return

Average Return

Figure 6: Average frequency of safety violations and average accumulated rewards during testing with the simple test suite.

olations and the average accumulated rewards gained during testing with a simple test suite with all agents trained for 20k and 80k episodes, respectively. Comparing the agents with low amount of training, the safest agent (basic) is also the one that gains the lowest reward. For well-trained agents, the safest agent (2-moves) also receives the most reward. The large negative reward assigned to safety violations (losing a life) may not be sufﬁcient to enforce safe behavior of rightcomplex. Hence, our testing method may point to issues in reward function design. However, computing the Pearson correlation coefﬁcient between fail verdict frequency and mean accumulated reward for all agents at four stages of training with all test suites reveals a moderate negative correlation of 0.7, thus high reward often implies low fail frequency.

8 Concluding Remarks

We present a search-based testing framework for safety and robust-performance testing of RL agents. For safety testing, we apply backtracking-based DFS to identify relevant states and adapt test-adequacy criteria from boundary value and combinatorial testing. For performance testing, we apply genetic-algorithm-based fuzzing starting from a seed trace found by the DFS. We show both testing methodologies on an off-the-shelf deep RL agent for playing Super Mario Bros, where we ﬁnd safety violations of well-trained agents and analyze their performance and robustness. To the best of our knowledge, we propose one of the ﬁrst testing frameworks tailored toward RL. For future work, we will instantiate our framework for more RL tasks, where solutions can be found through search and other domain-speciﬁc approaches.

We also plan to investigate different fuzzing approach, e.g., fuzzing on policy level rather than trace level.

Acknowledgments

This work has been supported by the University SAL Labs initiative of Silicon Austria Labs (SAL) and its Austrian partner universities for applied fundamental research for electronic based systems. We would like to acknowledge the use of HPC resources provided by the ZID of Graz University of Technology. Additionally, this project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement N 956123 - FOCETA. We also thank Vedad Hadˇzi c for his help in the initial development of the depth-ﬁrst search of the reference trace.

References [Achiam and Amodei, 2019] Joshua Achiam and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. Preprint. Under review, 2019. [Agarwal et al., 2021] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In Neur IPS, volume 34, pages 29304 29320, 2021. [Alshiekh et al., 2018] Mohammed Alshiekh, Roderick Bloem, R udiger Ehlers, Bettina K onighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In AAAI-18, pages 2669 2678, 2018. [Aschermann et al., 2020] Cornelius Aschermann, Sergej Schumilo, Ali Abbasi, and Thorsten Holz. Ijon: Exploring deep state spaces via fuzzing. In IEEE SP 2020, pages 1597 1612, 2020. [Bellemare et al., 2013] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res., 47:253 279, 2013. [B ottinger et al., 2018] Konstantin B ottinger, Patrice Godefroid, and Rishabh Singh. Deep reinforcement fuzzing. In IEEE SP Workshops 2018, pages 116 122, 2018. [Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [Corsi et al., 2021] Davide Corsi, Enrico Marchesini, and Alessandro Farinelli. Formal veriﬁcation of neural networks for safety-critical tasks in deep reinforcement learning. In UAI, volume 161 of Proceedings of Machine Learning Research, pages 333 343, 2021. [Drozd and Wagner, 2018] William Drozd and Michael D. Wagner. Fuzzer Gym: A competitive framework for fuzzing and learning. Co RR, abs/1807.07490, 2018. [Ehlers, 2017] R udiger Ehlers. Formal veriﬁcation of piecewise linear feed-forward neural networks. In ATVA 2017, volume 10482 of LNCS, pages 269 286, 2017.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

[Feng et al., 2020] Yuansong Feng, Suraj Subramanian, Howard Wang, and Steven Guo. Train a mario-playing RL agent. https://pytorch.org/tutorials/intermediate/mario rl tutorial.html, 2020. A Py Torch tutorial [online], accessed: 2022, January 07. [Garcıa and Fern andez, 2015] Javier Garcıa and Fernando Fern andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. [Harel-Canada et al., 2020] Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. Is neuron coverage a meaningful measure for testing deep neural networks? In ESEC/FSE, pages 851 862. ACM, 2020. [Karakovskiy and Togelius, 2012] Sergey Karakovskiy and Julian Togelius. The mario AI benchmark and competitions. IEEE T-CIAIG, 4(1):55 67, 2012. [Khalili and Tacchella, 2014] Ali Khalili and Armando Tacchella. Learning nondeterministic Mealy machines. In ICGI 2014, volume 34 of JMLR Workshop and Conference Proceedings, pages 109 123, 2014. [Kiran et al., 2022] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick P erez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909 4926, 2022. [K uttler et al., 2020] Heinrich K uttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt aschel. The Net Hack Learning Environment. In Neur IPS, pages 7671 7684, 2020. [Ma et al., 2018] Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Deep Gauge: multi-granularity testing criteria for deep learning systems. In ASE, pages 120 131. ACM, 2018. [Pathak et al., 2017] Shashank Pathak, Luca Pulina, and Armando Tacchella. Veriﬁcation and repair of control policies for safe reinforcement learning. Applied Intelligence, 48:886 908, 2017. [Pei et al., 2019] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deep Xplore: automated whitebox testing of deep learning systems. Commun. ACM, 62(11):137 145, 2019. [Pezz e and Young, 2007] Mauro Pezz e and Michal Young. Software testing and analysis - process, principles and techniques. Wiley, 2007. [Pranger et al., 2021] Stefan Pranger, Bettina K onighofer, Martin Tappler, Martin Deixelberger, Nils Jansen, and Roderick Bloem. Adaptive shielding under uncertainty. In ACC, pages 3467 3474, 2021. [Schrittwieser et al., 2020] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari,

go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020. [Schumilo et al., 2022] Sergej Schumilo, Cornelius Aschermann, Andrea Jemmett, Ali Abbasi, and Thorsten Holz. Nyx-net: network fuzzing with incremental snapshots. In Euro Sys, pages 166 180, 2022. [Scott et al., 2021] Joseph Scott, Trishal Sudula, Hammad Rehman, Federico Mora, and Vijay Ganesh. Banditfuzz: Fuzzing SMT solvers with multi-agent reinforcement learning. In FM, volume 13047 of LNCS, pages 103 121, 2021. [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. [Sutton and Barto, 1998] Richard S. Sutton and Andrew G. Barto. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998. [Tassa et al., 2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deep Mind control suite. ar Xiv preprint ar Xiv:1801.00690, 2018. [Tian et al., 2018] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deep Test: automated testing of deepneural-network-driven autonomous cars. In ICSE, pages 303 314. ACM, 2018. [Trujillo et al., 2020] Miller Trujillo, Mario Linares V asquez, Camilo Escobar-Vel asquez, Ivana Dusparic, and Nicol as Cardozo. Does neuron coverage matter for deep reinforcement learning?: A preliminary study. In ICSE 20 Workshops, pages 215 220, 2020. [Wang et al., 2021] Daimeng Wang, Zheng Zhang, Hang Zhang, Zhiyun Qian, Srikanth V. Krishnamurthy, and Nael B. Abu-Ghazaleh. Syzvegas: Beating kernel fuzzing odds with reinforcement learning. In USENIX Security, pages 2741 2758, 2021. [Zeller et al., 2021] Andreas Zeller, Rahul Gopinath, Marcel B ohme, Gordon Fraser, and Christian Holler. The fuzzing book. https://www.fuzzingbook.org/, 2021. accessed: 2022, January 07. [Zhang et al., 2018] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. Deep Road: GAN-based metamorphic testing and input validation framework for autonomous driving systems. In ASE, pages 132 142. ACM, 2018.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)