# programmatic_reward_design_by_example__fe7e6a4f.pdf

Programmatic Reward Design by Example

Weichao Zhou and Wenchao Li

Boston University {zwc662,wenchao}@bu.edu

Reward design is a fundamental problem in reinforcement learning (RL). A misspeciﬁed or poorly designed reward can result in low sample efﬁciency and undesired behaviors. In this paper, we propose the idea of programmatic reward design, i.e. using programs to specify the reward functions in RL environments. Programs allow human engineers to express sub-goals and complex task scenarios in a structured and interpretable way. The challenge of programmatic reward design, however, is that while humans can provide the high-level structures, properly setting the low-level details, such as the right amount of reward for a speciﬁc subtask, remains difﬁcult. A major contribution of this paper is a probabilistic framework that can infer the best candidate programmatic reward function from expert demonstrations. Inspired by recent generative-adversarial approaches, our framework searches for the most likely programmatic reward function under which the optimally generated trajectories cannot be differentiated from the demonstrated trajectories. Experimental results show that programmatic reward functions learned using this framework can signiﬁcantly outperform those learned using existing reward learning algorithms, and enable RL agents to achieve state-of-the-art performance on highly complex tasks.

Introduction

Reward signals are an integral part of reinforcement learning (RL). Most conventional reward functions are goal-driven they reward the agent only at the end of each episode. The sparsity of such reward signals, however, can make RL algorithms highly inefﬁcient. As the complexity of the task increases, it becomes difﬁcult for the agent to grasp the intricacies of the task solely from a goal-driven reward (Amodei et al. 2016). Inverse Reinforcement Learning (IRL) is a general paradigm that aims at recovering the intrinsic reward function of human experts from their demonstrations (Ng and Russell 2000; Ziebart et al. 2008). Earlier works of IRL require the provision of multiple feature functions to construct the reward function. More recent attempts use function approximation by means of deep neural networks to alleviate this limitation and have considerable success (Fu, Luo, and

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Levine 2018; Finn, Levine, and Abbeel 2016; Finn et al. 2016). However, due to the lack of interpretability of the approximated reward functions, it is difﬁcult to enforce speciﬁc correctness constraints in the reward learning process. Recent works proposed logic-based reward functions (Li, Vasile, and Belta 2017; Camacho et al. 2019) to endow an RL agent with high-level knowledge of the task via logical speciﬁcations. The logic-constrained reward (Hasanbeig, Abate, and Kroening 2019) and reward machines (Icarte et al. 2018) explicitly represents reward functions as automata. However, it is still cumbersome to design the automata and they can be difﬁcult to understand when the size of the automata are large. In this paper, we propose programmatic reward functions, i.e. reward functions represented as programs expressed in human-readable domain speciﬁc language (DSL). There are several beneﬁts of using programmatic reward functions for RL policy training. First, programs allow human engineers to express sub-goals, complex task scenarios in a structural and interpretable way. The inclusion of such domain knowledge forms inductive biases that help improve the sample efﬁciency and performance of RL agents. Second, engineers can take advantage of the rich semantics of DSLs to explicitly memorize, manipulate and leverage hindsight experiences of the RL agent. Lastly, programs are amenable to the speciﬁcation of symbolic constraints over the holes. In a typical design routine of a programmatic reward function, we assume that an engineer can provide human insights in the form of a partial program, or a sketch (Solar-Lezama 2008), analogous to human providing feature functions in IRL and logical speciﬁcations in logic-based reward designs, to express the high-level structures of the task speciﬁcation or reward function. The sketch in essence deﬁnes the set of events and certain interactions among them that the human engineer deem relevant to the task. The low-level details, such as the right amount of reward for a speciﬁc event or sub-task, are left as holes. Similar to Programming by Example (PBE) (Menon et al. 2013), we propose to infer the holes in a programmatic reward sketch from expert demonstrated trajectories. A key difference of our approach from PBE is that the demonstrated trajectories do not directly correspond to inputs or (intermediate) outputs of the program, but instead are assumed to be generated by an unknown expert policy that is optimal under some realization of the programmatic reward

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

function. A major contribution of this paper is a probabilistic learning algorithm that can complete a given programmatic reward sketch based on expert demonstrations. Our overall framework, called Programmatic Reward Design by Example (PRDBE), consists of three components: a set of example trajectories demonstrated by a human expert, a program sketch and a symbolic constraint that the complete program should satisfy. Directly searching in the program space is a combinatorial problem and can easily become intractable. Our approach is to search for the most likely program that matches the expert s intrinsic reward function. Our solution is inspired by generative adversarial approaches (Finn et al. 2016; Jeon, Seo, and Kim 2018) which introduce a discriminator to distinguish agent trajectories from the expert s. However, instead of formulating an agent s policy as a generative model, we sample trajectories that are optimal under a candidate programmatic reward function and iteratively improve the candidate program to maximize the chance of the discriminator making false predictions on those trajectories. To circumvent the issue of non-differentiability of programs, we employ a sampler to sample candidate programs from the space of valid programs. In particular, we use selfnormalized importance sampling to sample trajectories from an agent s policy. We summarize our contributions below.

We propose Programmatic Reward Design by Example (PRDBE), a novel paradigm to design and learn programlike reward functions for RL problems. We develop a probabilistic learning framework that can infer the most likely candidate reward program from expert demonstrations. Our approach enables RL agents to achieve state-of-theart performance on highly complex environments with only a few demonstrations. In addition, we show that programmatic reward functions generalize across different environment conﬁgurations of the same task.

Related Work Inverse Reinforcement Learning. IRL (Ng and Russell 2000; Abbeel and Ng 2004) instantiates a learning-fromdemonstrations (Lf D) framework to effectively infer the intrinsic reward functions of human experts. It is also notable for having inﬁnite number of solutions. To resolve the ambiguity of IRL, max-entropy (Ziebart et al. 2008), maxmargin (Abbeel and Ng 2004; Ratliff, Bagnell, and Zinkevich 2006) and Bayesian (Ramachandran and Amir 2007) methods have been proposed. However, those IRL methods assume linear rewards on the basis of non-linear state features, and also call RL in a loop to repeatedly solve the entire environments. The recent works (Fu, Luo, and Levine 2018; Ho and Ermon 2016; Jeon, Seo, and Kim 2018; Finn et al. 2016), drawing a connection between IRL and Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), have achieved substantially improved the scalablity of IRL by using deep neural networks and data-driven approaches. Our work, while embracing a data-driven ideology, represents the reward function in an interpretable way. Reward Design. Reward shaping (Ng, Harada, and Russell 1999) highly speeds up RL training by modifying a sparse

reward functions with state-based potential functions. Intrinsic reward generation (Bellemare et al. 2016; Pathak et al. 2017; Alshiekh et al. 2018) and adversarially-guided (Flet Berliac et al. 2021) techniques aims at motivating the agent to exhaustively explore the environments. Unlike these approaches, our work aims at capturing human knowledge in the reward function and does not generate uninterpretable reward signals densely ranging over the entire state space. Logic-based reward designs (Li, Vasile, and Belta 2017; Hasanbeig, Abate, and Kroening 2019; Camacho et al. 2019) present human knowledge in reward functions with speciﬁcation languages such as linear temporal logic (LTL)(Baier and Katoen 2008). Reward machine theories (Icarte et al. 2022) further directly represent the reward functions as ﬁnite state automata which can be translated into logic formulas. Our work distinguishes itself by 1) using programming languages instead of logics to expressively represent human insights in the reward functions; 2) adopting Lf D to implement low-level details in the reward functions. Regarding Lf D, Inverse reward design (IRD) (Hadﬁeld-Menell et al. 2017) design reward functions in a manner similar to IRL. Safety-aware apprenticeship learning (Zhou and Li 2018) incorporate formal speciﬁcation and formal veriﬁcation with IRL. However, those works restrict the reward function to be a linear combination of state features. Our work does not have such limitations. Interpretable Reinforcement Learning. Learning interpretable RL policies has drawn continual attention (Andre and Russell 2001, 2002; Verma et al. 2018; Zhu et al. 2019; Yang et al. 2021; Tian et al. 2020). Our work focuses on programmatic reward functions instead of policies because well-designed reward functions can beneﬁt diverse RL training methods. There have been a variety of works on learning-based program synthesis (Ellis et al. 2021; Parisotto et al. 2016; Itzhaky et al. 2010; Gulwani and Jain 2017; Ellis, Solar-Lezama, and Tenenbaum 2015).Our work is inspired by the concept of sketch synthesis (Solar Lezama 2008). Realizing sketch synthesis from example for reward function is our main contribution in this paper.

Reinforcement Learning. RL problems model an environment as a Markov Decision Process (MDP) M := S, A, P, d0 where S is a state space, A is an action space, P(s |s, a) is the probability of reaching a state s by performing an action a at a state s, d0 is an initial state distribution. A trajectory τ = s(0)a(0)s(1)a(1) . . . s(T )a(T ) is produced by sequentially performing actions for T steps after starting from an initial state s(0) d0. An RL agent can select actions according to a control policy π(a|s) which determines the probability of performing action a at any state s. A reward function is a mapping f : S A R from state-action pairs to the real space. With a slight abuse of notations, we also represent the total reward and the joint probability along a trajectory τ as f(τ) = PT t=0 f(s(t), a(t)) and π(τ) = QT t=0 π(a(t)|s(t)) respectively. Let H(π) be the entropy of π. The objective of entropy-regularized RL (Levine 2018) is to minimize JRL(π) = Eτ π[f(τ)] H(π).

Learning from Demonstrations. When the reward function is not given but a set E of expert trajectories τE s is demonstrated by some unknown expert policy πE, GAIL(Ho and Ermon 2016) trains an agent policy πA to imitate πE via a generative adversarial objective as in (2) where a discriminator D : S A [0, 1] is trained to maximize (2) so that it can identify whether an (s, a) is sampled from τE s; πA as the generator is optimized to minimize (2) so that its trajectories τA s are indistinguishable from τE s. Bayesian GAIL (Jeon, Seo, and Kim 2018) labels any expert trajectory τE with 1E and 0E to respectively indicate classifying τE as being sampled from E or not. Likewise, 1A and 0A indicate classifying any agent trajectory τA as from E or not. Bayesian GAIL learns the most likely D that makes the correct classiﬁcations 0A, 1E in terms of p(D|0A, 1E; πA; E) p(D)p(0A, 1E|πA, D; E) P τA p(τA|πA)p(0A|τA; D) P τE p(τE|E)p(1E|τE; D) where p(1|τ; D) = QT t=0 D(s(t), a(t)). Its logarithm (1) is lower-bounded by (2) due to Jensen s inequality. It is shown in (Fu, Luo, and Levine 2018) that by representing D(s, a) := exp(f(s,a)) exp(f(s,a))+πA(a|s) with a neural network f, (2) is maximized when f log πE, implying that f is a reward function under which πE is optimal. Hence, by representing D in (1) and (2) with f, an IRL objective of solving the most likely expert reward function f is obtained.

τE p(τE|E)p(1E|τE; D) P

τA p(τA|πA)p(0A|τA; D)(1)

t=0 D(s(t) E , a(t) E ) i +

t=0 1 D(s(t) A , a(t) A ) i (2)

Program Synthesis by Sketching. In the sketch synthesis problem, a human designer provides a sketch e, i.e., an incomplete program wherein certain details are left empty. Each unknown detail is called a hole and denoted as ?id, indexed by id in order of its appearance in e. The formalism of the sketches follows a general grammar Λ as in (3) and (4) where G is a family of functions customized with domain-speciﬁc knowledge for the tasks; x represents the input argument of the program. The grammar Λ induces a set E of syntactically allowable sketches. When a sketch e E is given by the designer, ?e = {?1, ?2, . . .} denotes an ordered set of the holes appearing in e. Let H be a set of possible assignments to ?e. The sketch e and H induce a set L = {l e[h/?e]|h H} of complete programs where e[h/?e] means substituting ?e in e with an assignment h. Besides the syntax, the program l is required to be a function with a list type [(S, A)] argument. A valid assignment to the argument can be a list-represented trajectory, i.e., by writing τ = s(0)a(0) . . . s(t)a(t) as [(s(0)a(0)), . . . , (s(t), a(t))]. Hereinafter we refer to τ either as a sequence of state-action pairs or as a list of state-action tuples depending on the contexts. The rules in (5) and (6) deﬁne the semantics of Λ wherein (6) replaces x with τ in g(e1, . . . , en). The result of directly applying input τ to a sketch e is written as [[e]](τ) which induces a partial program with ?e as free variables.

Figure 1: (a) Mini Grid 8x8 Door-Key; (b) A programmatic reward sketch for (a);(c) A reward function as a ﬁnite state automaton; (d) a programmatic reward sketch for (c)

For any complete program l, given a trajectory input τ, the output [[l]](τ) is required to be a |τ|-length real-valued list.

Sketch e := u | g(e1, . . . , en) g G (3) Term u := const | ?id | x (4)

[[const]](τ) := const [[?id]](τ) := ?id [[x]](τ) := τ (5) [[g(e1, . . . , en)]](τ) := g(e1, . . . , en)[τ/x] (6)

Motivating Example In this section, we motivate the problem of programmatic reward design with the pseudo-code of two programmatic reward function sketches, one for a navigation task in a gridworld and the other one for representing a reward function that has been formulated as a ﬁnite state automaton (FSA). For those two tasks, we assume that the domain expert provides a set of tokens such as reach goal, unlock door representing the behavior of the agent, and A, B, C representing the FSA states. A predicate function pred( ) G of trajectory can output those tokens according to the last state-action in the input trajectory.

Example 1. (Door-Key Task) Fig.1a is an 8 8 Door Key task in a Mini-Grid environment (Chevalier-Boisvert, Willems, and Pal 2018). An agent needs to pick up a key, unlock the yellow door on the grey wall and reach the green goal tile. In every step, the agent can observe at most 7 7 area in front if the area is not blocked by walls and doors. By default, the environment only returns a reward when the agent reaches the goal tile.

A sketch reward fn of programmatic reward function for Example 1 is shown in Figure1b. The holes {?id}5 id=1 are unknown numerical values. The input argument is written

as traj instead of x for readability. The output is a realvalued list concatenating the results of each recursive call to reward fn. The elements in the output list are the very rewards for the state-action pairs in the input trajectory in successive steps. This sketch responds to events such as reaching the goal and unlocking the door. We note that this programmatic reward function, whereas sparse in appearance considering the large state space, differs from the default goal-driven reward by informing the agent the stage-wise completion of the task. Line 7, 10 and 13 explicitly leverage hindsight experience to determine reward for the current step. Critically, ?id s in Figure1b ought not to be arbitrarily assigned. Suppose that a penalty ?5 0 for dropping the key is less than a award ?4 0 for picking up the key. The agent would repeatedly pick up and drop the key to obtain high net gain ?5+?4 0 instead of reaching for the goal state. Besides, we observe that the penalties for redundant actions such as closing the door ought not to be overwhelmingly high. Otherwise an under-trained agent may be incentivized to reside away from the door for good. Example 2. (Reward Function as A Finite State Automaton (FSA)) In Fig.1c, a reward function for some RL problem is represented as an FSA comprised of at least 3 states A, B, C among which A indicates task initialization, B indicates task completion, C indicates the occurrence of some other event of interests. Each directed edge represents a state-transition triggered by a step in the environment. Other states and transitions are omitted. Each transition is annotated with (pid, ?id) where pid is an unknown transition probability dependent on both the environment and the RL policy; ?id is the reward sent to the agent at the moment of the transition. An RL agent is supposed to accomplish the task with minimal amount of steps in the environment. Note that the states in Fig.1c are not to be confused with the MDP states in the environment; one step in the environment does not necessarily trigger the transitions drawn in Fig.1c either. FSAs such as the one in Example 2 are explicitly or implicitly adopted in several logic-based reward designs such as reward machines (Icarte et al. 2018). Such an FSA can be represented by the sketch in Fig.1d by introducing a local variable q as the FSA state. The key problem again is to determine appropriate values for the ?id s. We discuss the challenges facing this problem in the next section.

Problem Formulation In this section we augment the concept of sketching. Then we characterize the problem of Programmatic Reward Design (PRD) and discuss the challenges in PRD.

Sketching with Symbolic Constraints Given a sketch e, a designer can put constraints on ?e. A symbolic constraint c is a combination of multiple predicates c := µ | µ | c1 c2 | c1 c2 where the atomic predicate µ : H { , } is the basic building block. Under Boolean operations, the predicates follow the semantics (7) where the expression µ[h/?e] substitutes ?e in µ with h to output Boolean values. A satisfying implementation of the mapping eval : { , } R is eval( ) := 2I( ) 1 where

I : { , } {0, 1} is an indicator function. A symbolic constraint c is satisﬁed if [[c]](h) 0. Table 1 instantiates two predicates c1, c2 for Example 1. Then c = c1 c2 can be a symbolic constraint for Example 1. For simplicity, we only consider conjunctions of atomic predicates in this paper. Now we are now ready to state the Programmatic Reward Design problem as in Deﬁnition 1.

Properties Predicates [c1]Reward reaching the goal 5 id=1(?id ?1) [c2]Penalize dropping unused key ?5+?4 0

Table 1: The correspondence between two desired properties and the predicates for the sketch in Fig.1b.

[[µ]](h) := eval(µ[h/?e]) [[ µ]](h) := [[µ]](h) [[c1 c2]](h) := min([[c1]](h), [[c2]](h)) [[c1 c2]](h) := max([[c1]](h), [[c2]](h)) (7)

Deﬁnition 1 (Programmatic Reward Design (PRD)). For an RL task, a PRD problem is a tuple e, ?e, H, c where e is a sketch with holes ?e that takes values from set H; c is a symbolic constraint. The solution to a PRD problem e, ?e, H, c is any valid program l e[h/?e] subject to h H [[c]](h) 0.

Challenges in PRD We note that solving the PRD problem does not guarantee that the resulting reward function will be effective. The challenge of assigning proper values to the holes is faced not only by PRD but also by other symbolic reward design approaches. We use Example 2 to illustrate two aspects of this challenge. A) Goal-driven rewards. The reward function speciﬁed via logically guided approaches (Hasanbeig, Abate, and Kroening 2019) can usually be translated into transition systems similar to the one in Fig.1c. Many of those approaches end up only assigning non-trivial values to ?1 and ?4 while equalizing the rewards for all the other transitions. However, when p1 and p4 are extremely small, e.g. p1, p4 p2, p3, such goal-driven reward assignment barely provide any guidance when the agent explores the environment. B) Unknown dynamics. The reward shaping approach from (Camacho et al. 2019) adopts reward structures similar to the one in Fig.1c but allocates rewards to all transitions while ignoring the dynamics of the environment. This approach may result in inefﬁciency in large environments. For instance, if p2p4 < p1 p2, a global optimal policy would only aim at triggering one transition A B. However, the shaped reward may assign a non-trivial reward to ?2, causing the RL algorithm to spend excessive training episodes on a local-optimal policy that lingers over A C B. Given a sketch such as the one in Fig.1d, a PRD designer would also face the those challenges. Thus we assume that a PRD problem is accompanied by a set of demonstrations that show how an expert can accomplish the task, similar to the setting of IRL. These demonstrations will help narrow down the solutions to the PRD problem. We hereby propose Programmatic Reward Design by Example (PRDBE).

Programmatic Reward Design by Example Similar to Bayesian GAIL introduced in the Background section, we consider a probabilistic inference perspective for formulating PRDBE. We ﬁrst introduce a term to correlate programmatic reward functions with trajectory distributions.

Deﬁnition 2 (Nominal Trajectory Distribution). Given a programmatic reward function l, a nominal trajectory distribution of l is ˆp(τ|l) = p(τ) exp(l(τ)) where p(τ) is the probability of sampling τ under the passive dynamics of the environment; l(τ) is short for P t[[l]](τ)[t]. Furthermore, a normalized nominal trajectory distribution of l is p(τ|l) = p(τ) exp(l(τ))/Zl where Zl = P τ p(τ)ˆp(τ|l).

The nominal trajectory distribution ˆp(τ|l) can be viewed as a proxy of a possibly non-Markov policy πl(a(t)|s(t)) exp([[l]](τ)[t]). Such a policy trivially minimizes JRL(πl)(Levine 2018), the RL loss described in the Background section. Intuitively, we search for an l such that πl matches πE. Given this intuition, we formally deﬁne the problem of Programmatic Reward Design by Example in Deﬁnition 3.

Deﬁnition 3 (Programmatic Reward Design by Example (PRDBE)). Given a set of expert demonstrated trajectories E and a PRD problem e, ?e, H, c , the PRDBE problem e, ?e, H, c, E is to ﬁnd a solution l to the PRD problem such that for any τ the nominal trajectory distribution satisﬁes ˆp(τ|l ) = p(τ|l ) = p E(τ) where p E is the probability of sampling τ from E.

However, solving the PRDBE problem requires addressing the following challenges: a) the set of solutions to the PRD problem may not contain a satisfying solution l for PRDBE, and b) the sketch may not be differentiable w.r.t the holes. In other words, there may not exist a perfect solution to the PRDBE problem and gradient-based optimizations may not be readily applicable to PRDBE. To overcome these issues, we propose a learning framework with a relaxed objective.

A Generative Adversarial Learning Framework Our learning framework realizes the probability matching between ˆp(τ|l), p(τ|l) and p E(τ) in a generativeadversarial fashion. It searches for an l such that even the best discriminator represented with a reward function f as mentioned in the Background section cannot distinguish trajectories sampled by p E from those by p(τ|l). Given an intermediate, learned reward function f, while Bayesian GAIL trains a πA to minimize the log-likelihood (1) of correct classiﬁcations between agent trajectory τA and expert trajectory τE, we learn an l to maximize the log-likelihood of false classiﬁcations, i.e., log p(1A, 0E|l, f; E) = log P τA p(τA|l)p(1A|τA; l, f) P τE p E(τE)p(0E|τE; l, f)

with p(1A|τA; l, f) := QT t=0 exp(f(s(t) A ,a(t) A ))

exp(f(s(t) A ,a(t) A ))+exp([[l]](τA)[t]) and p(0E|τE; l, f) := QT t=0 exp([[l]](τE)[t]) exp(f(s(t) E ,a(t) E ))+exp([[l]](τE)[t]) being the discriminators; p(τA|l) being the generator of τA s. Since the l s are non-differentiable, we do not directly optimize l. Recalling that L is the program space induced by

H, we optimize a sampler q : L [0, 1] to concentrate the distribution density on those candidate programs l s that incur high values of log p(1A, 0E|l, f; E). Due to the introduction of q, the log-likelihood of false classiﬁcation becomes log p(1A, 0E|q, f; E) P l L q(l) log p(1A, 0E|l, f; E) which is further lower-bounded by (8). We then introduce an agent policy πA for importance sampling of p(τA|l) in (9). Thus q and πA together constitute the generator. After cancelling out the passive-dynamics induced p(τA) in the nominator and denominator as in (10), we handle the normalization constant Zl via self-normalized importance sampling (Bugallo et al. 2017) as in (11) by i.i.d. sampling a set {τA,i}m i=1 of trajectories with πA. We refer to (11) as the generative objective Jgen(q).

l L q(l) log p(1A, 0E|πl; l, f; E)

n E τA p(τA|l)

log p(1A|τA; l, f) +

log p(0E|τE; l, f) o (8)

p(τA|πA) log p(1A|τA; l, f) +

log p(0E|τE; l, f) o (9)

hexp(l(τA))

ZlπA(τA) log p(1A|τA; l, f) i +

log p(0E|τE; l, f) o (10)

exp(l(τA,i))

πA(τA,i) log p(1A|τA,i; l, f)

exp(l(τA,i))

log p(0E|τE; l, f) o (11)

Next, we note that the existence of symbolic constraint c imposes a prior p(l|c) over the search space of l. We let p(l|c) be a uniform distribution among the programs that do not violate c, i.e., {l e[h/?e]|[[c]](h) 0} L while being zero anywhere else. Then the objective of our learning framework for q becomes minimizing DKL q(l)||p(l|1A, 0E, f; E, c) where p(l|1A, 0E, f; E, c) = p(1A,0E|l,f;E)p(l|c)

p(1A,0E|f;E) . We minimize this KL-divergence by maximizing its evidence lower-bound (ELBO) as in (12) which equals the sum of Jgen(q), an entropy term H(q) and a Jc(q) E l q[log p(l|c)].

ELBO(q) = Jgen(q) DKL q(l)||p(l|c)

= H(q) + Jc(q) + Jgen(q) (12)

In our implementation, the holes in the designed sketches are all unknown reals, i.e., H = R|?e|. Rather than sampling from L, we can use a neural network qϕ to specify the mean

Algorithm 1: Generative Adversarial PRDBE Input: PRDBE tuple e, ?e, H, c, E , iteration number N, sample number m, K, optimization parameters α, β Output: l , π

1: initialization: program space L = {e[h/?e]|h H}; an agent policy πφ0; neural reward function fθ0; sampler qϕ0 : L [0, 1] 2: while iteration number n = 0, 1, . . . , N do 3: Sample trajectory set {τA,i}m i=1 by using policy πφn 4: Calculate rewards {[[l n]](τA,i)}m i=1 with the most likely program l n = arg max l qϕn(l)

5: Update φn to φn+1 using policy learning algorithm, e.g. PPO (Schulman et al. 2017) 6: Sample K programs {lk}K i=1 by using qϕn 7: Calculate rewards {{[[lk]](τA,i)}m i=1}K k=1 8: Update θn+1 θn + α θJadv(fθn) 9: Update ϕn+1 ϕn + β ϕELBO(qϕn) 10: end while 11: return l := arg max l qϕN (l) and π := πφN

Figure 2: Flowchart for our learning framework

and variance of a |?e|-dimensional multivariate Gaussian distribution in H. We use qϕ(l) to denote the probability of this Gaussian distribution producing an h H such that l = e[h/?e]. The mean of the Gaussian corresponds to the most likely l which we use to train a neural network policy πφ as the agent policy πA in (11). To calculate the gradient of Jgen, we use the logarithmic trick (Peters and Schaal 2008), i.e., ϕn El qϕn [ ] 1 K PK k=1 ϕn log qϕn(lk)[ ]. We note that Jc(q) is once the support of q contains an l that violates c. Since the support of the Gaussian speciﬁed by qϕ is the real space, Jc(qϕ) can always be . Hence, we relax Jc by using a Re LU function and a cross-entropy loss to only penalize qϕ if l violates c. We also train a neural reward function fθ as the f in (11) by maximize an adversarial objective Jadv(fθ) that differs from the generative objective Jgen by changing 1A, 0E into 0A, 1E such that p(0A|τA; l, fθ) := QT t=0 exp([[l]](τA)[t]) exp(fθ(s(t) A ,a(t) A ))+exp([[l]](τA)[t]) and p(1E|τE; l, fθ) := QT t=0 fθ(s(t) E ,a(t) E )

exp(fθ(s(t) E ,a(t) E ))+exp([[l]](τE)[t]). The algorithm is sum-

marized in Algorithm 1 and depicted in Fig. 2. In a nutshell, this algorithm iteratively samples trajectories with πφ, train πφ with l , then update fθ and qϕ respectively.

Experiments In this section, we experimentally investigate: A. Performance: whether Algorithm 1 can efﬁciently train an agent policy to attain high performance; B. Example Efﬁciency: whether Algorithm 1 can work with only a few demonstrated trajectories; C. Generalization: whether the programs learned via Algorithm 1 in one environment can generalize across different environments of the same task. Benchmark. We select from the Mini Grid environments (Chevalier-Boisvert, Willems, and Pal 2018) three challenging RL tasks in ascending order of difﬁculty: the basic setup and the default reward function for the Door Key task have been described in Example 1; Key Corridor shown in Fig.3e and Obstructed Maze in Fig.3i both require the agent to travel from room to room to ﬁnd the key(s) to pick up a colored ball in some locked room. In Key Corridor, all but one room are unlocked. In Obstructed Maze, most of the rooms are locked and obstructed by green balls and the keys are hidden in grey boxes which the agent must open ﬁrst. Note that the agent can only carry one object at a time, which makes picking up and dropping the same objects multiple times almost inevitable. The environments can vary in size by changing the number of rooms and tiles (e.g. Door Key-8x8 vs. Door Key-16x16). The placements of the objects and doors are randomized in each instance of an environment (e.g. Obstructed Maze-Full). Baselines. We compare Algorithm 1 with IRL algorithms, GAN-GCL (Fu, Luo, and Levine 2018) and GAIL (Ho and Ermon 2016) to answer question A. We use PPO (Schulman et al. 2017), and AGAC (Flet-Berliac et al. 2021) for RL training in line 5 of Algorithm 1. We answer question B by varying the number of demonstrated trajectories when running Algorithm 1. We answer question C by using the programmatic reward functions learned via Algorithm 1 in small environments to train RL agents in larger environments (the results are annotated with AGAC/PPO prog). In all three tasks, we provide the results of running the aforementioned RL algorithms as well as an intrinsic-reward augmented RL algorithm, RIDE (Raileanu and Rockt aschel 2020), with the default rewards. The sketches and symbolic constraints are in a similar form to those described in Example 1. We implement an LSTM reward function fθ for Algorithm 1 and IRL baselines; an MLP qϕ; a CNN version and an LSTM version of the actor-critic RL agent πφ respectively but only report the version with the higher performance. Door Key. We use 10 example trajectories demonstrated in a Door Key-8x8 environment.The PRDBE problem and its solution are both annotated as prog. In Fig.3b, running Algorithm 1 by using PPO and AGAC in line 5 respectively produces a high-performance policy with fewer frames than by training PPO or AGAC with the default reward. On the other hand, RIDE, GAN-GCL and GAIL all fail with closeto-zero returns. In Fig.3c we reduce the number of examples from 10 to 1 and it does not affect the number of frames that Algorithm 1 needs to produce a policy with average return of at least 0.8, regardless of whether PPO or AGAC is used in line 5. This shows that Algorithm 1 is example efﬁcient. In Fig.3h, we use the learned program to train PPO and AGAC

(a) Door Key-16x16

(b) Door Key-8x8-v0

(c) Door Key-8x8-v0

(d) Door Key-16x16-v0

(e) Key Corridor S6R3

(f) Key Corridor S3R3

(g) Key Corridor S3R3

(h) Key Corridor S4/6R3

(i) Obstructed Maze-Full

(j) Obstructed Maze-2Dlhb

(k) Obstructed Maze-2Dlhb

(l) Obstructed Maze-Full

Figure 3: Frames are number of interactions with the environment. Average return is the average default reward achieved over a series of episodes and is no larger than 1. Alg1 w/ AGAC/PPO indicates using AGAC or PPO as the policy learning algorithm in line 5 of Algorithm 1. AGAC/PPO prog indicates training an AGAC or PPO agent with reward provided by a programmatic reward function prog. CNN and LSTM indicate the structures of the actor-critic netowrks.

agents in a larger Door Key environment and achieve high performances with fewer frames than training PPO, AGAC or RIDE with the default reward.This shows that the learned programmatic reward generalizes well across environments.

Key Corridor. We use 10 example trajectories demonstrated in a 6 6 Key Corridor S3R3 environment. In Fig.3f, by using PPO and AGAC in line 5 of Algorithm 1, we respectively obtain high performance with signiﬁcantly fewer frames than by training AGAC with the default reward. We omit GAIL and GAN-GCL because both fail in this task. As shown in Fig.3g, reducing the number of examples (to 1) does not affect the performance of Algorithm 1. In Fig.3h, we use the learned program to train AGAC agents in two larger environments, 10 10 Key Corridor S4R3 and 16 16 Key Corridor S6R3, and achieve high performances with fewer frames than training AGAC with the default reward. Note that when using the default reward, AGAC has the prior SOTA performance for this task.

Obstructed Maze. We use 10 example trajectories demonstrated in a two-room Obstructed Maze-2Dhlb environment. When training the PPO agent, we discount PPO by episodic state visitation counts as in many exploration-driven approaches including AGAC. In Fig.3j we show that Algo-

rithm 1 can produce high-performance policies with fewer frames than AGAC trained with the default reward. Since AGAC(CNN) is the SOTA for this task, we do not show the results of other methods. In Fig.3h, we use the learned program to train AGAC agents in a larger environment with as many as 9 rooms, Obstructed Maze-Full, and achieve higher performances with fewer frames than other methods trained with the default reward. Regarding the number of demonstrations, as shown in Fig.3k, the performance of Algorithm 1 does not change much when the number is decreased to 6, but drops drastically when the number is further decreased possibly due to the higher complexity of this task.

We propose a novel paradigm for using programs to specify the reward functions in RL environments. We have developed a framework to complete a reward program sketch by learning from expert demonstrations. We experimentally validate our approach on challenging benchmarks and by comparing with SOTA baselines. Our future work will focus on reducing human efforts in the sketch creation process.

References Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship Learning via Inverse Reinforcement Learning. In Proceedings of the Twenty-ﬁrst International Conference on Machine Learning, ICML 04, 1 . New York, NY, USA: ACM. ISBN 1-58113838-5. Alshiekh, M.; Bloem, R.; Ehlers, R.; K onighofer, B.; Niekum, S.; and Topcu, U. 2018. Safe reinforcement learning via shielding. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32. Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P. F.; Schulman, J.; and Man e, D. 2016. Concrete Problems in AI Safety. ar Xiv:1606.06565. Andre, D.; and Russell, S. J. 2001. Programmable reinforcement learning agents. In Advances in neural information processing systems, 1019 1025. Andre, D.; and Russell, S. J. 2002. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, 119 125. Baier, C.; and Katoen, J.-P. 2008. Principles of Model Checking (Representation and Mind Series). The MIT Press. ISBN 026202649X. Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; and Munos, R. 2016. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, 29: 1471 1479. Bugallo, M. F.; Elvira, V.; Martino, L.; Luengo, D.; Miguez, J.; and Djuric, P. M. 2017. Adaptive Importance Sampling: The past, the present, and the future. IEEE Signal Processing Magazine, 34(4): 60 79. Camacho, A.; Toro Icarte, R.; Klassen, T. Q.; Valenzano, R.; and Mc Ilraith, S. A. 2019. LTL and Beyond: Formal Languages for Reward Function Speciﬁcation in Reinforcement Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, 6065 6073. International Joint Conferences on Artiﬁcial Intelligence Organization. Chevalier-Boisvert, M.; Willems, L.; and Pal, S. 2018. Minimalistic Gridworld Environment for Open AI Gym. https: //github.com/maximecb/gym-minigrid. Accessed: 2022-0425. Ellis, K.; Solar-Lezama, A.; and Tenenbaum, J. 2015. Unsupervised Learning by Program Synthesis. In Cortes, C.; Lawrence, N. D.; Lee, D. D.; Sugiyama, M.; and Garnett, R., eds., Advances in Neural Information Processing Systems 28, 973 981. Curran Associates, Inc. Ellis, K.; Wong, C.; Nye, M.; Sabl e-Meyer, M.; Morales, L.; Hewitt, L.; Cary, L.; Solar-Lezama, A.; and Tenenbaum, J. B. 2021. Dream Coder: bootstrapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 835 850. Finn, C.; Christiano, P. F.; Abbeel, P.; and Levine, S. 2016. A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models. ar Xiv:1611.03852.

Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, 49 58. PMLR. Flet-Berliac, Y.; Ferret, J.; Pietquin, O.; Preux, P.; and Geist, M. 2021. Adversarially Guided Actor-Critic. In International Conference on Learning Representations. Fu, J.; Luo, K.; and Levine, S. 2018. Learning Robust Rewards with Adverserial Inverse Reinforcement Learning. In International Conference on Learning Representations. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672 2680. Gulwani, S.; and Jain, P. 2017. Programming by Examples: PL meets ML. Springer. Hadﬁeld-Menell, D.; Milli, S.; Abbeel, P.; Russell, S. J.; and Dragan, A. 2017. Inverse Reward Design. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Hasanbeig, M.; Abate, A.; and Kroening, D. 2019. Logically-constrained neural ﬁtted q-iteration. In Proceedings of the 18th International Conference on Autonomous Agents and Multi Agent Systems, 2012 2014. International Foundation for Autonomous Agents and Multiagent Systems. Ho, J.; and Ermon, S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 4565 4573. Icarte, R. T.; Klassen, T.; Valenzano, R.; and Mc Ilraith, S. 2018. Using Reward Machines for High-Level Task Speciﬁcation and Decomposition in Reinforcement Learning. In Dy, J.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 2107 2116. PMLR. Icarte, R. T.; Klassen, T. Q.; Valenzano, R.; and Mc Ilraith, S. A. 2022. Reward machines: Exploiting reward function structure in reinforcement learning. Journal of Artiﬁcial Intelligence Research, 73: 173 208. Itzhaky, S.; Gulwani, S.; Immerman, N.; and Sagiv, M. 2010. A Simple Inductive Synthesis Methodology and its Applications. In OOPSLA/SPLASH 10, October 17-21, 2010, Reno/Tahoe, Nevada, USA. Jeon, W.; Seo, S.; and Kim, K.-E. 2018. A Bayesian Approach to Generative Adversarial Imitation Learning. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. Levine, S. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. . Li, X.; Vasile, C.-I.; and Belta, C. 2017. Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS), 3834 3839. IEEE. Menon, A. K.; Tamuz, O.; Gulwani, S.; Lampson, B.; and Kalai, A. T. 2013. A Machine Learning Framework for Programming by Example. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, I 187 I 195. JMLR.org. Ng, A. Y.; Harada, D.; and Russell, S. J. 1999. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML 99, 278 287. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. Ng, A. Y.; and Russell, S. J. 2000. Algorithms for Inverse Reinforcement Learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, 663 670. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. ISBN 1-55860-707-2. Parisotto, E.; Mohamed, A.; Singh, R.; Li, L.; Zhou, D.; and Kohli, P. 2016. Neuro-Symbolic Program Synthesis. ar Xiv:1611.01855. Pathak, D.; Agrawal, P.; Efros, A. A.; and Darrell, T. 2017. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, 2778 2787. PMLR. Peters, J.; and Schaal, S. 2008. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4): 682 697. Raileanu, R.; and Rockt aschel, T. 2020. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. In International Conference on Learning Representations. Ramachandran, D.; and Amir, E. 2007. Bayesian inverse reinforcement learning. Urbana, 51(61801): 1 4. Ratliff, N. D.; Bagnell, J. A.; and Zinkevich, M. A. 2006. Maximum Margin Planning. In Proceedings of the 23rd International Conference on Machine Learning, ICML 06, 729 736. New York, NY, USA: ACM. ISBN 1-59593-3832. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. ar Xiv:1707.06347. Solar-Lezama, A. 2008. Program synthesis by sketching. University of California, Berkeley. Tian, L.; Ellis, K.; Kryven, M.; and Tenenbaum, J. 2020. Learning abstract structure for drawing by efﬁcient motor program induction. Advances in Neural Information Processing Systems, 33. Verma, A.; Murali, V.; Singh, R.; Kohli, P.; and Chaudhuri, S. 2018. Programmatically interpretable reinforcement learning. In International Conference on Machine Learning, 5045 5054. PMLR. Yang, Y.; Inala, J. P.; Bastani, O.; Pu, Y.; Solar-Lezama, A.; and Rinard, M. 2021. Program Synthesis Guided Reinforcement Learning. ar Xiv:2102.11137.

Zhou, W.; and Li, W. 2018. Safety-aware apprenticeship learning. In International Conference on Computer Aided Veriﬁcation, 662 680. Springer. Zhu, H.; Xiong, Z.; Magill, S.; and Jagannathan, S. 2019. An inductive synthesis framework for veriﬁable reinforcement learning. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 686 701. Ziebart, B. D.; Maas, A.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd National Conference on Artiﬁcial Intelligence - Volume 3, AAAI 08, 1433 1438. AAAI Press. ISBN 978-1-57735-368-3.