# synthesizing_programmatic_policies_that_inductively_generalize__29a63a4f.pdf Published as a conference paper at ICLR 2020 SYNTHESIZING PROGRAMMATIC POLICIES THAT INDUCTIVELY GENERALIZE Jeevana Priya Inala MIT CSAIL jinala@csail.mit.edu Osbert Bastani University of Pennsylvania obastani@seas.upenn.edu Zenna Tavares MIT CSAIL zenna@mit.edu Armando Solar-Lezama MIT CSAIL asolar@csail.mit.edu Deep reinforcement learning has successfully solved a number of challenging control tasks. However, learned policies typically have difficulty generalizing to novel environments. We propose an algorithm for learning programmatic state machine policies that can capture repeating behaviors. By doing so, they have the ability to generalize to instances requiring an arbitrary number of repetitions, a property we call inductive generalization. However, state machine policies are hard to learn since they consist of a combination of continuous and discrete structures. We propose a learning framework called adaptive teaching, which learns a state machine policy by imitating a teacher; in contrast to traditional imitation learning, our teacher adaptively updates itself based on the structure of the student. We show that our algorithm can be used to learn policies that inductively generalize to novel environments, whereas traditional neural network policies fail to do so. 1 INTRODUCTION Existing deep reinforcement learning (RL) approaches have difficulty generalizing to novel environments (Packer et al., 2018; Rajeswaran et al., 2017). More specifically, consider a task that requires performing a repeating behavior we would like to be able to learn a policy that generalizes to instances requiring an arbitrary number of repetitions. We refer to this property as inductive generalization. In supervised learning, specialized neural network architectures have been proposed that exhibit inductive generalization on tasks such as list manipulation (Cai et al., 2017), but it is not obvious how those techniques would generalize to the control problems discussed in this paper. Alternatively, algorithms have been proposed for learning programmatic policies for control problems that generalize better than traditional neural network policies (Verma et al., 2019; 2018), but existing approaches have focused on simple stateless policies that make learning generalizable repetitive behaviors hard, e.g., a stateless program cannot internally keep track of the number of repetitions made so far and decide the next action based on that progress. We propose an algorithm for learning programmatic state machine policies. Such a policy consists of a set of internal states, called modes, each of which is associated with a controller that is applied while in that mode. The policy also includes transition predicates that describe how the mode is updated. These policies are sufficiently expressive to capture tasks of interest e.g., they can perform repeating tasks by cycling through some subset of modes during execution. Additionally, state machine policies are strongly biased towards policies that inductively generalize, that deep RL policies lack. In other words, this policy class is both realizable (i.e., it contains a right policy that solves the problem for all environments) and identifiable (i.e., we can learn the right policy from limited data). However, state machine policies are challenging to learn because their discrete state transitions make it difficult to use gradient-based optimization. One standard solution is to soften the state transitions by making them probabilistic. However, these techniques alone are insufficient; they still run into local optima due to the the constraints on the structure of the policy function, as well as the relatively few parameters they possess. Published as a conference paper at ICLR 2020 (a) Train 1 (b) Train 2 (c) Train 3 𝑚" 𝑚# 𝐻%& = 2.4,.35 𝑚/ 𝐻%0 = 1.6, .35 𝑚> 𝐻%? = 2.0, .09 (e) State machine based policy. False edges are dropped. Figure 1: Running example: retrieving an autonomous car from tight parking spots. The goal is to learn a state-machine policy (e) that is trained on scenarios (a), (b), and (c), that generalizes to scenario (d). To address this issue, we propose an approach called adaptive teacher that alternatingly learns a teacher and a student. The teacher is an over-parameterized version of the student, which is a state-machine policy trained to mimic the teacher. Because the teacher is over-parameterized, it can be easily learned using model-based numerical optimization (but does not generalize as well as the student). Furthermore, our approach is different from traditional imitation learning (Schaal, 1999; Ross et al., 2011) since the teacher is regularized to favor strategies similar to the ones taken by the student, to ensure the student can successfully mimic the teacher. As the student improves, the teacher improves as well. This alternating optimization can naturally be derived within the framework of variational inference, where the teacher encodes the variational distribution (Wainwright et al., 2008). We implement our algorithm and evaluate it on a set of reinforcement learning problems focused on tasks that require inductive generalization. We show that traditional deep RL approaches perform well on the original task, but fail to generalize inductively, whereas our state machine policies successfully generalize beyond the training distribution. We emphasize that we do not focus on problems that require large state-machines, which is a qualitatively different problem from ours and would require different algorithms to solve. We believe that state-machines are most useful when only a few modes are required. In particular, we are interested in problems where a relatively simple behavior must be repeated a certain number of times to solve the given task. The key premise behind our approach, as shown by our evaluation, is that, in these cases, compact state-machines can represent policies that both have good performance and are generalizable. In fact, our algorithm solves all of our benchmarks using state-machine policies with at most 4 modes. When many modes are needed, then the number of possible transition structures grows exponentially, making it difficult to learn the right structure without having an exponential amount of training data. Example. Consider the autonomous car in Figure 1, which consists of a blue car (the agent) parked between two stationary black cars. The system state is (x, y, θ, d), where (x, y) is the center of the car, θ is the orientation, and d is the distance between the two black cars. The actions are (v, ψ), Published as a conference paper at ICLR 2020 where v is velocity and ψ is steering angle (we consider velocity control since the speed is low). The dynamics are standard bicycle dynamics. The goal is to drive out of the parked spot to an adjacent lane while avoiding collisions. This task is easy when d is large (Figure 1a). It is somewhat more involved when d is small, since it requires multiple maneuvers (Figures 1b and 1c). However, it becomes challenging when d is very small (Figure 1d). A standard RL algorithm will train a policy that performs well on the distances seen during training but does not generalize to smaller distances. In contrast, our goal is to train an agent on scenarios (a), (b), and (c), that generalizes to scenario (d). In Figure 1e, we show a state machine policy synthesized by our algorithm for this task. We use df and db to denote the distances between the agent and the front and back black cars, respectively. This policy has three different modes (besides a start mode ms and an end mode me). Roughly speaking, this policy says (i) immediately shift from mode ms to m1, and drive the car forward and to the left, (ii) continue until close to the car in front; then, transition to mode m2, and drive the car backwards and to the right, (iii) continue until close to the car behind; then, transition back to mode m1, (iv) iterate between m1 and m2 until the car can safely exit the parking spot; then, transition to mode m3, and drive forward and to the right to make the car parallel to the lane. This policy inductively generalizes since it captures the iterative behavior of driving forward and then backward until exiting the parking spot. Thus, it successfully solves the scenario in Figure 1d. Related work. There has been growing interest in using program synthesis to aid machine learning (Lake et al., 2015; Ellis et al., 2015; 2018; Valkov et al., 2018; Young et al., 2019). Our work is most closely related to recent work using imitation learning to learn programmatic policies (Verma et al., 2018; Bastani et al., 2018; Zhu et al., 2019; Verma et al., 2019). These approaches use a neural network policy as the teacher. However, they are focused on learning stateless policies and hence, they use a supervised dataset of state-action pairs from the teacher and a domain-specific program synthesizer to learn programmatic policies. Building such a synthesizer for state machine policies is challenging since they contain both discrete and continuous parameters and internal state. The student in our algorithm needs to learn the state-machine policy from entire trajectory traces to learn the internal state. In particular, each trajectory trace consists of the sequence of states and actions from the initial state to the goal state visited by the teacher, but also encodes which states correspond to mode changes for the teacher. In the teacher s iteration, the teacher s mode changes are regularized to align more closely with the possible student mode changes. As a consequence, in the student s iteration, it is easier for the student to mimic the teacher s mode changes. Leveraging this connection between the teacher structure and student structure is critical for us to be able to learn state-machine policies Additionally, with the exception of (Verma et al., 2019), for the other approaches, there is no feedback from the student to the teacher. State machines have been previously used to represent policies that have internal state (typically called memory). To learn these policies, gradient ascent methods assume a fixed structure and optimize over real-valued parameters (Meuleau et al., 1999; Peshkin et al., 2001; Aberdeen & Baxter, 2002), whereas policy iteration methods uses dynamic programming to extend the structure (Hansen, 1998). Our method combines both, but similarly to Poupart & Boutilier (2004), the structure space is bounded. In addition, programmatic state machines use programs to represent state transitions and actions rules, and as a result can perform well while remaining small in size. Hierarchies of Abstract Machines (HAM)s also use programmatic state machines for hierarchical reinforcement learning, but assumed a fixed, hand-designed structure (Parr & Russell, 1998; Andre & Russell, 2002). Our inductive generalization goal is related to that of meta-learning (Finn et al., 2017); however, whereas meta-learning trains on a few examples from the novel environment, our goal is to generalize without additional training. Our work is also related to guided policy search, which uses a teacher in the form of a trajectory optimizer to train a neural network student (Levine & Koltun, 2013). However, training programmatic policies is more challenging since the teacher must mirror the structure of the student. Finally, it has recently been shown that over-parameterization is essential in helping neural networks avoid local minima (Allen-Zhu et al., 2019). Relaxing optimization problems by adding more parameters is a well established technique; in many cases, re-parameterization can make difficult non-convex problems solve efficiently (Carlone & Calafiore, 2018). Published as a conference paper at ICLR 2020 2 PROBLEM FORMULATION Dynamics. We are interested in synthesizing control policies for deterministic, continuous-time dynamical systems with continuous state and action spaces. In particular, we consider partially observable Markov decision processes (POMDP) X, A, O, F, Z, X0, φS, φG with states X Rd X, actions A Rd A, observations O Rd O, deterministic dynamics F : X A X (i.e., x = F(x, a)), deterministic observation function Z : X O, and initial state distribution x0 X0. We consider a safety specification φS : X R and a goal specification φG : X R. Then, the agent aims to reach a goal state φG(x) 0 while staying in safe states φS(x) 0. A positive value for φS(x) (resp., φG(x)) quantifies the degree to which x is unsafe (resp., away from the goal). Policies. We consider policies with internal memory π : O S A S where S Rd S is the set of internal states; we assume the memory is initialized to a constant s0. Given such a policy π, we sample a rollout (or trajectory) τ = (x0, x1, ..., x N) with horizon N N by sampling x0 X0 and then performing a discrete-time simulation xn+1 = xn + F(xn, an) , where (an, sn+1) = π(Z(xn), sn) and R>0 is the time increment. Since F, Z, and π are deterministic, τ is fully determined by x0 and π; τ can also be represented as a list of actions combined with the initial state i.e τ = x0, (a0, a1, , a N) . The degree to which φS and φG are satisfied along a trajectory is quantified by a reward function R(π, x0) = φG(x N)+ PN n=0 φS(xn)+, where x+ = max(0, x). The optimal policy π in some policy class Π is one which maximizes the expected reward Ex0 X0[R(π, x0)]. Inductive generalization. Beyond optimizing reward, we want a policy that inductively generalizes to unseen environments. Formally, we consider two initial state distributions: a training distribution Xtrain 0 , and a test distribution Xtest 0 that includes the extreme states never encountered during training. Then, the goal is to train a policy according to Xtrain 0 i.e., π = arg max π Π Ex0 Xtrain 0 [R(π, x0)], (1) but measure its performance according to Xtest 0 i.e., Ex0 Xtest 0 [R(π, x0)]. 3 PROGRAMMATIC STATE MACHINE POLICIES To achieve inductive generalization, we aim to synthesize programmatic policies in the form of state machines. At a high level, state machines can be thought of as compositions of much simpler policies, where the internal state of the state machines (called its mode) indicates which simple policy is currently being used. Thus, state machines are capable of encoding complex nonlinear control tasks such as iteratively repeating a complex sequence of actions (e.g., the car example in Figure 1). At the same time, state machines are substantially more structured than more typical policy classes such as neural networks and decision trees. More precisely, a state machine π is a tuple M, H, G, ms, me . The modes mi M of π are the internal memory of the state machine. Each mode mi M corresponds to an action function Hmi H, which is a function Hmi : O A mapping observations to actions. When in mode mi, the agent takes action a = Hmi(o). Furthermore, each pair of modes (mi, mj) corresponds to a switching condition Gmj mi G, which is a function Gmj mi : O R. When an agent in mode mi observes o such that Gmj mi (o) 0, then the agent transitions from mode mi to mode mj. If there are multiple modes mj with non-negative switching weight Gmj mi (o) 0, then the agent transitions to the one that is greatest in magnitude; if there are several modes of equal weight, we take the first one according to a fixed ordering. Finally, ms, me M are the start and end modes, respectively; the state machine mode is initialized to ms, and the state machine terminates when it transitions to me. Formally, π(on, sn) = (an, sn+1), where an = Hsn(on), s0 = ms and sn+1 = m = darg maxm Gm sn(on) if Gm sn (on) 0 sn otherwise (2) where darg max is a deterministic arg max that breaks ties as described above. Action functions and switching conditions are specified by grammars that encode the space of possible functions as a space of programs. Different grammars can be used for different problems. Published as a conference paper at ICLR 2020 Typical grammars for action functions include constants {Cα : o 7 α} and proportional controls {P i α0,α1 : o 7 α0(o[i] α1)}. A typical grammar for switching conditions is the grammar B ::= {o[i] α}i | {o[i] α}i | B1 B2 | B1 B2 of Boolean predicates over the current observation o, where o[i] is the ith component of o. In all these grammars, αi R are parameters to be learned. The grammar for switching conditions also has discrete parameters encoding the choice of expression. For example, in Figure 1, the action functions are constants, and the switching conditions are inequalities over components of o. 4 FRAMEWORK FOR SYNTHESIZING PROGRAMMATIC POLICIES We now describe the adaptive teaching framework for synthesizing state machine policies. In this section, the teacher is abstractly represented as a collection of trajectories τx0 (i.e., an open-loop controller consisting of a fixed sequence of actions) for each initial state x0. A key insight is that we can parameterize τx0 in a way that mirrors the structure of the state machine student. As we discuss in Section 4.2, we parameterize τx0 as a loop-free state machine. Intuitively, our algorithm efficiently computes τx0 (from multiple initial states x0) using gradient-based optimization, and then glues them together using maximum likelihood to construct a state machine policy. 4.1 ADAPTIVE TEACHING VIA VARIATIONAL INFERENCE We derive the adaptive teaching formulation by reformulating the learning problem in the framework of probabilistic reinforcement learning, and also consider policies π that are probabilistic state machines (see Section 4.3). Then, we use a variational approach to break the problem into the teacher and the student steps. In this approach, the log-likelihood of a policy π is defined as follows: ℓ(π) = log Ep(τ|π)[eλR(τ)] (3) where p(τ | π) is the probability of sampling rollout τ when using policy π from a random initial state x0, λ R 0 is a hyperparameter, and R(τ) is the reward assigned to τ. We have ℓ(π) = log Eq(τ) eλR(τ) p(τ | π) Eq(τ)[λR(τ) + log p(τ|π) log q(τ)] (4) where q(τ) is the variational distribution and the inequality follows from Jensen s inequality. Thus, we can optimize π by maximizing the lower bound Eq (4) on ℓ(π). Since the first and third term of Eq (4) are constant with respect to π, we have π = arg max π Eq(τ)[log p(τ|π)]. (5) Next, the optimal choice for q (i.e., to minimize the gap in the inequality in Eq (4)) is q = arg min q DKL(q(τ) eλR(τ) p(τ | π)/Z) (6) where Z is a normalizing constant. We choose q to have form q(τ) = p(x0) δ(τ τx0), where δ is the Dirac delta function, p(x0) is the initial state distribution, and τx0 are the parameters to be optimized, where τx0 encodes a trajectory from x0. Then, up to constants, the objective of Eq (6) equals Ep(x0) h log p(x0) + Eδ(τ τx0)[log δ(τ τx0)] (λR(τx0) + log p(τx0 | π, x0)) i . The first term is constant; the second term is degenerate, but it is also constant. Thus, we have q = arg max {τx0} Ep(x0) [λR(τx0) + log p(τx0 | π, x0)] . (7) Thus, we can optimize Eq (3) by alternatingly optimizing Eq (5) and Eq (7). We interpret these equations as adaptive teaching. At a high level, the teacher (i.e., the variational distribution q in Eq (7)) is used to guide the optimization of the student (i.e., the state machine policy π in Eq (5)). Rather than compute the teacher in closed form, we approximate it by sampling Published as a conference paper at ICLR 2020 Generate loop- free policies 𝜏 Model-based trajectory optimization Learn the best structural program policy Teacher Student Learn mode mapping p(𝜇) (Fix 𝜏 , π) Learn best action functions 𝐻)* (Fix 𝜏 ,p(𝜇)) Learn best switch conditions 𝐺)*- (Fix 𝜏 , p(𝜇)) Figure 2: Flowchart connecting the different components of the algorithm. finitely many initial states xk 0 X0 and then computing the optimal rollout from xk 0. Formally, on the ith iteration, the teacher and student are updated as follows: Teacher q i = k=1 δ(τ i k) (8) where τ i k = arg max τ λR(τ) + log p(τ | πi 1, xk 0) (xk 0 X0) Student π i = arg max π k=1 log p(τ i k | π, xk 0) (9) The teacher objective Eq (8) is to both maximize the reward R(τ) from a random initial state x0 and to maximize the probability p(τ | π, x0) of obtaining the rollout τ from initial state x0 according to the current student π. The latter encourages the teacher to match the structure of the student. Furthermore, the teacher is itself updated at each step to account for the changing structure of the student. The student objective Eq (9) is to imitate the distribution of rollouts according to the teacher. Figure 2 shows the different components of our algorithm. 4.2 TEACHER: COMPUTING LOOP-FREE POLICIES We begin by describing how the teacher solves the trajectory optimization problem Eq (8) i.e., computing τk for a given initial state xk 0. Parameterization. One approach is to parameterize τ as an arbitrary action sequence (a0, a1, ...) and use gradient-based optimization to compute τ. However, this approach can perform poorly even though we regularize τ towards the student, it could exhibit behaviors that are hard for the student to capture. Instead, we parameterize τ in a way that mirrors the student. In particular, we parameterize τ like a state machine, but rather than having modes and switching conditions that adaptively determine the sequence of action functions to be executed and the duration of execution, the sequence of action functions is fixed and each action function is executed for a fixed duration. More precisely, we represent τ as an loop-free policy τ = H, T . To execute τ, each action function Hi H is applied for the corresponding duration Ti T , after which Hi+1 is applied. The action functions are from the same grammar of action functions for the student. The obvious way to represent a duration Ti is as a number of time steps Ti N. However, with this choice, we cannot use continuous optimization to optimize Ti. Instead, we fix the number of discretization steps P for which Hi is executed, and vary the time increment i = Ti/P i.e., xn+1 xn + F(xn, Hi(o)) i. We enforce i max for a small max to ensure that the discrete-time approximation of the dynamics is sufficiently accurate. Figure 3(a) and (d) show examples of loop-free policies for two different initial states and two different teacher iterations. The loop-free policies in (d) are regularized to match the student s state-machine policy learned in the previous iteration (shown in Figure 3(c)). Optimization. We use model-based trajectory optimization to compute loop-free policies. The main challenge is handling the term p(τ | π, x0) in the objective. Symbolically computing the Published as a conference paper at ICLR 2020 Iteration 1 x1 0 x2 0 (a) Teacher R = 0 R = 0 (b) Mode mapping (c) Student R = 0.30 R = 0.36 Iteration 2 x1 0 x2 0 (d) Teacher R = 0 R = 0 (e) Mode mapping (f) Student R = 0.15 R = 0.04 Figure 3: Visualization showing the student-teacher interaction for two iterations. (a) The loop-free policies (with their corresponding rewards) learned by the teacher for two different initial states. Here, the boxes signify the different segments in the loop-free policies, the colors signify different actions, and the lengths of the boxes signify the durations of the segments. (b) The mapping between the segments and the modes in the state-machine i.e., p(µ = mj). Each box shows the composition of modes vertically distributed according their probabilities. For example, the third segment in the loop-free policy for x1 0 has p(µ = Green) = 0.65 and p(µ = Brown) = 0.35. (c) The most probable rollouts from the state-machine policy learned by the student. Finally, (d), (e) and (f) are similar to (a), (b) and (c), but for the second iteration. this probability is hard because of the discrete-continuous structure of π. Another alternative is to precompute the probabilities of all the trajectories τ that can be derived from π. However, this is also infeasible because the number of trajectories is unbounded. Thus, we perform trajectory optimization in two phases. First, we use a sampling-based optimization algorithm to obtain a set of good trajectories τ 1, ..., τ L. Then, we apply gradient-based optimization, replacing p( | π, x0) with a term that regularizes τ to be close to {τ ℓ}L ℓ=1. The first phase proceeds as follows: (i) sample τ 1, , τ L using π from x0, and let pℓbe the probability of τ ℓaccording to π, (ii) sort these samples in decreasing order of objective pℓ eλR(τ ℓ), and (iii) discard all but the top ρ samples. This phase essentially performs one iteration of CEM (Mannor et al., 2003). Then, in the second phase, we replace the probability expression with p(τ | π, x0) Pρ ℓ=1 pℓ e d(τ,τℓ) Pρ ℓ=1 pℓ , which we use gradient-based optimization to optimize. Here, d(τ, τ ℓ) is a distance metric between two loop-free policies, defined as the L2 distance between the parameters of τ and τ ℓ. We chose the number of samples, ρ = 10. For our benchmarks, we did not notice any improvement in the number of student-teacher iterations by increasing ρ above 10. So, we believe we are not losing any information from this approximation. 4.3 STUDENT: LEARNING STRUCTURED STATE MACHINE POLICIES VIA IMITATION Next, we describe how the student solves the maximum likelihood problem Eq (9) to compute π . Probabilistic state machines. Although the output of our algorithm is a student policy that is a deterministic state machine, our algorithm internally relies on distributions over rollouts induced by the student policy to guide the teacher. Thus, we represent the student policy as a probabilistic state machine during learning. To do so, we simply make the action functions Hmj and switching conditions G mj2 mj1 probabilistic instead of constant parameters in the grammar for action functions and switching conditions, now we have Gaussian distributions N(α, σ). Then, when executing π, we obtain i.i.d. samples of the parameters H mj Hmj and {(G m j mj) G m j mj}m j every time we switch to mode mj, and act according to H mj and {(G m j mj) } until the mode switches again. By re-sampling these parameters on every mode switch, we avoid dependencies across different parts of a rollout or different rollouts. On the other hand, by not re-sampling these parameters within a mode switch, we ensure that the structure of π remains intact within a mode. Optimization. Each τk can be decomposed into segments (k, i) where action function Hk,i is executed for duration Tk,i. For example, each block in Figure 3(a) is a segment. Furthermore, for the student π, let Hmj be the action function distribution for mode mj and G mj2 mj1 be the switching Published as a conference paper at ICLR 2020 Car Quad Quad PO Pendulum Cartpole Swimmer 0.0 Train performance Ours RL RL-LSTM Direct-opt Car Quad Quad PO Pendulum Cartpole Swimmer 0.0 Test performance Ours RL RL-LSTM Direct-opt [11,13] [11,12.5] [11,12] [11,11.5] [11,11.2] [11,11] Train distribution (range of d) Performance Ours train Ours test RL train RL test Figure 4: Comparison of performances on train (left) and test (middle) distributions. Our approach outperforms the baselines on all benchmarks in terms of test performance. An empty bar indicates that the policy learned for that experiment failed on all runs. We also plot test performance for different choices of training distribution for the Car benchmark (right). condition distribution for mode mj1 to mode mj2. Note that Hmj and G mj2 mj1 are distributions whereas Hk,i and Tk,i are constants. We have p(τk | π, xk 0) = Y i p(Hk,i | π, xk 0) p(Tk,i | π, xk 0). For each (k, i), let µk,i be the latent random variable indicating the ith mode used by π starting from xk 0; in particular, µk,i is a categorical random variable that takes values in the modes {mj}. And µk,i = mj means that Hk,i is sampled from the distribution Hmj and Tk,i is determined by the sampled switching conditions from distributions {G m j mj}. Assuming the latent variable µk,i allows the student to compute π by computing H mj and G mj2 mj1 separately. In Figure 3, (b) and (e) show the learned mode mappings p(µ = mj) for the segments in the loop-free policies shown in (a) and (d) respectively. Since directly optimizing the maximum likelihood π is hard in the presence of the latent variables µk,i, we use the standard expectation maximization (EM) approach to optimizing π, where the E-step computes the distributions p(µk,i = mj) assuming π is fixed, and the M-step optimizes π assuming the probabilities p(µk,i = mj) are fixed. See Appendix A for details. In Figure 3, (c) and (f) show the most probable rollouts from the state-machine policies learned at the end of the EM approach for two different student iterations. 5 EXPERIMENTS Benchmarks. We use 6 control problems, each with different training and test distributions (summarized in Figure 8 in Appendix C): (i) Car, the benchmark in Figure 1, (ii) Quad, where the goal is to maneuver a 2D quadcopter through an obstacle course by controlling its vertical acceleration, where we vary the obstacle course length, see Figure 6 leftmost, (iii) Quad PO, a variant where the obstacles are unobserved but periodic (so the agent can perform well using a repeating motion), see Figure 6 (second from left), (iv) Pendulum, where we vary the pendulum mass, (v) Cart-Pole, where we vary the time horizon and pole length, and (vi) Swimmer, where the goal is to move the swimmer forward through a viscous liquid, where we vary the length of the segments comprising the robot swimmer. Baselines. We compare against: (i) RL: PPO with a feed-forward neural network policy, (ii) RLLSTM: PPO with an LSTM, (iii) Direct-Opt: learning a state machine policy directly via numerical optimization. Hyper-parameters are chosen to maximize performance on the training distribution. More details about the baselines and the hyper-parameters can be found in Appendices B.2, B.3, & B.4. Each algorithm is trained 5 times; we choose the one that performs best on the training distribution. Note that for the comparison to RL approaches, we use model-free algorithms, whereas, in our algorithm, the teacher uses model-based optimization. We do not compare against model-based RL approaches because (a) even model-free RL approaches achieve almost perfect performance on the training distribution (see Figure 4 left) and (b) our main goal is to compare the performance of our policies and the neural network policies on the test distribution and not the training distribution. Moreover, in case the model of the system is unknown, we can use known algorithms to infer the model from data (Ahmadi et al., 2018) and then use this learned model in our algorithm. Published as a conference paper at ICLR 2020 Trajectory Collision (a) RL Train Trajectory Collision (b) RL Test (c) Original (d) User change 1 (e) User change 2 Figure 5: (a-c) The RL policy generates unstructured trajectories, and therefore does not generalize from (a) the training distribution to (b) the test distribution. In contrast, our state machine policy in (c) generates a highly structured trajectory that generalizes well. (c-e) A user can modify our state machine policy to improve performance. In (d), the user sets the steering angle to the maximum value 0.5, and in (e), the user sets the thresholds in the switching conditions Gm2 m1, Gm1 m2 to 0.1. 0 5 10 15 20 time y-acceleration 0 10 20 30 time y-acceleration Figure 6: Left: Trajectories for the Quad (leftmost) and Quad PO (second from the left) benchmarks using our state machine policy. Right: Graph of vertical acceleration over time for both our policy (red) and the neural network policy (blue), for Quad (second from the right) and Quad PO (rightmost). 5.1 RESULTS Figure 4 shows results on both training and test distributions. We measure performance as the fraction of rollouts (out of 1000) that both satisfy the safety specification and reach the goal. Inductive generalization. For all benchmarks, our policy generalizes well on the test distribution. In four cases, we generalize perfectly (all runs satisfy the metric). For Quad and Quad PO, the policies result in collisions on some runs, but only towards the end of the obstacle course. Comparison to RL. The RL policies mostly achieve good training performance, but generalize poorly since they over-specialize to states seen during training. The exceptions are Pendulum and Swimmer. Even in these cases, the RL policies take longer to reach the goals than our state machine policies (see Figure 10 and Figure 11 in Appendix C). For Quad PO, the RL policy does not achieve a good training performance since the states are partially observed. We may expect the LSTM policies to alleviate this issue. However, the LSTM policies often perform poorly even on the training distribution, and also generalize worse than the feed-forward neural network policies. Comparison to direct-opt. The state machine policies learned using direct-opt perform poorly even in training because of the numerous local optima arising due to the structural constraints. This illustrats the need to use adaptive teaching to learn state machine policies. 5.2 QUALITATIVE ANALYSIS Behavior of policy. We empirical analyze the policies. Figure 5 shows the trajectory taken by the RL policy (a), compared to our policy (c), from a training initial state for the Car benchmark. The RL policy does not exhibit a repeating behavior, which causes it to fail on the trajectory from a test state shown in (b). Similarly, Figure 6 (right) compares the actions taking by our policy to those taken by the RL policy on Quad and Quad PO. Our policy produces smooth repeating actions, whereas the RL policy does not. Action vs time graphs for other benchmarks can be found in the appendix (Figures 12, 13, & 14) and they all show similar behaviors. Published as a conference paper at ICLR 2020 Varying the training distribution. We study how the test performance changes as we vary the training distribution on the Car benchmark. We vary Xtrain 0 as d [dmin, 13], where dmin = {13, 12.5, 12, 11.5, 11.2, 11}, but fix Xtest 0 to d [11, 12]. Figure 4 (right) shows how test performance varies with dmin for both our policy and the RL policy. Our policy inductively generalizes for a wide range of training distributions. In contrast, the test performance of the RL policy initially increases as the train distribution gets bigger, but it eventually starts declining. The reason is that its training performance actually starts to decline. Thus, in some settings, our approach (even when trained on smaller distributions) can produce policies that outperform the neural network policies produced by RL (even when trained on the full distribution). Interpretability. An added benefit of our state machine policies is interpretability. In particular, we demonstrate the interpretability of our policies by showing how a user can modify a learned state machine policy. Consider the policy from Figure 1e for the autonomous car. We manually make the following changes: (i) increase the steering angle in Hm1 to its maximum value 0.5, and (ii) decrease the gap maintained between the agent and the black cars by changing the switching condition Gm2 m1 to df 0.1 and Gm1 m2 to db 0.1. Figure 5 demonstrates these changes it shows trajectories obtained using the original policy (c), the first modified policy (d), and the second modified policy (e). There is no straightforward way to make these kinds of changes to a neural network policy. 6 CONCLUSION We have proposed an algorithm for learning state machine policies that inductively generalize to novel environments. Our approach is based on a framework called adaptive teaching that alternatively learns a student that imitates a teacher, who in-turn adapts to the structure of the student. We demonstrate that our policies inductively generalize better than RL policies. In the future, we will explore more complex grammars for the action functions and the switching conditions, for example, with some parts being small neural networks, while still retaining the ability to learn generalizable behaviors. Moreover, we will extend our approach to use model-free techniques in the teacher s algorithm to make our approach more aligned with the reinforcement learning premise. Finally, we believe that the idea of learning programmatic representations and using the adaptive teaching algorithm to deal with the mixed discrete-continuous problems can be applied to other learning settings such as supervised learning and unsupervised learning. ACKNOWLEDGMENTS This work was supported by ONR N00014-17-1-2699 and NSF Award CCF-1910769. Douglas Aberdeen and Jonathan Baxter. Scaling internal-state policy-gradient methods for pomdps. In ICML, pp. 3 10, 2002. Mohamadreza Ahmadi, Ufuk Topcu, and Clarence Rowley. Control-oriented learning of lagrangian and hamiltonian systems. In 2018 Annual American Control Conference (ACC), pp. 520 525. IEEE, 2018. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242 252, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/allen-zhu19a.html. David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pp. 119 125, 2002. Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiable reinforcement learning via policy extraction. In Advances in Neural Information Processing Systems, pp. 2494 2504, 2018. Pavol Bielik, Veselin Raychev, and Martin Vechev. Program synthesis for character level language modeling. In ICLR, 2017. Published as a conference paper at ICLR 2020 Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL https://openreview. net/forum?id=Bkb Y4psgg. Luca Carlone and Giuseppe C. Calafiore. Convex relaxations for pose graph optimization with outliers. IEEE Robotics and Automation Letters, 3(2):1160 1167, 2018. doi: 10.1109/LRA.2018.2793352. URL https://doi.org/10.1109/LRA.2018.2793352. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https: //github.com/openai/baselines, 2017. Kevin Ellis, Armando Solar-Lezama, and Josh Tenenbaum. Unsupervised learning by program synthesis. In Advances in neural information processing systems, pp. 973 981, 2015. Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, pp. 6060 6069, 2018. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017. Eric A Hansen. Solving pomdps by searching in policy space. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 211 219. Morgan Kaufmann Publishers Inc., 1998. Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015. Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pp. 1 9, 2013. Shie Mannor, Reuven Y Rubinstein, and Yohai Gat. The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 512 519, 2003. Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 427 436. Morgan Kaufmann Publishers Inc., 1999. Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning. ar Xiv preprint ar Xiv:1810.12282, 2018. Ronald Parr and Stuart J Russell. Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, pp. 1043 1049, 1998. Leonid Peshkin, Nicolas Meuleau, and Leslie Kaelbling. Learning policies with external memory. ar Xiv preprint cs/0103003, 2001. Pascal Poupart and Craig Boutilier. Bounded finite state controllers. In Advances in neural information processing systems, pp. 823 830, 2004. Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems, pp. 6550 6561, 2017. Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635, 2011. Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6): 233 242, 1999. Published as a conference paper at ICLR 2020 Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, and Swarat Chaudhuri. Houdini: Lifelong learning as program synthesis. In Advances in Neural Information Processing Systems, pp. 8701 8712, 2018. Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. ar Xiv preprint ar Xiv:1804.02477, 2018. Abhinav Verma, Hoang Minh Le, Yisong Yue, and Swarat Chaudhuri. Imitation-projected policy gradient for programmatic reinforcement learning. Co RR, abs/1907.05431, 2019. URL http: //arxiv.org/abs/1907.05431. Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008. Halley Young, Osbert Bastani, and Mayur Naik. Learning neurosymbolic generative models via program synthesis. In International Conference on Machine Learning, pp. 7144 7153, 2019. He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for verifiable reinforcement learning. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 686 701. ACM, 2019. Published as a conference paper at ICLR 2020 A EXPECTATION MAXIMIZATION FOR STUDENT OPTIMIZATION A.1 COMPUTING p(τ | π, x0) First, note that we have p(τk | π, xk 0) = Y i p(Hk,i | π, xk 0) p(Tk,i | π, xk 0). p(Hk,i | π, xk 0) = X j p(Hk,i | Hmj) p(µk,i = mj). Similarly, the duration Tk,i is determined both by the current mode µi,k = mj1, and by the switching conditions G mj1 = {G mj2 mj1}mj2 from the current mode mj1 into some other mode mj2. More precisely, let γk,i denote the trajectory (sequence of states) of the (k, i) segment of τk, and let ζ(γk,i, G mj) denote the earliest time at which a switching condition G G mj becomes true along γk,i. Since G G mj are distributions, ζ(γk,i, G mj) is a distribution on transition times. Then, we have p(Tk,i | π, xk 0) = X mj2 p(µk,i = mj1) p(µk,i+1 = mj2) p(Tk,i | G mj2 mj1, G mj1) p(Tk,i | G mj2 mj1, G mj1) = p(Tk,i = ζ(γk,i, G mj2 mj1)) Y mj3 =mj2 p(Tk,i < ζ(γk,i, G mj3 mj1)). In other words, Tk,i is the duration until G mj2 mj1 triggers, conditioned on none of the conditions G mj3 mj1 triggering (where mj3 = mj2). A.2 OPTIMIZING THE STUDENT POLICY Numerically optimizing the maximum likelihood objective to compute π is hard because it requires integrating over all possible choices for the latent variables µk,i. For example, if the teacher generates 10 loop-free policies every iteration and there are 10 modes in each loop-free policy, and 4 modes in the state-machine, the number of choices for the latent variables is 4100, which makes the enumeration infeasible. The expectation-maximization method provides an efficient way for computing the maximum likelihood, by alternatingly optimizing for the latent variables and the state-machine parameters. The E-step computes the probability distributions p(µk,i = mj) for a fixed π, and the M-step optimizes Hmj and G mj2 mj1 given p(µk,i = mj). E-step. Assuming π is fixed, we have p(µk,i = mj | π, {τk}) = p(Hk,i | Hmj) p(Tk,i = ζ(γk,i, G mj)) P m j p(Hk,i | Hm j) p(Tk,i = ζ(γk,i, G m j)). (10) M-step. Assuming p(µk,i = mj) is fixed, we solve arg max {Hmj } k,i p(µk,i = mj) log p(Hk,i | Hmj) (11) {G mj2 mj1 } k,i p(µk,i = mj1) p(µk,i+1 = mj2) log p(Tk,i = ζ(γk,i, G mj2 mj1)) + p(µk,i = mj1) (1 p(µk,i+1 = mj2)) log p(Tk,i < ζ(γk,i, G mj2 mj1)) (12) For G mj2 mj1, the first term handles the case µk,i+1 = mj2, where we maximize the probability that G mj2 mj1 makes the transition at duration Tk,i, and the second term handles the case µk,i+1 = mj2, where we maximize the probability that G mj2 mj1 does not make the transition until after duration Tk,i. Published as a conference paper at ICLR 2020 We briefly discuss how to solve these equations. For action functions, suppose that H encodes the distribution N(αH, σ2 H) over action function parameters. Then, we have P k,i p(µk,i = mj) αHk,i P k,i p(µk,i = mj) (σ Hmj )2 = k,i p(µk,i = mj) (αHk,i α Hmj )(αHk,i α Hmj )T P k,i p(µk,i = mj) Solving for the parameters of G mj2 mj1 is more challenging, since there can be multiple kinds of expressions in the grammar that are switching conditions, which correspond to discrete parameters, and we need to optimize over these discrete choices. To do so, we perform a greedy search over these discrete choices (see Section A.3 for details on the greedy strategy). For each choice considered during the greedy search, we encode Eq (12) as a numerical optimization and solve it to compute the corresponding means α G mj2 mj1 and standard deviations σ G mj2 mj1 Then, we choose the discrete choice that achieves the best objective value according to Eq (12). Computing the optimal parameters for switching conditions is more expensive than doing so for action functions; thus, on each student iteration, we iteratively solve Eq (10) and Eq (11) multiple times, but only solve Eq (12) once. The EM method does not guarantee global optima but usually works well in practice. In addition, since computing the switching conditions is expensive, we had to restrict the number of EM iterations. However, note that even if the EM algorithm didn t converge, our overall algorithm can still recover by using additional teacher-student interactions. The alternate method would be to run the EM algorithm multiple times/longer to get better results per student iteration, and maybe reduce the total number of teacher-student iterations. We say maybe because the EM algorithm might have already converged to the global optima, making the extra EM iterations useless. The trade-off between our approach and this alternative depends on whether the teacher s algorithm or the student s algorithm is expensive for a particular benchmark. However, from Figure 15, we can see that some of our benchmarks already use very few (< 5) teacher-student iterations (Car, Quad PO, Pendulum, Mountain car, and Swimmer). Of the other three benchmarks that needed many iterations, for two of them (Cartpole and Acrobot), the student s algorithm is as expensive as the teacher s algorithm. This justifies our decision to not run the EM algorithm multiple times/longer. A.3 SYNTHESIZING SWITCHING CONDITIONS Next, we describe how we search over the large number of discrete choices in the grammar for switching conditions. It is not hard to show that in Eq (12), the objectives for the switching condition parameters G mj2 mj1 corresponding to different transitions (mj1, mj2) decompose into separate problems. Therefore, we can perform the search for each transition (mj1, mj2) separately. For each transition, the naïve approach would be to search over the possible derivations in the context-free grammar for switching conditions to some bounded depth. However, this search space is exponential in the depth due to the productions B ::= B B and B ::= B B. Thus, we employ a greedy search strategy to avoid the exponential blowup. Intuitively, our search strategy is to represent switching conditions as a kind of decision tree, and then perform a greedy algorithm to search over decision tree1. Our search strategy is similar to (but simpler than) the one in Bielik et al. (2017). In particular, we can equivalently represent a switching condition as a decision tree, where the internal nodes have the form o[i] α or o[i] α (where i {1, ..., d O} and α R are parameters), and the leaf nodes are labeled with Switch or Don t switch e.g., Figure 7 shows two examples of switching conditions expressed as decision trees. Then, our algorithm initializes the switching condition to a single leaf node i.e., Gcur Switch . At each step, we consider switching conditions G next(Gcur) that expand a single leaf node of Gcur; among these, we choose Gcur to be the one that minimizes a loss cost(G). 1However, our algorithm is not very similar to greedy decision tree learning algorithms. Published as a conference paper at ICLR 2020 Switch Don t switch Internal node 𝑜𝑖# 𝛼# Don t switch Switch Don t switch Figure 7: Switching conditions represented as decision trees. Algorithm 1 Greedy algorithm for learning switching conditions. procedure LEARNSWITCHINGCONDITION Gcur Switch while |Gcur| < N do Gcur arg min G next(Gcur) cost(G) return Gcur More precisely, to construct next(Gcur), we iterate over all leaf nodes L leaves(Gcur), and all expressions E E, where E = if o[i] α then Switch else Don t Switch i {1, ..., d O}, α R, { , } Here, { , } is a inequality relation, i {1, ..., d O} is a component of o, and α R is a threshold. For each pair L and E, we consider the decision tree G obtained by replacing L with E in Gcur. The set next(Gcur) contains all G constructed in this way. Next, the loss function cost(G) is given by Eq (12). In each iteration, our algorithm optimizes cost(G) over G next(Gcur), and updates Gcur G. To solve this optimization problem, we enumerate the possible choices and i and use numerical optimization to compute α (since α is a continuous parameter). An example of a single iteration of our algorithm is shown in Figure 7. In particular, letting G be the tree on the left and G be the tree on the right, the left-most leaf node of G is expanded to get G . Our algorithm is summarized in Algorithm 1. Overall, our algorithm searches over N (N 1) d O different discrete structures, where N is the number of nodes in the decision tree and d O is the length of the observation vector o. B IMPLEMENTATION DETAILS B.1 BENCHMARKS AND STATE-MACHINES STATISTICS Figure 8 shows the statistics regarding the benchmarks such as the number of action variables and observation variables, and the set of initial states used for training and testing. Figure 8 also shows the different aspects of the grammar used to describe the space of possible state-machine policies. We are able to learn policies for these benchmarks using 2 to 4 distinct modes in the state-machine with either a constant or a proportional grammar for the action functions. We use the Boolean tree grammar of depth 1 or 2 for all the switching conditions. For the Quad benchmark, the action variable is the acceleration of the quadcopter in the vertical direction. The observations include the position x, y, the velocities vx, vy, and the four sensors oxl, oxu, oyl, oyu to describe the obstacle course in the near neighborhood. The Quad PO benchmark has the same action space as the Quad benchmark, but can only observe x, y, vx, and vy. The synthesized state-machine policies for these benchmarks are shown in Figure 16 and Figure 17. The action functions used for these benchmarks choose the acceleration to be proportional to vy. The goal for the Pendulum benchmark is to control the force (continuous) at the actuated link in order to invert the link. The observation space includes the full state, i.e., the angle θ and the angular velocity ω. Figure 18 shows the synthesized state-machine policy for the pendulum benchmark. Published as a conference paper at ICLR 2020 Bench #A #O Xtrain 0 Xtest 0 # modes A_G C_G Car 2 5 d [12,13.5]m d [11,12]m 3 Constant Boolean tree (depth 1) Quad 1 8 x dist = 40m x dist = 80m 2 Proportional Boolean tree (depth 1) Quad PO 1 4 x dist = 60m x dist = 120m 2 Proportional Boolean tree (depth 1) Pendulum 1 2 mass [1,1.5]kg mass [1.5,5]kg 2 Constant Boolean tree (depth 2) Cartpole 1 4 time = 5s, len = 0.5 time = 300s, len = 1.0 2 Constant Boolean tree (depth 2) Acrobot 1 4 masses = [0.2,0.5] masses = [0.5,2] 2 Constant Boolean tree (depth 2) Mountain car 1 2 power = [5,15]e-4 power = [3,5]e-4 2 Constant Boolean tree (depth 1) Swimmer 3 10 len = 1 unit len = 0.75 unit 4 Proportional Boolean tree (depth 2) Figure 8: Summary of our benchmarks. #A is the action dimension, #O is the observation dimension, Xtrain 0 is the set of initial states used for training, Xtest 0 is the set of initial states used to test inductive generalization, # modes is the number of modes in the state machine policy, and A_G and C_G are the grammars for action functions and switching conditions, respectively. Depth of C_G indicates the number of levels in the Boolean tree. The Cartpole benchmark consists of a pole attached to a cart. The goal is to keep the pole upright by applying a continuous force to move the cart to the right or to the left. The observations include the position x, the velocity v of the cart, the angle θ, and the angular velocity ω of the pole. The synthesized solution is shown in Figure 19. The Acrobot benchmark is similar to the Pendulum benchmark but with two links; only the top link can be actuated, and the goal is to drive the bottom link above a certain height. The observations are the angles θ1, θ2 and the angular velocities ω1, ω2 of the two links. For this benchmark, we vary the mass of the links between the training and the test distributions. The synthesized solution is shown in Figure 20. For the Mountain car benchmark, the goal is to drive a low powered car to the top of a hill. An agent has to drive back and forth to gain enough momentum to be able to cross the hill. The agent controls the force (continuous) to move the car to the right or left and observes the position x and the velocity v at every timestep. We vary the power of the car between the training and the test distributions. The synthesized solution is shown in Figure 21. The Swimmer benchmark is based on the Mujoco s swimmer. To make this benchmark more challenging, we use 4 segments instead of 3. There are three actions that control the torques at the joints and the goal is to make the swimmer move forward through a viscous liquid. The agent can observe the swimmer s global angle θ, the joint angles (θ1, θ2, θ3), the swimmer s global angular velocity ω, the angular velocities of the joints (ω1, ω2, ω3), and the velocity of the center of mass (vx, vy). We vary the length of the segments between the training and the test distributions. The actions are chosen to be proportional to their corresponding angles. The synthesized state machine policy is shown in Figure 22. B.2 HYPER-PARAMETERS There are three main hyper-parameters in our algorithm: The maximum number of segments/modes in a loop-free policy. A large number of segments makes the teacher s numerical optimization slow, while a small number of segments might not be sufficient to get a high reward. The maximum time that a segment can be executed for in a loop-free policy. This maximum time constraint helps the numerical optimization to avoid local optima that arise from executing a particular (non-convex) action function for too long. The parameter λ in Section 4.1. This parameter strikes a balance between preferring highreward loop-free policies versus preferring policies that are similar to the state-machine learned so far. The first two parameters solely affect the teacher s algorithm; thus, we choose them by randomly sampling from a set and select the one that produces high-reward loop-free policies. We use λ = 100 for all our experiments. Published as a conference paper at ICLR 2020 B.3 THE DIRECT-OPT BASELINE For this baseline, we convert the problem of synthesizing a state machine policy into a numerical optimization problem. To do this, we first encode the discreteness in the grammar for switching conditions into a continuous one-hot representation. For example, the set of expressions o[i] α0 or o[i] α0 are encoded as αs(α1o[1] + α2o[2] + αno[n]) α0 with constraints 1 αs 1, i {1, ..., n}. 0 αi 1 and Pn i=1 αi = 1. The choices between the leaf expressions, conjunctions, and disjunctions are also encoded in a one-hot fashion. We also tried an encoding without the extra constraints on α i.e., the switching conditions are linear functions of the observations. We would expect the linear encoding to be less generalizable than the one-hot encoding. However, we found that it is hard to even synthesize a policy that works well on the training set with either of the encodings. Another difficulty with direct optimization is that we need to optimize the combined reward from all the initial states at once. In contrast, the numerical optimization performed by the teacher in our approach can optimize the reward for each initial state separately. To deal with issue, we use a batch optimization technique that uses 10 initial states for every batch and seeds the starting point of the numerical optimization for each batch with the parameters found so far. It also restarts the process with a random starting point if the numerical optimization stalls. We carryout this process in parallel using 10 threads until either a solution is found or the time exceeds 2 hours. B.4 RL BASELINES We use the PPO2 implementation from Open AI Baselines (Dhariwal et al., 2017) with the standard MLP and LSTM networks for our RL baselines using 107 timesteps for training. Environment featurization. We used the same action spaces, observation spaces, and the set of initial states that we used for our approach. One exception is the Car benchmark, for which we appended the observation vector with observations from the previous timestep. This modification was essential for the RL baseline to achieve a good performance on the training dataset. Designing reward functions. While our approach takes in a safe specification φS(x) 0 and a goal specification φG(x) 0, the RL baselines need a reward function. For the classic control problems such as cartpole, pendulum, acrobot, mountain car and swimmer, we used the standard reward functions as specified by their Open AI environments. For Quad and Quad PO benchmarks, since the goal is to avoid collisions for as long as possible, we use a reward of 1 for every timestep that the agent is alive and the agent is terminated as soon as it collides with any of the obstacles. Designing the reward function for the Car benchmark was tricky, because this benchmark has both a goal and a safety specification, and finding a right balance between them is crucial for learning. We tried various forms of rewards functions and finally, found that the following version achieves better performance on the training distribution (on the metric that measures the fraction of roll-outs that satisfy both the goal and the safety property): r(x, a) = φG(x)+ + L if φS(x) > 0 0 otherwise which adds the numerical error for not satisfying the goal with a constant negative error ( L) if the safety specification is violated at any time-step. We tried different values for L {0.1, 1, 2, 10, 20} and found that L = 10 achieves the best performance on the training distribution. Hyper-parameters search. We performed a search over the various hyper-parameters in the PPO2 algorithm. We ran 10 instances of the PPO2 algorithm with parameters uniformly sampled from the space given below, and chose the one that performs well on the training distribution. This sampling is not exhaustive, but our results in Figure 4 (left most) show that we did find parameters that achieve good training performances for most of our benchmarks. The number of training minibatches per update, nminibatches = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048}. For the lstm network, we set this hyper-parameter to 1. Published as a conference paper at ICLR 2020 Performance on Train dist. Performance on Test dist. Bench Algorithm G T_G G T_G Ours 0.08 7.9s 0.02 31.8s RL 0.16 6.5s 0.0 45.2s Direct-opt Mountain car Ours 0.001 168.5s 0.008 290.1s RL 0.0 98.7s 0.0 214.7s Direct-opt 0.006 105.3s 2.18 216.0s Figure 9: Experiment results for additional benchmarks. G is the average goal error (closer to 0 is better). T_G is the average number of timesteps to reach the goal (lower the better). indicates timeout. We can see that both our approach and RL generalizes for these benchmarks. Figure 10: Trajectories taken by our state machine policy (left) and the RL policy (middle) on Pendulum for a test environment (i.e., heavier pendulum). Green (resp., red) indicates positive (resp., negative) torque. Our policy performs optimally by using positive torque when angular velocity 0 and negative torque otherwise. In contrast, the RL policy performs sub-optimally (especially in the beginning of the trajectory). The policy entropy coefficient in the optimization objective, ent_coef = {0.0, 0.01, 0.05, 0.1}. The number of training epochs per update, noptepochs {3, ..., 36}. The clipping range, cliprange = {0.1, 0.2, 0.3}. The learning rate, lr [5 10 6, 0.003]. C ADDITIONAL RESULTS C.1 ADDITIONAL PERFORMANCE RESULTS Figure 9 shows the training and test performance for the acrobot and mountain car benchmarks. We can see that both our approach and RL generalizes for these benchmarks. Figure 10 qualitatively analyzes the policies learned by our approach versus RL for the Pendulum benchmark. We can see that the RL policy performs slightly sub-optimally compared to our policy. Figure 11 shows the trajectories from the learned state machine policy and RL policy on Swimmer for a train environment and a test environment. While both policies generalize, the swimmer with the state machine policy is slightly faster (it takes about 35s to cover a distance of 10 units while the RL policy takes about 45s). Figures 12, 13, & 14 show the action versus time plots for the various benchmarks using the learned state-machine policies and neural network policies. We can see that state-machine polices produce smooth actions, whereas the RL policies do not. Published as a conference paper at ICLR 2020 Figure 11: Trajectories taken by our state machine policy on Swimmer for (a) a train environment with segments of length 1, and (b) a test environment with segments of length 0.75. The colors indicate different modes. The axes are the x and y coordinates of the center of mass of the swimmer. Trajectories taken by the RL policy on Swimmer for (c) a train environment, and (d) a test environment. While both policies generalize, the swimmer with the state machine policy is slightly faster (it takes about 35s to cover a distance of 10 units while the RL policy takes about 45s). 0 1 2 3 time 0 1 2 3 time steering angle Figure 12: Action vs time graphs for the car benchmark for both our policy (red) and the neural network policy (blue). Left shows the velocity of the agent and Right shows the steering angle. C.2 ANALYSIS OF RUNNING TIME Figure 15 shows the synthesis times for various benchmarks. It also shows the number of studentteacher iterations, and the time spent by the teacher and the student separately. The teacher optimizes the loop-free policies for different initial states in parallel. The student optimizes the switching conditions between different pairs of modes in parallel. We used a parallelized implementation with 10 threads, and report the wall clock time. Published as a conference paper at ICLR 2020 0 50 100 time Figure 13: Action vs time graphs for the pendulum benchmark (left) and the cartpole benchmark (right) for both our policy (red) and the neural network policy (blue). 0 5 10 15 20 time 0 5 10 15 20 time 0 5 10 15 20 time Figure 14: Action vs time graphs for the swimmer benchmark for the three torques at the three different joints of the swimmer. Blue line is for the neural network policy and red line is for the state machine policy. Car Quad PO Pendulum Cartpole MCar Acrobot Benchmarks iterations: 2 iterations: 4 iterations: 3 iterations: 8 iterations: 4 iterations: 10 Teacher Student Misc Quad Swimmer 0 8000 iterations: 13 iterations: 2 Figure 15: Synthesis times (in seconds, wall clock time) for learning state machines policies for the different benchmarks. The plot breaks down the total synthesis time into time taken by the teacher, the student and other miscellaneous parts of the algorithm. Misc. mostly includes the time spent for checking convergence at every iteration. The plot also shows the number of teacher-student iterations taken for each benchmark. Figure 16: Synthesized state-machine policy for the Quad benchmark. Figure 17: Synthesized state-machine policy for the Quad PO benchmark. Published as a conference paper at ICLR 2020 Figure 18: Synthesized state-machine policy for Pendulum. Figure 19: Synthesized state-machine policy for Cartpole. Figure 20: Synthesized state-machine policy for Acrobot. Figure 21: Synthesized state-machine policy for Mountain car. Figure 22: Synthesized state-machine policy for Swimmer.