# the_logical_options_framework__cfc6120d.pdf

The Logical Options Framework

Brandon Araki 1 Xiao Li 1 Kiran Vodrahalli 2

Learning composable policies for environments with complex rules and tasks is a challenging prob lem. We introduce a hierarchical reinforcement learning framework called the Logical Options Framework (LOF) that learns policies that are sat

isfying, optimal, and composable. LOF effciently learns policies that satisfy tasks by representing the task as an automaton and integrating it into learning and planning. We provide and prove con ditions under which LOF will learn satisfying, optimal policies. And lastly, we show how LOF s learned policies can be composed to satisfy un seen tasks with only 10-50 retraining steps on our benchmarks. We evaluate LOF on four tasks in discrete and continuous domains, including a 3D pick-and-place environment.

1. Introduction

To operate in the real world, intelligent agents must be able to make long-term plans by reasoning over symbolic abstractions while also maintaining the ability to react to low-level stimuli in their environment (Zhang & Sridharan, 2020). Many environments obey rules that can be repre sented as logical formulae; e.g., the rules a driver follows while driving, or a recipe a chef follows to cook a dish. Traditional motion and path planning techniques struggle to plan over these long-horizon tasks, but hierarchical ap proaches such as hierarchical reinforcement learning (HRL) can solve lengthy tasks by planning over both the high-level rules and the low-level environment. However, solving these problems involves trade-offs among multiple desirable prop erties, which we identify as satisfaction, optimality, and composability (described below). Today s hierarchical plan ning algorithms lack at least one of these objectives. For example, Reward Machines (Icarte et al., 2018) are satisfy

1CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA 2Department of Computer Science, Columbia Univer sity, New York City, NY, USA 3Toyota Research Institute, Cam bridge, MA, USA 4MIT Lincoln Laboratory, Lexington, MA, USA. Correspondence to: Brandon Araki <araki@csail.mit.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

Jonathan De Castro 3 J. Micah Fry 4 Daniela Rus 1

ing and optimal, but not composable; the options framework (Sutton et al., 1999) is composable and hierarchically op timal, but cannot satisfy specifcations. An algorithm that achieves all three of these properties would be very pow erful because it would enable a model learned on one set of rules to generalize to arbitrary rules. We introduce the Logical Options Framework, which builds upon the options framework and aims to combine symbolic reasoning and low-level control to achieve satisfaction, optimality, and composability with as few compromises as possible. Fur thermore, we demonstrate that models learned with our framework generalize to arbitrary sets of rules without any further learning, and we also show that our framework is compatible with arbitrary domains and planning algorithms, from discrete domains and value iteration to continuous domains and proximal policy optimization (PPO).

Satisfaction: An agent operating in an environment gov erned by rules must be able to satisfy the specifed rules. Satisfaction is a concept from formal logic, in which the input to a logical formula causes the formula to evaluate to True. Logical formulae can encapsulate rules and tasks like the ones described in Fig. 1, such as pick up the gro ceries and do not drive into a lake . In this paper, we state conditions under which our method is guaranteed to learn satisfying policies.

Optimality: Optimality requires that the agent maximize its expected cumulative reward for each episode. In general, satisfaction can be achieved by rewarding the agent for satis fying the rules of the environment. In hierarchical planning there are several types of optimality, including hierarchi cal optimality (optimal with respect to the hierarchy) and optimality (optimal with respect to everything). We prove in this paper that our method is hierarchically optimal and, under certain conditions, optimal.

Composability: Our method is also composable once it has learned the low-level components of a task, the learned model can be rearranged to satisfy arbitrary tasks. More specifcally, the rules of an environment can be factored into liveness and safety properties, which we discuss in Sec. 3. The learned model has high-level actions called options that can be composed to satisfy new liveness properties. A short coming of many RL models is that they are not composable trained to solve one specifc task, they are incapable of han

The Logical Options Framework

Go grocery shopping, pick up the kid, and go home, unless your partner calls telling you that they will pick

up the kid, in which case just go grocery shopping and then go home. And don t drive into the lake.

(a) These natural language instructions can be transformed into an FSA, shown in (b).

Safety proposition

Subgoal propositions { , , }

Event proposition

(b) The FSA representing the natural language instructions. The propositions are divided into subgoal , safety , and event.

(c) The low-level MDP and corresponding policy that satis fes the instructions.

Figure 1. Many parents face this task after school ends who picks up the kid, and who gets groceries? The pictorial symbols represent propositions, which are true or false depending on the state of the environment. The arrows in (c) represent sub-policies, and the colors of the arrows match the corresponding transition in the FSA. The boxed phone at the beginning of some of the arrows represents how these sub-policies can occur only after the agent receives a phone call.

dling even small variations in the task structure. However, the real world is a dynamic and unpredictable place, so the ability to use a learned model to automatically reason over as-yet-unseen tasks is a crucial element of intelligence.

Fig. 1 gives an example of how LOF works. The environ ment is a world with a grocery store, your (hypothetical) kid, your house, and some lakes, and in which you, the agent, are driving a car. The propositions are divided into subgoals , representing events that can be achieved, such as going grocery shopping; safety propositions, represent ing events that you must avoid (driving into a lake); and event propositions, corresponding to events that you have no control over (receiving a phone call) (Fig. 1b). In this environment, you have to follow rules (Fig. 1a). These rules can be converted into a logical formula, and from there into a fnite state automaton (FSA) (Fig. 1b). LOF learns an option for each subgoal (illustrated by the arrows in Fig. 1c), and a meta-policy for choosing amongst the options to reach the goal state of the FSA. After learning, the options can be recombined to fulfll arbitrary tasks.

1.1. Contributions

This paper introduces the Logical Options Framework (LOF) and makes four contributions to the hierarchical rein forcement learning literature:

1. The defnition of a hierarchical semi-Markov Decision Process (SMDP) that is the product of a logical FSA

and a low-level environment MDP.

2. A planning algorithm for learning options and metapolicies for the SMDP that allows the options to be composed to solve new tasks with only 10-50 retraining steps on our benchmarks and no additional samples from the environment.

3. Conditions and proofs for satisfaction and optimality.

4. Experiments on a discrete delivery domain, a continu

ous 2D reacher domain, and a continuous 3D pick-and place domain on four tasks demonstrating satisfaction, optimality, and composability.

2. Background

Linear Temporal Logic: We use linear temporal logic (LTL) to formally specify rules (Clarke et al., 2001). LTL can express tasks and rules using temporal operators such as eventually and always. LTL formulae are used only indi

rectly in LOF, as they are converted into automata that the algorithm uses directly. We chose to use LTL to represent rules because LTL corresponds closely to natural language and has proven to be a more natural way of expressing tasks and rules for engineers than designing FSAs by hand (Kansou, 2019). Formulae φ have the syntax grammar

φ := p | φ | φ1 φ2 | φ | φ1 U φ2

The Logical Options Framework

where p is a proposition (a boolean-valued truth statement that can correspond to objects or events in the world), is negation, is disjunction, is next , and U is until . The derived rules are conjunction ( ), implication ( = ), equivalence ( ), eventually ( φ True U φ) and always ( φ φ) (Baier & Katoen, 2008). φ1 U φ2

means that φ1 is true until φ2 is true, φ means that there is a time where φ is true and φ means that φ is always true.

The Options Framework: The options framework is a framework for defning and solving semi-Markov Decision Processes (SMDPs) with a type of macro-action called an option (Sutton et al., 1999). The inclusion of options in an MDP problem turns it into an SMDP problem, because actions are dependent not just on the previous state but also on the identity of the currently active option, which could have been initiated many time steps before the current time.

An option o is a variable-length sequence of actions defned as o = (I, π, β, Ro(s), To(s0|s)). I S is the initiation set of the option. π : S A [0, 1] is the policy of the option. β : S [0, 1] is the termination condition. Ro(s) is the reward model of the option. To(s0|s) is the transition model. A major challenge in option learning is that, in general, the number of time steps before the option terminates, k, is a random variable. With this in mind, Ro(s) is defned as the expected cumulative reward of option o given that the option is initiated in state s at time t and ends after k time steps. Letting rt be the reward received by the agent at t time steps from the beginning of the option,

Ro(s) = E r1 + γr2 + . . . γk 1 rk (1)

To(s0|s) is the combined probability po(s0, k) that option o will terminate at state s0 after k time steps:

X To(s 0|s) = po(s 0, k)γk (2)

A crucial beneft of using options is that they can be com posed in arbitrary ways. In the next section, we describe how LOF composes them to satisfy logical specifcations.

3. Logical Options Framework

Here is a brief overview of how we will present our formu lation of LOF:

1. The LTL formula is decomposed into liveness and safety properties. The liveness property defnes the task specifcation and the safety property defnes the costs for violating rules.

2. The propositions are divided into subgoals, safety propositions, and event propositions. Each subgoal

is associated with its own option, whose goal is to achieve that subgoal. Safety propositions are used to defne rules. Event propositions serve as control fow variables that affect the task.

3. We defne an SMDP that is the product of a low-level MDP and a high-level logical FSA.

4. We defne the logical options.

5. We present an algorithm for fnding the hierarchically optimal policy on the SMDP.

6. We state conditions under which satisfaction of the LTL specifcation is guaranteed, and we prove that the planning algorithm converges to an optimal policy by showing that the hierarchically optimal SMDP policy is the same as the optimal MDP policy.

The Logic Formula: LTL formulae can be translated into B uchi automata using automatic translation tools such as SPOT (Duret-Lutz et al., 2016). All B uchi automata can be decomposed into liveness and safety properties (Alpern & Schneider, 1987). We assume here that the LTL for mula itself can be divided into liveness and safety formulae, φ = φliveness φsafety. For the case where the LTL for mula cannot be factored, see App. A. The liveness property describes things that must happen to satisfy the LTL for mula. It is a task specifcation and is used in planning to determine which subgoals the agent must achieve. The safety property describes things that can never happen and is used to defne costs for violating the rules. In LOF, the liveness property is written using a fnite-trace subset of LTL called syntactically co-safe LTL (Bhatia et al., 2010), in which ( always ) is not allowed and , U, and are only used in positive normal form. This way, the liveness property can be satisfed by fnite sequences of propositions, so the property can be represented as an FSA.

Propositions: Propositions are boolean-valued truth state ments corresponding to goals, objects, and events in the environment. We distinguish between three types of propo sitions: subgoals PG, safety propositions PS , and event propositions PE . Subgoals must be achieved in order to satisfy the liveness property. They are associated with goals such as the agent is at the grocery store . They only appear in φliveness. Each subgoal may only be associated with one state. Note that in general, it may be impossible to avoid having subgoals appear in φsafety. App. A describes how to deal with this scenario. Safety propositions are proposi tions that the agent must avoid for example, driving into a lake. They only appear in φsafety. Event propositions are not goals, but they can affect the task specifcation for example, whether or not a phone call is received. They may occur in φliveness, and, with extensions described in App. A, in φsafety. In the fully observable setting, event

The Logical Options Framework

propositions are somewhat trivial because the agent knows exactly when/if the event will occur, but in the partially observable setting, they enable complex control fow. Our optimality guarantees only apply in the fully observable setting; however, LOF s properties of satisfaction and com posability still apply in the partially observable setting. The goal state of the liveness FSA must be reachable from every other state using only subgoals. This means that no matter what event propositions occur, it must be possible for the agent to satisfy the liveness property. TPG : S 2PG and TPS : S 2PS relate states to the subgoal and safety propositions that are true at that state. TPE : 2PE {0, 1} assigns truth labels to the event propositions.

Hierarchical SMDP: LOF defnes a hierarchical semi Markov Decision Process (SMDP), learns the options, and plans over them. The high level of the SMDP is an FSA spec ifed with LTL. The low level is an environment MDP. We assume that the LTL specifcation φ can be decomposed into a liveness property φliveness and a safety property φsafety . The propositions P are the union of the subgoals PG, safety propositions PS , and event propositions PE . We assume that the liveness property can be translated into an FSA T = (F, P, TF , RF , f0, fg ). F is the set of automaton states; P is the set of propositions; TF is the transition func tion relating the current state and proposition to the next state, TF : F P F [0, 1]. In practice, TF is determin istic despite our use of probabilistic notation. We assume that there is a single initial state f0 and fnal state fg , and that the goal state fg is reachable from every state f F using only subgoals. The reward function assigns a reward to ev ery FSA state, RF : F R. In our experiments, the safety V property takes the form ps, which implies that ps PS no safety proposition is allowed, and that they have asso ciated costs, RS : 2PS R. φsafety is not limited to this form; App. A covers the general case. There is a low-level environment MDP E = (S, A, RE , TE , γ). S is the state space and A is the action space. They can be discrete or continuous. RE : S A R is a low-level reward function that characterizes, for example, distance or actuation costs. RE is a combination of the safety reward function RS and RE , e.g. RE (s, a) = RE (s, a) + RS (TPS (s)). The transi tion function of the environment is TE : S A S [0, 1].

From these parts we defne a hierarchical SMDP M = (S F, A, P, O, TE TP TF , RSMDP , γ). The hierar

chical state space contains two elements: low-level states S and FSA states F. The action space is A. The set of propositions is P. The set of options (one option associated with each subgoal in PG) is O. The transition function con sists of the low-level environment transitions TE and the FSA transitions TF . TP = TPG TPS TPE . We call TP , relating states to propositions, a transition function because it determines when FSA transitions occur. The transitions are applied in the order TE , TP , TF . The reward function

Algorithm 1 Learning and Planning with Logical Options

1: Given: Propositions P partitioned into subgoals PG, safety propositions PS , and event propositions PE Logical FSA T = (F, PG PE , TF , RF , f0, fg) de rived from φliveness Low-level MDP E = (S, A, RE , TE , γ), where RE (s, a) = RE (s, a) + RS (TPS (s)) combines the en vironment and safety rewards Proposition labeling functions TPG : S 2PG , TPS : S 2PS , and TPE : 2PE {0, 1} 2: To learn: 3: Set of options O, one for each subgoal p PG 4: Meta-policy µ(f, s, o), Q(f, s, o), and V (f, s) 5: Learn logical options: 6: for p PG do 7: Learn an option that achieves p, op = (Iop , πop , βop , Rop (s), Top (s0|s)) 8: Iop = S(

1 if p TPG (s) 9: βop = 0 otherwise 10: πop = optimal policy on E with rollouts terminating when p TPG (s) if p TPG (s0); k is number Eγk 11: Top (s0|s) = of time steps to reach p 0 otherwise

12: Rop (s) = E[RE (s, a1) + γRE (s1, a2) + . . . +γk 1RE (sk 1, ak)] 13: end for 14: Find a meta-policy µ over the options: 15: Initialize Q : F S O R, V : F S R to 0 16: for (k, f, s) [1, . . . , n] F S do 17: for o O do 18: Qk(f, s, o) RF (f)Ro(s)+ P P P TF (f 0|f, TP (s0), p e)TPE ( pe) f 0 F p e 2PE s0 S To(s0|s)Vk 1(f 0, s0) 19: end for 20: Vk(f, s) max Qk(f, s, o)

o O 21: end for 22: µ(f, s, o) = arg max Q(f, s, o)

o O 23: Return: Options O, meta-policy µ(f, s, o) and Qand value functions Q(f, s, o), V (f, s)

RSMDP (f, s, o) = RF (f)Ro(s), so RF (f) is a weighting on the option rewards. The SMDP has the same discount factor γ as E. Planning is done on the SMDP in two steps: frst, the options O are learned over E using an appropriate policy-learning algorithm such as PPO or Reward Machines. Next, a meta-policy over the task specifcation T is found using the learned options and the reward function RSMDP .

The Logical Options Framework

Logical Options: The frst step of Alg. 1 is to learn the logical options. We associate every subgoal p with an option op = (Iop , πop , βop , Rop , Top ). These terms are defned starting at Alg. 1 line 5. One assumption we make is that the initiation set of every option is the entire state space S. This assumption can be easily loosened as long as the liveness property does not require an infeasible sequence of options, e.g., if the task is to go to Room A and then Room B, the initiation set of the Room B option must include Room A. Every op has a policy πop whose goal is to reach the state sp where p is true. Options are learned by training on the environment MDP E and terminating only when sp is reached. As we discuss in Sec. 3.1, under certain conditions the optimal option policy is guaranteed to always terminate at the subgoal. This allows us to simplify the transition model of Eq. 2 to the form in Alg. 1 line 11. In the experiments, we further simplify this expression by setting γ = 1.

Logical Value Iteration: After fnding the logical options, the next step is to fnd a meta-policy for FSA T over the options (see Alg. 1 line 14). Qand value functions are found for the SMDP using the Bellman update equations:

X X X Qk(f, s, o) RF (f)Ro(s) +

f 0 F p e 2PE s0 S

TF (f 0|f, TPG (s 0), p e)TPE ( pe)To(s 0|s)Vk 1(f 0 , s 0)

Vk(f, s) max Qk(f, s, o) (4) o O

Eq. 3 differs from the generic equations for SMDP value iteration in that the transition function has two extra compo P P nents, TF (f 0|f, TP (s0), p e) and TPE ( pe). f 0 F p e 2PE The equations are derived from Araki et al. (2019) and the fact that, on every step in the environment, three transitions are applied: the option transition To, the event proposi tion transition TPE , and the FSA transition TF . Note that Ro(s) and To(s0|s) compress the consequences of choosing an option o at a state s from a multi-step trajectory into two real-valued numbers, allowing for more effcient planning.

3.1. Conditions for Satisfaction and Optimality

Here we give an overview of the proofs and necessary con ditions for satisfaction and optimality. The full proofs and defnitions are in App. B.

First, we describe the condition for an optimal option to always reach its subgoal. Let π0(s|s0) be the optimal goal

0 conditioned policy for reaching a goal s . If the optimal option policy equals the goal-conditioned policy for reach ing the subgoal sg , i.e. π (s) = πg (s|sg), then the option will always reach the subgoal. This can be stated in terms

of value functions: let V π0 (s|s0) be the expected return of π0(s|s0). If V πg (s|sg) > V π0 (s|s0) s, s0 =6 sg , then π (s) = πg(s|sg). This occurs for example if < RE (s, a) < 0 and if the episode terminates when the agent reaches sg. Then V πg is a bounded negative number, and V π0 for all other states is . We show that if every option is guaranteed to achieve its subgoal, then there must exist at least one sequence of options that satisfes the specifcation.

We then give the condition for the hierarchically optimal meta-policy µ (s) to always achieve the FSA goal state fg. In our context, hierarchical optimality means that the metapolicy is optimal over the available options. Let µ0(f, s|f 0) be the hierarchically optimal goal-conditioned meta-policy for reaching FSA state f 0 . If the hierarchically optimal meta-policy equals the goal-conditioned meta-policy for reaching the FSA goal state fg, i.e. µ (f, s) = µg(f, s|fg ), then µ (f, s) will always reach fg. In terms of value func

0 tions: let V µ 0 (f, s|f 0) be the expected return for µ . If

V µg (f, s|fg ) > V µ 0 (f, s|f 0) f, s, f 0 6= fg, then µ = µg. This occurs if all FSA rewards RF (f) > 0, all environment rewards < RE (s, a) < 0, and the episode only termi nates when the agent reaches fg. Then V µg is a bounded negative number, and V µ 0 for all other states is . Be cause LOF uses the Bellman update equations to learn the meta-policy, the LOF meta-policy will converge to the hier archically optimal meta-policy.

Consider the SMDP where planning is allowed over lowlevel actions, and let us call it the hierarchical MDP (HMDP) with optimal policy π

HMDP . Our result is:

Theorem 3.1. Given that the conditions for satisfaction and hierarchical optimality are met, the LOF hierarchically optimal meta-policy µg with optimal option sub-policies πg has the same expected returns as the optimal policy π

HMDP and satisfes the task specifcation.

3.2. Composability

The results in Sec. 3.1 guarantee that LOF s learned model can be composed to satisfy new tasks. Furthermore, the composed policy has the same properties as the original policy satisfaction and optimality. LOF s possession of composability along with satisfaction and optimality derives from two facts: 1) Options are inherently composable be cause they can be executed in any order. 2) If the conditions of Thm. 3.1 are met, LOF is guaranteed to fnd a (hierarchi cally) optimal policy over the options that will satisfy any liveness property that uses subgoals associated with the op tions. The composability of LOF distinguishes it from other algorithms that can achieve satisfaction and optimality.

The Logical Options Framework

4. Experiments & Results

Experiments: We performed experiments to demonstrate satisfaction and composability 1. For satisfaction, we mea sure cumulative reward over training steps. Cumulative re ward is a proxy for satisfaction, as the environments can only achieve the maximum reward when they satisfy their tasks. For the composability experiments, we take the trained op tions and record how many meta-policy retraining steps it takes to learn an optimal meta-policy for a new task.

Environments: We measure the performance of LOF on three environments. The frst environment is a discrete gridworld (Fig. 3a) called the delivery domain, as it can represent a delivery truck delivering packages to three lo cations (a, b, c) and having a home base h. There are also obstacles o (the black squares). The second environment is called the reacher domain, from Open AI Gym (Fig. 3d). It is a two-link arm that has continuous state and action spaces. There are four subgoals represented by colored balls: red r, green g, blue b, and yellow y. The third en vironment is called the pick-and-place domain, and it is a continuous 3D environment with a robotic Panda arm from Coppelia Sim and Py Rep (James et al., 2019). It is inspired by the lunchbox-packing experiments of Araki et al. (2019) in which subgoals r, g, and b are food items that must be packed into lunchbox y. All environments also have an event proposition called can, which represents when the need to fulfll part of a task is cancelled.

Tasks: We test satisfaction and composability on four tasks. The frst task is a sequential task. For the delivery domain, the LTL formula is (a (b (c h))) o deliver package a, then b, then c, and then return to home

h. And always avoid obstacles. The next task is the IF task (equivalent to the task shown in Fig. 1b): ( (c a) can) ( c can) o deliver package c, and then a, unless a gets cancelled. And always avoid obstacles . We call the third task the OR task, ((a b) c) o deliver package a or b, then c, and always avoid obstacles . The composite task has elements of all three of the previous tasks: ( ((a b) (c h)) can) ( ((a b) h) can) o. Deliver package a or b, and then c, unless c gets cancelled, and then return to home h. And always avoid obstacles . The tasks for the reacher and pick-and-place environments are equivalent, except that there are no obstacles for the reacher and arm to avoid.

The sequential task is meant to show that planning is eff cient and effective even for long-time horizon tasks. The IF task shows that the agent s policy can respond to event propositions, such as being alerted that a delivery is can

1Code for the discrete domain experiments is available at https://github.com/braraki/ logical-options-framework. Code for the other domains is available in the supplementary material.

LOF Total Reward: -5

Greedy Total

Figure 2. In this environment, the agent must either pick up the kid or go grocery shopping, and then go home (the OR task). Starting at S0, the greedy algorithm picks the next step in the FSA with the lowest cost (picking up the kid), which leads to a higher overall cost. LOF fnds the optimal path through the FSA.

celled. The OR task is meant to demonstrate the optimal ity of our algorithm versus a greedy algorithm, as discussed in Fig. 2. Lastly, the composite task shows that learning and planning are effcient and effective even for complex tasks.

Baselines: We test four baselines against our algorithm. Our algorithm is LOF-VI, short for Logical Options Framework with Value Iteration, because it uses value it eration for high-level planning. LOF-QL uses Q-learning instead (details are in App. C.3). Unlike LOF-VI, LOF-QL does not need explicit knowledge of TF , the FSA transition function. Greedy is a naive implementation of task satis faction; it uses its knowledge of the FSA to select the next subgoal with the lowest cost to attain. This leaves it vul nerable to choosing suboptimal paths through the FSA, as shown in Fig. 2. Flat Options uses the options frame work with no knowledge of the FSA. Its SMDP formulation is not hierarchical the state space and transition function do not contain high-level states F or transition function TF . The last baseline is RM, short for Q-Learning for Reward Machines (Icarte et al., 2018). Whereas LOF learn options to accomplish subgoals, RM learns sub-policies for every FSA state. App. C.4 discusses the differences between RM and LOF in detail.

Implementation: For the delivery domain, options were learned using Q-learning with an ϵ-greedy exploration pol icy. RM was learned using the Q-Learning for Reward Ma

The Logical Options Framework

(a) Delivery domain. (b) Satisfaction performance. (c) Composability performance.

(d) Reacher domain. (e) Satisfaction performance. (f) Composability performance.

(g) Pick-and-place domain. (h) Satisfaction performance. (i) Composability performance.

Figure 3. Performance on the satisfaction and composability experiments, averaged over all tasks. Note that LOF-VI composes new meta-policies in just 10-50 retraining steps. The frst row is the delivery domain, the second row is the reacher domain, and the third row is the pick-and-place domain. All results, including RM performance on the reacher and pick-and-place domains, are in App. C.6.

chines (QRM) algorithm described in Icarte et al. (2018). For the reacher and pick-and-place domains, options were learned by using proximal policy optimization (PPO) (Schul man et al., 2017) to train goal-oriented policy and value functions, which were represented using 128 128 and 128 128 128 fully connected neural networks, respec tively. Deep-QRM was used to train RM. The implementa tion details are discussed more fully in App. C.

4.1. Results

Satisfaction: Results for the satisfaction experiments, aver aged over all four tasks, are shown in Figs. 3b, 3e, and 3h. (Results on all tasks are in App. C.6). As expected, Flat Options shows no ability to satisfy tasks, as it has no knowledge of the FSAs. Greedy trains as quickly as LOF-VI and LOF-QL, but its returns plateau before the others because it chooses suboptimal paths in the composite and OR tasks. The difference is small in the continuous domains but still present. LOF-QL achieves as high a return as LOF-VI, but it is less composable (discussed below).

RM learns much more slowly than the other methods. This is because for RM, a reward is only given for reaching the goal state, whereas in the LOF-based methods, options are rewarded for reaching their subgoals, so during training LOF-based methods have a richer reward function than RM. For the continuous domains, RM takes an order of magnitude more steps to train, so we left it out of the fgures for clarity (see App. Figs. 14 and 16). However, in the continuous domains, RM eventually achieves a higher return than the LOF-based methods. This is because for those domains, we defne the subgoals to be spherical regions rather than single states, violating one of the conditions for optimality. Therefore, for example, it is possible that the meta-policy does not take advantage of the dynamics of the arm to swing through the subgoals more effciently. RM does not have this condition and learns a single policy that can take advantage of inter-subgoal dynamics to learn a more optimal policy.

Composability: The composability experiments were done on the three composable baselines, LOF-VI, LOF-QL, and Greedy. App. C.4 discusses why RM is not composable. Flat Options is not composable because its formula

The Logical Options Framework

tion does not include the FSA T . Therefore it is completely incapable of recognizing and adjusting to changes in the FSA. The composability results are shown in Figs. 3c, 3f, and 3i. Greedy requires no retraining steps to learn a meta-policy on a new FSA given its current FSA state, it simply chooses the next available FSA state that has the lowest cost to achieve. However, its meta-policy may be arbitrarily suboptimal. LOF-QL learns optimal (or in the continuous case, close-to-optimal) policies, but it takes 50 250 retraining steps, versus 10-50 for LOF-VI. Therefore LOF-VI strikes a balance between Greedy and LOF-QL, requiring far fewer steps than LOF-QL to retrain, and achieving better performance than Greedy.

5. Related Work

We distinguish our work from related work in HRL by its possession of three desirable properties composability, satisfaction, and optimality. Most other works possess two of these properties at the cost of the other.

Not Composable: The previous work most similar to ours is Icarte et al. (2018), which introduces a framework to solve tasks defned by automata called Reward Machines. Their algorithm, Q-Learning for Reward Machines, learns a sub-policy for every state of the automaton that achieves sat isfaction and optimality. However, the learned sub-policies have limited composability because they end up learning a specifc path through the automaton, and if the structure of the automaton is changed, there is no guarantee that the subpolicies will be able to satisfy the new automaton without re-training. By contrast, LOF learns a sub-policy for every subgoal, independent of the automaton, and therefore the sub-policies can be arranged to satisfy arbitrary tasks. An other similar work is Logical Value Iteration (LVI) (Araki et al., 2019; 2020). LVI defnes a hierarchical MDP and value iteration equations that fnd satisfying and optimal policies; however, the algorithm is limited to discrete do mains and has limited composability. A number of HRL algorithms use reward shaping to guide the agent through the states of an automaton (Li et al., 2017; 2019; Cama cho et al., 2019; Hasanbeig et al., 2018; Jothimurugan et al., 2019; Shah et al., 2020; Yuan et al., 2019; De Giacomo et al., 2019). While these algorithms can guarantee satisfaction and sometimes optimality, they cannot be composed be cause their policies are not hierarchical. Another approach is to use a symbolic planner to fnd a satisfying sequence of tasks and use an RL agent to learn and execute that se quence of tasks (Gordon et al., 2019; Illanes et al., 2020; Lyu et al., 2019). However, the meta-controllers of Gor don et al. (2019) and Lyu et al. (2019) are not composable as they are trained together with the low-level controllers. Although the work of Illanes et al. (2020) is amenable to transfer learning, it is not composable. Paxton et al. (2017);

Mason et al. (2017) use logical constraints to guide explo ration, and while these approaches are also satisfying and optimal, they are not composable as the agent is trained for a specifc set of rules. LOF is composable unlike the above methods because it has a hierarchical action space with high-level options. Once the options are learned, they can be composed arbitrarily.

Not Satisfying: Most hierarchical frameworks cannot sat isfy tasks. Instead, they focus on using state and action abstractions to make learning more effcient (Dietterich, 2000; Dayan & Hinton, 1993; Parr & Russell, 1998; Diuk et al., 2008; Oh et al., 2019). The options framework (Sutton et al., 1999) stands out because of its composability and its guarantee of hierarchical optimality, which is why we based our work off of it. There is also a class of HRL algorithms that builds on the idea of goal-oriented policies that can navigate to nearby subgoals (Eysenbach et al., 2019; Ghosh et al., 2018; Faust et al., 2018). By sampling sequences of subgoals and using a goal-oriented policy to navigate between them, these algorithms can travel much longer dis tances than a policy can travel on its own. Although these algorithms are composable in that they can navigate to far-away goals without further training, they are not able to solve tasks. Andreas et al. (2017) present an algorithm for solving simple policy sketches which is also composable; however, sketches are considerably less expressive than au tomata and linear temporal logic, which we use. Unlike the above methods, LOF is satisfying because it has a hierar chical state space with low-level MDP states and high-level FSA states. Therefore LOF can satisfy tasks by learning policies that reach the FSA goal state.

Not Optimal: In HRL, there are at least three types of opti mality hierarchical, recursive, and overall. As defned in Dietterich (2000), the hierarchically optimal policy is the optimal policy given the constraints of the hierarchy, and recursive optimality is when a policy is optimal given the policies of its children. For example, the options frame work is hierarchically optimal, while MAXQ and abstract MDPs (Gopalan et al., 2017) are recursively optimal. Icarte et al. (2019) introduce a composable method for learning Reward Machines, but their approach is focused on the rein forcement learning setting and is only guaranteed to learn hierarchically optimal policies. Leon et al. (2020) introduce a neurosymbolic method for generating policies for satis fying logical specifcations that can zero-shot generalize to new specifcations. However, the symbolic module of their network has no guarantees on being able to optimally satisfy specifcations. The method described in Kuo et al. (2020) is fully composable, but not optimal as it uses a re current neural network to generate a sequence of high-level actions and is therefore not guaranteed to fnd optimal poli cies. The approach in Kuo et al. (2020) has some advantages in terms of scalability versus our approach, since they use

The Logical Options Framework

a single deep model to learn policies for all operators and propositions. However, for LOF, a single deep goal-oriented policy could be trained to reach all subgoals, as is done in Leon et al. (2020). We therefore believe that LOF could be made equally scalable as Kuo et al. (2020), although this would be at the cost of losing most of the guarantees of the framework. However, one guarantee that LOF would keep is the hierarchical optimality of the meta-policy, which is a result of fnding the meta-policy using value iteration. We therefore believe that LOF should have better performance on unseen specifcations than Kuo et al. (2020), because Kuo et al. (2020) has no guarantees for satisfying specifcations. LOF is hierarchically optimal because it fnds an optimal meta-policy over the high-level options, and as we state in the paper, there are also conditions under which the overall policy is optimal.

6. Discussion and Conclusion

In this work, we claim that LOF has a unique combination of three properties: satisfaction, optimality, and composabil ity. We state and prove the conditions for satisfaction and optimality in Sec. 3.1. The experimental results confrm our claims while also pointing out some weaknesses. LOF-VI achieves optimal or near-optimal policies and trains an order of magnitude faster than the existing work most similar to it, RM. However, the optimality condition that each subgoal be associated with one state cannot be met for continuous domains, and therefore RM eventually outperforms LOF-VI. But even when optimality is not guaranteed, LOF-VI is hi erarchically optimal, which is why it outperforms Greedy in the composite and OR tasks. Next, the composability experiments show that LOF-VI can compose its learned options to perform new tasks in about 10-50 iterations on the benchmarks. Although Greedy requires no retraining steps, 10-50 retraining iterations is a tiny fraction of the tens of thousands of steps required to learn the original policy. Lastly, we have shown that LOF learns policies effciently, and that it can be used with a variety of domains and policylearning algorithms.

Acknowledgments

This material is supported by the Toyota Research Insti tute within the Toyota-CSAIL joint research center, NSF grant IIS-1723943, ONR N00014-18-1-2830, the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001, and SUTD Project Astralis. Any opinions, fndings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily refect the views of the funding agencies.

Abel, D. and Winder, J. The expected-length model of options. In IJCAI, 2019.

Alpern, B. and Schneider, F. B. Recognizing safety and liveness. Distributed computing, 2(3):117 126, 1987.

Andreas, J., Klein, D., and Levine, S. Modular multitask reinforcement learning with policy sketches. In Interna tional Conference on Machine Learning, pp. 166 175, 2017.

Araki, B., Vodrahalli, K., Leech, T., Vasile, C. I., Don

ahue, M., and Rus, D. Learning to plan with logical automata. In Proceedings of Robotics: Science and Sys tems, Freiburgim Breisgau, Germany, June 2019. doi: 10.15607/RSS.2019.XV.064.

Araki, B., Vodrahalli, K., Leech, T., Vasile, C. I., Donahue, M., and Rus, D. Deep bayesian nonparametric learning of rules and plans from demonstrations with a learned automaton prior. In AAAI, pp. 10026 10034, 2020.

Baier, C. and Katoen, J. Principles of model checking. MIT Press, 2008. ISBN 978-0-262-02649-9.

Bhatia, A., Kavraki, L. E., and Vardi, M. Y. Sampling-based motion planning with temporal goals. In 2010 IEEE International Conference on Robotics and Automation, pp. 2689 2696. IEEE, 2010.

Camacho, A., Icarte, R. T., Klassen, T. Q., Valenzano, R. A., and Mc Ilraith, S. A. Ltl and beyond: Formal languages for reward function specifcation in reinforcement learn ing. In IJCAI, volume 19, pp. 6065 6073, 2019.

Clarke, E. M., Grumberg, O., and Peled, D. Model Checking. MIT Press, 2001. ISBN 978-0-262-03270-4.

Dayan, P. and Hinton, G. E. Feudal reinforcement learning. In Advances in neural information processing systems, pp. 271 278, 1993.

De Giacomo, G., Iocchi, L., Favorito, M., and Patrizi, F. Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifcations. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pp. 128 136, 2019.

Dietterich, T. G. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artif cial intelligence research, 13:227 303, 2000.

Diuk, C., Cohen, A., and Littman, M. L. An object-oriented representation for effcient reinforcement learning. In Pro ceedings of the 25th international conference on Machine learning, pp. 240 247, 2008.

The Logical Options Framework

Duret-Lutz, A., Lewkowicz, A., Fauchille, A., Michaud, T., Renault, E., and Xu, L. Spot 2.0 a framework for LTL and ω-automata manipulation. In Proceedings of the 14th International Symposium on Automated Technology for Verifcation and Analysis (ATVA 16), volume 9938 of Lec

ture Notes in Computer Science, pp. 122 129. Springer, October 2016. doi: 10.1007/978-3-319-46520-3 8.

Eysenbach, B., Salakhutdinov, R. R., and Levine, S. Search on the replay buffer: Bridging planning and reinforce ment learning. In Advances in Neural Information Pro cessing Systems, pp. 15220 15231, 2019.

Faust, A., Oslund, K., Ramirez, O., Francis, A., Tapia, L., Fiser, M., and Davidson, J. Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In 2018 IEEE Interna tional Conference on Robotics and Automation (ICRA), pp. 5113 5120. IEEE, 2018.

Ghosh, D., Gupta, A., and Levine, S. Learning actionable representations with goal-conditioned policies. ar Xiv preprint ar Xiv:1811.07819, 2018.

Gopalan, N., Littman, M. L., Mac Glashan, J., Squire, S., Tellex, S., Winder, J., Wong, L. L., et al. Planning with abstract markov decision processes. In Twenty-Seventh International Conference on Automated Planning and Scheduling, 2017.

Gordon, D., Fox, D., and Farhadi, A. What should i do now? marrying reinforcement learning and symbolic planning. ar Xiv preprint ar Xiv:1901.01492, 2019.

Hasanbeig, M., Abate, A., and Kroening, D. Logicallyconstrained reinforcement learning. ar Xiv preprint ar Xiv:1801.08099, 2018.

Icarte, R. T., Klassen, T., Valenzano, R., and Mc Ilraith, S. Using reward machines for high-level task specifcation and decomposition in reinforcement learning. In Interna tional Conference on Machine Learning, pp. 2107 2116, 2018.

Icarte, R. T., Waldie, E., Klassen, T., Valenzano, R., Castro, M., and Mc Ilraith, S. Learning reward machines for partially observable reinforcement learning. In Advances in Neural Information Processing Systems, pp. 15523 15534, 2019.

Illanes, L., Yan, X., Icarte, R. T., and Mc Ilraith, S. A. Sym

bolic plans as high-level instructions for reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pp. 540 550, 2020.

James, S., Freese, M., and Davison, A. J. Pyrep: Bring

ing v-rep to deep robot learning. ar Xiv preprint ar Xiv:1906.11176, 2019.

Jothimurugan, K., Alur, R., and Bastani, O. A composable specifcation language for reinforcement learning tasks. In Advances in Neural Information Processing Systems, pp. 13041 13051, 2019.

Kansou, B. K. A. Converting asubset of ltl formula to buchi automata. International Journal of Software Engineering & Applications (IJSEA), 10(2), 2019.

Kuo, Y.-L., Katz, B., and Barbu, A. Encoding formulas as deep networks: Reinforcement learning for zero-shot ex ecution of ltl formulas. ar Xiv preprint ar Xiv:2006.01110, 2020.

Leon, B. G., Shanahan, M., and Belardinelli, F. Systematic generalisation through task temporal logic and deep re inforcement learning. ar Xiv preprint ar Xiv:2006.08767, 2020.

Li, X., Vasile, C.-I., and Belta, C. Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ Inter national Conference on Intelligent Robots and Systems (IROS), pp. 3834 3839. IEEE, 2017.

Li, X., Serlin, Z., Yang, G., and Belta, C. A formal meth

ods approach to interpretable reinforcement learning for robotic planning. Science Robotics, 4(37), 2019.

Lyu, D., Yang, F., Liu, B., and Gustafson, S. Sdrl: inter

pretable and data-effcient deep reinforcement learning leveraging symbolic planning. In Proceedings of the AAAI Conference on Artifcial Intelligence, volume 33, pp. 2970 2977, 2019.

Mason, G. R., Calinescu, R. C., Kudenko, D., and Banks,

A. Assured reinforcement learning with formally verifed abstract policies. In 9th International Conference on Agents and Artifcial Intelligence (ICAART). York, 2017.

Oh, Y., Patel, R., Nguyen, T., Huang, B., Pavlick, E., and Tellex, S. Planning with state abstractions for non-markovian task specifcations. ar Xiv preprint ar Xiv:1905.12096, 2019.

Parr, R. and Russell, S. J. Reinforcement learning with hier

archies of machines. In Advances in neural information processing systems, pp. 1043 1049, 1998.

Paxton, C., Raman, V., Hager, G. D., and Kobilarov, M. Combining neural networks and tree search for task and motion planning in challenging environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6059 6066. IEEE, 2017.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

The Logical Options Framework

Shah, A., Li, S., and Shah, J. Planning with uncertain specifcations (puns). IEEE Robotics and Automation Letters, 5(2):3414 3421, 2020.

Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artifcial intelligence, 112(1-2): 181 211, 1999.

Yuan, L. Z., Hasanbeig, M., Abate, A., and Kroening, D. Modular deep reinforcement learning with temporal logic specifcations. ar Xiv preprint ar Xiv:1909.11591, 2019.

Zhang, S. and Sridharan, M. A survey of knowledge-based sequential decision making under uncertainty. ar Xiv preprint ar Xiv:2008.08548, 2020.