# robust_subtask_learning_for_compositional_generalization__17ae2aac.pdf Robust Subtask Learning for Compositional Generalization Kishor Jothimurugan 1 Steve Hsu 1 Osbert Bastani 1 Rajeev Alur 1 Compositional reinforcement learning is a promising approach for training policies to perform complex long-horizon tasks. Typically, a high-level task is decomposed into a sequence of subtasks and a separate policy is trained to perform each subtask. In this paper, we focus on the problem of training subtask policies in a way that they can be used to perform any task; here, a task is given by a sequence of subtasks. We aim to maximize the worst-case performance over all tasks as opposed to the average-case performance. We formulate the problem as a two agent zero-sum game in which the adversary picks the sequence of subtasks. We propose two RL algorithms to solve this game: one is an adaptation of existing multi-agent RL algorithms to our setting and the other is an asynchronous version which enables parallel training of subtask policies. We evaluate our approach on two multi-task environments with continuous states and actions and demonstrate that our algorithms outperform state-of-the-art baselines. 1. Introduction Reinforcement learning (RL) has proven to be a promising strategy for solving complex control tasks such as walking (Fujimoto et al., 2018), autonomous driving (Ivanov et al., 2021), and dexterous manipulation (Akkaya et al., 2019). However, a key challenge facing the deployment of reinforcement learning in real-world tasks is its high sample complexity solving any new task requires training a new policy designed to solve that task. One promising solution is compositional reinforcement learning, where individual options (or skills) are first trained to solve simple tasks; then, these options can be composed together to solve more complex tasks (Lee et al., 2018; 2021; Ivanov 1University of Pennsylvania. Correspondence to: Kishor Jothimurugan . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). et al., 2021). For example, if a driving robot learns how to make left and right turns and to drive in a straight line, it can then drive along any route composed of these primitives. A key challenge facing compositional reinforcement learning is the generalizability of the learned options. In particular, options trained under one distribution of tasks may no longer work well if used in a new task, since the distribution of initial states from which the options are used may shift. An alternate approach is to train the options separately to perform specific subtasks, but options trained this way might cause the system to reach states from which future subtasks are hard to perform. One can overcome this issue by handcrafting rewards to encourage avoiding such states (Ivanov et al., 2021), in which case they generalize well, but this approach relies heavily on human time and expertise. We propose a novel framework that addresses these challenges by formulating the option learning problem as an adversarial reinforcement learning problem. At a high level, the adversary chooses the task that minimizes the reward achieved by composing the available options. Thus, the goal is to learn a set of robust options that perform well across all potential tasks. Then, we provide two algorithms for solving this problem. The first adapts existing ideas for using reinforcement learning to solve Markov games to our setting. Then, the second shows how to leverage the compositional structure of our problem to learn options in parallel at each step of a value iteration procedure; in some cases, by enabling such parallelism, we can reduce the computational cost of learning. We validate our approach on two benchmarks: (i) a rooms environment where a point mass robot must navigate any given sequence of rooms, and (ii) a simulated version of the F1/10th car, where a small racing car must navigate any given racetrack that is composed of several different predefined track segments. In both environments, our empirical results demonstrate that robust options are critical for performing well on a wide variety of tasks. In summary, our contributions are: (i) a game theoretic formulation of the compositional reinforcement learning problem, (ii) an algorithm for solving this problem in the finitestate setting with guaranteed convergence in the limit, (iii) two learning algorithms for continuous-state systems, and Robust Subtask Learning for Compositional Generalization (a) F1/10th Car Right Left Straight Sharp Right Sharp An example track (b) Segment Types Figure 1. F1/10th Environment. The entry and exit regions for the right and sharp right segments are shown in green and blue respectively. (iv) an empirical evaluation demonstrating the effectiveness of our approach. Motivating example. Let us consider a small scale autonomous racing car shown in Figure 1 (a). We would like to train a controller that can be used to navigate the car through all tracks constructed using five kinds of segments; the possible segments are shown in Figure 1 (b) along with an example track. A state of the car is a vector s = (x, y, v, θ) where (x, y) is its position on the track relative to the current segment, v is its current speed and θ is the heading angle. At any state s, the controller observes the measurements o(s) R1081 of a Li DAR sensor and outputs an action (a, ω) R2 where a is the throttle input and ω is the steering angle. In this environment, completing each segment is considered a subtask and a task corresponds to completing a sequence of segments (which may describe a track) e.g., straight right left sharp-right. Upon completion of a subtask, the car enters the next segment and a changeof-coordinates1 is applied to the car s state which is now relative to the new segment. The goal here is to learn one option for each subtask such that the agent can perform any task using these options. If one trains the options independently with the only goal of reaching the end of each segment (e.g., using distancebased rewards), it might (and does) happen that the car reaches the end of a segment in a state that was not part of the initial states used to train the policy corresponding to the next segment. Therefore, one should make sure that the initial state distribution used during training includes such states as well either manually or using dataset aggregation (Ross et al., 2011). Furthermore, it is possible that the car reaches a state in the exit region of a segment from which it is challenging to complete the next subtask e.g., a state in which the car is close to and facing towards a wall. Our algorithm identifies during training that, in order to perform future subtasks, it is better to reach the end of a 1The change-of-coordinates is assumed for simpler modelling and does not affect the Li DAR measurements or the controller. segment in a configuration where the car is facing straight relative to the next segment. Finally, we want the trained options to be such that they compose well for every sequence of segments i.e., track geometry. Therefore, we are interested in maximizing the worst case performance with respect to the choice of high-level task. Our framework is broadly applicable in many real-world scenarios. For instance, the F1/10th example can be seen as a miniature version of an autonomous driving scenario where the agent needs to learn to perform maneuvers such as turning left/right, changing lanes etc. Here, the policies for performing these maneuvers need to ensure that the car is in a safe and controllable state for future maneuvers. Another interesting scenario is when a household robot has to perform multiple tasks such as cleaning, cooking etc., but needs to ensure that the policies for performing these tasks leave the house in a favorable state for future tasks e.g., learning to cook without making the place too dirty (as it might be hard to perform the cleaning task later). 2. Problem Formulation A multi-task Markov decision process (MDP) is a tuple M = (S, A, P, Σ, R, F, T, γ, η, σ0), where S are the states, A are the actions, P(s | s, a) [0, 1] is the probability of transitioning from s to s on action a, η is the initial state distribution, and γ (0, 1) is the discount factor. Furthermore, Σ is a set of subtasks and for each subtask σ Σ, Rσ : S A R is a reward function2, Fσ S is a set of final states where the subtask is considered completed and Tσ : Fσ S [0, 1] is the jump probability function; upon reaching a state s in Fσ the system jumps to a new state s with probability Tσ(s | s). For the sake of clarity, we assume3 that Tσ(s | s) = 0 for any s with s Fσ for some σ . Finally, σ0 Σ is the initial subtask 2We can also have a reward function Rσ : S A S R that depends on the next state but we omit it for clarity of presentation. 3This assumption can be removed by adding a fictitious copy of Fσ to S for each σ Σ. Robust Subtask Learning for Compositional Generalization which has to be completed first4. A multi-task MDP can be viewed as a discrete time variant of a hybrid automaton model (Ivanov et al., 2021). In the case of our motivating example, the set of subtasks is given by Σ = {left, right, straight, sharp-left, sharp-right} with Fσ denoting the exit region of the segment corresponding to subtask σ. We use the jump transitions T to model the changeof-coordinates performed upon reaching an exit region. The reward function Rσ for a subtask σ is given by Rσ(s, a, s ) = s cσ 2 2 + B 1(s Fσ) where cσ is the center of the exit region and the subtask completion bonus B is a positive constant. A task τ is defined to be an infinite sequence5 of subtasks τ = σ0σ1 . . ., and T denotes the set of all tasks. For any task τ T , τ[i] denotes the ith subtask σi in τ. In our setting, the task is chosen by the environment nondeterministically. Given a task τ, a configuration of the environment is a pair (s, i) S Z 0 with s / Fτ[i] denoting that the system is in state s and the current subtask is τ[i]. The initial distribution over configurations : S Z 0 [0, 1] is given by (s, i) = ητ[0](s) if i = 0 and 0 otherwise. The probability of transitioning from (s, i) to (s , j) on an action a is given by Pr((s , j) | (s, i), a) = s Fτ[i] P(s | s, a)Tτ[i](s | s ) if j = i + 1 P(s | s, a) if j = i & s / Fτ[i] 0 otherwise. Intuitively, the system transitions to the next subtask if the current subtask is completed and stays in the current subtask otherwise. A (deterministic) policy is a function π : S A, where a = π(s) is the action to take in state s. Our goal is to learn one policy πσ for each subtask σ such that the discounted reward over the worst-case task τ is maximized. Formally, given a set of policies Π = {πσ | σ Σ} and a task τ, we can define a Markov chain over configurations with initial distribution and transition probabilities given by PΠ((s , j) | (s, i)) = Pr((s, j ) | (s, i), πτ[i](s)). We denote by DΠ τ the distribution over infinite sequences of configurations ρ = (s0, i0)(s1, i1) . . . generated by τ and Π. Then, we define the objective function as J(Π) = inf τ T Eρ DΠ τ t=0 γt Rτ[it](st, πτ[it](st)) i . These definitions can be naturally extended to stochastic policies as well. In our motivating example, choosing 4When there is no fixed initial subtask, we can add a fictitious initial subtask. 5A finite sequence can be appended with an infinite sequence of fictitious subtasks with zero reward. a large enough completion bonus B guarantees the discounted reward to be higher for runs in which more subtasks are completed. Our aim is to compute a set of policies Π arg maxΠ J(Π). Each subtask policy πσ defines an option (Sutton et al., 1999) oσ = (πσ, Iσ, βσ) where Iσ = S \ Fσ and βσ(s) = 1(s Fσ). Here, the choice of which option to trigger is made by the environment rather than the agent. 3. Reduction to Stagewise Markov Games The problem statement naturally leads to a game theoretic view in which the environment is the adversary. We can formally reduce the problem to a two-agent zero-sum Markov game G = ( S, A1, A2, P, R, γ, η) where S = S Σ is the set of states, A1 = A are the actions of agent 1 (the agent representing the options) and A2 = Σ are the actions of agent 2 (the adversary). The transition probability function P : S A1 A2 S [0, 1] is given by P((s , σ ) | (s, σ), a1, a2) = P(s | s, a1) if s / Fσ & σ = σ Tσ(s | s) if s Fσ & σ = a2 0 otherwise. We observe that the states are partitioned into two sets S = S1 S2 where S1 = {(s, σ) | s / Fσ} is the set of states where agent 1 acts (causing a step in M) and S2 = {(s, σ) | s Fσ} is the set of states where agent 2 takes actions (causing a change of subtask); this makes G a stagewise game. The reward function R : S A1 R is given by R((s, σ), a) = Rσ(s, a) if s / Fσ and 0 otherwise. The discount factor depends on the state and is given by γ(s, σ) = γ if s / Fσ and 1 otherwise; this is because a change of subtask does not invoke a step in M. The initial state distribution η is given by η(s, σ) = η(s)1(σ = σ0). A run of the game is a sequence ρ = s0a1 0a2 0 s1a1 1a2 1 . . . where st S and ai t Ai. A (deterministic) policy for agent i is a function πi : S Ai. Given policies π1 and π2 for agents 1 and 2, respectively and a state s S we denote by DG s (π1, π2) the distribution over runs generated by π1 and π2 starting at s. Then, the value of a state s is defined by V π1,π2( s) = E ρ DG s (π1,π2) h X k=0 γ( sk) R( st, a1 t) i . We are interested in computing a policy π 1 maximizing JG(π1) = E s η[min π2 V π1,π2( s)]. Given a policy π1 for agent 1, we can construct a policy πσ for any subtask σ given by πσ(s) = π1(s, σ); we denote by Π(π1) the set of subtask policies constructed this Robust Subtask Learning for Compositional Generalization way. The following theorem connects the objective of the game with our multi-task learning objective; all proofs are in Appendix A. Theorem 3.1. For any policy π1 of agent 1 in G, we have J(Π(π1)) JG(π1). Therefore, JG(π1) is a lower bound on the objective J(Π(π1)) which we seek to maximize. Now, let us define the optimal value of a state s by V ( s) = maxπ1 minπ2 V π1,π2( s). The following theorem shows that it is possible to construct a policy π 1 that maximizes JG(π1) from the optimal value function V . Theorem 3.2. For any policy π 1 such that for all (s, σ) S1, π 1(s, σ) arg maxa A n R((s, σ), a) + s S P(s | s, a)V (s , σ) o , we have that π 1 arg maxπ1 JG(π1). 3.1. Value Iteration In this section, we briefly look at two value iteration algorithms to compute V which we later adapt in Section 4 to obtain learning algorithms. Let V = {V : S1 R} denote the set of all value functions over S1. Given a value function V V we define its extension to all of S using JV K(s, σ) = ( minσ Σ P s S Tσ(s | s)V (s , σ ) if s Fσ V (s, σ) otherwise. (1) For a state s Fσ, JV K(s, σ) denotes the worst-case value (according to V ) with respect to the possible choices of next subtask σ . Now, we consider the Bellman operator F : V V defined by F(V )(s, σ) = n R((s, σ), a) + γ X s S P(s | s, a)JV K(s , σ) o (2) for all (s, σ) S1. Let us denote by V S1 the restriction of V to S1. The following lemma follows straightforwardly, giving us our first value iteration procedure. Theorem 3.3. F is a contraction mapping with respect to the ℓ -norm on V and V S1 is the unique fixed point of F with limn Fn(V ) = V S1 for all V V. Asynchronous VI. Next we consider an asynchronous value iteration procedure which allows us to parallelize computing subtask policies for different subtasks. Given a subtask σ and a value function V V, we define a subtask MDP MV σ which behaves like M until reaching a final state s Fσ after which it transitions to a dead state achieving a reward of JV K(s, σ). Formally, Algorithm 1 Asynchronous value iteration algorithm for computing optimal subtask policies. 1: function ASYNCVALUEITERATION(M, V ) 2: while stopping criterion is met do 3: for σ Σ do // in parallel 4: Compute Wσ(V ) 5: V Fasync(V ) // using Equation 3 MV σ = (Sσ, A, Pσ, RV σ , γ) where Sσ = S { } with being a special dead state. The transition probabilities are Pσ(s | s, a) = ( P(s | s, a) if = s / Fσ 1(s = ) otherwise. The reward function is given by RV σ (s, a) = Rσ(s, a) if = s / Fσ JV K(s, σ) if = s Fσ 0 otherwise. We denote by Wσ(V ) the optimal value function of the MDP MV σ . We then define the asynchronous value iteration operator Fasync : V V using Fasync(V )(s, σ) = Wσ(V )(s). (3) We can show that repeated application of Fasync leads to the optimal value function V of the game G. Theorem 3.4. For any V V, limn Fn async(V ) V S1. Since each Wσ(V ) can be computed independently, we can parallelize the computation of Fasync giving us the algorithm in Algorithm 1. We can also show that it is not necessary to compute Wσ(V ) exactly. Let Vσ = { V : Sσ R} be the set of all value functions over Sσ. For a fixed V V, let Fσ,V : Vσ Vσ denote the usual Bellman operator for the MDP MV σ given by Fσ,V ( V )(s) = n RV σ (s, a) + γ X s Sσ Pσ(s | s, a) V (s ) o for all V Vσ and s Sσ. Now for any V V and σ Σ, we define a corresponding Vσ Vσ using Vσ(s) = JV K(s, σ) if s S and Vσ( ) = 0. Then, for any integer m > 0 and V V, we can use Fm σ,V (Vσ) as an approximation to Wσ(V ). Let us define Fm : V V using Fm(V )(s, σ) = Fm σ,V (Vσ)(s). Intuitively, Fm corresponds to performing m steps of value iteration in each subtask MDP MV σ (which can be parallelized) starting at Vσ. The following theorem guarantees convergence when using Fm instead of Fasync. Theorem 3.5. For any V V and m > 0, limn Fn m(V ) V S1. Robust Subtask Learning for Compositional Generalization 4. Learning Algorithms In this section, we present RL algorithms for solving the game G. We first consider the finite MDP setting for which we can obtain a modified Q-learning algorithm with a convergence guarantee. We then present two algorithms based on Soft Actor Critic (SAC) for the continuous-state setting. 4.1. Finite MDP Assuming finite states and actions, we can obtain a Qlearning variant for solving G which we call Robust Option Q-learning. We assume that jump transitions T are known to the learner; this is usually the case since jump transitions are used to model subtask transitions and (potential) change-of-coordinates within the controller. However, we believe that the algorithm can be easily extended to the scenario where T is unknown. We maintain a function Q : S1 A R with Q(s, σ, a) denoting Q((s, σ), a). The corresponding value function VQ is defined using VQ(s, σ) = maxa A Q(s, σ, a) and is extended to all of S as JVQK. Note that, given a Q-function, the extended value function JVQK can be computed exactly. Robust Option Q-learning is an iterative process in each iteration t, it takes a step ((s, σ), a1, a2, (s , σ)) in G with (s, σ) S1 and updates the Q-function using Qt+1(s, σ, a1) (1 αt)Qt(s, σ, a1) + αt( R((s, σ), a1) + γJVQt K(s , σ)). where Qt is the Q-function in iteration t and JVQt K is the corresponding extended value function. Under standard assumptions on the learning rates αt, similar to Q-learning, we can show that Robust Option Qlearning converges to the optimal Q-function almost surely. Here, the optimal Q-function is defined by Q (s, σ, a) = R((s, σ), a) + γ P s S P(s | s, a)V (s , σ) for all (s, σ) S1. Let αt(s, σ, a) denote the learning rate used in iteration t if Qt(s, σ, a) is updated and 0 otherwise. Then, we have the following theorem. Theorem 4.1. If P t αt(s, σ, a) = and P t α2 t(s, σ, a) < for all (s, σ) S1 and a A, then limt Qt = Q almost surely. 4.2. Continuous States and Actions In the case of continuous states and actions, we can adapt any Q-function based RL algorithm such as Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2016) or Soft Actor Critic (SAC) (Haarnoja et al., 2018) to our setting. Here we present an SAC based algorithm that we call Robust Option SAC (ROSAC) which is outlined in Algorithm 2. This algorithm, like SAC, adds an entropy bonus to the reward function to improve exploration. Algorithm 2 Robust Option Soft Actor Critic. Inputs: Learning rates αψ, αθ, entropy weight β and Polyak rate δ. 1: function ROSAC(αψ, αθ, β, δ) 2: Initialize params {ψσ}σ Σ, {ψtarg σ }σ Σ, {θσ}σ Σ 3: Initialize replay buffer B 4: for each iteration do 5: for each episode do 6: s0 η 7: σ0 Initial Subtask 8: for each step t do 9: at πθσt( | st) 10: st+1 P( | s, a) 11: B B {(st, at, st+1)} 12: if st+1 Fσt then 13: st+1 Tσt( | st+1) 14: σt+1 Greedy(ε, arg minσ V (st+1, σ), Σ) 15: else 16: σt+1 σt 17: for each gradient step do 18: Sample batch B B 19: for σ Σ do 20: ψσ ψσ αψ ψσLQ(ψσ, B) 21: θσ θσ αθ θσLπ(θσ, B) 22: ψtarg σ δψσ + (1 δ)ψtarg σ We maintain two Q-functions for each subtask σ, Qψσ : S R parameterized by ψσ and a target function Qψtarg σ parameterized by ψtarg σ . We also maintain a stochastic subtask policy πθσ : S D(A) for each subtask σ where D(A) denotes the set of distributions over A. Given a step (s, a, s ) in M and a subtask σ with s / Fσ, we define the target value by yσ(s, a, s ) = Rσ(s, a) + γJV K(s , σ) where the value JV K(s , σ) is estimated using V (s , σ) = Qψtarg σ (s , a) β log πθσ( a | s ) with a πθσ( | s ) if s / Fσ. If s Fσ, we estimate JV K(s , σ) using V (s , σ) = minσ Σ V (s , σ ) where V (s , σ ) = Qψtarg σ (s , a) β log πθσ ( a | s ) with a πθσ ( | s ) and s Tσ( | s ). Now, given a batch B = {(s, a, s )} of steps in M we update ψσ using one step of gradient descent corresponding to the loss LQ(ψσ, B) = (s,a,s ) B (Qψσ(s, a) yσ(s, a, s ))2 and the subtask policy πθσ is updated using the loss Lπ(θσ, B) = (s,a,s ) B E a πθσ ( |s) Qψσ(s, a) β log πθσ( a | s) . Robust Subtask Learning for Compositional Generalization Figure 2. Rooms environment The gradient θσLπ(θσ, B) can be estimated using the reparametrization trick if, for example, πθσ( | s) is a Gaussian distribution whose parameters are differentiable w.r.t. θσ. We use Polyak averaging to update the target Q-networks {Qψtarg σ | σ Σ}. Note that we do not train a separate policy for the adversary. During exploration, we use the ε-greedy strategy to select subtasks. We first estimate the worst subtask for a state s using σ = arg minσ V (s, σ) where V (s, σ) is estimated as before. Then the function Greedy(ε, σ, Σ) chooses σ with probability 1 ε and picks a subtask uniformly at random from Σ with probability ε. Furthermore, we can easily impose constraints on the sequences of subtasks considered (e.g., only consider realistic tracks in F1/10th) by restricting the adversary to pick the next subtask σt+1 from a subset Σt+1 Σ of possible substasks i.e., by performing arg min (in Line 14) over Σt+1 instead of Σ. Asynchronous ROSAC. We can also obtain an asynchronous version of the above algorithm which lets us train subtask policies in parallel. Asynchronous Robust Option SAC (AROSAC) is outlined in Algorithm 3. Here we use one replay buffer for each subtask. We maintain an initial state distribution η over S to be used for training every subtask policy {πσ}σ Σ. η is represented using a finite set of states D from which a state is sampled uniformly at random. The value function V : S Σ R is estimated as before. To be specific, in each iteration, an estimate of any value V (s, σ) is obtained on the fly using the Q-functions and the subtask policies from the previous iteration. The SAC subroutine runs the standard Soft Actor Critic algorithm for N iterations on the subtask MDP M V σ (defined in Section 3)6 with initial state distribution η (defaults to η if D = ). It returns the updated parameters along with states Xσ visited during exploration with Xσ Fσ. The states in Xσ are used to update the initial state distribution for the next iteration following the Dataset Aggregation principle (Ross et al., 2011). 6Note that it is possible to obtain samples from M V σ as long can one can obtain samples from M and membership in Fσ can be decided. Algorithm 3 Asynchronous Robust Option SAC. Inputs: Learning rates α, entropy weight β, Polyak rate δ and number of SAC iterations N. 1: function AROSAC(α, β, δ, N) 2: Initialize params Ψ = {ψσ}σ Σ, Θ = {θσ}σ Σ 3: Initialize target params Ψtarg = {ψtarg σ }σ Σ 4: Initialize replay buffers {Bσ}σ Σ 5: Initialize D = {} 6: for each iteration do 7: V OBTAINVALUEESTIMATOR(Ψ, Θ) 8: for σ Σ do // in parallel 9: Run SAC(M V σ , D, ψσ, ψtarg σ , θσ, α, β, δ, N) 10: Update ψσ, ψtarg σ , θσ and Xσ to new values 11: for σ Σ do 12: for s Xσ do 13: s Tσ( | s) and D D {s } 5. Experiments We evaluate7 our algorithms ROSAC and AROSAC on two multi-task environments; a rooms environment and an F1/10th racing car environment (F110). Rooms environment. We consider the environment shown in Figure 2 which depicts a room with walls and exits. Initially the robot is placed in the green triangle. The L-shaped obstacles indicate walls within the room that the robot cannot cross. A state of the system is a position (x, y) R2 and an action is a pair (v, θ) where v is the speed and θ is the heading angle to follow during the next time-step. There are three exits: left (blue), right (yellow) and up (grey) reaching each of which is a subtask. Upon reaching an exit, the robot enters another identical room where the exit is identified (via change-of-coordinates) with the bottom entry region of the current room. A task is a sequence of directions e.g., left right up right indicating that the robot should reach the left exit followed by the right exit in the subsequent room and so on. Although the dynamics are simple, the obstacles make learning challenging in the adversarial setting. F1/10th environment. This is the environment in the motivating example. A publicly available simulator (F110) of the F1/10th car is used for training and testing. The policies use the Li DAR measurements from the car as input (as opposed to the state) and we assume that the controller can detect the completion of each segment; as shown in prior work (Ivanov et al., 2021), one can train a separate neural network to predict subtask completion. 7Our implementation is available online and can be found at https://github.com/keyshor/rosac. Robust Subtask Learning for Compositional Generalization 1 0 1 2 3 105 1.2 ROSAC AROSAC NAIVE DAGGER MADDPG PAIRED (a) Success probability against random adversary 1 0 1 2 3 105 1.2 ROSAC AROSAC NAIVE DAGGER MADDPG PAIRED (b) Success probability against MCTS adversary 1 0 1 2 3 105 6 ROSAC AROSAC NAIVE DAGGER MADDPG PAIRED (c) Number of subtasks completed against MCTS adversary Figure 3. Plots for the Rooms environment. x-axis denotes the number of sample steps and y-axis denotes either the average number of subtasks completed or the probability of completing 5 subtasks. Results are averaged over 10 runs. Error bars indicate std. 1 0 1 2 3 4 5 105 1.2 ROSAC AROSAC NAIVE DAGGER MADDPG PAIRED (a) Success probability against random adversary 1 0 1 2 3 4 5 105 1.2 ROSAC AROSAC NAIVE DAGGER MADDPG PAIRED (b) Success probability against MCTS adversary 1 0 1 2 3 4 5 105 30 ROSAC AROSAC NAIVE DAGGER MADDPG PAIRED (c) Number of subtasks completed against MCTS adversary Figure 4. Plots for the F1/10th environment. x-axis denotes the number of sample steps and y-axis denotes either the average number of subtasks completed or the probability of completing 25 subtasks. Results are averaged over 5 runs. Error bars indicate std. Baselines. We compare our approach to four baselines. The baseline NAIVE trains one controller for each subtask with the only aim of completing the subtask, similar to the approach used by Ivanov et al. (2021), using a manually designed initial state distribution. DAGGER is a similar approach which, instead of using a manually designed initial state distribution for training, infers the initial state distribution using the Dataset Aggregation principle (Ross et al., 2011). In other words, DAGGER is similar to AROSAC except that M is used instead of M V σ (Line 9 of Algorithm 3) for training the options in each iteration. The MADDPG baseline solves the game G using the multi-agent RL algorithm proposed by Lowe et al. (2017) for solving concurrent Markov games with continuous states and actions. We use the open-source implementation of MADDPG by the authors. The PAIRED baseline refers to the unsupervised environment design approach proposed by Dennis et al. (2020) in which the multi-task RL problem is viewed as a two-agent zero-sum game similar to our approach (but is not designed for long-horizon and compositional tasks). We implemented PAIRED with PPO as the base algorithm to train the policies of the protagonist and the antagonist which consist of one NN policy per subtask. The adversary is parameterized by the probabilities (logits) of choosing various subtasks at different timesteps and is trained using REINFORCE updates. Evaluation. We evaluate the performance of these algorithms against two adversaries. One adversary is the random adversary which picks the next subtask uniformly at random from the set of all subtasks. The other adversary estimates the worst sequence of subtasks for a given set of options using Monte Carlo Tree Search algorithm (MCTS) (Kocsis and Szepesv ari, 2006). The MCTS adversary is trained by assigning a reward of 1 if it selects a subtask which the corresponding policy is unable to complete within a fixed time budget and a reward of 0 otherwise. For the Rooms environment, we consider subtask sequences of length atmost 5 whereas for the F1/10th environment, we consider sequences of subtasks of length at most 25. We evaluate both the average number of subtasks completed as well as the probability of completing the set maximum number of subtasks. Results. The plots for the rooms environment are shown in Figure 3 and plots for the F1/10th environment are shown in Figure 4. We can observe that ROSAC and AROSAC outperform other approaches and learn robust op- Robust Subtask Learning for Compositional Generalization 0.0 0.5 1.0 1.5 2.0 2.5 1.2 ROSAC ROSAC_a (a) Success probability against random adversary 0.0 0.5 1.0 1.5 2.0 2.5 1.2 ROSAC ROSAC_a (b) Success probability against MCTS adversary 0.0 0.5 1.0 1.5 2.0 2.5 6 ROSAC ROSAC_a (c) Number of subtasks completed against MCTS adversary Figure 5. Ablation study in the Rooms environment. x-axis denotes the number of sample steps and y-axis denotes either the average number of subtasks completed or the probability of completing 5 subtasks. Results are averaged over 10 runs. Error bars indicate std. tions. AROSAC performs worse as compared to ROSAC; however, it has the added benefit of being parallelizable. DAGGER and NAIVE baselines are unable to learn policies that can be used to perform long sequences of subtasks; this is mostly due to the fact that they learn options that cause the system to reach states from which future subtasks are difficult to perform e.g., in the rooms environment, the agent sometimes reaches the left half of the exits from where it is difficult to reach the right exit in the subsequent room. Although MADDPG uses the same reduction to two-player games as ROSAC, it ignores all the structure in the game and treats it as a generic Markov game. As a result, it learns a separate NN policy for each player which leads to the issue of unstable training, primarily due to the non-stationary nature of the environment observed by either agent. As shown in the plots, this leads to poor performance when applied to the problem of learning robust options. Similarly, PAIRED is unable to learn good options and is likely due to two reasons. Firstly, it does not use a compositional or a hierarchical approach for training the options, which causes it to scale poorly to long-horizon tasks. Secondly, it relies on on policy algorithms (such as PPO) for training which are less efficient than SAC for our environments. Random vs MCTS adversary. Given that the number of subtasks is small in our benchmarks, the random adversary is able to select difficult sequences of subtasks often (eg., a right after a left in the Rooms environment), which explains the similarity in the observed performance against the two different adversaries. Nonetheless, even when the performance is measured against a random adversary, our approach significantly outperforms all baselines. An ablation study in the Rooms environment shows that selecting subtasks randomly during training is not as effective as the ε-greedy approach. In Figure 5, ROSACa denotes the ablation in which ε is set to 1 in line 14 of Algorithm 2. As the plots show, although this ablation is able to learn good subtask policies, it requires more samples than ROSAC. Note that the ablation still uses the worst-case target value estimate over future subtasks V for training the Q-networks. 6. Related Work Options and Hierarchical RL. The options framework (Sutton et al., 1999) is commonly used to model subtask policies as temporally extended actions. In hierarchical RL (Nachum et al., 2018; 2019; Kulkarni et al., 2016; Dietterich, 2000; Bacon et al., 2017; Tiwari and Thomas, 2019), options are trained along with a high-level controller that chooses the sequence of options (plan) to execute in order to complete a specific high-level task. In contrast, our approach aims to train options that work for multiple highlevel plans. There is also work on discovering options e.g., using intrinsic motivation (Machado et al., 2017), entropy maximization (Eysenbach et al., 2018), semisupervised RL (Finn et al., 2017), skill chaining (Konidaris and Barto, 2009; Bagaria and Konidaris, 2019; Bagaria et al., 2021a), expectation maximization (Daniel et al., 2016) and subgoal identification (Stolle and Precup, 2002). Our approach is complementary to option discovery methods since our algorithm can be used to learn robust policies corresponding to the options which can be used in multiple contexts. There has also been a lot of research on planning using learned options (Abel et al., 2020; Jothimurugan et al., 2021; Precup et al., 1998; Theocharous and Kaelbling, 2004; Konidaris et al., 2014). Some hierarchical RL algorithms have also been shown to enable few-shot generalization (Jothimurugan et al., 2021) to unseen tasks. Closely related to our work is the work on compositional RL in the multi-task setting (Ivanov et al., 2021) in which the subtask policies are trained using standard RL algorithms in a naive way without guarantees regarding worstcase performance. Multi-task RL. There has been some work on RL for zero-shot generalization (Vaezipoor et al., 2021; Oh et al., 2017; Sohn et al., 2018; Kuo et al., 2020; Andreas et al., Robust Subtask Learning for Compositional Generalization 2017); however, in these works, the learning objective is to maximize average performance with respect to a fixed distribution over tasks as opposed to the worst-case. Most closely related to our work is the line of work on minimax/robust RL (Pinto et al., 2017; Campero et al., 2020; Dennis et al., 2020), which aims to train policies to maximize the worst-case performance across multiple tasks/environments. However, there are a few key differences between existing work and our approach. Firstly, existing methods train a single NN policy to perform multiple tasks as opposed to training composable options for subtasks. As shown in the experiments, training a single policy does not scale well to complex long-horizon tasks. Secondly, these approaches do not provide strong convergence guarantees whereas we provide convergence guarantees in the finitestate setting. Finally, we utilize the structure in the problem to infer an adversary directly from the value functions instead of training a separate adversary as done in previous approaches. Skill Chaining. Research on skill chaining (Bagaria and Konidaris, 2019; Bagaria et al., 2021b;a; Lee et al., 2021) has led to algorithms for discovering and training options so that they compose well. However, in these approaches, the aim is to compose the options to perform a specific task which corresponds to executing the options in specific order(s). In contrast, our objective is independent of specific tasks and seeks to maximize performace across all tasks. Also, such approaches, at best, only provide hierarchical optimality guarantees whereas our algorithm converges to the optimal policy with respect to our game formulation. There has also been work on skill composition using transition policies (Lee et al., 2018); this method assumes that the subtask policies are fixed and learns one transition policy per subtask which takes the system from an end state to a favourable initial state for the subtask. However, poorly trained subtask policies can lead to situations in which it is not possible to achieve such transitions. In contrast, our approach trains subtask policies which compose well without requiring additional transition policies. Multi-agent RL. There has been a lot of research on multi-agent RL algorithms (Lowe et al., 2017; Hu and Wellman, 2003; Hu et al., 1998; Littman, 2001; Perolat et al., 2017; Prasad et al., 2015; Akchurina, 2008) including algorithms for two-agent zero-sum games (Bai and Jin, 2020; Wei et al., 2017; Littman, 1994). In this paper, we utilize the specific structure of our game to obtain a simple algorithm that neither requires solving matrix games nor trains a separate policy for the adversary. Furthermore, we show that we can obtain an asynchronous RL algorithm which enables training options in parallel. 7. Conclusions We have proposed a framework for training robust options which can be used to perform arbitrary sequences of subtasks. We first reduce the problem to a two-agent zerosum stagewise Markov game which has a unique structure. We utilized this structure to design two algorithms, namely ROSAC and AROSAC, and demonstrated that they outperform existing approaches for training options with respect to multi-task performance. One potential limitation of our approach is that the set of subtasks is fixed and has to be provided by the user. An interesting direction for future work is to address this limitation by combining our approach with option discovery methods. Acknowledgements. We thank the anonymous reviewers for their helpful comments. This research was partially supported by ONR award N00014-20-1-2115. David Abel, Nathan Umbanhowar, Khimya Khetarpal, Dilip Arumugam, Doina Precup, and Michael L. Littman. Value preserving state-action abstractions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2020. Natalia Akchurina. Multi-agent reinforcement learning algorithm with variable optimistic-pessimistic criterion. In ECAI, volume 178, pages 433 437, 2008. Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik s cube with a robot hand. ar Xiv preprint ar Xiv:1910.07113, 2019. Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pages 166 175. PMLR, 2017. Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Akhil Bagaria and George Konidaris. Option discovery using deep skill chaining. In International Conference on Learning Representations, 2019. Akhil Bagaria, Jason K Senthil, and George Konidaris. Skill discovery for exploration and planning using deep skill graphs. In International Conference on Machine Learning, pages 521 531. PMLR, 2021a. Akhil Bagaria, Jason K Senthil, Matthew Slivinski, and George Konidaris. Robustly learning composable op- Robust Subtask Learning for Compositional Generalization tions in deep reinforcement learning. In IJCAI, pages 2161 2169, 2021b. Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, 2020. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, Belmont, MA, 1996. Andres Campero, Roberta Raileanu, Heinrich K uttler, Joshua B Tenenbaum, Tim Rockt aschel, and Edward Grefenstette. Learning with amigo: Adversarially motivated intrinsic goals. ar Xiv preprint ar Xiv:2006.12122, 2020. Christian Daniel, Herke Van Hoof, Jan Peters, and Gerhard Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104 (2):337 357, 2016. Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33:13049 13061, 2020. Thomas G Dietterich. State abstraction in maxq hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 994 1000, 2000. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In ICLR, 2018. F110. F1/10 Autonomous Racing Competition. http://f1tenth.org. Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. In ICLR, 2017. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587 1596, 2018. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861 1870. PMLR, 2018. Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039 1069, 2003. Junling Hu, Michael P Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pages 242 250. Citeseer, 1998. Radoslav Ivanov, Kishor Jothimurugan, Steve Hsu, Shaan Vaidya, Rajeev Alur, and Osbert Bastani. Compositional learning and verification of neural network controllers. ACM Transactions on Embedded Computing Systems (TECS), 20(5s):1 26, 2021. Kishor Jothimurugan, Osbert Bastani, and Rajeev Alur. Abstract value iteration for hierarchical reinforcement learning. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1162 1170. PMLR, 13 15 Apr 2021. Levente Kocsis and Csaba Szepesv ari. Bandit based montecarlo planning. In European conference on machine learning, pages 282 293. Springer, 2006. George Konidaris and Andrew G Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in neural information processing systems, pages 1015 1023, 2009. George Konidaris, Leslie Kaelbling, and Tomas Lozano Perez. Constructing symbolic representations for highlevel planning. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675 3683, 2016. Yen-Ling Kuo, Boris Katz, and Andrei Barbu. Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of ltl formulas. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5604 5610, 2020. Youngwoon Lee, Shao-Hua Sun, Sriram Somasundaram, Edward S Hu, and Joseph J Lim. Composing complex skills by learning transition policies. In International Conference on Learning Representations, 2018. Youngwoon Lee, Joseph J Lim, Anima Anandkumar, and Yuke Zhu. Adversarial skill chaining for long-horizon robot manipulation via terminal state regularization. ar Xiv preprint ar Xiv:2111.07999, 2021. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016. Robust Subtask Learning for Compositional Generalization Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157 163. Elsevier, 1994. Michael L Littman. Friend-or-foe q-learning in generalsum games. In ICML, volume 1, pages 322 328, 2001. Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6382 6393, 2017. Marios C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2295 2304. JMLR. org, 2017. Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3303 3313, 2018. Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Near-optimal representation learning for hierarchical reinforcement learning. In ICLR, 2019. Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pages 2661 2670. PMLR, 2017. Stephen David Patek. Stochastic and shortest path games: theory and algorithms. Ph D thesis, Massachusetts Institute of Technology, 1997. Julien Perolat, Florian Strub, Bilal Piot, and Olivier Pietquin. Learning Nash Equilibrium for General-Sum Markov Games from Batch Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817 2826. PMLR, 2017. HL Prasad, Prashanth LA, and Shalabh Bhatnagar. Twotimescale algorithms for learning nash equilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1371 1379, 2015. Doina Precup, Richard S Sutton, and Satinder Singh. Theoretical results on reinforcement learning with temporally abstract options. In European conference on machine learning, pages 382 393. Springer, 1998. St ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627 635, 2011. Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies. Advances in Neural Information Processing Systems, 31, 2018. Martin Stolle and Doina Precup. Learning options in reinforcement learning. In International Symposium on abstraction, reformulation, and approximation, pages 212 223. Springer, 2002. Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181 211, 1999. Georgios Theocharous and Leslie P Kaelbling. Approximate planning in pomdps with macro-actions. In Advances in Neural Information Processing Systems, pages 775 782, 2004. Saket Tiwari and Philip S Thomas. Natural option critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5175 5182, 2019. Pashootan Vaezipoor, Andrew C Li, Rodrigo A Toro Icarte, and Sheila A. Mcilraith. Ltl2action: Generalizing ltl instructions for multi-task rl. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10497 10508. PMLR, 18 24 Jul 2021. Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4994 5004, 2017. Robust Subtask Learning for Compositional Generalization In this section, we prove the theorems presented in the paper through a series of lemmas. The proofs here are adaptations of proofs of similar results for zero-sum concurrent games with a single discount factor (Patek, 1997). A.1. Definitions We first introduce some notation and definitions. Recall that we defined S1 = {(s, σ) | σ / Fσ}, S2 = {(s, σ) | σ Fσ} and S = S1 S2. V = {V : S1 R} denotes the set of value functions over S1 and V = {V : S R} denotes the set of value functions over S. We use to denote the ℓ -norm. For any V V, (s, σ) S2 and any σ Σ, define JV Kσ (s, σ) = X s S Tσ(s | s)V (s , σ ). Note that for any (s, σ) S2, we have JV K(s, σ) = minσ Σ JV Kσ (s, σ). Similarly, for any V V, (s, σ) S1 and a A, let us define Fa(V )(s, σ) = R((s, σ), a) + γ X s S P(s | s, a)JV K(s , σ) with F(V )(s, σ) = maxa A Fa(V )(s, σ). Given any policy π1 : S A1 for agent 1 in G, we define the resulting MDP G(π1) = ( S, A2, Pπ1, Rπ1, γ) with states S and actions A2 = Σ as follows. The transition probability function is given by Pπ1((s , σ ) | (s, σ), a2) = ( P((s , σ ) | (s, σ), π1(s, σ), a2) if (s, σ) S1 P s S Tσ(s | s) P((s , σ ) | (s , a2), π1(s , a2), a2) if (s, σ) S2 and the reward function is given by Rπ1((s, σ), a2) = ( R((s, σ), π1(s, σ)) if (s, σ) S1 P s S Tσ(s | s) R((s , a2), π1(s , a2)) if (s, σ) S2. Intuitively, the MDP G(π1) merges every step of G in which a change of subtask occurs with the subsequent step in the environment, while using π1 to choose actions for agent 1. For any s S, let DG(π1) s (π2) denote the distribution over infinite trajectories generated by π2 starting at s in G(π1). Then we define the value function for the MDP G(π1) using V π2 G(π1)( s) = Eρ DG(π1) s (π2) t=0 γt Rπ1( st, at) i for all π2 : S A2 and s S. A.2. Necessary Lemmas We need a few intermediate results in order to prove the main theorems. We begin by analyzing the operators J K : V V and F : V V defined in Section 3.1. Lemma A.1. For any V1, V2 V, we have JV1K JV2K = V1 V2 . Proof. For any (s, σ) S1, we have |JV1K(s, σ) JV2K(s, σ)| = |V1(s, σ) V2(s, σ)|. For any (s, σ) S2 and σ Σ we have |JV1Kσ (s, σ) JV2Kσ (s, σ)| = X s S Tσ(s | s)V1(s , σ ) X s S Tσ(s | s)V2(s , σ ) s S Tσ(s | s)|V1(s , σ ) V2(s , σ )| Now we have |JV1K(s, σ) JV2K(s, σ)| = |minσ JV1Kσ (s, σ) minσ JV2Kσ (s, σ)| V1 V2 which concludes the proof. Robust Subtask Learning for Compositional Generalization Now we are ready to show that F is a contraction. Lemma A.2. F : V V is a contraction mapping w.r.t the norm . Proof. Let V1, V2 V. Then for any (s, σ) S1 and a A, |Fa(V1)(s, σ) Fa(V2)(s, σ)| = |γ X s S P(s | s, a)JV1K(s , σ) γ X s S P(s | s, a)JV2K(s , σ)| s S P(s | s, a)|JV1K(s , σ) JV2K(s , σ)| where the last inequality followed from Lemma A.1. Therefore, for any (s, σ) S1, we have |F(V1)(s, σ) F(V2)(s, σ)| = |maxa Fa(V1)(s, σ) maxa Fa(V2)(s, σ)| γ V1 V2 showing that F is a contraction. Now we connect the value function of the game G with that of the MDP G(π1). Lemma A.3. For any π1 : S A1, π2 : S A2 and s S, V π1,π2( s) = V π2 G(π1)( s). Proof. Given an infinite trajectory ρ = s0a1 0a2 0 s1a1 1a2 1 . . . in G we define a corresponding trajectory ρ2 = si0a2 i0 si1a2 i1 in G(π1) as a subsequence where i0 = 0 and it+1 = it + 1 if sit S1 and it+1 = it + 2 if sit S2. Then for any s S we have V π1,π2( s) = E ρ DG s (π1,π2) h X k=0 γ( sk) R( st, a1 t) i (1) = E ρ DG s (π1,π2) h X t=0 γt Rπ1( sit, a2 it) i (2) = Eρ DG(π1) s (π2) t=0 γt Rπ1( st, at) i = V π2 G(π1)( s) where (1) followed from the definitions of γ and Rπ1 and the fact that R( st, a1 t) = 0 if st S2, and (2) followed from the fact that sampling a trajectory ρ by first sampling ρ from DG s (π1, π2) and then constructing the subsequence ρ2 is the same as sampling an infinite trajectory ρ from DG(π1) s (π2)8. Lemma A.2 shows that for any V V we have lim n Fn(V ) = Vlim where Vlim V is the unique fixed point of F. Now we define two policies π 1 and π 2 for agents 1 and 2 respectively, as follows. For (s, σ) S1 we have π 1(s, σ) arg maxa A Fa(Vlim)(s, σ) (4) and for (s, σ) S2, we have π 2(s, σ) arg minσ JVlim Kσ (s, σ). (5) Note that the actions taken by π 1 in S2 and π 2 in S1 can be arbitrary since they do not affect the transitions of the game G. Now we show that for any s S, π 1 maximizes V π1,π 2 ( s) and π 2 minimizes V π 1,π2( s). Lemma A.4. For any s S, V π 1,π 2 ( s) = minπ2 V π 1,π2( s) = JVlim K( s). 8This can be shown formally by analyzing the probabilities assigned by the two distributions on cylinder sets. Robust Subtask Learning for Compositional Generalization Proof. Let G(π 1) = ( S, A2, Pπ 1 , Rπ 1 , γ). For any (s, σ) S2, we have JVlim K(s, σ) = min σ Σ s S Tσ(s | s)JVlim K(s , σ ) (3) = min σ Σ s S Tσ(s | s) R((s , σ ), a) + γ X s S P(s | s, a)JVlim K(s , σ ) a=π 1(s ,σ ) (4) = min a2 A2 Rπ 1 ((s, σ), a2) + γ X s S Pπ 1 ( s | (s, σ), a2)JVlim K( s) where (3) followed from the definitions of Vlim and π 1 and (4) followed from the definitions of Rπ 1 and Pπ 1 . Since JVlim K satisfies the Bellman equations for the MDP G(π 1), the optimal value function for G(π 1) is given by V G(π 1) = JVlim K. Now, from the definition of π 2 we can conclude that π 2 is an optimal policy for G(π 1). Therefore, Lemma A.3 implies that for any s S, min π2 V π 1,π2( s) = min π2 V π2 G(π 1)( s) = V π 2 G(π 1)( s) = V π 1,π 2 ( s). Hence, we have proved the desired result. The following lemma can be shown using a similar argument and the proof is omitted. Lemma A.5. For any s S, V π 1,π 2 ( s) = maxπ1 V π1,π 2 ( s) = JVlim K( s). The following lemma shows that it does not matter which agent picks its policy first. Lemma A.6. For any policies π 1 and π 2 satisfying Equations 4 and 5 respectively, for all s S, V π 1,π 2 ( s) = min π2 max π1 V π1,π2( s) = max π1 min π2 V π1,π2( s) = V ( s). Proof. We have, for any s S, V ( s) = max π1 min π2 V π1,π2( s) min π2 V π 1,π2( s) (1) = V π 1,π 2 ( s) (2) = max π1 V π1,π 2 ( s) min π2 max π1 V π1,π2( s) max π1 min π2 V π1,π2( s) where the (1) followed from Lemma A.4 and (2) followed from Lemma A.5. A.3. Proof of Theorem 3.1 Let Π(π1) = {πσ | σ Σ} be the set of subtask policies defined by π1. Let τ = σ0σ1 . . . be a task. Then we define a history-dependent policy πτ 2 in G(π1) which maintains an index i denoting the current subtask and picks σi+1 upon Robust Subtask Learning for Compositional Generalization reaching any state in S2 while simultaneously updating the index to i + 1. Then we have J(Π(π1)) = inf τ T Eρ DΠ τ t=0 γt Rτ[it](st, πτ[it](st)) i (1) = inf τ T E s η h Eρ DG(π1) s (πτ 2 ) t=0 γt Rπ1( st, at) i = inf τ T E s η[ V πτ 2 G(π1)( s)] E s η[ sup τ T V πτ 2 G(π1)( s)] (2) E s η[ max π2 V π2 G(π1)( s)] (3) = E s η[min π2 V π1,π2( s)] where (1) followed from the definitions of πτ 2 and G(π1), (2) followed from the fact that there is an optimal stationary policy maximizing V π2 G(π1)( s) and (3) followed from Lemma A.4. A.4. Proof of Theorem 3.2 Since V = JVlim K, for all (s, σ) S we have π 1(s, σ) arg maxa A Fa(Vlim)(s, σ). Now for any π 2 satisfying Equation 5, we can conclude from Lemma A.6 that, for any s S, JG(π 1) = E s η[min π2 V π 1,π2( s)] = E s η[V π 1,π 2 ( s)] = E s η[max π1 min π2 V π1,π2( s)] max π1 E s η[min π2 V π1,π2( s)] = max π1 JG(π1) which shows that π 1 maximizes JG(π1). A.5. Proof of Theorem 3.3 From Lemma A.2, we can conclude that F is a contraction over V w.r.t. the ℓ -norm. Lemmas A.6 and A.4 gives us that V S1= Vlim. Now the definition of Vlim implies that limn Fn(V ) = V S1 for all V V. A.6. Proof of Theorems 3.4 and 3.5 This proof is similar to the proof of convergence of asynchronous value iteration for MDPs presented in the book by Bertsekas and Tsitsiklis (1996). It is easy to see that, for any V V and σ Σ, the operators J K, F, Fasync, and Fσ,V are monotonic. Recall that, for any V V and σ Σ, we defined the corresponding Vσ Vσ using Vσ(s) = JV K(s, σ) if s S and Vσ( ) = 0. Also, we have F(V )(s, σ) = Fσ,V (Vσ)(s) = F1(V )(s, σ) for all (s, σ) S1. Now let V V be a value function such that F(V ) V . Then we have Fσ,V (Vσ) Vσ for all σ Σ. Therefore, using monotonicity of Fσ,V , we get that Fm σ,V (Vσ) Fm 1 σ,V (Vσ) Vσ for all m > 0 which implies Fm(V ) Fm 1(V ) V . Hence, for any (s, σ) S1, Fasync(V )(s, σ) = Wσ(V )(s) = lim m Fm σ,V (Vσ)(s) Fσ,V (Vσ)(s) = F(V )(s, σ). Furthermore, letting V m = Fm(V ) we get that JV m K JV K and hence V m σ Fm σ,V (Vσ) for all σ Σ. Also, for (s, σ) S1, F(V m)(s, σ) = Fσ,V m(V m σ )(s) Fσ,V (V m σ )(s) Fm+1 σ,V (Vσ)(s) = V m+1(s, σ). Therefore, using Robust Subtask Learning for Compositional Generalization continuity of F, we have F(Fasync(V )) = F(limm V m) = limm F(V m) limm V m+1 = Fasync(V ). Now we can show by induction on n that, for any V V with F(V ) V and n 1, we have F(Fn async(V )) Fn async(V ) and Vlim Fn async(V ) Fn(V ). Taking the limit as n gives us that limn Fn async(V ) = Vlim if F(V ) V . Using a symmetric argument, we get that limn Fn async(V ) = Vlim if F(V ) V . Let I V be defined by I(s, σ) = 1 for all (s, σ) S1. For a general V V, we can find a δ > 0 such that we have V = Vlim δI V Vlim + δI = V + and F(V ) V and F(V +) V +. Therefore, using monotonicity of Fasync we get Fn async(V ) Fn async(V ) Fn async(V +) for all n 0. Taking the limit as n tends to gives us the required result. Theorem 3.5 follows from a similar argument. A.7. Proof of Theorem 4.1 Given a function Q : S1 A R we define a new function H(Q) using H(Q)(s, σ, a) = R((s, σ), a) + γ X s S P(s | s, a)JVQK(s , σ) for all (s, σ) S1 and a A. Then, Robust Option Q-learning is of the form Qt+1(s, σ, a) = (1 αt(s, σ, a))Qt(s, σ, a) + αt(s, σ, a) H(Qt)(s, σ, a) + wt(s, σ, a) where the noise factor is defined by wt(s, σ, a) = γJVQt K( s, σ) γ X s S P(s | s, a)JVQK(s , σ) with s P( | s, a) being the observed sample. Let Xt denote the measure space generated by the set of random vectors {Q0, Q1, . . . , Qt, w0, . . . , wt 1, α0, . . . , αt}. Then, for all (s, σ) S1, a A and t 0, we have E[wt(s, σ, a) | Xt] = 0 and E[w2 t (s, σ, a) | Xt] 4γ2 max s S n JVQt K2(s , σ) o 4γ2 max (s ,σ ) S1,a A n Q2 t(s , σ , a ) o . Furthermore, using Lemmas A.1 and A.2 and the definition of VQ we can conclude that H is a contraction w.r.t the ℓ -norm and Q is the unique fixed point of H. Therefore, the random sequence of Q-functions {Qt}t 0 satisfies all assumptions in Proposition 4.4 of Bertsekas and Tsitsiklis (1996) implying that Qt Q as t with probability 1. B. Experimental Details All experiments were run on a 48-core machine with 512GB of memory and 8 GPUs. In all approaches (ours and baselines) except for MADDPG, the policy consists of one fully-connected NN per subtask, each with two hidden layers. MADDPG consists of two policies, one for the agent and one for the adversary, each with two hidden layers. In the case of MADDPG, the subtask is encoded in the observation using a one-hot vector. All hyperparameters were computed by grid search over a small set of values. Rooms environment. The hidden dimension used is 64 for all approaches except MADDPG for which we use 128 dimensional hidden layers. For DAGGER, NAIVE and AROSAC we run SAC with Adam optimizer (learning rate of α = 0.01), entropy weight β = 0.05, Polyac rate 0.005 and batch size of 100. In each iteration of AROSAC and DAGGER, SAC is run for N = 10000 steps. Similarly, ROSAC is run with Adam optimizer (learning rates αψ = αθ = 0.01), entropy weight β = 0.05, Polyac rate 0.005 and batch size of 300. The MADDPG baseline uses a learning rate of 0.0003 and batch size of 256. PAIRED uses PPO with a learning rate of 0.02, batch size of 512, minibatch size of 128 and 4 epochs for each policy update. The adversary is trained using REINFORCE with a learning rate of 0.003. Robust Subtask Learning for Compositional Generalization F1/10th environment. The hidden dimension used is 128 for all approaches. For DAGGER, NAIVE and AROSAC we run SAC with Adam optimizer (learning rate of α = 0.001), entropy weight β = 0.03, Polyac rate 0.005 and batch size of 128. In each iteration of AROSAC and DAGGER, SAC is run for N = 10000 steps. Similarly, ROSAC is run with Adam optimizer (learning rates αψ = αθ = 0.001), entropy weight β = 0.03, Polyac rate 0.005 and batch size of 5 128. The MADDPG baseline uses a learning rate of 0.0003 and batch size of 256. PAIRED uses PPO with a learning rate of 0.001, batch size of 512, minibatch size of 128 and 4 epochs for each policy update. The adversary is trained using REINFORCE with a learning rate of 0.003.