# on_efficiency_in_hierarchical_reinforcement_learning__baca132a.pdf On Efficiency in Hierarchical Reinforcement Learning Zheng Wen Deep Mind zhengwen@google.com Doina Precup Deep Mind doinap@google.com Morteza Ibrahimi Deep Mind mibrahimi@google.com Andre Barreto Deep Mind andrebarreto@google.com Benjamin Van Roy Deep Mind benvanroy@google.com Satinder Singh Deep Mind baveja@google.com Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency. While this has been demonstrated empirically over time in a variety of tasks, theoretical results quantifying the benefits of such methods are still few and far between. In this paper, we discuss the kind of structure in a Markov decision process which gives rise to efficient HRL methods. Specifically, we formalize the intuition that HRL can exploit well repeating "sub MDPs", with similar reward and transition structure. We show that, under reasonable assumptions, a model-based Thompson sampling-style HRL algorithm that exploits this structure is statistically efficient, as established through a finite-time regret bound. We also establish conditions under which planning with structure-induced options is near-optimal and computationally efficient. 1 Introduction Hierarchical reinforcement learning (HRL) refers to the ability of an agent to act and plan at multiple levels of temporal abstraction [Sutton et al., 1999, Barto and Mahadevan, 2003]. In principle, this ability can present several benefits: (1) more efficient exploration, by employing policies that help an agent circulate more efficiently over the state space; (2) more efficient credit assignment, because temporally extended models propagate credit over many time steps, and the same stream of data can be re-used to learn about many possible contingencies (for example, expressed as sub-goals); and (3) the ability to solve smaller problems and compose the resulting policies and models quickly in new situations. From a theoretical point of view, (1) has been investigated in recent work [Fruit et al., 2017]. Aspects (2) and (3) have received empirical validation in a large number of papers, but not much theoretical analysis, except for some special cases, such as the work of Mann et al. [2015]. In this paper, we present two general results which highlight the types of problems in which HRL is expected to provide benefits, in terms of planning speed, as well as in terms of statistical efficiency. First, as has been highlighted empirically in the past, having repeated structure in the Markov decision process (MDP) can lead to large speedups in both aspects. Second, in terms of planning, HRL provides benefits when we are able to "insulate" well sub-problems that can be solved in isolation, and whose solutions can then be "stitched" together. We formalize the latter intuition. Contributions: First, we formalize a notion of MDP decomposition into sub-problems, using state partitions. Second, we show that the existence of hierarchical structure in the environment can lead to statistically efficient learning, by extending the results in Osband et al. [2013] to establish a regret bound that separates errors due to sub-optimal planning and errors due to learning. If the original 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. MDP can be decomposed into repeating problems that are relatively small, the expected regret of a posterior sampling exploration algorithm can be dramatically reduced. Finally, we study planning under decompositions of the original problem. We establish a relationship between the complexity of planning and the number of contingencies under which sub-problems are solved, formalized through a notion of exit profiles. We show formally that near-optimal planning can be obtained much faster if the original problem can be partitioned into repeated problems that are small and well separated. 2 Problem formulation Consider a finite-time horizon MDP M = S, A, P, r, se, s0 , where S is a finite state space, A is a finite action space, P and r respectively encode the transition model and the reward model, se S is a fixed terminal state, and s0 S is a fixed initial state1. If the agent takes action a A at a non-terminal state s S \ {se}, it receives a random reward drawn from distribution r( |s, a), and transits to the next state s S with probability P(s |s, a). Without loss of generality, we assume that the support of reward distribution r( |s, a) [0, 1], s S \ {se}, a A. We use r(s, a) to denote the mean of r( |s, a). In this paper, we consider two different but related problems in MDPs: planning and reinforcement learning (RL). In the planning setting, the agent knows M, and its goal is to compute a near-optimal policy, π : S A, i.e., a policy that maximizes the expected total reward: maxπ Eπ h Pτ 1 h=1 rh i , where rh r( |sh, ah) is the reward received by taking action ah = π(sh) in time period h, and τ is a random variable that denotes the time at which the agent enters se. To simplify the exposition, we assume that under any policy π, τ τmax with probability 1 and E[τ] H. In the RL setting, the agent knows S, A, se, s0, but does not know P or r. The agent will repeatedly interact with M for T episodes. An episode always starts at s0 and ends immediately upon entering se. The agent s goal is to maximize its expected cumulative reward over the T episodes, max PT t=1 E [Pτt h=1 rth], where we use t to index episodes and h to index periods in an episode, rth r( |sth, ath) is the reward received in period h of episode t and τt is the duration of episode t. Note that an RL algorithm might call a planning algorithm to compute a policy (for example, if the RL algorithm is model-based). 3 Defining sub-problems and hierarchical structure We would like to capture the intuitive notion of modularity in the context of MDPs. Modularity allows a large problem to be broken down into sub-problems, which could be tackled and solved independently from the rest. Sub-problem solutions could then be "stitched" together to (approximately) solve the entire problem. Intuitively, if these sub-problems are relatively small and repeated, this approach can lead to large computational gains. We will now formalize these intuitions for MDPs. Definition 1 Consider a partition of the non-terminal states S \ {se} into L disjoint subsets H = {Si}L i=1. We define an induced sub MDP Mi = Si Ei, A, Pi, ri, Ei as follows: Si is the internal state set, and the action space is still A. The exit state set Ei is defined as Ei = {e S \ Si : (s, a) Si A s.t. P(e| s, a) > 0}. The state space of Mi is Si Ei. Pi and ri are respectively the restriction of P and r to domain Si A. The sub MDP Mi terminates once it reaches a state in Ei (i.e., an exit state). Given a partition H of M, consider the set of induced sub MDPs, {Mi}L i=1. We define M, the maximum size of any sub MDP, and E, the set of all exit states, as follows: M = maxi |Si Ei| and E = L i=1Ei, (1) 1Note that the fixed initial state assumption can be easily relaxed to allow for an initial state drawn from any fixed distribution. where | | denotes the cardinality of a set. Intuitively, each sub MDP can be viewed as a sub-problem of the original MDP. We can exploit the fact that some sub-problems may be similar to each other, by solving only one instance and re-using this solution. We now define the notion of equivalent sub MDPs, in order to capture this idea. Definition 2 (Equivalent sub MDPs) Two sub MDPs Mi and Mj are equivalent if there is a bijection f : Si Ei Sj Ej s.t. f(Si) = Sj, f(Ei) = Ej, and, through f, the sub MDPs have the same transition probabilities and rewards at internal states. Note that the constraints f(Si) = Sj and f(Ei) = Ej ensure that an internal (or exit) state in Mi is mapped to an internal (or exit) state in Mj. Let K L be the number of equivalence classes of sub MDPs induced by a particular partition H of M. When there is no repeatable structure, K = L. When the partition produces repeatable structure, K < L. In summary, any state space partition H yields three parameters: the maximum size of an induced sub MDP M, the number of sub MDP equivalence classes K, and the total number of exit states |E|. Our analyses will depend on these three quantities, rather than |S|, the number of states in M. While the results that we will present hold for any MDP M and any partition H, they will offer dramatic improvements over the standard algorithms only if M exhibits hierarchical structure with respect to a partition H, by which we mean: 1. MK |S|; 2. the number of exit states |E| is small relative to the total number of states |S|. Intuitively, condition 1 can be satisfied by having small M, small K or both. If the number of equivalence classes K is small, solving a sub MDP once may produce solutions that can be re-used in many other parts of the original problem. If M is small, all sub MDPs have small size, so they would be relatively easy to solve. Finally, if |E| is small, intuitively we have a small number of states that connect the sub-problems. We can think of these as bottleneck" states in M, which have been shown before to enable computationally efficient planning (see e.g. Sutton et al. [1999], Mc Govern and Barto [2001], Stolle and Precup [2002], Simsek and Barto [2009], Solway et al. [2014]). To make this notion of hierarchical structure more concrete, consider a garbage collecting robot navigating in a building. The robot s goal is to maximize the amount of trash it collects before exiting. The building has L floors, and each floor belongs to one of K floor types defined according to some criterion relevant to the robot. A floor of type i has |Ei| |E| exits to other floors (elevators, stairs, etc.). When exiting a floor, the agent will either get to another floor or leave the building. Each floor has L rooms, which can also be grouped into K groups. Room type j can be divided into M j M regions that define the robot s current state; some of these regions contain garbage that should be collected. Rooms are connected through up to |E | doors. If we think of the robot as an RL agent and the building as an MDP, the problem can be partitioned at two different levels: each sub MDP can be either a floor or a room. Each of these partitions will result in different values for the constants appearing in items 1 and 2 above, which will in turn define how efficiently the agent can solve the problem. In the next two sections, we will analyze RL (Sec. 4) and planning (Sec. 5), assuming that an agent starts with a known partition H, in which the equivalence classes of the induced sub MDPs are also known. We will establish results showing that leveraging hierarchical structure allows agents to achieve better statistical efficiency, in the case of learning, and better computational complexity, in the case of planning (at the cost of some controlled sub-optimality). 4 Statistically Efficient Learning with Hierarchical Structure It is natural to think of hierarchical reinforcement learning as what an agent does when it possesses prior knowledge that the environment obeys hierarchical structure. As we will establish in this section, hierarchical structure can enable more efficient learning, and the difference can be dramatic if the MDP exhibits highly repetitive structure. This occurs when the number of sub MDPs far exceeds the number of equivalence classes. We will formally characterize improvements in statistical efficiency through studying a specific reinforcement learning algorithm that can leverage prior knowledge in a coherent manner. We expect our qualitative insights to extend to other algorithms that carefully account for prior knowledge. 4.1 Posterior Sampling for Reinforcement Learning Posterior sampling for reinforcement learning (PSRL), as introduced by Strens [2000] and analyzed in Osband et al. [2013] and Gopalan and Mannor [2015], offers an often effective approach to episodic reinforcement learning. Before each episode of interaction, the agent samples a model of the environment, possibly in the form of an MDP, from the posterior distribution over environments conditioned on data gathered over previous episodes. Then, the agent computes an optimal policy for the sampled model and applies that to select actions over the next episode. To guide exploration, PSRL relies on representation of epistemic uncertainty in terms of a probability distribution over environment. The prior distribution reflects the agent s initial partial knowledge about the environment. To quantify performance, regret bounds that apply under any prior distribution are established in Osband et al. [2013]. These bounds do not capture the benefits of greater degrees of prior knowledge, but subsequent results [Osband and Van Roy, 2014a,b] demonstrate that stronger regret bounds, reflecting dramatic improvements in agent performance, are possible when the prior distribution reflects knowledge of special environment structure. Algorithm 1 offers pseudocode for a generalized form of PSRL. Over each episode, a sampler produces an MDP Mt that represents a statistically plausible model of the environment given the state of knowledge Pt. Then, a planner computes a policy πt that approximately optimizes Mt. This policy is executed over the episode, leading to a data set Dt made up of the trajectory of states, actions, and rewards. Finally, the state of knowledge is updated by an inference algorithm. Ideally, as in the pure form of PSRL, Pt encodes a posterior distribution over MDPs, the sampling algorithm draws Mt from Pt, the planner computes an optimal policy for Mt, and the inference algorithm applies Bayes rule. However, the more flexible generalization of Algorithm 1 can be applied more broadly, even when it is infeasible to exactly represent, compute, or sample from the posterior distribution or to identify an optimal policy. Algorithm 1: PSRL with a Planner, Sampler, and Inferer Initialization: prior knowledge P0, planning algorithm plan, sampling algorithm sample, inference algorithm infer; for episode t = 1, 2, . . . T do sample Mt sample(Pt); plan πt = plan(Mt); execute πt over episode t, observe Dt; infer Pt+1 = infer(Pt, Dt) ; end 4.2 Hierarchical Reinforcement Learning We consider posterior sampling for hierarchical reinforcement learning (PSHRL) to be PSRL applied with a particular kind of prior distribution, and possibly with a planner that is customized for such an environment. In particular, we will consider priors that include only MDPs that obey hierarchical structure, as described earlier, for fixed values of M and K. As for the planner, we will alternately consider an optimal planner and one designed to produce approximately optimal policies more efficiently by leveraging hierarchical structure, as we will discuss in Section 5. The per-episode computational complexity of PSHRL depends on the special structure obeyed by Pt and Mt, as well as the choice of plan, sample, and infer. As we will show in Section 5, suitable hierarchical structure can be leveraged to improve computational efficiency. The choice of P0, sample, and infer determine tractability of sampling and inference. Again, suitable hierarchical structure can allow for more efficient execution of these steps. Recall that K is the number of sub MDP equivalence classes and M is the maximal number of states per sub MDP. Hierarchical structure is especially informative when MK is small relative to the number of MDP states |S|. This can yield dramatic improvements in statistical efficiency, as we will establish via a regret bound. 4.3 Regret Bound For any learning algorithm alg, we define the Bayesian regret over the first T episodes as Bayes Regret(alg, T) = t=1 E h V (s0) V πt(s0) i , (2) where V is the optimal value function of M, and V πt is the value function under policy πt. The regret is Bayesian" in the sense that the expectation integrates over M with respect to the prior distribution P0. Note that minimizing Bayes Regret(alg, T) is equivalent to maximizing PT t=1 E [V πt(s0)] = PT t=1 E [Pτt h=1 rth]. We will study Bayes Regret(PSRL, T), which is the Bayesian regret of PSRL, when applied with a prior that exhibits hierarchical structure. Recall that, under any policy π, E[τ] H and τ τmax with probability 1. The following theorem is our main result on statistical efficiency. Theorem 1 (Regret Bound) If P0 exhibits hierarchical structure with a maximum of M states per sub MDP and K sub MDP equivalence classes, sample draws from the posterior distribution, and infer applies Bayes rule, then Bayes Regret(PSRL, T) E V (s0) V π(s0) T | {z } due to sub-optimal planning + O H 3 2 M |A|T log(|A|KHτmax T) | {z } due to learning where π = plan(M). Note that the expectation in E V (s0) V π(s0) represents an integral with respect to the prior distribution P0 of M. Also note that if plan exactly optimizes M then π = π , and therefore, E V (s0) V π(s0) = 0. Approximately optimal planning can contribute to regret a term E V (s0) V π(s0) T that grows linearly with the number T of episodes. One factor of O(H) in the second term is due to the magnitude of optimal value E [V (s0)], which can grow with horizon, and is therefore inevitable. We show that the hierarchical structure can enable statistically more efficient learning by comparing to the PSRL regret bound of Osband et al. [2013], which applies for any prior. Assuming that the MDP has fixed horizon τ = H and plan always returns an optimal policy, Osband et al. [2013] established Bayes Regret(PSRL, T) O(H 3 2 |S| |A|T)2, where the O( ) notation hides logarithmic factors. Our regret bound, under the same assumptions, is O H 3 2 M |A|T , that is, we have replaced O(|S|) with O(M K), which is highlighted in red font in Theorem 1. Notice that if M is treated as one sub MDP, then we have K = L = 1 and M = |S|, hence our regret bound reduces to that in Osband et al. [2013]. On the other hand, when M K |S|, our regret bound conveys a dramatic improvement. This improvement can be interpreted in terms of two components. First, the replacement of O( p |S|) with O( MK) arises because the agent needs to learn about a smaller number of distinct states in the hierarchical MDP. Second, O( p |S|) is replaced by O( M) because at each state-action pair in the hierarchical MDP, the agent can transition to at most M successor states. The proof of Theorem 1 is partially motivated by analysis in Osband et al. [2013]. However, we consider a different setting and our results are technically more complex. Specifically, compared with Osband et al. [2013], Theorem 1 considers hierarchical structure, and allows for both suboptimal planning and a random time horizon τ. Please refer to Appendix A for the detailed proof of Theorem 1. 5 Computationally Efficient Planning with Hierarchical Structure We now turn our attention to the problem of planning in M given a partition H. This problem has been tackled in the framework of options, by using option models [Sutton et al., 1999]. Options can 2Some notations in Osband et al. [2013] have different meanings. Specifically, τ" in Osband et al. [2013] means H and T" in Osband et al. [2013] means HT. Algorithm 2: Planning with Exit Profiles (PEP) Input: MDP M, k sets of exit profiles Jk, one for each equivalent sub MDP class k; Step 1: Option generation for k = 1, 2, . . . K do For each exit profile J Jk, compute one option πk,J for sub MDPs in equivalence class k, and its associated model; end Step 2: Plan with options Compute a policy for the induced high-level MG, which induces a policy π for M Return: π be viewed as policies which act in a subset of states, and with which one can associate temporally extended reward and transition models. The option policies can be thought of as solutions to subproblems, generated by an MDP s structure, and some subgoals, which are additional rewards associated with particular states [Sutton et al., 1999]. One can view this approach as constructing options corresponding to sub MDPs. To make this problem well defined, one needs to consider possible combinations of values associated with the exit states of a sub MDP. For example, an agent which is in a room with two doors might consider making either door a subgoal, by giving it a high reward. An option can then be trained inside the room, which amasses treasure if it makes sense, then exits through the designated door. We now formalize this intuition through the notion of exit profiles. Definition 3 (Exit Profile) An exit profile J for sub MDP Mi is a vector of values J(e), e Ei. Note that from the perspective of a sub MDP, the structure outside is summarized in an exit profile J. An exit profile induces an optimal policy for the sub MDP Mi, πi,J, which we will think of as an option. By definition, an exit profile J will induce the same option for equivalent sub MDPs. Once a set of options and associated models have been computed, one can define an induced highlevel MDP MG = SG, AG, P G, r G , whose state space SG = E {s0} is the union of all exit states and the initial state s0. For each s SG, if s is a state in a sub MDP Mi, then its action space AG(s) is the set of options computed for Mi. r G(s, πi,J) is the expected reward obtained from s Si under option πi,J until this option reaches an exit state e Ei, and P G(e|s, πi,J) gives the probability of transitioning to e Ei. These quantities form the option model for πi,J, defined as in Sutton et al. [1999], which can be computed at the same time as the option3. The process of creating a set of options and models corresponding to a set of exit profiles, then using them to solve MG, is summarized in Algorithm 2, which we call Planning with Exit Profiles (PEP). 5.1 Computational Complexity of Planning with Options PEP is a blueprint for planning with options, which can be instantiated by using different dynamic programming approaches [Bertsekas, 2015] to implement steps 1 and 2. We will discuss the complexity of this algorithm when using value iteration (VI) for both steps, but similar analyses could be carried out easily for other algorithms (e.g., policy iteration). In order to simplify the analysis and ensure that VI terminates in a finite number of steps, we make the following assumption:4 Assumption 1 For M, all its induced sub MDPs with exit profiles, and the induced MG, the transition probability graph corresponding to an optimal policy is acyclic. Under Assumption 1, VI will compute the value function in n iterations under a proper initialization, where n is the cardinality of the state space (see Section 3.4.1 of Bertsekas [2015]). Under this 3Note that we have no discount factor, due to the finite-time horizon, which simplifies the model and gives rise to an MDP at the high level as well, instead of an SMDP. 4We make Assumption 1 to simplify the exposition of the computational complexity results. This assumption is not strictly necessary and can be relaxed. assumption, the computational complexity of VI in M is O(|S|2|A|M), because by our definition, M is an upper bound on the number of states into which any state-action pair can transition. Let X = maxk | Jk| denote the maximum cardinality of an exit profile set over all equivalent sub MDP classes (see Algorithm 2). The computational complexity of PEP can then be expressed as: O(KXM 2|A|M) | {z } for step 1 + O(|E|2XM) | {z } for step 2 O X[KM 2|A| + |E|2]M . (3) Roughly speaking, planning with options will be efficient if XM 2K < O(|S|2) and |E|2X < O(|S|2|A|), which means that all sub MDPs are small, a small number of exit profiles are used to find options for each equivalent sub MDP class, and the total number of exit states |E| is small. 5.2 Performance of Planning with Options We now provide a performance bound for PEP, based on the quality" of the exit profiles. Let Ji [0, H]|Ei| be the space of possible exit profiles for sub MDP Mi. Let V π i,J denote the value of a policy π for an exit profile J in Mi. We denote V i,J the value of the optimal policy of Mi w.r.t. exit profile J, πi,J. We now define the suboptimality of a set of exit profiles: Definition 4 (Exit Profile Suboptimality) The suboptimality of a set of exit profiles J for Mi is defined as: i( J ) = max s S0 i , J Ji V i,J(s) max J J V πi, J i,J (s) , (4) where S0 i is the set of possible start states in Mi, and Ji is the space of possible exit profiles. Notice that V πi, J i,J (s) is the value with exit profile J, under policy πi, J that is optimal for another exit profile J, at the start state s. In other words, the definition of i( J ) ensures that for any exit profile J, there exists an exit profile in J that induces an i( J )-optimal policy under J. Recall that in Algorithm 2, for sub MDP Mi in equivalence class k, exit profiles Jk are used for option generation. Thus, we define = maxi i( Jki), where ki is the equivalence class Mi is in. We can also prove that PEP with VI is near-optimal under Assumption 1 and a mild technical assumption. Proposition 1 If in step 2 of PEP, the agent uses VI with initial V = 0 to compute a policy π, then under Assumption 1 and a mild technical assumption, we have: V (s0) V π(s0) |E|. Please refer to Appendix B.1 for the proof of this proposition. Roughly speaking, Proposition 1 states that if the total number of exit states, |E|, is small, and the exit profiles used in PEP have high quality ( is small), then PEP returns a near-optimal policy. 5.3 Sufficient Conditions for High-Quality Exit Profiles The previous results indicate that in order to ensure that planning with options is both computationally efficient and returns a near-optimal policy, we need the set of exit profiles considered by PEP to have small cardinality (for computational efficiency) and high quality (for near-optimality). We now discuss how to choose such exit profile sets. In general, an exit profile set can be chosen based on an ϵ-cover, defined as follows: a finite set J is an ϵ-cover for J if for any J J , there exists J J s.t. J J ϵ. We then have the following result: Proposition 2 For sub MDP Mi in equivalence class k, if Jk is an ϵ-cover for Ji, then i( Jk) 2ϵ. Please refer to Appendix B.2 for the proof of Proposition 2. Since Ji [0, H]|Ei|, there always exists a finite ϵ-cover Jk for Ji with | Jk| H ϵ |Ei|. In general, the cardinality of an ϵ-cover is too large to guarantee PEP s computational efficiency. However, if maxi |Ei| is very small (e.g. maxi |Ei| 3), then PEP with ϵ-cover will be computationally efficient. Another favorable special case is when sub MDP Mi has deterministic exiting. That is, with any start state s and under any deterministic policy π, the agent exits Mi at a single state es,π Ei with probability 1. Note that in general es,π depends on both the start state s and the deterministic policy π. Deterministic exiting will occur if Mi has only one exit state, or deterministic transitions. For any e Ei, we define Je : Ei ℜas Je(s) = (H + 1)1[s = e], where 1[ ] is the indicator function. We then have the following result: Proposition 3 For sub MDP Mi in equivalence class k, if Mi has deterministic exiting and Jk = {Je : e Ei}, then i( Jk) = 0. Please refer to Appendix B.3 for the proof of Proposition 3. Note that in this case | Jk| = |Ei|. Based on Proposition 1, if all the sub MDPs have deterministic exiting, then we can efficiently compute an optimal policy π for M by using PEP. 6 Summary of Results learning alg. planner regret bound computation per episode PSRL VI O(H 3 2 |S| |A|T) O(|S|2|A|M) PSHRL VI O(H 3 2 M |A|T) O(|S|2|A|M) PSHRL PEP |E|T + O(H 3 2 M |A|T) O(X(M 2K|A| + |E|2)M) Table 1: Algorithm Comparison. Differences in regret bounds and computational complexities are highlighted in red font. Recall that S and A are the state and action space of M; H is a bound on the expected time horizon of M; T is the number of interaction episodes; M is the maximum sub MDP size; K is the number of sub MDP equivalence classes; E is the set of all exit states; and X respectively measure the quality and the number of exit profiles used in PEP. In Table 1, we summarize the regret bounds and per-episode computational complexities of three algorithms: (1) PSRL with value iteration (VI) planning, (2) PSHRL with VI planning, and (3) PSHRL with PEP planning5. As discussed before, if a partition with many repeated, small sub MDPs is available, the regret bound for (2) is much smaller than (1), since PSHRL exploits hierarchical structure during learning. Moreover, if all the sub MDPs also have few exit states and admit a small set of high-quality exit profiles, then (3) will be computationally much more efficient than (1) and (2), and only incurs an additional O( |E|T) regret compared with (2), due to sub-optimal planning. 7 Related work Several works have tackled decompositions of MDPs into sub-problems in a classical planning context [Dean and Lin, 1995, Singh and Cohn, 1998, Meuleau et al., 1998]. The closest related to our work is the framework of weakly coupled MDPs [Meuleau et al., 1998], which considers an MDP decomposition into sub MDPs, which are then solved independently by planning and whose solutions are related through a set of constraints. Loose coupling of the sub-problems allows a small set of such constraints, which intuitively has the same effect as our exit profiles. Sub-tasks with a small number of exits have been modelled through the notion of bottleneck states [Mc Govern and Barto, 2001, Stolle and Precup, 2002, Simsek and Barto, 2009, Solway et al., 2014]. According to our analysis, the existence of such states would indeed imply a small number of exit profiles, and hence efficient planning. Some existing work, e.g. Solway et al. [2014], Harutyunyan et al. [2019], has proposed optimization objectives for identifying such states using information-related criteria. It is worth pointing out that the equivalent sub MDP" notion used in this paper can be further relaxed. In particular, rather than assuming that sub MDPs in the same class have the same reward/transition models, we can assume that they have similar reward/transition models. Most results in this paper can be extended to that case, by leveraging results from planning with approximate models [Jiang et al., 2016], such as approximate MDP homomorphisms [Ravindran and Barto, 2004], or using bisimulation metrics to assess the similarity of sub MDPs within a class [Ferns et al., 2004]. We leave 5We assume that the chosen sample and infer in Algorithm 1 can be executed efficiently and most computation is due to plan, as is typical in model-based RL. such extensions for future work. Our notion of sub MDPs and exit profiles can also be used to model the use of HRL for transfer learning (see [Taylor and Stone, 2009] for an overview of the latter topic, which includes both HRL as well as methods for state equivalence). Finally, we briefly compare this paper with Mann et al. [2015]. To summarize, Mann et al. [2015] discuss and analyze two algorithms: Option-Fitted Value Iteration (OFVI) and Landmark-Approximate Value iteration (LAVI). The OFVI analysis relies on the discounted-average concentrability of the future state distributions in the semi-MDP defined by options, so it is a very different-flavor result. LAVI relies on options that go to designated landmark states, and which are computed by solving a deterministic relaxation of the semi-MDP in a neighborhood of landmarks. In our terminology, such options have a single exit state, and LAVI then solves the problem that jumps between landmarks. There is no repeating structure in this approach; in fact, each option only applies in a small neighborhood of state space around a landmark. Our result could be applied to the LAVI setup directly, but it would be hard to compare to their bound directly due to the very different quantities involved. 8 Concluding Remarks We have presented two theoretical results which illuminate a kind of problem structure that benefits HRL algorithms. Briefly, the ability to partition an MDP into repeating sub MDPs leads to improved regret bounds for a Thompson sampling-style algorithm that takes advantage of this structure. Furthermore, we highlighted and formalized a trade-off between the quality and complexity of planning in MDPs where the structure allows for a small number of problems to be solved. The insights provided in this work can be used not just to further theoretical understanding of HRL, but also for two immediate practical goals. First, one could use the structure of MDP we introduced as a template to build examples to study HRL algorithms under controlled conditions [Osband et al., 2020]. Second, these results can be turned into objective functions for HRL algorithms that discover the right structure. This could lead, for example, to algorithms for the discovery of options endowed with initiation sets (corresponding to the sub MDP partitions), or to anytime planning algorithms that construct exit profiles as needed. Ultimately, we would like to put together the learning and planning results into a theoretical framework that elucidates the trade-off of three components that are crucial to RL agents: solution quality (as expressed by the true value of the policy found), sample complexity, and computational complexity. This trade-off is enabled by HRL, and can be understood through the lens of the analysis we presented: HRL allows controlling the complexity of the policy class of an agent, and this control should take into account both the agent s ability to acquire data and its available computational resources. We hope to develop this line of inquiry in future work. Broader Impact This is a theoretical investigation and as such does not present any foreseeable immediate societal impact beyond the general concerns over progress in artificial intelligence. Specifically, we focused on the question of how to design efficient hierarchical agents for reinforcement learning problems with repeating sub-problem structure. The insights provided have the potential to help practitioners to build more efficient hierarchical agents for real-world reinforcement learning problems, whose societal impact depends on the specific application. Acknowledgments and Disclosure of Funding All authors of this paper are employees of Deep Mind. This paper is not supported by any external funding. Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems: Theory and Applications, 13:41 77, 2003. Dimitri P Bertsekas. Dynamic programming and optimal control 4th edition, volume ii. Athena Scientific, 2015. Thomas Dean and S-H. Lin. Decomposition techniques for planning in stochastic domains. In IJCAI, 1995. Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In UAI, 2004. Ronan Fruit, Alessandro Lazaric, Matteo Pirotta, and Emma Brunskill. Regret minimization in MDPs with options without prior knowledge. In Neur IPS, 2017. Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. Journal of Machine Learning Research, 40:1 38, 2015. Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Remi Munos, and Doina Precup. The termination critic. In AISTATS, 2019. Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563 1600, 2010. Nan Jiang, Satinder Singh, and Ambuj Tewari. On structural properties of MDPs that bound loss due to shallow planning. In IJCAI, 2016. Timothy A. Mann, Doina Precup, and Shie Mannor. Approximate value iteration with temporally extended actions. JAIR, 53:375 438, 2015. Amy Mc Govern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, pages 361 368, 2001. Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Lim, Leonid Peshkin, Leslie P. Kaelbling, and Thomas Dean. Solving very large weakly coupled Markov Decision Processes. In AAAI, 1998. Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems 27, pages 1466 1474, 2014a. Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored mdps. In Advances in Neural Information Processing Systems 27, pages 604 612, 2014b. Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003 3011, 2013. Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina Mc Kinney, Tor Lattimore, Csaba Szepesvári, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado van Hasselt. Behaviour suite for reinforcement learning. In International Conference on Learning Representations, 2020. Balaraman Ravindran and Andrew G. Barto. Approximate homomorphisms: A framework for non-exact minimization in Markov Decision Processes. In Proceedings of the Fifth International Conference on Knowledge Based Computer Systems, 2004. Ozgur Simsek and Andrew G. Barto. Skill characterization based on betweenness. In Neur IPS, pages 1497 1504, 2009. Satinder Singh and David Cohn. How to dynamically merge Markov Decision Processes. In Neur IPS, 1998. Alec Solway, Carlos Diuk, Natalia Cordova, Debbie Yee, Andrew G. Barto, Yael Niv, and Matthew M. Botvinick. Optimal behavior hierarchy. PLOS Comp. Bio., 10(8), 2014. Martin Stolle and Doina Precup. Learning options in reinforcement learning. In Lecture Notes in Computer Science, 2371, pages 212 223, 2002. Malcolm Strens. A bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, 2000. Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181 211, 1999. Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. JMLR, 10:1633 1685, 2009. Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.