# composing_value_functions_in_reinforcement_learning__cf46e1fb.pdf Composing Value Functions in Reinforcement Learning Benjamin van Niekerk * 1 Steven James * 1 Adam Earle 1 Benjamin Rosman 1 2 An important property for lifelong-learning agents is the ability to combine existing skills to solve new unseen tasks. In general, however, it is unclear how to compose existing skills in a principled manner. Under the assumption of deterministic dynamics, we prove that optimal value function composition can be achieved in entropyregularised reinforcement learning (RL), and extend this result to the standard RL setting. Composition is demonstrated in a high-dimensional video game, where an agent with an existing library of skills is immediately able to solve new tasks without the need for further learning. 1. Introduction A major challenge in artificial intelligence is creating agents capable of leveraging existing knowledge for inductive transfer (Taylor & Stone, 2009). Lifelong learning, in particular, requires that an agent be able to act effectively when presented with new, unseen tasks. One promising approach is to combine behaviours learned in various separate tasks to create new skills. This compositional approach allows us to build rich behaviours from relatively simple ones, resulting in a combinatorial explosion in the agent s abilities (Saxe et al., 2017). Additionally, composition allows us to incrementally expand an agent s abilities an important property for lifelong learning. Consider, for example, a robot in a warehouse that has the ability to pack objects on shelves. Given a new object, we would want to avoid retraining the robot from scratch. Rather, we would like the robot to learn the skill of packing only the new object, and then compose it with its *Equal contribution 1School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa 2Council for Scientific and Industrial Research, Pretoria, South Africa. Correspondence to: Benjamin van Niekerk , Steven James . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). previous abilities. In reinforcement learning (RL), one existing approach to composition are linearly-solvable Markov Decision Processes (LMDPs) (Todorov, 2007), which structure the reward function to ensure that the Bellman equation becomes linear in the exponentiated value function. Todorov (2009) proves that the optimal value functions of a set of LMDPs can be composed to produce the optimal value function for a composite task. This is a particularly attractive property, since solving new tasks requires no further learning. However, the LMDP framework has so far been restricted to the tabular case with known dynamics, limiting its usefulness. Related work has focused on entropy-regularised RL (Haarnoja et al., 2017; Schulman et al., 2017; Nachum et al., 2017), where rewards are augmented with an entropy-based penalty term. This has been shown to lead to improved exploration and rich, multimodal value functions. Haarnoja et al. (2018) demonstrate that these value functions can be composed to approximately solve the intersection of tasks. We complement these results by proving optimal composition for the union of tasks in the total-reward, absorbingstate setting. Thus, any task that can be expressed as a combination of a set of base tasks can be solved immediately, without any further learning. We provide a formula for optimally composing value functions, and demonstrate our method in a video game. Results show that an agent is able to compose existing policies learned from high-dimensional pixel input to generate new, optimal behaviours. 2. Background A Markov decision process (MDP) is defined by the 4-tuple (S, A, ρ, r) where (i) the state space S is standard Borel; (ii) the action space A is finite (and therefore a compact metric space when equipped with the discrete metric); (iii) the transition dynamics ρ define a Markov kernel (s, a) 7 ρ(s,a) from S A to S; and (iv) the reward r is a real-valued function on S A that is bounded and measurable. In RL, an agent s goal is to maximise its utility by making a sequence of decisions. At each time step, the agent receives an observation from S and executes an action from A according to its policy. As a consequence of its action, the agent receives feedback (reward) and transitions to a new Composing Value Functions in Reinforcement Learning state. Whereas the rewards represent only the immediate outcome, the utility captures the long-term consequences of actions. Historically, many utility functions have been investigated (Puterman, 2014), but in this paper we consider the total-reward criterion (see Section 2.1). We consider the class of MDPs with an absorbing set G, which is a Borel subset of the state space. We augment the state space with a virtual state g such that ρ(s,a)({g}) = 1 for all (s, a) in G A, and r = 0 after reaching g. In the control literature, this class of MDPs is often called stochastic shortest path problems (Bertsekas & Tsitsiklis, 1991), and naturally model domains that terminate after the agent achieves some goal. We restrict our attention to stationary Markov policies, or simply policies. A policy s 7 πs is a Markov kernel from S to A. Together with an initial distribution ν over S, a policy defines a probability measure over trajectories. To formalise this, we construct the set of n-step histories inductively by defining H0 = S and Hn = Hn 1 A S for n in N. The n-step histories represent the set of all possible trajectories of length n in the MDP. The probability measure on Hn induced by the policy π is then P π ν,n = ν π ρ π ρ | {z } n times Using the standard construction (Klenke, 1995), we can define a unique probability measure P π ν on H consistent with the measures P π ν,n in the sense that P π ν (E A S A ) = P π ν,n(E), for any n in N and any Borel set E Hn. If ν is concentrated on a single state s, we simply write P π ν = P π s . Additionally for any real-valued bounded measurable function f on Hn, we define Eπ ν[f] to be the expected value of f under P π ν . Finally, we introduce the notion of a proper policy a policy under which the probability of reaching G after n steps converges to 1 uniformly over S as n . Our definition extends that of Bertsekas & Tsitsiklis (1995) to general state spaces, and is equivalent to the definition of transient policies used by James & Collins (2006): Definition 1. A stationary Markov policy π is said to be proper if t=0 P π s (st G) < . Otherwise, we say that π is improper. 2.1. Entropy-Regularised RL In the standard RL setting, the expected reward at state s under policy π is given by Ea π [r(s, a)]. Entropy-regularised RL (Ziebart, 2010; Fox et al., 2016; Haarnoja et al., 2017; Schulman et al., 2017; Nachum et al., 2017) augments the reward function with a term that penalises deviating from some reference policy π. That is, the expected reward is given by Ea π [r(s, a)] τKL[πs|| πs], where τ is a positive scalar temperature parameter and KL[πs|| πs] is the Kullback-Leibler divergence between π and the reference policy π at state s. When π is the uniform random policy, the regularised reward is equivalent to the standard entropy bonus up to an additive constant (Schulman et al., 2017). This results in policies that are able to preserve multimodality when faced with a task that can be solved in multiple different ways (Haarnoja et al., 2017). Additionally, the reference policy can be used to encode prior knowledge through expert demonstration. Based on the above regularisation, we define the n-step value function starting from s and following policy π as: Vπ,n(s) = Eπ s t=0 r(st, at) τKL[πst|| πst] Note that since the KL-divergence term is measurable (Dupuis & Ellis, 2011, Lemma 1.4.3), Vπ,n is well-defined. The infinite-horizon value function, which represents the total expected return after executing π from s, is then Vπ(s) = lim sup n Vπ,n(s). Since the reward function and KL-divergence are bounded,1 Vπ is well defined. Similarly, we define the Q-function to be the expected reward after taking action a in state s, and thereafter following policy π: Qπ(s, a) = r(s, a) + Z S Vπ(s )ρ(s,a)(ds ). (1) Given the definitions above, we say that a measurable function V is optimal if V (s) = supπ Vπ(s) for all s in S. Furthermore, a policy π is optimal if Vπ = V . In the standard RL case, the optimal policy is always deterministic and is defined by argmaxa Q (s, a). On the other hand, entropy-regularised problems may not admit an optimal deterministic policy. This owes to the KL-divergence term, which penalises deviation from the reference policy π. If π is stochastic, then a deterministic policy may incur more cost than a stochastic policy. 3. Soft Value and Policy Iteration The composition results to be discussed in Section 4 hold only in total-reward, entropy-regularised MDPs defined in 1Under the assumptions that A is finite and π is chosen so that πs is absolutely continuous with respect to πs for any state s and policy π. Composing Value Functions in Reinforcement Learning Section 2. While value and policy iteration in entropyregularised RL have been analysed previously (Nachum et al., 2017), convergence results are limited to discounted MDPs. Therefore, in this section, we sketch an argument that an optimal proper policy exists under the total-reward criterion and that the entropy-regularised versions of value and policy iteration (see Algorithms 1 and 2) converge to optimal solutions. More details can be found in the supplementary material. We begin by defining the Bellman operators: [TπVπ](s) = Z A Qπ(s, a)πs(da) τKL[πs|| πs], (2) [T V ](s) = sup π [TπV ](s). (3) Equations (2) and (3) are analogous to the standard Bellman operator and Bellman optimality operator respectively. Note that since the optimal policy may not be deterministic, the Bellman optimality operator selects the supremum over policies instead of actions. We also define the soft Bellman operator [LVπ](s) = τ log Z A exp (Qπ(s, a)/τ) πs(da). (4) Here L is referred to as soft , since it is a smooth approximation of the max operator. The soft Bellman operator is connected to the Bellman optimality operator through the following result: Lemma 1. Let V : S R be a bounded measurable function. Then T V = LV and the supremum is attained uniquely by the Boltzmann policy B[V ] defined by d πs (a) = exp Q(s, a)/τ R A exp Q(s, a )/τ π(da |s). Proof. Follows directly from Dupuis & Ellis (2011, Proposition 1.4.2). Analogous to the standard RL setting, we can define value and policy iteration in the entropy-regularised context, where the Bellman operators are replaced with their soft equivalents: Algorithm 1 Soft Value Iteration Input: MDP, temperature τ > 0, bounded function V Output: Optimal value function V initialize V V repeat replace V V apply soft Bellman operator V L[V ] until convergence Algorithm 2 Soft Policy Iteration Input: MDP, temperature τ > 0, proper policy π Output: Optimal policy π initialize π π repeat replace π π policy evaluation: find Vπ, the fixed-point of Tπ policy improvement: compute the Boltzmann policy π B[Vπ] until convergence In order to prove the convergence of the above algorithms in the total-reward setting, we require the following assumptions: Assumption 1. Suppose that the following hold: (i) the map a 7 r(s, a) is upper semicontinuous for all s in S; and (ii) for every state s in S and every bounded measurable function f, the map S f(s )ρ(s,a)(ds ) is continuous. These conditions on the transition dynamics and reward function are trivially satisfied if S is countable. We require a further two assumptions: Assumption 2. Suppose that the following hold: (i) there exists at least one proper continuous policy; and (ii) if a policy is improper, then its value function is unbounded below. Along with the compactness of A, these assumptions are identical to those of James & Collins (2006). We now proceed with the main result of this section. The proof follows closely along the lines of Bertsekas & Tsitsiklis (1991) and James & Collins (2006), but special care is taken to account for the fact that optimal policies are not necessarily deterministic. Theorem 1. Suppose that Assumptions 1 and 2 hold and that the optimal value function is bounded above. Then: (i) there exists an optimal proper policy; (ii) the optimal value function is the unique bounded measurable solution to the optimality equation; (iii) the soft policy iteration algorithm converges to the policy starting from any proper policy; (iv) the soft value iteration algorithm converges to the optimal value function starting from any proper policy. Proof. See supplementary material. Composing Value Functions in Reinforcement Learning 4. Compositionality In a lifelong learning context, an agent is presented with a series of tasks drawn from some distribution. The agent s goal is to exploit knowledge gained in previous tasks to improve performance in the current task. We consider an environment with fixed state space S, action space A, deterministic transition dynamics ρ, and absorbing set G. Let D be a fixed but unknown distribution over (S, A, ρ, r). The agent is then presented with tasks sampled from D, which differ only in their reward functions that is, a task is directly specified by its reward function. In this section, we describe a compositional approach to tackle this problem. 4.1. OR Composition Suppose further that the reward functions drawn from D differ only on the absorbing set G. This restriction was introduced by Todorov (2009), and is a strict subset of the successor representations framework (Dayan, 1993; Barreto et al., 2017). The composition we describe in this section can be viewed as OR task composition: if objectives of two tasks are to achieve goals A and B respectively, then we aim to compose the individual Q-functions to solve A OR B. To construct the composed OR task, we take the maximum (or softmaximum) of the reward functions of the composite tasks. We now show that, having encountered a set of tasks, we can combine the library of previously-learned Q-functions to solve any new tasks that lies in their span , without the need for further learning: Theorem 2 (Optimal Composition). Let M1, . . . , Mm be a library of tasks drawn from D. Let Q ,k τ be the optimal entropy-regularised Q-function, and rk be the reward function for Mk. Define the functions r and Q τ : S A Rm r = [r1, . . . , rm] and Q τ = [Q ,1 τ , . . . , Q ,m τ ]. Given a set of non-negative weights w, with ||w||1 = 1, consider a further task drawn from D with reward function satisfying r(s, a) = τ log (|| exp(r(s, a)/τ)||w) (5) for all s in G, where || ||w denotes the weighted 1-norm. Then the optimal Q-value for this task is given by: Q τ(s, a) = τ log (|| exp(Q τ(s, a)/τ)||w) . (6) That is, the optimal Q-functions for the library of tasks can be composed to form Q τ. Proof. Since ρ is deterministic, we can find a measurable function f : S A S such that ρ(s,a) = δf(s,a). For any Q-function, define the desirability function Z(s, a) = exp (Q(s, a)/τ) , and define the operator U on the space of non-negative bounded measurable functions by [UZ](s, a) = exp (r(s, a)/τ) Z A Z(f(s, a), a ) πs(da ). We now show that the desirability function is a fixed point of U. Since V τ is the fixed point of the Bellman optimality operator, by combining Lemma 1 and Theorem 1 we have V τ (s) = [T V ](s) = [LV ](s). Then using the definition of the soft Bellman operator: V τ (s) = τ log Z A exp (Q τ(s, a )/τ) πs(da ). Additionally, under the assumption of a deterministic environment, Equation (1) can be rewritten as Q τ(s, a) = r(s, a) + V τ (f(s, a)). Then it follows that [UZ τ ](s, a) = exp (r(s, a)/τ) Z A Z τ (f(s, a), a ) πs(da ) = exp (r(s, a)/τ) Z A exp (Q τ(f(s, a), a )/τ) πs(da ) = exp (r(s, a)/τ) exp (V τ (f(s, a))/τ) = exp (r(s, a) + V τ (f(s, a))/τ = exp (Q τ(s, a)/τ) = Z τ (s, a). Hence Z τ is a fixed point of U. Given a task Mk and terminal state s in G, the optimal Q-value at that state is simply rk(s, a). Therefore, for the combined task with reward function (5), the optimal Qvalue satisfies Q τ(s, a) = τ log (|| exp(r(s, a)/τ)||w) = τ log (|| exp(Q τ(s, a)/τ)||w) on G. Thus, restricted to G, the desirability function Z τ is a linear combination of the desirability functions for the family of tasks. Finally, since (6) holds on G and it is clear that U is a linear operator on the exponentiated Q-function, then (6) holds everywhere. Composing Value Functions in Reinforcement Learning The following lemma links the previous result to the standard RL setting. Recall that entropy-regularisation appends a temperature-controlled penalty term to the reward function. As the temperature parameter tends to 0, the reward provided by the environment dominates the entropy penalty, and the problem reduces to the standard RL case: Lemma 2. Let {τn} n=1 be a sequence in R such that τn 0. Let Q τn be the optimal Q-value function for MDP(τn): the entropy-regularised MDP with temperature parameter τn. Let Q 0 be the optimal Q-value for the standard MDP. Then Q τn Q 0 as n . Proof. First note that for a fixed policy π, state s and action a, we have Qπ τn(s, a) Qπ 0(s, a) as n . This follows directly from the definition of the entropy-regularised value function, and the fact that the KL-divergence is non-negative. Then using Lemma 3.14 (Hinderer, 1970) to interchange the limit and supremum, we have lim n Q τn = lim n sup π Qπ τn = sup π lim n Qπ τn = sup π Qπ 0 = Q 0. Since Qπ τn Qπ 0, we have Q τn Q 0 as n . We can now show that composition holds in the standard RL setting by using Lemma 2 to take the low-temperature limit of Theorem 2. Corollary 1. Let {τn} n=1 be a sequence in R such that τn 0 and let Q 0 be the optimal Q-function for the standard MDP with a composite reward satisfying r = max r. Then max Q τn Q 0 as n . Proof. For a fixed state s and action a and a possible reordering of the vector Q 0(s, a), we may suppose, without loss of generality, that Q ,1 0 (s, a) = max Q 0(s, a). Then by Lemma 2, we can find an N in N such that Q ,1 τn (s, a) = max Q τn(s, a) for all n N. Since log is continuous, we have from Theorem 2 that lim n Q τn = log lim n || exp(Q τn)||1/τn w , where || ||p w denotes the weighted p-norm. By factoring exp(Q ,1 τn ) out of || exp(Q τn)||1/τn w , we are left with ||1, exp( 2), . . . , exp( k)||1/τn w , where i = Q ,i τn Q ,1 τn for i = 2, . . . , k. Since Q ,1 τn (s, a) is the maximum of Q τn(s, a) for all n N, the limit as n of the above is 1. Then it follows that lim n Q ,1 τn (s, a) = log lim n exp(Q ,1 τn (s, a)) = Q ,1 0 (s, a). Since s and a were arbitrary and Q ,m τn Q ,m 0 , we have that max Q τn Q 0 as n . Comparing Theorem 2 to Corollary 1, we see that as the temperature parameter decreases to zero, the weight vector has less influence on the composed Q-function. In the limit, the optimal Q-function is independent of the weights and is simply the maximum of the library functions. This suggests a natural trade-off between our ability to interpolate between Q-functions, and the stochasticity of the optimal policy. Finally, we note that the assumption of deterministic dynamics is necessary. To see this, consider an MDP with a fixed start state and three goal states Red, Purple and Blue. The MDP has three actions a, b, and c with transition dynamics given by: Red Purple Blue a 0.1 0.8 0.1 b 0.1 0.1 0.8 c 0.0 0.5 0.5 Suppose that tasks A and B are to reach the Purple and Blue states respectively (with a reward of 1 for getting to the correct state and 0 otherwise). The optimal policy for the composed task A OR B is therefore to select action c, yielding a return of 1. However, the policy given by Corollary 1 selects either a or b with a return of 0.9. 4.2. AND Composition Haarnoja et al. (2018) show that an approximate AND composition is also possible for entropy-regularised RL. That is, if the goals A and B partially overlap, the composed Q-function will achieve A AND B approximately. The following result, included for completeness, demonstrates that the optimal Q-function for the composite task can be approximated by the average of the library Q-functions: Lemma 3 (Haarnoja et al. (2018)). Let Q ,1 τ and Q ,2 τ be the optimal Q-functions for two tasks drawn from D with rewards r1 and r2. Define the averaged Q-function Qave := (Q ,1 τ + Q ,2 τ )/2. Then the optimal Q-function Q τ for the task with reward function r = (r1 + r2)/2 satisfies Qave Q τ Qave C τ , where C τ is a fixed point of τEs ρ(s,a) h D 1 2 π ,1 s ||π ,2 s + max a C(s , a ) i , the policy π ,i s is the optimal Boltzmann policy for task i, and D 1 2 ( || ) is the R enyi divergence of order 1 Theorem 3 (Haarnoja et al. (2018)). Using the definitions in Lemma 3, the value of the composed policy πave satisfies Qπave Q τ F τ , Composing Value Functions in Reinforcement Learning where F τ is a fixed point of τEs ρ(s,a) h Ea πave s [C τ (s , a ) F(s , a )] i . We believe that the low-temperature result from Lemma 2 can be used to obtain similar results for the standard RL framework. We provide empirical evidence of this in the next section, and leave a formal proof to future work. 5. Experiments To demonstrate composition, we perform a series of experiments in a high-dimensional video game (Figure 1b). The goal of the game is to collect items of different colours and shapes. The agent has four actions that move it a single step in any of the cardinal directions, unless it collides with a wall. Each object in the domain is one of two shapes (squares and circles), and one of three colours (blue, beige and purple), for a total of six objects (see Figure 1a). We construct a number of different tasks based on the objects that the agent must collect, the task s name specifying the objects to be collected. For example, Purple refers to the task where an agent must collect any purple object, while Beige Square requires collecting the single beige square. For each task, the episode begins by randomly positioning the six objects and the agent. At each timestep, the agent receives a reward of 0.1. If the correct object is collected, the agent receives a reward of 1 and the episode terminates. We first learn to solve a number of base tasks using (soft) deep Q-learning (Mnih et al., 2015; Schulman et al., 2017), where each task is trained with a separate network. Each network is trained for 1.5m timesteps to ensure near-optimal convergence. The resulting networks are collected into a library from which we will later compose new Q-functions. The input to our network is a single RGB frame of size 84 84, which is passed through three convolutional layers and two fully-connected layers before outputting the predicted Q-values for the given state.2 Using the results in Section 4, we compose optimal Q-functions from those in the library. In all our results, we visualise value functions by placing the agent at each cell in the domain and feeding the resulting state into the learned Q-function. We take the maximum output as the value of the state and then interpolate between cells to form a smooth surface. 5.1. OR Composition Here we consider new tasks that can be described as the union of a set of base tasks in the standard RL setting. We train an agent separately on the Purple and Blue tasks, 2A full description of the architecture and hyperparameters is provided in the supplementary material. Beige Blue Purple (a) Items to be collected. (b) Layout of the grid-world. Figure 1. The video game domain. The position of the walls and obstacles remains fixed, but at the start of each episode, the agent and objects are randomly positioned. adding the corresponding Q-functions to our library. We use Corollary 1 to produce the optimal Q-function for the composite Purple Or Blue task, which requires the agent to pick up either blue or purple objects, without any further learning. Results are given in Figure 2. The local maxima over blue and purple objects illustrates the multimodality of the value function (Figure 2a). This is similar to approaches such as soft Q-learning (Haarnoja et al., 2017), which are also able to learn multimodal policies. However, we have anecdotally observed that directly learning a truly multimodal policy for the composite task can be difficult. If the entropy regularisation is too high, the resulting policy is extremely stochastic. Too low, and the policy quickly collapses to a single mode without exploring alternatives. It is instead far easier to learn unimodal value functions for each of the base tasks, and then compose them to produce optimal multimodal value functions. 5.2. Linear Task Combinations In Theorem 2 we showed that in the entropy-regularised setting, the composed Q-function is dependent on a weight vector w. This allows us to achieve a more general type of composition. In particular, we can immediately compute any optimal Q-function that lies in the span of the library Q-functions. Indeed, according to Theorem 2 the exponentiated optimal Q-function is a linear combination of the exponentiated library functions. Therefore, the weights can be used to modulate the relative importance of the library functions modelling the situation in which an agent has multiple concurrent objectives of unequal importance. We illustrate the effect of the weight vector w using soft Qlearning with a temperature parameter τ = 1. We construct a new task by composing the tasks Purple Circle and Beige Square, and assign different weights to these tasks. The different weighted value functions are given in Figure 3. 5.3. AND Composition Here we consider tasks which can be described as the intersection of tasks in the library. In general, this form of Composing Value Functions in Reinforcement Learning Figure 2. (a) The optimal value function for Purple Or Blue, which is produced by composing the Purple and Blue Q-functions. The multimodality of the composite value function is clearly visible. (b) Sample trajectories for the composite Purple Or Blue task, with the agent beginning at different positions. The agent selects the shortest path to any of the target objects. (c) Returns from 50k episodes. The first two box plots are the results of acting in the Purple Or Blue task using only one of the base Q-functions, while the third uses the composite Q-function. composition will not yield an optimal policy for the composite task owing to the presence of local optima in the composed value function. However, in many cases we can obtain a good approximation to the composite task by simply averaging the Q-values for the constituent tasks. While Haarnoja et al. (2018) considers this type of composition in the entropy-regularised case, we posit that this can be extended to the standard RL setting by taking the low-temperature limit. We illustrate this by composing the optimal policies for the Blue and Square tasks, which produces a good approximation to the optimal policy for collecting the blue square. Results are shown in Figure 4. 5.4. Temporal Our final experiment demonstrates the use of composition to long-lived agents. We compose the base Q-functions for the tasks Blue, Beige and Purple, and use the resulting Qfunction to solve the task of collecting all objects. Sample trajectories are illustrated by Figure 5. Despite the fact that the individual tasks terminate after collecting the required object, if we allow the episode to continue, the composed Q-function is able to collect all objects in a greedy fashion. The above shows the power of composition if we possess a library of skills learned from previous tasks, we can compose them to solve any task in their union continually. (a) Beige Square: 0.0 (b) Beige Square: 0.05 (c) Beige Square: 0.1 (d) Beige Square: 0.5 (e) Beige Square: 0.9 (f) Beige Square: 0.95 (g) Beige Square: 1.0 Beige Square Purple Circle 0.0 0.2 0.4 0.6 0.8 1.0 0 Beige Square Weight Objects Collected Figure 3. (a g) Weighted composed value function for the task Beige Square Or Purple Circle. The weight assigned to the Qfunction for Beige Square is varied from 0 to 1. (h) The number of beige squares compared to purple circles collected by the agent as the weights are varied in steps of 0.05. Results for each weight were averaged over 80 runs of 100 episodes. Composing Value Functions in Reinforcement Learning Figure 4. (a) The approximately optimal value function of the composed policies. Local optima are clearly visible. (b) Sample trajectories from the composed policy beginning from different starting positions. The agent exhibits suboptimal, but sensible behaviour near beige squares. (c) The IQR of returns from 50k episodes. The first box plot is the return from the optimal solution to the union of tasks, the second is the result of the approximate intersection of tasks, and the third is the true optimal policy. 6. Related Work As mentioned, the optimal composition described here can be achieved through the LMDP framework (Todorov, 2009). However, LMDPs require the passive dynamics (an S S matrix) to be specified upfront, which restricts their applicability to low-dimensional settings. Our approach, on the other hand, shows that the same composition can be achieved in both the entropy-regularised and standard RL setting. As a result, we can perform composition in highdimensional state spaces such as video games. Using the maximimum of a set of previously-learned Qfunctions appears in other contexts. Corollary 1 mirrors that of generalised policy improvement (Barreto et al., 2017; 2018), which uses the successor representation framework (Dayan, 1993) to show that maximising over Q-functions results in an improved policy. In our work, the resulting Q-function is not merely an improvement, but is in fact optimal. More generally, Abel et al. (2018) provide the MAXQInit algorithm, which can be used to solve a series of tasks that differ only in reward function. When faced with a new task, initialising the value function to be the maximum over learned Q-functions is shown to preserve optimism and lower the sample complexity with high probability. Finally, the composition described here differs from the options framework (Sutton et al., 1999), which sequence low-level actions to form high-level skills. Whereas options compose actions temporally, our composition is concurrent, but we note that we are sometimes able to mimic temporal composition (see Figure 5). Since options themselves contain policies, it is likely that they can be composed using the theory described here; however, we would need to account for options whose initiation sets and termination conditions differ. 7. Conclusion We showed that in entropy-regularised RL, value functions can be optimally composed to solve the union of tasks. Extending this result by taking the low-temperature limit, we showed that composition is also possible in standard RL. However, there is a trade-off between our ability to smoothly interpolate between tasks, and the stochasticity of the optimal policy. We demonstrated, in a high-dimensional environment, that a library of optimal Q-functions can be composed to solve composite tasks consisting of unions, intersections or temporal sequences of simpler tasks. The proposed compositional framework is a step towards lifelong-learning agents that are able to combine existing skills to solve new, unseen tasks. Figure 5. (a) and (b) Sample trajectories for the task of collecting all objects. (c) Returns from 50k episodes. The first box plot is the return of the composed Q-function, while the second is the result of DQN trained to collect all objects explicitly. Composing Value Functions in Reinforcement Learning Acknowledgements The authors wish to thank the anonymous reviewers for their thorough feedback and helpful comments. SJ is supported by a Google Ph D Fellowship in Machine Learning. BR is supported by a Google Faculty Research Award in Machine Learning. Abel, D., Jinnai, Y., Guo, Y., Konidaris, G., and Littman, M. Policy and value transfer in lifelong reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018. Barreto, A., Dabney, W., Munos, R., Hunt, J., Schaul, T., van Hasselt, H., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055 4065, 2017. Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., ˇZ ıdek, A., and Munos, R. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In Proceedings of the International Conference on Machine Learning, pp. 501 510, 2018. Bertsekas, D. and Tsitsiklis, J. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580 595, 1991. Bertsekas, D. and Tsitsiklis, J. Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control, volume 1, pp. 560 564. IEEE, 1995. Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613 624, 1993. Dupuis, P. and Ellis, R. A weak convergence approach to the theory of large deviations. 2011. Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In 32nd Conference on Uncertainty in Artificial Intelligence, 2016. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pp. 1352 1361, 2017. Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, S. Composable deep reinforcement learning for robotic manipulation. ar Xiv preprint ar Xiv:1803.06773, 2018. Hinderer, K. Foundations of non-stationary dynamic programming with discrete time parameter. In Lecture Notes in Operations Research and Mathematical Systems, volume 33. 1970. James, H. and Collins, E. An analysis of transient Markov decision processes. Journal of applied probability, 43(3): 603 621, 2006. Klenke, A. Probability Theory: A Comprehensive Course, volume 158. 1995. ISBN 9781447153603. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2772 2782, 2017. Puterman, M. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Saxe, A., Earle, A., and Rosman, B. Hierarchy through composition with multitask LMDPs. Proceedings of the 34th International Conference on Machine Learning, 70: 3017 3026, 2017. Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft Q-learning. pp. 1 15, 2017. Sutton, R., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181 211, 1999. Taylor, M. and Stone, P. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research, 10:1633 1685, 2009. Todorov, E. Linearly-solvable Markov decision problems. In Advances in Neural Information Processing Systems, pp. 1369 1376, 2007. Todorov, E. Compositionality of optimal control laws. In Advances in Neural Information Processing Systems, pp. 1856 1864, 2009. Ziebart, B. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.