# rd2_reward_decomposition_with_representation_decomposition__f7fbe8c5.pdf RD2: Reward Decomposition with Representation Disentanglement Zichuan Lin Tsinghua University lzcthu12@gmail.com Derek Yang UC San Diego dyang1206@gmail.com Li Zhao Microsoft Research lizo@microsoft.com Tao Qin Microsoft Research taoqin@microsoft.com Guangwen Yang Tsinghua University ygw@tsinghua.edu.cn Tieyan Liu Microsoft Research tyliu@microsoft.com Reward decomposition, which aims to decompose the full reward into multiple sub-rewards, has been proven beneficial for improving sample efficiency in reinforcement learning. Existing works on discovering reward decomposition are mostly policy dependent, which constrains diversified or disentangled behavior between different policies induced by different sub-rewards. In this work, we propose a set of novel policy-independent reward decomposition principles by constraining uniqueness and compactness of different state representations relevant to different sub-rewards. Our principles encourage sub-rewards with minimal relevant features, while maintaining the uniqueness of each sub-reward. We derive a deep learning algorithm based on our principle, and refer to our method as RD2, since we learn reward decomposition and disentangled representation jointly. RD2 is evaluated on a toy case, where we have the true reward structure, and chosen Atari environments where the reward structure exists but is unknown to the agent to demonstrate the effectiveness of RD2 against existing reward decomposition methods. 1 Introduction Since deep Q-learning was proposed by Mnih et al. [2015], reinforcement learning (RL) has achieved great success in decision making problems. While general RL algorithms have been extensively studied, here we focus on those RL tasks with multiple reward channels. In those tasks, we are aware of the existence of multiple reward channels, but only have access to the full reward. Reward decomposition has been proposed for such tasks to decompose the reward into sub-rewards, which can be used to train RL agent with improved sample efficiency. Existing works mostly perform reward decomposition by constraining the behavior of different policies induced by different sub-rewards. Grimm and Singh [2019] propose encouraging each policy to obtain only its corresponding sub-rewards. However, their work requires that the environment be reset to arbitrary state and cannot be applied to general RL settings. Lin et al. [2019] propose encouraging the diversified behavior between such policies, but their method only obtains sub-rewards on transition data generated by their own policy, therefore it cannot decompose rewards for arbitrary state-action pairs. In this paper, we propose a set of novel principles for reward decomposition by exploring the relation between sub-rewards and their relevant features. We demonstrate our principles based on a toy Equal contribution 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. environment Monster-Treasure, in which the agent receives a negative reward rmonster when it runs into the wandering monster, and receives a positive reward rtreasure when it runs into the treasure chest. A good decomposition would be to split the reward r into rmonster and rtreasure, where only some features are relevant to each sub-reward. To be specific, only the monster and the agent are relevant to predicting rmonster. A bad decomposition could be splitting the reward into r 2, or r and 0. The first one is not compact, in the sense that all features are relevant to both sub-rewards. The latter one is trivial, in the sense that none of the features is relevant to the 0 sub-reward. We argue that if each of the sub-reward we use to train our agent is relevant to limited but unique features only, then the representation of sub-returns induced by sub-rewards would also be compact and easy to learn. Motivated by the example above, we propose decomposing a reward into sub-rewards by constraining the relevant features/representations of different sub-rewards to be compact and non-trivial. We first derive our principles for reward decomposition under the factored Markov Decision Process(f MDP). Then we relax and integrate the above principles into deep learning settings, which leads to our algorithm, Reward Decomposition with Representation Disentanglement(RD2). Compared with existing works, RD2 can decompose reward for arbitrary state-action pairs under general RL settings and does not rely on policies. It is also associated with a disentangled representation so that the reward decomposition is self-explanatory and can be easily visualized. We demonstrate our reward decomposition algorithm on the Monster-Treasure environment discussed earlier, and test our algorithm on chosen Atari Games with multiple reward channels. Empirically, RD2 achieves the following: It discovers meaningful reward decomposition and disentangled representation. It achieves better performance than existing reward decomposition methods in terms of improving sample efficiency for deep RL algorithms. 2 Background and Related Works We consider general reinforcement learning, in which the interaction of the agent and the environment, can be viewed as a Markov Decision Process (MDP)[Puterman, 1994]. Denoting the state space by S, action space by A, the state transition function by P, the action-state dependent reward function by R and γ the discount factor, we write this MDP as (S, A, R, P, γ). Here a reward r is dependent on its state s S and action a A. r = R(s, a) (1) A common approach to solving an MDP is by estimating the action-value Qπ(s, a), which represents the expected total return for each state-action pair (s, a) under a given policy π. 2.2 Factored MDP Our theoretical foundation is based on factored MDP (f MDP). In a factored MDP [Boutilier et al., 1995, 1999], state s S can be described as a set of factors s = (x1, x2, ..., x N). In some factored MDP settings, reward function R can be decomposed into multiple parts where each part returns a sub-reward, or localized reward. Let si be a fixed subset of factors in s, denoted by si s, localized rewards ri only depend on sub-states: ri = Ri(si, a) (2) and the full reward is obtained by R(s, a) = PK i=1 Ri(si, a). In most environments, while the reward structure exists latently, we do not know the sub-reward functions Ri nor the sub-rewards ri and only the full reward r is observable. 2.3 Reward Decomposition Having access to sub-rewards ri can greatly accelerate training in RL [Schneider et al., 1999, Littman and Boyan, 1993, Russell and Zimdars, 2003, Bagnell and Ng, 2006, Marthi, 2007, Van Seijen et al., 2017, Open AI et al., 2019]. Hybrid Reward Architecture (HRA) [Van Seijen et al., 2017] proposes learning multiple Q-functions, each trained with its corresponding sub-reward and showed significant improvements compared to training a single Q-function. However, in HRA the rewards are decomposed manually. In Dota 2 [Open AI et al., 2019], over 10 reward types associated with different parts of the state, e.g. gold, kills, mana, etc., are designed intrinsically to help the agent plan better. Reward decomposition can also be used for multi-agent settings [Russell and Zimdars, 2003]. Given the potential of utilizing sub-rewards, finding a good reward decomposition in an unknown environment becomes an important line of research. Reward decomposition seeks to find sub-rewards ri without any domain knowledge. Grimm and Singh [2019] and Lin et al. [2019] both make an assumption of policy disagreement to perform reward decomposition. Lin et al. [2019] first perform reinforcement learning jointly with reward decomposition without domain knowledge or manipulating environments. However, Lin et al. [2019] can only compute sub-rewards from subvalues for transition data generated by their own policy, making it hard to apply learned sub-rewards to downstream tasks such as training new agents. 2.4 Disentangled Representation A recent line of work has argued that representations that are disentangled are an important step towards a better representation learning [Bengio et al., 2013, Peters et al., 2017, Higgins et al., 2017, Chen et al., 2016, 2018, Hsu et al., 2017]. The key idea is that a disentangled representation should separate the distinct, informative factors of variations in the data. Particularly, entropy reduction has been used for representation disentanglement in prior works [Li et al., 2019]. Different from those works, we focus on reward decomposition in RL, and learn compact representation for each sub-reward. Although we encourage the compactness and diversity of different representations for different sub-rewards, there are usually some overlap between different representations, which is different from the idea of disentangled representation. For example, in the Monster-Treasure environment, the agent information is important for representations of both rmonster and rtreasure. 3 Minimal Supporting Principle for Reward Decomposition In this section, we introduce our principles for finding minimal supporting reward decomposition under f MDP. The first principle is that the relevant features of the sub-rewards should contain as little information as possible, which implies compactness. To define relevant features formally, we first define minimal sufficient supporting sub-state. We further define K-minimal supporting reward decomposition, which directly leads to our second principle: each sub-reward should be unique in that their relevant features contain exclusive information. The second principle encourages diversified sub-rewards and features that represent different parts of the reward dynamics. 3.1 Minimal Sufficient Supporting Sub-state We first consider an f MDP with known sub-reward structures. E.g., rmonster and rtreasure in the Monster-Treasure environment introduced in the Introduction part. Let state be composed of N factors, denoted by s = {x1, x2, x3, ..., x N} and denote the i th sub-reward at state s by ri(s), i [1, K]. For example, state in the Monster-Treasure environment is {sagent, smonster, streasure}, where sagent/smonster/streasure represent the state of the agent/monster/treasure chest respectively. Substate si is extracted from state s by selecting a subset of variables. For sub-reward rmonster, the best sub-state would be {sagent, smonster} because it contains only relevant information for predicting rmonster. Motivated by this observation, we define minimal sufficient supporting sub-state in definition 1. Definition 1. A sub-state si s is the minimal sufficient supporting sub-state of ri if H(si) = min ˆsi Mi H(ˆsi) Mi = {ˆsi|H(ri|ˆsi, a) = min s H(ri| s, a), s s} where H(ri|si, a) denotes conditional entropy. If si Mi but H(si) = minˆsi Mi H(ˆsi), we refer to such sub-state as sufficient supporting sub-state. The intuition of minimal sufficient supporting sub-state is to contain all and only the information required to compute a sub-reward. Note that H(ri|si, a) is not necessarily 0 because of intrinsic randomness. 3.2 K-Minimal Supporting Reward Decomposition To introduce our principles for reward decomposition, we start from several undesired trivial decompositions and one specific desired decomposition in the Monster-Treasure environment discussed in the previous section. The first trivial decomposition would be splitting the total reward into two equivalent halves, i.e. r 2 and r 2, where the minimal sufficient supporting state for both channels would be s1 = s2 = s. Another trivial decomposition is r1 = r and r2 = 0 with corresponding minimal sufficient supporting sub-states s1 = s and s2 = , notice that the second channel would not contain any information. A more general case of trivial decomposition would be r1 = r + f(sagent) and r2 = f(sagent) with corresponding minimal sufficient supporting sub-states s1 = s and s2 = {sagent}, where f is an arbitrary function. The second channel does contain information but is in fact redundant. The last undesired decomposition would be r1 = rmonster + 1 2rtreasure and r2 = 1 2rtreasure where the corresponding minimal sufficient supporting sub-states are s1 = s and s2 = {sagent, streasure}. streasure in s1 is clearly redundant. The ideal decomposition for the Monster-Treasure environment would be to decompose the reward r into rmonster and rtreasure, because it is a compact decomposition in which each sub-reward has a compact minimal sufficient supporting sub-state. To distinguish the ideal decomposition from the trivial ones, the first principle is that each channel should contain exclusive information that other channels do not. On top of that, the second principle is that the sum of the information contained in each channel should be minimized. Motivated by above observation, we define K-minimal supporting sub-rewards as follows: Definition 2. Let si, ˆsi be the minimal sufficient supporting sub-state for ri, ˆri correspondingly. A set of sub-rewards {ri(s)}K i=0 forms a K-minimal supporting reward decomposition if: i H(si) = min {ˆri} C i=1 ˆri = r, ˆsi ˆsj i, j Note that there could be multiple K-minimal reward decompositions, e.g. swapping two channels of a K-minimal reward decomposition will create a new one. The intuition of K-minimal supporting reward decomposition is to encourage non-trivial and compact decomposition, while no sub-state si is a subset of other sub-state sj. 4 RD2 Algorithm Minimal supporting principles define our ideal reward decomposition under factored MDP, where selecting factors is inherently optimizing a boolean mask over factors. However, complex environments pose more challenges in developing a practical algorithm. To be specific, the first challenge is to allow complex states such as raw images as input, rather than extracted factors. The second challenge is that estimating entropy in deep learning using either sample-based or neural estimation methods could be time-consuming. In this section we propose several techniques to overcoming these two challenges. 4.1 Objectives To overcome the challenge of taking raw images as input, instead of viewing pixels as factors, we use a H W N feature map f(s) as a map of factors, each encoding regional information. Here H and W represent the height and width after convolution, and N is the number of channels of feature map. In Section 3., we assume that si picks a fixed subset of s as sub-state, which is inherently a fixed binary mask. However, in image-based RL environment, even when we are using feature map instead of raw pixels, it is not realistic to assume that the mask would be fixed for all states. This is similar to the attention mechanism, e.g. in the Monster-Treasure environment the mask would need to follow the monster s position to extract its information. To this end, we allow the mask on the feature map to be dependent on state s, given by mi(s). Sub-state ˆsi can then be represented by ˆsi = f(s) mi(s) (3) Definition 2 implies that the final objective for a reward decomposition is reached by minimizing P H(ˆsi). Normally we would first find the minimal sufficient supporting state si of a given reward decomposition, represented by ri, then evaluate P H(si). However, this objective cannot back propagate to ri since the operation of finding minimal sufficient supporting sub-state is not derivable. To tackle this issue, we let ri be directly dependent of ˆsi by ri = gθi(ˆsi, a). The first constraint for K-minimal supporting reward decomposition then leads to a straightforward objective: i=1 gθi(ˆsi, a))2 (4) Note that ˆsi would always be a sufficient supporting sub-state for ri, but not necessarily minimal. However, the minimal condition in definition 1 can be approximated by minimizing H(ˆsi), which is also the objective of K-minimal supporting reward decomposition given by definition 2. So our second objective is given by i=1 H(ˆsi) (5) The above two terms are still not suffice for finding K-minimal supporting reward decomposition. The second constraint of definition 2, which is the non-trivial requirement, suggests that si sj, i, j, which is also equivalent to H(ˆsi|ˆsj) > 0 in general cases. This constraint is found critical in our experiments. Also, as an alternative, an equivalent objective according to definition 1 is H(ri|ˆsi, a) < H(ri|ˆsj, a). Instead of simply demanding inequality, we further maximize H(ˆsi|ˆsj) or H(ri|ˆsj, a) to encourage diversity between sub-states. The last objective is given by j=1,j =i H(ˆsi|ˆsj) (6) j=1,j =i H(ri|ˆsj, a). (7) 4.2 Surrogate Loss for Entropy Estimation Computing Lmini and Ldiv requires entropy estimation. Since the state space in Atari is very large, using sampling-based entropy estimation methods is unrealistic. There exist reliable methods on neural entropy estimation, but are in general time-consuming. In our problem, we introduce approximate losses that are reasonable and convenient in our setting. Approximating H(ˆsi) Recall that H(c X) = H(X) + log(|c|) and H(X|c Y ) = H(X|Y ) when c is a constant and c = 0. Since we let mi (0, 1)N and that ˆsi = f(s) mi(s), an empirical estimation for H(ˆsi) can be derived: H(ˆsi) H(f(s)) + l=1 log(mi,l(s)) H(f(s)) + log( l=1 mi,l(s)) (8) where N is the size of the feature map. Note that if m is fixed, the first approximation becomes equality. The last inequality gives an upper bound that resolves numerical issues of taking log of a small float. Since the entropy of the feature map H(f(s)) is irrelevant to the mask, we can optimize H(si) approximately by minimizing the second term: l=1 mi,l(s)) (9) Approximating H(ˆsi|ˆsj) Inspired by the method for estimating H(ˆsi), we propose using an intuitive approximate loss for H(ˆsi|ˆsj) that resembles Lmini: j=1,j =i log( l=1 Re LU(mi,l(s) mj,l(s))). (10) To further explain the intuition behind Ldiv1, consider a factored MDP where a factor is either chosen or not chosen for each sub-state. Note that a factor xk will contribute to H(ˆsi|ˆsj) only if xk is chosen by ˆsi but not chosen by ˆsj, i.e. mi,k = 1 and mj,k = 0. A simple way to extend this logical expression to real values is to use Re LU(mi,k mj,k). Approximating H(ri|ˆsj, a) Estimating H(ri|ˆsj, a) could be complicated in general, however if we assume that H(ri|ˆsj, a) is only related to the logarithm of its variance (e.g. Gaussian distribution), i.e. H(ri|ˆsj, a) log(V ar(ri|ˆsj, a)), then a surrogate objective can be derived. Note the definition of variance V ar(ri|ˆsj, a) = E [ri E(ri|ˆsj, a)]2. To obtain an estimation for E(ri|ˆsj, a), we use a network ˆri = gθij(ˆsj, a) and minimize MSE(ri, ˆri) over parameter θij. We can then use ˆri as an estimation for E(ri|ˆsj, a) and MSE(ri, ˆri) as an approximation for V ar(ri|ˆsj, a). Thus maximizing MSE(ri, ˆri) over ˆsj will be equivalent to increasing log(V ar(ri|ˆsj, a)), i.e. H(ri|ˆsj, a). j=1,j =i log(min θij (gθi(ˆsi, a) gθij(ˆsj, a))2). (11) Ldiv2 penalizes information in ˆsj that is related to ri, which would enforce different channels to contain diversified information. The final objective of RD2 is given by: L = αLsum + βLmini + γLdiv (12) where Ldiv has two alternatives and α/β/γ are coefficients. We provide the pseudo code of our algorithm in Appendix 1. 5 Experiment In our experiments, we aim to answer the following questions: (1) Can RD2 learn reward decomposition? (2) Does RD2 learn meaningful mask on state input? (3) How does RD2 perform in terms of using decomposed rewards to improve sample efficiency? 5.1 Toycase Figure 1: Monster Treasure In this section, we test RD2 with mini-gridworld [Chevalier-Boisvert et al., 2018], configured to the Monster-Treasure environment discussed earlier as shown in Figure 1. In this environment, rtreasure = 2 when the agent (red triangle) finds the treasure (green grid), otherwise rtreasure = 0. The agent also receives a reward of rmonster = 2 when it collides with the moving monster (blue ball), otherwise rmonster = 0. Note that if the agent finds the treasure and collides with the monster at the same time, the reward r = rtreasure + rmonster will also be 0. The coordinates of the objects are extracted into factors and are given by {agentx, agenty, monsterx, monstery, treasurex, treasurey}. The network takes as input the factors and the action, and is trained with equation 12 using the Ldiv1 variant. The mask in this case is trainable but does not depend on the input. Note that only r = rtreasure + rmonster is used as a training signal. We find that RD2 is able to completely separate rtreasure and rmonster trained only with r. As shown in Figure 2, the MSE loss for rtreasure and rmonster eventually converges to 0. The mask gradually converges to the optimal mask, where ˆs1 = {agentx, agenty, treasurex, treasurey} and ˆs2 = {agentx, agenty, monsterx, monstery}. (a) Treasure reward error (b) Treasure mask (c) Monster reward error (d) Monster mask Figure 2: Monster-Treasure training curves rtreasure rmonster r predicted rtreasure predicted rmonster predicted r 2.00 0.00 2.00 1.85 0.13 1.99 0.00 -2.00 -2.00 -0.14 -1.86 -2.00 0.00 0.00 0.00 0.11 -0.15 -0.04 2.00 -2.00 0.00 1.92 -1.84 0.08 Table 1: Example of reward decomposition on Monster-Treasure In Monster-Treasure, there are two possible (s, a) pair that would receive a reward of 0. One is rtreasure = 0 and rmonster = 0, which is trivial. The second one is rtreasure = 2 and rmonster = 2, meaning that the agent finds the treasure but bumps into the monster at the same time. It is notable that while r does not show the difference between those two cases, RD2 is capable of telling the difference even when the total rewards are both 0, since both rtreasure and rmonster are predicted accurately as shown in Table 1. One specific observation due to continuous masking between 0 and 1 is that, although both channel masks have non-zero values on agent related factors, values of channel 2 are significantly larger than values of channel 1 due to Ldiv1. However, as long as the value does not go to zero, we can consider that channel 1 views agent coordinates as required factors. 5.2 Atari Domain We also run our algorithm on a more complicated benchmark called Atari. Following Lin et al. [2019], We experiment with the Atari games that have a structure of multiple reward sources. We first present the results of reward decomposition and visualize the trained masks using saliency maps on several Atari games, and then show that our decomposed rewards can accelerate the training process of existing RL algorithms. We show that RD2 achieves much better sample efficiency than the recently proposed reward decomposition method DRDRL [Lin et al., 2019] and Rainbow [Hessel et al., 2018]. Reward decomposition. We demonstrate that RD2 can learn meaningful reward decomposition on Atari games which has multiple-reward structure. Figure 3 shows the results. In the game Up NDown, the agent receives a reward of 3 when it hits a flag, and receives a reward of 2 when it jumps on another car. We show that our algorithm can decompose these two reward signals into two channels when it jumps on another car, the first channel is activated and outputs a reward of 2; when it hits a flag, the second channel will dominate the reward prediction and output a reward close to 3. Visualization. To better understand how our algorithm works, we visualize the saliency map [Simonyan et al., 2013] by computing the absolute value of the Jacobian ri s for each channel (i = 1, 2) in Figure 4 for the games Up NDown and Gopher. We find that RD2 successfully learns meaningful state decomposition. In Up NDown (top row), the first channel (blue) attends to the flag when the agent hits it (top left), while the second channel (pink) attends to other cars which the agent jumps on (top right). In Gopher (bottom row), the agent receives a reward of 0.15 when it fills the hole in ground(bottom left) and a reward of 0.8 when it catches a gopher (bottom right). We notice that RD2 learns a saliency map that accurately distinguishes these two cases. The first channel (blue) attends to the ground and predicts the 0.15 reward while the second channel (pink) attends to the gopher and predicts the 0.8 𝑟" = 1.68 𝑟( = 0.34 𝑟)*)+, = 2 𝑟" = 1.98 𝑟( = 0.03 𝑟)*)+, = 2 𝑟" = 0.13 𝑟( = 2.85 𝑟)*)+, = 3 𝑟" = 0.35 𝑟( = 2.78 𝑟)*)+, = 3 𝑟" = 0.12 𝑟( = 0.02 𝑟)*)+, = 0.15 𝑟" = 0.11 𝑟( = 0.03 𝑟)*)+, = 0.15 𝑟" = 0.21 𝑟( = 0.61 𝑟)*)+, = 0.8 𝑟" = 0.24 𝑟( = 0.59 𝑟)*)+, = 0.8 Figure 3: Reward decomposition results. reward. We also find that with the help of dynamic mask, the second channel (pink) always have attention on the gopher. Figure 4: Saliency map visualization. Joint training performance. We now simultaneously train the sub-reward function and the sub-Q network and use the decomposed reward to directly train the sub-Q networks for each channel as in Lin et al. [2019], Van Seijen et al. [2017]. In brief, we train multiple Q networks and introduce an additional sub-Q TD error defined by LT Di = [Qi(s, a) ri γQi(s , a )]2 (13) Note that we use global action a = argmaxa P i Qi(s, a) instead of local actions a i = argmaxa Qi(st+1, a) to assure unchanged optimal Q-function. For a detailed version of combining RD2 with Q-learning, please refer to Appendix A. Q-learning combined with RD2 shows great improvements in sample efficiency compared with both Rainbow and DRDRL as shown in Figure 5. At early epochs the curves of RD2 are below baselines due to noise in sub-reward signals. But once the reward decomposition part was partly trained, it accelerates an agent s learning process significantly. Figure 5: Joint training performance on Atari games. Each curve is averaged by three random seeds. 6 Discussion and Conclusion In this paper, we propose a set of novel reward decomposition principles which encourage sub-rewards to have compact and non-trivial representations, termed RD2. Compared with existing methods, RD2 is capable of decomposing rewards for arbitrary state-action pairs under general RL settings and does not rely on policies. Experiments demonstrate that RD2 greatly improves sample efficiency against existing reward decomposition methods. One possible explanation for the performance of RD2 is its relation to learning compact state representation. Each learned sub-reward is dependent only on a subset of the state, allowing the corresponding sub-value to also depend on a subset of the state and thus learn a compact representation for such sub-values. Therefore, RD2 naturally has a closer connection to learning compact representation for sub-values and speed up RL algorithms. In the future, we will explore reward decomposition under multi-agent RL setting. The state in multi-agent RL may have natural graph structure modeling agents interaction. We will explore how to leverage such structure for a better reward decomposition. Broader Impact Reinforcement learning has a wide range of applications in real life. In board games [Schrittwieser et al., 2019], RL has shown that it has the potential to beat human and therefore provide valuable insights. In optimal control, RL has also been widely used as a search policy that guarantees convergence. In general planning problems such as traffic control or recommendation system, introducing RL is also an active line of research. Reward decomposition has a lot of potential impacts, especially in multi-agent setting, where each agent should obtain a portion of the total reward, and in interpretation-required problems such as recommendation system. RD2 is capable of both decomposing rewards into sub-rewards, and on top of that provide meaningful interpretation due to disentangled representation. Integrating RD2 with those settings would provide benefits to both training aspects and interpretability aspects. However, the rise of autonomous analytic algorithms will inevitably decrease the demand for human data analysts. Drew Bagnell and Andrew Y. Ng. On local rewards and scaling distributed reinforcement learning. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 91 98. MIT Press, 2006. URL http://papers.nips.cc/paper/ 2951-on-local-rewards-and-scaling-distributed-reinforcement-learning.pdf. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. URL http://arxiv.org/abs/1206.5538. cite arxiv:1206.5538. C. Boutilier, T. Deam, and S. Hanks. Decision-Theoretic Planning: Structural Assumptions and Computational Leverage. JAIR, 11:1 94, 1999. Craig Boutilier, Richard Dearden, and Moisés Goldszmidt. Exploiting structure in policy construction. In IJCAI, pages 1104 1113. Morgan Kaufmann, 1995. URL http://dblp.uni-trier.de/db/ conf/ijcai/ijcai95.html#Boutilier DG95. Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http: //arxiv.org/abs/1812.06110. Tian Qi Chen, Xuechen Li, Roger B. Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Neur IPS, pages 2615 2625, 2018. URL http://dblp.uni-trier.de/db/conf/nips/nips2018.html#Chen LGD18. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172 2180. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/ 6399-infogan-interpretable-representation-learning-by-information-maximiz\ ing-generative-adversarial-nets.pdf. Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018. Christopher Grimm and Satinder Singh. Learning independently-obtainable reward functions. Co RR, abs/1901.08649, 2019. URL http://dblp.uni-trier.de/db/journals/corr/corr1901. html#abs-1901-08649. Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR (Poster). Open Review.net, 2017. URL http://dblp. uni-trier.de/db/conf/iclr/iclr2017.html#Higgins MPBGBML17. Wei-Ning Hsu, Yu Zhang, and James R. Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, NIPS, pages 1878 1889, 2017. URL http://dblp.uni-trier.de/db/conf/nips/nips2017. html#Hsu ZG17. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Yuanpeng Li, Liang Zhao, Jianyu Wang, and Joel Hestness. Compositional generalization for primitive substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4284 4293, 2019. Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, and Guangwen Yang. Distributional reward decomposition for reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 6212 6221. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 8852-distributional-reward-decomposition-for-reinforcement-learning.pdf. Michael Littman and Justin Boyan. A distributed reinforcement learning scheme for network routing, 1993. Bhaskara Marthi. Automatic shaping and decomposition of reward functions, 2007. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, February 2015. ISSN 00280836. URL http://dx.doi.org/10.1038/nature14236. Open AI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning, 2019. Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2017. ISBN 978-0-262-03731-0. URL https://mitpress.mit.edu/books/ elements-causal-inference. Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994. ISBN 0471619779. Stuart Russell and Andrew Zimdars. Q-decomposition for reinforcement learning agents. volume 2, pages 656 663, 01 2003. Jeff Schneider, Weng-Keen Wong, Andrew Moore, and Martin Riedmiller. Distributed value functions. In In Proceedings of the Sixteenth International Conference on Machine Learning, pages 371 378. Morgan Kaufmann, 1999. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. ar Xiv preprint ar Xiv:1911.08265, 2019. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. ar Xiv preprint ar Xiv:1312.6034, 2013. Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, and Jeffrey Tsang. Hybrid reward architecture for reinforcement learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5392 5402. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/ 7123-hybrid-reward-architecture-for-reinforcement-learning.pdf.