# variational_intrinsic_control_revisited__47ddf1e9.pdf Published as a conference paper at ICLR 2021 VARIATIONAL INTRINSIC CONTROL REVISITED Taehwan Kwon NC kth315@ncsoft.com In this paper, we revisit variational intrinsic control (VIC), an unsupervised reinforcement learning method for finding the largest set of intrinsic options available to an agent. In the original work by Gregor et al. (2016), two VIC algorithms were proposed: one that represents the options explicitly, and the other that does it implicitly. We show that the intrinsic reward used in the latter is subject to bias in stochastic environments, causing convergence to suboptimal solutions. To correct this behavior and achieve the maximal empowerment, we propose two methods respectively based on the transitional probability model and Gaussian mixture model. We substantiate our claims through rigorous mathematical derivations and experimental analyses. 1 INTRODUCTION Variational intrinsic control (VIC) proposed by Gregor et al. (2016) is an unsupervised reinforcement learning algorithm that aims to discover as many intrinsic options as possible, i.e., the policies with a termination condition that meaningfully affect the world. The main idea of VIC is to maximize the mutual information between the set of options and final states, called empowerment. The maximum empowerment is desirable because it maximizes the information about the final states the agent can achieve with the available options. These options are independent of the extrinsic reward of the environment, so they can be considered as the agent s universal knowledge about the environment. The concept of empowerment has been introduced in (Klyubin et al., 2005; Salge et al., 2014) along with methods for measuring it based on Expectation Maximization (Arimoto, 1972; Blahut, 1972). They defined the option as a sequence of a fixed number of actions. Yeung (2008) proposed to maximize the empowerment using the Blahut & Arimoto (BA) algorithm, but its complexity increases exponentially with the sequence length, rendering it impractical for high dimensional and long-horizon options. Mohamed & Rezende (2015) adopted techniques from deep learning and variational inference (Barber & Agakov, 2003) and successfully applied empowerment maximization for high dimensional and long-horizon control. However, this method maximizes the empowerment over open-loop options, meaning that the sequence of action is chosen in advance and conducted regardless of the (potentially stochastic) environment dynamics. This often impairs the performance, as the agent cannot properly react to the environment, leading to a significant underestimation of empowerment (Gregor et al., 2016). To overcome this limitation, Gregor et al. (2016) proposed to use the closed-loop options, meaning that the sequence of action is chosen considering the transited states. This type of options differ from those in Klyubin et al. (2005), Salge et al. (2014) and Mohamed & Rezende (2015) in that they have a termination condition, instead of a fixed number of actions. They presented two algorithms: VIC with explicit and implicit options (we will call them explicit and implicit VIC from here on). Explicit VIC defines a fixed number of options, and each option is sampled at the beginning of the trajectory, conditioning the policies of an agent until termination. In other words, both the state and the sampled option are the input to the policy function of the agent. One clear limitation of explicit VIC is that it requires the preset number of options. This does not only apply to explicit VIC, but also to some recent unsupervised learning algorithms that adopt a discrete option or skill with a predefined set (Machado et al., 2017; Eysenbach et al., 2018). Obviously, presetting the number of options limits the number of options that an agent can learn, impeding the maximal level of empowerment. Moreover, choosing a proper number of options is not straightforward, since the maximum of the objective for a given number of options depends on several unknown environmental factors such as Published as a conference paper at ICLR 2021 the cardinality of the state space and the transitional model. To overcome this issue, Gregor et al. (2016) proposed implicit VIC which defines the option as the trajectory (i.e., the sequence of states and actions) until termination. There exist multiple trajectories that lead to the same final state and implicit VIC learns to maximize the mutual information between the final state and the trajectory by controlling the latter. As a result, the number of options is no longer limited by the preset number, and neither is the empowerment. Despite this advantage, however, these implicit options induce bias in the intrinsic reward and hinder implicit VIC from achieving maximal empowerment. This effect grows with the stochasticity of the environment, and it may cause serious degradation of empowerment, limiting the agent from learning the universal knowledge of the environment. In this paper, we aim to solve such empowerment degradation of implicit VIC under stochastic dynamics. To this end, we revisit variational intrinsic control and make the following contributions: 1. We show that the intrinsic reward in implicit VIC suffers from the variational bias in stochastic environments, causing convergence to suboptimal solutions (Section 2). 2. To compensate this bias and achieve the maximal empowerment, we suggest two modifications of implicit VIC: the environment dynamics modeling with the transitional probability (Section 3) and Gaussian mixture model (Section 4). 2 VARIATIONAL BIAS OF IMPLICIT VIC IN STOCHASTIC ENVIRONMENTS In this section, we derive the variational bias of implicit VIC under stochastic environment dynamics. First, we adopt the definition of termination action and final state from Gregor et al. (2016) and briefly review VIC. The termination action terminates the option and yields the final state sf = st independently of the environmental action space. VIC aims to maximize the empowerment, i.e., the mutual information between option Ωand final state sf, which can be written as follows: I(Ω, sf|s0) = X Ω p(Ω|s0) log p(Ω|s0) + X Ω,sf p(sf|s0, Ω)p(Ω|s0) log p(Ω|s0, sf). (1) Since p(Ω|s0, sf) is intractable, VIC (Gregor et al., 2016) derives the variational bound IV B I and maximizes it instead: IV B(Ω, sf|s0) = X Ω p(Ω|s0) log p(Ω|s0) + X Ω,sf p(sf|s0, Ω)p(Ω|s0) log q(Ω|s0, sf), (2) where q(Ω|s0, sf) is the inference model to be trained. When IV B is maximized, we have q(Ω|s0, sf) = p(Ω|s0, sf) and achieve the maximum I. As explained in Section 1, explicit VIC samples an explicit option at the beginning of a trajectory and it conditions policy as π(a|s, Ω) until termination. Due to the randomness of policy, the final state is undetermined for a given option until the policy converges to achieve a specific option. Unlike explicit VIC, implicit VIC defines its option as a trajectory until termination, and hence, the final state is determined for a given option. This can be expressed as p(sf|Ω, s0) = ( 1, if sf = sf|Ω 0, otherwise (3) where sf|Ωis the final state of an option Ω. This is a key characteristic of implicit VIC and is essential for deriving the main results of this paper. We will often use this equation to eliminate sf for reduction. Interestingly, this makes a difference in empowerment maximization between explicit and implicit VIC, which can be explained by rewriting I(Ω, sf|s0) as follows: I(Ω, sf|s0) = H(sf|s0) H(sf|Ω, s0). Note that H(sf|Ω, s0) is 0 for implicit VIC since sf is determined for a given Ω. One can notice that to maximize empowerment, the agent needs to learn (1) maximizing H(sf|s0) and (2) minimizing H(sf|Ω, s0). While explicit VIC needs to learn both (1) and (2), implicit VIC needs to learn only (1), since (2) is already achieved by the definition of option. This makes the learning of implicit VIC easier and faster. Moreover, implicit VIC is scalable as explained in Section 1. Despite these strengths, implicit VIC suffers from variational bias in intrinsic reward under stochastic environments that can seriously degrade the empowerment. We derive this variational bias by decomposing p(Ω|s0) and p(Ω|sf, s0). Since implicit VIC defines the option Ωas the trajectory of an agent, i.e., Published as a conference paper at ICLR 2021 the sequence of states and actions: Ω= (s0, a0, s1, a1, ..., s T 1, a T 1 = af, s T = sf = s T 1), the probability of an option can be decomposed as p(Ω|s0) = Y (τt,at,st+1) Ω p(at|τt)p(st+1|τt, at) (4) using Bayes rule with τt = (s0, a0, ..., st). Similarly, p(Ω|s0, sf) can be expressed as p(Ω|s0, sf|Ω) = Y (τt,at,st+1) Ω p(at|τt, sf|Ω)p(st+1|τt, at, sf|Ω). (5) Note that sf is replaced by sf|Ωsince it is determined by the given Ω. Using (4), (5) and (3), we can rewrite the mutual information (1) of implicit VIC as I(Ω, sf|s0) = X Ω p(Ω|s0) X (τt,at,st+1) Ω h log p(at|τt, sf|Ω) p(at|τt) + log p(st+1|τt, at, sf|Ω) p(st+1|τt, at) The intrinsic reward of implicit VIC (Gregor et al., 2016) is given by r IV B Ω = X (τt,at,st+1) Ω log q(at|τt, sf|Ω) p(at|τt) , (7) where q(at|τt, sf) is inference and p(at|τt) is policy of an agent. It can be shown that r IV B Ω comes from the first part of (6) (see Appendix A for details). Under deterministic environment dynamics, the transitional part log p(st+1|τt, at, sf|Ω)/p(st+1|τt, at) is canceled out since both the nominator and denominator are always 1. However, this is not possible under stochastic environment dynamics and it yields the variational bias b V IC Ω in the intrinsic reward: b V IC Ω = X (τt,at,st+1) Ω log p(st+1|τt, at, sf|Ω) p(st+1|τt, at) . (8) From (8), we see that this bias comes from the difference between the transitional probabilities with and without the given final state. This difference is large when sf|Ωin the nominator plays a crucial role, which then causes a large bias. One extreme case is when st+1 is the necessary transition to reach sf|Ωbut not the only transition from (τt, at). Then, p(st+1|τt, at, sf|Ω) is 1 but p(st+1|τt, at) is not, yielding a large bias. In Section 5, we provide the experimental evidence that this variational bias leads to a suboptimal training. Even though the original VIC (Gregor et al., 2016) subtracts b(s0) from r IV B Ω to reduce the variance of learning, it cannot compensate this bias since it also depends on Ω. In the next section, we analyze the mutual information (1) in more detail under stochastic environment dynamics and define the variational estimate of (1), IV E, for training. 3 IMPLICIT VIC WITH TRANSITIONAL PROBABILITY MODEL In this section, we analyze I(Ω, sf|s0) under stochastic environment dynamics and propose to explicitly model transitional probabilities to fix variational bias. First, for a given option and final state, we define pπ(Ω|sf, s0), pρ(Ω|sf, s0), pπ(Ω|s0) and pπ(Ω|s0) as follows: pπ(Ω|sf, s0) = Y (τt,at,st+1) Ω pπ(at|τt, sf), pρ(Ω|sf, s0) = Y (τt,at,st+1) Ω pρ(st+1|τt, at, sf), pπ(Ω|s0) = Y (τt,at,st+1) Ω pπ(at|τt), pρ(Ω|s0) = Y (τt,at,st+1) Ω pρ(st+1|τt, at) . (9) Note that p(Ω|sf, s0) = pπ(Ω|sf, s0)pρ(Ω|sf, s0) is the true distribution of Ωfor given sf where pπ(Ω|sf, s0) is a policy-related part and pρ(Ω|sf, s0) is a transitional part of p(Ω|sf, s0) and so do p(Ω|s0), pπ(Ω|s0) and pρ(Ω|s0). It is necessary to consider transitional probabilities since they induce variational bias in intrinsic reward. Hence we model (9) as follows: πq(Ω|sf, s0) = Y (τt,at,st+1) Ω πq(at|τt, sf), ρq(Ω|sf, s0) = Y (τt,at,st+1) Ω ρq(st+1|τt, at, sf), πp(Ω|s0) = Y (τt,at,st+1) Ω πp(at|τt), ρp(Ω|s0) = Y (τt,at,st+1) Ω ρp(st+1|τt, at), (10) Published as a conference paper at ICLR 2021 where πq, ρq, πp and ρp are our models to be trained. We know the policy of an agent, so we have pπ(at|τt) = πp(at|τt). For the other probabilities, they are trained based on our algorithms. Now we can rewrite I(Ω, sf|s0) as below using (9): I(Ω, sf|s0) = X Ω,sf p(Ω, sf|s0) h log pρ(Ω|sf, s0)pπ(Ω|sf, s0) log pρ(Ω|s0)pπ(Ω|s0) i . (11) Using (10), we define IV E as follows: IV E(Ω, sf|s0) = X Ω,sf p(Ω, sf|s0) h log ρq(Ω|sf, s0)πq(Ω|sf, s0) log ρp(Ω|s0)πp(Ω|s0) i . (12) This is an estimate of the mutual information between Ωand sf with transitional models. Note that IV E is not a variational lower bound on I unlike IV B, hence the derivation of VIC in Gregor et al. (2016) is not applicable in this case. To tackle this problem, we start from absolute difference, |I IV E| which can be bounded as I IV E U V E = X sf p(sf|s0)DKL pπ( |sf, s0)pρ( |sf, s0) πq( |sf, s0)ρq( |sf, s0) + DKL pπ( |s0)pρ( |s0) πp( |s0)ρp( |s0) . See Appendix B for the derivation. Note that U V E is the sum of positively weighted KL divergences, which means that it satisfies U V E 0 if and only if πq( |sf, s0) pπ( |sf, s0), ρq( |sf, s0) pρ( |sf, s0) and ρp( |s0) pρ( |s0) for all sf. In other words, our estimate of the mutual information converges to the true value as our estimates (10) converge to the true distribution (9). It makes sense that we can estimate the true value of the mutual information if we know the true distribution. Hence we minimize U V E with respect to πq, ρq and ρp. If πq, ρq and ρp are parameterized by θq π, θq ρ and θp ρ, we can obtain gradients of U V E using (3) as follows: θq πU V E = X Ω p(Ω|s0) θq π log πq(Ω|sf|Ω, s0), θq ρU V E = X Ω p(Ω|s0) θq ρ log ρq(Ω|sf|Ω, s0), θp ρU V E = X Ω p(Ω|s0) θp ρ log ρp(Ω|s0), which can be estimated from sample mean. Once we have (πq( |sf, s0), ρq( |sf, s0), ρp( |s0)) (pπ( |sf, s0), pρ( |sf, s0), pρ( |s0)) for all sf, we can update the policy to maximize I. If πp is parameterized by θp π, the gradients, θp πI and θp πIV E can be obtained as follows using (3): Ω p(Ω|s0) h log pπ(Ω|sf|Ω, s0)pρ(Ω|sf|Ω, s0) log πp(Ω|s0)pρ(Ω|s0) i | {z } r I Ω θp π log πp(Ω|s0), Ω p(Ω|s0) h log πq(Ω|sf|Ω, s0)ρq(Ω|sf|Ω, s0) log πp(Ω|s0)ρp(Ω|s0) i θp π log πp(Ω|s0), (15) where pπ(Ω|s0) is replaced by πp(Ω|s0) since we know the true value of policy (see Appendix A for details). From (15), we see that θp πIV E θp πI as πq( |sf, s0) pπ( |sf, s0), ρq( |sf, s0) pρ( |sf, s0) and ρp( |s0)) pρ( |s0) for all sf. It means that we can estimate the correct gradient of mutual information w.r.t policy as our estimates (10) converge to the true distribution (9). Note that for deterministic environments, we can omit ρq( |sf| , s0) and ρp( |s0) since they are always 1. Then we have IV E = IV B and θq πU V E = θq πIV B, which means that maximizing IV B is equivalent to minimizing U V E for θq π (i.e., it is equivalent to the original implicit VIC). Finally, Algorithm 1 summarizes the modified implicit VIC with transitional probability model. The additional steps added to the original implicit VIC are marked with ( ). Note that Algorithm 1 is not always practically applicable since it is hard to model p(st+1|τt, at) and p(st+1|τt, at, sf) when the cardinality of the state space is unknown. (We can not set the number of nodes for softmax.) In our experiment of Algorithm 1, we will assume that we know the cardinality of the state space. This allows us to model pρ(st+1|τt, at) and pρ(st+1|τt, at, sf) using softmax. In the next section, we propose a practically applicable method that avoids this intractability of the cardinality. Published as a conference paper at ICLR 2021 Algorithm 1 Implicit VIC with transitional probability model Initialize s0, η, Ttrain, θq π, θq ρ, θp π and θp ρ. for itrain : 1 to Ttrain do Follow πp(at|τt) result in Ω= (s0, a0, ..., sf). r IV E Ω P t [log πq(at|τt, sf) log πp(at|τt)] from (15) r IV E Ω r IV E Ω + P t [log ρq(st+1|τt, at, sf) log ρp(st+1|τt, at)] from (15), (*) Update each parameter: θp π θp π + ηr IV E Ω θp π P t log πp(at|τt) from (15) θq π θq π + η θq π P t log πq(at|τt, sf) from (14) θp ρ θp ρ + η θp ρ P t log ρp(st+1|τt, at) from (14), (*) θq ρ θq ρ + η θq ρ P t log ρq(st+1|τt, at, sf) from (14), (*) end end for 4 IMPLICIT VIC WITH GAUSSIAN MIXTURE MODEL In this section, we propose the alternative method to overcome the limitation of Algorithm 1 by modeling the smoothed transitional distributions. This allows us to use the Gaussian Mixture Model (GMM) (Pearson, 1894) and other continuous distributional models for modelling transitional distribution. First, we smooth p(st+1|τt, at, sf) and p(st+1|τt, at) into fσ(xt+1|τt, at, sf) and fσ(xt+1|τt, at) with xt+1 = st+1 + zt+1 and zt+1 N(0, σ2Id): fσ(xt+1|τt, at, sf) = X s S(τt,at,sf ) p(s |τt, at, sf)fσ(xt+1 s ; 0, σ2Id), fσ(xt+1|τt, at) = X s S(τt,at) p(s |τt, at)fσ(xt+1 s ; 0, σ2Id), (16) where S(τt, at, sf) = s |p(s |τt, at, sf) > 0 , S(τt, at) = s |p(s |τt, at) > 0 and d is the dimension of the state. Then, using Gaussian Mixture Model (GMM) (Pearson, 1894), we model (16) as f q σ(xt+1|τt, at, sf) and f p σ(xt+1|τt, at): f q σ(xt+1|τt, at, sf) = i=1 wi(τt, at, sf)fσ(xt+1; µi(τt, at, sf), σ2Id), f p σ(xt+1|τt, at) = i=1 wi(τt, at)fσ(xt+1; µi(τt, at), σ2Id). Note that if we set ngmm > maxτt,at |S(τt, at)|, (17) can perfectly fit (16). Now using (17), we define IV E σ , the variational estimate with smoothing as follows: Ω,sf p(Ω, sf|s0) h log πq(Ω|sf, s0)f q σ(Ω|sf, s0) log πp(Ω|s0)f p σ(Ω|s0) i . (18) Note that IV E σ is not a variational lower bound on I, hence we can not also apply derivation of implicit VIC in Gregor et al. (2016). As in Section 3, we start from the absolute difference between I and IV E σ . The upper bound on |I IV E σ | can be obtained as follows: I IV E σ U V E σ,1 + U V E σ,2 with U V E σ,1 = Ω,sf p(Ω, sf|s0) log hpρ(Ω|sf, s0) pρ(Ω|s0) f p σ(Ω|s0) f q σ(Ω|sf, s0) U V E σ,2 = X sf p(sf|s0)DKL pπ( |sf, s0)pρ( |sf, s0)||πq( |sf, s0)pρ( |sf, s0) . See Appendix C for the derivation. Note that U V E σ,2 differs from the first part of U V E in (13). This upper bound implies U V E σ,1 0 as f q σ( |sf, s0)/f p σ( |s0) pρ( |sf, s0)/pρ( |s0) for all sf and Published as a conference paper at ICLR 2021 U V E σ,2 0 if and only if πq( |sf, s0) pπ( |sf, s0) for all sf since U V E σ,2 is the sum of positively weighted KL divergences. To estimate the correct value of mutual information from (18), we minimize U V E σ,1 and U V E σ,2 . The gradient of U V E σ,2 can be obtained as below using (3): θq πU V E σ,2 = X Ω p(Ω|s0) θq π log πq(Ω|sf|Ω, s0) (20) which can be estimated from the sample mean. As Algorithm 1, it satisfies θq πU V E σ,2 = θq πIV B, which means that minimizing U V E σ,2 is equivalent to maximizing IV B with respect to θq π. Since U V E σ,2 is 0 if and only if πq( |sf, s0) = pπ( |sf, s0) for all sf, this update will make πq( |sf, s0) converge to pπ( |sf, s0) for all sf. Unlike U V E σ,2 , it is difficult to directly minimize U V E σ,1 due to the absolute value function. However, it can be minimized by estimating the correct ratio between transitional distributions, pρ( |sf, s0)/pρ( |s0), even though each of their true values is unknown. To estimate this ratio, we fit (17) to (16) by minimizing DKL[fσ( |τt, at, sf) f q σ( |τt, at, sf)] and DKL[fσ( |τt, at) f p σ( |τt, at)]. If f q σ and f p σ are parameterized by θq ρ and θp ρ, the gradients of KL divergences can be obtained as follows: θq ρDKL[fσ( |τt, at, sf) f q σ( |τt, at, sf)] = Z xt+1 fσ(xt+1|τt, at, sf) θq ρ log f q σ(xt+1|τt, at, sf), θp ρDKL[fσ( |τt, at) f p σ( |τt, at)] = Z xt+1 fσ(xt+1|τt, at) θp ρ log f p σ(xt+1|τt, at), which can be estimated from the sample mean. As our estimated smoothed transitional distribution (17) converges to the true smoothed transitional distribution (16), it satisfies f q σ( |sf, s0)/f p σ( |s0) pρ( |sf, s0)/pρ( |s0) which leads to U V E σ,1 0 for finite T and σ dmin where T is the average length of trajectories and dmin is the minimal distance between two different states (see appendix D for details). Hence, for small enough σ and finite length of trajectories, we can minimize U V E σ,1 to nearly zero. This implies that if we smooth the original transitional distribution with smaller σ, the smoothed transition will be sharper and the ratio between the transitional probabilities will be estimated more accurately using them. Once we have (πq( |sf, s0), f q σ( |sf, s0), f p σ( |s0)) (pπ( |sf, s0), fσ( |sf, s0), fσ( |s0)) for all sf after the update by (20) and (21), we have IV E σ I. Now, we can update the policy by obtaining θp πIV E σ using (3): Ω p(Ω|s0) h log pπ(Ω|sf, s0) πp(Ω|s0) + log pρ(Ω|sf, s0) | {z } r I Ω θp π log πp(Ω|s0), θp πIV E σ = X Ω p(Ω|s0) h log πq(Ω|sf|Ω, s0) πp(Ω|s0) + log f q σ(Ω|sf|Ω, s0) f p σ(Ω|s0) θp π log πp(Ω|s0) (22) where θp πI is rewritten from (15). From (22), we see that it satisfies θp πIV E σ θp πI as πq( |sf, s0) pπ( |sf, s0) and f q σ( |sf, s0)/f p σ( |s0) pρ( |sf, s0)/pρ( |s0) for all sf, which can be achieved by (20) and (21) as explained previously. Hence, we can estimate the correct gradient of mutual information w.r.t our policy using the estimated smoothed transitional distributions for finite T and σ dmin. However, choosing an appropriate value of σ is not straightforward since dmin and T are usually unknown and depend on the environment. Besides, too small σ makes the training of f q σ and f p σ unstable due to the extreme gradient of Gaussian function near the mean. Another issue of GMM is the choice of a proper ngmm of (17). As explained previously, we can perfectly fit (17) to (16) for ngmm > maxτt,at |S(τt, at)|. We may choose a very large ngmm for the perfect fit but it makes training hard for its complexity. We leave the proper choice of σ and ngmm as future works and use empirically chosen values (σ = 0.25 and ngmm = 10) in this paper. Finally, we summarize our method in Algorithm 2. Additional steps added to the original implicit VIC are marked with ( ). Published as a conference paper at ICLR 2021 Algorithm 2 Implicit VIC with Gaussian mixture model Initialize s0, η, Ttrain, Tsmooth, θq π, θq ρ, θp π and θp ρ. for itrain : 1 to T do Follow πp(at|τt) result in Ω= (s0, a0, ..., sf). r IV E σ Ω P t [log πq(at|τt, sf) log πp(at|τt)] from (22) r IV E σ Ω r IV E σ Ω + P t [log f q σ(st+1|τt, at, sf) log f p σ(st+1|τt, at)] from (22), (*) Update each parameter: θp π θp π + ηr IV E σ Ω θp π P t log πp(at|τt) from (22) θq π θq π + η θq π P t log πq(at|τt, sf) from (20) θp ρ 0 (*) θq ρ 0 (*) for ismooth : 1 to Tsmooth do (*) Sample (z1, z2, ..., zf), zi N(0, σ2In) θp ρ θp ρ + η θp ρ P t log f p σ(st+1 + zt+1|τt, at) from (21) θq ρ θq ρ + η θq ρ P t log f q σ(st+1 + zt+1|τt, at, sf) from (21) end for θp ρ θp ρ + θp ρ/Tsmooth (*) θq ρ θq ρ + θq ρ/Tsmooth (*) end end for 5 EXPERIMENTS In this section, we evaluate implicit VIC (Gregor et al., 2016), Algorithm 1 and Algorithm 2. We use LSTM (Hochreiter & Schmidhuber, 1997) to encode τt = (s0, a0, ..., st) into a vector. We conduct experiments on both deterministic and stochastic environments and evaluate results by measuring the mutual information I from samples. To measure I, we rewrite (1) using (3) as follows: I(Ω, sf|s0) = X Ω,sf p(sf|s0) log p(sf|s0) + X Ω,sf p(Ω|s0)p(sf|Ω, s0) log p(sf|Ω, s0) sf p(sf|s0) log p(sf|s0) which is maximized when sf is distributed uniformly. We estimate ˆI using the distribution of sf from the samples, i.e., ˆp(sf|s0). We apply exponential moving average (0.99 as a smoothing factor) to an average of 5 repetitions for estimating ˆp(sf|s0). We manually set Tmax (maximum length of a trajectory) for each experiment such that the termination action is the only available action at Tmaxth action. For training GMM of Algorithm 2, the transitional models are trained to predict the distribution of xt+1 = xt+1 st instead of xt+1 since predicting difference is usually easier than predicting the whole state. For the training, we have a warm-up phase which trains the base function b(s0) in Gregor et al. (2016) and the transitional models. After the warm-up phase, we update the base function, policy and transitional models simultaneously. Please see Appendix F.1 for details on the hyper-parameter settings. (a) Deterministic 1D world. (b) Deterministic 2D world. (c) Deterministic tree world. Figure 1: Deterministic environments. Fig. 1a shows the deterministic 1D world. The agent can go left right. Fig. 1b shows the deterministic 2D. The agent can go left, up, right and down. Fig 1c shows the deterministic tree world. The agent can go left and right. Published as a conference paper at ICLR 2021 We compare the algorithms in deterministic environments in Fig. 1. Note that although (8) is zero for deterministic environments, we still train the transitional models of Algorithm 1 and 2 to show the convergence to the optimum of them. Fig. 2 shows that all three algorithms rapidly achieve the maximal empowerment. We can observer that all the states in environments are visited uniformly after training which is achieved when the mutual information is maximized. Fig. 3 shows the training results of implicit VIC and Algorithm 2 in the 25 25 grid world with 4 rooms used in Gregor et al. (2016). Both implicit VIC and our Algorithm 2 successfully learn passing narrow doors between rooms and visit almost the whole reachable states for a given Tmax = 25. The additional results in the Mujoco (Todorov et al., 2012) showing the applicability of our Algorithm 2 are in appendix E. (a) Deterministic 1D world with Tmax = 5. (b) Deterministic 2D world with Tmax = 5. (c) Deterministic tree world with Tmax = 4. Figure 2: Estimated empowerment during the training in deterministic environments. Fig. 2a shows the deterministic 1D world and its training results. The agent can go left right. Fig. 2b shows the deterministic 2D world and its training results. The agent can go left, up, right and down. Fig 2c shows the deterministic tree world and its training results. The agent can go left and right. Green shows the distribution of sf. Figure 3: Deterministic grid world with 4 rooms. The environment is a 25 25 grid world with 4 rooms. The agent can go left, up, right and down. The agent starts from (4, 4) and (10, 4) with Tmax = 25. Green shows the distribution of sf. We compare the algorithms in stochastic environments. Please see appendix F.2 for the details on the stochasticity of the environments. Fig. 4 shows results in simple stochastic environments. We see that while implicit VIC converges to sub-optimum, our two algorithms achieve the maximal empowerment. In Fig. 5, implicit VIC fails to reach far rooms, whereas our Algorithm 2 reaches every room in the environment. From Fig. 4 and 5, we can notice that implicit VIC fails to reach far states under the stochastic dynamics. It happens because the variational bias of implicit VIC is accumulated throughout a trajectory, i.e., it gets larger as the length of the trajectory increases. Fig. 5 also shows the results of training with external rewards. The same mixed reward is used as Gregor et al. (2016), r = r I + αr E with α = 30. For training the random agent, only αr E is used and the entropy loss was applied for exploration. While the random agent and implicit VIC converge to sub-optimum, our Algorithm 2 achieves the optimal solution. Published as a conference paper at ICLR 2021 (a) Stochastic 1D world with Tmax = 5. Algorithm 1 and 2 achieve 196% empowerment gain compared to implicit VIC. (b) Stochastic 2D world with Tmax = 5. Algorithm 1 and 2 achieve 142% empowerment gain compared to implicit VIC. (c) Stochastic tree world with Tmax = 4. Algorithm 1 and 2 achieve 406% empowerment gain compared to implicit VIC. Figure 4: Estimated empowerment during the training in stochastic environments. The environments are equal to Fig. 1 except for their stochasticity. Green shows the distribution of sf. Figure 5: Stochastic gird world with 35 rooms. The environment is a 15 15 grid world with 35 rooms (black cells surrounded by gray walls). We set Tmax = 25. The agent can go up, down and right. It starts from (0, 6). Once the agent enters a room, the environment returns done and then the final action is available only. Green shows the distribution of sf. The external reward is composed of -0.1 for every time step, +1 for entering the normal room and +100 for entering the special room. The sub-optimal solution is reaching the closest room. The optimal solution is trying to reach the special room (the room with the red box) while it enters the closest normal room as soon as possible when it is impossible to reach there due to the stochastic transition. 6 CONCLUSION In this work, we revisited variational intrinsic control (VIC) proposed by Gregor et al. (2016). We showed that for VIC with implicit options, the environmental stochasticity induces a variational bias in the intrinsic reward, leading to convergence to sub-optima. To reduce this bias and achieve maximal empowerment, we proposed to model the environment dynamics using either the transitional probability model or the Gaussian mixture model. Evaluations on stochastic environments demonstrated the superiority of our methods over the original VIC algorithm with implicit options. ACKNOWLEDGMENTS This research is conducted with the support of NC. We thank Seong Hun Lee at University of Zaragoza for his sincere feedback on our work, Yujeong Lee at KL Partners for her encouragement and Seungeun Rho at Seoul National University, Jinyun Chung, Yongchan Park, Hyunsoo Park, Sangbin Moon, Inseok Oh, Seongho Son and Minkyu Yang at NC for their useful comments. Published as a conference paper at ICLR 2021 Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14 20, 1972. David Barber and Felix V Agakov. The im algorithm: a variational approach to information maximization. In Advances in neural information processing systems, pp. None, 2003. Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460 473, 1972. Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2018. Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. ar Xiv preprint ar Xiv:1611.07507, 2016. Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997. Alexander S Klyubin, Daniel Polani, and Chrystopher L Nehaniv. Empowerment: A universal agentcentric measure of control. In 2005 IEEE Congress on Evolutionary Computation, volume 1, pp. 128 135. IEEE, 2005. Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295 2304, 2017. Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125 2133, 2015. Karl Pearson. Iii. contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London.(A.), (185):71 110, 1894. Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment an introduction. In Guided Self-Organization: Inception, pp. 67 114. Springer, 2014. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012. Raymond W Yeung. The blahut arimoto algorithms. In Information Theory and Network Coding, pp. 211 228. Springer, 2008. A DERIVATION OF INTRINSIC REWARD FROM MUTUAL INFORMATION Here we derive the intrinsic reward of VIC by taking the gradient of (1) with respect to the parameter of policy θ. We omit s0 for simplicity. Note that p(Ω), p(sf) and p(Ω, sf) can be parameterized by θ since they are all determined by policy. We start by rewriting (1): I(Ω, sf) = X Ω,sf pθ(Ω, sf) h log pθ(Ω, sf) log pθ(Ω)pθ(sf) i . Published as a conference paper at ICLR 2021 Then by taking the gradient with respect to θ, we obtain θI(Ω, sf) = X Ω,sf pθ(Ω, sf) h log pθ(Ω, sf) log pθ(Ω)pθ(sf) i θ log pθ(Ω, sf) Ω,sf pθ(Ω, sf) h θpθ(Ω, sf) pθ(Ω, sf) θpθ(Ω) pθ(Ω) θpθ(sf) Ω,sf pθ(Ω, sf) h log pθ(Ω, sf) log pθ(Ω)pθ(sf) i θ log pθ(Ω, sf) h θpθ(Ω, sf) pθ(sf|Ω) θpθ(Ω) pθ(Ω|sf) θpθ(sf) i . Ω,sf θpθ(Ω, sf) = 0, X Ω,sf pθ(sf|Ω) θpθ(Ω) = 0, X Ω,sf pθ(Ω|sf) θpθ(sf) = 0, and (3) we have θI(Ω, sf) = X Ω,sf pθ(Ω, sf) h log pθ(Ω, sf) log pθ(Ω)pθ(sf) i θ log pθ(Ω, sf) Ω pθ(Ω)r I Ω θ log pθ(Ω) (23) where r I Ω= log pθ(Ω|sf|Ω) log pθ(Ω). Similarly, we can obtain θIV E(Ω, sf) = X Ω,sf pθ(Ω, sf) h log q(Ω|sf) log pθ(Ω) i θ log pθ(Ω, sf) Ω pθ(Ω)r IV E Ω θ log pθ(Ω) where r IV E Ω = log q(Ω|sf|Ω) log pθ(Ω). Using (4) and (5), we can rewrite r I Ωas τt,at,st+1 Ω log pθ(at|τt, sf|Ω)pθ(st+1|τt, at, sf|Ω) pθ(at|τt)p(st+1|τt, at) . Since pθ(at|τt, sf|Ω) is intractable, we may replace it with variational inference qφ(at|τt, sf|Ω) which result in r IV E Ω = X τt,at,st+1 Ω log qφ(at|τt, sf|Ω)pθ(st+1|τt, at, sf|Ω) pθ(at|τt)p(st+1|τt, at) . For deterministic environment, we have pθ(st+1|τt, at, sf|Ω) = p(st+1|τt, at) = 1 and both r I Ωand r IV E Ω can be reduced into τt,at,st+1 Ω log pθ(at|τt, sf|Ω) r IV E Ω = X τt,at,st+1 Ω log qφ(at|τt, sf|Ω) pθ(at|τt) = r IV B Ω . Published as a conference paper at ICLR 2021 B DERIVATION OF U V E Here we derive U V E from |I IV E| with omitted s0 for simplicity: Ω,sf p(Ω, sf) h log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pq ρ(Ω|sf) log pπ(Ω)pρ(Ω) pp π(Ω)pp ρ(Ω) Ω,sf p(Ω, sf) log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pq ρ(Ω|sf) Ω,sf p(Ω, sf) log pπ(Ω)pρ(Ω) pp π(Ω)pp ρ(Ω) Ω,sf p(sf)p(Ω|sf) log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pq ρ(Ω|sf) Ω,sf p(Ω)p(sf|Ω) log pπ(Ω)pρ(Ω) pp π(Ω)pp ρ(Ω) Using (3), (9) and (10), we obtain Ω,sf p(sf)p(Ω|sf) log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pq ρ(Ω|sf) Ω,sf p(Ω)p(sf|Ω) log pπ(Ω)pρ(Ω) pp π(Ω)pp ρ(Ω) Ω,sf p(sf)p(Ω|sf) log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pq ρ(Ω|sf) Ω p(Ω) log pπ(Ω)pρ(Ω) pp π(Ω)pp ρ(Ω) sf p(sf)DKL pπ( |sf)pρ( |sf)||pq π( |sf)pq ρ( |sf) + DKL pπ( )pρ( )||pp π( )pp ρ( ) C DERIVATION OF U V E σ,1 AND U V E σ,2 Here we derive (19). We also omit s0 for simplicity here. We start from I IV E σ : Ω,sf p(Ω, sf) h log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)f q σ(Ω|sf) log pπ(Ω)pρ(Ω) pp π(Ω)f p σ(Ω) Ω,sf p(Ω, sf) h log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)f q σ(Ω|sf) log pρ(Ω) i ( pp π(Ω) = pπ(Ω)) Ω,sf p(Ω, sf) h log pρ(Ω|sf)f p σ(Ω) pρ(Ω)f q σ(Ω|sf) + log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pρ(Ω|sf) Ω,sf p(Ω, sf) log pρ(Ω|sf)f p σ(Ω) pρ(Ω)f q σ(Ω|sf) Ω,sf p(Ω, sf) log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pρ(Ω|sf) Ω,sf p(Ω, sf) log pρ(Ω|sf)f p σ(Ω) pρ(Ω)f q σ(Ω|sf) Ω,sf p(sf)p(Ω|sf) log pπ(Ω|sf)pρ(Ω|sf) pq π(Ω|sf)pρ(Ω|sf) = U V E σ,1 + U V E σ,2 . Published as a conference paper at ICLR 2021 D DERIVATION OF THE BOUNDS ON I Iσ Here we derive the bounds on the estimation error of mutual information with smoothing. First, we derive the upper bound on fσ(st+1|τt, at): fσ(st+1|τt, at) = X s S(τt,at) p(s |τt, at)fσ(st+1 s ; 0, σ2In) p(st+1|τt, at) + X s =st+1 S(τt,at) p(s |τt, at) exp ( st+1 s 2 2 2σ2 ) p(st+1|τt, at) + X s =st+1 S(τt,at) p(s |τt, at) exp ( d2 min 2σ2 ) p(st+1|τt, at) + (1 p(st+1|τt, at)) exp ( d2 min 2σ2 ) p(st+1|τt, at) + exp ( d2 min 2σ2 ) . Obviously, we have 1 (2πσ2)n p(st+1|τt, at) fσ(st+1|τt, at) which results in: p(st+1|τt, at) p (2πσ2)nfσ(st+1|τt, at) p(st+1|τt, at) + exp ( d2 min 2σ2 ). (24) Similarly, we can obtain the bounds of fσ(st+1|τt, at, sf) as follows: p(st+1|τt, at, sf) p (2πσ2)nfσ(st+1|τt, at, sf) p(st+1|τt, at, sf) + exp ( d2 min 2σ2 ). (25) Combining (24) and (25), we can obtain p(st+1|τt, at, sf) p(st+1|τt, at) + exp ( d2 min 2σ2 ) fσ(st+1|τt, at, sf) fσ(st+1|τt, at) p(st+1|τt, at, sf) + exp ( d2 min 2σ2 ) p(st+1|τt, at, sf) . (26) Taking log and using log (a + b) log a + b a for a, b > 0 to (26), we have log p(st+1|τt, at, sf) p(st+1|τt, at) 1 pmin exp ( d2 min 2σ2 ) log fσ(st+1|τt, at, sf) fσ(st+1|τt, at) log p(st+1|τt, at, sf) p(st+1|τt, at) + 1 pmin,f exp ( d2 min 2σ2 ) which results in 1 pmin,f exp ( d2 min 2σ2 ) log p(st+1|τt, at, sf) p(st+1|τt, at) log fσ(st+1|τt, at, sf) fσ(st+1|τt, at) 1 pmin exp ( d2 min 2σ2 ). (27) Using (9) and (27) we obtain TΩ pmin,f exp ( d2 min 2σ2 ) log pρ(Ω|sf, s0) pρ(Ω|s0) log fσ(Ω|sf, s0) fσ(Ω|s0) TΩ pmin exp ( d2 min 2σ2 ). By taking the expectation to each side, we have pmin,f exp ( d2 min 2σ2 ) T pmin,f exp ( d2 min 2σ2 ) T pmin exp ( d2 min 2σ2 ) Tmax pmin exp ( d2 min 2σ2 ) with Tmax = maxΩTΩand T = P Ω,sf p(Ω, sf|s0)TΩ. This implies that the estimation error of mutual information with smoothing converges to 0 for fine Tmax or T as σ 0. Also, for finite T and σ dmin, it satisfies|I Iσ| 0. Published as a conference paper at ICLR 2021 E ADDITIONAL EXPERIMENT RESULTS Here we show additional experiment results in Half Cheetah-v3 in the Mujoco environments (Todorov et al., 2012) to show the applicability of our Algorithm 2. We expect that both implicit VIC and our Algorithm 2 will show similar results as Fig. 1 and Fig. 2 since it is a deterministic environment. We fixed the length of the trajectory as T = 100 to force the agent to learn long enough trajectories. Each motor s action space is quantized to 5 actions. We can observe that exciting movements (especially triple backflips) are learned by Algorithm 2. Another exciting fact is that since the number of options it can learn is unlimited, it shows newer behaviors on and on as learning goes on. Figure 6: Learned behaviors by Algorithm 2 in Half Cheetah-v3. The question mark is written when the result of the flip is unknown. The various behaviors such as forward flip, backflip, etc. are learned by Algorithm 2. F EXPERIMENTAL DETAILS Here we specify experimental details and environment details. F.1 HYPER-PARAMETERS Here we show hyper-parameters used in our experiments. We used a learning rate of 1e 3 for Fig. 2 and Fig. 4 and 1e 4 otherwise. Published as a conference paper at ICLR 2021 Table 1: Hyper-parameters used for experiments Hyper-parameter Value Optimizer Adam Learning rate 1e-3, 1e-4 Betas (0.9, 0.999) Weight initialization Gaussian with std. 0.1 and mean 0 Batch size 128 Tsmooth 128 σ (GMM) 0.25 ngmm (GMM) 10 F.2 STOCHASTIC ENVIRONMENTS DETAIL Here we specify the transition tables used in this paper. Table 2: Transition table of stochastic environments. Action Transition go left go right go left 0.7 0.3 go right 0.3 0.7 (a) Transition table of the stochastic 1D world. Action Transition go left go right go left 0.7 0.1 0.1 0.1 go up 0.1 0.7 0.1 0.1 go right 0.1 0.1 0.7 0.1 go down 0.1 0.1 0.1 0.7 (b) Transition table of the stochastic 2D world. Action Transition go left no move go left 0.8 0.0 0.2 go right 0.6 0.2 0.2 (c) Transition table of the stochastic tree world. Action Transition go up no move go up 0.7 0.0 0.0 0.3 go down 0.0 0.7 0.0 0.3 go right 0.0 0.0 0.7 0.3 (d) Transition table of the stochastic grid world with 35 rooms.