# compress_and_control__359d2c78.pdf Compress and Control Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins Google Deep Mind, Australian National University {veness,bellemare,alschua,gdesjardins}@google.com marcus.hutter@anu.edu.au This paper describes a new information-theoretic policy evaluation technique for reinforcement learning. This technique converts any compression or density model into a corresponding estimate of value. Under appropriate stationarity and ergodicity conditions, we show that the use of a sufficiently powerful model gives rise to a consistent value function estimator. We also study the behavior of this technique when applied to various Atari 2600 video games, where the use of suboptimal modeling techniques is unavoidable. We consider three fundamentally different models, all too limited to perfectly model the dynamics of the system. Remarkably, we find that our technique provides sufficiently accurate value estimates for effective on-policy control. We conclude with a suggestive study highlighting the potential of our technique to scale to large problems. 1 Introduction Within recent years, a number of information-theoretic approaches have emerged as practical alternatives to traditional machine learning algorithms. Noteworthy examples include the compression-based approaches of Frank, Chui, and Witten (2000) and Bratko et al. (2006) to classification, and Cilibrasi and Vit anyi (2005) to clustering. What differentiates these techniques from more traditional machine learning approaches is that they rely on the ability to compress the raw input, rather than combining or learning features relevant to the task at hand. Thus this family of techniques has proven most successful in situations where the nature of the data makes it somewhat unwieldy to specify or learn appropriate features. This class of methods can be formally justified by appealing to various notions within algorithmic information theory, such as Kolmogorov complexity (Li and Vit anyi 2008). In this paper we show how a similarly inspired approach can be applied to reinforcement learning, or more specifically, to the tasks of policy evaluation and on-policy control. Policy evaluation refers to the task of estimating the value function associated with a given policy, for an arbitrary given environment. The performance of well-known reinforcement learning techniques such as policy iteration Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. (Howard 1960), approximate dynamic programming (Bertsekas and Tsitsiklis 1996; Powell 2011) and actor-critic methods (Sutton and Barto 1998), for example, all crucially depend on how well policy evaluation can be performed. In this paper we introduce a model-based approach to policy evaluation, which transforms the task of estimating a value function to that of learning a particular kind of probabilistic state model. To better put our work into context, it is worth making the distinction between two fundamentally different classes of model based reinforcement learning methods. Simulation based techniques involve learning some kind of forward model of the environment from which future samples can be generated. Given access to such models, planning can be performed directly using search. Noteworthy recent examples include the work of Doshi-Velez (2009), Walsh, Goschin, and Littman (2010), Veness et al. (2010), Veness et al. (2011), Asmuth and Littman (2011), Guez, Silver, and Dayan (2012), Hamilton, Fard, and Pineau (2013) and Tziortziotis, Dimitrakakis, and Blekas (2014). Although the aforementioned works demonstrate quite impressive performance on small domains possessing complicated dynamics, scaling these methods to large state or observation spaces has proven challenging. The main difficulty that arises when using learnt forward models is that the modeling errors tend to compound when reasoning over long time horizons (Talvitie 2014). In contrast, another family of techniques, referred to in the literature as planning as inference, attempt to side-step the issue of needing to perform accurate simulations by reducing the planning task to one of probabilistic inference within a generative model of the system. These ideas have been recently explored in both the neuroscience (Botvinick and Toussaint 2012; Solway and Botvinick 2012) and machine learning (Attias 2003; Poupart, Lang, and Toussaint 2011) literature. The experimental results to date have been somewhat inconclusive, making it far from clear whether the transformed problem is any easier to solve in practice. Our main contribution in this paper is to show how to set up a particularly tractable form of inference problem by generalizing compression-based classification to reinforcement learning. The key novelty is to focus the modeling effort on learning the stationary distribution of a particular kind of augmented Markov chain describing the system, from which we can ap- Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence proximate a type of dual representation (Wang, Bowling, and Schuurmans 2007; Wang et al. 2008) of the value function. Using this technique, we were able to produce effective controllers on a problem domain orders of magnitude larger than what has previously been addressed with simulation based methods. 2 Background We start with a brief overview of the parts of reinforcement learning and information theory needed to describe our work, before reviewing compression-based classification. 2.1 Markov Decision Processes A Markov Decision Process (MDP) is a type of probabilistic model widely used within reinforcement learning (Sutton and Barto 1998; Szepesv ari 2010) and control (Bertsekas and Tsitsiklis 1996). In this work, we limit our attention to finite horizon, time homogenous MDPs whose action and state spaces are finite. Formally, an MDP is a triplet (S, A, µ), where S is a finite, non-empty set of states, A is a finite, non-empty set of actions and µ is the transition probability kernel that assigns to each state-action pair (s, a) S A a probability measure µ( | s, a) over S R. S and A are known as the state space and action space respectively. The transition probability kernel gives rise to the state transition kernel P(s |s, a) := µ({s } R | s, a), which gives the probability of transitioning from state s to state s if action a is taken in s. An agent s behavior is determined by a policy, that defines, for each state s S and time t N, a probability measure over A denoted by πt( | s). A stationary policy is a policy which is independent of time, which we will denote by π( | s) where appropriate. At each time t, the agent communicates an action At πt( | St 1) to the system in state St 1 S. The system then responds with a statereward pair (St, Rt) µ( | St 1, At). Here we will assume that each reward is bounded between [rmin, rmax] R and that the system starts in a state s0 and executes for an infinite number of steps. Thus the execution of the system can be described by a sequence of random variables A1, S1, R1, A2, S2, R2, ... The finite m-horizon return from time t is defined as Zt := Pt+m 1 i=t Ri. The expected m-horizon return from time t, also known as the value function, is denoted by V π(st) := E[Zt+1 | St = st]. The return space Z is the set of all possible returns. The action-value function is defined by Qπ(st, at+1) := E[Zt+1 | St = st, At+1 = at+1]. An optimal policy, denoted by π , is a policy that maximizes the expected return E [Zt+1 | St] for all t; in our setting, a state-dependent deterministic optimal policy always exists. 2.2 Compression and Sequential Prediction We now review sequential probabilistic prediction in the context of statistical data compression. An alphabet X is a set of symbols. A string of data x1x2 . . . xn X n of length n is denoted by x1:n. The prefix x1:j of x1:n, j n, is denoted by x j or x 0, with the familiar chain rules ρ(x1:n) = Qn i=1 ρ(xi|x 0} = 1, x X; (PR) positive recurrent iff E[min{n 1 : Xn = x}|X0 = x] < , x X; (IR) irreducible iff x, x n 1 : P[Xn = x |X0 = x] > 0; (EA) essentially aperiodic iff gcd{n 1 : P[Xn = x|X0 = x] > 0} {1, }, x X. Note also that EA+IR implies AP. Although the term ergodic is sometimes used to describe particular combinations of these properties (e.g. AP+PR+IR), here we avoid it in favor of being more explicit. Lemma 1. Consider a stochastic process {Xt}t 1 over state space X that is independent of a sequence of U-valued random variables {Ut}t 1 in the sense that Figure 1: Lemma 1 applied to { (At, St), Rt }t 1. P(xt|x 0}. Lemma 1 allows HMC {Xt := (At, St)}t 1 to be augmented to obtain the HMC {Yt := (Xt, Rt)}t 1, where At, St and Rt denote the action, state and reward at time t respectively; see Figure 1 for a graphical depiction of the dependence structure. The second result allows the HMC {Yt}t 1 to be further augmented to give the snake HMC {Yt:t+m}t 1 (Br emaud 1999). This construction ensures that there is sufficient information within each augmented state to be able to condition on the m-horizon return. Lemma 2. If {Yt}t 1 is an (IR/EA/PR) HMC over state space Y, then for any m N, the stochastic process {Wt}t 1, where Wt := (Yt, ..., Yt+m), is an (IR/EA/PR) HMC over W := {y0:m Ym+1 : P(y1:m|y0) > 0}. Now if we assume that the HMC defined by M and π is (IR+EA+PR), Lemmas 1 and 2 imply that there exists a unique stationary distribution ν over the augmented state space (A S R)m+1. Furthermore, if we let (A 0, S 0, R 0, . . . , A m, S m, R m) ν and define Z := Pm i=1 R i, it is clear that there exists a joint distribution ν over Z (A S R)m+1 such that (Z , A 0, S 0, R 0, . . . , A m, S m, R m) ν. Hence the νprobability P [Z | S 0, A 1] is well defined, which allows us to express the action-value function Qπ as Qπ(s, a) = Eν [Z | S 0 = s, A 1 = a] . (2) Finally, by expanding the expectation and applying Bayes rule, Equation 2 can be further re-written as Qπ(s, a) = X z Z z ν(z | s, a) z Z z ν(s | z, a) ν(z | a) P z Z ν(s | z , a) ν(z | a). (3) The CNC approach to policy evaluation involves directly learning the conditional distributions ν(s | z, a) and ν(z | a) in Equation 3 from data, and then using these learnt distributions to form a plug-in estimate of Qπ(s, a). Notice that ν(s|z, a) conditions on the return, similar in spirit to prior work on planning as inference (Attias 2003; Botvinick and Toussaint 2012; Solway and Botvinick 2012). The distinguishing property of CNC is that the conditioning is performed with respect to a stationary distribution that has been explicitly constructed to allow for efficient modeling and inference. 3.3 Online Policy Evaluation We now provide an online algorithm for compression-based policy evaluation. This will produce, for all times t N, an estimate ˆQπ t (s, a) of the m-horizon expected return Qπ(s, a) as a function of the first t m action-observationreward triples. Constructing our estimate involves modeling the νprobability terms in Equation 3 using two different coding distributions, ρS and ρZ respectively; ρS will encode states conditional on return-action pairs, and ρZ will encode returns conditional on actions. Sample states, actions and returns can be generated by directly executing the system (M, π); Provided the HMC M + π is (IR+EA+PR), Lemmas 1 and 2 ensure that the empirical distributions formed from a sufficiently large sample of action/state/return triples will be arbitrarily close to the required conditional ν-probabilities. Next we describe how the coding distributions are trained. Given a history ht := s0, a1, s1, r1 . . . , an+m, sn+m, rn+m with t = n + m, we define the m-lagged return at any time i n + 1 by zi := ri + + ri+m 1. The sequence of the first n states occurring in ht can be mapped to a subsequence denoted by sz,a 0:n 1 that is defined by keeping only the states (si : zi+1 = z ai+1 = a)n 1 i=0 . Similarly, a sequence of m-lagged returns z1:n can be mapped to a subsequence za 1:n formed by keeping only the returns (zi : ai = a)n i=1 from z1:n. Our value estimate at time t of taking action a in state s can now be defined as ˆQπ t (s, a) := X z Z z wz,a t (s), (4) wz,a t (s) := ρS( s | sz,a 0:n 1 ) ρZ(z | za 1:n) P z Z ρS(s | sz ,a 0:n 1) ρZ(z | za 1:n) (5) approximates the probability of receiving a return of z if action a is selected in state s. Implementation. The action-value function estimate ˆQπ t can be computed efficiently by maintaining |Z||A| buckets, each corresponding to a particular return-action pair (z, a). Each bucket contains an instance of the coding distribution ρS encoding the state sequence sz,a 0:n 1. Similarly, |A| buckets containing instances of ρZ are created to encode the various return subsequences. This procedure is summarized in Algorithm 1. To obtain a particular state-action value estimate, Equations 4 and 5 can be computed directly by querying the appropriate bucketed coding distributions. Assuming that the time required to compute each conditional probability using Algorithm 1 CNC POLICY EVALUATION Require: Stationary policy π, environment M Require: Finite planning horizon m N Require: Coding distributions ρS and ρZ 1: for i = 1 to t do 2: Perform ai π( | si 1) 3: Observe (si, ri) µ( | si 1, ai) 4: if i m then 5: Update ρS in bucket (zi m+1, ai m+1) with si m 6: Update ρZ in bucket ai m+1 with zi m+1 7: end if 8: end for 9: return ˆQπ t ρS and ρZ is constant, the time complexity for computing ˆQt(s, a) is O(|Z|). 3.4 Analysis We now show that the state-action estimates defined by Equation 4 are consistent provided that consistent density estimators are used for both ρS and ρZ. Also, we will say fn converges stochastically to 0 with rate n 1/2 if and only if c>0, δ [0, 1] : P |fn(ω)| q and will denote this by writing fn(ω) OP(n 1/2). Theorem 1. Given an m-horizon, finite state space, finite action space, time homogenous MDP M := (S, A, µ) and a stationary policy π that gives rise to an (IR+EA+PR) HMC, for all ϵ > 0, we have that for any state s S and action a A that lim n P h | ˆQπ n(s, a) Qπ(s, a) | ϵ i = 0, provided ρS and ρZ are consistent estimators of ν(s|z, a) and ν(z|a) respectively. Furthermore, if |ρS(s|z, a) ν(s|z, a)| OP(n 1/2) and |ρZ(z|a) ν(z|a)| OP(n 1/2) then | ˆQπ n(s, a) Qπ(s, a)| OP(n 1/2). Next we state consistency results for two types of estimators often used in model-based reinforcement learning. Theorem 2. The frequency estimator ρ(xn|x