# proper_value_equivalence__957b1e75.pdf Proper Value Equivalence Christopher Grimm Computer Science & Engineering University of Michigan crgrimm@umich.edu André Barreto, Gregory Farquhar, David Silver, Satinder Singh Deep Mind {andrebarreto,gregfar, davidsilver,baveja}@google.com One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-k counterparts defined with respect to k applications of the Bellman operator. This leads to a family of VE classes that increase in size as k . In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as Mu Zero can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to Mu Zero and show that it can lead to improved performance in practice. 1 Introduction It has long been argued that, in order for reinforcement learning (RL) agents to solve truly complex tasks, they must build a model of the environment that allows for counterfactual reasoning [29]. Since representing the world in all its complexity is a hopeless endeavor, especially under capacity constraints, the agent must be able to ignore aspects of the environment that are irrelevant for its purposes. This is the premise behind the value equivalence (VE) principle, which provides a formalism for focusing on the aspects of the environment that are crucial for value-based planning [17]. VE distinguishes models based on a set of policies and a set of real-valued scalar functions of state (henceforth, just functions). Roughly, a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the same result as the environment s Bellman operators when applied to the functions. The policies and functions thus become a language to specify which parts of the environment a model should capture. As the number of policies and functions increase the requirements on the model become more stringent, which is to say that the class of VE models shrinks. In the limit, the VE class collapses to a single point corresponding to a perfect model. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Although this result is reassuring, in practice we want to stop short of collapsing after all, at this point the agent is no longer ignoring irrelevant aspects of the environment. A fundamental question is thus how to select the smallest sets of policies and functions such that a resulting VE model is sufficient for planning. In this paper we take an important additional step in this direction: we show that the VE principle can be formulated with respect to value functions only. This result drastically reduces the space of functions that must be considered by VE, as in general only a small fraction of the set of all functions will qualify as value functions in a given environment. Since every policy has an associated value function, this new formulation of VE removes the need for selecting functions, only requiring policies. We name our new formulation proper value equivalence (PVE) to emphasize its explicit use of value functions. PVE has several desirable properties. Unlike with VE, the class of PVE models does not collapse to a singleton in the limit. This means that, even if all value functions are used, we generally end up with multiple PVE models which can be beneficial if some of these are easier to learn or represent than others. Crucially, all of these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. Finally, we make more precise Grimm et al. s [17] suggestion that the VE principle may help explain the good empirical performance of several modern algorithms [38, 33, 24, 12, 30]. Specifically, we show that, with mild assumptions, minimizing the loss of the Mu Zero algorithm [31] can be understood as minimizing a PVE error. We then leverage this connection to suggest a modification to Mu Zero and show a small but significant improvement in the Atari Learning Environment [3]. 2 Background The agent s interaction with the environment will be modeled as a Markov decision process (MDP) M S, A, r, p, γ , where S and A are the state and action spaces, r(s, a) is the expected reward following taking a from s, p(s |s, a) is the transition kernel and γ [0, 1) is a discount factor [27]. A policy is a mapping π : S 7 P(A), where P(A) is the space of probability distributions over A; we define Π {π | π : S 7 P(A)} as the set of all possible policies. A policy π is deterministic if π(a|s) > 0 for only one action a per state s. A policy s value function is defined as vπ(s) Eπ h X i=0 γir(St+i, At+i) | St = s i , (1) where Eπ[ ] denotes expectation over the trajectories induced by π and the random variables St and At indicate the state occupied and the action selected by the agent at time step t. The agent s goal is to find a policy π Π that maximizes the value of every state [36, 37]. Usually, a crucial step to carry out this search is to compute the value function of candidate policies. This process can be cast in terms of the policy s Bellman operator: Tπ[v](s) EA π( |s),S p( |s,A) [r(s, A) + γv(S )] , (2) where v is any function in the space V {f | f : S 7 R}. It is known that limn T n π v = vπ, that is, starting from any v V, the repeated application of Tπ will eventually converge to vπ. Since in RL the agent does not know p and r, it cannot apply (2) directly. One solution is to learn a model m ( r, p) and use it to compute (2) with p and r replaced by p and r [36]. We denote the set of all models as M. The value equivalence principle defines a model as value equivalent (VE) to the environment m (r, p) with respect to a set of policies Π and a set of functions V if it produces the same Bellman updates as m when using Π and V [17]. Classes of such models are expressed as follows: M(Π, V) { m M : Tπv = Tπv π Π, v V} (3) where M M is a class of models, Tπ denotes one application of the Bellman operator induced by model m and policy π to function v, and Tπ is environment s Bellman operator for π. Grimm et al. [17] showed that the VE principle can be used to learn models that disregard aspects of the environment which are not related to the task of interest.1 Classical approaches to model learning 1A related approach is taken in value-aware model learning [11] which minimizes the discrepancy between the Bellman optimality operators induced by the model and the environment. do not take the eventual use of the model into account, potentially modeling irrelevant aspects of the environment. Accordingly, Grimm et al. [17] have shown that, under the same capacity constraints, models learned using VE can outperform their classical counterparts. 3 Proper value equivalence One can define a spectrum of VE classes corresponding to different numbers of applications of the Bellman operator. We define an order-k VE class as: Mk(Π, V) { m M : T k π v = T k π v π Π, v V} (4) where T k π v denotes k applications of Tπ to v. Under our generalized definition of VE, Grimm et al. [17] studied order-one VE classes of the form M1(Π, V). They have shown that M1(Π, V) either contains only the environment or is empty. This is not generally true for k > 1. The limiting behavior of order-k value equivalent classes can be described as follows Proposition 1. Let V be a set of functions such that if v V then Tπv V for all π Π. Then, for k, K Z+ such that k divides K, it follows that: (i) For any M M and any Π Π, we have that Mk(Π, V) MK(Π, V). (ii) If Π is non-empty and V contains at least one constant function, then there exist environments such that Mk(Π, V) MK(Π, V). We defer all proofs of theoretical results to Appendix A.2. Based on Proposition 1 we can relate different VE model classes according to the greatest common divisor of their respective orders; specifically, two classes Mk(Π, V) and MK(Π, V) will intersect at Mgcd(k,K)(Π, V) (Figure 1). Proposition 1 also implies that, in contrast to order-one VE classes, higher order VE classes potentially include multiple models, even if VE is defined with respect to all policies Π and all functions V. In addition, the size of a VE class cannot decrease as we increase its order from k to a multiple of k (and in some cases it will strictly increase). This invites the question of what happens in the limit as we keep increasing the VE order. To answer this question, we introduce a crucial concept for this paper: Definition 1. (Proper value equivalence). Given a set of policies Π Π, let M (Π) = lim k Mk(Π, V) = { m M : vπ = vπ π Π}, (5) where vπ and vπ are the value functions of π induced by model m and the environment. We say that each m M (Π) is a proper value equivalent model to the environment with respect to Π. Figure 1: Topology of the space of order-k VE classes. Given a set of policies Π, a set of functions V closed under Bellman updates, and k, K Z+ such that k divides K, we have that Mk(Π, V) MK(Π, V). Because the process of repeatedly applying a policy s Bellman operator to a function converges to the same fixed point regardless of the function, in an order VE class the set Π uniquely determines the set V. This reduces the problem of defining Π and V to defining the former only. Also, since all functions in an order VE are value functions, we call it proper VE or PVE. It is easy to show that Proposition 1 is valid for any k Z+ when K = (Corollary 2 in Appendix A.2). Thus, in some sense, M is the biggest VE class. It is also possible to define this special VE class in terms of any other: Proposition 2. For any Π Π and any k Z+ it follows that M (Π) = \ π Π Mk({π}, {vπ}), (6) where vπ is the value of policy π in the environment. We thus have two equivalent ways to describe the class of models which are PVE with respect to a set of policies Π. The first, given in (5), is the orderlimit of value equivalence with respect to Π and the set of all functions V. The second, given in (6), is the intersection of the classes of models that are order-k VE with respect to the singleton policies in Π and their respective value functions. This latter form is valid for any k, and will underpin our practical algorithmic instantiations of PVE. Setting k = 1 in Proposition 2 we see that PVE can be written in terms of order-one VE. This means that M inherits many of the topological properties of M1 shown by Grimm et al. [17]. Specifically, we know that M (Π) M (Π) if M M and also that M (Π ) M (Π) when Π Π (these directly follow from Grimm et al. s [17] Properties 1 and 3 respectively). Proposition 2 also sheds further light into the relation between PVE and order-k VE more generally. Let Π be a set of policies and Vπ their value functions. Then, for any k Z+, we have that Mk(Π, Vπ) = \ v Vπ Mk({π}, {v}) \ π Π Mk({π}, {vπ}) = M (Π), (7) which is another way to say that M is, in some sense, the largest among all the VE classes. The reason why the size of VE classes is important is that it directly reflects the main motivation behind the VE principle. VE s premise is that models should be constructed taking into account their eventual use: if some aspects of the environment are irrelevant for value-based planning, it should not matter whether a model captures them or not. This means that all models that only differ with respect to these irrelevant aspects but are otherwise correct qualify as valid VE solutions. A larger VE class generally means that more irrelevant aspects of the environment are being ignored by the agent. We now make this intuition more concrete by showing how irrelevant aspects of the environment that are eventually captured by order-one VE are always ignored by PVE: Proposition 3. Let Π Π. If the environment state can be factored as S = X Y where |Y| > 1 and vπ(s) = vπ((x, y)) = vπ(x) for all π Π, then M1(Π, V) M (Π). Note that the subset relation appearing in Proposition 3 is strict. We can think of the variable y appearing in Proposition 3 as superfluous features that do not influence the RL task, like the background of an image or any other sensory data that is irrelevant to the agent s goal. A model is free to assign arbitrary dynamics to such irrelevant aspects of the state without affecting planning performance. Since order-one VE eventually pins down a model that describes everything about the environment, one would expect the size of M relative to M1 to increase as more superfluous features are added. Indeed, in our proof of Proposition 3 we construct a set of models in M (Π) which are in one-to-one correspondence with Y, confirming this intuition (see Appendix A.2). Proper value equivalence yields models that are sufficient for optimal planning In general PVE does not collapse to a single model even in the limit of Π = Π. At first this may cause the impression that one is left with the extra burden of selecting one among the PVE models. However, it can be shown that no such choice needs to be made: Proposition 4. An optimal policy for any m M (Π) is also an optimal policy in the environment. According to Proposition 1 any model m M (Π) used for planning will yield an optimal policy for the environment. In fact, in the spirit of ignoring as many aspects of the environment as possible, we can define an even larger PVE class by focusing on deterministic policies only: Corollary 1. Let Πdet be the set of all deterministic policies. An optimal policy for any m M (Πdet) is also optimal in the environment. Given that both M (Π) and M (Πdet) are sufficient for optimal planning, one may wonder if these classes are in fact the same. The following result states that the class of PVE models with respect to deterministic policies can be strictly larger than its counterpart defined with respect to all policies: Proposition 5. There exist environments and model classes for which M (Π) M (Πdet). Figure 2 illustrates Proposition 5 with an example of environment and a model m such that m M (Πdet) but m / M (Π). To conclude our discussion on models that are sufficient for optimal planning, we argue that, in the absence of additional information about the environment or the agent, M (Πdet) is in fact the largest possible VE class that is guaranteed to yield optimal performance. To see why this is so, suppose we 100% (+0, R) 50% (+0, R) Environment: Model: Figure 2: An environment / model pair with the same values for all deterministic policies but not all stochastic policies. The environment has three states and two actions: A = {L, R}. The percentages in the figure indicate the probability of a given transition and the corresponding tuples (r, a) indicate the reward associated with a given action. A deterministic policy cannot dither between s1 and s3 but a stochastic policy can. Note that the dynamics between the pair differs when taking action R from s2. This difference will affect the dithering behavior of such a stochastic policy in a way that results in different model and environment values. remove a single deterministic policy from Πdet and pick an arbitrary model m M (Πdet {π}). Let vπ be the value function of π according to the model m. Because π is not included in the set of policies used to enforce PVE, vπ may not coincide with vπ, the actual value function of π according to the environment. Now, if π happens to be the only optimal policy in the environment and vπ is not the optimal value function of m, the policy returned by this model will clearly be sub-optimal. 4 Learning a proper value-equivalent model Having established that we want to find a model in M (Πdet), we now turn our attention to how this can be done in practice. Following Grimm et al. [17], given a finite set of policies Π and a finite set of functions V, we cast the search for a model m Mk(Π, V) as the minimization of deviations from (4): ℓk Π,V(m , m) X v V T k π v T k π v||, (8) where Tπ are Bellman operators induced by m and is a norm.2 Note that setting k = in (8) yields a loss that requires computing m s value function which is impractical to do if m is being repeatedly updated. Thankfully, by leveraging the connection between order-k VE and PVE given in Proposition 2, we can derive a practical PVE loss: ℓk Π, (m , m) X π Π T k π vπ T k π vπ|| = X π Π vπ T k π vπ||. (9) Interestingly, given a set of policies Π, minimizing (9) for any k will result in a model m M (Π) (cf. Proposition 2). As we will discuss shortly, this property can be exploited to generate multiple loss functions that provide a richer learning signal in practical scenarios. Contrasting loss functions (8) and (9) we observe an important fact: unlike with other order-k VE classes, PVE requires actual value functions to be enforced in practice. Since value functions require data and compute to be obtained, it is reasonable to ask whether the benefits of PVE justify the associated additional burden. Concretely, one may ask whether the sample transitions and computational effort spent in computing the value functions to be used with PVE would not be better invested in enforcing other forms of VE over arbitrary functions that can be readily obtained. We argue that in many cases one does not have to choose between order-k VE and PVE. Valuebased RL algorithms usually compute value functions iteratively, generating a sequence of functions v1, v2, ... which will eventually converge to vπ for some π. A model-based algorithm that computes vπ in this way has to somehow interleave this process with the refinement of the model m. When it comes to VE, one extreme solution is to only use the final approximation vπ vπ in an attempt to enforce PVE through (9). It turns out that, as long as the sequence v1, v2, ... is approaching vπ, one can use all the functions in the sequence to enforce PVE with respect to π. Our argument is based on the following result: 2We can also impose VE with infinite sets of functions and policies by replacing the respective sums with integrals; in this case one may consider taking a supremum over VE terms to avoid situations where VE is not necessarily satisfied on measure 0 sets. Proposition 6. For any π Π, v V and k, n Z+, we have that vπ T k π vπ (γk + γn) vπ v | {z } ϵv + T n π v T k π v | {z } ϵve Note that the left-hand side of (10) corresponds to one of the terms of the PVE loss (9) associated with a given π. This means that, instead of minimizing this quantity directly, one can minimize the upper-bound on the right-hand side of (10). The first term in this upper bound, ϵv, is the conventional value-function approximation error that most value-based methods aim to minimize (either directly or indirectly). The second term, ϵve, is similar to the terms appearing in the order-k VE loss (8), except that here the number of applications of Tπ and of its approximation Tπ do not have to coincide. All the quantities appearing in ϵve are readily available or can be easily approximated using sample transitions [17]. Thus, ϵve can be used to refine the model m using functions v that are not necessarily value functions. As v vπ, two things happen. First, ϵve approaches one of the terms of the PVE loss (9) associated with policy π. Second, ϵv vanishes. Interestingly, the importance of ϵv also decreases with n and k, the number of times Tπ and Tπ are applied in ϵve, respectively. This makes sense: since T n π v vπ as n and, by definition, VE approaches PVE as k , we have that ϵve approaches the left-hand side of (10) as both n and k grow. An extended example: Mu Zero through the lens of value equivalence Grimm et al. [17] suggested that the VE principle might help to explain the empirical success of recent RL algorithms like Value Iteration Networks, the Predictron, Value Prediction Networks, Tree QN, and Mu Zero [38, 33, 24, 12, 31]. In this section we investigate this hypothesis further and describe a possible way to interpret one of these algorithms, Mu Zero, through the lens of VE. We acknowledge that the derivation that follows abstracts away many details of Mu Zero and involves a few approximations of its mechanics, but we believe it captures and explains the algorithm s essence. Mu Zero is a model-based RL algorithm that achieved state-of-the-art performance across both board games, such as Chess and Go, and Atari 2600 games [31]. The model m in Mu Zero is trained on sequences of states, actions and rewards resulting from executing a behavior policy in the environment: st:t+n+K, at:t+n+K, rt:t+n+K where n and K are hyperparameters of the agent which will be explained shortly. The agent produces an agent state z0 t from st and subsequently generates z1:K t by using its model to predict the next K agent states following actions at:t+K. The agent also maintains reward and value function estimates as a function of agent states, which we denote r(z) and v(z) respectively. A variant3 of Mu Zero s per-state model loss can thus be expressed as: k=0 (Vt+k v(zk t ))2 + (rt+k r(zk t )))2 (11) where Vt+k = rt+k + + γn 1rt+k+n 1 + γnvtarg(z0 t+k+n). The term vtarg is a value target produced by Monte-Carlo tree search (MCTS, [7]). Because the behavior policy is itself computed via MCTS, we have that vtarg v; for simplicity we will assume that vtarg = v and only use v. In what follows we show, subject to a modest smoothness assumption, that minimizing Mu Zero s loss with respect to its behavior policy, π, also minimizes a corresponding PVE loss. Put precisely: C Edπ[ℓµ(St)] ℓK {π}, (m , m) 2 (12) for some C > 0, where dπ is a stationary distribution. We proceed by combining two derivations: a lower-bound on Edπ[ℓµ(St)] in (15), and an upper-bound on (ℓK {π}, (m , m))2 in (17). As a preliminary step we note that ℓµ(st) and ℓK {π}, (m , m) are expressed in terms of samples and expectations respectively. We note the following connection between these quantities: E[rt+k|st] = Pk π[rπ](st), E[Vt+k|st] = Pk πT n π [vπ](st), E[ r(zk t )|st] = Pk π[ rπ](st), E[v(zk t )|st] = Pk π[vπ](st), (13) 3In reality Mu Zero uses a categorical representation for its value and reward functions and minimizes them using a cross-entropy objective. We argue that this choice is not essential to its underlying ideas and use scalar representations with a squared loss to simplify our analysis. where Pk π is the k-step environment transition operator under policy π: Pk π[x](st) = E[x(St+k)|st, m , π], rπ(s) = EA π[r(s, A)] and Pk π and rπ are the corresponding quantities using the model instead of the environment. The above expectations are taken with respect to the environment or model and π as appropriate. We now derive our lower-bounds on Edπ[ℓµ(St)]: Edπ[ℓµ(St)] = Edπ h K X k=0 E[ (Vt+k v(zk t ))2 | St ] + k=0 E[ (rt+k r(zk t ))2 | St ] i k=0 (E[ Vt+k | St ] E[ v(zk t ) | St ])2 + k=0 (E[ rt+k | St ] E[ r(zk t ) | St ])2i k=0 Edπ h (Pk πT n π v(St) Pk πv(St))2i + k=0 Edπ h (Pk πrπ(St) Pk π rπ(St))2i (14) where we apply the tower-property, Jensen s inequality and the identities in (13). We write the expression using norms and drop all terms except k {0, K} in the first sum to obtain: Edπ[ℓµ(St)] T n π v v 2 dπ + PK π T n π v PK π v 2 dπ + k=0 Pk πrπ Pk π rπ 2 dπ (15) recalling that x y 2 dπ = Edπ[(x(St) y(St))2]. To derive an upper-bound for (ℓK {π}, (m , m))2 we assume that the error in value estimation is smooth in the sense that there is some g > 0 (independent of v) such that v vπ < g v vπ dπ. We can then use a modified version of (10) for the dπ-weighted ℓ2-norm (see Appendix A.2), plugging in n + K and K: vπ T K π vπ dπ (gγK + γn+K) vπ v dπ + T K+n π v T K π v dπ γK(g + γn) vπ v dπ + PK π T n π v PK π v dπ + k=0 Pk πrπ Pk π rπ dπ γK (g + γn) (1 γn) T n π v v dπ + PK π T n π v PK π v dπ + k=0 Pk πrπ Pk π rπ dπ, (16) from here we can square both sides and apply Jensen s inequality, vπ T K π vπ 2 dπ ab T π n v v 2 dπ + b PK π T n π v PK π v 2 dπ + b k=0 Pk πrπ Pk π rπ 2 dπ, (17) where a = γK(g + γn)(1 γn) 1 and b = a + K + 2. Combining (17) and (15) we obtain: ab Edπ[ℓµ(St)] vπ T K π vπ 2 dπ = ℓK {π}, (m , m) 2, (18) thus minimizing Mu Zero s loss minimizes a squared PVE loss with respect to a single policy. 5 Experiments We first provide results from tabular experiments on a stochastic version of the Four Rooms domain which serve to corroborate our theoretical claims. Then, we present results from experiments across the full Atari 57 benchmark [3] showcasing that the insights from studying PVE and its relationship to Mu Zero can provide a benefit in practice at scale. See Appendix A.3 for a full account of our experimental procedure. In Section 3 we described the topological relationships between order-k and PVE classes. This is summarized by Proposition 1, which shows that, for appropriately defined V and Π, Mk MK if K is a multiple of k. We illustrate this property empirically by randomly initializing a set of models and then using (8) (or (9) for the limiting case of k = ) to iteratively update them towards Mk(Π, V), with k {1, 30, 40, 50, 60, }. We take the vectors representing these models and project them onto their first two principal components in order to visualise their paths through learning. The results are Figure 3: All scatter plots are generated by tracking the training progress over 500,000 iterations of models with different order-k VE objectives. In each plot 120 models were tracked; at every 1000 timesteps the full set of models is converted into vector form and projected onto their first two principal components before being plotted (details in the appendix). Top row: points are colored according to their progress through training. Bottom row: points are colored according to the average value of the associated model s optimal policy on the environment. Rightmost plot: line-plot of the diameters of these scatter plots against the model-class order. shown on the top row of Figure 3. In accordance with the theory, we see that the space of converged models, represented with the brightest yellow regions, grows with k. This trend is summarised in the rightmost plot, which shows the diameter of the scatter plots for each k. In the bottom row of Figure 3 we use color to show the value that the optimal policy of each model achieves in the true environment. As predicted by our theory, the space of models that are sufficient for optimal planning also grows with k. Model classes containing many models with optimal planning performance are particularly advantageous when the set of models that an agent can represent is restricted, since the larger the set of suitable models the greater the chance of an overlap between this set and the set of models representable by the agent. Proposition 4 and Corollary 1 compared M (Π) and M (Πdet), showing that, although any model in either class is sufficient for planning, M (Π) M (Πdet). This suggests that it might be better to learn a model in M (Πdet) when the agent has limited capacity. Figure 4: (a) Comparison of the performance of optimal policies obtained from capacity constrained models trained to be in M (Πdet) and M (Π). For each action a A, the transition dynamics Pa is constrained to have a rank of at most k. The red dashed line represents the performance of the optimal environment policy. (b) Trajectories starting from the bottom-right state (red dot) sampled from the optimal environment policy in both the environment and a PVE model. Note the numerous diagonal transitions in the PVE model which are not permitted in the environment. We illustrate that this is indeed the case in Figure 4b. We progressively restrict the space of models that the agent can represent and attempt to learn models in either M (Π) or M (Πdet). Indeed, we find that the larger class, M (Πdet), yields superior planning performance as agent capacity decreases. Given their importance, we provide intuition on the ways that individual PVE models differ from the environment. In Figure 4a we compare trajectories starting at the same initial state (denoted by a red-circle) from the optimal environment policy in both the environment and in a randomly sampled model from M (Π). In the PVE model there are numerous diagonal transitions not permitted by the environment. Note that while the PVE model has very different dynamics than the environment, these differences must balance out , as it still has the same values under any policy as the environment. Figure 5: Comparison of our proposed modification to Mu Zero with an unmodified baseline. In Section 4 we showed that minimizing Muzero s loss function is analogous to minimizing an upper-bound on a PVE loss (9) with respect to the agent s current policy π which corresponds to finding a model in M (Π) where Π = {π}. Note that our guarantee on the performance of PVE models (Corollary 1) holds when Π contains all deterministic policies. While it is not feasible to enforce Π = Πdet, we can use previously seen policies by augmenting the Mu Zero algorithm with a buffer of past policies and their approximate value functions (we do so by periodically storing the corresponding parameters). We can then add an additional loss to Mu Zero with the form of the original value loss but using the past value functions. We still use sampled rewards to construct value targets, but use the stored policies to compute off-policy corrections using V-trace [9]. To test this proposal we use an on policy (i.e., without a replay buffer) implementation of Mu Zero run for 500M frames (as opposed to 20B frames in the online result of [31]) on the Atari 57 benchmark and find that using our additional loss yields an advantage in the human normalised median performance shown in Figure 5. Mu Zero s data efficiency can also be improved with the aid of a replay buffer of trajectories from past policies [32], which may also capture some of the advantages of expanding the set of policies used for PVE. 6 Related work Our work is closely related to the value-aware model learning (VAML, Iter VAML, [11, 10]) which learns models to minimize the discrepancy between their own Bellman optimality operators and the environment s on the optimal value function an a priori unknown quantity. To handle this VAML specifies a family of potential value functions and minimizes the worst-case discrepancy across them, whereas Iter VAML minimizes the discrepancy with respect to model s current estimate of the value function in a value iteration inspired scheme. PVE and the VAML family are complementary works, with VAML addressing its induced optimization problems and PVE addressing its induced model classes. Both, however, advocate for learning models with their eventual use in mind a view that is aligned with many criticisms of the maximum likelihood objective for model learning [11, 21, 2, 23]). It is worth mentioning the relationship between PVE and TD models [35] which, for a given policy, defines any R R|S| and P R|S| |S| with limk P k = 0 as a valid model if V = R + P V where V R|S| represents vπ. Clearly all models in M ({π}) are valid models, however, since P is not necessarily a transition matrix, the converse does not hold. While TD models are restricted to prediction rather than control, their generality warrants further inquiry. Order-k and PVE model classes form equivalences between MDPs and thus can be situated among other equivalence notions which can be formulated as state-aggregations [8, 26, 25, 16, 28, 13, 34, 39, 5, 40]. As pointed out by Grimm et al. [17], the interaction between arbitrary state-aggregation and models can be captured with special cases of order-one VE. Our extension of higher-order VEs potentially offers the possibility of blending existing notions of state aggregation with PVE. A notable instance of state-aggregation is bisimulation [22], which uses a relation to aggregate states that have the same immediate rewards and transition dynamics into other aggregated states. Bisimulation metrics [13] provide smooth measures of how closely pairs of states satisfy bisimulation relations. These concepts have become increasingly popular in deep reinforcement learning where they are used to guide the learning of effective representations [43, 42, 15, 1]. Although both bisimulation and PVE provide direction for learning internal aspects of an agent, they are fundamentally different in their purview bisimulation concerns representation learning, while PVE concerns the learning of models given a representation of state. Beyond bisimulation, representation learning has a wide literature [41, 20, 6, 14, 4] including several modern works which explicitly study the conjunction of model learning with state representation [44, 42, 15]. These are further complemented by efforts to learn state representations and models jointly in the service of value-based planning [12, 33, 24, 18, 31, 38]. 7 Conclusion and future work We extended the value equivalence principle by defining a spectrum of order-k VE sets in which models induce the same k-step Bellman operators as the environment. We then explored the topology of the resulting equivalence classes and defined the limiting class when k as PVE. If a model is PVE to the environment with respect to a set of policies Π, then the value functions of all policies in Π are the same in the environment and the model. The fact that PVE classes can be defined using only a set of policies eliminates the need for specifying a set of functions to induce VE resolving a fundamental issue left open by Grimm et al. [17]. Importantly, we showed that being PVE with respect to all deterministic policies is sufficient for a model to plan optimally in the environment. In the absence of additional information, this is the largest possible VE class that yields optimal planning. On the practical side, we showed how the Mu Zero algorithm can be understood as minimizing an upper bound on a PVE loss, and leveraged this insight to improve the algorithm s performance. Though our efforts have advanced the understanding of value equivalence and proven useful algorithmically, there is still work to be done in developing a VE theory whose assumptions hold in practice. This remaining work can be broadly grouped into two areas (1) understanding the role of approximation in VE and (2) establishing performance guarantees for VE models with arbitrary sets of policies and functions. We leave these as future work. Acknowledgements We thank Angelos Filos and Sonya Kotov for many thought-provoking discussions. Christopher Grimm s work was made possible by the support of the Lifelong Learning Machines (L2M) grant from the Defense Advanced Research Projects Agency. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors. [1] Rishabh Agarwal, Marlos C Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. ar Xiv preprint ar Xiv:2101.05265, 2021. [2] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463 474. PMLR, 2020. [3] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. [4] Ondrej Biza, Robert Platt, Jan-Willem van de Meent, and Lawson LS Wong. Learning discrete state abstractions with deep variational inference. ar Xiv preprint ar Xiv:2003.04300, 2020. [5] Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10069 10076, 2020. [6] Dane Corneil, Wulfram Gerstner, and Johanni Brea. Efficient model-based deep reinforcement learning with variational state tabulation. In International Conference on Machine Learning, pages 1049 1058. PMLR, 2018. [7] Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72 83. Springer, 2006. [8] Thomas Dean and Robert Givan. Model minimization in markov decision processes. In AAAI/IAAI, pages 106 111, 1997. [9] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1407 1416. PMLR, 2018. [10] Amir-massoud Farahmand. Iterative value-aware model learning. In Advances in Neural Information Processing Systems (Neur IPS), pages 9090 9101, 2018. [11] Amir-Massoud Farahmand, André Barreto, and Daniel Nikovski. Value-Aware Loss Function for Model-based Reinforcement Learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54, pages 1486 1494, 2017. [12] G Farquhar, T Rocktäschel, M Igl, and S Whiteson. Treeqn and atreec: Differentiable treestructured models for deep reinforcement learning. In 6th International Conference on Learning Representations, ICLR 2018-Conference Track Proceedings, volume 6. ICLR, 2018. [13] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In UAI, volume 4, pages 162 169, 2004. [14] Vincent François-Lavet, Yoshua Bengio, Doina Precup, and Joelle Pineau. Combined reinforcement learning via abstract representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3582 3589, 2019. [15] Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. Deepmdp: Learning continuous latent space models for representation learning. In International Conference on Machine Learning, pages 2170 2179. PMLR, 2019. [16] Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(1-2):163 223, 2003. [17] Christopher Grimm, Andre Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020. [18] Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization. ar Xiv preprint ar Xiv:2104.06159, 2021. [19] Matteo Hessel, Manuel Kroiss, Aidan Clark, Iurii Kemaev, John Quan, Thomas Keck, Fabio Viola, and Hado van Hasselt. Podracer architectures for scalable Reinforcement Learning. Co RR, abs/2104.06272, 2021. [20] Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for pomdps. In International Conference on Machine Learning, pages 2117 2126. PMLR, 2018. [21] Joshua Joseph, Alborz Geramifard, John W Roberts, Jonathan P How, and Nicholas Roy. Reinforcement learning with misspecified model classes. In 2013 IEEE International Conference on Robotics and Automation, pages 939 946. IEEE, 2013. [22] Robin Milner. Communication and concurrency, volume 84. [23] Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010 2020. PMLR, 2020. [24] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural Information Processing Systems, pages 6118 6128, 2017. [25] Pascal Poupart and Craig Boutilier. Value-directed belief state approximation for POMDPs. Co RR, abs/1301.3887, 2013. URL http://arxiv.org/abs/1301.3887. [26] Pascal Poupart, Craig Boutilier, et al. Value-directed compression of pomdps. Advances in neural information processing systems, pages 1579 1586, 2003. [27] Martin L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994. [28] Balaraman Ravindran and Andrew G Barto. Approximate homomorphisms: A framework for non-exact minimization in markov decision processes. 2004. [29] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 3 edition, 2003. [30] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. ar Xiv preprint ar Xiv:1911.08265, 2019. [31] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020. [32] Julian Schrittwieser, Thomas Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, and David Silver. Online and offline reinforcement learning by planning with a learned model. ar Xiv preprint ar Xiv:2104.06294, 2021. [33] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-toend learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3191 3199. JMLR. org, 2017. [34] John P Spencer, Michael SC Thomas, and JL Mc Clelland. Toward a unified theory of development. JP Spencer, MSC Thomas, & JL Mc Clelland (Eds.), pages 86 118, 2009. [35] Richard S. Sutton. TD models: Modeling the world at a mixture of time scales. In Proceedings of the Twelfth International Conference on Machine Learning, pages 531 539, 1995. [36] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018. URL https://mitpress.mit.edu/books/ reinforcement-learning-second-edition. 2nd edition. [37] Csaba Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [38] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154 2162, 2016. [39] Jonathan Taylor, Doina Precup, and Prakash Panagaden. Bounding performance loss in approximate mdp homomorphisms. Advances in Neural Information Processing Systems, 21: 1649 1656, 2008. [40] Elise van der Pol, Thomas Kipf, Frans A Oliehoek, and Max Welling. Plannable approximations to mdp homomorphisms: Equivariance under actions. In Proceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, pages 1431 1439, 2020. [41] Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: a locally linear latent dynamics model for control from raw images. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 2746 2754, 2015. [42] Amy Zhang, Zachary C Lipton, Luis Pineda, Kamyar Azizzadenesheli, Anima Anandkumar, Laurent Itti, Joelle Pineau, and Tommaso Furlanello. Learning causal state representations of partially observable environments. ar Xiv preprint ar Xiv:1906.10437, 2019. [43] Amy Zhang, Rowan Mc Allister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. ar Xiv preprint ar Xiv:2006.10742, 2020. [44] Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew Johnson, and Sergey Levine. Solar: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pages 7444 7453. PMLR, 2019. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Code for the illustrative experiments is available at a URL provided in Appendix A.3. Sufficient detail for reproducability is provided in the same section. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]