# equivariant_muzero__8e7c2fd7.pdf Published in Transactions on Machine Learning Research (12/2023) Equivariant Mu Zero Andreea Deac deacandr@mila.quebec Mila, Université de Montréal Théophane Weber theophane@google.com Google Deep Mind George Papamakarios gpapamak@google.com Google Deep Mind Reviewed on Open Review: https: // openreview. net/ forum? id= Exb Gar Tb LE Deep reinforcement learning has shown lots of success in closed, well-defined domains such as games (Chess, Go, Star Craft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying environment dynamics, so as to robustly generalise to conditions that differ from those they were trained on. Modelbased reinforcement learning algorithms, such as Mu Zero or Dreamer, aim to accomplish this by learning a world model. However, leveraging a world model has not yet consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of Mu Zero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by Mu Zero are equivariant to a particular symmetry group acting on the environment, the entirety of Mu Zero s action-selection algorithm will also be equivariant to that group. As such, Equivariant Mu Zero is guaranteed to behave symmetrically in symmetrically-transformed states, and will hence be more data-efficient when learning its world models. We evaluate Equivariant Mu Zero on procedurally-generated Mini Pacman and on Chaser from the Proc Gen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. We verify that our improvements hold even when only some of the components of Equivariant Mu Zero obey strict equivariance, which highlights the robustness of our construction. 1 Introduction Reinforcement learning (RL) is a potent paradigm for solving sequential decision making problems in a dynamically changing environment. Successful examples of its uses include game playing (Vinyals et al., 2019), drug design (Segler et al., 2018), robotics (Ibarz et al., 2021) and theoretical computer science (Fawzi et al., 2022). However, the generality of RL often leads to data inefficiency, poor generalisation and lack of safety guarantees. This is an issue especially in domains where data is scarce or difficult to obtain, such as medicine or human-in-the-loop scenarios. Most RL approaches do not directly attempt to capture the regularities present in the environment. As an example, consider a grid-world: moving down in a maze is equivalent to moving left in the 90 clock-wise rotation of the same maze. Such equivalences can be formalised via Markov Decision Process homomorphisms (Ravindran, 2004; Ravindran & Barto, 2004), and while some works incorporate them (e.g. van der Pol et al., 2020; Rezaei-Shoshtari et al., 2022), most deep reinforcement learning agents would act differently in such equivalent states if they do not observe enough data. This becomes even more problematic when the number Work performed while the author was at Google Deep Mind. Published in Transactions on Machine Learning Research (12/2023) of equivalent states is large. One common example is 3D regularities, such as changing camera angles in robotic tasks. In recent years, there has been significant progress in building deep neural networks that explicitly obey such regularities, often termed geometric deep learning (Bronstein et al., 2021). In this context, the regularities are formalised using symmetry groups and architectures are built by composing transformations that are equivariant to these symmetry groups (e.g. convolutional neural networks for the translation group, graph neural networks and transformers for the permutation group). As we are looking to capture the symmetries present in an environment, a fitting place is within the framework of model-based RL (MBRL). MBRL leverages explicit world-models to forecast the effect of action sequences, either in the form of next-state or immediate reward predictions. These imagined trajectories are used to construct plans that optimise the forecasted returns. In the context of state-of-the-art MBRL agent Mu Zero (Schrittwieser et al., 2020), a Monte-Carlo tree search is executed over these world-models in order to perform action selection. In this paper, we demonstrate that equivariance and MBRL can be effectively combined by proposing Equivariant Mu Zero (Eq Mu Zero, shown in Figure 2), a variant of Mu Zero where equivariance constraints are enforced by design in its constituent neural networks. As Mu Zero does not use these networks directly to act, but rather executes a search algorithm on top of their predictions, it is not immediately obvious that the actions taken by the Eq Mu Zero agent would obey the same constraints is it guaranteed to produce a rotated action when given a rotated maze? One of our key contributions is a proof that guarantees this: as long as all neural networks are equivariant to a symmetry group, all actions taken will also be equivariant to that same symmetry group. Consequently, Eq Mu Zero can be more data-efficient than standard Mu Zero, as it knows by construction how to act in states it has never seen before. On the practical side, we present a specific implementation of Eq Mu Zero that is equivariant to 90 rotations by design, and we empirically verify its generalisation capabilities in two rotationally symmetric grid-worlds: procedurally-generated Mini Pacman and the Chaser game in the Proc Gen suite. 2 Background 2.1 Reinforcement Learning The reinforcement learning problem is typically formalised as a Markov Decision Process (S, A, P, R, γ) formed from a set of states S, a set of actions A, a discount factor γ [0, 1], and two functions that model the outcome of taking action a in state s: the transition distribution P(s |s, a) specifying the next state probabilities, and the reward function R(s, a) specifying the reward. The aim is to learn a policy, π(a|s), a function specifying (probabilities of) actions to take in state s, such that the agent maximises the (expected) cumulative reward G(tr) = Pt=T t=0 γt R(st, at), where tr = (s0, a0, s1, a1, . . . , s T , a T ) is the trajectory taken by the agent starting in the initial state s0 and following the policy to decide at based on st. 2.2 Mu Zero Reinforcement learning agents broadly fall into two categories: model-free and model-based. The specific agent we extend here, Mu Zero (Schrittwieser et al., 2020), is a model-based agent for deterministic environments (where P(s |s, a) = 1 for exactly one s for all s S and a A). Mu Zero relies on several neural-network components that are composed to create a world model, which estimates the unseen processes in the MDP. The neural networks are: The state encoder, h : S Z, which embeds states into a latent space Z (e.g. Z = Rk). The action encoder, h A : A ZA, which embeds actions into a latent space ZA (e.g. ZA = Rk). The transition function1, τ : Z A Z, which predicts embeddings of next states. 1Note that both the transition and reward functions will implicitly call the action encoder, h A, in order to appropriately embed the input action. We expand on this in Section 3.1. Published in Transactions on Machine Learning Research (12/2023) The reward function, ρ : Z A R, which predicts the immediate expected reward after taking an action in a particular state. The value function, v : Z R, which predicts the value (expected cumulative reward) from a given state. The policy function, p : Z [0, 1]|A|, which predicts the probability of taking each action from the current state. These probabilities should add up to 1: P a A p(a|z) = 1. Using these, Mu Zero computes its corresponding transition, reward, value and policy models respectively through composition of neural networks. To plan its next action, Mu Zero executes a Monte Carlo tree search (MCTS) over many simulated trajectories, generated using the above models. We use superscript notation to denote information belonging to planning such as z0 being the hidden representation of the root state in a MCTS step, and subscripts for interacting with the environment such as at being the action taken at time t. Moreover, please note that we use s for input observation and z for latent space representations, while the original Mu Zero paper (Schrittwieser et al., 2020) uses o and s respectively. Mu Zero has demonstrated state-of-the-art capabilities over a variety of deterministic or near-deterministic environments, such as Go, Chess, Shogi and Atari, and has been successfully applied to real-world domains such as video compression (Mandhane et al., 2022). Although here we focus on Mu Zero for deterministic environments, we note that extensions to stochastic environments also exist (Antonoglou et al., 2021) and are an interesting target for future work. 2.3 Groups, Representations and Symmetries A group (G, ) is a set G equipped with a composition operation : G G G (written concisely as g h = gh), satisfying the following axioms: Associativity: (gh)l = g(hl) for all g, h, l G. Identity: there exists a unique e G satisfying eg = ge = g for all g G. Inverse: for every g G there exists a unique g 1 G such that gg 1 = g 1g = e. Groups are a natural way to describe symmetries: object transformations that leave the object unchanged. Since group elements are just abstract set elements, in order to reason about them as transformations, we define the notion of a group action act : G Ω Ω, where Ωis the space of the input data. For example, if Ω= Z2 n refers to the pixels of an image and G = Z2 n refers to circular translations, the group action act translates each pixel accordingly: act((a, b), (u, v)) = ((u + a) mod n, (v + b) mod n) (1) where (a, b) G specifies the translation operation and (u, v) Ωis the pixel being translated. In most cases of interest, the group actions will be linear transformations of the data in Ω. We can thus reason about them in the context of linear algebra by using their real representations: functions ρV : G RN N that give, for every group element g G, a real matrix demonstrating how this element acts on a vector space V. For example, for the rotation group G = SO(n), the representation ρV would provide an appropriate n n rotation matrix for each rotation g. We perform the group action by multiplying the matrices corresponding to the representation and our data. Note that we use ρ to denote both a group representation and the reward function of Mu Zero. However, they should be easy to distinguish as the representation function will always be applied to group elements such as g, whereas the reward function will be applied to neural network embeddings z. Published in Transactions on Machine Learning Research (12/2023) 2.4 Equivariance and Invariance As symmetries are assumed to not change the essence of the data they act on, we would like to construct neural networks that adequately represent such symmetry-transformed inputs. Assume we have a neural network f : X Y, mapping between vector spaces X and Y, and that we would like this network to respect the symmetries within a group G. Then we can impose the following condition, for all group elements g G and inputs x X: f(ρX (g)x) = ρY(g)f(x). (2) This condition is known as G-equivariance: for any group element, it does not matter whether we act with it on the input or on the output of the function f the end result is the same. A special case of this, G-invariance, is when the output representation is trivial (ρY(g) = I): f(ρX (g)x) = f(x). (3) In geometric deep learning, equivariance to reflections, rotations, translations and permutations has been of particular interest (Bronstein et al., 2021). For the specific context of reinforcement learning and Mu Zero, the possible inputs to our neural networks may be both states and actions. Therefore, when further discussing these conditions, note that x may refer to either state or action representations, or both at once. Generally speaking, there are three ways to obtain an equivariant model: a) data augmentation, b) data canonicalisation and c) specialised architectures. Data augmentation creates additional training data by applying group elements g to input/output pairs (x, y) equivariance is encouraged by training on the transformed data and/or minimising auxiliary losses such as ρY(g)f(x) f(ρX (g)x) . Data augmentation can be simple to apply, but it results in only approximately equivariant models. Data canonicalisation requires a method to standardise the input, such as breaking the translation symmetry for molecular representation by centering the atoms around the origin (Musil et al., 2021) however, in many cases such a canonical transformation may not exist. Specialised architectures have the downside of being harder to build, but they can guarantee exact equivariance as such, they reduce the search space of functions, potentially reducing the number of parameters and increasing training efficiency. Bronstein et al. (2021) discuss many ways in which one can construct state-of-the-art specialised architectures, from rudimentary approaches relying on exhaustive view generation (Cohen & Welling, 2016a), to more efficient approaches like Steerable CNNs (Cohen & Welling, 2016b; Weiler & Cesa, 2019), all the way to advanced approaches preserving gauge equivariance over manifolds and meshes (De Haan et al., 2020). All of these approaches can be leveraged to build constituent networks in our Equivariant Mu Zero framework. 2.5 Equivariance in RL and Related Work There has been previous work at the intersection of reinforcement learning and equivariance. While leveraging multi-agent symmetries was repeatedly shown to hold promise (van der Pol et al., 2021; Muglich et al., 2022), of particular interest to us are the symmetries emerging from the environment, in a single-agent scenario. Related work in this space can be summarised by the commutative diagram in Figure 1. This diagram represents various transformations we might want to make on data that originates from RL, as arrows. Some of these arrows are neural network layers (such as Enc : S Z), some are action executions (such as a : S S), while others are symmetry transformations (such as ρZ(g) : Z Z). In a commutative diagram, each pair of paths connecting the same start and endpoint specifies a mathematical constraint that the compositions of arrows in these two paths must be the same transformation. For example, in Figure 1, consider the front-facing square of the lower cube, i.e., paths S ρS(g) S Enc Z and S Enc Z ρZ(g) Z. As these two paths both start in the same point and end in the same point, we recover the previously discussed G-equivariance condition on Enc: Enc(ρS(g)s) = ρZ(g)Enc(s). (4) When considering only the cube at the bottom, we recover Park et al. (2022) a supervised learning task where a latent transition model T learns to predict the next state embedding. They show that if T is Published in Transactions on Machine Learning Research (12/2023) Figure 1: Commutative diagram of symmetries in RL. State transitions due to an action a are back-to-front, transformations due to a symmetry g are left-to-right, state encoding and decoding by the model is bottomto-top. equivariant, the encoder can pick up the symmetries of the environment even if it is not fully equivariant by design. Mondal et al. (2022) build a model-free agent by combining an equivariant-by-design encoder and enforcing the remaining equivariances via regularisation losses. They also consider the invariance of the reward, captured in Figure 1 by taking the decoder to be the reward model and l = 1. The work of van der Pol et al. (2020) can be described by having the value model as the decoder, while the work of Wang et al. (2022) has the policy model as the decoder and l = |A|. While our work is the first to study the interplay of equivariance and Monte-Carlo tree search (MCTS)- backed MBRL algorithms such as Mu Zero, we would like to point out three recent related works that are generally inspired by exploiting symmetries in model-based planning. Firstly, Sym Plan (Zhao et al., 2022) takes inspiration from the symmetries present in the Value Iteration algorithm (Bellman, 1966), to extend Value Iteration Networks (Tamar et al., 2016) with additional equivariance constraints. Note that, while there exists a MBRL inspiration in this work, it is still an implicit planner, i.e. an end-to-end agent model without any explicit invocation of model-based methods. Secondly, EDGI (Brehmer et al., 2023) is a recent extension of diffusion-based planning (Janner et al., 2022) that incorporates group equivariance constraints. As planning with diffusion naturally occurs in a high-dimensional space with rich geometry, imposing additional spatial constraints on it is a natural direction to follow. Lastly, in a work concurrent to ours, Zhao et al. (2023) study how various path-planning algorithms e.g., model predictive control behave under symmetries, and derive arguments that are similar in spirit to our theoretical analysis of Mu Zero under equivariance. 3 Equivariant Mu Zero This section describes Equivariant Mu Zero (Eq Mu Zero), a variant of Mu Zero whose constituent neural networks (encoder, transition model, reward model, value model and policy model) are equivariant by design. Section 3.1 describes a specific implementation of Eq Mu Zero, where the equivariance is with respect to the 4-element cyclic group C4, that is, the group of 90 rotations. Section 3.2 formally proves that if the components of Eq Mu Zero are equivariant with respect to any symmetry group G, the action selection will also be equivariant with respect to G. Published in Transactions on Machine Learning Research (12/2023) 3.1 Implementation of C4-equivariant Mu Zero We present a concrete instance of Eq Mu Zero for the 4-element cyclic group C4, and we describe how its various components can be designed to obey C4-equivariance (Figure 2). The C4 group is applicable, for example, in rotationally symmetric grid-worlds, where the elements of the group represent rotating the gridworld by all four possible multiples of 90 , that is, C4 = {e, r90 , r180 , r270 }. Such grid-worlds are symmetric with respect to 90 rotations, in the sense that moving down in a grid-world map is the same as moving left in the 90 clock-wise rotated version of the same map. As our exemplar, we will use the G-CNN architecture from Cohen & Welling (2016a), which is among the earliest prominent architectures that directly supports equivariance to small finite groups, such as C4. In what follows we assume the environment is a grid-world for concreteness, however we note that our Eq Mu Zero implementation is directly applicable to any environment with the same symmetry structure. Let s denote the state of the grid-world represented as a 2D array (i.e. an image). For simplicity, we assume there are only four directional movement actions in the environment, denoted by A = { , , , }. Any additional non-movement actions (such as the do nothing action) can be included without difficulty. We assume that the C4 group acts on states and actions in the obvious way: ρS(r90 )s = R90 s is the grid-world state rotated clock-wise by 90 , and ρA(r90 ) = . State and action encoders To enforce C4-equivariance in the state encoder, we first need to specify the effect of rotations on the latent state z. In our implementation, the latent state consists of 4 equally shaped arrays, z = (z1, z2, z3, z4), and we prescribe that a 90 clock-wise rotation manifests as a cyclical permutation: ρZ(r90 )z = (z2, z3, z4, z1). Then, our equivariant encoder embeds state s as follows: heq(s) = (h(s), h(R90 s), h(R180 s), h(R270 s)) = (z1, z2, z3, z4) (5) where h is an arbitrary neural network (we use a CNN). Similarly, our equivariant action-encoder embeds action a as follows: h A-eq(a) = (h A(a), h A(ρA(r90 )a), h A(ρA(r180 )a), h A(ρA(r270 )a)) = (za;1, za;2, za;3, za;4) (6) where h A is an MLP. Transition model We propose two ways of building a C4-equivariant transition model. The first one works by maintaining the structure in the latent space: τeq(z, a) = (τ(z1, h A-eq(a)1), τ(z2, h A-eq(a)2), τ(z3, h A-eq(a)3), τ(z4, h A-eq(a)4)), (7) where h A-eq(a)1 represents the first element of the tuple obtained from encoding action a. The network τ is arbitrary; in our implementation, we broadcast the output of h A across all pixels of h s output, followed by a Res Net. This equation satisfies C4-equivariance, that is, τeq(R90 s, ρA(r90 )a) = ρZ(r90 )τeq(s, a). Our second transition model is less constrained but more involved, as it allows components of z to interact, while still retaining C4-equivariance: τeq(z, a) = (τ( z1, z2, z3, z4), τ( z2, z3, z4, z1), τ( z3, z4, z1, z2), τ( z4, z1, z2, z3)). (8) For brevity, z1 denotes the pair (z1, h A-eq(a)1). Note that, in this version, τ takes more arguments than before. In our experiments, we use the more constrained variant for Mini Pacman, and the less constrained variant for Chaser, as more data is available for the latter. Policy model We make a C4-equivariant policy by combining state and action embeddings from all four latents as follows: peq(a | z) = p(a | z1) + p(ρA(r90 )a | z2) + p(ρA(r180 )a | z3) + p(ρA(r270 )a | z4) It is straightforward to verify that P a A peq(a | z) = 1, i.e. peq( | z) is properly normalised, and that peq(ρA(r90 )a | ρZ(r90 )z) = peq(a | z), i.e. it satisfies C4-equivariance. Published in Transactions on Machine Learning Research (12/2023) Figure 2: Architecture of Equivariant Mu Zero, where h, h A are encoders, τ is the transition model, ρ is the reward model, v is the value model and p is the policy predictor. Each colour represents an element of the C4 group {e, r90 , r180 , r270 } applied to the input (environment state and action). Reward and value models Lastly, the reward and value networks (ρinv, vinv) should be C4-invariant. We can satisfy this constraint by aggregating the latent space with any C4-invariant function, such as sum, average or max. Here we use summation: ρinv(z, a) = ρ(z1, a) + ρ(z2, ρA(r90 )a) + ρ(z3, ρA(r180 )a) + ρ(z4, ρA(r270 )a) (10) vinv(z) = v(z1) + v(z2) + v(z3) + v(z4). (11) Composing the equivariant components described above (Equations 5 10), we construct the end-to-end equivariant Eq Mu Zero agent, displayed in Figure 2. While the specific implementation presented here represents a valid Eq Mu Zero architecture for the C4 symmetry group, we reiterate that it is only an exemplar. If we were interested in expanding to a broader symmetry group such as D4 which includes eight elements, spanning both rotations and reflections we could extend our representations to include all eight transformed views, or utilise a more memory-efficient approach of Steerable CNNs (Cohen & Welling, 2016b; Weiler & Cesa, 2019; Zhao et al., 2022) or explicit symmetrisation (van der Pol et al., 2020). In fact, we now prove a theorem stating that, no matter the symmetry group of choice or the method used to ensure equivariance to it (Bronstein et al., 2021), so long as equivariance is guaranteed, Eq Mu Zero will select actions in a way that respects the underlying symmetries. 3.2 Proof of Action Equivariance for Arbitrary Symmetry Groups As discussed in Section 2.2, Mu Zero does not use its constituent networks directly for acting, but it uses them to perform a Monte-Carlo tree search (MCTS), from which an action is selected. Therefore, even if Mu Zero s constituent networks are fully equivariant, it is not obvious that the action selection is too. In other words, are we guaranteed that Eq Mu Zero will behave in an equivariant manner, that is, will it always perform a rotated action in a rotated grid-world? The following theorem shows that the answer is yes. Theorem 1 If all the relevant neural networks used by Mu Zero are G-equivariant, the proposed Eq Mu Zero agent will select actions in a G-equivariant manner, that is for every state s S and for every g G, if Eq Mu Zero selects action a while in s, then it must select ρA(g)a while in ρS(g)s. Published in Transactions on Machine Learning Research (12/2023) Proof. Assume that the Eq Mu Zero agent uses the following equivariant neural networks: heq for the encoder (which also includes the action encoder for simplicity), τeq for the transition model, peq for the policy model, vinv for the value model and ρinv for the reward function. By assumption, heq, τeq and peq are G-equivariant, and vinv and ρinv are G-invariant. Mu Zero relies on multi-step rollouts using its internal models. Specifically, it requires simulating the reward, value, policy and transition models n steps in the future, starting from state s and simulating the action sequence a1, a2, . . . , an. We can simulate these rollouts as the following compositions of neural networks: rn = ρinv(τeq( τeq(τeq(heq(s), a1), a2) , an 1), an) vn = vinv(τeq( τeq(τeq(heq(s), a1), a2) , an)) pn = peq(τeq( τeq(τeq(heq(s), a1), a2) , an)) zn = τeq( τeq(τeq(heq(s), a1), a2) , an). As these rollout simulations are computed by composing several G-equivariant and G-invariant functions, they are themselves G-equivariant (in the case of pn and zn) and G-invariant (in the case of rn and vn). Mu Zero also computes returns starting from intermediate simulated state embeddings, G(zk), where k is the depth of this state, assuming the actions simulated after this depth are ak+1, ak+2, . . . , al+1. The return is also a G-invariant function, as it is a linear combination of several G-invariant functions: τ=0 γτρinv(zk+τ, ak+1+τ) + γl kvinv(zl, al+1). (13) To prove that one planning step is equivariant, we need to show that the action selection is G-equivariant. Since the outcome of Mu Zero s MCTS function is based on the initial observation, s, we denote the internal state of MCTS as {Qs(z, a), N s(z, a), . . .}, for each node embedding z that we simulate as we expand the tree, and each action a. These values correspond respectively to Q-values, visit counts, expansion policy estimates, etc. We will use identical notation as Schrittwieser et al. (2020) for these states, even though we denote the Mu Zero models, heq, τeq, peq, ρinv and vinv, somewhat differently. First, Mu Zero simulates a rollout by repeatedly selecting the next action to simulate, ak, as follows: ak = argmax a Qs(zk 1, a) + P s(zk 1, a) b N s(zk 1, b) 1 + N s(zk 1, a) b N s(zk 1, b) + c2 + 1 where c1 and c2 are constant hyperparameters. Once the rollout is completed, the intermediate MCTS states are updated accordingly: Qs t(zk 1, ak) = N s t 1(zk 1, ak)Qs t 1(zk 1, ak) + G(zk 1) N s t 1(zk 1, ak) + 1 N s t (zk 1, ak) = N s t 1(zk 1, ak) + 1. Note that we will, from now on, use shortened notation to represent group-transformed states, actions and embeddings. Specifically, we will use gss as shorthand for ρS(g)s, gaa as shorthand for ρA(g)a, and gzz as shorthand for ρZ(g)z, in order to maintain brevity. As discussed previously, we need to show that, for each MCTS internal state (e.g. N s), if we assume heq, τeq, peq, ρinv and vinv to be G-equivariant functions, the resulting state would also be G-equivariant under transformations of the initial observation. That is, for all s, a, z: N gss(gzz, gaa) = N s(z, a). (16) To prove this, we will use induction on the number of backups performed by MCTS, t. We proceed: Base case (t = 0) : N gss 0 (gzz, gaa) = N s 0(z, a) = 0 Qgss 0 (gzz, gaa) = Qs 0(z, a) = 0. (17) Published in Transactions on Machine Learning Research (12/2023) Case t : N gss t (gzz, gaa) = N s t (z, a) Qgss t (gzz, gaa) = Qs t(z, a). (18) We will start by showing that the states and actions expanded by MCTS under initial G-transformed observation gss, (ez0, ea1, ez1, ea2, . . . ), would exactly correspond to (gzz0, gaa1, gzz1, gaa2, . . . ), where (z0, a1, z1, a2, . . . ) are states expanded under the non-transformed observation, s. By equivariance of h, ez0 = h(gss) = gzh(s) = gzz0, as expected. Next, we show that the actions selected by MCTS also obey a G-equivariance constraint, in the sense that: if ezk 1 = gzzk 1, then eak = gaak. As we assumed N s t to be G-equivariant (Case t), it must hold that P b N s t (z, b) is G-invariant (as a sumreduction of equivariant functions). Hence, we can rewrite Equation 14 as: ak = arg max a Qs t(zk 1, a) + P s t (zk 1, a) ϵ(zk 1) 1 + N s t (zk 1, a) where ϵ is G-invariant, P s is G-equivariant by composition of functions that are G-equivariant by assumption, and Qs is G-equivariant by assumption of Case t. Hence, using this formula to define eak, we recover: eak = arg max a Qgss t (ezk 1, a) + P gss t (ezk 1, a) ϵ(ezk 1) 1 + N gss t (ezk 1, a) = arg max a Qgss t (gzzk 1, a) + P gss t (gzzk 1, a) ϵ(gzzk 1) 1 + N gss t (gzzk 1, a) = arg max a Qgss t (gzzk 1, gag 1 a a) + P gss t (gzzk 1, gag 1 a a) ϵ(gzzk 1) 1 + N gss t (gzzk 1, gag 1 a a) = arg max a Qs t(zk 1, g 1 a a) + P s t (zk 1, g 1 a a) ϵ(zk 1) 1 + N s t (zk 1, g 1 a a) = ga arg max a Qs t(zk 1, a) + P s t (zk 1, a) ϵ(zk 1) 1 + N s t (zk 1, a) We note that we have taken the ga out of the arg max, which is an unambiguous operation only if there is a unique action ak that maximises the expression in Equation 19. To avoid breaking the symmetry in practice, we propose that tiebreaks for ak are resolved in a purely randomised fashion. Showing this, we now only need to verify that the updates to Nt and Qt (in Equation 15) are equivariant for all state-action pairs along the trajectory. Values of N and Q for all other state-action pairs will be unchanged from Nt, and therefore trivially still G-equivariant. First we show this for N: N gss t+1(ezk 1, eak) = N gss t+1(gzzk 1, gaak) = N gss t (gzzk 1, gaak) + 1 = N s t (zk 1, ak) + 1 = N s t+1(zk 1, ak). Hence, Case t + 1 still holds for N. Now we turn our attention to Q. Published in Transactions on Machine Learning Research (12/2023) First, by invariance of ρ and v, we can show that G(zk) is a sum of G-invariant functions and therefore also invariant. Plugging into the Q update: Qgss t+1(ezk 1, eak) = Qgss t+1(gzzk 1, gaak) = N gss t (gzzk 1, gaak)Qgss t (gzzk 1, gaak) + G(gzzk 1) N gss t (gzzk 1, gaak) + 1 = N s t (zk 1, ak)Qs t(zk 1, ak) + G(zk 1) N s t (zk 1, ak) + 1 = Qs t+1(zk 1, ak). Hence, Case t + 1 also holds for Q. The other states stored by MCTS, such as P, are computed by directly evaluating expressions in equation 12. As discussed before, these expressions are G-equivariant by composition. Having proved that all internal state of of MCTS consistently remains transformed by G under transformed input observations, we can conclude that the final policy given by MCTS will be exactly G-equivariant. 4 Experiments and results 4.1 Environments We consider two 2D grid-world environments, Mini Pacman (Guez et al., 2019) and Chaser (Cobbe et al., 2020), that feature an agent navigating in a 2D maze. In both environments, the grid-world state is represented as a 2D array (i.e. an image) and an action is a direction to move (one of { , , , })2. Both of these grid-worlds are symmetric with respect to 90 rotations, in the sense that moving down in some map is the same as moving left in the 90 clock-wise rotated version of the same map. Hence, we take the symmetry group to be C4 = {e, r90 , r180 , r270 }, the 4-element cyclic group, which in our case represents rotating the grid-world state by all four possible multiples of 90 . 4.2 Model hyperparameters The architecture of all agents we study here is based on Mu Zero (Schrittwieser et al., 2020), with similar hyperparameters as in Hamrick et al. (2020). The encoder and the transition model are implemented as standard Res Net modules (He et al., 2016) with a hidden dimension of 128. Each Res Net starts with a 3 3 convolutional layer, followed by layer normalisation (Ba et al., 2016) and five residual blocks, each of them consisting of the following sequence of operations: layer normalisation, Re LU, 3 3 convolution, layer normalisation, Re LU and another convolution, to which the input of the residual block is added. After the residual blocks are applied, a final Re LU activation is used. For training the models used by Mu Zero, we maintain a prioritised experience replay buffer (Schaul et al., 2015), with batch size 512, discount factor γ = 0.97 and a trajectory length of n = 10 for computing n-step returns. All models are trained using the Adam SGD optimiser (Kingma & Ba, 2014) with learning rate 10 3. To prototype the use of equivariance in various parts of the Mu Zero models, we augment these Res Nets in the way which is described in Section 3.1. Implementationally, this augmentation reduces to applying the Res Net on several C4-transformed views of the input, and appropriately aggregating or concatenating the results as specified. 4.3 Results We compare our C4-equivariant implementation of Eq Mu Zero (as described in section 3.1) with a standard Mu Zero that uses non-equivariant components: Res Net-style networks for the encoder and transition models, and MLP-based policy, value and reward models, following Hamrick et al. (2020). As the encoder and the policy of Eq Mu Zero are the only two components which require knowledge of how the symmetry group acts 2Note that, to conform with the API of Proc Gen, Chaser also contains some actions which do not transform equivariantly, though they also have no direct effect on the game state if played. We note that such actions will have equivariant effects in equivariant states, and hence our framework as presented can support their inclusion. Published in Transactions on Machine Learning Research (12/2023) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Learner updates 1e4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Learner updates 1e4 160 Rotated maps 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Learner updates 1e4 160 Different maps Equivariant Mu Zero Equivariant Mu Zero, standard policy Equivariant Mu Zero, standard encoder Standard Mu Zero Standard Mu Zero, equivariant encoder 0.0 0.5 1.0 1.5 2.0 Learner updates 1e5 0.0 0.5 1.0 1.5 2.0 Learner updates 1e5 18 Rotated maps 0.0 0.5 1.0 1.5 2.0 Learner updates 1e5 18 Different maps Equivariant Mu Zero, standard policy Equivariant Mu Zero, standard encoder Standard Mu Zero Standard Mu Zero, equivariant encoder Figure 3: Results on procedurally-generated Mini Pacman (top) and Chaser from Proc Gen (bottom). The results are aggregated over three random seeds per each model variant. The dark shaded area is one standard deviation from the mean, and the light shaded area is the minimum/maximum value across the seeds. on the environment, we include the following ablations in order to evaluate the trade-off between end-to-end equivariance and general applicability: Standard Mu Zero with an equivariant encoder, equivariant Mu Zero with a standard encoder and equivariant Mu Zero with a standard policy model. We train each agent on a set of maps, X. To test for generalisation, we measure the agent s performance on three, progressively harder, settings. Namely, we evaluate the agent on X, with randomised initial agent position (denoted by same in our results), on the set of rotated maps RX, where R {R90 , R180 , R270 } (denoted by rotated) and, lastly, on a set of maps Y, such that Y X = and Y RX = (denoted by different). Figure 3 (top) presents the results of the agents on Mini Pacman. First, we empirically confirm that the average reward on layouts X, seen during training, matches the average reward gathered on the rotations of the same mazes, RX, for Eq Mu Zero. Second, we notice that changing the equivariant policy with a nonequivariant one does not significantly impact performance. However, the same swap in the encoder brings the performance of the agent down to that of Standard Mu Zero this suggests that the structure in the latent space of the transition model, when not combined with some explicit method of imposing equivariance in the encoder, does not provide noticeable benefits. Third, we notice that Equivariant Mu Zero is generally robust to layout variations, as the learnt high-reward behaviours also transfer to Y. At the same time, Standard Mu Zero significantly drops in performance for both Y and RX. We note that experiments on Mini Pacman were done in a low-data scenario, using 5 maps of size 14 14 for training; we observed that the differences between agents diminished when all agents were trained with at least 20 times more maps. Figure 3 (bottom) compares the performance of the agents on the Proc Gen game, Chaser, which has similar dynamics to Mini Pacman, but larger mazes of size 64 64 and a more complex action space. Due to the complexity of the action space, we only use Eq Mu Zero with a standard policy, rather than a fully Published in Transactions on Machine Learning Research (12/2023) equivariant version. We use 500 maze instances for training. Our results demonstrate that, even when the problem complexity is increased in such a way, Equivariant Mu Zero still consistently outperforms the other agents, leading to more robust plans being discovered. Our theoretical analysis proves that the internal decision-making of Mu Zero s MCTS algorithm and, hence, its derived visit counts will be G-equivariant, provided its constitutent networks are G-equivariant. However, as discussed in Mu Zero (Schrittwieser et al., 2020), the actions actually executed by the Mu Zero agent depend on stochastically sampling the MCTS policy, whose logits are based on these visit counts. As the action sampling is stochastic, executions of Mu Zero with different random seeds will yield different exact action sequences, leading to different downstream performance, even when the input maps are exactly rotated. Regardless, the simplification of the function search space for Mu Zero s neural models arising by equivariance constraints can indeed bring about clear improvements in performance, both for rotated maps and entirely unseen ones. This is especially the case in simpler, lower-data settings such as the Mini Pacman environment. In the Chaser environment, the agent is exposed to a significantly visually richer input and a much larger dataset of maps. We posit that, in such a setting, rotated maps do not pose a substantially easier challenge than any other unseen map, yet still the equivariant versions of Mu Zero are capable of outperforming all other baselines we tested. 5 Limitations and future work While the theory of Equivariant Mu Zero generalizes to any symmetry group, in this work we test an instance where the component neural networks satisfy the criteria for the C4 group. Leveraging neural networks equivariant to continuous rotations, such as the SO(3) group, makes Eq Mu Zero applicable to different problems, such as molecular tasks. However, for parts such as the encoder, the transition model and the policy, a different strategy would be required, as it is impossible to transform the input state and action with every element of an infinite group. Future work could consider enforcing the equivariance constraints via additional losses, possibly combining with an approach such as that of Park et al. (2022), keeping in mind that the theoretical guarantees will no longer apply in their current form. More generally, this work can also be composed with a module that discovers symmetries, such as in the work of Yang et al. (2023). It is important to note the key benefit of enforcing equivariance within a world model: it reduces the hypothesis space of our world model to only functions respecting the relevant symmetries. This can, clearly, only improve the data efficiency of learning the world model, and when the complexities of the RL problem align nicely with these symmetric behaviours, then our approach can be highly beneficial for downstream performance. That being said, we acknowledge that there exist tasks requiring hard exploration (such as Montezuma s revenge), where the agent is tasked with discovering long sequences of actions, before any reward can be observed. Furthermore, these action sequences may not be amenable to the same symmetries as the underlying environment (horizontal reflections in the case of Montezuma s revenge). Eq Mu Zero s innovations are not sufficient to resolve such challenges, but they are likely to be complementary with any exploration-based research that is conducted in the future. 6 Conclusions We present Equivariant Mu Zero, a model-based agent that is, by construction, equivariant. We theoretically verify its properties with respect to general symmetry groups, proving the agent s overall equivariance given the appropriate conditions are met by its constituent neural networks. Moreover, we empirically demostrate that an Equivariant Mu Zero agent that is C4-equivariant generalizes to unseen rotations of the training data, and also performs more robustly on test mazes, with diminishing returns when presented with 100 times more data. Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K Hubert, and David Silver. Planning in stochastic environments with a learned model. In International Conference on Learning Representations, 2021. Published in Transactions on Machine Learning Research (12/2023) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Richard Bellman. Dynamic programming. Science, 153(3731):34 37, 1966. Johann Brehmer, Joey Bose, Pim De Haan, and Taco Cohen. Edgi: Equivariant diffusion for planning with embodied agents. ar Xiv preprint ar Xiv:2303.12410, 2023. Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In International Conference on Machine Learning, pp. 2048 2056. PMLR, 2020. Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pp. 2990 2999. PMLR, 2016a. Taco S Cohen and Max Welling. Steerable cnns. ar Xiv preprint ar Xiv:1612.08498, 2016b. Pim De Haan, Maurice Weiler, Taco Cohen, and Max Welling. Gauge equivariant mesh cnns: Anisotropic convolutions on geometric graphs. ar Xiv preprint ar Xiv:2003.05425, 2020. Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47 53, 2022. Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, and Timothy Lillicrap. An investigation of model-free planning. In International Conference on Machine Learning, pp. 2464 2473. PMLR, 2019. Jessica B Hamrick, Abram L Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Veličković, and Théophane Weber. On the role of planning in model-based deep reinforcement learning. ar Xiv preprint ar Xiv:2011.04021, 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Julian Ibarz, Jie Tan, Chelsea Finn, Mrinal Kalakrishnan, Peter Pastor, and Sergey Levine. How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698 721, 2021. Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. ar Xiv preprint ar Xiv:2205.09991, 2022. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Amol Mandhane, Anton Zhernov, Maribeth Rauh, Chenjie Gu, Miaosen Wang, Flora Xue, Wendy Shang, Derek Pang, Rene Claus, Ching-Han Chiang, et al. Mu Zero with self-competition for rate control in VP9 video compression. ar Xiv preprint ar Xiv:2202.06626, 2022. Arnab Kumar Mondal, Vineet Jain, Kaleem Siddiqi, and Siamak Ravanbakhsh. Eq R: Equivariant representations for data-efficient reinforcement learning. In International Conference on Machine Learning, pp. 15908 15926. PMLR, 2022. Darius Muglich, Christian Schroeder de Witt, Elise van der Pol, Shimon Whiteson, and Jakob Foerster. Equivariant networks for zero-shot coordination. ar Xiv preprint ar Xiv:2210.12124, 2022. Published in Transactions on Machine Learning Research (12/2023) Felix Musil, Andrea Grisafi, Albert P Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti. Physics-inspired structural representations for molecules and materials. Chemical Reviews, 121(16):9759 9815, 2021. Jung Yeon Park, Ondrej Biza, Linfeng Zhao, Jan Willem van de Meent, and Robin Walters. Learning symmetric embeddings for equivariant world models. ar Xiv preprint ar Xiv:2204.11371, 2022. Balaraman Ravindran. An algebraic approach to abstraction in reinforcement learning. University of Massachusetts Amherst, 2004. Balaraman Ravindran and Andrew G Barto. Approximate homomorphisms: A framework for non-exact minimization in Markov decision processes. 2004. Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, and Doina Precup. Continuous MDP homomorphisms and homomorphic policy gradient. ar Xiv preprint ar Xiv:2209.07364, 2022. Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. ar Xiv preprint ar Xiv:1511.05952, 2015. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020. Marwin H S Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555(7698):604 610, 2018. Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. Advances in neural information processing systems, 29, 2016. Elise van der Pol, Daniel Worrall, Herke van Hoof, Frans Oliehoek, and Max Welling. MDP homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33:4199 4210, 2020. Elise van der Pol, Herke van Hoof, Frans A Oliehoek, and Max Welling. Multi-agent MDP homomorphic networks. ar Xiv preprint ar Xiv:2110.04495, 2021. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. Dian Wang, Robin Walters, and Robert Platt. SO(2)-equivariant reinforcement learning. ar Xiv preprint ar Xiv:2203.04439, 2022. Maurice Weiler and Gabriele Cesa. General e (2)-equivariant steerable cnns. Advances in neural information processing systems, 32, 2019. Jianke Yang, Robin Walters, Nima Dehmamy, and Rose Yu. Generative adversarial symmetry discovery, 2023. Linfeng Zhao, Xupeng Zhu, Lingzhi Kong, Robin Walters, and Lawson LS Wong. Integrating symmetry into differentiable planning with steerable convolutions. In The Eleventh International Conference on Learning Representations, 2022. Linfeng Zhao, Owen Lewis Howell, Jung Yeon Park, Xupeng Zhu, Robin Walters, and Lawson LS Wong. Can euclidean symmetry help in reinforcement learning and planning. 2023.