# structuring_representations_using_group_invariants__68608978.pdf

Structuring Representations Using Group Invariants

Mehran Shakerinava , Arnab Kumar Mondal , Siamak Ravanbakhsh Mc Gill University and Mila, Montréal, Canada {mehran.shakerinava, arnab.mondal, siamak.ravanbakhsh}@mila.quebec

A finite set of invariants can identify many interesting transformation groups. For example, distances, inner products and angles are preserved by Euclidean, Orthogonal and Conformal transformations, respectively. In an equivariant representation, the group invariants should remain constant on the embedding as we transform the input. This gives a procedure for learning equivariant representations without knowing the possibly nonlinear action of the group in the input space. Rather than enforcing such hard invariance constraints on the latent space, we show how to use invariants for symmetry regularization of the latent while guaranteeing equivariance through other means. We also show the feasibility of learning disentangled representations using this approach and provide favorable qualitative and quantitative results on downstream tasks, including world modeling and reinforcement learning.

1 Introduction

Sample efficient representation learning is a critical open challenge in deep learning for AI. When we have prior information about transformations that are relevant to a particular domain, building representations that are aware of these transformations can lead to better sample efficiency and generalization. One way to use such symmetry priors is to make the network invariant to the given transformations. A generalization of this idea is called equivariance, where transforming the input transforms the output in a specific way. An equivariant network that makes good predictions for a particular input also generalizes to all input transformations, making symmetry a useful prior.

While recent years have witnessed a range of exciting equivariant deep models, there are several limitations. First, most equivariant networks constrain the network architecture, often requiring specialized implementations. Moreover, transformations considered by the existing methods are often assumed to be linear in both input and representation space. This is the case for architectures designed for finite permutation groups and continuous Lie groups. Approaches that go beyond linear transformations in the input space often assume access to group information i.e., the group member that transforms one input to another is known. This paper introduces a simple approach that addresses all of these limitations.

Our approach uses the invariants of a given linear representation of a transformation group. Previously invariants were used to connect different geometries, and group theory in Klein s Erlangen program [32]. According to this view, geometries are concerned with invariant quantities under certain transformations. For example, Euclidean geometry is concerned with the length, angle, and parallelism of lines, among others, because Euclidean transformations preserve these. However, moving to the more general and less structured Affine geometry, notions of distance and angle are no longer relevant, while parallelism remains an invariant of the geometry. The corresponding symmetry groups are examples of Lie groups that have a subgroup relation, E(n) < Aff(n), thereby enabling the groups to characterize a hierarchy (or lattice) of different geometries.

These authors contributed equally to this work.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

From this geometric perspective, our proposal in this work is to induce a geometry on the embedding and make it equivariant to a given group by enforcing the invariants of their defining action. For example, distance is the invariant for Euclidean geometry, which means all distance-preserving transformations are Euclidean. Therefore, to enforce equivariance to the Euclidean group, it is sufficient to ensure that the embedding of any two data points has the same distance before and after the same transformation of the inputs; see Figure 1. While this approach uses the defining action of different groups in the embedding space, the same group can have a non-linear and unknown action on the input space. In the pendulum example of Figure 1, the group E(3) acts on the value of each input image pixel using an unknown and non-linear action. Moreover, this approach does not require the pairing of group members with transformations, a piece of information that is often unavailable.

Figure 1: E(3)-equivariant embedding for the pendulum. The input x consists of a pair of images that identify both the angle and the angular velocity of a pendulum. The equivariant embedding learns to encode both: the true angle is shown by a change of color and angular velocity using a change of brightness. The two circular ends (black and white) correspond to states of maximum angular velocity in opposite directions. The Sym Reg objective for the Euclidean group learns this embedding by preserving the pairwise distance between the codes before (f(x), f(x )) and after (f(t X(g, x)), f(t X(g, x))) transformations of the input by t X. Therefore dashed lines have equal lengths. For the pendulum, the transformations are in the form of applying positive or negative torque in some range.

In the rest of the paper, we arrive at the idea above from a different path: after reviewing related works in Section 2 and providing a background in Section 3, Section 4 observes that equivariance, in its general form, can be a weak inductive bias. This is because having an injective code is sufficient for equivariance to any transformation group. However, in this manifestation of equivariance, the group action on the embedding can be highly non-linear. Since the simplicity of the action on the embedding seems essential for equivariance to become a useful learning bias, Section 5 proposes to regularize the group action on the code to make it simple . This symmetry regularization (Sym Reg) objective is group-dependent and the essence of our approach. Enforcing geometric invariants in the latent space is proposed as a symmetry regularization. While we focus on equivariant representation learning through self-supervision, in principle, supervised tasks can also benefit from the proposed Sym Reg. An important benefit of a symmetry-based representation is its ability to produce disentangled representations through group decomposition [27]. Section 6 studies disentanglement using Sym Reg. Section 7 presents a range of experiments to understand its behavior and puts it in the context of comparable baselines.

2 Related Works

Finding effective priors and objectives for deep representation learning is an integral part of the quest for AI [3]. Among these priors, learning equivariant deep representations has been the subject of many works over the past decade. Many recent efforts in this direction have focused on the design of equivariant maps [57, 13, 47, 34, 15, 23, 54, 19, 6] where the linear action of the group on the data is known. A particularly relevant example here is Villar et al. [54], which uses group invariants to construct equivariant maps where the group acts using its linear defining action in the input space. Due to this constraint, the application of these models has been focused on fixed geometric data such as images [36], sets [60, 45], graphs [39, 33], spherical data and the (special) orthogonal group [14, 1, 50, 22], the Euclidean group [52, 55, 24] or other physically motivated groups such as the Lorentz [4] or Poincare group [54], among others.

In the present work, the group action is unknown and possibly non-linear. Our setup is closer to the body of work on generative representation learning [7, 11, 40], in which the (linear) transformation is applied to the latent space [46, 58, 35, 37, 16, 21]. Among these generative coding methods, transforming autoencoder [29] is a closely related early work, which in addition to equivariance,

seeks to represent the part-whole hierarchy in the data. What additionally contrasts our work with the follow-up works on capsule networks [48, 38] is that Sym Reg is agnostic to the choice of architecture and training. We only rely on our objective function to enforce equivariance.

Since we consider learning equivariant representations through self-supervision, exciting recent progress in this area is also quite relevant [25, 42, 9, 53, 26, 61, 20, 41]. While the use of transformations is prominent in these works, in many settings, the objective encourages invariance to certain transformations, making such models useful for invariant downstream tasks such as classification. Similar to many of these methods, we also use transformed pairs to learn a representation, with the distinction of learning an equivariant representation. An exception is the recent work of Dangovski et al. [17], which learns an equivariant representation by separating the invariant embedding from the pose, where the relative pose is learned through supervision. Therefore, in that work, in contrast to ours, one needs to know the transformation that maps one input to another. When considering the Euclidean group, Sym Reg preserves distances in the embedding space under non-linear transformations of the input. This embedding should not be confused with isometric embedding [51], where the objective is to maintain the pairwise distances between points in the input and the embedding space.

3 Background on Symmetry Transformations

We can think of transformations as a set of bijective maps on a domain X. Since these maps are composable, we can identify their compositional structure using an abstract group G. For this reason, such transformations are called group actions. To formally define transformation groups, we first define an abstract group. A group G is a set equipped with a binary operation, such that the set is closed under the operation gg G g,g G, every g G has a unique inverse such that gg 1 = e, where e is the identity element of the group, and the group operations are associative (gg )g = g(g g ).

A G-action on a set X is defined by a function t G X X, which can be thought of as a bijective transformation parameterized by g G. In order to maintain the group structure, the action should satisfy the following two properties: (1) the action of the identity is the identity transformation t(e,x) = x; (2) the composition of two actions is equal to the action of the composition of group elements t(g,t(g x)) = t(gg ,x). The action t is faithful to G if transformations of X using each g G are unique i.e., g,g x X s.t. t(g,x) t(g ,x). If a G-action is defined on a set X, we call X a G-set. Many groups are defined using their defining action; for example, SO(3) is the group of rotations in 3D space. While this defining action is a linear transformation, the same group can act non-linearly on Rn using the action t SO(3) Rn Rn.

4 Equivariance is Cheap, Actions Matter

A symmetry-based representation or embedding is a function f X Z such that both X and Z are G-sets, and furthermore, f knows about G-actions, in the sense that transformations of the input using t X have the same effect as transformations of the output using some action t Z: f(t X(g,x)) = t Z(g,f(x)) g,x G X (1)

The following claim shows that despite many efforts in designing equivariant networks, simply asking for the representation to be equivariant is not a strong inductive bias, and we argue that the action matters. Put another way, the strong performance of existing equivariant networks should be attributed to the fact that the group action on the embedding space is simple (linear). Proposition 4.1. Given a transformation group t X G X X, the function f X Z is an equivariant representation if g G,x,x X f(x) = f(x ) f(t X(g,x)) = f(t X(g,x )). (2) That is, two embeddings are identical iff they are identical for all transformations.

The proof is in the appendix. The condition above is satisfied by all injective functions, indicating that many functions are equivariant to any group. Corollary 4.2. Any injective function f X Z is equivariant to any transformation group t X G X X, if we define G action on the embedding space as t Z(g,z) f(t X(g,f 1(z))) g,z G Z (3)

The ramification of the results above in what follows is two-fold: 1. While injectivity ensures equivariance, the group action on the embedding, as shown in Equation (3), can become highly non-linear. Intuitively, this action recovers x = f 1(z), applies the group action x = t X(x) in the input domains and maps back to the embedding space f(x ) to ensure equivariance. In the following, we push t Z towards a simple linear G-action through optimization of f. This objective can be interpreted as a symmetry regularization or a symmetry prior (Sym Reg).

2. Although Corollary 4.2 uses injectivity of f for the entire X, we only need this for the data manifold. In practice, one could enforce injectivity on the training dataset D using a decoder, architectural choices such as momentum encoder [26], or loss functions defined on the training data, such as a hinge loss [25] Lhinge(f,D) = x,x x D max(ϵ f(x) f(x ) ,0) or other losses that monotonically decrease with distance, such as 1 f(x) f(x ) , or its logarithm log( f(x) f(x ) ). In experiments, we use the logarithmic barrier function.

5 Symmetry Regularization Objectives

In learning equivariant representations, we often do not know the abstract group G and how it transforms our data, t X. We assume that one can pick a reasonable abstract group G that contains the ground truth abstract group acting on the data i.e., G action on the input may not be faithful. Our goal is to learn an f X Z that is equivariant w.r.t. the actions t X,t Z, where t X G X X is unknown and t Z is some (simple) G-action on Z of our choosing.

More Informed but Less Practical Setting. In the most informed case, the dataset also contains information about which group member g G can be used to transform x to x that is, the dataset consists of triples (x,g,xt = t X(g,x)). By having access to this information, we can regularize the embedding using the following loss function: Linformed G (f,D) = (x,g,xt) D ℓ(f(xt) t Z(g,f(x))), where ℓis an appropriate loss function, such as the square loss. At its minimum, we have f(xt) = t Z(g,f(x)) or f(t X(g,x)) = t Z(g,f(x)), enforcing equivariance condition of Equation (1). However, even if the optimal value is not reached, due to its injectivity, f is still Gequivariant, and the the objective above is regularizing the G action on the code. This informed setup is used in equivariant contrastive learning of [17]. The assumption of having access to g is realistic when we know the action t X, so that we can generate (x,g,xt) triplets. Fortunately, using group invariants, we may still learn an equivariant embedding, even if we do not have the group information tied to the dataset.

Here, we introduce our method for several well-known groups first and then elaborate on the more general treatment. Example 1 (Euclidean Group). The defining action of the Euclidean group E(n) is the set of transformations that preserve the Euclidean distance between any two points in Rn, a.k.a. isometries. These transformations are compositions of translations, rotations, and reflections. Since, for the real domain, all Euclidean isometries are linear and belong to E(n), we can enforce the group structure on the embedding by ensuring that distances between the embeddings before and after any transformation match. For this, we need the dataset D to be a set of pairs of pairs ((x,xt = t X(g,x)),(x ,x t = t X(g,x ))), where x,x are transformed using the same unknown group member g. Distance-preservation loss below combined with injection loss are sufficient to produce an E(n)-regularized embedding:

LE(n)(f,D) = ((x,xt),(x ,x t)) D ℓ(

distance before the transformation ³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ f(x) f(x )

distance after the transformation ³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ f(xt) f(x t) ) (4)

For example, in the standard RL setup, where we have access to triplets (s,a,s ), we can easily form D by unrolling an episode and collecting two different state transitions corresponding to a particular action. In practice, with a finite number of actions, we can efficiently generate this dataset by keeping a separate buffer for each action where we store state transitions for that action and sample from that buffer to train the embedding function f. We provide the algorithm in Appendix C. Example 2 (Orthogonal and Unitary Groups). The defining action of the orthogonal group O(n) preserves the inner product between two vectors. The analogous group in the complex domain is the

unitary group, which preserves the complex inner product. Our symmetry-regularization objective enforces this invariant: LO(n)(f,D) = ((x,xt),(x ,x t)) D ℓ(f(x) f(x ) f(xt) f(x t)).

For the unitary group, one additionally needs to embed to complex domain Z = Cn, where the only difference is in the definition of the inner product. Example 3 (Conformal Group). The invariant of conformal geometry is the angle. In a Euclidean embedding, conformal transformations include a combination of translation, rotation, dilation, and inversion with respect to an n 1-sphere. To enforce this group structure, we need triplets of inputs before and after a transformation ((x,xt),(x ,x t),(x ,x t )), so that we can calculate the angle in the embedding. Conformal Sym Reg objective, which preserves angles, imposes a weaker constraint on the embedding than the distance preservation of the Euclidean group since the latter implies the former. Moreover, it has an additional benefit that, compared to LE(n), the loss cannot be minimized by simply shrinking the embedding. Therefore in practice, the injection enforcing losses of Section 4 is not as crucial when using conformal symmetry regularization.

5.1 General Setting

Given a group G acting linearly on a vector space Z, invariant polynomials associated with this action are those polynomials satisfying P(t Z(z,g)) = P(z) g G. These polynomials form an algebra studied in the field of invariant theory [56, 44]. In particular, a relevant problem is the question of whether there exists a finite set of bases for invariant polynomials for a given group representation. This question was one of Hilbert s 23 problems, and it was answered affirmatively by Hilbert himself for linear reductive groups, which includes classical Lie groups [28]. Our proposal, in its most general form is to ensure invariance of polynomial bases within the orbits of the latent space before-after transformation of the input.

Some examples of classical Lie groups and their invariants are: volume and orientation preservation by the Special Linear group, where the corresponding invariant polynomial is the determinant; Lorentz and Poincare groups are the analogs of the Orthogonal and Euclidean groups in the Minkowski space respectively, therefore equipped with similar invariants; the Symplectic group preserves another bilinear form. Finite groups also possess invariants. We show this use of invariants for Sym Reg through the important example of the symmetric group. Example 4 (Symmetric Group). Symmetric polynomials P(z1,...,zn) that are invariant under all permutations of variables have a finite set of elementary bases:

e1(z) = 1 j n zj, e2(z) = 1 j<k n zjzk, ..., en(z) = z1z2 ...zn.

Assuming an n-dimensional embedding (i.e., zj R), the corresponding Sym Reg objective penalizes change in these elementary basis before-after a transformation. At its minimum value, this penalty ensures that transformations of the input lead to permutations of the latent dimensions however, with Sym Reg, this loss is used only to regularize the embedding. An alternative approach to Sym Reg for finite groups and, in particular, the Symmetric group is discussed in Appendix B.

Choice of Lie group Deciding on a Lie group for each application and in particular working with the corresponding invariants can be cumbersome. A simple alternative is to use an E(n)-equivariant embedding for sufficiently large n. This is because Lie groups have isometric Euclidean embedding for sufficiently large n. We demonstrate this in the experiments with SO(3) group in Section 7.1.

6 Decomposing the Representation

Higgins et al. [27] suggested a notion of disentangled representation based on decomposition of the abstract group into a direct product form G = G1 ... Gk. There are two approaches to learning such decomposed representation using Sym Reg, depending on whether or not we can perform certain types of transformations in isolation. For example, an RL agent may transform its environment through actions like moving a single limb that can be performed in isolation. In this case, we call the decomposition active to contrast it with the passive case, where the action of different subgroups is always mixed in our dataset.

Figure 2: Visualization of Sym Reg s latent projection for the rotating Chair dataset. The chair is rotated in three orthogonal axes from 0 to 2π. The latent embedding for each chair pose is projected from a 16D embedding space to a 2D space for visualization. The colors of the representations are mapped to the chair s angle of rotation. We notice that the mapping function f learned is continuous with respect to the transformations of the object, and it maps the rotations along an axis to a circular manifold. This is true for each orthogonal axis of rotation. We observe a similar result for any other initial pose for the chair. A qualitative comparison with an existing method [25] is provided in Appendix D.1.

Active Decomposition. Let G = {(g1,...,gk) G1 ... Gk} be a product group, where Gi {(e,...,e,gi,e,...,e) G} can be identified with a normal subgroup of G. In active decomposition, sub-groups can act in isolation, and therefore we have k types of tuples in our dataset D1,...,Dk D. Each subset Di is associated with actions of a subgroup Gi using t X((e,...,e,gi,e...,e), )), gi Gi. In this setting, the representation f X Z = Z1 ... Zk can be thought of as k separate functions where fi X Zi is equivariant to Gi-action and invariant to all Gj,j i actions. This gives the following objective

Lactive G (f,D) = k i=1 LGi(fi,Di) ¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹ equivariance to Gi

+Linv. G/Gi(fi,D/Di) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ invariance to Gj for j i

where Linv. G (f,D) enforces invariance of f to G-transformations in D e.g., by penalizing f(x) f(t X(g,x)) .

Passive Decomposition. When we have no control over transformations, and we are simply given the data, it is still possible to use an abstract group that has a product form. Here again, f X Z = Z1 ... Zk, but the loss function is simply enforced on each block separately i.e., Lpassive G (f,D) = k i=1 LGi(fi,D), where LGi(fi,D) is a Sym Reg objective from Section 5.

7 Experiments

We conducted many experiments to qualitatively study the representation learned by Sym Reg and its ability to produce a disentangled representation, and quantitatively compare it against simple baselines in representation learning and downstream RL tasks. For details on architecture and training, see Appendix G.

7.1 Qualitative Analysis

In this section, we visualize the representation learned for two examples from the Gym environment [5], including the pendulum and the mountain car (see Appendix D.2), followed by an experiment involving a rotating object where we know the ideal embedding is the SO(3) manifold. Finally, Figure 3 visualizes a conformal embedding for double-bump world. In most cases, we also visualize a Variational Auto Encoder (VAE) [30] embedding for comparison. Our objective here is to visually demonstrate the behavior of Sym Reg and its remarkable ability to learn an embedding informed by the non-linear transformation of the input.

The Pendulum. For this experiment, the input x is two consecutive frames of the pendulum that have been grayscaled and downsampled to 32 32 pixels. The action space is a range of torques

that can be applied to the base of the pendulum. We use the action to transform the data. We use the objective of Equation (4) to learn an E(3)-equivariant representation. To efficiently estimate LE(n), we use a mini-batch that consists of 64 randomly sampled observations from the environment and their transformations via three randomly sampled actions (4 64 samples in total). The model learns to parameterize the embedding using the angle and the angular momentum of the pendulum from the input data; see Figure 1. In order to compare, we visualize the learned latent of VAE and run similar experiments on the Mountain Car Environment in Appendix D.2).

Rotating Chair. We consider a 3D chair from Model Net40 [59] and transform it through the action of the group SO(3). The group action on the input is the 2D projection into a 48 48 image after the 3D rotation of the chair. While the group of interest is SO(3), we use Sym Reg loss of Equation (4) following Section 5.1. We embed the chair in R16 using Sym Reg and visualize the latent by rotating the chair along three orthogonal axes and projecting the latent codes into a 2D space. Figure 2 shows three circular latent traversals of Sym Reg embedding corresponding to rotation around each axis, which is consistent with the structure of the SO(3) manifold. The process of learning the SO(3) manifold is a challenging task (see Appendix D.1), and previous works assumed that the group member corresponding to each transformation is given [46, 2]. In contrast, we only use the observations corresponding to similar actions during training and not the group members themselves. As we see later, this is critical in settings such as RL, where group information is unavailable.

Figure 3: (left) Sim CLR (middle) VAE (right) Sym Reg embeddings of double-bump world.

Conformal Embedding for Double Bump World. Double-bump world consists of a rectangular bump signal and a triangular bump signal, both cyclically shifted and superimposed. These transformations are given by a pair ( 1, 2) which cyclically shifts the rectangular bump by 1 and the triangular bump by 2. In our experiments, the signal length is 64, and the length of the bump is 16. Sym Reg embeds to a 4-dimensional conformal space in this example. Figure 3(right) shows the random projection of the embedding, where the colors change as the triangle bump moves. The figure suggests that Sym Reg can learn to represent a data point using the location of two bumps. For comparison, we show the embedding found by VAE and Sim CLR [9].

7.2 Experiments on Active and Passive Decomposition

In this section, we first contrast active and passive decomposition in their ability to disentangle the two bumps in the double bump world. We observe that while both can decompose the embedding into a product form SO(2) SO(2), only active decomposition leads to disentanglement. Finally, we apply active decomposition to the more complex setting of ego-motion, where Sym Reg can decompose the representation of the agent s state into location and orientation.

Decomposition of the Double-Bump World. Here, we compare the active and passive decomposition for the same double-bump world. While the ground truth is SO(2) SO(2), Sym Reg uses the larger group E(2) E(2). In the active case, each subgroup moves one of the bumps, and the loss of Equation (5) is used to learn an embedding for each subgroup. In the passive case, both bumps move randomly. Figure 4 compares the decomposed embedding found in each case. While in both cases, the SO(2) SO(2) torus is decomposed into a product of circles, only the active case successfully disentangles the two bumps. Note that the color of each point is based on the location of the triangle bump. Our results agree with Caselles-Dupré et al. [8], who claim that learning a disentangled representation requires interaction with the environment; see also [43, 40]. However, we note that while the disentangling of the bump movements does not happen in the passive case, we can still successfully decompose the embedding.

Active Decomposition for Ego-Motion. We used a modified version of the single-room environment of Mini World [12] for this experiment. The agent is standing in a 3D room containing eight differently colored boxes around the walls. A map of the room can be seen in Figure 5. An

Figure 4: Active versus passive decomposition for the double-bump world. Four images at the bottom show a pair of inputs (x, x ) and their transformations (xt, x t). The figure shows the embedding for two inputs before, (x, x ) in blue, and after, (xt, x t) in red, the same transformation. This transformation cyclically shifts both the triangle and the square to the left, but the amount of translation is larger for the square. In both passive and active decomposition, the Euclidean distance is preserved by the transformation the red points have the same distance from each other as the blue points on every manifold. In the active decomposition (right), one of the manifolds encodes the circular translation of the triangle bump, while the second one represents the location of the square bump. Various colors indicate the location of the triangle. In the case of passive decomposition (left), since the transformation of individual shapes does not guide the decomposition, the manifolds jointly encode the location of each bump type.

Figure 5: Decomposition of the ego-motion. The dataset contains a first-person view of a room. Transformations include right-left rotation and forward-backward movement. The equivariant embedding is produced by active decomposition using these two transformations, where the ring-structured manifold corresponds to the rotation action, and the smaller manifold corresponds to translations. Color coding shows the ground truth angle of the image. The black square markers show the traversal of the embedding as the agent rotates while standing in the middle of the room. Note that black squares are concentrated in the center of the second manifold.

observation consists of a first-person view of the room, downsampled to 32 32 pixels. The agent can rotate left/right or move forward/backward. We learn an E(2) E(2) equivariant embedding using the active decomposition objective of Equation (5). Each mini-batch consists of 64 random observations and the result of applying all four actions in those states (4 64 samples in total).

Figure 5 visualizes the embedding of the input in two sub-figures, where the more prominent figure shows the embedding corresponding to the rotation action, and the more petite figure (bottom right) shows the embedding corresponding to forward-backward movement. The first figure also shows the first-person view when the agent rotates while standing in the middle of the room. The corresponding markers collapse around the center of the second embedding, demonstrating an intuitive embedding parameterized by rotation angle and location. Walking straight across the room also produces the expected behavior of traversing the second manifold while the rotation angle, for the most part, remains fixed (not shown).

7.3 Quantitative Evaluation in Downstream Tasks

7.3.1 World Modelling

We select the Atari games Pong and Space Invaders as our environments for the world modeling experiments. These environments were previously used by Kipf et al. [31] to evaluate the Contrastive

ENVIRONMENT METHOD H@1 MRR

WORLD MODEL(AE) 23.8 3.3 44.7 2.4 WORLD MODEL(VAE) 1.0 0.0 5.1 0.1 C-SWM 36.5 5.6 56.2 6.2 OURS 45.2 3.4 60.2 3.9

SPACE INVADERS

WORLD MODEL(AE) 40.2 3.3 59.6 3.5 WORLD MODEL(VAE) 1.0 5.3 5.3 0.1 C-SWM 48.5 7.0 66.1 6.6 OURS 54.2 6.3 68.7 5.1

Table 1: Hits at Rank 1 (H@1) and Mean Reciprocal Rank (MRR) of different methods.

METHODS INVERTED PENDULUM REACHER SWIMMER

VANILLA 500 150 -11 2.5 25.6 3.4 AE-DECOUPLED 30 15 -13 3.0 16 3.9 AE-FINETUNED 580 130 -11.5 3.2 26 4.3 IN-SSL-DECOUPLED 100 17 -15 2.6 12 2.5 IN-SSL-FINETUNED 550 21 -12 4.1 25.9 4.8 EQ-SSL-DECOUPLED 456 190 -14.8 3.1 18 4.5 EQ-SSL-FINETUNED 710 120 -10 2.6 27 3.5 SYMREG-DECOUPLED 800 180 -14.5 3.1 21 4.1 DEC-SYMREG-DECOUPLED 600 200 -12.8 2.7 19 5.6 SYMREG-FINETUNED 950 50 -10 3.4 31.5 3.9

Table 2: Average reward collected over 10 episodes for various models in Inverted Pendulum, Reacher and Swimmer. We provide the standard errors using 5 random seeds.

Structured World Model (C-SWM). We train the encoder using Euclidean Sym Reg of Equation (4), freeze it, and then learn a Multi-Layer Perceptron (MLP) based transition function in the latent space. Following Kipf et al. [31], we report Hits at Rank 1 (H@1) and Mean Reciprocal Rank (MRR), which are invariant to the embedding scale. These evaluation metrics measure the relative closeness of the following state s representation predicted by the transition model and the representation of the observed next state. We use a set of reference state representations to measure the relative closeness (embedding random observations from the experience buffer). Section 7.3.2 reports these measures and shows that a simple transition model learned on top of our embedding outperforms C-SWM in both games. Other reported baselines use an Auto Encodcer (AE) and a Variational Auto Encoder (VAE) to learn embeddings.

7.3.2 Reinforcement Learning

Next, we consider three Mujoco environments: Inverted Pendulum, Reacher, and Swimmer from Open AI Gym [5] and learn directly from the image observations. We compare our model with Auto Encoder (AE) and Self-supervised Learning (SSL) based baselines. While AE learns to reconstruct the image observations of the states, SSL learns to inject invariance (IN-SSL) or equivariance (EQSSL) to agent actions. Given a triplet (s,a,s), IN-SSL maximizes the likelihood of f(s) and f(s ) being similar (Sim CLR [10]). EQ-SSL of Dangovski et al. [18], in this context, additionally predicts the action that leads to the state transition. We introduce two variations of each model. In the first variation, the low-dimensional embedding is used as a substitute for the high-dimensional input data without further adjustment (-decoupled). The second variation allows for fine-tuning during the reinforcement learning stage (-fine-tuned). We use random policy to collect trajectories for the pre-training and use Proximal Policy Optimization (PPO) [49] algorithm for the downstream RL task. To evaluate the data efficiency of these models, we report the average reward collected over 10 episodes in the first 100,000 steps for Reacher and Swimmer and 30,000 steps for Inverted Pendulum in Section 7.3.2 (since Inverted Pendulum generally learns faster, we took a fewer number of steps.)

We see that out of all the representation learning methods, learned representations of Sym Reg most adequately capture the structure of the environment in Inverted Pendulum since the RL agent just trained on the fixed representation (Sym Reg-decoupled) outperforms all of them, including vanilla PPO. In Reacher, Sym Reg, along with other non-generative models, performs poorly compared to the AE. We believe that this is because the representation is focused on transformations caused by the agent s actions while details that can be valuable from the reward s perspective in this case, the small object that the Reacher should reach - are ignored. This observation points to a limitation of all non-generative approaches that fine-tuning can resolve. To further verify this, we combined Sym Reg with a Decoder (Dec-Sym Reg) and noticed a significant improvement in the performance of the decoupled variation. In Swimmer, again, we see that learning the agent s transformations is not enough to get all the reward information as the background movement decides how far the agent has swum. Indeed, allowing the encoder to fine-tune allows the representations to reflect the reward information and improve performance.

We proposed to learn equivariant representations by learning an injective embedding that is regularized towards a simple linear action using group invariants. We demonstrate this to be a simple, intuitive, and yet effective approach for representation learning. In the future, we would like to understand data characteristics that motivate the choice of one Lie group over others. We would also like to explore Sym Reg for the symmetric group and its combination with Euclidean groups as a way to represent various objects in Euclidean space. We would also like to investigate further the best choice of objectives or mechanism for preventing a representation collapse in conjunction with Sym Reg.

Acknowledgments

We would like to thank anonymous reviewers for their constructive feedback. This research was in part supported by CIFAR AI chairs program and NSERC Discovery grant. The computational resources were provided by Mila and Compute Canada (now Digital Research Alliance of Canada). As a part of Mila, the authors acknowledge the material support of NVIDIA in the form of computational resources and IDT team and their technical support for maintaining the Mila compute clusters.

[1] Anderson Brandon, Hy Truong Son, Kondor Risi. Cormorant: Covariant Molecular Neural Networks // Advances in Neural Information Processing Systems. 32. 2019. 14537 14546.

[2] Anonymous . Learning Symmetric Representations for Equivariant World Models // Submitted to The Tenth International Conference on Learning Representations. 2022. under review.

[3] Bengio Yoshua, Courville Aaron, Vincent Pascal. Representation learning: A review and new perspectives // IEEE transactions on pattern analysis and machine intelligence. 2013. 35, 8. 1798 1828.

[4] Bogatskiy Alexander, Anderson Brandon, Offermann Jan, Roussi Marwah, Miller David, Kondor Risi. Lorentz group equivariant neural network for particle physics // International Conference on Machine Learning. 2020. 992 1002.

[5] Brockman Greg, Cheung Vicki, Pettersson Ludwig, Schneider Jonas, Schulman John, Tang Jie, Zaremba Wojciech. Openai gym // ar Xiv preprint ar Xiv:1606.01540. 2016.

[6] Bronstein Michael M, Bruna Joan, Cohen Taco, Veliˇckovi c Petar. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges // ar Xiv preprint ar Xiv:2104.13478. 2021.

[7] Burgess Christopher P, Higgins Irina, Pal Arka, Matthey Loic, Watters Nick, Desjardins Guillaume, Lerchner Alexander. Understanding disentangling in beta-VAE // ar Xiv preprint ar Xiv:1804.03599. 2018.

[8] Caselles-Dupré Hugo, Garcia-Ortiz Michael, Filliat David. Symmetry-based disentangled representation learning requires interaction with environments // ar Xiv preprint ar Xiv:1904.00243. 2019.

[9] Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey. A simple framework for contrastive learning of visual representations // International conference on machine learning. 2020. 1597 1607.

[10] Chen Ting, Kornblith Simon, Norouzi Mohammad, Hinton Geoffrey E. A Simple Framework for Contrastive Learning of Visual Representations // Co RR. 2020. abs/2002.05709.

[11] Chen Xi, Duan Yan, Houthooft Rein, Schulman John, Sutskever Ilya, Abbeel Pieter. Infogan: Interpretable representation learning by information maximizing generative adversarial nets // ar Xiv preprint ar Xiv:1606.03657. 2016.

[12] Chevalier-Boisvert Maxime. gym-miniworld environment for Open AI Gym. 2018.

[13] Cohen Taco, Welling Max. Group equivariant convolutional networks // International conference on machine learning. 2016. 2990 2999.

[14] Cohen Taco S., Geiger Mario, Köhler Jonas, Welling Max. Spherical CNNs // International Conference on Learning Representations. 2018.

[15] Cohen Taco S, Geiger Mario, Weiler Maurice. A General Theory of Equivariant CNNs on Homogeneous Spaces // Advances in Neural Information Processing Systems. 32. 2019. 9145 9156.

[16] Cohen Taco S, Welling Max. Transformation properties of learned visual representations // ar Xiv preprint ar Xiv:1412.7659. 2014.

[17] Dangovski Rumen, Jing Li, Loh Charlotte, Han Seungwook, Srivastava Akash, Cheung Brian, Agrawal Pulkit, Soljaˇci c Marin. Equivariant Contrastive Learning // ar Xiv preprint ar Xiv:2111.00899. 2021.

[18] Dangovski Rumen, Jing Li, Loh Charlotte, Han Seungwook, Srivastava Akash, Cheung Brian, Agrawal Pulkit, Soljacic Marin. Equivariant Self-Supervised Learning: Encouraging Equivariance in Representations // International Conference on Learning Representations. 2022.

[19] Dehmamy Nima, Walters Robin, Liu Yanchen, Wang Dashun, Yu Rose. Automatic Symmetry Discovery with Lie Algebra Convolutional Network // Advances in Neural Information Processing Systems. 2021. 34.

[20] Ermolov Aleksandr, Siarohin Aliaksandr, Sangineto Enver, Sebe Nicu. Whitening for selfsupervised representation learning // International Conference on Machine Learning. 2021. 3015 3024.

[21] Falorsi Luca, Haan Pim de, Davidson Tim R, De Cao Nicola, Weiler Maurice, Forré Patrick, Cohen Taco S. Explorations in homeomorphic variational auto-encoding // ar Xiv preprint ar Xiv:1807.04689. 2018.

[22] Finkelshtein Ben, Baskin Chaim, Maron Haggai, Dym Nadav. A simple and universal rotation equivariant point-cloud network // ar Xiv preprint ar Xiv:2203.01216. 2022.

[23] Finzi Marc, Welling Max, Wilson Andrew Gordon. A Practical Method for Constructing Equivariant Multilayer Perceptrons for Arbitrary Matrix Groups // ar Xiv preprint ar Xiv:2104.09459. 2021.

[24] Fuchs Fabian, Worrall Daniel, Fischer Volker, Welling Max. Se (3)-transformers: 3d rototranslation equivariant attention networks // Advances in Neural Information Processing Systems. 2020. 33. 1970 1981.

[25] Hadsell Raia, Chopra Sumit, Le Cun Yann. Dimensionality reduction by learning an invariant mapping // 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06). 2. 2006. 1735 1742.

[26] He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, Girshick Ross. Momentum contrast for unsupervised visual representation learning // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. 9729 9738.

[27] Higgins Irina, Amos David, Pfau David, Racaniere Sebastien, Matthey Loic, Rezende Danilo, Lerchner Alexander. Towards a definition of disentangled representations // ar Xiv preprint ar Xiv:1812.02230. 2018.

[28] Hilbert David, David Hilbert. Theory of algebraic invariants. 1993.

[29] Hinton Geoffrey E, Krizhevsky Alex, Wang Sida D. Transforming auto-encoders // International conference on artificial neural networks. 2011. 44 51.

[30] Kingma Diederik P, Welling Max. Auto-encoding variational bayes // ar Xiv preprint ar Xiv:1312.6114. 2013.

[31] Kipf Thomas, Pol Elise van der, Welling Max. Contrastive Learning of Structured World Models // International Conference on Learning Representations. 2020.

[32] Klein Felix. A comparative review of recent researches in geometry // Bulletin of the American Mathematical Society. 1893. 2, 10. 215 249.

[33] Kondor Risi, Son Hy Truong, Pan Horace, Anderson Brandon, Trivedi Shubhendu. Covariant compositional networks for learning graphs // ar Xiv preprint ar Xiv:1801.02144. 2018.

[34] Kondor Risi, Trivedi Shubhendu. On the generalization of equivariance and convolution in neural networks to the action of compact groups // International Conference on Machine Learning. 2018. 2747 2755.

[35] Kulkarni Tejas D, Whitney Will, Kohli Pushmeet, Tenenbaum Joshua B. Deep convolutional inverse graphics network // ar Xiv preprint ar Xiv:1503.03167. 2015.

[36] Le Cun Yann, Bengio Yoshua, others . Convolutional networks for images, speech, and time series // The handbook of brain theory and neural networks. 1995. 3361, 10. 1995.

[37] Lenc Karel, Vedaldi Andrea. Learning covariant feature detectors // European conference on computer vision. 2016. 100 117.

[38] Lenssen Jan Eric, Fey Matthias, Libuschewski Pascal. Group Equivariant Capsule Networks // Advances in Neural Information Processing Systems. 31. 2018. 8844 8853.

[39] Maron Haggai, Ben-Hamu Heli, Shamir Nadav, Lipman Yaron. Invariant and equivariant graph networks // ar Xiv preprint ar Xiv:1812.09902. 2018.

[40] Mita Graziano, Filippone Maurizio, Michiardi Pietro. An identifiable double VAE for disentangled representations // International Conference on Machine Learning. 2021. 7769 7779.

[41] Mondal Arnab Kumar, Jain Vineet, Siddiqi Kaleem, Ravanbakhsh Siamak. Eq R: Equivariant Representations for Data-Efficient Reinforcement Learning // Proceedings of the 39th International Conference on Machine Learning. 162. 17 23 Jul 2022. 15908 15926. (Proceedings of Machine Learning Research).

[42] Oord Aaron van den, Li Yazhe, Vinyals Oriol. Representation learning with contrastive predictive coding // ar Xiv preprint ar Xiv:1807.03748. 2018.

[43] Painter Matthew, Hare Jonathon, Prugel-Bennett Adam. Linear disentangled representations and unsupervised action estimation // ar Xiv preprint ar Xiv:2008.07922. 2020.

[44] Popov Vladimir L, Vinberg Ernest B. Invariant theory // Algebraic geometry IV. 1994. 123 278.

[45] Qi Charles R, Su Hao, Mo Kaichun, Guibas Leonidas J. Pointnet: Deep learning on point sets for 3D classification and segmentation // Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. 652 660.

[46] Quessard Robin, Barrett Thomas D, Clements William R. Learning group structure and disentangled representations of dynamical environments // ar Xiv preprint ar Xiv:2002.06991. 2020.

[47] Ravanbakhsh Siamak, Schneider Jeff, Poczos Barnabas. Equivariance through parameter-sharing // International Conference on Machine Learning. 2017. 2892 2901.

[48] Sabour Sara, Frosst Nicholas, Hinton Geoffrey E. Dynamic Routing Between Capsules // Advances in Neural Information Processing Systems. 30. 2017. 3856 3866.

[49] Schulman John, Wolski Filip, Dhariwal Prafulla, Radford Alec, Klimov Oleg. Proximal Policy Optimization Algorithms. 2017.

[50] Shakerinava Mehran, Ravanbakhsh Siamak. Equivariant Networks for Pixelized Spheres // International Conference on Machine Learning. 2021. 9477 9488.

[51] Tenenbaum Joshua B, De Silva Vin, Langford John C. A global geometric framework for nonlinear dimensionality reduction // science. 2000. 290, 5500. 2319 2323.

[52] Thomas Nathaniel, Smidt Tess, Kearnes Steven, Yang Lusann, Li Li, Kohlhoff Kai, Riley Patrick. Tensor field networks: Rotation-and translation-equivariant neural networks for 3D point clouds // ar Xiv preprint ar Xiv:1802.08219. 2018.

[53] Tian Yonglong, Krishnan Dilip, Isola Phillip. Contrastive multiview coding // ar Xiv preprint ar Xiv:1906.05849. 2019.

[54] Villar Soledad, Hogg David, Storey-Fisher Kate, Yao Weichi, Blum-Smith Ben. Scalars are universal: Equivariant machine learning, structured like classical physics // Advances in Neural Information Processing Systems. 2021. 34.

[55] Weiler Maurice, Cesa Gabriele. General E(2)-Equivariant Steerable CNNs // Advances in Neural Information Processing Systems. 32. 2019.

[56] Weyl Hermann. The classical groups: their invariants and representations. 1946.

[57] Wood Jeffrey, Shawe-Taylor John. Representation theory and invariant neural networks // Discrete applied mathematics. 1996. 69, 1-2. 33 60.

[58] Worrall Daniel E, Garbin Stephan J, Turmukhambetov Daniyar, Brostow Gabriel J. Interpretable transformations with encoder-decoder networks // Proceedings of the IEEE International Conference on Computer Vision. 2017. 5726 5735.

[59] Wu Zhirong, Song Shuran, Khosla Aditya, Yu Fisher, Zhang Linguang, Tang Xiaoou, Xiao Jianxiong. 3d shapenets: A deep representation for volumetric shapes // Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. 1912 1920.

[60] Zaheer Manzil, Kottur Satwik, Ravanbakhsh Siamak, Poczos Barnabas, Salakhutdinov Russ R, Smola Alexander J. Deep Sets // Advances in Neural Information Processing Systems. 30. 2017. 3391 3401.

[61] Zbontar Jure, Jing Li, Misra Ishan, Le Cun Yann, Deny Stéphane. Barlow twins: Self-supervised learning via redundancy reduction // ar Xiv preprint ar Xiv:2103.03230. 2021.