# unsupervised_learning_of_equivariant_structure_from_sequences__eff4295e.pdf

Unsupervised Learning of Equivariant Structure from Sequences

Takeru Miyato 1,2 Masanori Koyama 1 Kenji Fukumizu3,1 equal contribution 1Preferred Networks, Inc. 2University of Tübingen 3The Institute of Statistical Mathematics

In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict the future observations. We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product by applying simultaneous block-diagonalization to the transition operators in the latent space, the procedure which is commonly used in representation theory to decompose the feature-space based on the type of response to group actions. We will showcase our method from both empirical and theoretical perspectives. Our result suggests that ﬁnding a simple structured relation and learning a model with extrapolation capability are two sides of the same coin. The code is available at https://github.com/takerum/meta_sequential_prediction. .

1 Introduction

The recent evolution and successes of neural networks in machine learning ﬁelds have shown the importance of symmetry-aware neural network models [18, 45, 38, 63]. In particular, symmetries in the form of geometric/algebraic constraints have been proven useful in various applications involving high-dimensional, highly-structured observations. For example, recent literature of robotics and reinforcement learning has succeeded in exploiting the knowledge of geometrical symmetries to improve the sample efﬁciency [62, 65] or to train a model that generalizes to unseen observations [58].

However, building an inductive bias that matches the given dataset of interest is challenging, and recent studies have been exploring the ways to learn symmetries itself from observational sequences. Many of these approaches consider settings with relatively restrictive assumptions or weak supervision. For example, [56] allows the trainer to use the knowledge of the identities of the actions used in making the transition. Meanwhile, [4, 35, 34] essentially assume that the transition velocity of all sequences in the datasets are the same.

These studies indicate that there is still much room left for the question of what is required for dataset/model to enable the unsupervised learning of the equivariance relation corresponding to the underlying symmetry . This paper advances this investigation by showing that if the sequential dataset consists of time series with a certain stationary property (constant velocity, constant acceleration), we can learn the underlying symmetries by simply training the model to be able to predict the future with linear transition in the latent space. Our theory in Section 3 shows that this strategy can learn a model that is almost equivariant. Moreover, we will experimentally show that, by training an encoder-decoder model in a framework of meta-learning which we call meta-sequential prediction (MSP), we can actually learn an equivariant model. In particular, we show that we can

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

learn a hidden equivariance structure in the dataset by splitting (1) the internal training step to compute the prediction loss of linear transition in the latent space from (2) the external training step to update the encoder and decoder. We will also empirically show that, in alignment with group representation theory [42], the learned linear latent transitions in our framework can be simultaneously block-diagonalized, and that each block corresponds to a disentangled factor of variation.

2 Related works

Recently, numerous studies have explored the ways to learn symmetry in a data-driven manner. There is rich literature in unsupervised/weakly supervised approaches that use sequential datasets to exploit the structure that is shared across time, and they all differ by the types of inductive bias. For example, the object-centric approach introduces inductive bias in the form of architecture, and equips the model with pre-deﬁned slots to be allocated to objects [39, 33, 40]. Meanwhile, [22, 1] assumes that the symmetry to be found takes the form of the energy conservation law, and learns each variable in the law as a function of observations. While this approach assumes that some energy is preserved in each observation, we assume that the transition parameters like velocity and acceleration are preserved within each observation. Other more indirect forms of inductive bias include those relevant to distributional sparseness and symmetry deﬁned through algebraic constraints. [31] for instance assumed that every stationary component of a given time series is generated by a ﬁnite and independent latent time series. [41] proposed to sparsely model the transition with a distribution of large kurtosis. Our work belongs to a family of unsupervised learning that seeks to ﬁnd the underlying symmetry of the dataset based on an algebraic inductive bias that the transitions can be represented linearly in some latent space. In this sense, our inductive bias also has a connection to Koopman operator [43, 25, 3]. We are different from these studies in that we are aiming to learn a common encoding function (i.e. lifting function) under which the set of sequences following different dynamics can be described linearly. Also [70] applied Koopman operator on pedestrian walking sequences, and [23] used Koopman operator to separate the foreground from the background. However, [70] does not set out their model to solve the extrapolation, and neither [70] nor [23] discusses the natural algebraic decomposition of the latent space that results solely from the objective to predict the unseen future.

Unsupervised learning with algebraic/geometrical constraints Many studies impose algebraic constraints that reﬂect some form of geometrical assumptions. [16] uses a known coordinate map parametrization of a Lie group family to construct a posterior distribution on the manifold. [54] assumes that the observations are dense enough on the data-manifold to describe its tangent space, and exploits a property of random walks on the product manifold to decompose the data space. In the analysis of sequential datasets, [13, 59, 10, 56, 12] make some Lie group type assumptions about the transition. [56] also assumes that the identity of the actions in the sequences is known. As for the approaches with less explicitly geometrical touch, [35] uses capsule structure in their probabilistic framework to model a ﬁnitely cyclic structure while retaining the computability of posterior distribution. By design, [35] assumes that all sequences in the dataset transition with the same cyclic velocity. [69] enforces the underlying transition action to be commutative. Some of these studies learn the representation so that the linear transition in the latent space can be explicitly computed [56, 12, 4]. In particular, [4] presents a theory that suggests that a representation without this feature would have topological defects, such as discontinuity. Our approach shares a similar philosophy with these works except that, instead of imposing a strong assumption about the underlying symmetry, we only make a relatively weak stationarity assumption about the dataset; although we assume each sequence to be transitioning with constant velocity/acceleration, we allow the velocity/acceleration to vary across different sequences.

Disentangled Representation Learning Disentangled structure [29, 30] is a form of symmetry that has also been actively studied. It is known that, under the i.i.d assumption of examples, unsupervised learning of disentanglement representation is not achievable without some inductive biases encoded in models and datasets [49]. In response to this work, subsequent works have explored different frameworks such as weakly-/semi-supervised settings [51, 50] and learning on sequential examples [41] to learn disentangled representations. For example, Phy DNet [24] disentangles the known physical dynamics from the unknown factors by preparing an explicit module called Phy Cell. ICA [32] and recent works [71, 64] also discuss the identiﬁability property of learned representa-

tions. Classical methods like [28, 6, 36] take an approach of incorporating the inductive bias in the form of a probabilistic model.

We are different from many previous methods in that we do not equip our model with an explicit disentanglement framework. Our method achieves disentanglement as a by-product of training a model that can predict the future linearly in the latent space. The set of latent linear transformations estimated by our method for different time sequences can be simultaneously block-diagonalized, and the latent space of each block corresponds to a disentangled feature. Our data assumption about constant velocity/acceleration might be similar in taste to the setting used by [32], in which the observed time series can be split into the ﬁnite number of stationary components.

3 Learning of equivariant structure from stationary time sequences

Our goal is to learn the underlying symmetry structure of a dataset in an unsupervised way that helps us predict the future. What do us humans require for the dataset when we are tasked to, for example, predict where a thrown ball would be in the next second? We hypothesize that we solve such a prediction task by analyzing a short, past time-frame with a certain stationary property (e.g., constant velocity/acceleration). Indeed, people with good dynamic visual acuity can chase a fastmoving object, because they can identify such a short stationary time-frame and use it to predict the near future linearly in their latent space. Based on this intuition, we propose to provide the trainer with a dataset consisting of constant velocity/acceleration sequences. We formalize this idea below.

Dataset structure Our dataset S consists of sequences in some ambient space X, so that each member s S takes the form s = [st X; t = 1, ..., T] S. Because we want s to be describing a sequence that transitions with constant velocity, we assume that all st in a given instance of s S are related by a ﬁxed transition operation g G so that st+1 = g st for all t, where G is the set of transition operators on X and each g G acts on x X by sending x to g x. We assume that X is closed under G; that is, g x X for all x X, g G. We allow G to be continuous as well, so that g might not have a ﬁnite order (For instance, if g is a rotation with speed 2πr with irrational r, any ﬁnite repetition of g would not agree with identity mapping). This way, our setting is different from those used in [35] that explores a cyclic structure using the capsules of same size. We emphasize that the transition action g is generally assumed to differ across the different members of S. For example, if G is a set of rotations and X a set of images, then the rotational speed, direction and the initial image may all be different for any two distinct sequences, s and s . Because each instance of S is characterized by s1 and g, we may write s(s1, g) to denote a sequence that begins with initial frame s1 and transitions with g. Summarizing, S is a subset of {[gt s1; t = 0, ..., T 1]; s1 X, g G}.

Prediction framework through equivariance Our strategy is to exploit the stationary property of each s S to seek an invertible continuous function Φ : X Ra m such that there exists some M : G Ra a satisfying

MgΦ(x) = Φ(g x) for all x X and g G. (1)

This relation is known as equivariance [11], and this type of tensorial latent space has also been used in [44] as well for the unsupervised learning of the structure underlying the dataset. In other words, we seek a model in which X and Ra m are invertibly related by an equivariance relation with respect to G, where g G acts on Φ(x) Ra m via the map Φ(x) 7 MgΦ(x) with Mg Ra a. We assume m > 1 in our study1. In this framework, we predict the sequence s(s1, g) with the relation Φ 1(MgΦ(st 1)) = st. When G is a group, (1) would imply MghΦ(x) = Φ(gh x) = MgΦ(h x) = Mg MhΦ(x) for all x, and the map g 7 Mg is called a representation of G [9, 67, 10].

Because we are aiming to establish the framework in which the representation of each action g can be explicitly computed, our philosophy has much in common with the proposition of [4]. This approach is in contrast to [35], which encodes a predeﬁned cyclic structure in the model.

1 We note that, when m > 1, the action of Mg on Ra m can be realized by applying the same Mg to m-copies of Ra. In other words, our prediction framework assumes that there are m number of subspaces that react to g in the same way. This m > 1 assumption is also considered in [4]. The case in which there are no copies of subspaces that act in the same way is called multiplicity-free in the literature of representation theory [9, 19], and is known to be a special case that happens only under a restrictive condition on the dimension of the observations space [46]. A similar idea has been used in model architecture as well [15].

Φ(gs1) Φ(g s1)

M(g, x) M(g, hx)

Φ((gh)x) Φ(hx)

Intra-orbital homogeneity

M(g, x) M(g, y) =

Full equivariance

Figure 1: Visualization of intra-orbital homogeneity vs full equivariance. During the training, the model was trained to satisfy M(g, s1) = M(g, g s1) for all g and s1. When intra-orbital homogeneity holds, M(g, x) = M(g, h x) for all h, g G and x. When the full equivariance holds, M(g, x) is invariant across different orbits.

3.1 Learning equivariance relation from stationary sequential dataset

However, training the model satisfying (1) with just the constant velocity assumption is not a trivial task, because this model assumption only assures that, for each sequence s(s1, g), there is a sequence-speciﬁc operator M(g, s1) that is guaranteed only to be able to predict the sequence that transitions with g and begins from s1 in the way of M(g, s1)Φ(st) = Φ(st+1) = Φ(g st) (the left most panel in Figure 1). In order to satisfy the full equivariance ((1) or the right most panel Figure 1), M(g, s1) shall not depend on s1 (i.e. homogeneous with respect to s1). At the same time, because the constant velocity assumption applies to each sequence over all time intervals, it at least assures that the latent transition M is well deﬁned within each sequence; that is, M(g, x) = M(g, g s1) = M(g, g2 s1) and so on. It turns out that, with some regularity assumptions on the model and the choice of G, we can extend this observation to say that M satisﬁes intra-orbital homogeneity (the middle panel in Figure 1) ; that is, M(g, x) is constant on the orbit G x = {g x; g G} for each x.

Proposition 3.1. Suppose that Φ(st) = M(g, s1)Φ(st 1) for all s and t. If m > a and if G is a compact commutative Lie group, then M satisﬁes intra-orbital homogeneity.

Also, if M satisﬁes intra-orbital homogeneity, M(g, x) and M(g, x ) for any pair (x, x ) in different orbits G x = G x can be shown to be at least similar.

Proposition 3.2. Suppose that M(g, x) satisﬁes intra-orbital homogeneity, and suppose that G is a compact connected group. If M(g, x) is continuous with respect x, then for all (x, x ), there exists some P such that PM(g, x)P 1 = M(g, x ).

Thus, much of the equivariance property can be satisﬁed automatically by training the representation on the set of stationary sequences. Interestingly, as we experimentally demonstrate later, our training method in the next section successfully learns a fully equivariant Φ without explicitly enforcing the change of basis P to be I.

3.2 Learning Φ via solving a meta-sequential prediction task

We propose a meta-learning way to learn a homeomorphic function Φ : X Ra m with equivariance property by seeking an injective Φ such that M(g, s1)Φ(st) = Φ(st+1) for all g G, s1 X. We learn such Φ by casting this problem as a meta-learning problem in which M(g, s1) is to be internally estimated for each s. In other words, we seek a pair of an encoder Φ and a decoder Ψ such that L(Φ, Ψ|s) = min M

st,st+1 s Ψ(MΦ(st)) st+1 2 2 is optimized for each s.

We conduct this optimization by splitting each s = {s1, ..., s T } into conditional time sequence sc = {s1, ..., s Tc} and validation time sequence sp = {s Tc+1, ..., s T }, while using the former for the internal optimization of M to force the linear algebraic relation in the latent space and using the latter for the prediction loss. More precisely, we solve the following optimization problem about Φ

Lp(Φ, Ψ) := s T t=Tc+1 Ψ(M (sc|Φ)t TcΦ(s Tc)) st 2 2 where M (sc|Φ) = arg min M Tc 1 t=1 MΦ(st) Φ(st+1) 2 F .

Since M is obtained from the latent sequence in the internal optimization, we call this learning framework the meta-sequential prediction (MSP). It might appear as if we can also set s = sc and optimize the following reconstruction version of Eq.(2):

Lr(Φ, Ψ) :=

s Tc t=2 Ψ(M (s|Φ)t 1Φ(s1)) st 2 2. (3)

However, as we will see in the experiment section, the use of the validation sequence sp makes a substantial difference in the learned representation. This is most likely because the minimization of the validation error of M on sp would encourage Φ to exclude the sc-speciﬁc information from the transition M . We will illustrate this effect in the experiment section. We shall note that we can also parameterize M as M := exp(A), where A Ra a is a Lie algebra element to be internally optimized. This type of approach was used in [14] for building Lie group convolutional network and in [59, 12] for predicting a sequence that is not necessarily stationary. In order to train their model, [59] used additional parameters to diagonalize each algebra element as well as hyperparameters to stabilize the training. We also experimented with a Lie-algebra style representation of M and used SGD to internally optimize the exponent parameters, but we needed to carefully tune the hyperparameter for the norm regularization term to stabilize the training and never really succeeded to train the model without collapsing. Our internal optimization procedure is free of such a parameter tuning.

Figure 2: The overview of meta-sequential prediction (MSP) when Tc = 2 and Tp = t. After the encoder encodes the observations into tensor representations Φ(s1), Φ(s2), the method solves the least square problem: M = arg min M MΦ(s1) Φ(s2) 2 F . The model then predicts the future observations by st+2 = Ψ((M )tΦ(s2)). These processes (including the linear problem) are all differentiable.

Internal Optimization of M Because the internal optimization in Eq.(2) is a linear problem, it can be solved analytically as

M (sc|Φ) = H+1H +0, (4)

where H+0 = [Φ(s1); ...; Φ(s Tc 1)] Ra (Tc 1)m and H+1 = [Φ(s2); ...; Φ(s Tc)] Ra (Tc 1)m are the horizontal concatenations of the encoded frames and H +0 is the Moore Penrose pseudo inverse of H+0. Because M (sc|Φ) is a closed form with respect to Φ, the loss (2) can be directly optimized by differentiating it with respect to the parameters of both Φ and Ψ. Thus, the training is done in an end-to-end manner. Figure 2 summarizes the overall procedure to make prediction on a given sequence when Tc = 2, Tp = t. We note that, although we have assumed the dataset to consist of constant-velocity sequences, we can readily extend our method to the dataset consisting of the time series with higher-order stationarity, such as constant acceleration. See Section 4.4 for the detailed explanation of the model extension and the experimental results.

3.3 Irreducible decomposition of M s

Representation theory guarantees that, if G is a compact connected group, any representation D : G Ra a can be simultaneously block-diagonalized; that is, there is a common change of basis U such that V := UDg U T =

j V (j) g , where V (j) g is called irreducible representation that cannot be block-diagonalized any further [42, 10, 67]. The equivariance of our Φ which we show in the experimental section suggests that M (s|Φ) may be simultaneously block-diagonalizable as well. This block-diagonalization sometimes reveals disentanglement structure because any irreducible representation of G1 G2 is of form V (1) V (2), where V (k) is an irreducible representation of

} M* V* U U -1

Figure 3: Simutaneous block diagonalization (SBD) applied to the set of M s obtained from 3DShapes sequences. SBD ﬁnds the common change of basis under which all matrices V = UM U 1 simultaneously take the form of block diagonal matrices with the same block positions. For clarity, we provide in this ﬁgure the visualizations of M := M I and V := V I instead of M and V .

Gk. In particular, if M s irreducible representations have the form V (1) 1 or 1 V (2), then each block would either corresponds to the action of G1 or of G2.2

To ﬁnd U that simultaneously block-diagonalizes all M (s|Φ), we optimized U based on the following objective function that measures the block-ness of V (s) := UM (s|Φ)U 1 based on the normalized graph Laplacian operator :

Lbd(V (s)) := (A(V (s))) trace = a d=1σd(A(V (s))) (5)

where A(V (s)) = abs(V (s))abs(V (s))T with abs(V (s)) representing the matrix such that abs(V (s))ij = |V ij|. Our objective function is based on the fact that, if we are given an adjacency matrix A of a graph, then the number of connected components in the graph can be identiﬁed by looking at the rank of the graph Laplacian. For the derivation, please see Appendix E. Through this decomposition, we are able to uncover the hidden block structure of M s. See Figure 3 for the actual block-decompositions of M s through our simultaneous block diagonalization. We show in Section 4.3 that each block component of V with optimized U corresponds to the disentangled factor of variations in dataset.

4 Experiments

We conducted several experiments to investigate the efﬁcacy of our framework. In this section, we brieﬂy explain the experimental settings. For more details, please see Appendix D. We tested our framework on Sequential MNIST, 3DShapes [5], and Small NORB [48]. Sequential MNIST is created from MNIST dataset [47]. For all experiments, we used a Res Net [26]-based encoderdecoder architecture and we set a = 16 and m = 256 so that the latent space lives in R16 256.

For Sequential MNIST, we chose our G to be the set of all combinations of three types of transformations: shape rotation, hue rotation, and translation, and randomly sampled a single instance of g G for each sequence(See Appendix D for the examples of sequences). To create each sequence, we ﬁrst resized the MNIST image to 24 24, applied repetitions of a randomly sampled, ﬁxed member of g G and embedded the results to 32 32 images. For shape and hue rotations, we randomly sampled the velocity of angles from uniform distribution on the interval [ π/2, π/2) for each sequence. For translation, we randomly sampled the start point and end point in the range of [-10, 10], and then moved the digit images on a straight line between the sampled points. We also experimented on sequential MNIST with background (Sequential MNIST-bg). For Sequantial MNIST-bg, we used the same generation rule as Sequential MNIST but we added background images behind the moving digits. For the background, we used a randomly sampled images from Image Net [57], which were all resized to 32 32. Also, we only used the images of digit 4 for most of the experiments on Sequential MNIST/MNIST-bg. Unless otherwise noted, all evaluations in this paper for the Sequential MNIST are based on training with only digit 4.

3DShapes and Small NORB are datasets with multiple factors of variation. We created a set of constant-velocity sequences from these datasets by varying a ﬁxed combination of factors for each sequence. That is, on these datasets, we chose our G to be the set of variations of factors, and sampled each g as i gℓi i , where gi represents the increase of i-th factor by one unit and ℓi Z. Thus, the value of ℓi represents the velocity in the direction of the i-th factor on the grid. For 3DShapes, we chose wall hue, ﬂoor hue, object hue, scale, and orientation as the factors to vary. We varied elevation and azimuth for Small NORB. For the split of each sequence into (sc, sp) , we set Tc = 2 and Tp = 1 on all of the constant velocity experiments.

2V : G {1} is a valid irreducible representation for any G.

(a) Generated examples with our proposed method on Sequential MNIST and MNIST-bg

(b) Comparison of variant methods

Figure 4: (a): Predictions made by meta-sequential prediction on Sequential MNIST and MNIST-bg. The ground truth sequence is placed below the predictions, with the ﬁrst two images representing sc. (b): Typical failure examples generated by the comparative methods. (1)(2)(3) and (4) are Neural M , Neural transition, Rec. model and Ours w/ ﬁxed block, respectively. See Appendix B for more examples.

1 3 5 7 9 11 13 15 17 Tp

Neural transition Neural M *

Rec. model Ours w/fixed blocks Ours

(a) Sequential MNIST

1 2 3 4 5 6 7 8 Tp

Neural transition Neural M *

Rec. model Ours w/fixed blocks Ours

(b) Sequential MNIST-bg

1 2 3 4 5 6 Tp

Neural transition Neural M *

Rec. model Ours w/fixed blocks Ours

(c) 3DShapes

Figure 5: Prediction errors Lp with Tc = 2 and Tp = 1, . . . , 18. During the training phase, models are trained to predict the observations only at Tp = 1. The prediction errors at Tp > 1 indicate the extrapolation performance. The results on Small NORB can be found in Appnedix A.

We also conducted experiments for the sequences with constant acceleration on Sequential MNIST. To create a sequence with constant acceleration, we chose a pair ga, gv G for each sequence, and generated s by setting st+1 = gt agvst. We elaborate on the detail of this extension in 4.4.

As ablations, we tested several variants of our method: ﬁxed 2x2 blocks (abbreviated as ﬁxed blocks), Neural M , Reconstruction model (abbreviated as Rec. Model), and Neural transition. For the method of ﬁxed 2x2 blocks, we separated the latent tensor Φ(s) R16 256 into 8 subtensors {Φ(k)(s) R2 256}8 k=1 and calculated pseudo inverse for each k to compute the transition in each R2 256 dimensional space. This variant yields M as a direct sum of eight 2 2 matrices. We tested this variant to see the effect of introducing a predetermined representation theoretic structure as in [10]. For Rec. model, we trained Φ and Ψ based on Lr in Eq. 3 with Tc = 3. We tested this variant to see the effect of our use of Tp. For Neural M , we trained an additional network Mθ that maps sc to a transition matrix, replaced M with Mθ in (4), and optimized θ and (Φ, Ψ) simultaneously. We may see Neural M as a variant in which the meta part of the internal and external training is removed from our method. For Neural transition, we trained 1x1 1D-convolutional networks to be applied to latent sequences in the past to produce the latent tensor in the next time step; for instance, st+1 = Ψ(1DCNN(Φ(st), Φ(st 1))) when Tc = 2 3. The 1DCNN was applied along the multiplicity dimension m. In this variant, the relation between Φ(st) and Φ(st+1) is not necessarily linear. Section D.1 in Appendix describes each of the comparison methods more in detail. In testing all of these variants, we used the same pair of encoder and decoder architecture as the proposed method.

4.1 Qualitative and quantitative results on the prediction

Figure 4 shows the example sequences generated by the proposed model and comparative models. Figure 5 presents the prediction performance at Tc + Tp when Tc = 2. To produce this result, we

3This model can be seen as a simpliﬁed version of [60]

(a) Dependency of M on sequence. The 2x3 tiled images in the leftmost panel represents two sequences s(1), s(2) with the same transition action g. We consider the effects of M (1) and M (2) inferred respectively from s(1) and s(2). Neural M fails in prediction when M (2) is used to predict s(1). Our method does not fail by this swap, indicating M (2) = M (1).

MNIST MNIST-bg 3DShapes Small NORB

Method Lp Lp equiv Lp Lp equiv Lp Lp equiv Lp Lp equiv Rec. Model 48.91 64.22 87.05 95.66 153.39 258.20 57.01 78.13 Neural M 4.99 64.25 20.60 83.18 2.09 217.73 28.98 53.24 MSP (Ours) 6.42 15.91 27.38 36.41 2.74 2.87 31.14 44.77

(b) Equivariance performance based on Lp(Eq.(2)) and Lp equiv(Eq.(6)) with Tc = 2 and Tp = 1. To evaluate equivariance errors on more difﬁcult settings, we used all of digits in Sequential MNIST-bg for both training and test sets.

Figure 6: Quantitative and qualitative evaluation of learned equivariance.

back-propagated the prediction error at Tp = 1 to the encoder during the training, and the prediction at Tp > 1 was used to evaluate the extrapolation performance. Our method successfully predicts the images for Tp 1. Neural transition and Neural M had almost the same prediction performance at Tp = 1, but they both failed in extrapolation. Our ﬁxed 2 2 blocks variant failed in extrapolation as well. This might be because the over-regularized structure of 2x2 block hindered with the training of the SGD optimization [17].

To evaluate how our learned representation relates to the structural features in the dataset, we also regressed the factors of transition from M and regressed the class of the digits from Φ(s1) (Figures 10 and 11 in Appendix A). Sim CLR [8] and contrastive predictive coding (CPC) [61, 27] are tested as baselines. Please see Appendix D for the detailed experimental settings for Sim CLR and CPC. Our method yields the representation with better prediction performance than the comparative methods on the test datasets.

4.2 Equivariance performance

As we have described through Section 3.1 and 3.2, the equivariance is achieved when M (s(s1, g)|Φ) in (2) does not depend on s1, where we recall that s(s1, g) represents the sequence that begins with s1 and transitions with g. To see how much the trained model is equivariant to the transformations in the sequential dataset, we therefore calculated the equivariance error, which is the prediction error from applying M (s|Φ) to Φ(s Tc) for a pair s = s that transitions with the same g. In other words, when Tc = 2, we compute the following;

Lp equiv := Eg Es,s S(g)[ Ψ(M (s|Φ) Φ(s 2)) s 3 2 2] (6)

where S(g) represents the set of all sequences that transition with g. For each pair of s = s we set Tc = 2 and Tp = 1 as done in the experiments in the previous sections. Table 6b compares the Neural M method against our method in terms of the equivariance error. Figure 6a shows the result of applying M (s|Φ) on Φ(s 2) and applying M (s |Φ) on Φ(s2). We see that when we swap M this way, Neural M also swaps the digits; this implies that M learned by Neural M encodes the sequence speciﬁc information together with the transition. On the other hand, swapping of M does not affect the prediction for our method, suggesting that our method is succeeding to learn an equivariant model. This is somewhat surprising, because our model does not have the explicit mechanism to enforce the full equivariance (P = I in Section 3.1).

(5) Orientation (1) Floor hue (2) Wall hue (3) Object hue (4) Scale

Before SBD After SBD

(a) Simeltaneously block-diagonalized matrices

(b) Disentangled transition representation on 3DShapes

Figure 7: (a) Simultaneous block-diagonalization (SBD) of M . The top right matrix is the visualization of abs(V I) averaged over all of the training sequences. Each of the ﬁve matrices below is the visualization of abs(V I) averaged over the set of sequences on which only a single factor was varied. Coordinates are permuted for better visibility. (b) Sequences generated by applying the transformation of just one block. To produce the disentangled sequences in each row from the leftmost two images in the bottom row, we performed the internal optimization of M while setting all but the block positions corresponding to each factor of variation to be identity. We elaborate this result and the results for Sequential MNIST, MNIST-bg and Small NORB in Appendix A.

4.3 Structures found by simultaneous block-diagonalization of M s

We have seen in the previous section that the trained Φ is fully equivariant to transformations G, which implies each M is a representation of the corresponding transformation of g G. As we describe in Section 3.3, we apply simultaneous block-diaognalization to uncover the symmetry structure captured by M s. Figure 7a shows the structure revealed by simultaneous block-diagonalization through the change of basis U trained by minimizing the average of Lbd in eq.(5) over all s. Figure 7b shows the results of applying transformation of only one block. We can see that each block only alters one factor of variation. Our results suggest that the learned M captures the hidden disentangled structure of the group actions behind the datasets.

4.4 Extension to the sequences with constant acceleration

We have seen that meta-sequential prediction successfully learns an equivariant structure from the set of constant-velocity sequences. In this section, we show that we can extend our concept to the set of sequences sharing the stationarity of higher order (constant acceleration). By deﬁnition, the pair of Φ(st) and Φ(st 1) encodes the information about the velocity at t. When the multiplicity m is sufﬁciently large4, the velocity can be estimated by: 1Mt = Φ(st)Φ(st 1) . Because this would yield a sequence of velocities, we can simply apply our method again to estimate the constant acceleration by 2M = 1M+11M +0 where 1M+0 = [1M2; ...; 1MTc 1] Ra (Tc 2)a 1M+1 = [1M3; ...; 1MTc] Ra (Tc 2)a. We can then predict the future representation st for t = Tc + 1, ..., Tp by

st = Ψ (( t t =Tc+1 1 Mt ) Φ(s Tc) ) where 1 Mt = 2M (t Tc) 1MTc. (7)

where represents multiplications from left. We train Φ and Ψ by minimizing the mean squared error between st and st for t = Tc + 1, ..., T as in Eq.(2). To create a sequence of constant acceleration from MNIST dataset, we only used shape and color rotations. We chose the initial velocity for these rotations randomly on the interval [ π/5, π/5) for each sequence, and chose the acceleration on the interval [ π/40, π/40). The results are shown in Figure 8. The Neural transition and the constant-velocity version of our method failed to predict the accelerated sequence, while the 2nd order model succeeded in predicting the sequence even after Tp > 5. Also, Figure 15f and Table 16c

4If m is less than a, we cannot obtain the pseudo inverse because of the rank deﬁcient in Φ(st 1)Φ(st 1)T. Thus m should be at least larger than a.

(a) Generated images on accelerated Sequential MNIST.

1 3 5 7 9 11 13 15 Tp

Neural transition Ours(1st order) Ours(2nd order)

(b) Prediction error.

Figure 8: Results on accelerated Sequential MNIST. Every model was trained with Tc = 5, Tp = 5. Neural transition overﬁtted and collapsed from Tp = 5 (beyond the training horizon).

in Appendix shows that the accelerated version of our proposed model again achieves learning the equivariance relation.

5 Discussion & Limitations

How is full equivariance achieved in our method? The theoretical results we provided in section 3.1 only assure that M(g, x) and M(g, x ) are similar when the underlying group is commutative, compact and connected. However, as we have shown experimentally, our method seems to be learning Φ for which the estimators of M satisfy M (s(g, x)|Φ) = M (s(g, x )|Φ). This can be happening because our framework and the training method based on the internal optimization in the latent space is somehow encouraging M to be orthogonal (See the loss curve of orthogonality of M in Appendix A). Maybe this is forcing the change of matrix P such that PM(g, x)P 1 = M(g, x ) to be also rotations as well, which commutes with M(g, x) itself. Also, Figure 4b and 5 show that, as reported in [34], the models trained with reconstruction loss like (3) does not well capture the group transformation behind the sequences: the encoder representation was found to be signiﬁcantly worse than that of the model trained with (2). We hypothesize that (3) fails to remove the sequence-speciﬁc information from M , while (3) succeeds to do so by training the model to be able to predict the unseen images.

Towards learning symmetries from more realistic observations As we are making a connection between our prediction framework and group equivariance, we are essentially assuming that the transitions are always invertible, because group is closed under inversion. However, this might not be always the case in real world applications; for instance, if the image sequences are the sequential renderings of a rotating 3D object, the transitions are generally not invertible because only a part of the object is visible at each time step. We experimented Sequential Shape Net, which is created from Shape Net [7] dataset. A series of rendered images is generated by sequentially applying 3D rotations of different speeds for each axis. Generated results on Sequential Shape Net (See 26 in Appendix B show that actually our current method was not able to generate the images on 3D rotated datasets. If the transitions are not invertible, some measures must be taken in order to resolve the indeterminacy, such as probabilistic modeling or additional structural inductive bias.

Broader impact Because our study generally contributes to predictions and extrapolation, it has as much potential to negatively affect the society as most other prediction methods. In particular, applications of our method to image sequence can be potentially integrated into weapon systems, for example. At the same time, our unsupervised learning of the symmetrical structure from sequential datasets may also contribute to new discoveries in the systems of ﬁnance, medical science, physics and other ﬁelds of ML such as reinforcement learning.

[1] F. Alet, D. Doblar, A. Zhou, J. Tenenbaum, K. Kawaguchi, and C. Finn. Noether networks: meta-learning useful conserved quantities. Neur IPS, pages 16384 16397, 2021.

[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

[3] P. Bevanda, S. Sosnowski, and S. Hirche. Koopman operator dynamical models: Learning, analysis and control. Annual Reviews in Control, 52:197 212, 2021.

[4] D. Bouchacourt, M. Ibrahim, and S. Deny. Addressing the topological defects of disentanglement via distributed operators. ar Xiv preprint ar Xiv:2102.05623, 2021.

[5] C. Burgess and H. Kim. 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018.

[6] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in β-vae. In Neur IPS Workshop of Learning Disentangled Features, 2017.

[7] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shape Net: An Information-Rich 3D Model Repository. Technical Report ar Xiv:1512.03012 [cs.GR], Stanford University Princeton University Toyota Technological Institute at Chicago, 2015. URL https://shapenet.org/terms.

[8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In 1597-1607, 2020.

[9] M. Clausen and U. Baum. Fast Fourier Transforms. Wissenschaftsverlag, 1993.

[10] T. Cohen and M. Welling. Learning the irreducible representations of commutative lie groups. In ICML, pages 1755 1763, 2014.

[11] T. Cohen and M. Welling. Group equivariant convolutional networks. In ICML, pages 2990 2999, 2016.

[12] M. Connor and C. Rozell. Representing closed transformation paths in encoded network latent space. In AAAI, pages 3666 3675, 2020.

[13] B. Culpepper and B. Olshausen. Learning transport operators for image manifolds. Neur IPS, 22:423 431, 2009.

[14] N. Dehmamy, R. Walters, Y. Liu, D. Wang, and R. Yu. Automatic symmetry discovery with lie algebra convolutional network. Neur IPS, pages 2503 2515, 2021.

[15] C. Deng, O. Litany, Y. Duan, A. Poulenard, A. Tagliasacchi, and L. J. Guibas. Vector neurons: A general framework for so(3)-equivariant networks. In ICCV, pages 12180 12189, 2021.

[16] L. Falorsi, P. de Haan, T. R. Davidson, N. De Cao, M. Weiler, P. Forré, and T. S. Cohen. Explorations in homeomorphic variational auto-encoding. ar Xiv preprint ar Xiv:1807.04689, 2018.

[17] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.

[18] K. Fukushima and S. Miyake. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pages 267 285. Springer, 1982.

[19] W. Fulton and J. Harris. Representation Theory: A First Course, volume 129. Springer Science & Business Media, 1991.

[20] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In AISTATS, pages 315 323, 2011.

[21] K. Greff, F. Belletti, L. Beyer, C. Doersch, Y. Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. Laradji, H.-T. D. Liu, H. Meyer, Y. Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V. Sitzmann, A. Stone, D. Sun, S. Vora, Z. Wang, T. Wu, K. M. Yi, F. Zhong, and A. Tagliasacchi. Kubric: a scalable dataset generator. In CVPR, 2022.

[22] S. Greydanus, M. Dzamba, and J. Yosinski. Hamiltonian neural networks. In Neur IPS, pages 15353 15363, 2019.

[23] J. Grosek and J. N. Kutz. Dynamic mode decomposition for real-time background/foreground separation in video. ar Xiv preprint ar Xiv:1404.7592, 2014.

[24] V. L. Guen and N. Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In CVPR, pages 11474 11484, 2020.

[25] Y. Han, W. Hao, and U. Vaidya. Deep learning of koopman representation for control. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1890 1895. IEEE, 2020.

[26] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016.

[27] O. J. Hénaff, A. Srinivas, J. D. Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord. Data-efﬁcient image recognition with contrastive predictive coding. In ICML, pages 4182 4192, 2020.

[28] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

[29] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a deﬁnition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018.

[30] I. Higgins, S. Racanière, and D. Rezende. Symmetry-based representations for artiﬁcial and biological general intelligence. Frontiers in Computational Neuroscience, page 28, 2022.

[31] A. Hyvarinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Neur IPS, 2016.

[32] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000.

[33] R. Kabra, D. Zoran, G. Erdogan, L. Matthey, A. Creswell, M. Botvinick, A. Lerchner, and C. Burgess. Simone: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. In Neur IPS, pages 20146 20159, 2021.

[34] T. Keller and M. Welling. Predictive coding with topographic variational autoencoders. In ICCV workshop, pages 1086 1091, 2021.

[35] T. Keller and M. Welling. Topographic vaes learn equivariant capsules. In Neur IPS, pages 28585 28597, 2021.

[36] H. Kim and A. Mnih. Disentangling by factorising. In ICML, pages 2649 2658, 2018.

[37] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[38] T. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks. In ICLR, 2016.

[39] T. Kipf, E. van der Pol, and M. Welling. Contrastive learning of structured world models. In ICLR, 2020.

[40] T. Kipf, G. F. Elsayed, A. Mahendran, A. Stone, S. Sabour, G. Heigold, R. Jonschkowski, A. Dosovitskiy, and K. Greff. Conditional object-centric learning from video. In ICLR, 2021.

[41] D. Klindt, L. Schott, Y. Sharma, I. Ustyuzhaninov, W. Brendel, M. Bethge, and D. Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. In ICLR, 2021.

[42] I. R. Kondor. Group theoretical methods in machine learning. Columbia University, 2008.

[43] B. O. Koopman. Hamiltonian systems and transformation in hilbert space. Proceedings of the national academy of sciences of the united states of america, 17(5):315, 1931.

[44] M. Koyama, T. Miyato, and K. Fukumizu. Invariance-adapted decomposition and lasso-type contrastive learning. In ICML 2022 Workshop on Topology, Algebra, and Geometry in Machine Learning, 2022.

[45] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Neur IPS, 25, 2012.

[46] A. S. Leahy. A classiﬁcation of multiplicity free representations. Journal of Lie Theory, 8:367 391, 1998.

[47] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. URL http://yann.lecun.com/exdb/ mnist/License.

[48] Y. Le Cun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II 104. IEEE, 2004. URL https://cs. nyu.edu/~ylclab/data/norb-v1.0-small/.

[49] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, pages 4114 4124, 2019.

[50] F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen. Weakly-supervised disentanglement without compromises. In ICML, pages 6348 6359, 2020.

[51] F. Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf, and O. Bachem. Disentangling factors of variation using few labels. In ICLR, 2020.

[52] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectiﬁer nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.

[53] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, pages 807 814, 2010.

[54] D. Pfau, I. Higgins, A. Botev, and S. Racanière. Disentangling by subspace diffusion. In Neur IPS, pages 17403 17415, 2020.

[55] S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille. Weight standardization. ar Xiv preprint ar Xiv:1903.10520, 2019.

[56] R. Quessard, T. Barrett, and W. Clements. Learning disentangled representations and group structure of dynamical environments. Neur IPS, pages 19727 19737, 2020.

[57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

[58] A. Simeonov, Y. Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V. Sitzmann. Neural descriptor ﬁelds: Se (3)-equivariant object representations for manipulation. ar Xiv preprint ar Xiv:2112.05124, 2021.

[59] J. Sohl-Dickstein, C. M. Wang, and B. A. Olshausen. An unsupervised algorithm for learning lie group transformations. ar Xiv preprint ar Xiv:1001.1027, 2010.

[60] A. Van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

[61] A. Van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. ar Xiv e-prints, 2018.

[62] E. van der Pol, D. Worrall, H. van Hoof, F. Oliehoek, and M. Welling. Mdp homomorphic networks: Group symmetries in reinforcement learning. In Neur IPS, pages 4199 4210, 2020.

[63] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Neur IPS, 30, 2017.

[64] J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Selfsupervised learning with data augmentations provably isolates content from style. Neur IPS, 34, 2021.

[65] D. Wang, R. Walters, and R. Platt. So(2) -equivariant reinforcement learning. ar Xiv preprint ar Xiv:2203.04439, 2022.

[66] Z. Wang and J.-C. Liu. Translating math formula images to latex sequences using deep neural networks with sequence-level training. International Journal on Document Analysis and Recognition (IJDAR), 24 (1):63 75, 2021.

[67] S. H. Weintraub. Representation Theory of Finite Groups: Algebra and Arithmetic, volume 59. American Mathematical Society, 2003.

[68] Y. Wu and K. He. Group normalization. In ECCV, pages 3 19, 2018.

[69] T. Yang, X. Ren, Y. Wang, W. Zeng, and N. Zheng. Towards building a group-based unsupervised representation disentanglement framework. In ICLR, 2021.

[70] S. Zhang, Y. Wang, and A. Li. Cross-view gait recognition with deep universal linear embeddings. In CVPR, pages 9095 9104, 2021.

[71] R. S. Zimmermann, Y. Sharma, S. Schneider, M. Bethge, and W. Brendel. Contrastive learning inverts the data generating process. In ICML, pages 12979 12990, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] The abstract and introduction reﬂects the contributions of this paper. (b) Did you describe the limitations of your work? [Yes] In the Discussion & Limitations section, we discuss the type of dataset to which our work cannot be directly applied, as well as the theoretical problem that has not been solved. (c) Did you discuss any potential negative societal impacts of your work? [Yes] We included the broader impact discussion in Section 5 (d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] We read the guidelines and we conﬁrmed that the contents in our paper conformed to them. 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] We included the full set of assumptions. We provide the details of the assumptions in the supplementary material. (b) Did you include complete proofs of all theoretical results? [Yes] We provide complete proofs of theoretical results in Supplementary material. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We provided code for reproducing the experimental results as a supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] We explained how we chose the hyperparameters and how we split the dataset into training and test sets in section 4 and appendix section D. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We include the standard deviation values of equivariance performance in Appendix A (We omitted them in the main submission because of the space limitation) and the linear classiﬁcation performance of the transition parameters in Appendix A. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We put the information of GPUs which we used to train the models. Also we included approximated costs for reproducing the full results of experiments in D. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We cited [47] for MNIST, [7] for Shape Net, [36] for 3Dshapes and [48] for Small NORB. (b) Did you mention the license of the assets? [Yes] For each dataaset, we included the URL to the license information in the reference section. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] For Shape Net dataset, we are approved to download the assets. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [Yes] We don t ﬁnd any personally identiﬁable information in the images used in our experiments. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]