# regularizing_towards_soft_equivariance_under_mixed_symmetries__019ad2e1.pdf Regularizing Towards Soft Equivariance Under Mixed Symmetries Hyunsu Kim 1 Hyungi Lee 1 Hongseok Yang 2 1 3 Juho Lee 1 4 Datasets often have their intrinsic symmetries, and particular deep-learning models called equivariant or invariant models have been developed to exploit these symmetries. However, if some or all of these symmetries are only approximate, which frequently happens in practice, these models may be suboptimal due to the architectural restrictions imposed on them. We tackle this issue of approximate symmetries in a setup where symmetries are mixed, i.e., they are symmetries of not single but multiple different types and the degree of approximation varies across these types. Instead of proposing a new architectural restriction as in most of the previous approaches, we present a regularizer-based method for building a model for a dataset with mixed approximate symmetries. The key component of our method is what we call equivariance regularizer for a given type of symmetries, which measures how much a model is equivariant with respect to the symmetries of the type. Our method is trained with these regularizers, one per each symmetry type, and the strength of the regularizers is automatically tuned during training, leading to the discovery of the approximation levels of some candidate symmetry types without explicit supervision. Using synthetic function approximation and motion forecasting tasks, we demonstrate that our method achieves better accuracy than prior approaches while discovering the approximate symmetry levels correctly. 1Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea 2School of Computing, KAIST, Daejeon, South Korea 3Discrete Mathematics Group, Institute for Basic Science (IBS), Daejeon, South Korea 4AITRICS, Seoul, South Korea. Correspondence to: Juho Lee , Hongseok Yang . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). (a) Equivariant trajectory (b) Soft-equivariant trajectory Figure 1: Illustrative example of a system with mixed symmetries with soft equivariances. 1. Introduction Exploiting symmetries in a dataset is one of the key principles for building an effective deep-learning model. A popular approach for implementing this principle is to restrict the architecture of a neural network in the model so that the model has desired symmetries by construction. The approach has been highly successful, leading to a range of effective so-called equivariant or invariant models (Bronstein et al., 2021), such as CNNs (Cohen & Welling, 2016; Cohen et al., 2019; 2018) and GNNs (Kipf & Welling, 2016; Veliˇckovi c et al., 2018), that cover different types of symmetries, such as translation invariance. In practice, however, the symmetries implied in data are often approximate, partially due to measurement noises or unexpected external effects. For such scenarios, models that are equivariant or invariant by construction may be suboptimal due to their architectural restrictions. Moreover, while most of the previous works assume a single type of symmetry, many real-world data come with mixed symmetries, that is, multiple types of symmetries may exist in data. Equivariant models assuming symmetries of just a single type cannot easily be combined to model such mixed symmetries. Even more, those mixed symmetries may be approximate, so different types of symmetries may exhibit different approximation levels. As an example, imagine we want to model the trajectory of a golf ball in 3D space as in Figure 1. The trajectory is O(3) equivariant, or there are mixed symmetries w.r.t. Ox(2), Oy(2), and Oz(2). Now assume that a wind is blowing along the y-axis. While the trajectory is still Oy(2)-equivariant, it is only approximately equivariant to Ox(2) and Oz(2). An O(3)-equivariant model by design would be too restrictive in this case, and a model equivariant only with respect to Regularizing towards soft equivariance under mixed symmetries Oy(2) would miss soft equivariance along x and z axes. In this paper, we tackle the modeling problem under mixed approximate symmetries, i.e., there are multiple types of symmetries with varying degrees of approximations across the types. Instead of building models symmetric by design, we propose a regularizer-based method, where an unconstrained model is regularized toward equivariance. The regularizer is attached for each potential symmetry type expected to be implied in data, and the degree of equivariance approximation of the type is captured by the strength of the regularizer for it - its regularization coefficient. Since it is almost impossible to know the degree of approximations in advance, the regularization coefficients must be carefully tuned to capture the approximation levels correctly. Our method, without explicit supervision, can automatically tune the coefficients during training and, thus, automatically discover the varying degrees of equivariance approximations (from prescribed candidate groups) in the mixed-symmetry settings. We are not the first to study approximate symmetries. However, the existing works mostly rely on architectural restrictions in relaxed forms (Finzi et al., 2021a; van der Ouderaa et al., 2022; Wang et al., 2022). Moreover, they do not consider multiple types of symmetries with different approximation levels. In contrast, our method does not impose architectural restrictions on a model but solely relies on the equivariance regularizers. As we will show later, the regularizer-based method is especially useful in the mixed-symmetry settings, while the existing works are not straightforwardly extended to those settings. We experimentally evaluated our method with a synthetic function-approximation task and a motion forecasting task. Our method could correctly discover degrees of approximations of different symmetry types in a relative term and achieve better test accuracy. We summarize our contributions below: We tackle the problem where we have multiple types of (approximate) symmetries with different levels of equivariance/invariance errors. We propose a novel method regularizing an unrestricted model with (approximate) symmetry constraints, and present an algorithm that can automatically identify approximation levels of different symmetry types during training. We demonstrate the effectiveness of our approach on synthetic and real-world tasks with multiple types of (approximate) symmetries. 2. Backgrounds We start with a review on the formalization of symmetries of neural networks in terms of groups. We also review so called residual pathway prior (Finzi et al., 2021a), a recent proposal for handling approximate symmetries. 2.1. Group Representation and Equivariance A representation of a group G on a Euclidean space Rn is a function ρ from G to the general linear group on Rn (i.e., the group of invertible n n matrices with matrix multiplication as group composition) such that ρ preserves the composition operator of the group. When we have representations of a group G in Euclidean spaces X and Y, denoted ρX and ρY, we say that a function f : X Y is G-equivariant if for all g G and x X, we have f ρX (g)(x) = ρY(g) f(x) . (1) Intuitively, this condition means that f does not actively use information that can be altered by group elements g. The convolution layers in CNNs are a leading example that is equivariant with respect to the translation group (in an ideal setup where the images are defined over the entire plane R2). A range of neural-network architectures that ensure equivariance (including equivariant multilinear perceptions to be explained next) have been developed because they usually generalize better than non-equivariant counterparts. 2.2. Equivariant Multilayer Perceptrons Equivariant Multilayer Perceptrons (EMLPs) (Finzi et al., 2021b) are models that are guaranteed to be equivariant with respect to a given group G and its representation ρ. As its name indicates, an EMLP is identical to a standard multilayer perceptron except for one thing: its weights and biases are not network parameters, but they are constructed out of other parameters. This further parameterization of weights and biases ensure that all the linear layers of the EMLP are equivariant by construction. To describe the linear layers of an EMLP formally, we need to recall a few facts. First, when G has representations ρ on Rn and ρ on Rm, the set of G-equivariant linear maps from Rn to Rm forms a vector space. Thus, it has an orthonormal basis B = {M1, . . . , Md} where the Mi s are m n matrices representing G-equivariant linear maps and when reshaped to vectors via stacking columns (i.e., vec(M1), . . . , vec(Md)), the matrices become orthonormal vectors of (m n) dimension. Second, the set of vectors v in Rm that are invariant with respect to G and ρ (i.e., ρ (g)(v) = v for all g G) forms a subspace of Rm. So, this subspace also has an orthonormal basis. The linear layers of EMLP are defined in terms of these two bases. Assume that the l-th layer of the network has nl input nodes Regularizing towards soft equivariance under mixed symmetries and nl+1 output nodes. Formally, each linear layer l of an EMLP is an affine map Linear EMLP : Rnl Rnl+1 defined as follows: Linear EMLP(x) = Wx + b, vec(W) = Qθ, b = Rβ, (2) where vec(W) is the vector obtained by stacking the columns of the matrix W, Q is a fixed matrix with (nl+1 nl) rows and d columns, and R is a fixed matrix with nl+1 rows and r columns. The columns of the matrix Q when reshaped to nl+1 nl matrices via unstacking form an orthonormal basis of the space of G-equivariant linear maps from Rnl to Rnl+1. Similarly, the columns of the other matrix R form the basis of the subspace of G-invariant vectors in Rnl+1. The parameters to be trained are θ Rd and β Rr, the coefficients combining the orthonomal basis. 2.3. Residual Pathway Prior The Residual Pathway Prior (RPP) (Finzi et al., 2021a) is a recent proposal for learning an approximately-equivariant neural network. It is based on the idea of combining equivariant and non-equivariant transformations together in each network layer. Concretely, it is the following variant of the EMLP, which adds a standard linear layer, called residual pathway, to each equivariant linear layer of the EMLP: Linear RPP(x) = Wx + b, vec(W) = QQ vec(W1) + vec(W2), b = RR b1 + b2, (3) where Q and R are from the equations in (2), and Q vec(W1) and R b1 correspond to θ and β in the same equations, respectively. Note that Linear RPP(x) is the sum of the EMLP s linear layer on x and W2x+b2. The residual pathway refers to the latter part. The parameters of an RPP are trained with the following ℓ2regularization, which comes from the prior distributions on those parameters: RRPP(W1, b1, W2, b2) = vec(W1) 2 + b1 2 2σ2 1 + vec(W2) 2 + b2 2 2σ2 2 , (4) with σ2 being substantially smaller than σ1, which encourages that residual layers play only a minor role for inference. 3. Equivariance Regularizer In this section, we present our equivariance regularizer, the key conceptual contribution of the paper. We assume that a collection of groups G1, G2, . . . , GK are given, capturing Figure 2: The projection-based equivariance regularizer for a group G measures the distance W QQ W , where W is either vec(W) or b in the standard linear layer and Q is an orthonormal basis of the space of G-equivariant matrices or G-invariant vectors. different types of symmetries, and also that these groups come with representations for input and output spaces of all linear layers. The latter assumption enables us to talk about Gk-equivariant linear or affine maps for all layers. In our presentation, we fix a layer l, and describe how our regularizers constrain network parameters at that layer. For notational simplicity, we omit the layer indices from the parameters, unless required to be specified. 3.1. Projection-Based Equivariance Regularizer For every k = 1, . . . , K, write Qk and Rk for the matrices from the equations in (2); the columns of Qk form an orthonormal basis of Gk-equivariant linear maps from Rnl to Rnl+1 after being reshaped into nl+1 nl matrices, and the columns of Rk form an orthonormal basis of nl+1-dimensional Gk-invariant vectors in Rnl+1. Our Projection-based Equivariance Regularizer (PER) for a group Gk is defined by RPER k (W, b) = λk 2 vec(W) Qk Q kvec(W) 2 2 b Rk R kb 2, (5) where W and b are parameters of the l-th layer of the network, and λk is a regularization coefficient for the group Gk. Modulo the reshaping into the vector form, the term Qk Q kvec(W) is the projection of W (expressing a linear map from Rnl to Rnl+1) to the space of Gk-equivariant linear maps expressed as nl+l nl matrices. Thus, the first summand measures the ℓ2-distance from W to the space of Gk-equivariant linear maps. Similarly, the second summand uses the projection of the bias term and measures the ℓ2-distance from b to the space of Gk-invariant vectors. This regularizer can be a part of a learning objective during training, so that the training moves the parameters W and b towards the space of the Gk-equivariant linear maps or Gkinvariant vectors. An advantage of this regularizer-based approach for enforcing symmetries is that we can easily Regularizing towards soft equivariance under mixed symmetries combine multiple regularizers for different groups simply by adding them to the objective function. Concretely, in our setup of K different groups, we can use the following regularizer for the parameters of the l-th layer: RPER(W, b) = k=1 RPER k (W, b) (6) The regularization coefficients λk control the strength of enforcing different types of symmetries formalized by different groups G1, . . . , GK. Ideally, these parameters are set according to the approximation levels of different symmetry types. However, we don t know the approximation levels in advance. In the next section, we explain how to infer such parameters during training without explicit supervision. An implicit assumption under the regularizer RPER(W, b) is that the ℓ2-distance measures how much the symmetry with respect to Gk is violated by the corresponding parameters of the network. The following proposition supports that assumption, showing that minimizing the ℓ2 distances indeed minimizes the equivariance error. Proposition 3.1. Let f be an S-layer MLP with the weight matrix W (l) and the bias term b(l) at each layer l. Assume that the activation functions of f are G-equivariant and LLipchitz continuous. Also, assume a constant U > 0 such that x < U for every x X, and the operator norms ρX (g) op and ρY(g) op for any g G are also bounded by U. Then, there exists a constant C > 0 depending on S, L, and U only, such that for all {(W (l), b(l))}l=1,...,S, if the operator norm W (l) op and the ℓ2 norm b(l) are bounded by U for every l, we have sup x X,g G ρY(g)f(x) f(ρX (g)x) vec(W (l)) Q(l) k Q(l) k vec(W (l)) + b(l) R(l) k R(l) k b(l) . (7) The proof of a refined version of this proposition is given in Appendix A. According to Proposition 3.1, the equivariance error of a model is bounded by the ℓ2-distances of parameters to equivariance subspaces, and the minimum equivariant error is achieved when the ℓ2-distances are zero, which happens when the value of the regularizer is zero. Our methodology presents a comparable functionality to RPP, yet it allows the model to discover soft equivariant weights with a reduced number of parameters. The distinctions between RPP and PER have been visually depicted in Appendix B. 3.2. Adjustment of Hyperparameters of Groupwise Equivariance Regularizers The regularization coefficients λ1, . . . , λK in (6) play an important role of controlling the strengths of groupwise equivariance constraints that we impose on the model. We empirically observed that a better model is learned when these regularization coefficients for different groups (and hence the strengths of regularization for these groups) are correlated with the approximation levels of symmetries for those groups in a dataset. That is, if (λ 1, . . . , λ K) are the coefficents leading to the best model with the lowest validation error after training, then a smaller λ k value means weaker symmetry (more approximation error) for the group Gk, and a larger λ k means more exact symmetry for the group Gk. Based on this observation, we propose an automatic tuning procedure that could discover the approximation levels of different symmetry types (formalized by different groups and captured through the regularizers) in a datadriven way. Given an S-layer MLP f, let RPER k (f) = PS ℓ=1 RPER k (W (l), b(l)). We first initialize all the regularization coefficients with the same value, and in the early stage of training, adjust the coefficients {λk}k=1,...,K based on the magnitudes of the corresponding regularizers {RPER k (f)}k=1,...,K with the following formula: min{RPER k (f)}k=1,...,K RPER k (f) where γ is a scaling factor calibrating how much the approximation difference will be reflected in the coefficients. We empirically confirmed that setting γ [2, 5] gives reasonable results. 3.3. Extension of EMLP for Mixed Symmetries Unlike our method which can conveniently combine multiple regularizers for mixed symmetries, it is not straightforward to extend existing (approximately) equivariant models for mixed symmetry settings. Here, as a baseline, we describe a na ıve extension of EMLP for our setup which assumes multiple types of symmetries formalized by groups G1, . . . , GK. Assume the model is equivariant to the first L groups G1, . . . , GL and softly equivariant for the rest. For G1, . . . , GL, we first compute a joint subspace by solving the set of equivariance constraints for L groups and denote the corresponding bases Q1 and R1. Similarly, we compute a joint subspace for all groups G1, . . . , GK and denote the bases Q2 and R2. A Mixed EMLP (MEMLP) is defined as Linear MEMLP(x) = W1x + b1 + W2x + b2 vec(Wq) = Qqθq, b = Rqβq for q = 1, 2. (9) Here, both W1x + b1 and W2x + b2 are equivariant to G1, . . . , GL, so the overall model is equivariant to them. Regularizing towards soft equivariance under mixed symmetries Table 1: Test MSE for the moment of inertia task. EMLP and RPP are built with O(3) and MEMLP is built with O(3)- EMLP and O(ax)-EMLP where ax {x, y, z}. Equiv group MLP O(2)EMLP O(3)EMLP RPP MEMLP PER O(3) 4.25 0.17 - 1.13 0.36 2.66 1.43 - 0.27 0.23 Ox(2) 2.84 0.12 1.75 0.63 62.36 41.10 2.06 1.12 3.38 0.92 0.25 0.16 Oy(2) 2.78 0.12 1.73 0.11 29.11 14.06 1.72 0.61 2.87 0.31 0.56 0.48 Oz(2) 2.69 0.10 1.56 0.17 46.32 10.18 1.86 1.13 2.75 0.29 0.32 0.26 - 6.81 0.23 - 10.65 2.08 4.16 0.49 - 0.34 0.28 On the other hand, since W1x + b1 is not equivariant to GL+1, . . . , GK, the overall model is only softly equivariant to them. The level of soft equivariance is controlled by the prior variances for W1 and W2, as in the case of RPP. 4. Experiments To demonstrate the effectiveness of our method, especially for its utility in discovering mixed symmetries from data, we compare ours to (approximately) equivariant baseline models for a synthetic function approximation task and a real-world motion forecasting task. The baselines we are comparing against include EMLP, RPP, and MEMLP described in 3.3. The network architectures used for those models including our model in common have four layers with the gated nonlinearities and bilinear layers as described in Finzi et al. (2021b;a). Throughout all the experiments, to see the net effect of the abilities of the models capturing equivariances, we controlled the sizes of the competing models so that all of them have similar number of parameters. Additional information regarding the experiments, such as the specific hyperparameters employed and the data preprocessing details applied, can be found in Appendix F. Furthermore, Appendix C provides insightful recommendations for efficient initializations of neural networks in the PER settings. Additionally, Appendix G presents supplementary experiments conducted to assess the robustness of our method. 4.1. Synthetic Function-Approximation Task 4.1.1. THE MOMENT OF INERTIA FUNCTION We generate a synthetic dataset having mixed symmetries by adding a perturbation to a symmetric function that computes the moment of inertia. Given the masses and positions of five particles, denoted by (m1:5, x1:5) := (mi, xi)5 i=1, the moment of inertia is computed as follows: I(m1:5, x1:5) := i=1 mi(x i xi I xix i ). (10) The moment-of-inertia function is equivariant with respect to group O(3), which consists of rotations and re- Table 2: Test MSE for the Cos Sim task. EMLP and RPP are built with (SO(3), S(3)) and MEMLP is built with (SO(3), S(3))-EMLP and SO(3) or S(3)EMLP EMLP. Sub EMLP stands for either SO(3) or S(3) and EMLP stands for (SO(3), S(3))-EMLP. All values are in a scale of 10 1. Inv group MLP Sub-EMLP EMLP RPP MEMLP PER SO(3), S(3) 0.41 0.03 - 1.10 0.02 1.10 0.03 - 0.32 0.02 SO(3) 0.46 0.07 0.39 0.30 2.54 0.10 2.57 0.10 2.56 0.10 0.44 0.03 S(3) 0.69 0.04 2.14 0.11 2.14 0.11 2.18 0.09 2.18 0.09 0.65 0.09 - 3.76 0.32 - 3.76 0.32 3.84 0.04 - 0.66 0.13 flections. That is, for a group element g O(3), ρ(g)I(m1:5, x1:5) = I(ρ(g)(m1:5, x1:5)). Here g acts on each position xi, that is, ρ(g)(m1:5, x1:5) = (mi, gxi)5 i=1 where g in gxi is represented as a 3 3 matrix. The output of the function M = I(m1:5, x1:5) is a 3 3 matrix, and g acts on M as ρ(g)(M) = g Mg 1. To generate data, we draw x1:5 i.i.d. N(0, I) and m 1:5 i.i.d. N(0, 1), and then compute mi = softplus(m i). We then compute the moment of inertia with (10) and add five different types of errors to the output. Let ˆx, ˆy, ˆz R3 be the orthonormal basis vectors of the x, y and z axes, respectively. The five types of errors and the corresponding approximate symmetries are as follows: 1. 0 (no error), O(3)-equivariant. 2. Iˆxˆx , Ox(2) equivariant, soft O(3) equivariant. 3. Iˆyˆy , Oy(2) equivariant, soft O(3)-equivariant. 4. Iˆzˆz , Oz(2) equivariant, soft O(3) equivariant. 5. 0.3I(ˆxˆx ˆyˆy + ˆzˆz ), soft O(3)-equivariant. For the baselines, we consider O(3)EMLP, O(3)RPP, and O(axis)-O(3)EMLP which is equivariant to O(axis) and softly equivariant to O(3), where axis {x, y, z} is chosen according to the symmetry in the data. Our model, denoted by PER, regularizes an MLP with equivariance regularizers for the groups (Ox(2), Oy(2), Oz(2)). 4.1.2. THE COSSIM FUNCTION Another synthetic function-approximation task we consider is the Cos Sim function which computes the average cosine similarity between three particles. Given the positions of three particles in 3D space, denoted by x1:3 := {xi}3 i=1 with each xi R3, the Cos Sim function computes Avg CS(x1:3) = CS(x1, x2) + CS(x2, x3) + CS(x1, x3) where CS(a, b) := a b a b . The Avg CS function is invariant to both SO(3) and S(3) where SO(3) is a rotation group in R3 and S(3) is a scaling group in R3. That is, for a group element g SO(3) or g S(3), Avg CS(ρ(g)x1:3) = Regularizing towards soft equivariance under mixed symmetries Data Equiv. Err. Model Equiv. Err. 4 Equiv. Regular. 0 1000 2000 3000 4000 5000 6000 7000 Epochs Regular. Coeff. Figure 3: The training progress of PER on a dataset equivariant to Oz(2) and softly equivariant to Ox(2) and Oy(2). From top to bottom: data equivariance error (defined in Equation 12), model equivariance error, the values of the equivariance regularizers (RPER k (f))3 k=1, and the regularization coefficients (λk)3 k=1. The coefficients are adjusted automatically at epoch 2000. Avg CS(x1:3), where ρ(g)x1:3 = {gxi}3 i=1. Similarly to the inertia task, to generate data, we draw x1:3 i.i.d. N(0, I), compute (11), and inject four different types of errors. 1. 0 (no error), SO(3) and S(3) invariant. 2. P3 i=1 xi 3 , SO(3)-invariant, soft S(3)-invariant. 3. P3 i=1 |xi ˆx| P3 j=1(|xj ˆy|+|xj ˆz|), soft SO(3) invariant, S(3) invariant. 4. P3 i=1 xi P3 i=1 |xi ˆx| P3 j=1(|xj ˆy|+|xj ˆz|), soft SO(3) and S(3) For the baselines, we consider (SO(3), S(3))EMLP, (SO(3), S(3))RPP, SO(3)-S(3)MEMLP (equivalent to SO(3) and softly equivalent to S(3)), and S(3)- SO(3)MEMLP (equiv to S(3) and softly equiv to SO(3)). For computing the basis of the joint equivariant subspace of SO(3) and S(3), we solve for the conjunction of equivariance constraints for two groups, as we explained in 3.3. 4.1.3. ANALYSIS OF THE RESULTS Overall results. We summarize the results for the moment of inertia task in Table 1 and the results for the Cos Sim task in Table 2. For both tasks, PER significantly outperforms baselines, across all error types having different types of approximate equivariance. From below, we empirically show that this is because PER correctly captures the approximate equivariance in data and adjusts the regularization coefficients accordingly. Discovery of approximate equivariance. We check whether our model correctly learns the degree of approximate equivariance implied in the dataset. For instance, in Model Equiv. Err. Equiv. Regular. Equiv. Regular. Coeff. (negative correlation) Absolute Correlation Oz(2) Ox(2) Oy(2) Figure 4: Absolute Pearson correlation coefficients between data equivariance errors and (model equivariance error, the values of equivariance regularizers, the regularization coefficients), measured across 13 datasets of different degrees and types of approximate equivariances. the moment of inertia task, when data is perturbed by the Item 3, our model should be able to detect that the data is Oy(2) equivariant and softly O(3) equivariant. Figure 3 illustrates the progress of equivariance errors, values of regularization, and their coefficients during training. Here, the data is perturbed by the error type Item 4, so it is Oz(2) equivariant and softly equivariant to Ox(2) and Oy(2). As we can see in the figure, our model captures the difference between the equivariance error levels and adjusts the regularization coefficients at the epoch 2000. Here, our model lowers the regularization coefficients for Ox(2) (to 2.91) and Oy(2) (to 3.07) while keeping the coefficient for Oz(2) (to 100.0). As a result, the model trained with the adjusted regularization could correctly match the equivariance errors assumed in the data. To further demonstrate that our model indeed captures the equivariance error levels from data, we measure the Pearson correlation coefficients between the (model equivariance errors, the values of the equivariance regularizers (RPER k (f))3 k=1, the regularization coefficients (λk)3 k=1), and the equivariance errors assumed in the data. Here, the model equivariance error is measured as a Monte Carlo approximation of the following expectation of equivariance error (scaled) of a model f: ρY(g)f(x) f(ρX (g)x) ρY(g)f(x) f(ρX (g)x) We measure the correlations across 13 different types of datasets with varying error types and scales, and summarize the result in Figure 4 (the specific values for each sample are written in Appendix D). The model equivariance error is highly correlated with the data equivariance error, indicating that the model correctly captures the equivariance errors implied in the data. Equivariance regularizers and their coefficients are also correlated with the data equivariance error, supporting our claim that the automatic tuning procedure in our method can discover the approximate equivariance (from prescribed candidate groups) in a datadriven way. Regularizing towards soft equivariance under mixed symmetries 4.2. Motion Forecasting Task 4.2.1. TASK DESCRIPTION The goal of this task is to predict the future positions of a moving vehicle given past positions. The position of the vehicle is represented with a 3D coordinate (x, y, z). We collect the trajectories from Waymo Open Motion Dataset (WOMD) (Ettinger et al., 2021) containing trajectories of vehicles moving on roads. We use 16,814 trajectories for training, 3,339 trajectories for validation, and 3,563 trajectories for testing. Each trajectory consists of T = 6 past positions x(1:T ) := {x(t)}T t=1 and T = 6 future positions y(1:T ) := {y(t)}T t=1 to be predicted, and the positions are measured at a frequency of 2.5Hz. We assess the performance of the models trained for this task using the Average Distance Error (ADE) defined as follows: ADE(y(1:T ), ˆy(1:T )) = 1 t=1 y(t) ˆy(t) , (13) where y(1:T ) and ˆy(1:T ) are predicted and ground-truth future trajectories, respectively. In principle, the trajectory of a moving vehicle is equivariant to the rotations along the z-axis. Therefore, an Oz(2)- equivariant model is expected to perform better than nonequivariant models. Indeed, on the WOMD dataset, Assaad et al. (2022) reported that an Oz(2)-equivariant transformer works better than a non-equivariant transformer. However, they also reported that on the same task, the Oz(2)-equivariant transformer performs worse than a soft Oz(2)-equivariant transformer. In our experiment, we attempt to see why this is the case and also find out what other types of (approximate) symmetries the dataset might exhibit. To this end, we compare Oz(2)-EMLP, O(3)- EMLP, O(3)-RPP, Oz(2)-RPP, Oz(2)-O(3) MEMLP, and MLP with (Ox(2), Oy(2), Oz(2)) PER. 4.2.2. NORMALIZATION METHODS Typically, for a regression problem, we preprocess the inputs either by normalizing or scaling them. However, we find that training with trajectories with such typical preprocessing performs poorly, due to high variance across trajectories. Hence, before the actual normalization, we first do centering for each trajectory to bring it near the origin. Given a i-th trajectory x(1:T ) i , the centering is defined as centering(x(1:T ) i ) = (x(t) i xi) := c(1:T ) i , (14) where xi := PT t=1 x(t) i /T. Even after centering, we still suffer from varying scales of the coordinates (the values of the z-axis are significantly smaller than the values of the other axes because most vehicles run on horizontal roads). To resolve this, we may normalize each coordinate separately but it might also break the symmetry implied in the data. Hence, we consider three different types of normalization schemes where each scheme induces different (approximate) symmetry, and compare the models on the datasets preprocessed with them. The goal of the experiment is to show that our method can capture different types of symmetries induced by the normalizations and thus perform robustly across datasets. Examples of trajectory for each normalization are visually compared in Appendix E. Scale-aware normalization. Assume we have N trajectories in the training set. Let µ R3 and σ R3 + be the element-wise mean and standard deviation of the trajectories in the training set, c(t) i NT , σ = N X (c(t) i µ) 2 where denotes the element-wise exponentiation. Given µ and σ, the first normalization scheme is defined as normalize(c(1:T ) i ) = (c(t) i µ) σ T t=1, (16) where denotes the element-wise division. We call this normalization a scale-aware normalization, since it adjusts the data for each coordinate separately so that all the (x, y, z) coordinates have similar scales. Symmetry-aware normalization. Note that the scaleaware normalization breaks the rotation symmetry because it scales each coordinate with a different value. In that case, we may lose the benefits of utilizing the rotation equivariance in a model. In the second normalization scheme, instead of element-wise scaling, we use the total standard deviation for the scaling: 1 3c(t) i 3NT , s2 = c(t) i m13 2 normalize(c(1:T ) i ) = (c(t) i µ)/s T t=1, (17) where 13 = [1, 1, 1] . We call this normalization symmetry-aware since the rotation symmetry of the resulting trajectory is not broken by the normalization. Symmetry-scale-aware normalization. While the symmetry-aware normalization preserves the rotation symmetry, it still has the problem of a small z-scale in the training set. To further resolve this, as the third scheme, we modify the centering step as follows, centering(x(1:T ) i ) = (x(t) i α xi)T t=1, (18) Regularizing towards soft equivariance under mixed symmetries 2 Symmetry-aware 2 Scale-aware 1.25 1.50 10 1 Symmetry-scale-aware Oz(2)EMLP (Oz(2) equiv) O(3)EMLP (O(3) equiv) Oz(2)RPP (Oz(2) soft) O(3)RPP (O(3) soft) Oz(2)-O(3)MEMLP (Oz(2) equiv, O(3) soft) (Ox(2),Oy(2),Oz(2))PER Figure 5: Test ADE results for WOMD dataset. where denotes the element-wise multiplication and α R3 is a scaling factor. We set α = (1, 1, 0.993), so the values for the z-axis remain similar to the other axes after centering. Then we normalize the centered data as in the symmetry-aware normalization. Since the values of the z-axis were similar to those of other axes, even after the scaling, the values of the three coordinates have a similar scale. We call this scheme symmetry-scale-aware since it is both scale-aware and preserves rotation symmetry. 4.2.3. ANALYSIS OF THE RESULTS We expect that the scale-aware normalization breaks the Oz(2) equivariance because it normalizes the x and y axes with different scales, but the degree of approximate equivariance would not be serious because the x axis and the y axis have similar (but still different) scales. Indeed, Figure 5 shows that the models (approximately) equivariant to Oz(2), Oz(2)-EMLP, Oz(2)-O(3) MEMLP, and PER, perform better than the others. Interestingly, as can be seen in Figure 6, PER discovers that the data has soft Oz(2) equivariance, which coincides with our expectation that the scale-aware normalization mildly breaks the Oz(2) equivariance. Note that Oz(2)-EMLP exhibits a tiny equivariance error. This is due to a numerical error in calculating the equivariant basis Q and R in Equation 2. Even though the symmetry-aware normalization does not break the O(3) equivariance, the dataset itself has soft O(3) equivariance due to the gravity acting on the vehicles. However, the significantly small scale of z-coordinates in the symmetry-aware normalization causes a model to underestimate the Ox(2) and Oy(2) equivariance. Consequently, the small equivariance error discovered by PER led to the best performance. As shown in Figure 6, while Oz(2)-EMLP captures only large equivariance errors on Ox(2) and Oy(2), PER captures small equivariance errors on O(3). For the symmetry-scale-aware scheme, PER shows the best Oz(2)EMLP PER Model Equiv. Err. Scale-aware Oz(2)EMLP PER 1 Symmetry-aware Oz(2)EMLP PER Symmetry-scale-aware Oz(2) Ox(2) Oy(2) Figure 6: The model equivariance errors captured by Oz(2)-EMLP and our algorithm. performance. As in the scale-aware, all models perform well except for the O(3)-EMLP. Together with the captured equivariance errors in Figure 6, they explain the symmetry-scale-aware scheme is also softly O(3) equivariant. Whereas the element-wise scaling causes soft O(3) equivariance in the scale-aware, in the symmetry-scaleaware scheme, mainly the gravity acting on the vehicles along the z-axis results in the soft O(3) equivariance of the data. Moreover, the relatively large equivariance errors on Oz(2) (blue) helped the performance of PER, which was a coherent result with Assaad et al. (2022). To summarize, for all three normalization, PER robustly outperforms the baselines, and discovers reasonable soft symmetries. 5. Related Work The translation equivariance of CNN and permutation equivariance of GNN are the most popular examples of symmetry built in the neural networks. Recently, there have been several works designing neural networks having desirable group equivariance. EMLP (Finzi et al., 2021b) is a framework that builds an MLP equivariant to various groups. Lie Conv (Finzi et al., 2020) is a variant of CNN targeting equivariance for Lie groups. Another variant called G-CNN (Cohen & Welling, 2016) is equivariant w.r.t. 90degree rotations, reflections, and translation. Most softly equivariant models impose architectural restrictions for the soft equivariance. RPP (Finzi et al., 2021a) build a soft equivariant model via a residual layer added to the equivariant linear layer, where the degree of equivariance is determined by the prior variances assigned for the equivariant layer and the residual pathway. Relaxed group convolution (Wang et al., 2022) implements a softly equivariant CNN by interpolating multiple conv operations with different weights, and the number of convolutions determines the degree of equivariance. Relaxed G-steerable group convolution (Wang et al., 2022) introduces spatiallocation-dependent weights that replace the weights in the G-steerable CNN. Relaxed Gand G-steerable CNNs use group-action-based regularizers to restrict the relaxation. There are some previous works allowing automatic sym- Regularizing towards soft equivariance under mixed symmetries metry discovery from data (Dehmamy et al., 2021; Krippendorf & Syvaeri, 2021). However, to our knowledge, ours is the first to discover the varying degrees of approximate equivariance across multiple groups under mixed symmetry settings. 6. Conclusion In this paper, we tackle the learning problems under mixed symmetries, where a dataset contains multiple types of symmetries with different levels of equivariance errors. While previous methods focused on a single type of symmetries and bake in the equivariance constraint to the architecture as an inductive bias, ours take a regularizer-based approach, where a model without any equivariance constraint is regularized towards it using a projection-based regularization. One notable advantage is that it can automatically detect the levels of equivariance errors and adapt to those error levels by controlling the regularization coefficients. This is done during the training without any explicit supervision. Using a synthetic function approximation task and real-world motion forecasting task, we demonstrate that our proposed model could indeed capture mixed symmetries, identify the different level of equivariance errors, and predicts better than the existing methods. In this paper, we mainly focused on MLP architectures, so extending our framework to arbitrary neural network architectures such as CNNs, RNNs, or transformers (Vaswani et al., 2017) would be an interesting future research direction. Acknowledgements This work was partially supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)), Artificial Intelligence Innovation Hub (No.2022-0-00713), and National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF2021M3E5D9025030). HY was supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921) and also by the Institute for Basic Science (IBS-R029-C1). We are grateful to Seongho Keum, who helped us throughout the process of building up this work. Assaad, S., Downey, C., Al-Rfou, R., Nayakanti, N., and Sapp, B. Vn-transformer: Rotation-equivariant attention for vector neurons, 2022. 7, 8 Bronstein, M. M., Bruna, J., Cohen, T., and Veliˇckovi c, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. 1 Cohen, T. and Welling, M. Group equivariant convolutional networks. In Proceedings of The 33rd International Conference on Machine Learning (ICML 2016), 2016. 1, 8 Cohen, T., Geiger, M., K ohler, J., and Welling, M. Spherical CNNs. ar Xiv preprint ar Xiv:1801.10130, 2018. 1 Cohen, T., Weiler, M., Kicanaoglu, B., and Welling, M. Gauge equivariant convolutional networks and the icosahedral CNN. In Proceedings of The 36th International Conference on Machine Learning (ICML 2019), 2019. 1 Dehmamy, N., Walters, R., Liu, Y., Wang, D., and Yu, R. Automatic symmetry discovery with Lie algebra convolutional network. In Advances in Neural Information Processing Systems 34 (Neur IPS 2021), 2021. 9 Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., Chai, Y., Sapp, B., Qi, C. R., Zhou, Y., Yang, Z., Chouard, A., Sun, P., Ngiam, J., Vasudevan, V., Mc Cauley, A., Shlens, J., and Anguelov, D. Large scale interactive motion forecasting for autonomous driving : The waymo open motion dataset. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021. 7 Finzi, M., Stanton, S., Izmailov, P., and Wilson, A. G. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. In Proceedings of The 37th International Conference on Machine Learning (ICML 2020), 2020. 8 Finzi, M., Benton, G., and Wilson, A. G. Residual pathway priors for soft equivariance constraints. In Advances in Neural Information Processing Systems 34 (Neur IPS 2021), 2021a. 2, 3, 5, 8 Finzi, M., Welling, M., and Wilson, A. G. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. In Proceedings of The 38th International Conference on Machine Learning (ICML 2021), 2021b. 2, 5, 8 Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, 2010. 12 He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015. 12 Regularizing towards soft equivariance under mixed symmetries Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. 15 Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2016. 1 Krippendorf, S. and Syvaeri, M. Detecting symmetries with neural networks. Mach. Learn. Sci. Technol., 2021. 9 Loshchilov, I. and Hutter, F. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. 15 Puny, O., Atzmon, M., Smith, E. J., Misra, I., Grover, A., Ben-Hamu, H., and Lipman, Y. Frame averaging for invariant and equivariant network design. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. 16 van der Ouderaa, T. F. A., Romero, D. W., and van der Wilk, M. Relaxing equivariance constraints with nonstationary continuous filters, 2022. 2 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017. 9 Veliˇckovi c, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018. 1 Wang, R., Walters, R., and Yu, R. Approximately equivariant networks for imperfectly symmetric dynamics, 2022. 2, 8 Regularizing towards soft equivariance under mixed symmetries A. Proof of Proposition 3.1 Proof. We define notations V and where V is a vectorized form of the weight W and converts the vector form of a matrix back to the matrix form. i.e. V = vec(W) and W = V . Also, we can utilize those identities by the definitions of Qw and Qb. ρY(g)(Qw Q w V ) = (Qw Q w V ) ρX (g), ρY(g)Qb Q b b = Qb Q b b. (19) Now we first prove Proposition 3.1 where f is a linear function. By the triangle inequality, ρY(g)(Wx + b) WρX (g)x b (20) = ρY(g)Wx ρY(g)(Qw Q w V ) x + ρY(g)b ρY(g)Qb Q b b (21) WρX (g)x + (Qw Q w V ) ρX (g)x b + Qb Q b b (22) ρY(g)Wx ρY(g)(Qw Q w V ) x + ρY(g)b ρY(g)Qb Q b b (23) + WρX (g)x (Qw Q w V ) ρX (g)x + b Qb Q b b (24) ρY(g) W (Qw Q w V ) x + ρY(g)(b Qb Q b b) (25) + W (Qw Q w V ) ρX (g)x + b Qb Q b b . (26) We can split out ρ(g) and x by using the operator norms and the operator norm is bounded by Frobenius norm. ρY(g) W (Qw Q w V ) x ρY(g) op W (Qw Q w V ) F x , (27) ρY(g)(b Qb Q b b) ρY(g) op b Qb Q b b , (28) W (Qw Q w V ) ρX (g)x W (Qw Q w V ) F ρX (g)x . (29) Therefore, the G-equivariance error is bounded as follows: sup x,g ρY(g)(Wx + b) WρX (g)x b (30) sup g ρY(g) op sup x x + sup x,g sup g ρX (g)x V Qw Q w V (31) + ρY(g) op + 1 b Qb Q b b (32) = C1 V Qw Q w V + C2 b Qb Q b b . (33) The norm of x are supposed to be bounded since we have a finite dataset. Now we are looking at when f is a non-linear function whose activation σ is G-equivariant and L-lipchitz continuous. The equivariant activation σ has ρY(g)σ(f(x)) = σ(ρY(g)f(x)) (34) for any function f. Hence, the G-equivariance error ρY(g)σ(Wx + b) σ(WρX (g)x b) = σ(ρY(g)(Wx + b)) σ(WρX (g)x + b) . (35) Since σ is L-lipchitz continuous, σ(ρY(g)(Wx + b)) σ(WρX (g)x + b) L ρY(g)(Wx + b) WρX (g)x b . (36) The r.h.s is the equivariance error when f is a linear function, which is bounded by Equation 33. Lastly, we show the case when f is a two-layer MLP. More-than-two-layered MLPs can be shown in the same way. The G-equivariance error is ρ(2)(g)(W (2)σ(W (1)x + b(1)) + b(2)) W (2)σ(W (1)ρ(0)(g)x + b(1)) b(2) (37) = ρ(2)(g)(W (2)σ(W (1)x + b(1)) + b(2)) W (2)ρ(1)(g)σ(W (1)x + b(1)) b(2) + W (2)ρ(1)(g)σ(W (1)x + b(1)) + b(2) W (2)σ(W (1)ρ(0)(g)x + b(1)) b(2) . (38) Regularizing towards soft equivariance under mixed symmetries This is bounded by an addition of two equivariance errors by triangle inequality ρ(2)(g)(W (2)σ(W (1)x + b(1)) + b(2)) W (2)ρ(1)(g)σ(W (1)x + b(1)) b(2) + W (2)ρ(1)(g)σ(W (1)x + b(1)) + b(2) W (2)σ(W (1)ρ(0)(g)x + b(1)) b(2) (39) ρ(2)(g)(W (2)x + b(2)) W (2)ρ(1)(g)x b(2) + W (2) op ρ(1)(g)σ(W (1)x + b(1)) σ(W (1)ρ(0)(g)x + b(1)) , (40) where x = σ(W (1)x + b(1)). The first term of Equation 40 is the equivariance error of the linear function where the input is the output of the first layer. Besides, the second term involves the equivariance error of the non-linear function. Overall, the equivariance error of the two-layer MLP is bounded as sup x,g ρ(2)(g)(W (2)σ(W (1)x + b(1)) + b(2)) W (2)σ(W (1)ρ(0)(g)x + b(1)) b(2) (41) sup g ρ(2)(g) op sup x x + sup x,g ρ(1)(g)x V (2) Qw(2)Q w(2)V (2) (42) + sup g ρ(2)(g) op + 1 b(2) Qb(2)Q b(2)b(2) (43) + L W (2) op sup g ρ(1)(g) op sup x x + sup x,g ρ(0)(g)x V (1) Qw(1)Q w(1)V (1) (44) + L W (2) op sup g ρ(1)(g) op + 1 b(1) Qb(1)Q b(1)b(1) (45) C(2) 1 V (2) Qw(2)Q w(2)V (2) + C(2) 2 b(2) Qb(2)Q b(2)b(2) (46) + C(1) 1 V (1) Qw(1)Q w(1)V (1) + C(1) 2 b(1) Qb(1)Q b(1)b(1) . (47) In terms of the mathematical induction, the bound of the equivariance error of more-than-two-layered MLPs can be derived as follows: sup x,g [ ρ(S)(g)(W (S)σ(f (x)) + b(S)) W (S)σ(f (ρ(0)(g)x)) b(S) ] (48) = sup x,g ρ(S)(g)(W (S)σ(f (x)) + b(S)) W (S)ρ(1)(g)σ(f (x)) b(S) + W (S)ρ(S 1)(g)σ(f (x)) + b(S) W (S)σ(f (ρ(0)(S)x)) b(S) (49) sup x,g ρ(S)(g)(W (S)σ(f (x)) + b(S)) W (S)ρ(1)(g)σ(f (x)) b(S) + L W (S) op sup x,g ρ(S 1)(g)f (x) f (ρ(0)(S)x) , (50) where S is the number of layers and f is a (S 1)-layered MLP. B. RPP vs. PER Illustrated in Figure 7. C. Weight Initializations for PER Since our method does not restrict the parameter space, we can freely choose desirable strategies of weight initialization according to prior knowledge about a given task. C.1. Standard Obviously we can utilize the well-known initializations of neural networks such as Glorot initialization (Glorot & Bengio, 2010) and He initialization (He et al., 2015). Regularizing towards soft equivariance under mixed symmetries Figure 7: Comparison of parameterization bewteen RPP (left) and PER (right). W1 and W2 are the parameters of RPP. W1 explains equivariance by projecting onto the equivariant space and W2, called a residual path, captures the difference between approximate equivariance desired in dataset and strict equivariance of Qw Q w W1. On the other hand, WPER does not require additional parameters because it is already close to the equivariant space due to the regularizer. This initialization mimics the initial weights of the RPP model. The structure of the RPP models consists of addition of weights projected on the equivariant space QQ vec(W1) and small weights vec(W2) acting as a perturbation of the equivariant weights. vec(WRPP) = QQ vec(W1) + vec(W2), vec(W1) N(0, σ2I), vec(W2) N(0, ϵσ2I), (51) where 0 < ϵ << 1 and σ is determined by selected types of initialization such as Glorot and He. Thus, our model can be initialized with the added distribution as follows: vec(WPER) N(0, σ2QQ + ϵσ2I). (52) C.3. Half Soft The degree of approximate equivariance is determined by the perpendicular distance from the equivariant space and the perpendicular distance is determined by an amount of the complementary direction of the equivariant space. i.e. the approximate equivariance degree of weight W is determined by Q Q vec(W) because vec(W) = QQ vec(W) + Q Q vec(W), where Q is the complementary basis of Q. Therefore, we can control the equivariance of the initial weights with a scaling factor λ as follows: vec(WPER) N(0, (1 λ)σ2QQ + λσ2I) = N(0, σ2QQ + λσ2 Q Q ). (53) The case when λ = 0 corresponds to the initial weights of EMLP and the case when λ = 1 corresponds to the initial weights of MLP. We chose λ = 0.5 to locate the model in the middle between EMLP and MLP. D. Samples for Measuring Correlation Experiments for measuring the Pearson correlation are listed in Table 3. E. Comparison of trajectory between the normalizations 3 example trajectories (red, green, and blue) for each normalization are described in Figure 8. F. Experimental Details F.1. Dataset Description Information for each dataset is summarized in Table 4. Regularizing towards soft equivariance under mixed symmetries Table 3: Samples for measuring the correlation with the data equivariance error. ϵ1 = Iˆxˆx , ϵ2 = Iˆyˆy , ϵ3 = Iˆzˆz , and ϵ4 = Iˆxˆx + Iˆyˆy Iˆzˆz Data Equiv. Err. Equiv. Regular. Coeff. Model Equiv. Err. Equiv. Regular. Noise Oz(2) Ox(2) Oy(2) Oz(2) Ox(2) Oy(2) Oz(2) Ox(2) Oy(2) Oz(2) Ox(2) Oy(2) 0 7.22E-08 7.23E-08 7.38E-08 1.00E+02 9.80E+01 9.34E+01 9.17E-04 7.72E-04 7.60E-04 1.46E-07 1.32E-07 1.31E-07 0.3ϵ1 5.89E-02 7.05E-08 5.85E-02 1.35E+01 1.00E+02 1.32E+01 5.30E-02 4.27E-03 5.42E-02 7.06E-05 9.95E-07 7.01E-05 0.6ϵ1 1.05E-01 7.11E-08 1.04E-01 1.22E+01 1.00E+02 1.29E+01 7.56E-02 8.09E-03 7.95E-02 1.19E-04 4.78E-06 1.18E-04 0.9ϵ1 1.41E-01 6.90E-08 1.40E-01 1.97E+01 1.00E+02 1.98E+01 1.36E-01 3.78E-03 1.34E-01 9.61E-05 2.20E-06 9.53E-05 0.3ϵ2 5.99E-02 5.34E-02 7.43E-08 7.54E+00 7.36E+00 1.00E+02 4.54E-02 5.06E-02 9.64E-03 1.85E-04 1.84E-04 1.57E-05 0.6ϵ2 1.08E-01 9.74E-02 7.22E-08 2.64E+01 2.62E+01 1.00E+02 8.13E-02 8.71E-02 1.73E-02 1.28E-04 1.27E-04 2.62E-05 0.9ϵ2 1.47E-01 1.34E-01 7.15E-08 9.61E+00 9.86E+00 1.00E+02 1.08E-01 1.10E-01 7.21E-03 2.44E-04 2.44E-04 1.96E-05 0.3ϵ3 7.04E-08 5.34E-02 5.96E-02 1.00E+02 3.36E+00 3.39E+00 1.67E-03 5.55E-02 5.56E-02 7.64E-07 1.41E-04 1.41E-04 0.6ϵ3 7.05E-08 9.75E-02 1.08E-01 1.00E+02 4.74E+00 4.96E+00 2.02E-03 9.61E-02 9.47E-02 2.87E-06 2.04E-04 2.05E-04 0.9ϵ3 6.88E-08 1.34E-01 1.47E-01 1.00E+02 4.19E+00 4.21E+00 1.87E-03 1.39E-01 1.40E-01 2.40E-06 2.65E-04 2.65E-04 0.3ϵ4 1.32E-01 5.91E-02 6.70E-02 1.79E+01 9.73E+01 1.00E+02 1.17E-01 5.13E-02 6.59E-02 9.04E-05 3.46E-05 3.39E-05 0.6ϵ4 2.50E-01 1.14E-01 1.31E-01 3.68E+01 9.62E+01 1.00E+02 1.96E-01 9.09E-02 1.06E-01 1.53E-04 5.79E-05 5.92E-05 0.9ϵ4 3.41E-01 1.58E-01 1.85E-01 3.29E+01 5.17E+01 1.00E+02 3.07E-01 1.39E-01 1.75E-01 1.86E-04 6.21E-05 5.86E-05 Original Trajectories Scale-aware Symmetry-aware Symmetry-scale-aware Figure 8: The scale-aware normalization strongly emphasizes the z-coordinates. The symmetry-aware normalization just scales down the whole coordinates but the scale of z-coordinates is still close to zero. The symmetry-scale-aware normalization scales down the whole coordinates while retaining the scale of z-coordinates. F.2. Data Selection of WOMD Trajectory Slicing The WOMD dataset contains maximum 91 points of a trajectory measured in 10Hz. We sliced and gathered first 24 points and dropped every even-numbered points so that the final trajectory contains only 12 points. The past 6 points and future 6 points are regarded as input and output, respectively. Trajectory Selection We selected only a portion of the whole WOMD dataset. The training part of WOMD motion forecasting dataset consists of total 1,000 files of the TFRecord format. We used first 28 files as training set, next 6 files as validation set, and last 6 files as testing set only. Furthermore, we excluded all trajectory that doesn t move enough and move too far, we collected only trajectories satisfying the following conditions: yt=6 ˆx yt=1 ˆx 2 < 5 (54) yt=6 ˆy yt=1 ˆy 2 < 5 (55) yt=6 ˆz yt=1 ˆz 2 > 0.05, (56) where ˆx, ˆy, ˆz R3 are orthonormal basis vectors of x-,y-, and z-axes. F.3. Details of Training For all experiments, we used five different seeds to report performance results. Architecture Description All architectures of the neural networks including EMLP, RPP, Mixed RPP, and our model are fixed with 4 layers and different width. Their widths were adjusted to set their number of parameters the same. Regularizing towards soft equivariance under mixed symmetries Table 4: Information of each dataset. S denotes a scalar and V denotes a vector in R3. Inertia Cos Sim WOMD Training Samples 1,000 1,000 16,814 Validation Samples 1,000 1,000 3,339 Testing Samples 1,000 1,000 3,563 Input Representation 5S 5V 3V 6V Output Representation V2 S 6V Hyperparameters See Table 5 for hyperparameter settings of our model and baseline methods for all experiments. Those hyperparameters are applied the same for all models. Additional hyperparameters of our model for each task are listed in Table 6. Table 5: Common hyperparameter settings for each task. Dataset Mini-batch Max Epochs Learning Rate Weight Decay Width (RPP) Inertia 500 8,000 0.001 2.0 10 4 384 (270) Cos Sim 200 10,000 0.0002 2.0 10 5 128 (45) WOMD (scale-aware) 256 750 0.0002 0 384 (269) WOMD (symmetry-aware) 256 500 0.0002 0 384 (269) WOMD (symmetry-scale-aware) 256 500 0.0002 0 384 (269) Table 6: Hyperparameter setting of our model. Dataset Task Type Initial λ γ Adjustment Epoch Initialization Mini-batch Max Epochs O(3) 100 2 2,000 Standard 500 8,000 Ox(2) 100 2 2,000 Standard 500 8,000 Oy(2) 100 2 2,000 Standard 500 8,000 Oz(2) 100 2 2,000 Standard 500 8,000 Only Soft 100 2 2,000 Standard 500 8,000 SO(3) S(3) 0.005 2 2,500 Standard 200 10,000 SO(3) 0.1 2 2,500 Standard 200 10,000 S(3) 0.01 2 2,500 Standard 200 10,000 Only Soft 0.005 2 2,500 Standard 200 10,000 WOMD Scale-aware 0.2 5 125 Half Soft 128 500 Symmetry-aware 0.3 5 100 Half Soft 128 500 Symmetry-scale-aware 5 5 100 Half Soft 128 500 Extra Details We applied the cosine decaying of learning rate (Loshchilov & Hutter, 2017) and early stopping with 50 patience for stable training. The optimizers used in every tasks are ADAM (Kingma & Ba, 2015). All experiments were trained and evaluated on RTX 3090 devices. Regularizing towards soft equivariance under mixed symmetries G. Additional Experiments G.1. Analysis of Adjustment of Hyperparameters We share a part of the robustness analysis across different hyperparameters (initial coefficients λ, scaling factors γ, and the moments of the adjustment) required in the automatic tuning procedure described in 3.2. As the results show in Table 7, we found that the performance of the model is not so sensitive to the choice of hyperparameters. For instance, for the initial value of lambda, we observed that the model would achieve similar performances provided that the ratio of the initial Loss over the initial λ RPER is at a certain level. The situation for the scaling factor γ is similar: the final performance was consistent for the values arbitrarily chosen within the range [2, 5]. Table 7: Test MSE results across different hyperparameters required in the automatic PER-coefficients-tuning precedure described in 3.2. λ is the initial coefficients and γ is the scaling factors. (a) Inertia Oz(2) task loss/(λ RPER) Test MSE 0.00009 1.75 2.27 0.00037 0.32 0.26 0.00147 0.35 0.19 Oz(2)EMLP 1.56 0.17 (b) Cos Sim S(3) task Loss/(λ RPER) Test MSE 0.0348 0.068 0.007 0.1741 0.065 0.009 0.8707 0.052 0.013 S(3)EMLP 0.21 0.11 (c) Inertia Oz(2) task 2 0.32 0.26 3 0.40 0.24 4 0.35 0.15 5 0.43 0.21 Oz(2)EMLP 1.56 0.17 (d) Cos Sim S(3) task 2 0.065 0.009 3 0.044 0.004 4 0.044 0.004 5 0.044 0.004 S(3)EMLP 0.214 0.110 (e) Inertia Oz(2) task (training epochs 8000) Adjusted Epoch Test MSE 1000 0.26 0.17 2000 0.32 0.26 3000 0.38 0.24 Oz(2) EMLP 1.56 0.17 (f) Cos Sim S(3) task (training epochs 2000) Adjusted Epoch Test MSE 300 0.044 0.003 500 0.065 0.009 700 0.046 0.004 S(3) EMLP 0.214 0.110 G.2. Comparison with Frame Averaging Table 8: Test MSE comparison with FA in the Inertia O(3) task Model Test MSE MLP 4.25 0.17 EMLP 1.13 0.36 RPP 2.66 1.43 PER 0.27 0.23 MLP w/ FA 0.36 0.05 Frame Averaging (FA) (Puny et al., 2022) is a framework that, in simple terms, trains a G-equivariant model f by taking the average over some group elements in G, called frames. FA is a flexible approach since it does not restrict the internal structure of the model f, unlike EMLP. We ran additional experiments with FA on the fully-equivariant task same as the first row in Table 1. Table 8 shows the results. Although EMLP in the table used gated nonlinearity (GNL) due to its architectural restriction, FA does not need such a restriction, so the MLP w/ FA row in Table 8 applied the frame averaging to the same setup as the MLP row (i.e., MLP with the Swish activation). Our results confirmed, FA is indeed a more powerful baseline than EMLP. But note that our model (PER) performs better than MLP w/ FA here. G.3. Simple Experiment Assuming Symmetries Are Unknown in the WOMD task Table 9: (a) Captured equivariance errors across the prescribed regularizers with different groups (b) Change of Test MSE due to the additional regularizers (SLz(2), SLy(2), GLx(2), and GLy(2)). (a) Model equivariance errors Regularized Groups Model Equiv. Err. Oz(2) 0.0007 Ox(2) 0.0006 SLz(2) 0.2638 SLy(2) 0.2302 GLx(2) 0.2398 GLy(2) 0.2089 (b) Test MSE Models Test MSE ( 10 2) O(2) PER 3.07 0.01 O(2),SL(2),GL(2) PER 3.09 0.01 We explain an additional experiment where we mimic the situation of unknown symmetries by including various and sometimes wrong matrix groups as candidate groups and checking whether our method picks the correct groups. Table 9 shows the model equivariance error captured by the model when using all O(2), SL(2), and GL(2) PERs to train the motion forecasting task with symmetry-aware normalization (this task has symmetries with respect to Oz(2) and Ox(2)). As shown in the tables, our method has appropriately captured the equivariance with respect to Oz(2) and Ox(2).