# noether_networks_metalearning_useful_conserved_quantities__a771a19b.pdf Noether Networks: Meta-Learning Useful Conserved Quantities Ferran Alet 1, Dylan Doblar 1, Allan Zhou2, Joshua Tenenbaum1, Kenji Kawaguchi3, Chelsea Finn2 1MIT, 2Stanford University, 3National University of Singapore {alet,ddoblar}@mit.edu Progress in machine learning (ML) stems from a combination of data availability, computational resources, and an appropriate encoding of inductive biases. Useful biases often exploit symmetries in the prediction problem, such as convolutional networks relying on translation equivariance. Automatically discovering these useful symmetries holds the potential to greatly improve the performance of ML systems, but still remains a challenge. In this work, we focus on sequential prediction problems and take inspiration from Noether s theorem to reduce the problem of finding inductive biases to meta-learning useful conserved quantities. We propose Noether Networks: a new type of architecture where a meta-learned conservation loss is optimized inside the prediction function. We show, theoretically and experimentally, that Noether Networks improve prediction quality, providing a general framework for discovering inductive biases in sequential problems. 1 Introduction The clever use of inductive biases to exploit symmetries has been at the heart of many landmark achievements in machine learning, such as translation invariance in CNN image classification [35], permutation invariance in Graph Neural Networks [52] for drug design [57], or roto-translational equivariance in SE3-transformers [25] for protein structure prediction [33]. However, for data distributions of interest, there may be exploitable symmetries that are either unknown or difficult to describe with code. Progress has been made in automatically discovering symmetries for finite groups [74], but meta-learning and exploiting general continuous symmetries has presented a major challenge. In part, this is because symmetries describe the effect of counterfactuals about perturbations to the data, which are not directly observable. In this work, we propose to exploit symmetries in sequential prediction problems indirectly. We take inspiration from Noether s theorem [44], which loosely states the following: For every continuous symmetry property of a dynamical system, there is a corresponding quantity whose value is conserved in time. For example, consider a system of planets interacting via gravity: this system is translation invariant in all three cardinal directions (i.e. translating the entire system in the x,y, or z axis conserves the laws of motion). Noether s theorem asserts there must be a conserved quantity for each of these symmetries; in this case, linear momentum. Similarly, the system has a time-invariance (i.e. the laws of motion are the same today as they will be tomorrow). In this case, the corresponding conserved quantity is the total energy of the system. Equal contribution. Our code is publicly available at https://lis.csail.mit.edu/noether. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Figure 1: Noether Networks enforce conservation laws, meta-learned by gφ, in sequential predictions made by fθ, which is tailored to the input x0 to produce final predictions ˆx1:T . Imposing these meta-learned inductive biases improves video prediction quality with objects sliding down a ramp. Inspired by this equivalence, we propose that approximate conservation laws are a powerful paradigm for meta-learning useful inductive biases in sequential prediction problems. Whereas symmetries are difficult to discover because they are global properties linked to counterfactuals about unobserved perturbations of the data, conserved quantities can be directly observed in the true data. This provides an immediate signal for machine learning algorithms to exploit. Our approach involves meta-learning a parametric conservation loss function which is useful to the prediction task. We leverage the tailoring framework [4], which proposes to encode inductive biases by fine-tuning neural networks with hand-designed unsupervised losses inside the prediction function. Whereas traditional auxiliary losses are added to the task loss during training, tailoring losses are optimized inside the prediction function both during training and testing, customizing the model to each individual query. In doing so, we ensure there is no generalization gap for the conservation loss. We propose to meta-learn the self-supervised loss function and parameterize it in the form of a conservation loss; i.e. L(x0, x1:T ; gφ) = T t=1 |gφ(x0) gφ( xt)|2. This conservation form encodes a meta-inductive bias (inductive bias over inductive biases) which narrows the search space exponentially (in T) and simplifies the parameterization. Figure 1 demonstrates the Noether Network pipeline for a video prediction task. The main contribution of this paper is the Noether Network, an architecture class and algorithm that can automatically learn its own inductive biases in the form of meta-learned conservation loss functions. We theoretically characterize the advantages of such conservation laws as effective regularizers that constrain the learning to lower-dimensional output manifolds, lowering the generalization gap. Empirically, we first find that, when the meta-learned conservation loss takes the form of a synthesized program, Noether Networks recover known conservation laws from raw physical data. Second, we find that Noether Networks can learn conservation losses on raw videos, modestly improving the generalization of a video prediction model, especially on longer-term predictions. 2 Theoretical Advantages of Enforcing Conservation Laws In this section, we demonstrate principled advantages of enforcing conservation laws of the form gφ(fθ(x)) = gφ(x) by considering a special case where preimages under gφ form affine subspaces. Let input x and target y be x, y Rd, and let the Noether embedding be gφ : Rd P where P = {gφ(x) : x Rd}. We consider a restricted class of models, parameterized by θ Θ, of the form fθ(x) = x + vθ for vθ Rd such that for all x, the preimage of gφ is g 1 φ [{gφ(x)}] = {x + Az : z Rm}, A Rd m. Here, m d is the dimensionality of the preimages of gφ. We denote by C the smallest upper bound on the loss value as L(fθ(x), y) C (for all x, y and θ). Define ψ(v) = Ex,y[L(f(x, v), y)] 1 n n i=1 L(f(xi, v), yi), with Lipschitz constant ζ. Therefore, ζ 0 is the smallest real number such that, for all v and v in V, |ψ(v) ψ(v )| ζ v v 2, where V = {vθ Rd : θ Θ}. Finally, we define R = supθ Θ vθ 2, and the generalization gap by G(θ) = Ex,y[L(fθ(x), y)] 1 i=1 L(fθ(xi), yi). Theorem 1 shows that enforcing conservation laws of gφ(fθ(x)) = gφ(x) is advantageous when the dimension of the preimages under gφ is less than the dimension of x; that is, when m < d. Theorem 1. Let ρ N+. Then, for any δ > 0, with probability at least 1 δ over an iid draw of n examples ((xi, yi))n i=1, the following holds for all θ Θ: ξ ln(max( ξ, 1)) + ξ ln(2R(ζ1 1/ρ) n) + ln(1/δ) where ξ = m if gφ(fθ(x)) = gφ(x) for any x Rd and θ Θ, and ξ = d otherwise. The proof is presented in Appendix A. In Theorem 1, when we enforce the conservation laws, d is replaced by m, the dimension of the preimage. We now discuss various cases for the values of m: (Case of m = d) Let us consider an extreme scenario where the function gφ maps all x Rd to one single point. In this scenario, the dimensionality of preimages under gφ is maximized as m = d. Accordingly, the bounds with and without enforcing the conservation laws become the same. Indeed, in this scenario, the conservation laws of gφ(fθ(x)) = gφ(x) give us no information, because they always hold for all x Rd, even without imposing them. (Case of m = 0) Let us consider another extreme scenario where the function gφ is invertible. In this scenario, the dimensionality of preimages under gφ is zero as m = 0. Thus, imposing the condition of gφ(fθ(x)) = gφ(x) makes the bound in Theorem 1 to be very small. Indeed, in this scenario, the condition of gφ(fθ(x)) = gφ(x) implies that fθ(x) = x: i.e., x is not moving, and thus it is easy to generalize. (Case of 0 < m < d) From the above two cases, we can see that the benefit of enforcing conservation laws comes from more practical cases in-between these, with 0 < m < d. In Theorem 1, the function gφ can differ from the true function g φ underlying the system. This is because we analyze a standard generalization gap: i.e., the difference between the expected loss and the training loss. The cost of not using the true g φ is captured in the training loss; i.e., the training loss can be large with the function gφ that differs significantly from the true g φ. Even in this case, the generalization gap can be small. For example, in the case of m = 0, the generalization bound is small, whereas the training loss will be large unless xt+1 = xt. Therefore, our analysis gives us the insight on the trade-off between the training loss and the dimensionality of preimages under gφ. 3 Noether Networks Leveraging tailoring to encode inductive biases. We perform a prediction-time optimization to encourage outputs to follow conservation laws using the tailoring framework [4]. Tailoring encodes inductive biases in the form of unsupervised losses optimized inside the prediction function. In doing so, tailoring fine-tunes the model to each query to ensure that it minimizes the unsupervised loss for that query. For example, we may optimize for energy conservation in a physics prediction problem. In meta-tailoring, we train the model to do well on the task loss after the tailoring step has fine-tuned its parameters. In contrast to auxiliary losses, which would optimise the conservation loss only for the training points, tailoring allows us to ensure conservation at test time. Since we aim to build in the conservation in the architecture, we want to ensure it is also satisfied for unseen test samples. Another advantage of tailoring losses is that they are easier to meta-learn. Auxiliary losses are pooled over all examples and training epochs and their effect is only known at validation/test time. We would need to use implicit gradients [39, 47] to know their eventual effect on the final weights at the end of training. With tailoring, we can directly measure the effect of the meta-learned update on the same sample. A limitation of tailoring framework is that the tailoring loss must be user-specified. This is acceptable in domains where the desired inductive bias is both known and easily encoded, but problematic in general we address this issue with Noether Networks. Our approach can be seen as a generalization of tailoring where the unsupervised loss is meta-learned and takes the form of a conservation loss. Noether Networks: discovering useful conservation laws. We build on the paradigm of conservation laws; i.e. quantities that are conserved over time. This has the following two challenges: 1. Real data often has noise, breaking any exact conservation law. Moreover, many conservation laws are only approximately satisfied given only partial information of the system. For example, conservation of energy is not fully satisfied in real dissipative systems and estimating conservation of mass from pixels is bound to be inexact in the case of occlusions. 2. Optimizing for conservation alone can lead to trivial quantities, such as predicting a constant value gφ(x) = C independent of x. We search for useful conservation laws, whose (approximate) enforcing brings us closer to the true data manifold for the current sequence. Note that a useful conserved quantity doesn t need to be exactly conserved in the training data to improve prediction performance. We only need its conservation to be informative of the data manifold. This allows us to discover conservation laws even in imperfect, noisy environments. Conversely, tailoring with a trivial conserved quantity cannot improve the final predictions (see App. B for a detailed explanation). Finally, viewing conserved quantities as useful inductive biases aligns well with their use, since we often care about biases only insofar as they help improve task performance. The search for such useful conserved quantities can take many forms. In this work, we focus on two: first, in the beginning of section 4.1, we use a combination of program synthesis and gradient descent to generate a large, but finite, number of candidate parametric formulas for physical system prediction. We then try meta-tailoring with each formula on the training set and pick the conservation formula with the best loss. Formulas can be a useful, interpretable description in scientific domains given the appropriate state-space. However, for general descriptions from raw data, we would like to describe the loss with a neural network, as we describe in the next section. Meta-learning a neural loss function. We propose Noether Networks, an architecture class for sequential prediction that consists of a base predictor fθ : xt xt+1 and a meta-learned tailoring loss, parameterized as the conservation of a neural embedding gφ : xt Rk. This embedding takes raw predictions as input (such as images in the case of video prediction). The conservation form induces a meta-inductive bias over potential learned tailoring losses. The Noether loss is formulated as LNoether(x0, x1:T ; gφ) = t=1 |gφ(x0) gφ ( xt)|2 t=1 |gφ( xt 1) gφ ( xt)|2 where x0 is the ground-truth input, the xt = fθ( xt 1) are the model s predictions, and x0 x0. Expressions 2(a) and 2(b) are equivalent if we fully enforce the conservation law, but they differ if conservation is not fully enforced. When not fully conserving, 2(a) propagates information from ground truth more directly to the prediction, but 2(b) may be a useful approximation which better handles imperfectly conserved quantities, where the quantity should be Lipschitz but not exactly constant. In both cases, if we tailor θ with a single gradient step; the gradient update takes the form θ(x0; φ) = θ λin θLNoether(x0, x1:T (θ); gφ). (3) We compute final predictions as ˆxt = fθ(x0;φ)(ˆxt 1). We can now backpropagate from Ltask(x1:T , ˆx1:T ) back to φ, which will be optimized so that the unsupervised adaptation θ θ(x0; φ) helps lower Ltask.The optimization requires second-order gradients to express how φ affects Ltask through θ(x0; φ). This is similar to MAML [23], as well as works on meta-learning loss functions for few-shot learning [5] and group distribution shift [71]. Algorithm 1 provides pseudo-code. For deep learning frameworks that allow per-example weights, such as JAX [9], the loop over sequences in Alg. 1 can be efficiently parallelized. To parallelize it for other frameworks we use the CNGRAD algorithm [4], which adapts only the Conditional Normalization (CN) layers in the inner loop. Similar to BN layers, CN only performs element-wise affine transformations: y = x γ + β, which can be efficiently parallelized in most deep learning frameworks even with per-example γ, β. Finally, even though we use the two-layer optimization typical of meta-learning, we are still in the classical single-task single-distribution setting. Noether Networks learn to impose their own inductive biases via a learned loss and to leverage it via an adaptation of its parameters θ. This is useful as we often do not have access to meta-partitions that distinguish data between distributions or tasks. Algorithm 1 Prediction and training procedures for Noether Networks with neural conservation loss Given: predictive model class f; embedding model class g; prediction horizon T; training dist. Dtrain; batch size N; learning rates λin, λout, λemb; task loss Ltask; Noether loss LNoether 1: procedure PREDICTSEQUENCE(x0; θ, φ) 2: x0, ˆx0 x0, x0 3: xt fθ( xt 1) t {1, . . . , T} Initial predictions 4: θ(x0; φ) θ λin θLNoether(x0, x1:T ; gφ) Inner step with Noether loss 5: ˆxt fθ(x0;φ)(ˆxt 1) t {1, . . . , T} Final prediction with tailored weights 6: return ˆx1:T 7: procedure TRAIN 8: φ randomly initialized weights Initialize weights for Noether embedding g 9: θ randomly initialized weights Initialize weights for predictive model f 10: while not done do 11: Sample batch x(0) 0:T , . . . , x(N) 0:T Dtrain 12: for 0 n N do 13: ˆx(n) 1:T PREDICTSEQUENCE(x(n) 0 ; θ, φ) 14: φ φ λemb φ N n=0 Ltask(ˆx(n) 1:T , x(n) 1:T ) Outer step with task loss for embedding 15: θ θ λout θ N n=0 Ltask(ˆx(n) 1:T , x(n) 1:T ) Outer step for predictive model 16: return φ, θ 4 Experiments Our experiments are designed to answer the following questions: 1. Can Noether Networks recover known conservation laws in scientific data? 2. Are Noether Networks useful for settings with controlled dynamics? 3. Can Noether Networks parameterize useful conserved quantities from raw pixel information? 4. How does the degree of conservation affect performance on the prediction task? 4.1 Experimental domains To answer the above questions, we consider a number of different experimental domains, which we overview in this subsection, before moving on to the evaluation and results. Spring and pendulum from state coordinates. Greydanus et al. [28] propose the setting of an ideal spring and ideal pendulum, which will allow us to understand the behavior of Noether Networks for scientific data where we know a useful conserved quantity: the energy. They also provide data from a real pendulum from [53]. In contrast to the ideal setups, here the data is noisy and the Hamiltonian is only approximately conserved, as the system is dissipative. For the two pendulum, input x = (p, q) R2 contains its angle q and momentum p. Given the parameters in [28] the energy is H = 3(1 cos q) + p2, with p2 3 cos q being a simpler equivalent. For the spring, input x = (p, q) R2 contains the displacement q and momentum p and the energy is H = 1 2(q2 + p2). Thus, q2 + 1.0 p2 is a conserved quantity, where the coefficient 1.0 has appropriate units. We build on their vanilla MLP baseline and discover conservation laws that, when used for metatailoring, improve predictions. Since the baseline predicts d dt(xt) rather than xt+1, we apply the loss to a finite-difference approximation, i.e. LNoether xt, xt + dx dt Δt = LNoether (xt, xt + fθ(xt)Δt). Domain specific language for scientific formulas. To search over Hamiltonians, we program a simple domain specific language (DSL) for physical formulas. Since formulas in the DSL have physical meaning, each sub-formula carries its own associated physical units and is checked for validity. This allows us to significantly prune the exponential space, as done in AI-Feynman [63]. The vocabulary of the DSL is the following: Input(i): returns the i-th input, Operation: one of {+, , , /, sin, cos, x2}, Parameter(u): trainable scalar with units [u]. Table 1: Recovering H for the ideal pendulum. Method Description RMSE Vanilla MLP N/A 0.0563 Noether Nets p2 2.99 cos(q) 0.0423 True H [oracle] p2 3.00 cos(q) 0.0422 Table 2: Recovering H for the ideal spring. Method Description RMSE Vanilla MLP N/A .0174 Noether Nets q2 + 1.002 p2 .0165 True H [oracle] q2 + 1.000 p2 .0166 Figure 2: Noether Networks can recover the energy of a real pendulum, even though it is not fully conserved. This is because they only look for quantities whose conservation helps improve predictions. Moreover, by only softly encouraging conservation, it better encodes imperfect conservations. Pixel pendulum with controls. In settings such as robotics we may be interested in actionconditioned video prediction [45]: predicting future frames given previous frames and an agent s actions. We recorded videos of episodes in Open AI Gym s pendulum swing up environment, using a scripted policy. Each model receives four history frames and a sequence of 26 policy actions starting from the current timestep, and must predict 26 future frames. As this is a visually simple environment we limit the training dataset to 5 episodes of 200 frames each, holding out 195 episodes for testing. Video prediction with real-world data. To characterize the benefits provided by Noether Networks in a real-world video prediction setting, we evaluate the effect of adding a Noether conservation loss to the stochastic video generation (SVG) model [18] on the ramp scenario of the Physics 101 dataset [69]. The dataset contains videos of various real-world objects sliding down an incline and colliding with a second object. Each model is conditioned on the first two frames and must predict the subsequent 20 frames. To increase difficulty, we restrict our data to the 20-degree incline. 4.2 Evaluation Can Noether Networks recover known conservation laws in scientific data? We first generate all valid formulas up to depth 7, removing operations between incompatible physical units and formulas equivalent up to commutativity. We obtain 41,460 and 72,372 candidate conserved formulas for the pendulum and spring, respectively, which need not have units of energy. These formulas are purely symbolic, with some trainable parameters still to be defined. We thus optimize their parameters via gradient descent to have low variance within sequences of true data. We then do the same for random sequences; if conservation is two orders of magnitude smaller for the true data, we accept it as an approximate conservation. This measure is similar to others used before [38], but is not sufficient when the data is noisy, the minimisation is sub-optimal, or there are numerical issues. As a result, we obtain 210 and 219 candidates for approximately conserved quantities for the pendulum and spring, respectively. Finally, for each potentially conserved quantity, we try it as meta-tailoring loss: starting from the pre-trained vanilla MLP, we fine-tune it for 100 epochs using meta-tailoring, with one inner step and a range of inner learning rates 10k for k { 3, 2.5, . . . , 1}. We then evaluate the fine-tuned model on long-term predictions, keeping the expression with the best MSE loss in the training data. We observe that this process correctly singles out equations of the same form as the true Hamiltonian H, with almost the exact parameters. Using these as losses reaches equivalent performance with that of the oracle, which uses the true formula (see Tables 1 and 2). Finally, we run the same process for a real pendulum [53], where energy is not conserved. We use largely the same pipeline as for the ideal pendulum, the differences are explained in Appendix C. Noether Networks discover p2 2.39 cos(q), close to the (potentially sub-optimal) H = p2 2.4 cos(q) described in [28]. This Noether Network improves the baseline by more than one order of magnitude, matches the performance of hand-coding the conservation loss, and improves over Hamiltonian Neural Networks, which fully impose conservation of energy. Results can be seen in figure 2. Figure 3: In controlled pendulum environment, the Noether Network has lower mean squared error (left) and slightly better structural similarity (right). Metrics are computed and plotted by prediction timestep, where timestep 0 is a given history frame. Note that simply training SVG for more steps does not increase performance. Figure 4: The Noether Network outperforms the baseline by a small margin in all four metrics in realworld video prediction, showing they can meta-learn useful conserved quantities from raw pixel data. Are Noether Networks useful in settings with controlled dynamics? For the controlled pendulum experiments, we begin with an SVG model but modify it to (1) remove the prior sampling component since the environment is deterministic and (2) concatenate each action to the corresponding frame encoder output and feed the result to the LSTM. After training the SVG model for 50 epochs and fixing its weights, we run Algorithm 1 for 20 epochs to learn an embedding network for conserving useful quantities. We found that Noether Networks perform better in this setting by directly adapting the latent LSTM activations against the conservation loss, rather than adapting any network parameters. Since Noether Networks take additional training steps starting from a pre-trained SVG model, we also continued training the SVG model for 20 further epochs, which we call SVG (more steps). Figure 3 shows the results of each method on held out sequences. Noether Networks improve the overall mean squared error (MSE) and structural similarity (SSIM) over the base SVG. Can Noether Networks parameterize useful conserved quantities from raw pixel information? We train on only 311 training sequences, which makes generalization difficult. In this setting, the base SVG model struggles with overfitting novel test objects often morph into objects seen at train-time, and the object in motion often morphs with the target object (as shown in figure 1). Our Noether Network uses the inner loss formulation of Equation 2(b), where gφ is a two-layer CNN receiving two consecutive frames followed by a fully-connected projection layer producing 64-dimensional embeddings. We meta-tailor the embedding and the base model for 400 epochs. As seen in figure 4, taking a single inner tailoring step improves performance slightly over the baseline SVG model with respect to all four considered metrics: learned perceptual image patch similarity (LPIPS) [72], mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM). As the prediction horizon increases, the Noether Network performs comparatively better, likely because conservations encourage the model to retain aspects of the ground-truth input. These results provide some evidence that Noether Networks are capable of learning useful inductive biases from raw video data. To investigate whether the Noether embeddings learn relevant features, we use Gradient-weighted Class Activation Mapping (Grad-CAM) [54] to compute importance maps for sequences in the test set. Since the Noether embedding has 64 dimensions, we perform principal component analysis (PCA) before Grad-CAM to reduce the dimensionality of the embedding, and to sort dimensions by the percentage of variance they explain in the test set frames. Interestingly, we find that the first PCA dimension captures 83.6% of the variance, and the first four dimensions capture 99.9% of the variance. Figure 5 shows Grad-CAM localization maps for two example test-set sequences, (a) depicts an orange torus and (b) shows a blue brick, where warmer colors (red) indicate high importance and cooler colors (blue) indicate low importance. Our interpretations of Grad-CAM localization maps are consistent across examples, see Appendix F for additional ones. The first dimension, which explains the vast majority of the variance, primarily focuses on the sliding object in both examples. It also attends to the object on the table, and to the edge of the ramp. The attention to the objects suggests that the Noether Network learns to conserve quantities related to the objects pixels and their motion (since it takes Prediction horizon (a) Orange object Embedding dimension (after PCA) (b) Blue object Embedding dimension (after PCA) Figure 5: Grad-CAM localization maps show that the Noether embeddings attend to relevant frame regions for two example sequences. Heatmaps for the first six dimensions of the embedding (in the PCA basis, sorted by decreasing amount of variance explained) are shown for three time steps. Figure 6: Despite training with only one inner step, as we lower the Noether loss to zero (top), the outer loss remains low (bottom). two consecutive frames). We hypothesize that the attention to the table edge encodes information about the orientation of the sequence, since half of the training sequences are randomly flipped horizontally. While some the dimensions focus nearly uniformly on the entire frame, the fourth dimension focuses on the orange sliding object in (a), and the human hand and table in (b). In all the test set examples, whenever the human hand is in the frame, it is attended to by this dimension. We hypothesize that it learns to conserve the hand in sequences where it is present, and possibly picks up on similarly-colored pixels as well. Finally, the sixth dimension seems to track blue sliding objects. How does the degree of conservation affect performance? All of the results presented in this work were achieved by Noether Networks trained and evaluated with a single inner step. To characterize how the degree to which the conservations are imposed affects video prediction performance, we optimize the inner (Noether) loss for many inner steps and measure the outer (task) loss, as shown in figure 6. Here, the inner loss is optimized by Adam for 500 steps, as opposed to the single step of SGD during training (both settings use a learning rate of 10 4). During the optimization, the outer loss improves for roughly 150 inner steps, which suggests that the approximate conservations learned by the Noether Network generalize to more exact conservation settings. 5 Related Work Unsupervised adaptation and meta-learning loss functions. Hand-designed unsupervised losses have been used to adapt to distribution shifts with rotation-prediction [59] or entropyminimization [65], as well as to encode inductive biases [4]. Unsupervised losses have also been metalearned for learning to encode human demonstrations [70], few-shot learning exploiting unsupervised information [41, 5], and learning to adapt to group distribution shifts [71]. In contrast, our unsupervised loss only takes the single query we care about, thus imposing no additional assumptions on top of standard prediction problems, and takes the form of a conservation law for predicted sequences. Encoding of symmetries and physics-based inductive biases in neural networks. We propose to encode symmetries in dynamical systems as meta-learned conserved quantities. This falls under the umbrella of geometric deep learning, which aims to encode symmetries in neural networks; Bronstein et al. [12] provide an excellent review of the field. Most relevant to the types of problems we focus on is roto-translational equivariance [62, 66, 25, 51], applications of GNNs in physical settings [13, 6, 34, 3, 2, 49, 46, 50, 17] and the encoding of time-invariance in RNNs [22, 32, 30, 14, 60]. Recent works have encoded Hamiltonian and Lagrangian mechanics into neural models [28, 40, 15, 24], with gains in data-efficiency in physical and robotics systems, including some modeling controlled or dissipative systems [73, 19]. In contrast to these works, we propose to encode biases via prediction-time fine-tuning following the tailoring framework [4], instead of architectural constraints. This allows us to leverage the generality of loss functions to encode a class of inductive biases extending beyond Lagrangian mechanics, and handle raw video data. In this regard, Noether networks are closer to energy-based models (EBMs) [1, 42, 29, 37, 20], which also optimize a loss function to make predictions. However, EBMs directly optimize the prediction, instead of tailoring the weights of a forward model, which can ease the learning problem. In particular, [42] show how EBMs can be a good model to enforce constraints, showing it for predicting two bouncing balls from state information. Outside mechanics, Suh and Tedrake [58] highlight the difficulty of learning physical inductive biases with deep models from raw data and instead propose to encode mass conservation from pixels with a constrained linear model. Finally, Noether s theorem has been used [26, 61] to theoretically understand the optimization dynamics during learning; unrelated to our goal of discovering inductive biases for sequential prediction. Discovery of symmetries and conserved quantities. Noether networks are related to slow feature analysis [68, 43, 7, 27, 31], which aims at extracting the smoothest features from time series in an unsupervised way, to use it for dimensionality reduction, regression, or classification. In contrast to this subfield, Noether networks impose these approximate conservations at prediction-time to improve predictions. There are other previous works which aim to learn conserved quantities in dynamical systems from data. The approach of Schmidt and Lipson [53] focuses on candidate equations that correctly predict dynamic relationships between parts of the system by measuring the agreement between the ratios of the predicted and observed partial (time) derivatives of various components of a system. A set of analytical expressions is maintained and updated as new candidates achieve sufficient accuracy. More recently, Liu and Tegmark [38] discover conservation laws of Hamiltonian systems by performing local Monte Carlo sampling followed by linear dimensionality estimation to predict the number of conserved quantities from an explained ratio diagram. Wetzel et al. [67] learn a Siamese Network that learns to differentiate between trajectories. Contrary to both, Noether Networks do not need segmentations into trajectories with different conserved quantities and can deal with raw pixel inputs and outputs. Other approaches to learning symmetries from data include neuroevolution for learning connectivity patterns [56], regularizers for learning data augmentation [8], and meta-learning equivarianceinducing weight sharing [74]. However, these approaches do not aim to learn conserved quantities in dynamical systems. Finally, the two-step pipeline of symbolic search with unit constraints followed by gradient descent to estimate parameters has many commonalities with approaches for inductive process modeling [36, 11, 10] for scientific discovery. However, Noether networks then leverage the discovered physical equations to improve the predictions of a neural network. Neural networks to discover physical laws. There has been a growing interest in leveraging the functional search power of neural networks for better scientific models. Multiple works train a neural network and then analyze it to get information, such as inferring missing objects via back-propagation descent [3], inspecting its gradients to guide a program synthesis approach [63], learning a blue-print with a GNN [16], or leveraging overparametrized-but-regularized auto-encoders [64]. Others, such as Dream Coder [21], take the explicit approach with a neural guided synthesis over the space of formulas. Unlike these works, our method can readily be applied both to symbolic conservation laws and to neural network conservation losses applied to raw videos. 6 Discussion Limitations. A disadvantage of our proposed approach over architectural constraints (in the cases when the latter can represent the same bias) is computation time. Noether Networks optimize the conservation loss inside the prediction function, which results in an overhead. This can be reduced by only fine-tuning the higher-level layers, avoiding back-propagation back to the input. Another approach to reduce training time could be to warm-start the Noether embeddings with pre-trained contrastive embeddings, which have an aligned goal and have shown great results for video and audio classification [48]. It is also sometimes the case that, by encoding the inductive bias during training the trained but un-tailored Noether Network performs better than a comparable architecture without the bias during training. It is also worth noting that the theoretical guarantees apply to fully enforced conservation constraints, while in practice we only fine-tune our model to better satisfy them. However, we believe the theoretical result still conveys the potential benefits of Noether networks. Broader impact. We propose a new framework for meta-learning inductive biases in sequential settings, including from raw data. On the positive side, our framework can be used to discover conserved quantities in scientific applications, even when these are not fully conserved. By improving video prediction, we are also empowering predictive models for entities such as robots and self-driving cars, helping elder care and mobility. However, at the same time, it could be used for better predicting the movement and tracking of people from CCTV cameras, affecting privacy. It could also be used to improve the quality of deepfake video generation, which can be used for malicious purposes. Overview and future work. We propose Noether Networks: a new framework for meta-learning inductive biases through conserved quantities imposed at prediction time. Originally motivated by applications in physics, our results indicate that Noether networks can recover known conservation laws in scientific data and also provide modest gains when applied to video prediction from raw pixel data. The generality of optimizing arbitrary unsupervised losses at prediction time provides an initial step towards broader applications. Finally, this work points at the usefulness of designing meta-learned inductive biases by putting priors on the biases instead of hard-coding biases directly. Acknowledgements We want to thank Leslie Pack Kaelbling, Tomás Lozano-Pérez and Maria Bauza for their useful and detailed feedback, as well as Yannic Kilcher, Max Welling, and Joel Zylberberg for insightful comments. We gratefully acknowledge support from Good AI; from NSF grant 1723381; from AFOSR grant FA9550-17-1-0165; from ONR grants N00014-18-1-2847 and N00014-21-1-2685; from the Honda Research Institute, from MIT-IBM Watson Lab; from SUTD Temasek Laboratories; and from Google. Chelsea Finn is a fellow in the CIFAR Learning in Machines and Brains program. We also acknowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the reported research results. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors. [1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147 169, 1985. [2] F. Alet, A. K. Jeewajee, M. Bauza, A. Rodriguez, T. Lozano-Perez, and L. P. Kaelbling. Graph element networks: adaptive, structured computation and memory. Proceedings of the 36th International Conference on Machine Learning-Volume 97, 2019. [3] F. Alet, E. Weng, T. Lozano-Perez, and L. Kaelbling. Neural relational inference with fast modular meta-learning. In Advances in Neural Information Processing Systems (Neur IPS) 32, 2019. [4] F. Alet, M. Bauza, K. Kawaguchi, N. G. Kuru, T. Lozano-Perez, and L. P. Kaelbling. Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time. ar Xiv preprint ar Xiv:2009.10623, 2020. [5] A. Antoniou and A. J. Storkey. Learning to learn by self-critique. In Advances in Neural Information Processing Systems, pages 9940 9950, 2019. [6] P. W. Battaglia, R. Pascanu, M. Lai, D. Rezende, and K. Kavukcuoglu. Interaction networks for learning about objects, relations and physics. ar Xiv preprint ar Xiv:1612.00222, 2016. [7] Y. Bengio and J. Bergstra. Slow, decorrelated features for pretraining complex cell-like networks. Advances in neural information processing systems, 22:99 107, 2009. [8] G. Benton, M. Finzi, P. Izmailov, and A. G. Wilson. Learning invariances in neural networks. ar Xiv preprint ar Xiv:2010.11882, 2020. [9] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. [10] W. Bridewell and P. Langley. Two kinds of knowledge in scientific discovery. Topics in Cognitive Science, 2(1):36 52, 2010. [11] W. Bridewell, P. Langley, L. Todorovski, and S. Džeroski. Inductive process modeling. Machine learning, 71(1):1 32, 2008. [12] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021. [13] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositional object-based approach to learning physical dynamics. ar Xiv preprint ar Xiv:1612.00341, 2016. [14] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ar Xiv preprint ar Xiv:1406.1078, 2014. [15] M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, and S. Ho. Lagrangian neural networks. ar Xiv preprint ar Xiv:2003.04630, 2020. [16] M. Cranmer, A. Sanchez-Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, and S. Ho. Discovering symbolic models from deep learning with inductive biases. ar Xiv preprint ar Xiv:2006.11287, 2020. [17] F. de Avila Belbute-Peres, T. D. Economon, and J. Zico Kolter. Combining differentiable pde solvers and graph neural networks for fluid flow prediction. ar Xiv e-prints, pages ar Xiv 2007, 2020. [18] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In International Conference on Machine Learning, pages 1174 1183. PMLR, 2018. [19] S. A. Desai, M. Mattheakis, D. Sondak, P. Protopapas, and S. J. Roberts. Port-hamiltonian neural networks for learning explicit time-dependent dynamical systems. Physical Review E, 104(3):034312, 2021. [20] Y. Du and I. Mordatch. Implicit generation and generalization in energy-based models. ar Xiv preprint ar Xiv:1903.08689, 2019. [21] K. Ellis, C. Wong, M. Nye, M. Sable-Meyer, L. Cary, L. Morales, L. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. ar Xiv preprint ar Xiv:2006.08381, 2020. [22] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179 211, 1990. [23] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ar Xiv preprint ar Xiv:1703.03400, 2017. [24] M. Finzi, K. A. Wang, and A. G. Wilson. Simplifying hamiltonian and lagrangian neural networks via explicit constraints. ar Xiv preprint ar Xiv:2010.13581, 2020. [25] F. B. Fuchs, D. E. Worrall, V. Fischer, and M. Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. ar Xiv preprint ar Xiv:2006.10503, 2020. [26] G. Głuch and R. Urbanke. Noether: The more things change, the more stay the same. ar Xiv preprint ar Xiv:2104.05508, 2021. [27] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. Le Cun. Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE international conference on computer vision, pages 4086 4093, 2015. [28] S. Greydanus, M. Dzamba, and J. Yosinski. Hamiltonian neural networks. In Advances in Neural Information Processing Systems, pages 15353 15363, 2019. [29] G. Hinton, S. Osindero, M. Welling, and Y.-W. Teh. Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cognitive science, 30(4):725 731, 2006. [30] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997. [31] D. Jayaraman and K. Grauman. Slow and steady feature analysis: higher order temporal coherence in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3852 3861, 2016. [32] M. I. Jordan. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471 495. Elsevier, 1997. [33] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021. [34] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for interacting systems. ar Xiv preprint ar Xiv:1802.04687, 2018. [35] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097 1105, 2012. [36] P. Langley, J. N. Sanchez, L. Todorovski, and S. Dzeroski. Inducing process models from continuous data. 2002. [37] Y. Le Cun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. [38] Z. Liu and M. Tegmark. Ai poincaré: Machine learning conservation laws from trajectories. ar Xiv preprint ar Xiv:2011.04698, 2020. [39] J. Lorraine, P. Vicol, and D. Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In International Conference on Artificial Intelligence and Statistics, pages 1540 1552. PMLR, 2020. [40] M. Lutter, C. Ritter, and J. Peters. Deep lagrangian networks: Using physics as model prior for deep learning. ar Xiv preprint ar Xiv:1907.04490, 2019. [41] L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein. Meta-learning update rules for unsupervised representation learning. ar Xiv preprint ar Xiv:1804.00222, 2018. [42] A. Mnih and G. Hinton. Learning nonlinear constraints with contrastive backpropagation. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 1302 1307. IEEE, 2005. [43] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 737 744, 2009. [44] E. Noether. Invariante variationsprobleme. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, Mathematisch-Physikalische Klasse, 1918:235 257, 1918. [45] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. ar Xiv preprint ar Xiv:1507.08750, 2015. [46] T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, and P. W. Battaglia. Learning mesh-based simulation with graph networks. ar Xiv preprint ar Xiv:2010.03409, 2020. [47] A. Raghu, M. Raghu, S. Kornblith, D. Duvenaud, and G. Hinton. Teaching with commentaries. ar Xiv preprint ar Xiv:2011.03037, 2020. [48] A. Recasens, P. Luc, J.-B. Alayrac, L. Wang, F. Strub, C. Tallec, M. Malinowski, V. Patraucean, F. Altché, M. Valko, et al. Broaden your views for self-supervised video learning. ar Xiv preprint ar Xiv:2103.16559, 2021. [49] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein. Temporal graph networks for deep learning on dynamic graphs. ar Xiv preprint ar Xiv:2006.10637, 2020. [50] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate complex physics with graph networks. In International Conference on Machine Learning, pages 8459 8468. PMLR, 2020. [51] V. G. Satorras, E. Hoogeboom, and M. Welling. E (n) equivariant graph neural networks. ar Xiv preprint ar Xiv:2102.09844, 2021. [52] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008. [53] M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. science, 324(5923):81 85, 2009. [54] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618 626, 2017. [55] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [56] K. O. Stanley, D. B. D Ambrosio, and J. Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artificial life, 15(2):185 212, 2009. [57] J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. Mac Nair, S. French, L. A. Carfrae, Z. Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery. Cell, 180(4):688 702, 2020. [58] H. Suh and R. Tedrake. The surprising effectiveness of linear models for visual foresight in object pile manipulation. ar Xiv preprint ar Xiv:2002.09093, 2020. [59] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training for out-ofdistribution generalization. ar Xiv preprint ar Xiv:1909.13231, 2019. [60] C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? ar Xiv preprint ar Xiv:1804.11188, 2018. [61] H. Tanaka and D. Kunin. Noether s learning dynamics: The role of kinetic symmetry breaking in deep learning. ar Xiv preprint ar Xiv:2105.02716, 2021. [62] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. ar Xiv preprint ar Xiv:1802.08219, 2018. [63] S.-M. Udrescu and M. Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020. [64] S.-M. Udrescu and M. Tegmark. Symbolic pregression: Discovering physical laws from distorted video. Physical Review E, 103(4):043307, 2021. [65] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell. Fully test-time adaptation by entropy minimization. ar Xiv preprint ar Xiv:2006.10726, 2020. [66] M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. ar Xiv preprint ar Xiv:1807.02547, 2018. [67] S. J. Wetzel, R. G. Melko, J. Scott, M. Panju, and V. Ganesh. Discovering symmetry invariants and conserved quantities by interpreting siamese neural networks. Physical Review Research, 2 (3):033499, 2020. [68] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715 770, 2002. [69] J. Wu, J. J. Lim, H. Zhang, J. B. Tenenbaum, and W. T. Freeman. Physics 101: Learning physical object properties from unlabeled videos. In BMVC, volume 2, page 7, 2016. [70] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. ar Xiv preprint ar Xiv:1802.01557, 2018. [71] M. Zhang, H. Marklund, N. Dhawan, A. Gupta, S. Levine, and C. Finn. Adaptive risk minimization: A meta-learning approach for tackling group distribution shift. ar Xiv preprint ar Xiv:2007.02931, 2020. [72] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. [73] Y. D. Zhong, B. Dey, and A. Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control. ar Xiv preprint ar Xiv:1909.12077, 2019. [74] A. Zhou, T. Knowles, and C. Finn. Meta-learning symmetries by reparameterization. ar Xiv preprint ar Xiv:2007.02933, 2020.