# efficiently_parameterized_neural_metriplectic_systems__0b3f2291.pdf Published as a conference paper at ICLR 2025 EFFICIENTLY PARAMETERIZED NEURAL METRIPLECTIC SYSTEMS Anthony Gruber Center for Computing Research Sandia National Laboratories Albuquerque, NM. USA adgrube@sandia.gov Kookjin Lee School of Computing and Augmented Intelligence Arizona State University Tempe, AZ. USA kookjin.lee@asu.edu Haksoo Lim Big Data Analytics Laboratory Yonsei University Seoul, Korea limhaksoo96@naver.com Noseong Park Big Data Analytics Laboratory Korea Advanced Institute of Science and Technology Daejeon, Korea noseong.kaist.cs@gmail.com Nathaniel Trask School of Engineering and Applied Science University of Pennsylvania Philadelphia, PA. USA ntrask@seas.upenn.edu Metriplectic systems are learned from data in a way that scales quadratically in both the size of the state and the rank of the metriplectic operators. In addition to being provably energy-conserving and entropy-stable, the proposed neural metriplectic systems (NMS) approach includes approximation results that demonstrate its ability to accurately learn metriplectic dynamics from data, along with an error estimate that indicates its potential for generalization to unseen timescales when the approximation error is low. Examples are provided to illustrate performance both with full state information available and when entropic variables are unknown, confirming that the NMS approach exhibits superior accuracy and scalability without compromising on model expressivity. 1 INTRODUCTION The theory of metriplectic, also called GENERIC, systems Morrison (1986); Grmela and Öttinger (1997) provides a principled formalism for encoding dissipative dynamics in terms of complete thermodynamical systems that conserve energy and produce entropy. By formally expressing the reversible and irreversible parts of state evolution with separate algebraic brackets, the metriplectic formalism provides a general mechanism for maintaining essential conservation laws while simultaneously respecting dissipative effects. Thermodynamic completeness implies that any dissipation is caught within a metriplectic system through the generation of entropy, allowing for a holistic treatment which has already found use in modeling fluids Morrison (1984; 2009), plasmas Kaufman and Morrison (1982); Materassi (2012), and kinetic theories Kaufman (1984); Holm et al. (2008). From a machine learning point of view, the formal separation of conservative and dissipative effects makes metriplectic systems highly appealing for the development of phenomenological models. Given data which is physics-constrained or exhibits some believed properties, a metriplectic system can be learned to exhibit the same properties with clearly identifiable conservative and dissipative parts. This allows for a more nuanced understanding of the governing dynamics via an evolution equation which reduces to an idealized Hamiltonian system as the dissipation is taken to zero. Moreover, elements corresponding author Published as a conference paper at ICLR 2025 in the kernel of the learned conservative part are immediately understood as Casimir invariants, which are special conservation laws inherent to the phase space of solutions, and are often useful for understanding and exerting control on low-dimensional structure in the system. On the other hand, the same benefit of metriplectic structure as a direct sum of reversible and irreversible parts makes it challenging to parameterize in an efficient way, since delicate degeneracy conditions must be enforced in the system for all time. In fact, there are no methods at present for learning general metriplectic systems which scale optimally with both dimension and the rank of metriplectic data an issue which this work directly addresses. Precisely, metriplectic dynamics on the finite or infinite dimensional phase space P are generated by a free energy function(al) F : P R, F = E+S defined in terms of a pair E, S : P R representing energy and entropy, respectively, along with two algebraic brackets { , }, [ , ] : C (P) C (P) C (P) which are bilinear derivations on C (P) with prescribed symmetries and degeneracies {S, } = [E, ] = 0. Here { , } is an antisymmetric Poisson bracket, which is a Lie algebra realization on functions, and [ , ] is a degenerate metric bracket which is symmetric and positive semi-definite. When P Rn for some n > 0, these brackets can be identified with symmetric matrix fields L : P Skewn(R), M : P Symn(R) satisfying {F, G} = F L G and [F, G] = F M G for all functions F, G C (P) and all states x P. Using the degeneracy conditions along with x = I and abusing notation slightly then leads the standard equations for metriplectic dynamics, x = {x, F} + [x, F] = {x, E} + [x, S] = L E + M S, which are provably energy conserving and entropy producing. To see why this is the case, recall that L = L. It follows that the infinitesimal change in energy satisfies E = x E = L E E + M S E = L E E + S M E = 0, and therefore energy is conserved along the trajectory of x. Similarly, the fact that M = M is positive semi-definite implies that S = x S = L E S + M S S = E L S + M S S = | S|2 M 0, so that entropy is nondecreasing along x as well. Geometrically, this means that the motion of a trajectory x is everywhere tangent to the level sets of energy and transverse to those of entropy, reflecting the fact that metriplectic dynamics are a combination of noncanonical Hamiltonian (M = 0) and generalized gradient (L = 0) motions. Note that these considerations also imply the Lyapunov stability of metriplectic trajectories, as can be seen by taking E as a Lyapunov function. Importantly, this also implies that metriplectic trajectories which start in the (often compact) set K = {x | E(x) E(x0)} remain there for all time. In phenomenological modeling, the entropy S is typically chosen from Casimirs of the Poisson bracket generated by L, i.e. those quantities C C (P) such that L C = 0. On the other hand, the method which will be presented here, termed neural metriplectic systems (NMS), allows for all of the metriplectic data L, M, E, S to be approximated simultaneously, removing the need for Casimir invariants to be known or assumed ahead of time. The only restriction inherent to NMS is that the metriplectic system being approximated is nondegenerate (c.f. Definition 3.1), a mild condition meaning that the gradients of energy and entropy cannot vanish at any point x P in the phase space. It will be shown that NMS alleviates known issues with previous methods for metriplectic learning, leading to easier training, superior parametric efficiency, and better generalization performance. Contributions. The proposed NMS method for learning metriplectic models offers the following advantages over previous state-of-the-art: (1) It approximates arbitrary nondegenerate metriplectic dynamics with optimal quadratic scaling in both the problem dimension n and the rank r of the irreversible dynamics. (2) It produces realistic, thermodynamically consistent entropic dynamics from unobserved entropy data. (3) It admits universal approximation and error accumulation results given in Proposition 3.7 and Theorem 3.9. (4) It yields exact energy conservation and entropy stability by construction, allowing for effective generalization to unseen timescales. 2 PREVIOUS AND RELATED WORK Previous attempts to learn metriplectic systems from data separate into hard and soft constrained methods. Hard constrained methods enforce metriplectic structure by construction, so that the Published as a conference paper at ICLR 2025 defining properties of metriplecticity cannot be violated. Conversely, methods with soft constraints relax some aspects of metriplectic structure in order to produce a wider model class which is easier to parameterize. While hard constraints are the only way to truly guarantee appropriate generalization in the learned surrogate, the hope of soft constrained methods is that the resulting model is close enough to metriplectic that it will exhibit some of the favorable characteristics of metriplectic systems, such as energy and entropy stability. Some properties of the methods compared in this work are summarized in Table 1. Soft constrained methods. Attempts to learn metriplectic systems using soft constraints rely on relaxing the degeneracy conditions L S = M E = 0. This is the approach taken in Hernández et al. (2021b), termed SPNN, which learns an almost-metriplectic model parameterized with generic neural networks through a simple L2 penalty term in the training loss, Lpen = |L E|2 + |M S|2. This widens the space of allowable network parameterizations for the approximate operators L, M. While the resulting model violates the first and second laws of thermodynamics, the authors show that reasonable trajectories are still obtained, at least when applied within the range of timescales used for training. A similar approach is taken in Hernández et al. (2021a), which targets larger problems and develops an almost-metriplectic model reduction strategy based on the same core idea. Hard constrained methods. Perhaps the first example of learning metriplectic systems from data was given in González et al. (2019) in the context of system identification. Here, training data is assumed to come from a finite element simulation, so that the discrete gradients of energy and entropy can be approximated as E(x) = Ax, S(x) = Bx. Assuming a fixed form for L produces a constrained learning problem for the constant matrices M, A, B which is solved to yield a provably metriplectic surrogate model. Similarly, the work Ruiz et al. (2021) learns M, E given L, S by considering a fixed block-wise decoupled form which trivializes the degeneracy conditions, i.e. L = [ 0; 0 0] and M = [0 0; 0 ]. This line of thought is continued in Xu et al. (2022) and Xu et al. (2023), both of which learn metriplectic systems with neural network parameterizations by assuming this decoupled block structure along with specialized forms for the metriplectic data. A somewhat broader class of metriplectic systems are considered in Gruber et al. (2023b) using tools from exterior calculus, with the goal of learning metriplectic dynamics on graph data. This leads to a structure-preserving network surrogate which scales linearly in the size of the graph domain, but also cannot express arbitrary metriplectic dynamics due to the specific modeling choices for L, M. A particularly inspirational method for learning general metriplectic systems was given in Lee et al. (2021) and termed GNODE, building on parameterizations of metriplectic operators developed in Öttinger (2014). GNODE parameterizes learnable reversible and irreversible bracket generating matrices L, M in terms of state-independent tensors ξ (Rn) 3 and ζ (Rn) 4: for 1 α, β, γ, µ, ν n, the authors choose Lαβ(x) = P γ ξαβγ γS and Mαβ(x) = P µ,ν ζαµ,βν µE νE, where αF = F/ xα, ξ is totally antisymmetric, and ζ is symmetric between the pairs (α, µ) and (β, ν) but antisymmetric within each of these pairs. The key idea here is to exchange the problem of enforcing degeneracy conditions L E = M S = 0 in matrix fields L, M with the problem of enforcing symmetry conditions in tensor fields ξ, ζ, which is comparatively easier but comes at the expense of underdetermining the problem. In GNODE, enforcement of these symmetries is handled by a generic learnable 3-tensor ξ (Rn) 3 along with learnable matrices D Symr(R) and Λs Skewn(R) for 1 s r n, leading to the final parameterizations ξαβγ = 1 3! ξαβγ ξαγβ + ξβγα ξβαγ + ξγαβ ξγβα and ζαµ,βν = P s,t Λs αµDstΛt βν. Along with learnable energy and entropy functions E, S parameterized by multi-layer perceptrons (MLPs), the data L, M learned by GNODE guarantees metriplectic structure in the surrogate model and leads to successful learning of metriplectic systems in some simple cases of interest. However, note that this is a highly redundant parameterization requiring n 3 + r n 2 + r+1 2 + 2 learnable scalar functions, which exhibits O(n3 + rn2) scaling in the problem size because of the necessity to compute and store n 3 entries of ξ and r n 2 entries of Λ. Additionally, the assumption of state-independence in the bracket generating tensors ξ, ζ is somewhat restrictive, limiting the class of problems to which GNODE can be applied. A related approach to learning metriplectic dynamics with hard constraints was seen in Zhang et al. (2022), which proposed a series of GFINN architectures depending on how much of the information L, M, E, S is assumed to be known. In the case that L, M are known, the GFINN energy and entropy are parameterized with scalar-valued functions f Pker A where f : Rn R Published as a conference paper at ICLR 2025 (E or S) is learnable and Pker A : Rn Rn is orthogonal projection onto the kernel of the (known) operator A (L or M). It follows that the gradient (f Pker A) = Pker A f(Pker A) lies in the kernel of A, so that the degeneracy conditions are guaranteed at the expense of constraining the model class of potential energies/entropies. Alternatively, in the case that all of L, M, E, S are unknown, GFINNs use learnable scalar functions f for E, S parameterized by MLPs as well as two matrix fields QE, QS Rr n with rows given by qf s = Sf s f for learnable skew-symmetric matrices Sf s , 1 s r, f = E, S. Along with two triangular (r r) matrix fields TL, TM, this yields the parameterizations L(x) = QS(x) (TL(x) TL(x))QS(x) and M(x) = QE(x) (TM(x) TM(x))QE(x), which necessarily satisfy the symmetries and degeneracy conditions required for metriplectic structure. GFINNs are shown to both increase expressivity over the GNODE method as well as decrease redundancy, since the need for an explicit order-3 tensor field is removed and the reversible and irreversible brackets are allowed to depend explicitly on the state x. However, GFINNs still exhibit cubic scaling through the requirement of rn(n 1) + r2 + 2 = O rn2 learnable functions, which is well above the theoretical minimum required to express a general metriplectic system and leads to difficulties in training the resulting models. Model reduction. Finally, it is worth mentioning the closely related line of work involving model reduction for metriplectic systems, which began with Öttinger (2015). As remarked there, preserving metriplecticity in reduced-order models (ROMs) exhibits many challenges due to its delicate requirements on the kernels of the involved operators. There are also hard and soft constrained approaches: the already mentioned Hernández et al. (2021a) aims to learn a close-to-metriplectic data-driven ROM by enforcing degeneracies by penalty, while Gruber et al. (2023a) directly enforces metriplectic structure in projection-based ROMs using exterior algebraic factorizations. The parameterizations of metriplectic data presented here are related to those presented in Gruber et al. (2023a), although NMS does not require access to nonzero components of the gradients E, S. It is also worth mentioning the influential monograph Dorst et al. (2007) which contains many examples of exterior algebraic methods for computer scientific applications. 3 FORMULATION AND ANALYSIS The proposed formulation of the metriplectic bracket-generating operators L, M used by NMS is based on the idea of exploiting structure in the tensor fields ξ, ζ to reduce the necessary number of degrees of freedom. In particular, it will be shown that the degeneracy conditions L S = M E = 0 imply more than just symmetry constraints on these fields, and that taking these additional constraints into account allows for a more compact representation of metriplectic data. Following this, results are presented which show that the proposed formulation is universally approximating on nondegenerate systems (c.f. Definition 3.1) and admits a generalization error bound in time. 3.1 EXTERIOR ALGEBRA Developing these metriplectic expressions will require some basic facts from exterior algebra, of which more details can be found in, e.g., Tu (2017, Chapter 19). The basic objects in the exterior algebra VV over the vector space V are multivectors, which are formal linear combinations of totally antisymmetric tensors on V . More precisely, if I(V ) denotes the two-sided ideal of the free tensor algebra T(V ) generated by elements of the form v v (v V ), then the exterior algebra is the quotient space VV T(V )/I(V ) equipped with the antisymmetric wedge product operation . This graded algebra is equipped with natural projection operators P k : VV Vk V which map between the full exterior algebra and the kth exterior power of V , denoted Vk V , whose elements are homogeneous k-vectors. More generally, given an n-manifold M with tangent bundle TM, the exterior algebra V(TM) is the algebra of multivector fields whose fiber over x M is given by VTx M. For the present purposes, it will be useful to develop a correspondence between bivectors B V2(Rn) and skew-symmetric matrices B Skewn(R), which follows directly from Riesz representation in terms of the usual Euclidean dot product. More precisely, supposing that e1, ..., en are the standard basis vectors for Rn, any bivector B V2TRn can be represented as B = P i 0, there exist two-layer neural network functions E, S : K R, L : K Skewn(R) and M : K Symn(R) such that E, S = 0 on K, M is positive semi-definite, L S = M E = 0 for all x K, and each approximate function is ε-close to its given counterpart on K. Moreover, if L, M have k 0 continuous derivatives on K then so do L, M. Remark 3.8. The assumption x K of the state remaining in a compact set V is not restrictive when at least one of E, S : Rn R, say E, has bounded sublevel sets. In this case, letting x0 = x(0) it follows from E 0 that E(x(t)) = E(x0) + R t 0 E(x(τ)) dτ E(x0), so that the entire trajectory x(t) lies in the (closed and bounded) compact set K = {x | E(x) E(x0)} Rn. Leaning on Proposition 3.7 and classical universal approximation results in Li (1996), it is further possible to establish the following error estimate also proven in Appendix A which gives an idea of the error accumulation rate that can be expected from this method. Theorem 3.9. Suppose L, M, E, S are nondegenerate metriplectic data such that L, M have at least one continuous derivative, E, S have Lipschitz continuous gradients, and at least one of E, S have bounded sublevel sets. For any ε > 0, there exist nondegenerate metriplectic data L, M, E, S defined by two-layer neural networks such that, for all T > 0, Z T 0 |x x|2 dt e2a T 2ea T + T + 1, where a, b R are constants depending on both sets of metriplectic data and x = L( x) E( x) + M( x) S( x). Remark 3.10. Theorem 3.9 provides a bound on state error over time under the assumption that the approximation error in the metriplectic networks can be controlled. On the other hand, notice that Theorem 3.9 can also be understood as a generic error bound on any trained metriplectic networks L, M, E, S provided universal approximation results are not invoked in the estimation leading to εb. This result confirms that the error in the state x for a fixed final time T tends to zero with the approximation error in the networks L, M, E, S, as one would hope based on the approximation capabilities of neural networks. More importantly, Theorem 3.9 also bounds the generalization error of any trained metriplectic network for an appropriate (and possibly large) ε equal to the maximum approximation error on K, where the learned metriplectic trajectories are confined for all time. With this theoretical guidance, the remaining goal of this work is to demonstrate that NMS is also practically effective at learning metriplectic systems from data and exhibits reasonable generalization to unseen timescales. 4 ALGORITHM Similar to previous approaches in Lee et al. (2021) and Zhang et al. (2022), the learnable parameters in NMS are calibrated using data along solution trajectories to a given dynamical system. This brings up an important question regarding how much information about the system in question is realistically present in the training data. While many systems have a known metriplectic form, it is not always the case that one will know metriplectic governing equations for a given set of training data. Therefore, two approaches are considered in the experiments below corresponding to whether full or partial state information is assumed available during NMS training. More precisely, the state x = (xo, xu) will be partitioned into observable and unobservable variables, where xu may be empty in the case that full state information is available. In a partially observable system xo typically contains positions and momenta while xu contains entropy or other configuration variables which are more Published as a conference paper at ICLR 2025 Algorithm 1 Training neural metriplectic systems 1: Input: snapshot data X Rn ns, each column xs = x(ts, µs), target rank r 1, batch size nb 1. 2: Initialize networks Atri, B, Kchol, E, S, and loss L = 0 3: for step in Nsteps do 4: Randomly draw batch P = {(tsk, xsk)}nb k=1 5: for (t, x) in P do 6: Evaluate Atri(x), B(x), Kchol(x), E(x), S(x) 7: Automatically differentiate E, S to obtain E(x), S(x) 8: Form A(x) = Atri(x) Atri(x) and D(x) = Kchol(x)Kchol(x) 9: Build L(x), M(x) according to Lemma 3.2 10: Evaluate x = L(x) E(x) + M(x) S(x) 11: Randomly draw n1, ..., nl and form tj = t + nj t for all j 12: x1, ..., xl = ODEsolve( x, t1, ..., tl) 13: L += l 1 P j Loss(xj, xj) 14: end for 15: Rescale L = |P| 1L 16: Update Atri, B, Kchol, E, S through gradient descent on L. 17: end for difficult to observe during physical experiments. In both cases, NMS will learn a metriplectic system in x according to the procedure described in Algorithm 1. Note that the batch-wise training strategy in Algorithm 1 requires initial conditions for xu in the partially observed case. There are several options for this, and two specific strategies will be considered here. Suppose the data are drawn from the training interval [0, T] with initial state x0 and final state x T . The first strategy sets xu 0 = 0, xu T = 1 (where 1 is the all ones vector), and xu s = 1/s, 0 < s < T, so that the unobserved states are initially assumed to lie on a straight line. In the absence of other information, this ensures that the entropy will be satisfy the monotonicity constraint necessary for metriplectic structure. The second strategy is more sophisticated, and involves training a diffusion model to predict the distribution of xu given xo. Specific details of this procedure are given in Appendix H. Note that other strategies are also possible, such as allowing the unobserved initial states to be trainable as in Chen et al. (2020). The goal of the following experiments is to show that NMS is effective even when entropic information cannot be observed during training, yielding superior performance when compared to previous methods including GNODE, GFINN, and SPNN discussed in Section 2. The metrics considered for this purpose will be mean absolute error (MAE) and mean squared error (MSE) defined in the usual way as the average ℓ1 (resp. squared ℓ2) error between the discrete states x, x Rn ns. For brevity, many implementation details have been omitted here and can be found in Appendix B. An additional experiment showing the effectiveness of NMS in the presence of both full and partial state information can be found in Appendix D. Remark 5.1. To facilitate a more equal parameter count between the compared metriplectic methods, the results of the experiments below were generated using the alternative parameterization D = KK where K : K Rr r is full and r r. Of course, this change does not affect metriplecticity since D is still positive semi-definite for each x K. 5.1 TWO GAS CONTAINERS The first test of NMS involves two ideal gas containers separated by movable wall which is removed at time t0, allowing for the exchange of heat and volume. In this example, the motion of the state Published as a conference paper at ICLR 2025 x = (q p S1 S2) is governed by the metriplectic equations: S1 = 9N 2k2 Bα 4E1(x) 1 E1(x) 1 E2(x) , S2 = 9N 2k2 Bα 4E1(x) 1 E1(x) 1 E2(x) where (q, p) are the position and momentum of the separating wall, S1, S2 are the entropies of the two subsystems, and the internal energies E1, E2 are determined from the Sackur-Tetrode equation for ideal gases, Si/Nk B = ln ˆc Vi E3/2 i , 1 i 2. Here, m denotes the mass of the wall, 2L is the total length of the system, and Vi is the volume of the ith container. As in Lee et al. (2021; 2022) Nk B = 1 and α = 0.5 fix the characteristic macroscopic unit of entropy while ˆc = 102.25 ensures the argument of the logarithm defining Ei is dimensionless. This leads to the total entropy S(x) = S1 + S2 and the total energy E(x) = (1/2m)p2 + E1(x) + E2(x), which are guaranteed to be nondecreasing and constant, respectively. The primary goal here is to verify that NMS can accurately and stably predict gas container dynamics without the need to observe the entropic variables S1, S2. To that end, NMS has been compared to GNODE, SPNN, and GFINN on the task of predicting the trajectories of this metriplectic system over time, with results displayed in Table 2. More precisely, given an intial condition x0 and an interval 0 < ttrain < tvalid < ttest, each method is trained on partial state information (in the case of NMS) or full state information (in the case of the others) from the interval [0, ttrain] and validated on (ttrain, tvalid] before state errors in q, p only are calculated on the whole interval [0, ttest]. As can be seen from Table 2 and Figure 2, NMS is remarkably accurate over unseen timescales even in this unfair comparison, avoiding the unphysical behavior which often hinders soft-constrained methods like SPNN. The energy and instantaneous entropy plots in Figure 2 further confirm that the strong enforcement of metriplectic structure guaranteed by NMS leads to correct energetic and entropic dynamics for all time. (a) Hamiltonian H = p2 (b) Position q (c) Momentum p (d) Energy E = H + P (e) Instantaneous Entropy S Figure 2: The ground-truth and predicted position, momentum, instantaneous entropy, and energies for the two gas containers example in the training (white), validation (yellow), and testing (red) regimes. 5.2 THERMOELASTIC DOUBLE PENDULUM Next, consider the thermoelastic double pendulum from Romero (2009) with 10-dimensional state variable x = (q1 q2 p1 p2 S1 S2) , which represents a highly challenging benchmark for Published as a conference paper at ICLR 2025 Table 2: Prediction errors for xo measured in MSE and MAE on the interval [0, ttest] in the two gas containers example (left) and on the test set in the thermoelastic double pendulum example (right). NODE SPNN GNODE GFINN NMS MSE .12 .04 .13 .10 .16 .10 .07 .03 .01 .02 MAE .25 .10 .26 .14 .25 .13 .13 .03 .08 .06 NODE SPNN GNODE GFINN NMS MSE .41 .01 .42 .01 .42 .01 .40 .03 .38 .03 MAE .48 .04 .47 .03 .46 .04 .43 .07 .42 .07 metriplectic methods. The equations of motion in this case are given for 1 i 2 as mi , pi = qi(E1(x) + E2(x)), S1 = κ T 1 1 T2 1 , S2 = κ T1T 1 2 1 , where κ > 0 is a thermal conductivity constant (set to 1), mi is the mass of the ith spring (also set to 1) and Ti = Si Ei is its absolute temperature. In this case, qi, pi R2 represent the position and momentum of the ith mass, while Si represents the entropy of the ith pendulum. As before, the total entropy S(x) = S1 + S2 is the sum of the entropies of the two springs, while defining the internal energies Ei(x) = (1/2)(ln λi)2 + ln λi + e Si ln λi 1, λ1 = |qi|, λ2 = |q2 q1|, leads to the total energy E(x) = (1/2m1)|p1|2 + (1/2m2)|p2|2 + E1(x) + E2(x). The task in this case is prediction across initial conditions. As in Zhang et al. (2022), 100 trajectories are drawn from the ranges in Appendix B and integrated over the interval [0, 40] with t = 0.1, with an 80/10/10 split for training/validation/testing. Here all compared models are trained using full state information. As seen in Table 2, NMS is again the most performant, although all models struggle to approximate the dynamics over the entire training interval. It is also notable that the training time of NMS is greatly decreased relative to GNODE and GFINN due to its improved quadratic scaling; a representative study to this effect is given in Appendix G. 6 CONCLUSION Neural metriplectic systems (NMS) have been considered for learning finite-dimensional metriplectic dynamics from data. Making use of novel non-redundant parameterizations for metriplectic operators, NMS provably approximates arbitrary nondegenerate metriplectic systems with generalization error bounded in terms of the operator approximation quality. Benchmark examples have shown that NMS is both more scalable and more accurate than previous methods, including when only partial state information is observed. Future work will consider extensions of NMS to infinite-dimensional metriplectic systems with the aim of addressing its main limitation: the difficulty of scaling NMS (among all present methods for metriplectic learning) to realistic, 3-D problems of the size that would be considered in practice. A promising direction is to consider the use of NMS in model reduction, where sparse, large-scale systems are converted to small, dense systems through a clever choice of encoding/decoding. ACKNOWLEDGEMENTS Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy s National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. This article has been co-authored by an employee of National Technology & Engineering Solutions of Sandia, LLC under Contract No. DE-NA0003525 with the U.S. Department of Energy (DOE). The employee owns all right, title and interest in and to the article and is solely responsible for its contents. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this article or allow others to do so, for United States Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan https://www.energy.gov/downloads/doe-public-access-plan. The work of A. Gruber and N. Trask is supported by the John von Neumann fellowship at Sandia, the U.S. Published as a conference paper at ICLR 2025 Department of Energy, Office of Advanced Computing Research under the Scalable and Efficient Algorithms - Causal Reasoning, Operators, Graphs and Spikes" (SEA-CROGS) project, and the Do E Early Career Research Program. K. Lee acknowledges support from the U.S. National Science Foundation under grant IIS 2338909. N. Park and H. Lim acknowledge support from the Korea Advanced Institute of Science and Technology (KAIST) grant funded by the Korea government (MSIT) (No. G04240001, Physics-inspired Deep Learning), as well as the Institute for Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) (No. 2020-0-01361, Artificial Intelligence Graduate School Program (Yonsei University)). The authors thank Jonas Actor for careful proofreading and suggestions regarding the presentation of Appendix A. Z. Chen, J. Zhang, M. Arjovsky, and L. Bottou. Symplectic recurrent neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id= Bkg YPREt Pr. J. R. Dormand and P. J. Prince. A family of embedded runge-kutta formulae. Journal of computational and applied mathematics, 6(1):19 26, 1980. L. Dorst, D. Fontijne, and S. Mann. Geometric Algebra for Computer Science: An Object-oriented Approach to Geometry. Morgan Kaufmann, Amsterdam, 2007. D. González, F. Chinesta, and E. Cueto. Thermodynamically consistent data-driven computational mechanics. Continuum Mechanics and Thermodynamics, 31(1):239 253, 2019. M. Grmela and H. C. Öttinger. Dynamics and thermodynamics of complex fluids. i. development of a general formalism. Phys. Rev. E, 56:6620 6632, Dec 1997. doi: 10.1103/Phys Rev E.56.6620. URL https: //link.aps.org/doi/10.1103/Phys Rev E.56.6620. A. Gruber, M. Gunzburger, L. Ju, and Z. Wang. Energetically consistent model reduction for metriplectic systems. Computer Methods in Applied Mechanics and Engineering, 404:115709, 2023a. ISSN 0045-7825. doi: https://doi.org/10.1016/j.cma.2022.115709. URL https://www.sciencedirect.com/science/ article/pii/S0045782522006648. A. Gruber, K. Lee, and N. Trask. Reversible and irreversible bracket-based dynamics for deep graph neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. Q. Hernández, A. Badías, D. González, F. Chinesta, and E. Cueto. Deep learning of thermodynamics-aware reduced-order models from data. Computer Methods in Applied Mechanics and Engineering, 379:113763, 2021a. Q. Hernández, A. Badías, D. González, F. Chinesta, and E. Cueto. Structure-preserving neural networks. Journal of Computational Physics, 426:109950, 2021b. D. D. Holm, V. Putkaradze, and C. Tronci. Kinetic models of oriented self-assembly. Journal of Physics A: Mathematical and Theoretical, 41(34):344010, aug 2008. doi: 10.1088/1751-8113/41/34/344010. URL https://dx.doi.org/10.1088/1751-8113/41/34/344010. A. N. Kaufman. Dissipative hamiltonian systems: A unifying principle. Physics Letters A, 100(8):419 422, 1984. ISSN 0375-9601. doi: https://doi.org/10.1016/0375-9601(84)90634-0. URL https://www. sciencedirect.com/science/article/pii/0375960184906340. A. N. Kaufman and P. J. Morrison. Algebraic structure of the plasma quasilinear equations. Physics Letters A, 88(8):405 406, 1982. ISSN 0375-9601. doi: https://doi.org/10.1016/0375-9601(82)90664-8. URL https://www.sciencedirect.com/science/article/pii/0375960182906648. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. K. Lee, N. Trask, and P. Stinis. Machine learning structure preserving brackets for forecasting irreversible processes. Advances in Neural Information Processing Systems, 34:5696 5707, 2021. K. Lee, N. Trask, and P. Stinis. Structure-preserving sparse identification of nonlinear dynamics for data-driven modeling. In Mathematical and Scientific Machine Learning, pages 65 80. PMLR, 2022. X. Li. Simultaneous approximations of multivariate functions and their derivatives by neural networks with one hidden layer. Neurocomputing, 12(4):327 343, 1996. Published as a conference paper at ICLR 2025 H. Lim, M. Kim, S. Park, and N. Park. Regular time-series generation using sgm. ar Xiv preprint ar Xiv:2301.08518, 2023. E. Materassi, M.; Tassi. Metriplectic framework for dissipative magneto-hydrodynamics. Physica D: Nonlinear Phenomena, 2012. P. Morrison. Thoughts on brackets and dissipation: old and new. In Journal of Physics: Conference Series, volume 169, page 012006. IOP Publishing, 2009. P. J. Morrison. Some observations regarding brackets and dissipation. Technical Report PAM-228, Center for Pure and Applied Mathematics, University of California, Berkeley, 1984. P. J. Morrison. A paradigm for joined hamiltonian and dissipative systems. Physica D: Nonlinear Phenomena, 18(1):410 419, 1986. ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(86)90209-5. URL https: //www.sciencedirect.com/science/article/pii/0167278986902095. H. C. Öttinger. Irreversible dynamics, onsager-casimir symmetry, and an application to turbulence. Phys. Rev. E, 90:042121, Oct 2014. H. C. Öttinger. Preservation of thermodynamic structure in model reduction. Phys. Rev. E, 91:032147, Mar 2015. I. Romero. Thermodynamically consistent time-stepping algorithms for non-linear thermomechanical systems. International Journal for Numerical Methods in Engineering, 79(6):706 732, 2023/05/14 2009. D. Ruiz, D. Portillo, and I. Romero. A data-driven method for dissipative thermomechanics. IFAC-Papers On Line, 54(19):315 320, 2021. X. Shang and H. C. Öttinger. Structure-preserving integrators for dissipative systems based on reversible irreversible splitting. Proceedings of the Royal Society A, 476(2234):20190446, 2020. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. Co RR, abs/2011.13456, 2020. URL https://arxiv.org/ abs/2011.13456. S. Särkkä and A. Solin, editors. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019. L. W. Tu. Differential Geometry: Connections, Curvature, and Characteristic Classes. Springer International Publishing, 2017. P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7): 1661 1674, 2011. B. Xu, Y. Chen, T. Matsubara, and T. Yaguchi. Learning generic systems using neural symplectic forms. In International Symposium on Nonlinear Theory and Its Applications, number A2L-D-03 in IEICE Proceeding Series, pages 29 32. The Institute of Electronics, Information, and Communication Engineers (IEICE), 2022. B. Xu, Y. Chen, T. Matsubara, and T. Yaguchi. Equivalence class learning for GENERIC systems. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023. Z. Zhang, Y. Shin, and G. Em Karniadakis. Gfinns: Generic formalism informed neural networks for deterministic and stochastic dynamical systems. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 380(2229):20210207, 2022. G. Zhong and J. E. Marsden. Lie-poisson hamilton-jacobi theory and lie-poisson integrators. Physics Letters A, 133(3):134 139, 1988. A PROOF OF THEORETICAL RESULTS This Appendix provides proof of the analytical results in Section 3 of the body. First, the parameterizations of L, M in terms of exterior algebra are established. Published as a conference paper at ICLR 2025 Proof of Lemma 3.2. First, it is necessary to check that the operators L, M parameterized this way satisfy the symmetries and degeneracy conditions claimed in the statement. To that end, recall that a b ab ba , meaning that (ab ba ) b a = a b. It follows that A A = A where A denotes the reversion of A, i.e., A = P i 0 there exists a two-layer neural network A : K Rn n such that supx K A A < ε and supx K k A k A < ε for 1 k m where k is the (total) derivative operator of order k. Proof. This will be a direct consequence of Corollary 2.2 in Li (1996) provided we show that |A| c|A| for some c > 0. To that end, if σ1 ... σr > 0 (r n) denote the nonzero singular values of A A, it follows that for each x K, A A = σ1 q σ2 1 + ... + σ2r = Aij Aij 2 = A A F . On the other hand, it also follows that A A F = Aij Aij 2 s X i,j max i,j Aij Aij = n r Aij Aij = n A A , and therefore the desired inequality holds with c = n. Now, for any ε > 0 it follows from Li (1996) that there exists a two layer network A with m continuous derivatives such that supx K A A < ε/n and supx K k A k A < ε/n < ε for all 1 k m. Therefore, it follows that A A n sup x K completing the argument. Next, we bound the deviation in the orthogonal projectors P E , P S . Lemma A.2. Let f : Rn R be such that f = 0 on the compact set K Rn. For any ε > 0, there exists a two-layer neural network f : K R such that f = 0 on K, supx K f f < ε, supx K f f < ε, and supx K P f P f Proof. Denote f = v and consider any v : K R. Since |v| | v| + |v v|, it follows for all x K that whenever |v v| < (1/2) infx K|v|, | v| |v| |v v| > |v| 1 2 inf x K|v| > 0, so that v = 0 in K, and since the square function is monotonic, inf x K| v|2 inf x K 2 inf x K|v| 2 = 1 4 inf x K|v|2. On the other hand, we also have | v| |v| + | v v| < |v| + (1/2) infx K|v|, so that, adding and subtracting vv and applying Cauchy-Schwarz, it follows that for all x K, |vv v v | |v v||v| + | v||v v| 2 max{|v|, | v|}|v v| < 2|v| + inf x K|v| |v v|. Now, by Corollary 2.2 in Li (1996), for any ε > 0 there exists a two-layer neural network f : K R such that ( 1 2 inf x K|v|, infx K|v|2 2 supx K|v| + infx K|v| ε 4, ε Published as a conference paper at ICLR 2025 and also supx K f f < ε. Letting v = f, it follows that for all x K, min n |v|2, | v|2o 2|v| + infx K|v| min n |v|2, | v|2o |v v|, and therefore, taking the supremum of both sides and applying the previous work yields the desired estimate, 42 supx K|v| + infx K|v| infx K|v|2 sup x K |v v| < ε. With these intermediate results established, the proof of the approximation result Proposition 3.7 proceeds as follows. Proof of Proposition 3.7. Recall from Theorem 3.4 that we can write L = P S (Atri A tri)P S and similarly for L. Notice that, by adding and subtracting P S Atri P S and P S Atri P S , it follows that for all x K, P S Atri P S P S Atri P S = P S P S Atri P S + P S Atri Atri P S + P S Atri P S P S P S P S |Atri| + Atri Atri + Atri P S P S 2 max n |Atri|, Atri o P S P S + Atri Atri where we have used that P S , P S have unit spectral norm. By Lemma A.1, for any ε > 0 there exists a two layer neural network Atri such that supx K Atri Atri < ε 4, and by Lemma A.2 there exists a two-layer network S with S = 0 on K such that P S P S < min ε, max sup x K |Atri|, sup x K It follows that S, S are ε-close to S, S on K and 2 max n |Atri|, Atri o P S P S < ε Therefore, the estimate L L 2 sup x K P S Atri P S P S Atri P S implies that L is ε-close to L on K as well. Moving to the case of M, we see that for all x K, by writing M = UΛU = Kchol K chol for Kchol = UΛ1/2 and repeating the first calculation with Kchol in place of Atri and P E in place of P S , P E Kchol K chol P E P E Kchol K chol P E 2 max n |Kchol|, Kchol o P E P E + Kchol K chol Kchol K chol . Moreover, if Kchol Kchol < (1/2) infx K|Kchol| for all x K then similar arguments as used in the proof of Lemma A.2 yield the following estimate for all x K, Kchol K chol Kchol K chol 2 max n |Kchol|, Kchol o Kchol Kchol 2|Kchol| + inf x K|Kchol| Kchol Kchol . Published as a conference paper at ICLR 2025 As before, we now invoke Lemma A.1 to construct a two-layer lower-triangular network Kchol such that Kchol Kchol < min 1 2 inf x K|Kchol|, 2 sup x K |Kchol| + inf x K|Kchol| 1 ε as well as (using Lemma A.2) a network E satisfying E = 0 on K and P E P E < min ε, max sup x K |Kchol|, sup x K Again, it follows that E, E are ε-close to E, E on K, and by the work above we conclude M M = sup x K P E Kchol K chol P E P E Kchol K chol P E as desired. It is now possible to give a proof of the error bound in Theorem 3.9. Recall the L2([0, T]) error metric x and Lipschitz constant Lf, defined for all x, y Rn and Lipschitz continuous functions f as 0 |x|2 dt, |f(x) f(y)| Lf|x y|. Proof of Theorem 3.9. First, note that the assumption that one of E, S (without loss of generality, say E) has bounded sublevel sets implies bounded trajectories for the state x as in Remark 3.8, so we may assume x K for some compact K Rn. Moreover, for any ε > 0 it follows from Proposition 3.7 that there are approximate networks E, S which are ε-close to E, S on K. Additionally, it follows that E, S have nonzero gradients E, S which are also ε-close to the true gradients E, S on K. This implies that for each x K, E = E + (E E) E + ε, so it follows that the sublevel sets {x | E(x) m} {x | E(x) m + ε} are also bounded. Therefore, we may assume (by potentially enlarging K) that both x, x K lie in the compact set K for all time. Now, let y = x x. The next goal is to bound the following quantity: | y| = L(x) E(x) + M(x) S(x) L( x) E( x) M( x) S( x) = L(x) E(x) L( x) E( x) + M(x) S(x) M( x) S( x) =: | y E + y S|. To that end, notice that by adding and subtracting L(x) E( x), L(x) E( x), L( x) E( x), it follows that y E = L(x)( E(x) E( x)) + L(x) L(x) E( x) + L(x) L( x) E( x) + L( x) E( x) E( x) . By Proposition 3.7 there exists a two-layer neural network L with one continuous derivative such that supx K L L < ε, which implies that L is Lipschitz continuous with (uniformly wellapproximated) Lipschitz constant. Using this fact along with the assumed Lipschitz continuity of E and the approximation properties of the network E already constructed then yields | y E| L E sup x K |L| + L L sup x K | E| |y| + ε sup x K L + sup x K | E| =: a E|y| + ε b E. Similarly, by adding and subtracting M(x) S( x), M(x) S( x), M( x) S( x), it follows that y S = M(x)( S(x) S( x)) + M(x) M(x) S( x) + M(x) M( x) S( x) + M( x) S( x) S( x) . Published as a conference paper at ICLR 2025 By Proposition 3.7, there exists a two-layer network M with one continuous derivative such that supx K M M < ε, with M Lipschitz continuous for the same reason as before. It follows from this and supx K S S < ε that | y S| L S sup x K |M| + L M sup x K | S| |y| + ε sup x K M + sup x K | S| =: a S|y| + ε b S. Now, recall that t|y| = |y| 1( y y) | y| by Cauchy-Schwarz, and therefore the time derivative of |y| is bounded by t|y| | y E| + | y S| = (a E + a S)|y| + ε(b E + b S) =: a|y| + b. This implies that t|y| a|y| b, so multiplying by the integrating factor e at and integrating in time yields |y(t)| εb Z t 0 ea(t τ) dτ = ε b where we used that y(0) = 0 since the initial condition of the trajectories is shared. Therefore, the L2 error in time can be approximated by 0 |y|2 dt ε2 b2 a2 e2a T 2ea T + T + 1 , establishing the conclusion. B EXPERIMENTAL AND IMPLEMENTATION DETAILS This Appendix records additional details related to the numerical experiments in Section 5. For each benchmark problem, a set of trajectories is manufactured given initial conditions by simulating ODEs with known metriplectic structure. The metriplectic forms of these systems are given in the subsections below. For the experiments in Table 2, only the observable variables are used to construct datasets, since entropic information is assumed to be unknown. Algorithm 2 summarizes the training of the dynamics models used for comparison with NMS. Algorithm 2 Training dynamics models 1: Input: snapshot data X Rn ns, each column xs = x(ts, µs), target rank r 1 2: Initialize loss L = 0 and networks with parameters Θ 3: for step in Nsteps do 4: Randomly draw an initial condition (t0k, x0k) where k ns 5: x1, ..., xl = ODEsolve(x0k, x, t1, ..., tl) 6: Compute the loss L((xo 1, . . . , xo l), ( xo 0, . . . , xo l)) 7: Update the model parameters Θ via SGD 8: end for For each compared method, integrating the ODEs is done via the Dormand Prince method (dopri5) Dormand and Prince (1980) with relative tolerance 10 7 and absolute tolerance 10 9. The loss is evaluated by measuring the discrepancy between the ground truth observable states xo and the approximate observable states xo in the mean absolute error (MAE) and mean squared error (MSE) metrics, which are defined in the usual way via the ℓ1, ℓ2 norms respectively for true data x(ti) and approximate data x(ti): MAE( x, x) = 1 i=1 | x(ti) x(ti)|1, MSE( x, x) = 1 i=1 | x(ti) x(ti)|2. The model parameters Θ (i.e., the weights and biases) are updated by using Adamax Kingma and Ba (2014) with an initial learning rate of 0.01. The number of training steps is set as 30,000, and the model parameters resulting in the best performance for the validation set are chosen for testing. Specific information related to the experiments in Section 5 is given in the subsections below. Published as a conference paper at ICLR 2025 For generating the results reported in Table 2, we implemented the proposed algorithm in Python 3.9.12 and Py Torch 2.0.0. Other required information is provided with the accompanying code. All experiments are conducted on Apple M2 Max chips with 96 GB memory. To provide the mean and the standard deviation, experiments are repeated three times with varying random seeds for all considered methods. B.1 TWO GAS CONTAINERS The metriplectic equations of motion in the two gas containers example are given in terms of the state x = (q p S1 S2) . With the energy E(x) = (1/2m)p2 + E1(x) + E2(x) and entropy S(x) = S1 + S2 defined in Section 5, it follows that x = L E(x) + M(x) S where S = (0 0 1 1) and 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 2 1 (T1T2) 1 0 0 (T1T2) 1 T 2 2 3 + e2S2 (2 q)5 1 3 2p T1(x) T2(x) , Ti(x) = Si Ei(x). As mentioned in the body, the two gas container (TGC) problem tests models predictive capability (i.e., extrapolation in time). To this end, one simulated trajectory is obtained by solving an IVP with a known TGC system and an initial condition, and the trajectory of the observable variables is split into three subsequences, [0, ttrain], (ttrain, tval], and (tval, ttest] for training, validation, and test with 0 < ttrain < tval < ttest. In the experiment, a sequence of 100,000 timesteps is generated using the Runge Kutta 4thorder (RK4) time integrator with a step size 0.001. The initial condition is given as x = (1, 2, 103.2874, 103.2874) following Shang and Öttinger (2020). The training/validation/test split is defined by ttrain = 20, tval = 30, and ttest = 100. For a fair comparison, all considered models are set to have a similar number of model parameters, 2,000. The specifications of the network architectures are: NMS: The total number of model parameters is 1959. The functions Atri, B, Kchol, E, S are parameterized as MLPs with the Tanh nonlinear activation function. The MLPs parameterizing Atri, B, Kchol, E are specified as 1 hidden layer with 10 neurons, and the on parameterizing S is specified as 3 hidden layers with 25 neurons. NODE: The total number of model parameters is 2179. The black-box NODE is parameterized as an MLP with the Tanh nonlinear activation function, 4 hidden layers and 25 neurons. SPNN: The total number of model parameters is 1954. The functions E and S are parameterized as MLPs with the Tanh nonlinear activation function; each MLP is specified as 3 hidden layers and 20 neurons. The two 2-tensors defining L and M are defined as learnable 3 3 matrices. GNODE: The total number of model parameters is 2343. The functions E and S are parameterized as MLPs with the Tanh nonlinear activaton function; each MLP is specified as 2 hidden layers and 30 neurons. The matrices and 3-tensors required to learn L and M are defined as learnable 3 3 matrices and 3 3 3 tensor. GFINN: The total number of model parameters is 2065. The functions E and S are parameterized as MLPs with Tanh nonlinear activation function; each MLP is specified as 2 hidden layers and 20 neurons. The matrices to required to learn L and M are defined as K learnable 3 3 matrices, where K is set to 2. Published as a conference paper at ICLR 2025 B.2 THERMOELASTIC DOUBLE PENDULUM The equations of motion in this case are given for 1 i 2 as mi , pi = qi(E1(x) + E2(x)), S1 = κ T 1 1 T2 1 , S2 = κ T1T 1 2 1 , where κ > 0 is a thermal conductivity constant (set to 1), mi is the mass of the ith spring (also set to 1) and Ti = Si Ei is its absolute temperature. In this case, qi, pi R2 represent the position and momentum of the ith mass, while Si represents the entropy of the ith pendulum. As before, the total entropy S(x) = S1 + S2 is the sum of the entropies of the two springs, while defining the internal energies 2(ln λi)2 + ln λi + e Si ln λi 1, λ1 = |qi|, λ2 = |q2 q1|, leads to the total energy E(x) = (1/2m1)|p1|2 + (1/2m2)|p2|2 + E1(x) + E2(x). This system is metriplectic with matrix fields L, M, given by 0 0 I 0 0 0 0 0 I 0 I 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 1 1 T2(x) 1 0 0 0 0 1 T1T 1 2 (x) where Ti = Si Ei. The thermoelastic double pendulum experiment tests model prediction across initial conditions. In this case, 100 trajectories are generated by varying initial conditions that are randomly sampled from [0.1,1.1] [-0.1,0.1] [2.1, 2.3] [-0.1,0.1] [-1.9,2.1] [0.9,1.1] [-0.1, 0.1] [0.9,1.1] [0.1,0.3] R10. Each trajectory is obtained from the numerical integration of the ODEs using an RK4 time integrator with step size 0.02 and the final time T = 40, resulting in the trajectories of length 2,000. The resulting 100 trajectories are split into 80/10/10 for training/validation/test sets. For a fair comparison, all considered models are again set to have similar number of model parameters, 2,000. The specifications of the network architectures are: NMS: The total number of model parameters is 2201. The functions A, B, K, E, S are parameterized as MLPs with the Tanh nonlinear activation function. The MLPs parameterizing are specified as 1 hidden layer with 15 neurons. NODE: The total number of model parameters is 2005. The black-box NODE is parameterized as an MLP with the Tanh nonlinear activation function, 2 hidden layers and 35 neurons. SPNN: The total number of model parameters is 2362. The functions E and S are parameterized as MLPs with the Tanh nonlinear activation function; each MLP is specified as 3 hidden layers and 20 neurons. The two 2-tensors defining L and M are defined as learnable 3 3 matrices. GNODE: The total number of model parameters is 2151. The functions E and S are parameterized as MLPs with the Tanh nonlinear activaton function; each MLP is specified as 2 hidden layers and 15 neurons. The matrices and 3-tensors required to learn L and M are defined as learnable 3 3 matrices and 3 3 3 tensor. GFINN: The total number of model parameters is 2180. The functions E and S are parameterized as MLPs with Tanh nonlinear activation function; each MLP is specified as 2 hidden layers and 15 neurons. The matrices to required to learn L and M are defined as K learnable 3 3 matrices, where K is set to 2. C ADDITIONAL FIGURES: TWO GAS CONTAINERS AND THERMOELASTIC DOUBLE PENDULUM Figure 3 presents the ground-truth and predicted trajectories of q and p for the two gas containers example, providing confirmation that the learned metriplectic vector field x closely approximates the Published as a conference paper at ICLR 2025 true metriplectic vector field. The values pictured are obtained by evaluating the different dynamics functions (either the ground truth or those learned by the different parameterization) along the trajectories of q and p. Figure 3 shows that the proposed NMS method produces the most accurate learned velocities q, p. The MAE between the ground-truth metriplectic vector field and the predicted ones computes to [0.2652, 0.1355, 0.2285, 0.0546] for [SPNN, GNODE, GFINN, NMS], confirming that NMS yields the lowest error in this quantity. Note that only the observed state variables are included in this computation as no access to the unobserved variables is assumed during training. Figure 3: The ground-truth and predicted trajectories of q and p obtained via evaluating the (learned) dynamics functions. Figure 4 depicts trajectories of the ground-truth and the predicted position (q), momentum (p), entropy (S), and energy (E) for the thermoelastic double pendulum example. Figures show that the trajectories produced by NMS follow the ground-truth trajectories closer than those produced by the baseline methods. Figures 4(g) and 4(h) demonstrate that the trajectories of entropy S and energy E produced by GNODE, GFINN, and NMS are thermodynamically consistent (i.e., non-decreasing entropy and conserved energy). Figure 4: The ground-truth and predicted position, momentum, entropy, and energy for the thermoelastic double pendulum. Figures illustrate that the trajectories generated by NMS follow the ground-truth ones closer than those of other methods do. Published as a conference paper at ICLR 2025 D ADDITIONAL EXPERIMENT: DAMPED NONLINEAR OSCILLATOR Consider a damped nonlinear oscillator of variable dimension with state x = (q p S) , whose motion is governed by the metriplectic system m, p = k sin q γp, S = γ|q|2 Here q, p Rn denote the position and momentum of the oscillator, S is the entropy of a surrounding thermal bath, and the constant parameters m, γ, T are the mass, damping rate, and (constant) temperature. This leads to the total energy E(x) = (1/2m)|p|2 k cos q + TS, which is readily seen to be constant along solutions x(t). This system has a metriplectic expression in terms of E, S and the matrix fields 0 I 0 I 0 0 0 0 0 0 0 0 0 0 0 0 0 γ|q|2 It is now verified that NMS can accurately and stably predict the dynamics of a nonlinear oscillator x = (q p S) in the case that n = 1, 2, both when the entropy S is observable as well as when it is not. As before, the task considered is prediction in time, although all compared methods NODE, GNODE, and NMSknown are now trained on full state information from the training interval, and test errors are computed over the full state x on the extrapolation interval (tvalid, ttest], which is 150% longer than the training interval. In addition, another NMS model, NMSdiff, was trained using only the partial state information xo = (q, p) and tested under the same conditions, with the initial guess for xu generated as in Appendix H. As can be seen in Table 3, NMS is more accurate than GNODE or NODE in both the 1-D and 2-D nonlinear oscillator experiments, improving on previous results by up to two orders of magnitude. Remarkably, NMS produces more accurate entropic dynamics even in the case where the entropic variable S is unobserved during NMS training and observed during the training of other methods. This illustrates another advantage of the NMS approach: because of the reasonable initial data for S produced by the diffusion model, the learned metriplectic system produced by NMS remains performant even when metriplectic governing equations are unknown and only partial state information is observed. Table 3: Experimental results for the benchmark problems with respect to MSE and MAE. The best scores are in boldface. 1-D D.N.O. T.G.C. 2-D D.N.O. MSE MAE MSE MAE MSE MAE NMSdiff .0170 .1132 .0045 .0548 .0275 .1456 NMSknown .0239 .1011 .0012 .0276 .0018 .0357 NODE .0631 .2236 .0860 .2551 .0661 .2096 GNODE .0607 .1976 .0071 .0732 .2272 .4267 To describe the experimental setup precisely, data is collected from a single trajectory with initial condition as x = (2, 0, 0) following Lee et al. (2021). The path is calculated at 180,000 steps with a time interval of 0.001, and is then split into training/validation/test sets as before using ttrain = 60, tval = 90 and ttest = 180. Specifications of the networks used for the experiments in Table 3 are: NMS: The total number of parameters is 154. The number of layers for Atri, B, Kchol, E, S is selected from {1,2,3} and the number of neurons per layer from {5,10,15}. The best hyperparameters are 1 hidden layer with 5 neurons for each network function. GNODE: The total number of model parameters is 203. The number of layers and number of neurons for each network is chosen from the same ranges as for NMS. The best hyperparameters are 1 layer with 10 neurons for each network function. NODE: The total number of model paramters is 3003. The NODE architecture is formed by stacking MLPs with Tanh activation functions. The number of blocks is chosen from {3,4,5} Published as a conference paper at ICLR 2025 and the number of neurons of each MLP from {30,40,50}. The best hyperparameters are 4 and 30 for the number of blocks and number of neurons, respectively. E ADDITIONAL EXPERIMENT: HAMILTONIAN NEURAL NETWORK FOR TWO GAS CONTAINERS As briefly mentioned in the main body, from the definition of metriplectic dynamics, restricting M = 0 leads to Hamiltonian systems. In this section, we present the results of training Hamiltonian systems only on the observable variables (q, p) from the two gas container example problems. We use the same dataset used in Section 5.1 and the same experimental setup described in Section B.1. For parameterization of the Hamiltonian dynamics, we explicitly consider the canonical Poisson matrix L = J = 0 1 1 0 as is standard in Hamiltonian neural networks (HNNs), along with an MLP to parameterize the Hamiltonian function H = HΘ(q, p), resulting in the dynamics Figure 5 presents the trajectories of position and momentum predicted by the trained HNN. It essentially shows that imposing an inductive bias (i.e., energy conservation), which does not match with the underlying physical system (i.e., dissipation), results in catastrophic failure in prediction. Figure 5(c) confirms that the learned Hamiltonian is kept constant over the numerical simulation. Figure 5: [Hamiltonian neural networks] The ground-truth and predicted trajectories of position and momentum for the two gas containers example, along with the learned Hamiltonian. Notice that the incorrect inductive bias of conservation leads to catastrophic failure in network prediction. F ADDITIONAL EXPERIMENT: A DAMPED THERMOELASTIC ROD As an additional benchmark problem, we consider a damped thermoelastic rod, which discretizes an infinite-dimensional metriplectic system. Consider a 1-dimensional elastic rod, with coordinate s [0, l], which evolves following the dynamics defined as: p m V (q) γ p = L E(q, p, e) + M(q, p, e) S. (2) Here, V denotes a given potential function, γ denotes a constant controlling the rate of dissipation, m = m(s) denotes the mass density, and e = e(s) denotes the internal energy representing the conversion of mechanical energy into heat. The energy function is given by E(q, p, e) = H(q, p) + S(e) = Z ℓ 2m(s) + V (q(s)) + Z ℓ 0 e(s), (3) where H(q, p) denotes the Hamiltonian. Published as a conference paper at ICLR 2025 (a) q1 (first component) (b) q50 (last component) (c) p1 (first component) (d) p50 (last component) Figure 6: The ground-truth and predicted trajectories of positions and momenta for the thermoelastic rod are depicted with N = 50. Figure 7: MSEs (in the log10 scale) of predictions obtained by NMS and GFINN for varying N. In numerical computation, this continuous system is discretized, leading to s being represented by a vector s RN, where N is the number of discretization points. This discretization results in a vectorized representation of q, p, denoted as q, p RN, and a discretized version of Eq. (2), where the state variables are [q, p, S]. In particular, the resulting metriplectic system is given in terms of S = (0 0 1) and 0 I 0 I 0 0 0 0 0 0 0 0 0 I p In the experiments, we largely follow the setup used in the two gas container experiments, i.e., testing models predictive capability through temporal extrapolation. One simulated trajectory is obtained by solving an IVP with a known thermoelastic rod system, and the trajectory of observed variables is then split into three subsequences, [0, ttrain], (ttrain, tval], and (tval, ttest] for training, validation, and testing with 0 < ttrain < tval < ttest. In the experiment, we set ttrain = 30, tval = 40, and ttest = 50. We train the proposed NMS on spatial discretizations of varying size N = {50, 75, 100}, which results in systems of dimension {101, 151, 201}, respectively. The performance of NMS is then compared to the performance of GFINNs, which are the current state-of-the-art in metriplectic learning. Figure 6 shows the ground truth and predicted trajectories of the first and the last components of q and p for the case of N = 50, where it can be seen that the trajectories predicted by NMS are significantly better aligned with the ground truth trajectories compared to those of GFINN, particularly in the extrapolation regime (40 < t < 50). Figure 7 shows the results of training GFINN and NMS on thermoelastic rod systems with varying N, where GFINN and NMS instances with comparable model sizes are considered. The results suggest that the proposed NMS is parameter efficient, supporting the theoretical conclusions in Section 3. G SCALING STUDY To compare the scalability of the proposed NMS architecture design with existing architectures, different realizations of GNODE, GFINN, and NMS are generated by varying the dimension of the state variables, n = {1, 5, 10, 15, 20, 30, 50}. The specifications of these models (i.e., hyperparameters) are set so that the number of model parameters is kept similar between each method for smaller values of n. For example, for n = 1, 5 the number of model parameters is 20,000 for each architecture. The results in Figure 8(a) confirm that GNODE scales cubically in n while both GFINN and NMS scale quadratically. Note that only a constant scaling advantage of NMS over GFINN can be seen from this plot, since r is fixed during this study. It is also worthwhile to investigate the computational timings of these three models. Considering the same realizations of the models listed above, i.e., the model instances for varying n = Published as a conference paper at ICLR 2025 {1, 5, 10, 15, 20, 30, 50}, 1,000 random samples of states {x(i)}1,000 i=1 are generated. These samples are then fed to the dynamics function L(x(i)) E(x(i)) + M(x(i)) S(x(i)) for i = 1, . . . , 1000, and the computational wall time of the function evaluation via Py Torch s profiler API is measured. The results of this procedure are displayed in Figure 8(b). Again, it is seen that the proposed NMSs require less computational resources than GNODEs and GFINNs. (a) State dimension n versus number of model parameters (b) Model parameters versus wall time in microseconds Figure 8: A study of the scaling behavior of GNODE, GFINN, and NMS. H DIFFUSION MODEL FOR UNOBSERVED VARIABLES Recent work in Lim et al. (2023) suggests the benefits of performing time-series generation using a diffusion model. This Appendix describes how this technology is used to generate initial conditions for the unobserved NMS variables in the experiments corresponding to Table 3. More precisely, we describe how to train a conditional diffusion model which generates values for unobserved variables xu given values for the observed variables xo. Training and sampling: Recall that diffusion models add noise with the following stochastic differential equation (SDE): dx(t) = f(t, x(t))dt + g(t)dw, t [0, 1], where w Rdim(x) is a multi-dimensional Brownian motion, f(t, ) : Rdim(x) Rdim(x) is a vector-valued drift term, and g : [0, 1] R is a scalar-valued diffusion function. For the forward SDE, there exists a corresponding reverse SDE: dx(t) = [f(t, x(t)) g2(t) x(t)log p(x(t))]dt + g(t)d w, which produces samples from the initial distribution at t = 0. This formula suggests that if the score function, x(t)log p(x(t)), is known, then real samples from the prior distribution p(x) N(µ, σ2) can be recovered, where µ, σ vary depending on the forward SDE type. In order for a model Mθ to learn the score function, it has to optimize the following loss: L(θ) = Et{λ(t)Ex(t)[ Mθ(t, x(t)) x(t) log p(x(t)) 2 2]}, where t is uniformly sampled over [0, 1] with an appropriate weight function λ(t) : [0, 1] R. However, using the above formula is computationally prohibitive. Thanks to Vincent (2011), this loss can be substituted with the following denoising score matching loss: L (θ) = Et{λ(t)Ex(0)Ex(t)|x(0)[ Mθ(t, x(t)) x(t) log p(x(t)|x(0)) 2 2]}. Since score-based generative models use an affine drift term, the transition kernel p(x(t)|x(0)) follows a certain Gaussian distribution Särkkä and Solin (2019), and therefore the gradient term x(t) log p(x(t)|x(0)) can be analytically calculated. Published as a conference paper at ICLR 2025 Experimental details On the other hand, the present goal is to generate unobserved variables xu given values for the observed variables xo = (q, p), i.e., conditional generation. Therefore, our model has to learn the conditional score function, xu(t) log p(xu(t)|xo). For example, in the damped nonlinear oscillator case, S(t) is initialized as a perturbed t [0, 1], from which the model takes the concatenation of q, p, S(t) as inputs and learns conditional the score function S(t) log(S(t)|q, p). For the experiments in Table 3, diffusion models are trained to generate xu variables on three benchmark problems: the damped nonlinear oscillator, two gas containers, and thermolastic double pendulum. On each problem, representative parameters such as mass or thermal conductivity are varied, with the total number of cases denoted by N. Full trajectory data of length T is then generated using a standard numerical integrator (e.g., dopri5), before it is evenly cut into T/L pieces of length L. Let V, U denote the total number of variables and the number of unobserved variables, respectively. It follows that the goal is to generate U unobserved variables given V U observed ones, i.e., the objective is to generate data of shape (NT/L, L, U) conditioned on data of shape (NT/L, L, V U). After the diffusion model has been trained for this task, the output data is reshaped into size (N, T, U), which is used to initialize the NMS model. Note that the NODE and GNODE methods compared to NMS in Table 3 use full state information for their training, i.e., xu = in these cases, making it comparatively easier for these methods to learn system dynamics. As in other diffusion models e.g. Song et al. (2020), a U-net architecture is used, modifying 2-D convolutions to 1-D ones and following the detailed hyperparameters described in Song et al. (2020). Note the following probability flow ODE seen in Song et al. (2020): dx(t) = f(t, x(t)) 1 2g2(t) x(t)log p(x(t)) dt, Although models trained to mimic the probability flow ODE do not match the perofrmance of the forward SDE s result in the image domain, the authors of Lim et al. (2023) observe that the probability flow ODE outperforms the forward SDE in the time-series domain. Therefore, the probability flow ODE is used with the default hyperparameters of Song et al. (2020).