# learning_equivariant_nonlocal_electron_density_functionals__eac3fe46.pdf Published as a conference paper at ICLR 2025 LEARNING EQUIVARIANT NON-LOCAL ELECTRON DENSITY FUNCTIONALS Nicholas Gao , Eike Eberhard , Stephan Günnemann {n.gao,e.eberhard,s.guennemann}@tum.de Department of Computer Science & Munich Data Science Institute Technical University of Munich The accuracy of density functional theory hinges on the approximation of nonlocal contributions to the exchange-correlation (XC) functional. To date, machinelearned and human-designed approximations suffer from insufficient accuracy, limited scalability, or dependence on costly reference data. To address these issues, we introduce Equivariant Graph Exchange Correlation (EG-XC), a novel non-local XC functional based on equivariant graph neural networks (GNNs). Where previous works relied on semi-local functionals or fixed-size descriptors of the density, we compress the electron density into an SO(3)-equivariant nuclei-centered point cloud for efficient non-local atomic-range interactions. By applying an equivariant GNN on this point cloud, we capture molecular-range interactions in a scalable and accurate manner. To train EG-XC, we differentiate through a self-consistent field solver requiring only energy targets. In our empirical evaluation, we find EG-XC to accurately reconstruct gold-standard CCSD(T) energies on MD17. On out-of-distribution conformations of 3BPA, EG-XC reduces the relative MAE by 35 % to 50 %. Remarkably, EG-XC excels in data efficiency and molecular size extrapolation on QM9, matching force fields trained on 5 times more and larger molecules. On identical training sets, EG-XC yields on average 51 % lower MAEs. 1 INTRODUCTION Kohn-Sham Density Functional Theory (KS-DFT) is the backbone of computational material and drug discovery (Jones, 2015). It is a quantum mechanical method to approximate the ground state energy of an Nel-electron system by finding the electron density ρ DNel = ρ : R3 R+| R R3 ρ(r)dr = Nel that minimizes the energy functional E : DNel R. This functional maps electron densities to energies and is composed of E[ρ] = T[ρ] + Vext[ρ] + VH[ρ] + EXC[ρ] (1) T : DNel R is the kinetic energy functional, Vext : DNel R the external potential due to positively charged nuclei, VH : DNel R+ the Coulomb energy between electrons, and, finally, EXC : DNel R the exchange-correlation (XC) functional (Cramer, 2004). While T, Vext and VH permit analytical computations, the exact form of EXC remains unknown, and its approximation frequently dominates DFT s error (Kim et al., 2013). Machine learning has emerged as a promising data-driven approach to approximate EXC (Kulik et al., 2022; Zhang et al., 2023). Many such ML functionals adopt the classical approach and define EXC as the integral of a learnable XC energy density ϵXC : Rdm GGA R (Perdew, 2001): R3 ρ(r)ϵXC (g(r)) dr (2) where g : R3 Rd are properties of the electron density ρ (Dick & Fernandez-Serra, 2021; Nagai et al., 2022). If g only depends on density quantities from its infinitesimal neighborhood, e.g., ρ(r), ρ(r), . . ., the resulting functionals are called semi-local. While this integrates well into existing quantum chemistry code, non-local interactions like Van der Waals forces exceed the functional Equal contribution; corresponding authors. Published as a conference paper at ICLR 2025 class (Kaplan et al., 2023). In contrast, non-local functionals can capture such interactions by depending on multiple points in space simultaneously, e.g., EXC[ρ] = RR R3 ρ(r)ρ(r )ϵXC(g(r), g(r ))drdr . Non-locality can be further separated into atomic-range interactions, which depend on the electron density around the nucleus, and molecular-range interactions, which depend on the electron density within typical molecular length scales. But, existing non-local functionals either scale poorly computationally (Zhou et al., 2019) or require costly reference data (Margraf & Reuter, 2021; Bystrom & Kozinsky, 2022). The critical challenge in XC functionals lies in efficiently captureing non-local molecular-range interactions at scale. To this end, we propose to leverage equivariant graph neural networks (GNNs) to learn non-local XC functionals through our Equivariant Graph Exchange Correlation (EG-XC)1. To enable computationally efficient non-local interactions, we propose two key innovations: (1) We compress the electron density to an SO(3)-equivariant atomic-range point cloud representation by convolving the electron density with an SO(3)-equivariant kernel at the nuclear positions. (2) We use SO(3)-equivariant GNNs on this point cloud representation to efficiently capture molecular-range information. Finally, we use the embeddings to define a non-local feature density g NL : R3 Rd for the XC energy density ϵXC from Equation 2. Compared to previous approaches, our finite point cloud embeddings are neither dependent on the nuclear charge nor the basis set but purely derived from the density ρ. To train on energies alone, we differentiate through the minimization of Equation 1 (Li et al., 2021). In our experimental evaluation, EG-XC improves upon the learnable semi-local XC-functional on molecular dynamic trajectories, extrapolation to both out-of-distribution conformations and increasingly larger molecules. In particular, we find EG-XC to reduce errors of the semi-local functional by a factor of 2 to 3, on par with accurate ML force fields combined with DFT calculations ( -ML). In extrapolation to unseen structures, we find EG-XC to accurately reproduce out-ofdistribution potential energy surfaces unlike force fields (incl. -ML) while reducing the MAE by 35 % to 50 % compared to the next-best tested method. Finally, we find EG-XC to have an excellent data efficiency, achieving similar accuracies to the best force fields with 5 times less data. To summarize, we demonstrate that GNN-driven functionals push the frontier of non-local XC functionals and provide a promising path toward accurate and scalable DFT calculations. 2 BACKGROUND Notation. To denote functionals, i.e., functions mapping from functions to scalars, we write F[f] where F is the functional and f the function. We use superscripts in brackets(t) to indicate sequences and regular superscripts of bold vectors, e.g., xl, to indicate the l-th irreducible representation of the SO(3) group. We generally use r R3 to denote points in the 3D Euclidean space. A notable exception is the finite set of nuclei positions {R1, . . . , RNnuc} for which we will use capital letters. The norm of a vector is given by r = r 2. For directions, i.e., unit length vectors, we use br = r r . Kohn-Sham density functional theory is the foundation of our work. Here, we provide a very brief introduction. For more details, we refer the reader to Appendix A or Lehtola et al. (2020). In KS-DFT, the electron density ρ is represented by a set of orthogonal orbitals ϕi : R3 R, ϕi(x) = CT i χ(x), that are defined as linear combinations of a basis set of atomic orbitals χµ : R3 R: ρ(r) = ϕ(r)T ϕ(r) = χ(r)T CCT χ(r) = χ(r)T Pχ(r) (3) where P = CCT RNbas Nbas is the so-called density matrix. Given the representation in a finite basis set, one can compute the kinetic energy T, external potential Vext, and electron-electron repulsion energies VH analytically. Unfortunately, such analytical expressions are generally unavailable for EXC[ρ] as in Equation 2. Thus, one relies on numerical integration (Lehtola et al., 2020). To minimize Equation 1, one typically uses the self-consistent field (SCF) method that we outline in greater detail in Appendix B. The SCF method is an iterative two-step optimization where one first computes the so-called Fock matrix F = E P based on the current coefficients C. Then, the optimal coefficients C RNbas Nel are obtained by solving the generalized eigenvalue problem FC = SCE (4) 1We provide the source code on https://github.com/eseberhard/eg-ex Published as a conference paper at ICLR 2025 with Sµν = R R3 χµ(r)χν(r)dr being the overlap matrix of the atomic orbitals and E being a diagonal matrix. The procedure is repeated until convergence of P. Equivariance allows defining symmetries of functions. A function f : X Y is equivariant to a group G iff g G, x X, f GX g x = GY g f(x) where GX g , GY g are the representations of g in the domain X and codomain Y, respectively. Invariance is a special case of equivariance where GY g = 1, i.e., f GX g x = f(x), for all g G, x X. The real spherical harmonics Y l : R3 R2l+1, l N+, are a well-known example of SO(3)- equivariant functions as they transform under rotation according to the Wigner D matrices Dl R R(2l+1) (2l+1), R SO(3), with D0 R = 1 being the identity and D1 R being 3D-rotation matrices: Y l(D1 Rx) = Dl RY l(x). (5) Such 2l + 1-dimensional equivariant functions hi : R3 R2l+1, e.g., hi = Y l, can be recombined using the tensor product (h1 h2)lo mo = X m1,m2 Clo,l1,l2 mo,m1,m2hl1 m1hl2 m2, (6) where C R(2lo+1) (2li+1) (2lj+1) are the Clebsch-Gordan coefficients. As the Wigner D-matrices are orthogonal, it follows that the inner product of two l-equivariant functions yields an invariant one hl i D1 Rr T hl j D1 Rr = hl i(r)T Dl T R Dl Rhl j(r) = hl i(r)T hl j(r). (7) While, we are interested in E(3)-invariant functions, using such equivariant intermediate representations improves accuracy, expands the function class, and improves data efficiency (Schütt et al., 2017; Gasteiger et al., 2021; Batatia et al., 2022). 3 RELATED WORK Learnable density functional approximations (DFAs) of the XC functional have a long history. In their seminal paper, Kohn & Sham (1965) proposed an ϵXC fitted to a reference calculation. In this first parameterization ϵXC only depends on g(r) = ρ(r). To improve accuracy, the density features g have been successively refined by including gradients and physical quantities such as the kinetic energy density τ(r) = 1 2 PNel i | ϕi(r)|2. The resulting semi-local functionals are also known as meta-GGAs with g(r) = gm GGA(r) := [ρ(r), | ρ(r)|, τ(r)] (Perdew, 2001). Many ML functionals use this parameterization with ϵXC frequently being a product of an MLP with a classical functional (Nagai et al., 2022; Kulik et al., 2022; Zhang et al., 2023). An appealing aspect of learning semi-local energy densities rather than EXC[ρ] directly is the known implementation of physical constraints (Sun et al., 2015; Kaplan et al., 2023), which inspired works by Dick & Fernandez Serra (2021); Nagai et al. (2022). While augmenting gm GGA with hand-crafted non-local features improves the accuracy, this comes at the cost of exact constraints (Nagai et al., 2020; Kirkpatrick et al., 2021). Similarly, such physical biasses are absent in kernel methods Margraf & Reuter (2021); Bystrom & Kozinsky (2022). Additionally, these methods require reference densities that are rarely available (Szabo & Ostlund, 2012). Alternative grid-based CNNs converge slowly in the number of grid points compared to specialized spherical integration grids (Zhou et al., 2019; Treutler & Ahlrichs, 1995). Lastly, machine-learned functionals have been used in other DFT contexts as well, e.g., for energy corrections on SCF-converged energies (Chen & Yang, 2024) or to model the kinetic energy functional in orbital-free DFT (Snyder et al., 2012; Mi et al., 2023; Zhang et al., 2024; Remme et al., 2023). Here, we propose to extend semi-local models with expressive fully learnable non-local features obtained from standard integration grids. This allows us to leverage the physical biases of meta-GGAs while allowing efficient non-local interactions with fast integration. Although this work focuses on the XC functional, the methods we present can be transferred to these applications as well. Machine learning force fields have long functioned as a cheap but inaccurate alternative to quantum mechanical calculations by directly approximating the potential energy surface from data without solving the electronic structure problem (Unke et al., 2021). Contemporary force fields generally rely on graph neural networks (GNNs) (Schütt et al., 2017) representing molecules in terms of graphs with atoms as nodes. The advent of SO(3)-equivariant models led to significant improvements in accuracy Published as a conference paper at ICLR 2025 (1) Nuclei-centered equivariant (2) Equivariant message passing (4) Graph readout (3) Non-local reweighted meta-GGA Figure 1: Illustration of Equivariant Graph Exchange Correlation (EG-XC) s four components: (1) We obtain a finite atomic-range point cloud representation H(0) by convolving the electron density ρ with radial filters Γk : R+ R and spherical harmonics Y l m : R3 R at the nuclear position. (2) The embeddings H(0) are updated using equivariant message passing to obtain molecular-range effects in H(T ). (3) We define a non-local feature density g NL : R3 Rd from which we derive the exchange correlation density ϵXC : R3 R. (4) We add a graph readout of H(T ) to learn additional corrections. To obtain EXC[ρ], we integrate ϵXC and add it to the global graph readout. and data efficiency, closing the gap to full-fidelity quantum mechanical calcualtions (Batzner et al., 2022; Batatia et al., 2022). While they effectively aim to accomplish the same goal as DFT at a fraction of the cost, their out-of-distribution accuracy is a common problem (Stocker et al., 2022). In EG-XC, we transfer the success of equivariant GNNs to DFT, by combining their approximation power with the physical nature of DFT, resulting in an accurate and data-efficient method. 4 EQUIVARIANT GRAPH EXCHANGE CORRELATION With Equivariant Graph Exchange Correlation (EG-XC), we propose an efficient non-local approximation to the unknown exchange-correlation functional EXC that maps electron densities (positive integrable functions in R3) to scalar energies. To this end, EG-XC consists of four components as illustrated in Figure 1: (1) Nuclei-centered equivariant embeddings H(0) compress the electron density ρ into a finite point cloud representation by equivariantly integrating the density around the nuclei enabling atomic-range interactions. (2) Equivariant message passing on this point cloud includes molecular-range information in H(T ). Based on the point cloud embeddings, we define EXC as the sum of two readouts: (3) A non-local reweighted meta-GGA based on our non-local feature density g NL : R3 Rd and (4) a graph readout of the invariant nuclei-centered features H(T )0: R3 ρ(r) γNL (g NL(r), gm GGA(r)) | {z } non-local weights ϵm GGA (gm GGA(r)) | {z } meta-GGA | {z } non-local reweighted meta-GGA i=1 ϵNL H(T )0 i | {z } graph readout (1) Nuclei-centered equivariant embeddings H(0) R1 . . . R2lmax+1 Nnuc d := FNnuc d lmax reduce the computational scaling by mapping the continuous density to a finite per-nucleus representation. Further, these enable using equivariant GNNs to learn molecular-range interactions efficiently. The positions of the nuclei Ri R3 naturally lend themselves as centroids for such embeddings as they represent peaks of the electron density (Kato, 1957). Similar to Remme et al. (2023), we perform this reduction by convoluting the electron density ρ with equivariant filters Sl k : R3 R2l+1 at the nuclear positions Ri: H(0)l ik = ρi Sl k (Ri) (9) Published as a conference paper at ICLR 2025 where ρi : R3 R+ is the soft-partitioned electron density associated with the i-th embedding ρi(r) = ρ(r) αi(r) PNnuc j=1 αj(r) , (10) αi(r) = exp r Ri 2 with λ R+ being a free parameter. Such partitioning prevents overcounting if nuclei are close, similar to quadrature methods (Becke, 1988). Like equivariant GNNs (Batzner et al., 2022), we define the filters Sl k as a product of spherical harmonics Y l : R3 R2l+1 and radial filters Γk : R+ R: Sl k(r) = Γk( r )Y l (br) (12) where 0 l lmax indexes the real spherical harmonics. The spherical harmonics allow us to capture angular changes in the electron density that would otherwise be lost in a purely radial representation. Put together, we compute the nuclei-centered equivariant embeddings H(0)l ik = Z R3 ρi(r)Γl k( r Ri )Y l \ r Ri dr j=1 wjρ(rj)αi(rj)Γk( rj Ri )Y l \ rj Ri (13) with a set of Nquad standard integration points and weights {(rj, wj)}Nquad j=1 R3 R Nquad (Treutler & Ahlrichs, 1995). Importantly, while these embeddings are centered at the nuclei, they do not embed the nuclear charges, as in ML force fields (Schütt et al., 2017), but the electron density around them. This is important as the derivative with respect to the electron density affects the SCF procedure; see Appendix B. A nuclear charge embedding s derivative would not exist and, thus, not alter the DFT calculation, effectively yielding a force field. We follow Schütt et al. (2021) and parametrize the radial filters Γk as a combination of sin, cos with a fixed polynomial envelope function u : R+ R from Gasteiger et al. (2022) for a cutoff c R+: c if k even, cos rπ(k+1) c if k odd. (14) We found such frequency filters to yield more stable results than Bessel functions as the latter ones are very sensitive close to the centroids (Gasteiger et al., 2022). This sensitivity might be beneficial in force fields but is implicitly given in DFT due to the higher density close to the nuclei. (2) Equivariant message passing allows for the propagation and updating of the equivariant electron density features H(0) to capture molecular-range dependencies within the electron density like Vander-Waals forces (Frank et al., 2022). The equivariance of these features is important as it allows us to define a non-radial-symmetric feature density g NL in the next step. We use Batzner et al. (2022) s Nequ IP to perform message passing with the SO(3)-equivariant convolution from Thomas et al. (2018) to iteratively update the embeddings H(t) over T steps: H(t+1) i = Equi MLP Equi Lin H(t) i + Equi Conv H(t), R + Equi Lin H(t) i , (15) where Equi Lin : Fd lmax Fd lmax is an equivariant dense layer that mixes features of the same order l Equi Lin(X)k = Concat k =1 W l k k Xl k with each W l Rd d being a learnable weight matrix. Equi MLP : Fd lmax Fd lmax is the equivariant equivalent of a standard MLP, where we use the invariant embeddings l = 0 and an activation function σ : R R to gate the equivariant parts l > 0: Equi MLP(X) = Equi Lin(Act Equi Lin(X)), (17) Act Equi Lin(X) = Concat σ(W 0X0), Equi Lin(X)l σ W l X0 lmax l=1 Published as a conference paper at ICLR 2025 Lastly, we define the convolution Equi Conv : FNnuc d lmax RNnuc 3 FNnuc d lmax via the tensor product: Equi Conv(X, R)lo t = li,lf =0 MLP (Γ( Rs Rt ))lo,li,lf Equi MLP(xs)li Y lf \ Rs Rt l0 . To accelerate the computation of the tensor product, we use Passaro & Zitnick (2023) s efficient equivariant convolutions. For more details, we refer the reader to Batzner et al. (2022). (3) Non-local reweighted meta-GGA. As the majority of the XC energy can be captured by semilocal functionals (Goerigk et al., 2017), starting with a machine-learned meta-GGA ϵm GGA : Rdm GGA R accounts for most of the XC energy. To correct the meta-GGA for non-local effects, we first define a non-local feature density g NL : R3 Rd based on the non-local embeddings n H(t)o T l=0 Y l \ r Ri T H(t)l ik | {z } angular Γ( r Ri )T w(t)l k H(t)0 i | {z } radial where w(t)l k : Rd Rd are MLPs mapping to radial weights. While the inner product between the spherical harmonics Y l and our equivariant embeddings H(t)l expresses angular changes, the inner product of radial basis functions Γ and w(t)l k allows for radial changes. From this feature density g NL and standard meta-GGA inputs gm GGA : R3 Rdm GGA, we derive the non-local correction γNL : Rd+dm GGA R to the exchange energy density ϵm GGA: ϵXC(r) = γNL (g NL(r), gm GGA(r)) ϵm GGA (gm GGA(r)) . (21) In practice, we implement γNL as an MLP. To obtain the final readout, we integrate the exchangecorrelation density over the electron density R R3 ϵXC(r)ρ(r)dr. Like Equation 13, we evaluate the integral with standard integration grids. (4) Graph readout. In addition to the meta-GGA-based readout, we add global graph readout of the embeddings n H(t)o T t=0 to capture the remaining non-local effects. We use an MLP ϵNL : Rd R on top of the invariant embeddings (l = 0) to obtain the final exchange-correlation energy: R3 ϵXC(r)ρ(r)dr + i=1 ϵNL H(t)0 i . (22) Limitations. While EG-XC demonstrates significant improvements over a semi-local ML functional, it is not free of limitations. First, as we rely on the nuclear positions to represent the electronic density, it is not truly universal, i.e., independent of the external potential Vext (Kohn & Sham, 1965). Second, the non-local nature of our functional permits no known way to enforce most physical constraints (Kaplan et al., 2023). Though, the importance of these constraints is still debated (Kirkpatrick et al., 2021), see Appendix O. These missing constraints enable the correction of basis set errors through the XC functional. While this may lead to unphysical matches between densities and energies, our experiments suggest that this does not lead to overfitting energies. Third, systems without nuclei, e.g., the homogenous electron gas, cannot be modeled with our approach. In such cases, one may want to replace the real-space point cloud with a frequency representation (Kosmala et al., 2023). Fourth, to handle open-shell systems, e.g., in chemical reactions, one would need to extend the equivariant embeddings to include spin information. Lastly, while more data efficient and better in extrapolation, running KS-DFT is more expensive than a surrogate force field. 5 EXPERIMENTS In the following, we compare EG-XC to various alternative methods of learning potential energy surfaces across several settings. In particular, we focus on the following tasks: interpolating accurate Published as a conference paper at ICLR 2025 Table 1: Test set MAE on the CCSD(T) MD17 dataset in m Eh. (best, second) Force field -ML KS-DFT Molecule Sch Net Pai NN Nequ IP Sch Net Pai NN Nequ IP Dick EG-XC Aspirin 7.01 2.82 5.52 2.02 1.20 1.04 1.94 0.69 Benzene 0.40 0.16 0.09 0.13 0.11 0.02 0.39 0.10 Ethanol 1.41 0.89 0.95 0.93 0.42 0.25 0.85 0.21 Malonaldehyde 2.10 1.00 2.32 0.61 0.44 0.29 0.73 0.27 Toluene 1.80 1.10 1.87 0.44 0.31 0.13 0.38 0.20 energy surfaces, extrapolation to unseen conformations, and extrapolation to larger molecules. Additionally, we present an ablation study on EG-XC s components. For a runtime complexity analysis, we refer the reader to Appendix M and for runtime measurements to Appendix N. Methods. To accurately position EG-XC, we compare to methods of varying computational costs: force fields, -ML, i.e., combining force fields with KS-DFT calculations, and a learnable XCfunctional (Dick & Fernandez-Serra, 2021). While force fields are orders of magnitude cheaper by bypassing quantum mechanical calculations, they lack prior physical knowledge. For the -ML methods, we shift all energies by DFT energies with LDA in the STO-6G basis. This reduces the learning problem to the difference between the KS-DFT and the target energy (Wengert et al., 2021). As force fields, we test increasingly expressive models based on their use of SO(3) irreducible representations: Sch Net (l = 0) (Schütt et al., 2017), Pai NN (l = 1) (Schütt et al., 2021), and Nequ IP (l = 2) (Batzner et al., 2022). Finally, we compare with EG-XC s learnable semi-local XC-functional (Dick & Fernandez-Serra, 2021). As -ML methods and learnable XC-functionals require a DFT computation for each structure, they are at the same computational cost as EG-XC. In Appendix L, we provide additional -ML calculations with learnable XC functionals as base method. Setup. To train the XC functionals, we follow Li et al. (2021) and implement the SCF method differentiably. This allows us to match the converged SCF energies directly to the target energies without needing ground truth electron densities. We provide implementation details in Appendix C. All methods are trained on energy labels only. Force fields are trained with hyperparameters from their respective works with modifications to learning rate, batch size, and initialization to improve performance; we outline the changes in Appendix G. We list EG-XC s hyperparameters in Appendix F. Loss. We follow Dick & Fernandez-Serra (2021) and minimize the mean squared error between the converged SCF energy and the target energy over the last Iloss N+ steps Etarget,n Ei SCF(Rn, Zn) 2 (23) with Etarget R beign the target energy and Ei SCF : (R3 N+)M R the SCF energy after the i-th iteration. The index n runs over the dataset. The parameter gradients are obtained through the total derivative of the loss function with respect to the parameters of the XC functional d L dθ by backpropagating through the SCF iterations. Importantly, for the parameters updates to exist, the XC functional must be at least twice differentiable as we compute the partial derivative EXC[ρ] P i to construct the ith Fock matrix (see Appendix B) and the total derivative for the updates d L dθ . We obtain a C -smooth XC functionals by using the Si LU activation function (Hendrycks & Gimpel, 2023). Reproducing gold-standard accuracies. The objective behind learning XC-functionals or ML force fields is to facilitate access to accurate potential energy surfaces by distilling highly accurate reference data into a faster method. Hence, we compare these methods on the revised MD17 dataset, which contains precise gold-standard CCSD(T) (CCSD for aspirin) reference energies for conformations of five molecules along the trajectory of a molecular dynamic (MD) simulation (Chmiela et al., 2018). Each molecule has a training set of 1000 structures, which we split into 950 training and 50 validation structures. Each test set contains an additional 500 structures (1000 for ethanol). Following Schütt et al. (2017), a separate model is fitted per molecule. Since CCSD(T) calculations inherently account for non-local effects, these datasets are well suited for investigating the ability to accurately interpolate multi-dimensional energy surfaces from a limited number of reference structures. In Appendix I, we provide additional -ML data with a more accurate DFT functional and basis sets. Published as a conference paper at ICLR 2025 Table 2: Structural extrapolation on the 3BPA dataset. All methods were trained on the 300K training set. All numbers are relative MAE in m Eh. The last three rows refer to the potential energy surfaces. Force field -ML KS-DFT Test set Sch Net Pai NN Nequ IP Sch Net Pai NN Nequ IP Dick EG-XC 300K 5.15 2.91 3.81 2.38 1.14 0.81 0.96 0.42 600K 9.06 5.81 7.55 3.96 2.13 1.56 1.36 0.73 1200K 18.33 14.14 17.30 6.84 5.97 3.30 2.27 1.39 β = 120 3.84 1.78 2.25 2.53 1.25 1.09 0.75 0.35 β = 150 4.64 1.89 2.64 2.03 0.84 0.88 0.61 0.23 β = 180 4.97 1.92 3.03 1.79 1.06 0.73 0.56 0.20 Figure 2: Two-dimensional slice of the potential energy surface of 3BPA with the dihedral β = 120 . Pure force fields like Nequ IP struggle to recover the shape of this out-of-distribution energy surface. When paired with DFT calculations, one can see that the energy surface moves closer to the target shape but introduces additional extrema. Learnable XC functionals like Dick & Fernandez-Serra (2021) and EG-XC demonstrate significantly better reproduction of the target energy surface. Table 1 lists the mean absolute error (MAE) for all methods on the MD17 test sets. We find that force fields generally struggle to reconstruct the energy surface of the MD17 dataset accurately when trained solely on energies. In contrast, -ML methods and learnable XC-functionals generally achieve chemical accuracy at 1 kcal mol 1 1.6 m Eh. The DFT reference calculations systematically reduce the to-be-learned energy surface delta from 6.2 m Eh to 4.9 m Eh. Compared to the EG-XC s base semi-local Dick & Fernandez-Serra (2021), EG-XC reduces errors by a factor of 2 to 4. Among all methods, EG-XC yields the lowest error on 3 of the 5 molecules. In Appendix H, we test the performance on a reduced training set of just 50 samples. There we find that gap between KS-DFT methods and force fields ones to widen significantly supporting the data efficiency of learnable KS-DFT methods. We hypothesize the good performance of force fields on MD17 being due to the homogeneous nature of the dataset where the gap between the training and test set is small. In such settings, no physical inductive bias may be required to perform well on the test set. Conversely, we expect the performance of force fields to degrade in extrapolation settings where the training and test set differ significantly. We investigate this hypothesis in the following experiments. Extrapolating structures. In MD simulations, one often encounters structures far outside the training set, e.g., due to higher temperatures or environmental changes (Stocker et al., 2022). To investigate the extrapolation to unseen structures, we use the 3BPA dataset (Kovács et al., 2021). 3BPA contains various geometric configurations of the molecule 3-(benzyloxy)pyridin-2-amine. The training set consists of 500 structures sampled from an MD simulation at room temperature (300K). The test sets consist of MD trajectories at 300K, 600K, and 1200K to test in and out-of-distribution (OOD) performance. Additionally, the dataset contains three 2-dimensional potential energy surface slices where two dihedral angles are varied while keeping one fixed. The labels have been computed with the hybrid ωB97X XC functional and the 6-31G(d) basis set and, thus, include non-local interactions. As constant offsets of the potential energy surface do not affect the system dynamics, we evaluate the relative MAE, i.e., E [|Epred Etarget median(Epred Etarget)|]. We list the relative MAE for all methods and test sets in Table 2; for the absolute MAE, we refer the reader to Appendix J. Across all test sets, EG-XC results in 35 % to 51 % lower errors than the next best-tested alternative. On the Published as a conference paper at ICLR 2025 Maximum number of heavy atoms in train (training set size) 5 6 7 8 9 6 7 8 9 7 8 9 8 9 4 (48) 5 (174) 6 (776) 7 (3884) 15.3 19.4 21.7 24.4 27.1 17.2 20.7 24.3 27.7 5.1 8.4 11.5 3.1 5.0 12.2 15.1 17.5 20.1 22.3 9.1 13.0 18.2 23.5 2.4 4.3 6.5 1.8 3.0 11.5 15.7 19.0 22.3 25.3 5.2 7.0 9.2 11.2 2.6 4.3 5.8 1.4 2.4 11.1 14.0 16.8 19.8 23.7 4.8 7.5 10.9 14.7 3.2 5.4 8.2 1.7 2.8 10.1 12.8 15.9 19.6 23.5 3.1 4.1 5.4 7.0 1.3 2.1 2.7 1.0 1.8 13.6 15.6 17.8 20.0 22.2 2.4 3.4 5.1 6.9 1.3 2.1 2.9 0.9 1.5 3.2 4.1 4.4 4.9 5.3 2.7 2.7 2.9 3.1 2.1 2.3 2.6 2.0 2.1 4.3 5.7 7.1 8.7 10.4 1.1 1.6 2.2 3.1 0.7 1.0 1.3 0.8 1.0 Heavy atoms in test Figure 3: MAE in m Eh on QM9 size extrapolation. Each row represents a different method. The four groups indicate the maximum number of heavy atoms in the training set and its size. Each column represents the test subset with the number of heavy atoms listed above. far OOD 1200K samples, EG-XC is the only method achieving chemical accuracy 1.6 m Eh at an relative MAE of 1.40 m Eh. To illustrate the qualitative improvement of EG-XC s energy surfaces, we plot the most OOD potential energy slice at β = 120 in Figure 2 for Nequ IP, -Nequ IP, Dick & Fernandez-Serra (2021), EG-XC and the target. The remaining methods and corresponding energy surfaces are plotted in Appendix K. It is evident that force fields fail to reproduce the target energy surface with no resemblance to the target surface. While the reference DFT calculations move -Nequ IP closer to the target surface, the energy shape includes additional extrema not present in the target surface. In contrast, both XC functionals accurately reproduce the energy surface. In line with the results on MD17, we find support to the hypothesis that learnable XC functionals are better suited for extrapolation to unseen structures than force fields. Extrapolation to larger molecules. Gathering reference data for large compounds is costly and expensive with accurate quantum chemical calculations like CCSD(T) scaling O(Nel 7) in the number of electrons Nel. Thus, extrapolation from small or medium-sized molecules to larger ones is critical. Here, we simulate this setting by splitting the QM9 dataset (Ramakrishnan et al., 2014) into subsets of increasing size based on the number of heavy atoms, i.e., QM9(S) are all QM9 molecules with at most S heavy atoms. For each S {4, 5, 6, 7}, we train a separate model and test on the remaining structures, i.e., QM9\QM9(S). For each training set, we split the structures 90%/10% into training and validation sets. The QM9 dataset comprises 134k stable organic molecules with up to 9 heavy atoms. The energies have been computed with the hybrid B3LYP XC functional with the 6-31G(2df,p) basis set and, thus, contain non-local interactions through the exact Hartree exchange. As the dataset contains only few molecules with fluorine, force fields (including -ML) could not yield accurate energies if none or few are in the training set. Thus, we omitted all molecules that include fluorine. We visualize the MAE for each combination of the number of heavy atoms in the test set and the maximum number of heavy atoms in the training set in Figure 3. Across all combinations of training and test sets, we find learnable XC functionals yielding the lowest errors with a preference for Dick & Fernandez-Serra (2021) on the smallest QM9(4). We hypothesize that this is due to the physical constraints enforced by the learnable XC functional, which are not present in the other methods, including EG-XC. Starting with QM9(5), EG-XC s errors are consistently the lowest. Notably, EG-XC trained on QM9(6) yields lower MAE on the largest structures than the best alternative on QM9(7) with 5 times more samples and one-atom larger molecules. On QM9(7), EG-XC is at least 33 % more accurate than the competing methods on test molecules with 9 heavy atoms. Compared to Dick & Fernandez-Serra (2021), EG-XC reduces the MAE by at least 2 on QM9(6) and QM9(7). Published as a conference paper at ICLR 2025 Table 3: Ablation study on the 3BPA dataset. 300K 600K 1200K β = 120 β = 120 β = 120 no m GGA 7.00 12.89 25.85 10.99 11.16 10.82 no graph readout 0.97 1.38 2.30 0.77 0.63 0.57 no GNN 0.60 0.87 1.59 0.57 0.54 0.53 EG-XC 0.42 0.73 1.39 0.35 0.23 0.20 Overall, the QM9 results support the hypothesis that learnable XC functionals are better suited for extrapolation to larger molecules than force fields. Ablations. To highlight the importance EG-XC s components, we perform an ablation study on the 3BPA dataset. We compare the full EG-XC model to variants without the m GGA, the graph readout, and the equivariant GNN. For the model without GNN, we use an equivariant MLP defined in Equation 17 that we apply node-wise. We report the relative MAE on all test sets in Table 3. One sees that the m GGA is a crucial component for the performance of EG-XC. The model without graph readout approximately doubles EG-XC s error close to the performance of Dick & Fernandez-Serra (2021). Without the equivariant message passing, we observe that EG-XC can still reduce errors significantly, indicating that our equivariant convolution is well-suited to encode the electron density. Otherwise, we would not observe an improvement here as a charge embedding could only correct a constant error, which is accounted for in the relative MAE metric. 6 DISCUSSION We tackled the problem of learning efficient non-local exchange-correlation functionals to enhance the accuracy of density functional theory. To this end, we have presented Equivariant Graph Exchange Correlation (EG-XC), an equivariant graph neural network approach. We have introduced a basisindependent reduction of the electron density into a finite point cloud representation, enabling equivariant graph neural networks to capture molecular-range information. Based on this point cloud embedding, we have parameterized a non-local feature density used for the reweighing of a semi-local exchange-correlation energy density. Unlike other learnable non-local functionals, EG-XC can be trained by differentiating through the SCF solver such that the training requires only energy targets. In our empirical evaluation, EG-XC improves upon previous machine learning-based methods of modeling potential energy surfaces. On MD17, EG-XC accurately reconstructs gold-standard potential energy surfaces, namely CCSD(T), within the DFT framework. On the 3BPA, EG-XC reduces errors by 35 % to 50 % compared to the best-performing baseline. On QM9, EG-XC demonstrates remarkable data efficiency, achieving similar accuracies with 5 times less data or up to 33 % lower errors at the same amount of data compared to the best baseline. Finally, EG-XC extrapolates well to unseen conformations and larger molecules on 3BPA and QM9, respectively. Overall, these results strongly underline the data efficiency of learning exchange-correlation functionals compared to machine-learned force fields. We hypothesize that a significant portion of EG-XC s generalization is thanks to including the physically constrained semi-local model that captures most of the exchange-correlation energy and biases EG-XC to approximately fulfill the same constraints. On top of these, our non-local contributions transfer the accuracies of graph neural network-based force fields to functionals while maintaining a similar data efficiency. Future work. The simple implementation of EG-XC through standard deep learning libraries and the integration with the self-consistent field method open the door to a range of future research. Thanks to the basis-set independence, EG-XC can transfer to non-atom-centered basis sets like plane waves, as they are common in periodic systems, or approximate the kinetic energy functional in orbital-free DFT. Further, while we trained EG-XC solely on energies, multimodal training with other DFT-computable observables like electron densities or atomic forces could further improve accuracy and generalization. Another path to improving generalization may be integrating known physical constraints to the non-local part of EG-XC. Finally, accurate functionals like EG-XC may bridge the gap between sparse, accurate quantum mechanical calculations (Cheng et al., 2024; Gao & Günnemann, 2024a) and fast force field (Batzner et al., 2022). Published as a conference paper at ICLR 2025 Acknowledgements. We are grateful to Leo Schwinn and Jan Schuchardt for their invaluable manuscript feedback and to Arthur Kosmala and Johannes Margraf for insightful discussions on density functional theory. Funded by the Federal Ministry of Education and Research (BMBF) and the Free State of Bavaria under the Excellence Strategy of the Federal Government and the Länder. Ilyes Batatia, David P. Kovacs, Gregor Simm, Christoph Ortner, and Gabor Csanyi. MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. Advances in Neural Information Processing Systems, 35:11423 11436, December 2022. Simon Batzner, Albert Musaelian, Lixin Sun, Mario Geiger, Jonathan P. Mailoa, Mordechai Kornbluth, Nicola Molinari, Tess E. Smidt, and Boris Kozinsky. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature Communications, 13(1):2453, May 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-29939-5. A. D. Becke. A multicenter numerical integration scheme for polyatomic molecules. The Journal of Chemical Physics, 88(4):2547 2553, February 1988. ISSN 0021-9606, 1089-7690. doi: 10.1063/1.454033. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: Composable transformations of Python+Num Py programs, 2018. Kyle Bystrom and Boris Kozinsky. CIDER: An Expressive, Nonlocal Feature Set for Machine Learning Density Functionals with Exact Constraints. Journal of Chemical Theory and Computation, 18 (4):2180 2192, April 2022. ISSN 1549-9618. doi: 10.1021/acs.jctc.1c00904. Zehua Chen and Weitao Yang. Development of a machine learning finite-range nonlocal density functional. The Journal of Chemical Physics, 160(1):014105, January 2024. ISSN 0021-9606. doi: 10.1063/5.0179149. Lixue Cheng, P. Bernát Szabó, Zeno Schätzle, Derk Kooi, Jonas Köhler, Klaas J. H. Giesbertz, Frank Noé, Jan Hermann, Paola Gori-Giorgi, and Adam Foster. Highly Accurate Real-space Electron Densities with Neural Networks, September 2024. Stefan Chmiela, Huziel E. Sauceda, Klaus-Robert Müller, and Alexandre Tkatchenko. Towards exact molecular dynamics simulations with machine-learned force fields. Nature Communications, 9(1): 3887, September 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-06169-2. Christopher J. Cramer. Essentials of Computational Chemistry: Theories and Models. Wiley, Chichester, West Sussex, England ; Hoboken, NJ, 2nd ed edition, 2004. ISBN 978-0-470-09182-1 978-0-470-09181-4. Sebastian Dick and Marivi Fernandez-Serra. Highly accurate and constrained density functional obtained with differentiable programming. Physical Review B, 104(16):L161109, October 2021. doi: 10.1103/Phys Rev B.104.L161109. Thorben Frank, Oliver Unke, and Klaus-Robert Müller. So3krates: Equivariant attention for interactions on arbitrary length-scales in molecular systems. Advances in Neural Information Processing Systems, 35:29400 29413, 2022. Nicholas Gao and Stephan Günnemann. Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions. In International Conference on Learning Representations, April 2022. Nicholas Gao and Stephan Günnemann. Generalizing Neural Wave Functions. In International Conference on Machine Learning, February 2023a. Nicholas Gao and Stephan Günnemann. Sampling-free Inference for Ab-Initio Potential Energy Surface Networks. In The Eleventh International Conference on Learning Representations, February 2023b. Published as a conference paper at ICLR 2025 Nicholas Gao and Stephan Günnemann. Neural Pfaffians: Solving Many Many-Electron Schrödinger Equations. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, September 2024a. Nicholas Gao and Stephan Günnemann. On Representing Electronic Wave Functions with Sign Equivariant Neural Networks. In ICLR 2024 Workshop on AI4Differential Equations In Science, March 2024b. Johannes Gasteiger, Florian Becker, and Stephan Günnemann. Gem Net: Universal Directional Graph Neural Networks for Molecules. In Advances in Neural Information Processing Systems, May 2021. Johannes Gasteiger, Chandan Yeshwanth, and Stephan Günnemann. Directional Message Passing on Molecular Graphs via Synthetic Coordinates, April 2022. Lars Goerigk, Andreas Hansen, Christoph Bauer, Stephan Ehrlich, Asim Najibi, and Stefan Grimme. A look at the density functional theory zoo with the advanced GMTKN55 database for general main group thermochemistry, kinetics and noncovalent interactions. Physical Chemistry Chemical Physics, 19(48):32184 32215, 2017. ISSN 1463-9076, 1463-9084. doi: 10.1039/C7CP04913G. Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs), June 2023. R. O. Jones. Density functional theory: Its origins, rise to prominence, and future. Reviews of Modern Physics, 87(3):897 923, August 2015. ISSN 0034-6861, 1539-0756. doi: 10.1103/Rev Mod Phys. 87.897. Aaron D. Kaplan, Mel Levy, and John P. Perdew. The Predictive Power of Exact Constraints and Appropriate Norms in Density Functional Theory. Annual Review of Physical Chemistry, 74(1): 193 218, 2023. doi: 10.1146/annurev-physchem-062422-013259. Tosio Kato. On the eigenfunctions of many-particle systems in quantum mechanics. Communications on Pure and Applied Mathematics, 10(2):151 177, 1957. ISSN 1097-0312. doi: 10.1002/cpa. 3160100201. Min-Cheol Kim, Eunji Sim, and Kieron Burke. Understanding and Reducing Errors in Density Functional Calculations. Physical Review Letters, 111(7):073003, August 2013. ISSN 0031-9007, 1079-7114. doi: 10.1103/Phys Rev Lett.111.073003. James Kirkpatrick, Brendan Mc Morrow, David H. P. Turban, Alexander L. Gaunt, James S. Spencer, Alexander G. D. G. Matthews, Annette Obika, Louis Thiry, Meire Fortunato, David Pfau, Lara Román Castellanos, Stig Petersen, Alexander W. R. Nelson, Pushmeet Kohli, Paula Mori Sánchez, Demis Hassabis, and Aron J. Cohen. Pushing the frontiers of density functionals by solving the fractional electron problem. Science, 374(6573):1385 1389, December 2021. doi: 10.1126/science.abj6511. W. Kohn and L. J. Sham. Self-Consistent Equations Including Exchange and Correlation Effects. Physical Review, 140(4A):A1133 A1138, November 1965. ISSN 0031-899X. doi: 10.1103/ Phys Rev.140.A1133. Arthur Kosmala, Johannes Gasteiger, Nicholas Gao, and Stephan Günnemann. Ewald-based Long Range Message Passing for Molecular Graphs. In Proceedings of the 40th International Conference on Machine Learning, pp. 17544 17563. PMLR, July 2023. Dávid Péter Kovács, Cas van der Oord, Jiri Kucera, Alice E. A. Allen, Daniel J. Cole, Christoph Ortner, and Gábor Csányi. Linear Atomic Cluster Expansion Force Fields for Organic Molecules: Beyond RMSE. Journal of Chemical Theory and Computation, 17(12):7696 7711, December 2021. ISSN 1549-9618. doi: 10.1021/acs.jctc.1c00647. David B. Krisiloff, Caroline M. Krauter, Francis J. Ricci, and Emily A. Carter. Density Fitting and Cholesky Decomposition of the Two-Electron Integrals in Local Multireference Configuration Interaction Theory. Journal of Chemical Theory and Computation, 11(11):5242 5251, November 2015. ISSN 1549-9618. doi: 10.1021/acs.jctc.5b00762. Published as a conference paper at ICLR 2025 H. J. Kulik, T. Hammerschmidt, J. Schmidt, S. Botti, M. A. L. Marques, M. Boley, M. Scheffler, M. Todorovi c, P. Rinke, C. Oses, A. Smolyanyuk, S. Curtarolo, A. Tkatchenko, A. P. Bartók, S. Manzhos, M. Ihara, T. Carrington, J. Behler, O. Isayev, M. Veit, A. Grisafi, J. Nigam, M. Ceriotti, K. T. Schütt, J. Westermayr, M. Gastegger, R. J. Maurer, B. Kalita, K. Burke, R. Nagai, R. Akashi, O. Sugino, J. Hermann, F. Noé, S. Pilati, C. Draxl, M. Kuban, S. Rigamonti, M. Scheidgen, M. Esters, D. Hicks, C. Toher, P. V. Balachandran, I. Tamblyn, S. Whitelam, C. Bellinger, and L. M. Ghiringhelli. Roadmap on Machine learning in electronic structure. Electronic Structure, 4 (2):023004, August 2022. ISSN 2516-1075. doi: 10.1088/2516-1075/ac572f. Susi Lehtola, Frank Blockhuys, and Christian Van Alsenoy. An Overview of Self-Consistent Field Calculations Within Finite Basis Sets. Molecules, 25(5):1218, January 2020. ISSN 1420-3049. doi: 10.3390/molecules25051218. Li Li, Stephan Hoyer, Ryan Pederson, Ruoxi Sun, Ekin D. Cubuk, Patrick Riley, and Kieron Burke. Kohn-Sham Equations as Regularizer: Building Prior Knowledge into Machine-Learned Physics. Physical Review Letters, 126(3):036401, January 2021. ISSN 0031-9007, 1079-7114. doi: 10.1103/Phys Rev Lett.126.036401. Johannes T. Margraf and Karsten Reuter. Pure non-local machine-learned density functional theory for electron correlation. Nature Communications, 12(1):344, January 2021. ISSN 2041-1723. doi: 10.1038/s41467-020-20471-y. Wenhui Mi, Kai Luo, S. B. Trickey, and Michele Pavanello. Orbital-Free Density Functional Theory: An Attractive Electronic Structure Method for Large-Scale First-Principles Simulations. Chemical Reviews, 123(21):12039 12104, November 2023. ISSN 0009-2665. doi: 10.1021/acs.chemrev. 2c00758. Ryo Nagai, Ryosuke Akashi, and Osamu Sugino. Completing density functional theory by machine learning hidden messages from molecules. npj Computational Materials, 6(1):1 8, May 2020. ISSN 2057-3960. doi: 10.1038/s41524-020-0310-0. Ryo Nagai, Ryosuke Akashi, and Osamu Sugino. Machine-learning-based exchange correlation functional with physical asymptotic constraints. Physical Review Research, 4(1):013106, February 2022. doi: 10.1103/Phys Rev Research.4.013106. Saro Passaro and C. Lawrence Zitnick. Reducing SO(3) convolutions to SO(2) for efficient equivariant GNNs. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of ICML 23, pp. 27420 27438, Honolulu, Hawaii, USA, July 2023. JMLR.org. John P. Perdew. Jacob s ladder of density functional approximations for the exchange-correlation energy. In AIP Conference Proceedings, volume 577, pp. 1 20, Antwerp (Belgium), 2001. AIP. doi: 10.1063/1.1390175. P. Pulay. Improved SCF convergence acceleration. Journal of Computational Chemistry, 3(4): 556 560, 1982. ISSN 1096-987X. doi: 10.1002/jcc.540030413. Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1(1):140022, December 2014. ISSN 2052-4463. doi: 10.1038/sdata.2014.22. Roman Remme, Tobias Kaczun, Maximilian Scheurer, Andreas Dreuw, and Fred A. Hamprecht. Kinetic Net: Deep learning a transferable kinetic energy functional for orbital-free density functional theory. The Journal of Chemical Physics, 159(14), 2023. Kristof Schütt, Oliver Unke, and Michael Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In Proceedings of the 38th International Conference on Machine Learning, pp. 9377 9388. PMLR, July 2021. Kristof T. Schütt, Pieter-Jan Kindermans, Huziel E. Sauceda, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Sch Net: A continuous-filter convolutional neural network for modeling quantum interactions. ar Xiv:1706.08566 [physics, stat], December 2017. Published as a conference paper at ICLR 2025 John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert Müller, and Kieron Burke. Finding Density Functionals with Machine Learning. Physical Review Letters, 108(25):253002, June 2012. ISSN 0031-9007, 1079-7114. doi: 10.1103/Phys Rev Lett.108.253002. Sina Stocker, Johannes Gasteiger, Florian Becker, Stephan Günnemann, and Johannes T. Margraf. How Robust are Modern Graph Neural Network Potentials in Long and Hot Molecular Dynamics Simulations? Preprint, Chemistry, April 2022. Jianwei Sun, Adrienn Ruzsinszky, and John P. Perdew. Strongly Constrained and Appropriately Normed Semilocal Density Functional. Physical Review Letters, 115(3):036402, July 2015. ISSN 0031-9007, 1079-7114. doi: 10.1103/Phys Rev Lett.115.036402. Qiming Sun, Timothy C. Berkelbach, Nick S. Blunt, George H. Booth, Sheng Guo, Zhendong Li, Junzi Liu, James D. Mc Clain, Elvira R. Sayfutyarova, Sandeep Sharma, Sebastian Wouters, and Garnet Kin-Lic Chan. Py SCF: The Python-based simulations of chemistry framework. WIREs Computational Molecular Science, 8(1):e1340, 2018. ISSN 1759-0884. doi: 10.1002/wcms.1340. Attila Szabo and Neil S. Ostlund. Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory. Courier Corporation, 2012. Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotationand translation-equivariant neural networks for 3D point clouds. ar Xiv:1802.08219 [cs], May 2018. Oliver Treutler and Reinhart Ahlrichs. Efficient molecular numerical integration schemes. The Journal of Chemical Physics, 102(1):346 354, January 1995. ISSN 0021-9606. doi: 10.1063/1.469408. Oliver T. Unke, Stefan Chmiela, Huziel E. Sauceda, Michael Gastegger, Igor Poltavsky, Kristof T. Schütt, Alexandre Tkatchenko, and Klaus-Robert Müller. Machine Learning Force Fields. Chemical Reviews, 121(16):10142 10186, August 2021. ISSN 0009-2665. doi: 10.1021/acs.chemrev. 0c01111. O. Vahtras, J. Almlöf, and M.W. Feyereisen. Integral approximations for LCAO-SCF calculations. Chemical Physics Letters, 213(5-6):514 518, October 1993. ISSN 00092614. doi: 10.1016/ 0009-2614(93)89151-7. Simon Wengert, Gábor Csányi, Karsten Reuter, and Johannes T. Margraf. Data-efficient machine learning for molecular crystal structure prediction. Chemical Science, pp. 10.1039.D0SC05765G, 2021. ISSN 2041-6520, 2041-6539. doi: 10.1039/D0SC05765G. Max Wilson, Nicholas Gao, Filip Wudarski, Eleanor Rieffel, and Norm M. Tubman. Simulations of state-of-the-art fermionic neural network wave functions with diffusion Monte Carlo, March 2021. Max Wilson, Saverio Moroni, Markus Holzmann, Nicholas Gao, Filip Wudarski, Tejs Vegge, and Arghya Bhowmik. Neural network ansatz for periodic wave functions and the homogeneous electron gas. Physical Review B, 107(23):235139, June 2023. doi: 10.1103/Phys Rev B.107.235139. He Zhang, Siyuan Liu, Jiacheng You, Chang Liu, Shuxin Zheng, Ziheng Lu, Tong Wang, Nanning Zheng, and Bin Shao. Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning. Nature Computational Science, 4(3):210 223, March 2024. ISSN 2662-8457. doi: 10.1038/s43588-024-00605-8. Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Haiyang Yu, Yu Qing Xie, Xiang Fu, Alex Strasser, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nicholas Gao, Adriana Ladera, Tailin Wu, Elyssa F. Hofgard, Aria Mansouri Tehrani, Rui Wang, Ameya Daigavane, Montgomery Bohde, Jerry Kurtin, Qian Huang, Tuong Phung, Minkai Xu, Chaitanya K. Joshi, Simon V. Mathis, Kamyar Azizzadenesheli, Ada Fang, Alán Aspuru-Guzik, Erik Bekkers, Michael Bronstein, Marinka Zitnik, Anima Anandkumar, Stefano Ermon, Pietro Liò, Rose Yu, Stephan Günnemann, Jure Leskovec, Heng Ji, Jimeng Sun, Regina Barzilay, Tommi Jaakkola, Connor W. Coley, Xiaoning Qian, Xiaofeng Qian, Tess Smidt, and Shuiwang Ji. Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems, November 2023. Published as a conference paper at ICLR 2025 Yi Zhou, Jiang Wu, Shuguang Chen, and Guan Hua Chen. Toward the Exact Exchange Correlation Potential: A Three-Dimensional Convolutional Neural Network Construct. The Journal of Physical Chemistry Letters, 10(22):7264 7269, November 2019. ISSN 1948-7185, 1948-7185. doi: 10.1021/acs.jpclett.9b02838. Published as a conference paper at ICLR 2025 A KOHN-SHAM DENSITY FUNCTIONAL THEORY In Kohn-Sham density functional theory (KS-DFT) (Kohn & Sham, 1965), one attempts to approximate the ground state energy of a molecular system by finding the electron density that minimizes the energy functional Equation 1. Compared to ab-initio methods like Gao & Günnemann (2022; 2023b;a; 2024b); Wilson et al. (2021; 2023), KS-DFT requires the empirical approximation of the XC functional EXC but scales more favorably with system size. Here, we largely compress Lehtola et al. (2020) to provide a brief introduction to the topic. We will neglect electron spin for simplicity and focus on a discretized DFT scheme based on atom-centered basis sets. Operators, i.e., functions mapping from functions to functions, are denoted by acting upon the ket, e.g., |f is the derivative of f. Further, we use the braket notation to denote the inner product of two functions f and g as f|g = R g(x)f(x)dx. We will use Einstein s summation convention, where indices that only appear on one side of an equation are summed over. The electron density ρ of an Nel system is associated with an antisymmetric wave function of a fictitious non-interacting system consisting of Nel orthonormal orbitals ϕi : R3 R. These orbitals are constructed as linear combinations of Nbas a finite basis set χµ : R3 R such that ϕi(r) = Cµiχµ(r). For systems without periodic boundary conditions, these basis sets typically consist of nuclei-centered functions called atomic orbitals. The electron density and pseudo-wavefunction are given by N! det [ϕi(rj)]N i,j=1 , (24) ρ(r) = Z Ψ(r, r2, ..., r N) Ψ(r, r2, ..., r N)dr2...dr N = ϕ(r)T ϕ(r) = χ(r)T CT Cχ(r) = χ(r)T Pχ(r), (25) where C RNbas Nel are the orbital coefficients and P = CT C RNbas Nbas is the so-called density matrix. This construction ensures that the electron density belongs to the class of densities expressible by antisymmetric wave functions and enables the exact computation of the kinetic energy (Kohn & Sham, 1965) 2 ϕi| 2|ϕi = 1 2Cµi Cνi χµ| 2|χν = Cµi Cνi Tµν = PµνTµν, (26) making KS-DFT significantly more accurate than contemporary orbital-free DFT. The external potential in molecular systems is defined by the nuclei positions R and charges Z Vext[ρ] = Z ρ(r)Zn |r Rn|dr = Cµi Cνi Z χµ(r)Znχν(r) |r Rn| dr = Cµi Cνi Vµν = PµνVµν. (27) The Hartree energy is a mean-field approximation to the electron-electron interaction ZZ ρ(r)ρ(r ) |r r | drdr = 1 2Cµi Cνi Cλj Cκj ZZ χµ(r)χλ(r ) 1 |r r |χν(r)χκ(r )drdr = Cµi Cνi Cλj Cκj Jµνλκ = PµνPλκJµνλκ. (28) Finally, the exchange-correlation energy is a functional of the electron density, which is unknown and has to be approximated. This is the main problem we are addressing in this work. To find the ground state, the parameters C are optimized to minimize Equation 1. This is typically achieved iteratively in a self-consistent fashion; see Appendix B. The construction via Gaussian-type basis functions (GTOs) greatly simplifies the integrals appearing in the calculation allowing us to analytically precompute the so-called core Hamiltonian Hcoreµν = Tµν + Vµν and the electronelectron repulsion tensor Jµνλκ once and reuse them throughout the optimization. B SELF-CONSISTENT FIELD METHOD In this work, we use the Self-Consistent Field (SCF) method (Kohn & Sham, 1965; Lehtola et al., 2020) to find the coefficients ˆC that minimize the energy functional E[ρ(r; C)] in Eq. 1 ˆC = arg min C E = arg min C PµνHcoreµν + PµνPλκJµνλκ + EXC (29) s.t. ij ϕi|ϕj = δij. (30) Published as a conference paper at ICLR 2025 We can express the orthogonality constraint above via the overlap matrix S RNbas Nbas ij ϕi|ϕj = δij CT SC = I, (31) Sµν := χµ|χν . (32) This constraint is enforced by introducing the Lagrange multipliers Eij, and we define the loss L = E + (CT SC I)ij Eij. (33) Since the matrix E is symmetric and the rotation of the basis does not change the energy, we can choose E to be diagonal. Using the Fock matrix F := E P , we can write the gradient of the loss w.r.t. C as L C = 2FC 2SCE != 0 (34) = FC = SCE (35) where the Fock matrix is given by Fµν = Hcoreµν + 2PλκJµνλκ + EXC[ρ] To evaluate the EXC and its derivative we discretize R3 with spherical integration grids around the nuclei (Becke, 1988). Specifically, for a set of quadrature points {ri}Nquad i=1 RNquad 3 and weights {wi}Nquad i=1 RNquad + , we first compute the density and m GGA features gm GGA(ri) = [ρ(ri), ρ(ri) , τ(ri)] on these points: ρ(ri) = χ(ri)T Pχ(ri), (37) ρ(ri) = 2 χ(ri)T P χ(ri) 2, (38) 2 χ(ri)T P χ(ri). (39) From this discretized input, we compute the scalar output EXC, e.g., EXC[ρ] = PNquad i=1 wiρ(ri)ϵm GGA(gm GGA(ri)) for an m GGA, and rely on automatic differentiation via backpropagation to compute the derivative w.r.t. P. The generalized eigenvalue problem in Equation 35 is solved to find the coefficients C. We solve this by diagonalizing S = V ΛV T , substituting C = X C with X = V Λ 1/2V T , and multiplying both sides from the left with XT to obtain the ordinary eigenvalue problem XT FX | {z } F C = XT SX | {z } I = F = CE. (40) After solving this for C, we recover the original coefficients C = XT C via the eigenvectors associated with the Nel lowest eigenvalues E. Since F depends on P = CCT , this problem is linearized by alternating optimization steps of F(C), C, s.t. we need to iterate until convergence commonly referred to as self-consistence in this context. C SELF-CONSISTENT FIELD IMPLEMENTATION We implement the SCF method from Appendix B fully differentiably in JAX (Bradbury et al., 2018) by following Lehtola et al. (2020). In particular, we precompute χ(r) on the integration grid points (Treutler & Ahlrichs, 1995), and the core-Hamiltonian Hµν. Since the electron-repulsion integrals Jµνλκ scale as O(Nbas 4) in compute and memory, we use the density-fitting approximation (Vahtras et al., 1993) to reduce the memory and compute scaling to O(Nbas 2Naux) where Naux Nbas 2 is the size of the auxiliary basis set. We explain the procedure in more detail in Appendix D. Additionally, we improve the convergence by implementing the direct inversion of the iterative subspace (DIIS) method, which we briefly summarize in Appendix E (Pulay, 1982). For precomputing the integrals and obtaining grid points, we use the Py SCF (Sun et al., 2018). For the evaluation of the atomic orbitals on the integration grid points, we use our own JAX implementation. Published as a conference paper at ICLR 2025 Like Dick & Fernandez-Serra (2021), we perform several SCF iterations with a pre-defined XCfunctional to obtain a good initial guess for the density. During training, we randomly interpolate between the precycled initial guess and a standard initial guess via 2 Pprecycle + 1 1 + α Pstandard (41) where α U(0, 1). At inference time, we fix α = 1. This interpolation leads to varying initial densities and functions as a regularizer (Dick & Fernandez-Serra, 2021). D DENSITY FITTING Here, we largely follow Krisiloff et al. (2015) and give a brief introduction to the density-fitting approximation. For more details, we refer the reader to the original work. We want to approximate the two-electron integral tensor Jµνλκ = Z χµ(r)χν(r ) 1 |r r |χλ(r)χκ(r )drdr . (42) To reduce the memory and compute scaling from O(Nbas 4) to O(Nbas 2Naux), we introduce an auxiliary basis set {χaux µ }Naux µ=1 and approximate the two-electron integrals as Jµνλκ ˆJ(µν)I J 1 IJ ˆJ(λκ)J (43) where ˆJ(µν)I = Z χµ(r)χν(r) 1 |r r |χaux I (r)drdr , (44) JIJ = Z χaux I (r) 1 |r r |χaux J (r )drdr , (45) and the indices I, J run over the auxiliary basis set {χaux I }Naux I=1. Note that choosing {χaux I }Naux I=1 = {χµ χν}Nbas µ,ν=1 yields the exact two-electron integrals. One can further simplify by computing B(µν)I = ˆJ(µν)J J 1 2 = J 1 J 1 2 with J 1 2 being the Choleksy decomposition of J. This reduces the contraction of the two-electron integrals with the density matrix to VH[ρ] = PµνJµνλκPλκ PµνB(µν)IB(λκ)IPλκ. (47) E DIRECT INVERSION OF ITERATIVE SUBSPACE The direct inversion of the iterative subspace (DIIS) method is a common technique to accelerate the convergence of the SCF method. Here, we briefly summarize the method; for a derivation and more details, we refer the reader to Pulay (1982). The DIIS method aims at finding a linear combination of previous Fock matrices that minimizes the norm of the error matrix. This accelerates the convergence of the SCF method by extrapolating a new Fock matrix from previous ones. Let Fk be the Fock matrix in the k-th iteration and Pk the density matrix. Then the error matrix is defined as 2 T (Fk Pk S SPk Fk)S 1 where S is the overlap matrix. We aim to find coefficients ci that minimize the norm of a linear combination of error matrices subject to the constraint Pm i=1 ci = 1. This minimization problem can be solved using Lagrange multipliers, leading to the linear system B 1 1T 0 where Bij = Tr(e T i ej), 1 is a vector of ones, and λ is a Lagrange multiplier. Solving this system gives us the optimal coefficients ci which we use to obtain the extrapolated Fock matrix that is solved in the next SCF iteration. Published as a conference paper at ICLR 2025 Table 4: Hyperparameters for EG-XC. Hyperparameter Value d number of features per irrep 32 lmax number of irreps 2 T number of layers 3 Radial filters 32 ϵm GGA Base semilocal functional Dick & Fernandez-Serra (2021) Batch size 1 Iloss Number of steps to compute loss 3 Parameter EMA 0.995 Optimizer Adam β1 0.9 β2 0.999 Basis set 6-31G(d) Density fitting basis set weigend I SCF iterations 15 Precycle XC functional LDA Precycle iterations 15 Learning rate MD17 0.01 1+ 1 1000 3BPA 0.01 1+ 1 1000 QM9 0.001 1+ 1 1000 F HYPERPARAMETERS We list the hyperparameters for EG-XC in Table 4. G BASELINE CHANGES For reference methods, we used the model hyperparameters from their respective works. For Nequ IP, we used the default l = 2. We extended patience schedules to ensure full convergence for all baselines. On the QM9(S) datasets, we first fitted a linear model on the training set with the number of atoms and shifted all labels by this prediction. Further, we initialized all models with zero in the last layer. These steps improved generalization between 10 and 100 times on the smaller datasets. H MD17 WITH 50 SAMPLES We imitate the setting of Batatia et al. (2022) and train all models on only 50 samples for each of the MD17 trajectories. We reduce the validation set to 10 samples as well. As we found KS-DFT methods to be more learning rate sensitive here, we ablated the initial learning rate with 0.01 and 0.001 and report the lower ones for Dick & Fernandez-Serra (2021) and EG-XC. The resulting test set MAEs are listed in Table 5. KS-DFT methods and -ML are more successful learning from such few samples. Especially on the challenging aspirin structures, EG-XC is the only method yielding accuracies within chemical accuracy. I DELTA ML WITH SCAN (6-31G) DFT In Table 6, we present additional MD17 data with the SCAN functional (Sun et al., 2015) and the 6-31G(d) basis set. The results show that improving the reference DFT functional significantly improves -ML approaches. It should be noted that the pure SCAN is already more accurate than the Dick & Fernandez-Serra (2021) specifically fitted functional on Aspirin. We argue that this setting is Published as a conference paper at ICLR 2025 Table 5: MD17 with only 50 samples. Force field -ML KS-DFT Sch Net Pai NN Nequ IP Sch Net Pai NN Nequ IP Dick EG-XC Aspirin 8.94 9.46 8.75 9.55 3.27 3.19 1.98 1.23 Benzene 2.11 2.28 2.84 0.25 0.26 0.21 0.53 0.59 Ethanol 4.15 3.30 4.44 2.99 1.60 1.32 0.99 0.35 Malonaldehyde 5.47 3.99 5.05 2.81 1.77 1.36 0.78 0.54 Toluene 6.65 4.40 5.15 1.15 0.79 0.80 0.70 0.99 Table 6: MAE for -ML with SCAN and the 6-31G(d) basis set on MD17. For SCAN, we report the relative MAE instead to account for mean shifts. Molecule Sch Net Pai NN Nequ IP SCAN Aspirin 0.57 0.46 0.34 1.84 Benzene 0.08 0.07 0.02 1.27 Ethanol 0.26 0.17 0.08 1.04 Malonaldehyde 0.32 0.25 0.13 1.26 Toluene 0.21 0.17 0.08 1.92 an ideal case of -ML as SCAN is well-suited for such close-to-equilibrium structures. However, it should be mentioned that such improvements are orthogonal as one may also apply -learning on top of EG-XC. J 3BPA MEAN ABSOLUTE ERROR In Table 7, we present the absolute MAE on the 3BPA dataset for all methods. It is apparent that while EG-XC performs well, -Nequ IP achieves the lowest MAE. Thus, demonstrating that EG-XC s error is largly a constant offset that would not affect actual MD simulations. K 3BPA POTENTIAL ENERGY SURFACES We plot the potential energy surfaces for 120 , 150 , and 180 in Figure 4, Figure 5, and Figure 6, respectively. It is apparent that force fields struggle on all energy surfaces while XC-functionals yield a similar surface structure to the target. Table 7: Absolute MAE in m Eh for structural extrapolation on the 3BPA dataset. All methods are trained on the 300K training set. Force field -ML KS-DFT Test set Sch Net Pai NN Nequ IP Sch Net Pai NN Nequ IP Dick EG-XC 300K 5.15 2.91 3.85 2.38 1.14 0.81 0.96 0.42 600K 28.14 13.85 24.56 4.09 2.13 1.56 2.19 1.43 1200K 93.50 49.14 81.00 7.05 6.01 3.30 6.32 4.39 β = 120 19.98 4.16 13.77 2.54 1.26 1.09 1.58 1.04 β = 150 20.13 6.64 15.53 2.59 0.89 0.88 1.60 1.02 β = 180 20.25 9.01 17.35 1.93 1.15 0.77 1.70 1.04 Published as a conference paper at ICLR 2025 Figure 4: Energy surfaces of the 3BPA dataset at β = 120 . Table 8: Relative MAE in m Eh on the 3BPA dataset with -ML on top of learnable XC functionals. Dick EG-XC Test set Sch Net Pai NN Nequ IP Sch Net Pai NN Nequ IP 300K 0.62 0.30 0.34 0.42 0.29 0.30 600K 1.01 0.56 0.62 0.71 0.52 0.56 1200K 1.85 1.24 1.38 1.38 1.15 1.26 β = 120 0.39 0.28 0.32 0.31 0.19 0.22 β = 150 0.41 0.35 0.27 0.24 0.19 0.22 β = 180 0.45 0.37 0.25 0.20 0.15 0.21 Published as a conference paper at ICLR 2025 Figure 5: Energy surfaces of the 3BPA dataset at β = 150 . Published as a conference paper at ICLR 2025 Figure 6: Energy surfaces of the 3BPA dataset at β = 180 . Published as a conference paper at ICLR 2025 L -LEARNING ON LEARNABLE XC FUNCTIONALS As one could see in the experiments of the main body, -ML is a powerful technique to boost the accuracy of force fields by integrating quantum mechanical calculations. One may improve this procedure further by relying on learnable XC functionals as baseline method instead of fixed functionals. Here, we investigate this approach by training -ML on top of Dick & Fernandez-Serra (2021) and EG-XC on the 3BPA dataset. The results are presented in Table 8. One can see that -ML on top of such accuracy XC functionals improves accuracies significantly beyond the learning with LDA functional in the main body. Across all force field models, we find EG-XC to yield lower errors on all test sets than the base m GGA functional from Dick & Fernandez-Serra (2021). This supports the hypothesis that the improvements learned by EG-XC are not identical with those of -ML. M RUNTIME COMPLEXITY In the following, we ignore the cost of precomputing the necessary integral tensors, the core Hamiltonian Hcore, the density fitting tensor B (see. Appendix D), the evaluation of the atomic orbitals on the quadrature grid, and the initial guess P0 for the density as these do not contribute significantly to the overall runtime. A single EG-XC forward pass consists of the following steps: 1. O(Nquad Nbas 2): Compute the electron density features gm GGA from the density matrix P. 2. O(Nquaddlmax): Compute the nuclear-centered point cloud embeddings H. 3. O(Nnuc 2dlmax 3T) + O(Nnucd2lmax T): Equivariant message passing on the point cloud embeddings. 4. O(Nquaddlmax): Compute the non-local feature density g NL on the nuclear grid points. 5. O(Nquadd2L): Compute the XC energy density ϵXC from the non-local feature density. 6. O(Nnucd2): Graph readout to obtain the XC energy. Note that due to the exponential decay in the partitioned density ρi, we truncate the summations in Equation 13 and Equation 20 to avoid a O(Nquad Nnuc) scaling. Additionally, a single SCF requires the following steps: 1. O(Nbas 2Naux): Compute the electron repulsion integrals. 2. O(Nbas 2): Compute the Fock matrix. 3. O(Nbas 3) + O(I3): DIIS extrapolation. 4. O(Nbas 3): Solve the generalized eigenvalue problem. 5. O(Nbas 3): Compute the density matrix. For hybrid functionals like B3LYP or ωB97X, one has to add the cost of the exact exchange term with density fitting O(Nbas 3Naux). For a complete SCF calculation, we iterate these steps until convergence which typically requires I 15 iterations. In practice, computing the electron density features dominates the runtime for semi-local functionals. For hybrid functionals, the cost of the exact exchange term dominates the runtime. N RUNTIME COMPARISON Here, we compare the runtime of a DFT calculation with different XC functionals, namely LDA: a local XC functional, SCAN: a m GGA XC functional, B3LYP: a hybrid XC functional, Published as a conference paper at ICLR 2025 2 6 10 14 18 22 26 30 34 36 CNH4 Runtime (s) ωB97X B3LYP EG-XC Dick2021 SCAN LDA Figure 7: Runtime scaling of KS-DFT methods with different XC functionals. Table 9: Comparison between Dick & Fernandez-Serra (2021) and Nagai et al. (2020) on 3BPA. Test set Dick Nagai 300K 0.96 0.91 600K 1.36 1.24 1200K 2.27 2.09 β = 120 0.75 0.85 β = 150 0.61 0.68 β = 180 0.56 0.66 ωB97X: a range-separated hybrid XC functional, Dick & Fernandez-Serra (2021): a machine-learned m GGA XC functional, As test system, we use increasing cumulene chains CNH4 with N {2, 4, 6, . . . , 38}. We use the 6-31G(d) basis set. We run 15 SCF cycles for each DFT calculation and repeat the calculations 10 times from which we report the minimum to account for other processes running on the same machine. All calculations were performed on a single NVIDIA A100 GPU with our JAX implementation. For the hybrid functionals, we use density fitting to fit the exact exchange term into GPU memory. The runtime scaling for each method can be found in Figure 7. While DFT calculations are approximately 2 times slower than Dick & Fernandez-Serra (2021) on small structures, at the largest tested system size of C38H4, 232 electrons, EG-XC is only 7 % slower. Compared to hybrid functionals, we find B3LYP and ωB97X being 43 % and 96 % slower than EG-XC, respectively. This highlights the efficiency of EG-XC s non-local interactions in large systems. O PHYSICAL CONSTRAINTS IN MGGA FUNCTIONALS To investigate the importance of physical constraints on the m GGA, we train Dick & Fernandez Serra (2021) and Nagai et al. (2020) on the 3BPA dataset. While Dick & Fernandez-Serra (2021) implements many physical constraints, none of these are present in Nagai et al. (2020). The relative MAE across all test sets is shown in Table 9. While Nagai et al. (2020) yields lower errors on the MD trajectories at higher temperature, Dick & Fernandez-Serra (2021) offers lower errors on the potential energy surfaces. These inconclusive results suggest that physical constraints do not necessarily improve the accuracy of XC functionals, even in extrapolation tasks.