# neural_pfaffians_solving_many_manyelectron_schrödinger_equations__eceb2550.pdf Neural Pfaffians: Solving Many Many-Electron Schrödinger Equations Nicholas Gao, Stephan Günnemann {n.gao,s.guennemann}@tum.de Department of Computer Science & Munich Data Science Institute Technical University of Munich Neural wave functions accomplished unprecedented accuracies in approximating the ground state of many-electron systems, though at a high computational cost. Recent works proposed amortizing the cost by learning generalized wave functions across different structures and compounds instead of solving each problem independently. Enforcing the permutation antisymmetry of electrons in such generalized neural wave functions remained challenging as existing methods require discrete orbital selection via non-learnable hand-crafted algorithms. This work tackles the problem by defining overparametrized, fully learnable neural wave functions suitable for generalization across molecules. We achieve this by relying on Pfaffians rather than Slater determinants. The Pfaffian allows us to enforce the antisymmetry on arbitrary electronic systems without any constraint on electronic spin configurations or molecular structure. Our empirical evaluation finds that a single neural Pfaffian calculates the ground state and ionization energies with chemical accuracy across various systems. On the Tiny Mol dataset, we outperform the gold-standard CCSD(T) CBS reference energies by 1.9 m Eh and reduce energy errors compared to previous generalized neural wave functions by up to an order of magnitude. 1 Introduction Solving the electronic Schrödinger equation is at the heart of computational chemistry and drug discovery. Its solution provides a molecule s or material s electronic structure and energy (Zhang et al., 2023). While the exact solution is infeasible, neural networks have recently shown unprecedentedly accurate approximations (Hermann et al., 2023). These neural networks approximate the system s ground-state wave function Ψ : RNe 3 R, the lowest energy state, by minimizing the energy Ψ| ˆH |Ψ , where ˆH is the Hamiltonian operator, a mathematical description of the system. While such neural wave functions are highly accurate, training has proven computationally intensive. Gao & Günnemann (2022) have shown that training a generalized neural wave function on a large class of systems amortizes the cost. However, their approach is limited to different geometric arrangements of the same molecule. Subsequent works eliminated this limitation by introducing hand-crafted algorithms (Gao & Günnemann, 2023a) or heavily relying on classical Hartree-Fock calculations (Scherbela et al., 2023). Both impose strict, non-learnable mathematical constraints and prior assumptions that may not always hold, limiting their generalization and accuracies. Handcrafted algorithms only work for a limited set of molecules, in particular organic molecules near equilibrium, while the reliance on Hartree-Fock empirically results in degraded accuracies. In this work, we propose the Neural Pfaffian (Neur Pf) to overcome these limitations. As suggested by its name, Neur Pf uses Pfaffians to define a superset of the previously used Slater determinants to enforce the fermionic antisymmetry. The Pfaffian lifts the constraint on the number of molecular orbitals from Slater determinants (Szabo & Ostlund, 2012), enabling overparametrized wave functions 38th Conference on Neural Information Processing Systems (Neur IPS 2024). with simpler and more accurate generalization. Compared to Globe (Gao & Günnemann, 2023a), the absence of hand-crafted algorithms enables the modeling of non-equilibrium, ionized, or excited systems. By being fully learnable without fixed Hartree-Fock calculations like TAO (Scherbela et al., 2024), Neur Pf achieves significantly lower variational energies. Our empirical results show that Neur Pf can learn all second-row elements ground-state, ionization, and electron affinity potentials with a single wave function. Further, we demonstrate that Neur Pf s accuracy surpasses Globe on the challenging nitrogen dimer with seven times fewer parameters while not suffering from performance degradations when adding structures to the training set. On the Tiny Mol dataset, Neur Pf surpasses the highly accurate reference CCSD(T) CBS energies on the small structures by 1.9 m Eh and reduces errors compared to TAO by factors of 10 and 6 on the small and large structures, respectively. 2 Quantum chemistry Quantum chemistry aims to solve the time-independent Schrödinger equation (Foulkes et al., 2001) ˆH |Ψ = E |Ψ (1) where Ψ : RN 3 RN 3 R is the electronic wave function for N spin-up and N spin-down electrons, ˆH is the Hamiltonian operator, and E is the system s energy. To ease notation, if not necessary, we omit spins in Ψ and treat it as Ψ : RNe 3 R where Ne = N +N . The Hamiltonian ˆH for molecular systems, which we are concerned with in this work, is given by Zm Zn Rm Rn (2) with ri R3 being the ith electron s position, and Rm R3, Zm N+ being the mth nucleus position and charge. The wave function Ψ describes the behavior of electrons in the system defined by the Hamiltonian ˆH. As the square of the wave function Ψ2 is proportional to the probability density p( r) Ψ2( r) of finding the electrons at positions r RNe 3, its integral must be finite: Z Ψ( r)2d r < . (3) Further, as electrons are indistinguishable half-spin fermionic particles, the wave function must be antisymmetric under any same-spin electron permutation τ: Ψ τ ( r ), τ ( r ) = sgn(τ )sgn(τ )Ψ( r). (4) To enforce this constraint, the wave function is typically defined as a so-called Slater determinant of N + N integrable so-called orbital functions φi : R3 R: ΨSlater( r) = det h φ j( r i ) i det h φ j( r i ) i = det Φ ( r ) det Φ ( r ). (5) Note that for the determinant to exist, one needs exactly N up and N down orbitals φ j and φ j. In linear algebra, Eq. (1) is an eigenvalue problem, where we look for the eigenfunction Ψ0 with the lowest eigenvalue E0. In Variational Monte Carlo (VMC), this is solved by applying the variational principle, which states that the energy of any trial wave function Ψ upper bounds E0: E0 Ψ| ˆH |Ψ R Ψ( r) ˆHΨ( r)d r R Ψ2( r)d r . (6) By plugging in the probability distribution from Eq. (3), we can rewrite Eq. (6) as E0 Ep( r) h Ψ 1( r) ˆHΨ( r) i = Ep( r) [EL( r)] , (7) with EL( r) = Ψ( r) 1 ˆHΨ( r) being the so-called local energy. The right-hand side of Eq. (7) is known as the variational energy. As Eq. (7) does not require Ψ to be an analytic function, we can approximate the energy of any valid wave function Ψ with samples drawn from p( r). If we pick a parametrized family of wave functions Ψθ, we can optimize the parameters θ to minimize the VMC energy by following the gradient of the variational energy θ = Ep( r) (EL( r) Ep( r) [EL( r)]) θ log Ψθ( r) , (8) where we approximate all expectations by Monte Carlo sampling (Ceperley et al., 1977). Neural wave functions typically keep the functional form of Eq. (5) but replace the orbitals φi with learned many-electron orbitals φNN i : R3 RNe 3 R (Hermann et al., 2023). These many-electron orbitals φNN i are implemented as different readouts of the same permutation-equivariant neural network. Multiplying each orbital by an envelope function χi : R3 R that decays exponentially to zero at large distances enforces the finite integral requirement in Eq. (3). Generalized wave functions solve the more general problem where the nucleus positions R and charges Z are not fixed. Since the Hamiltonian ˆH R,Z depends on the molecular structure ( R, Z), so does the corresponding ground state wave function Ψ R,Z. Note that we still work in the Born-Oppenheimer approximation, i.e., we treat the nuclei as classical point charges (Zhang et al., 2023). Given a dataset of molecular structures D = {( R1, Z1), ...}, the total energy P ( R,Z) D Ψ R,Z| ˆ H R,Z|Ψ R,Z Ψ2 R,Z is minimized to approximate the ground state for each structure. Typi- cally, the dependence on R, Z is implemented by using a meta network that takes R, Z as inputs and outputs the parameters of the electronic wave function (Gao & Günnemann, 2022). 3 Related work While attempts to enforce the fermionic antisymmetry in neural wave functions in less than O(Ne 3) operations promise faster runtime than Slater determinants, the accuracy of these methods is limited (Han et al., 2019; Acevedo et al., 2020; Richter-Powell et al., 2023). Pfau et al. (2020) and Hermann et al. (2020) established Slater determinants for neural wave functions by demonstrating chemical accuracy on small molecules. Note, Eq. (5) may also be written via a block-diagonal matrix, i.e., Ψ( r) = det diag(Φ , Φ ) . Spencer et al. (2020) s implementation further increased accuracies by parametrizing the off diagonals that were implicitly set to 0 before, with additional orbitals Φ: ΨSlater( r) = det(ˆΦ( r)) = det Φ ( r ) Φ ( r ) Φ ( r ) Φ ( r ) Several works confirmed the improved empirical accuracy of this approach (Gerard et al., 2022; Lin et al., 2021; Ren et al., 2023; Gao & Günnemann, 2023b, 2024). While later works refined the architecture to increase accuracy (von Glehn et al., 2023; Wilson et al., 2021, 2023), the use of Slater determinants mostly remained a constant, with two notable exceptions: Firstly, Lou et al. (2023) use AGP wave functions (Casula & Sorella, 2003; Casula et al., 2004) to formulate the wave function as Ψ( r) = det(Φ ) det(Φ ) = det(Φ Φ T ). This avoids picking exactly N /N orbitals as Φ and Φ may be non-square but fails to generalize Eq. (9), we empirically verify the impact of this limitation in App. I. Secondly, Kim et al. (2023) introduced the combination of neural networks and Pfaffians, who demonstrated its performance on the ultra-cold Fermi gas. Though universal in theory, their parametrization yields no trivial adaption to molecular systems. In classical quantum chemistry, Bajdich et al. (2006, 2008) reported promising early results with Pfaffians in single-structure calculations for small molecules. In this work, we generalize Eq. (9) to Pfaffian wave functions that permit pretraining with Hartree-Fock calculations and generalization across molecules. Generalized wave functions. Scherbela et al. (2022) started this research with a weight-sharing scheme between wave functions. These still had to be reoptimized for each structure. Later, Gao & Günnemann (2022, 2023b) proposed PESNet, a generalized wave function for energy surfaces allowing joint training without reoptimization. Subsequent works extended PESNet to different compounds where the main challenge is parametrizing exactly N + N orbitals, such that the orbital matrix in Eq. (9) stays square. The problem of finding these orbitals was formulated into a discrete orbital selection problem. Gao & Günnemann (2023a) s hand-crafted algorithm accomplishes this by selecting orbitals via a greedy nearest neighbor search. In contrast, Scherbela et al. (2024, 2023) use the lowest eigenvalues of the Fock matrix as selection criteria. Both introduce non-learnable constraints, limiting generalization or sacrificing accuracy. Neur Pf avoids the selection problem by introducing an overparametrization when enforcing the exchange antisymmetry. 4 Neural Pfaffian Previous generalized wave functions build on Slater wave functions and attempt to adjust the orbitals φi to the molecule. Slater determinants were chosen due to their previously demonstrated high accuracy. However, they require exactly N + N orbitals. While the nuclei allow inferring the total number of electrons Ne of any stable, singlet state system, the spin distribution into N and N orbitals per atom is not readily available. Previous works implement this via a discrete selection of orbitals via non-learnable prior assumptions and constraints on the wave function; see Sec. 3. Here, we present the Neural Pfaffian (Neur Pf), a superset of Slater wave functions that preserves accuracy while relaxing the orbital number constraint. By not enforcing an exact number of orbitals, Neur Pf is overparametrized with No max{N , N } orbitals, avoiding discrete selections and making it a natural choice for generalized wave functions. Importantly, Neur Pf can be pretrained with Hartree-Fock, which accounts for > 99% of the total energy (Szabo & Ostlund, 2012). We introduce Neur Pf in four steps: (1) We introduce the Pfaffian and use it to define a superset of Slater wave functions. (2) We present memory-efficient envelopes that additionally accelerate convergence. (3) We introduce a new pretraining scheme for matching Pfaffian and Slater wave functions. (4) We discuss combining our developments to build a generalized wave function. 4.1 Pfaffian wave function The Pfaffian of a skew-symmetric 2n 2n matrix A, i.e., A = AT , is defined as Pf(A) = 1 2nn! τ S2n sgn(τ) i=1 Aτ(2i 1),τ(2i) (10) where S2n is the symmetric group of 2n elements. One may consider it a square root of the determinant of A since Pf(A)2 = det(A). An important property of the Pfaffian is Pf(BABT ) = det(B)Pf(A) for any invertible matrix B and skew-symmetric matrix A. In the context of neural wave functions, this means that if A is an along both dimensions permutation equivariant function of the electron positions r, A(τ( r)) = PτA( r)P T τ , the Pfaffian of A is a valid wave function that fulfills the antisymmetry requirement from Eq. (4): Ψ(τ( r)) = Pf(A(τ( r))) = Pf(PτA( r)P T τ ) = det(Pτ)Pf(A( r)) = sign(τ)Ψ( r). (11) To compute the Pfaffian without evaluating the 2n! terms in Eq. (10), we implement the Pfaffian via a tridiagonalization with the Householder transformation as in Wimmer (2012). There are various ways to construct A (Bajdich et al., 2006, 2008; Kim et al., 2023). Here, we introduce a superset of Slater wave functions, enabling high accuracy on molecular systems. If A is a skew-symmetric matrix, so is BABT for any arbitrary matrix B. Thus, we can construct ΨPfaffian as ΨPfaffian( r) = 1 Pf(APf)Pf ˆΦPf( r)APf ˆΦPf( r)T (12) where APf RNo No is a learnable skew-symmetric matrix and ˆΦPf : RNe 3 RNe No is a permutation equivariant function like in Eq. (9). This construction elevates the need for having exactly N /N orbitals as in Slater determinants. We may now overparametrize the wave function with No max{N , N } orbitals, allowing for a more flexible and simpler implementation without needing discrete orbital selection. By choosing ˆΦPf = ˆΦ, it is straightforward to see that Eq. (12) is a superset of the Slater determinant wave function in Eq. (9). Note that, like in Eq. (9), we parametrize two sets of orbital functions ΦPf and ΦPf and change their order for spin-down electrons to not enforce the exchange antisymmetry between different-spin electrons. As the normalizer Pf(APf) is constant, we drop it going forward. As it is common in quantum chemistry (Szabo & Ostlund, 2012; Hermann et al., 2020), we use linear combinations of wave functions to increase expressiveness: ΨPfaffian( r) = k=1 ckΨPfaffian,k( r). (13) We visually compare the schematic of the Slater determinant and Pfaffian wave functions in Fig. 1. In App. A, we discuss how to handle odd numbers of electrons such that ˆΦPf APf ˆΦT Pf has even Concatenate Diag Projection Offdiag Projection Input Orbitals Electrons (a) Slater wave function Diag Projection Offdiag Projection Concatenate Antisymmetrizer Matrix Contraction (b) Neural Pfaffian Fig. 1: Schematic of the Slater determinant (1a) and our Neur Pf (1b). Where the Slater formulation requires exactly Ne orbital functions, the Pfaffian formulation works for any number No max{N , N } of orbital functions, indicated by the rectangular orbital blocks. dimensions. Like previous work (Pfau et al., 2020), we parametrize the orbital functions φi as a product of a permutation equivariant neural network h : R3 RN 3 RNf and an envelope function χ : R3 R: φki( rj| r) = χki( rj) h( rj| r)T wki ηN N ki (14) with wki RNf being a learnable weight vector, and ηN N ki R being a scalar depending on the spin state of the system, i.e., the difference between the number of up and down electrons. The envelope function χ ensures that the integral of the squared wave function is finite. For h, we use Moon from Gao & Günnemann (2023a) thanks to its size consistency. 4.2 Memory-efficient envelopes To satisfy the finite integral requirement on the square of Ψ in Eq. (3), the orbitals φ are multiplied by an envelope function χ : R3 R that exponentially decays to zero at large distances. We do not split spins here and work with Ne = N + N to simplify the discussion, but, in practice, we would split the envelopes into two sets, one for ΦPf and one for ΦPf. The envelope function is typically a sum of exponentials centered on the nuclei (Spencer et al., 2020). In Einstein s summation notation, the envelope function can be written as χki( rbj) = πkmi | {z } Nk Nn No exp( σkmi rbj Rm )bkmji | {z } Nb Nk Nn Ne No (15) where Nb denotes the batch size. Empirically, we found the tensor on the right side containing many redundant entries. Further, due to the nonlinearity of the exponential function, one cannot implement the envelope in a simple matrix contraction but has to materialize the full five-dimensional tensor. Neur Pf amplifies this problem as No Ne whereas Slater determinants constraint No = Ne. We use a single set of exponentials per nucleus instead of having one for each combination of orbital and nucleus. This reduces the number of envelopes per electron from Nk Nn No to Nk Nenv, where Nenv = Nn Nenv/nuc is the number of envelope functions. In general, we pick Nenv/nuc such that Nenv No. These atomic envelopes are linearly recombined into molecular envelopes, effectively enlarging π to a Nk No Nenv tensor. Thanks to these rearrangements, we avoid constructing a five-dimensional tensor. Instead, we define the envelopes as χki( rbj) = πkni |{z} Nk Nenv No exp( σkn rbj Rn )kbnj | {z } Nb Nk Nenv Ne Concurrently, Pfau et al. (2024) presented similar bottleneck envelopes. However, we found ours to converge faster and not yield numerical instabilities. We discuss this further in App. B and I. 4.3 Pretraining Pfaffian wave functions Pretraining is essential in training neural wave functions and has frequently been observed to critically affect final energies (Gao & Günnemann, 2023a; von Glehn et al., 2023; Gerard et al., 2022). The pretraining aims to find orbital functions close to the ground state to stabilize the optimization. Traditionally, this is done by matching the orbitals of the neural wave function to the orbitals of a baseline wave function, typically a Hartree-Fock wave function ΨHF = det(ΦHF), by solving min θ Φθ ΦHF 2 2, (17) for the neural network parameters θ (Pfau et al., 2020). Since our Pfaffian has No orbitals while Hartree-Fock has Ne, we cannot directly apply this to our Pfaffian wave function. Further, as we predict orbitals per nucleus, our arbitrary orbital order may not align with Hartree-Fock. We propose two alternative pretraining schemes for neural Pfaffian wave functions: one based on matching single-electron orbitals and one based on matching geminals, effectively two-electron orbitals. We need to expand the Hartree-Fock orbitals ΦHF to No orbitals to match the single-electron orbitals directly. We construct ΦHF by padding the extra No Ne orbitals with zeros. It can easily be verified that the wave function ΨHF-Pf = 1 Pf AHF Pf( ΦHF AHF ΦT HF), is equivalent to the original Hartree-Fock wave function, i.e., ΨHF-Pf = ΨHF = det(ΦHF) for any invertible skew-symmetric AHF. Further, note that the multiplication of ΦHF with any matrix T SO(No) from the special orthogonal group does not change ΨHF-Pf. Thus, it suffices to match the single electron orbitals of ˆΦPf and ΦHF up to a rotation T SO(No), yielding the following optimization problem: min θ min T SO(No) ˆΦPf ΦHFT 2 2. (18) We solve this alternatingly for T and θ. To match the geminals ˆΦPf APf ˆΦT Pf and ΦHFAHFΦT HF, we have to account for the fact that the choice of AHF is arbitrary as long as it is skew-symmetric and invertible. Again, we solve this optimization problem alternatingly by solving for AHF S = {A SO(Ne) : A = AT } and θ: min θ min AHF S ˆΦPf APf ˆΦT Pf ΦHFAHFΦT HF 2 2. (19) While both formulations share the same minimizer, combining both yields the most stable results. We hypothesize that this is because the single-electron orbitals are more stable than the geminals and thus provide a better starting point for the optimization. In contrast, the latter provides a closer formulation of the neural network orbitals. Thus, we pretrain our neural Pfaffian wave functions by solving the optimization problem α min T SO(No) ˆΦPf ΦHFT 2 2 + β min AHF S ˆΦPf APf ˆΦT Pf ΦHFAHFΦT HF 2 2 with weights α, β [0, 1]. To optimize over the special orthogonal group SO(No), we use the Cayley transform (Gallier, 2013). App. C further details the procedure. 4.4 Generalizing over systems We now focus on generalizing the construction of our Pfaffian wave function for different systems. We accomplish the generalization similar to PESNet (Gao & Günnemann, 2022) by introducing a second neural network, the Meta GNN M : (R3 N+)Nn Θ that acts upon the molecular structure, i.e., nuclei positions and charges, and parametrizes the electronic wave function ΨPfaffian : RNe 3 Θ R for the system of interest. As architecture for the wave function and Meta GNN, we use the same architecture as in Gao et al. (2023a) with the exception being that we replace the Slater determinant with the Pfaffian as described in Sec. 4 and minor tweaks highlighted in App. D.4. Fig. 2: Orbital parametrization per nucleus. , / indicate electrons and nuclei, respectively. Pfaffian. To represent wave functions of different systems within a single Neur Pf, we need to adapt the orbitals ˆΦPf and antisymmetrizer APf from Eq. (12) to the molecule. In doing so, we must ensure No max{N , N }. Otherwise, ˆΦPf APf ˆΦT Pf is singular, and the wave function is zero. One may solve this by picking No large enough that No max{N , N } for all molecules in the dataset. However, this is computationally expensive, does not reuse known orbitals in the problem, and simply moves the problem to even larger systems. Instead, we grow the number of orbitals No with the system size by defining Norb/nuc orbitals per nucleus, as depicted in Fig. 2. This allows us to transfer orbitals from smaller systems to larger systems. We only need to ensure that Norb/nuc is larger than half the maximum number of electrons in a period, e.g., for the first period Norb/nuc 1, for the second period Norb/nuc 5. The projection W from Eq. (14) and the envelope decays σ are parametrized by node embeddings, while the envelope weights π and the antisymmetrizer APf are derived from edge embeddings. We predict a Norb/nuc Nf matrix per nucleus for W and a Nenv/nuc vector per nucleus for σ. For the edge parameters π and APf, we predict a Nenv/nuc Norb/nuc and a Norb/nuc Norb/nuc matrix per edge, respectively. These are concatenated into the Nenv No and No No matrices π and ˆAPf. The latter is antisymmetrized to get APf = 1 2( ˆAPf ˆAT Pf). We parametrize the spin-dependent scalars η as node outputs for a fixed number of spin configurations Ns. Because the change in spin configuration does not grow with system size, Ns is fixed. We generate two sets of these parameters, on for ΦPf and on for ΦPf. App. D provides definitions for the wave function, the Meta GNN, and the parametrization. Pretraining. Previous work like Gao & Günnemann (2023a) needed to canonicalize the Hartree-Fock solutions for different systems before pretraining to ensure that the orbitals fit the neural network. Alternatively, Scherbela et al. (2023) relied on traditional quantum chemistry methods like Foster & Boys (1960) s localization to canonicalize their orbitals in conjunction with sign equivariant neural networks. In contrast, we ensure that the transformed Hartree-Fock orbitals are similar across structures as we optimize T SO(No) and AHF S for each structure separately, which simultaneously also accounts for arbitrary rotations in the orbitals produced by Hartree-Fock. Limitations. While our Pfaffian-based generalized wave function significantly improves accuracy on organic chemistry, we leave the transfer to periodic systems for future work (Kosmala et al., 2023). Further, due to the lac of low-level hardware/software support for the Pfaffian and the increased number of orbitals No max{N , N }, our Pfaffian is slower than a comparably-sized Slater determinant. While we solve the issue of enforcing the fermionic antisymmetry, our neural wave functions are still unaware of any symmetries of the wave function itself. These are challenging to describe and largely unknown, but their integration may improve generalization performance (Schütt et al., 2018). Finally, in classical single-structure calculations, Neur Pf may not improve accuracies. App. P discusses the broader impact of our work. 5 Experiments In the following, we evaluate Neur Pf on several atomic and molecular systems by comparing it to Globe (Gao & Günnemann, 2023a) and TAO (Scherbela et al., 2024). Concretely, we investigate the following: (1) Second-row elements and their ionization potentials and electron affinities. Globe cannot compute these due to its restriction to singlet state systems. (2) The challenging nitrogen potential energy surface where Globe significantly degraded performance when enlarging their training set with additional molecules. (3) The Tiny Mol dataset (Scherbela et al., 2024) to evaluate Neur Pf s generalization capabilities across biochemical molecules. In interpreting the following results, one should mind the variational principle, i.e., lower energies are better for neural wave functions. Further, 1 kcal mol 1 1.6 m Eh is the typical threshold for chemical accuracy. Like previous work, we optimize the neural wave function using the VMC framework from Sec. 2. We precondition the gradient with the Spring optimizer (Goldshlager et al., 2024). App. E details the setup further. App. F,I and J show an experiment on extensity and additional ablations. Atomic systems and spin configurations. We evaluate Neur Pf on second-row elements and their ionization potentials and electron affinities. These systems are particularly interesting as they represent Ground (Eh) Li Be B C N O F Ne 102 103 104 105 10 5 102 103 104 105102 103 104 105 102 103 104 105102 103 104 105 our Fermi Net chemical Acc. Unphysical System Unphysical System Fig. 3: Ground state, electron affinity, and ionization potential errors of second-row elements during training. A single Neur Pf has been trained on all systems jointly while references (Pfau et al., 2020) were calculated separately for each system. Energies are averaged over the last 10% of steps. a wide range of spin configurations. We cannot use Globe on such systems because they differ from the singlet state assumption. Instead, we compare our results to the single-structure calculations from Pfau et al. (2020) s Fermi Net and the exact results from Chakravorty et al. (1993); Klopper et al. (2010). In App. G, we repeat this experiment for metals. Fig. 3 displays the ground state energy, electron affinity, and ionization potential errors of Neur Pf during training compared to the reference energies from Pfau et al. (2020); Chakravorty et al. (1993); Klopper et al. (2010). It is apparent that Neur Pf reaches chemical accuracy relative to the exact results while only training a single neural network for all systems. While separately optimized Fermi Nets may achieve lower errors, Pfau et al. (2020) trained 21 neural networks for 200k steps each compared to a single Neur Pf trained for 200k steps, i.e., 21 times fewer steps and samples. Whereas Gao & Günnemann (2023a); Scherbela et al. (2023) focus on singlet state systems or stable biochemical molecules, Neur Pf demonstrates that a generalized wave function need not be restricted to such simple systems and can even generalize to a wide range of electronic configurations. 2 3 4 5 6 Distance a0 Energy error (m Eh) our our (Ethene) Globe Globe (Ethene) Fermi Net PESNet Experiment Fig. 4: Potential energy surface of nitrogen. Energies are relative to Le Roy et al. (2006). Effect of uncorrelated data. Next, we evaluate Neur Pf on the nitrogen potential energy surface, a traditionally challenging system due to its high electron correlation effects (Lyakh et al., 2012). This is particularly interesting as Gao & Günnemann (2023a) observed a significant accuracy degradation when reformulating their wave function to generalize over different systems. In particular, they found that training only on the nitrogen dimer leads to significantly lower errors than training with an ethene-augmented dataset, indicating an accuracy penalty in generalization. We replicate their setup and compare the performance of Neur Pf trained on the nitrogen energy surface with and without additional ethene structures. Like Gao & Günnemann (2023a), the nitrogen structures are taken from Pfau et al. (2020) and the ethene structures from Scherbela et al. (2022). As additional references, we plot Gao & Günnemann (2022) s PESNet and Fu et al. (2023) s Fermi Net results. Fig. 4 shows the error potential energy surface relative to the experimental results from Le Roy et al. (2006). Neur Pf reduces the average error on the energy surface from Globe s 2.7 m Eh to 2 m Eh when training solely on nitrogen structures. When adding the ethene structures, Globe s error increases to 5.3 m Eh while Neur Pf s error stays constant at 2 m Eh, a lower error than the Globe without the augmented dataset. These results indicate Neur Pf s strong capabilities in approximating ground states while allowing for generalization across different systems without a significant loss in accuracy. E ECCSD(T) CBS (m Eh) Shaded region: Improvement over CCSD(T) CBS Small molecules 250 500 1k 2k 4k 8k 16k 32k Training Step E ECCSD(T) CBS (m Eh) Large molecules CCSD(T) Globe NH C O NH C NH Fig. 5: Convergence of mean energy difference on the Tiny Mol dataset from Scherbela et al. (2024). The y-axis is linear < 1 and logarithmic 1. Due to the variational principle, Neur Pf is better than the reference CCSD(T) on the small molecules. Tiny Mol dataset. Finally, we look at learning a generalized wave function over different molecules and structures. We use the Tiny Mol dataset (Scherbela et al., 2024), consisting of a small and large dataset. The dataset includes goldstandard CCSD(T) CBS energies. The small set consists of 3 molecules with 2 heavy atoms, while the large set covers 4 molecules with 3 heavy atoms. For each molecule, 10 structures are provided. Here, we compare again both Globe (+Moon) and TAO to Neur Pf. All models are directly trained on the small and large test sets. Fig. 5 shows the mean energy difference to CCSD(T) at different stages of the training. We refer to App. K for a per molecule error attribution. It is apparent that Neur Pf yields lower errors than the TAO and Globe after at least 500 steps. On the small structures, Neur Pf even matches the CCSD(T) baseline after 16k steps and achieves 1.9 m Eh lower energies after 32k steps. Since VMC methods are variational, i.e., lower energies are always better, Neur Pf is more accurate than the CCSD(T) CBS reference. Compared to TAO and Globe, Neur Pf reports 5.9 m Eh and 11.3 m Eh lower energies, respectively. On the large structures, we observe a similar pattern where we find Neur Pf having a 25 times smaller error than TAO during the early stages of training and reaching 21.1 m Eh lower energies after 32k steps a 6 times lower error compared to the CCSD(T) baseline. Note that since the CCSD(T) (CBS) energies are neither exact nor variational, the true error to the ground state is unknown. Still, we provide additional numbers for a Neur Pf trained for 128k steps in App. K. There, we find Neur Pf yielding 4.4 m Eh lower energies on the large structures. These results show that a generalized wave function can achieve high accuracy on various molecular structures without pretraining when not relying on hand-crafted algorithms or Hartree-Fock calculations. For additional experiments, we refer the reader to App. L where we first pretrain TAO and Neur Pf on a separate training set and, then, finetune on the small and large test sets and App. M for a comparison of joint and separate optimization. 6 Conclusion In this work, we established a new way of parametrizing neural network wave functions for generalization across molecules via overparametrization with Pfaffians. Our Neural Pfaffian is more accurate, simpler to implement, fully learnable, and applicable to any molecular system compared to previous work. The wave function changes smoothly with the structure, avoiding the discrete orbital selection problem previously solved via hand-crafted algorithms or Hartree-Fock. Additionally, we introduced a memory-efficient implementation of the exponential envelopes, reducing memory requirements while accelerating convergence. Further, we presented a pretraining scheme for Pfaffians enabling initialization with Hartree-Fock a crucial step for molecular systems. Our experimental evaluation demonstrated that our Neural Pfaffian can generalize across different ionizations of various systems, stay accurate when enlarging datasets, and set a new state of the art by outperforming previous neural wave functions and the reference CCSD(T) CBS on the Tiny Mol dataset. These developments open the door for new neural wave functions applications, e.g., to generate reference data for machine-learning force fields or density functional theory (Cheng et al., 2024; Gao et al., 2024). Acknowledgments. We greatly thank Simon Geisler for our valuable discussions. Further, we thank Valerie Engelmayer, Leo Schwinn, and Aman Saxena for their invaluable feedback on the manuscript. Funded by the Federal Ministry of Education and Research (BMBF) and the Free State of Bavaria under the Excellence Strategy of the Federal Government and the Länder. Acevedo, A., Curry, M., Joshi, S. H., Leroux, B., and Malaya, N. Vandermonde Wave Function Ansatz for Improved Variational Monte Carlo. In 2020 IEEE/ACM Fourth Workshop on Deep Learning on Supercomputers (DLS), pp. 40 47, November 2020. doi: 10.1109/DLS51937.2020.00010. Bajdich, M., Mitas, L., Drobný, G., Wagner, L. K., and Schmidt, K. E. Pfaffian Pairing Wave Functions in Electronic-Structure Quantum Monte Carlo Simulations. Physical Review Letters, 96 (13):130201, April 2006. ISSN 0031-9007, 1079-7114. doi: 10.1103/Phys Rev Lett.96.130201. Bajdich, M., Mitas, L., Wagner, L. K., and Schmidt, K. E. Pfaffian pairing and backflow wavefunctions for electronic structure quantum Monte Carlo methods. Physical Review B, 77(11):115112, March 2008. doi: 10.1103/Phys Rev B.77.115112. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: Composable transformations of Python+Num Py programs, 2018. Casula, M. and Sorella, S. Geminal wavefunctions with Jastrow correlation: A first application to atoms. The Journal of Chemical Physics, 119(13):6500 6511, October 2003. ISSN 0021-9606, 1089-7690. doi: 10.1063/1.1604379. Casula, M., Attaccalite, C., and Sorella, S. Correlated geminal wave function for molecules: An efficient resonating valence bond approach. The Journal of Chemical Physics, 121(15):7110 7126, October 2004. ISSN 0021-9606. doi: 10.1063/1.1794632. Ceperley, D., Chester, G. V., and Kalos, M. H. Monte Carlo simulation of a many-fermion study. Physical Review B, 16(7):3081 3099, October 1977. doi: 10.1103/Phys Rev B.16.3081. Chakravorty, S. J., Gwaltney, S. R., Davidson, E. R., Parpia, F. A., and p Fischer, C. F. Ground-state correlation energies for atomic ions with 3 to 18 electrons. Physical Review A, 47(5):3649 3670, May 1993. doi: 10.1103/Phys Rev A.47.3649. Cheng, L., Szabó, P. B., Schätzle, Z., Kooi, D., Köhler, J., Giesbertz, K. J. H., Noé, F., Hermann, J., Gori-Giorgi, P., and Foster, A. Highly Accurate Real-space Electron Densities with Neural Networks, September 2024. Foster, J. M. and Boys, S. F. Canonical Configurational Interaction Procedure. Reviews of Modern Physics, 32(2):300 302, April 1960. ISSN 0034-6861. doi: 10.1103/Rev Mod Phys.32.300. Foulkes, W. M. C., Mitas, L., Needs, R. J., and Rajagopal, G. Quantum Monte Carlo simulations of solids. Reviews of Modern Physics, 73(1):33 83, January 2001. doi: 10.1103/Rev Mod Phys.73.33. Fu, W., Ren, W., and Chen, J. Variance extrapolation method for neural-network variational Monte Carlo, August 2023. Gallier, J. Remarks on the Cayley Representation of Orthogonal Matrices and on Perturbing the Diagonal of a Matrix to Make it Invertible, November 2013. Gao, N. and Günnemann, S. Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions. In International Conference on Learning Representations, April 2022. Gao, N. and Günnemann, S. Generalizing Neural Wave Functions. In International Conference on Machine Learning, February 2023a. doi: 10.48550/ar Xiv.2302.04168. Gao, N. and Günnemann, S. Sampling-free Inference for Ab-Initio Potential Energy Surface Networks. In The Eleventh International Conference on Learning Representations, February 2023b. Gao, N. and Günnemann, S. On Representing Electronic Wave Functions with Sign Equivariant Neural Networks. In ICLR 2024 Workshop on AI4Differential Equations In Science, March 2024. Gao, N., Köhler, J., and Foster, A. Folx - Forward Laplacian for JAX, 2023. Gao, N., Eberhard, E., and Günnemann, S. Learning Equivariant Non-Local Electron Density Functionals, October 2024. Gasteiger, J., Groß, J., and Günnemann, S. Directional Message Passing for Molecular Graphs. In International Conference on Learning Representations, September 2019. Gerard, L., Scherbela, M., Marquetand, P., and Grohs, P. Gold-standard solutions to the Schrödinger equation using deep learning: How much physics do we need? Advances in Neural Information Processing Systems, May 2022. Goldshlager, G., Abrahamsen, N., and Lin, L. A Kaczmarz-inspired approach to accelerate the optimization of neural network wavefunctions, January 2024. Han, J., Zhang, L., and E, W. Solving many-electron Schrödinger equation using deep neural networks. Journal of Computational Physics, 399:108929, December 2019. ISSN 0021-9991. doi: 10.1016/j.jcp.2019.108929. Hermann, J., Schätzle, Z., and Noé, F. Deep-neural-network solution of the electronic Schrödinger equation. Nature Chemistry, 12(10):891 897, October 2020. ISSN 1755-4330, 1755-4349. doi: 10.1038/s41557-020-0544-y. Hermann, J., Spencer, J., Choo, K., Mezzacapo, A., Foulkes, W. M. C., Pfau, D., Carleo, G., and Noé, F. Ab initio quantum chemistry with neural-network wavefunctions. Nature Reviews Chemistry, 7 (10):692 709, October 2023. ISSN 2397-3358. doi: 10.1038/s41570-023-00516-8. Kim, J., Pescia, G., Fore, B., Nys, J., Carleo, G., Gandolfi, S., Hjorth-Jensen, M., and Lovato, A. Neural-network quantum states for ultra-cold Fermi gases, May 2023. Klopper, W., Bachorz, R. A., Tew, D. P., and Hättig, C. Sub-me V accuracy in first-principles computations of the ionization potentials and electron affinities of the atoms H to Ne. Physical Review A, 81(2):022503, February 2010. ISSN 1050-2947, 1094-1622. doi: 10.1103/Phys Rev A. 81.022503. Kosmala, A., Gasteiger, J., Gao, N., and Günnemann, S. Ewald-based Long-Range Message Passing for Molecular Graphs. In Proceedings of the 40th International Conference on Machine Learning, pp. 17544 17563. PMLR, July 2023. Le Roy, R. J., Huang, Y., and Jary, C. An accurate analytic potential function for ground-state N2 from a direct-potential-fit analysis of spectroscopic data. The Journal of Chemical Physics, 125 (16):164310, October 2006. ISSN 0021-9606, 1089-7690. doi: 10.1063/1.2354502. Li, R., Ye, H., Jiang, D., Wen, X., Wang, C., Li, Z., Li, X., He, D., Chen, J., Ren, W., and Wang, L. A computational framework for neural network-based variational Monte Carlo with Forward Laplacian. Nature Machine Intelligence, 6(2):209 219, February 2024. ISSN 2522-5839. doi: 10.1038/s42256-024-00794-x. Lin, J., Goldshlager, G., and Lin, L. Explicitly antisymmetrized neural network layers for variational Monte Carlo simulation. ar Xiv:2112.03491 [physics], December 2021. Lou, W. T., Sutterud, H., Cassella, G., Foulkes, W. M. C., Knolle, J., Pfau, D., and Spencer, J. S. Neural Wave Functions for Superfluids, July 2023. Lyakh, D. I., Musiał, M., Lotrich, V. F., and Bartlett, R. J. Multireference Nature of Chemistry: The Coupled-Cluster View. Chemical Reviews, 112(1):182 243, January 2012. ISSN 0009-2665, 1520-6890. doi: 10.1021/cr2001417. Martin, W. C. and Musgrove, A. Ground levels and ionization energies for the neutral atoms. 1998. Mishchenko, K. and Defazio, A. Prodigy: An Expeditiously Adaptive Parameter-Free Learner, October 2023. Motta, M., Ceperley, D. M., Chan, G. K.-L., Gomez, J. A., Gull, E., Guo, S., Jiménez-Hoyos, C. A., Lan, T. N., Li, J., Ma, F., Millis, A. J., Prokof ev, N. V., Ray, U., Scuseria, G. E., Sorella, S., Stoudenmire, E. M., Sun, Q., Tupitsyn, I. S., White, S. R., Zgid, D., Zhang, S., and Simons Collaboration on the Many-Electron Problem. Towards the Solution of the Many-Electron Problem in Real Materials: Equation of State of the Hydrogen Chain with State-of-the-Art Many-Body Methods. Physical Review X, 7(3):031059, September 2017. ISSN 2160-3308. doi: 10.1103/ Phys Rev X.7.031059. Pfau, D., Spencer, J. S., Matthews, A. G. D. G., and Foulkes, W. M. C. Ab initio solution of the many-electron Schrödinger equation with deep neural networks. Physical Review Research, 2(3): 033429, September 2020. doi: 10.1103/Phys Rev Research.2.033429. Pfau, D., Axelrod, S., Sutterud, H., von Glehn, I., and Spencer, J. S. Accurate computation of quantum excited states with neural networks. Science, 385(6711):eadn0137, August 2024. doi: 10.1126/science.adn0137. Ren, W., Fu, W., Wu, X., and Chen, J. Towards the ground state of molecules via diffusion Monte Carlo on neural networks. Nature Communications, 14(1):1860, April 2023. ISSN 2041-1723. doi: 10.1038/s41467-023-37609-3. Richter-Powell, J., Thiede, L., Asparu-Guzik, A., and Duvenaud, D. Sorting Out Quantum Monte Carlo, November 2023. Scherbela, M., Reisenhofer, R., Gerard, L., Marquetand, P., and Grohs, P. Solving the electronic Schrödinger equation for multiple nuclear geometries with weight-sharing deep neural networks. Nature Computational Science, 2(5):331 341, May 2022. ISSN 2662-8457. doi: 10.1038/ s43588-022-00228-x. Scherbela, M., Gerard, L., and Grohs, P. Variational Monte Carlo on a Budget Fine-tuning pretrained Neural Wavefunctions. In Thirty-Seventh Conference on Neural Information Processing Systems, November 2023. Scherbela, M., Gerard, L., and Grohs, P. Towards a transferable fermionic neural wavefunction for molecules. Nature Communications, 15(1):120, January 2024. ISSN 2041-1723. doi: 10.1038/ s41467-023-44216-9. Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A., and Müller, K.-R. Sch Net A deep learning architecture for molecules and materials. The Journal of Chemical Physics, 148(24): 241722, June 2018. ISSN 0021-9606, 1089-7690. doi: 10.1063/1.5019779. Shazeer, N. GLU Variants Improve Transformer, February 2020. Spencer, J. S., Pfau, D., Botev, A., and Foulkes, W. M. C. Better, Faster Fermionic Neural Networks. 3rd Neur IPS Workshop on Machine Learning and Physical Science, November 2020. Szabo, A. and Ostlund, N. S. Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory. Courier Corporation, 2012. von Glehn, I., Spencer, J. S., and Pfau, D. A Self-Attention Ansatz for Ab-initio Quantum Chemistry. In The Eleventh International Conference on Learning Representations, February 2023. Wilson, M., Gao, N., Wudarski, F., Rieffel, E., and Tubman, N. M. Simulations of state-of-the-art fermionic neural network wave functions with diffusion Monte Carlo, March 2021. Wilson, M., Moroni, S., Holzmann, M., Gao, N., Wudarski, F., Vegge, T., and Bhowmik, A. Neural network ansatz for periodic wave functions and the homogeneous electron gas. Physical Review B, 107(23):235139, June 2023. doi: 10.1103/Phys Rev B.107.235139. Wimmer, M. Algorithm 923: Efficient Numerical Computation of the Pfaffian for Dense and Banded Skew-Symmetric Matrices. ACM Transactions on Mathematical Software, 38(4):30:1 30:17, August 2012. ISSN 0098-3500. doi: 10.1145/2331130.2331138. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In Eighth International Conference on Learning Representations, April 2020. Zhang, X., Wang, L., Helwig, J., Luo, Y., Fu, C., Xie, Y., Liu, M., Lin, Y., Xu, Z., Yan, K., Adams, K., Weiler, M., Li, X., Fu, T., Wang, Y., Yu, H., Xie, Y., Fu, X., Strasser, A., Xu, S., Liu, Y., Du, Y., Saxton, A., Ling, H., Lawrence, H., Stärk, H., Gui, S., Edwards, C., Gao, N., Ladera, A., Wu, T., Hofgard, E. F., Tehrani, A. M., Wang, R., Daigavane, A., Bohde, M., Kurtin, J., Huang, Q., Phung, T., Xu, M., Joshi, C. K., Mathis, S. V., Azizzadenesheli, K., Fang, A., Aspuru-Guzik, A., Bekkers, E., Bronstein, M., Zitnik, M., Anandkumar, A., Ermon, S., Liò, P., Yu, R., Günnemann, S., Leskovec, J., Ji, H., Sun, J., Barzilay, R., Jaakkola, T., Coley, C. W., Qian, X., Qian, X., Smidt, T., and Ji, S. Artificial Intelligence for Science in Quantum, Atomistic, and Continuum Systems, November 2023. Table 1: Number of envelope parameters for the full envelope and our memory efficient envelopes for an explanatory system. full 1600 1600 3200 our 640 12800 13400 A Odd numbers of electrons To handle odd numbers of electrons, we extend the electron pair matrix ˆΦPf APf ˆΦT Pf to even dimensions. We accomplish this by augmenting ˆΦPf APf ˆΦT Pf with a learnable single-electron orbital φodd to \ ˆΦPf APf ˆΦT Pf = ˆΦPf APf ˆΦT Pf φodd φT odd 0 To obtain a single additional orbital for the whole molecule, we parameterize one orbital φodd,m for each nucleus as in Eq. (14) and sum them up to obtain φodd = PNn m=1 φodd,m. B Difference to bottleneck envelopes Similar to the bottleneck envelope from Pfau et al. (2024), our efficient envelopes aim at reducing memory requirements. The bottleneck envelopes are defined as χk bottleneck(rbi)j = m=1 πlm exp ( σlm ri Rm ) (22) While both methods share the idea of reducing the number of parameters, they differ in their implementation. Whereas the bottleneck envelopes construct a full set of L many-nuclei envelopes and then linearly recombine these to the final envelopes for each of K No orbitals, our efficient envelopes construct the final envelopes directly from a set of single-nuclei exponentials. Further, we use a different set of basis functions for each of the K determinants. In terms of computational complexity, the bottleneck envelopes require O(Ne Nn L) + O(KLNe No) operations to compute the envelopes, while our efficient envelopes require O(KNenv Ne No) operations. In practice, we found our efficient envelopes to be faster and converge better on all systems we tested. An ablation study is presented in App. I. Further, we observed no numerical instabilities in our envelopes as reported by Pfau et al. (2024). Compared to the full envelopes, we find our memory efficient ones to be slower but yielding better performance. This is likely due to the increased number of wave function parameters. The number of parameters for the full envelopes and our memory efficient envelopes is shown in Tab. 1 for an example with Ne = No = 20, Nn = 5, Nd = 16, N env atom = 8. The full envelopes σ, π scale both O(Nd Nn No) while our memory efficient envelopes σ scales O(Nd Nn Nenv/nuc) and π scales O(Nd Nn Nenv/nuc No). In runtime, the full envelopes require O(Nd Nn Ne No) operations, while our memory efficient envelopes require O(Nd Nn N env atom Ne No) operations. In memory complexity, the full envelopes require O(Nd Nn N 2 e ), while our memory efficient envelopes require O(Nd Nn N env C Pretraining To pretrain Neur Pf, we solve the optimization problem from Eq. (20). The nested optimization problems are solved iteratively, where we first solve for T SO(N) and AHF S and then for the parameters of the wave function θ. We describe how we parametrize the special orthogonal group SO(N) and the antisymmetric special orthogonal group S and then how we solve the optimization problems. To optimize over the special orthogonal group SO(N), we parametrize T via some arbitrary matrix T RN N. Next, we obtain an antisymmetrized version of T via T T T . (23) We now may use ˆT with the Cayley transform to obtain a special orthogonal matrix T = ˆT I 1 ˆT + I (24) where I is the identity matrix. T is now a special orthogonal matrix where all eigenvalues are 1. To parametrize matrices with an even number of eigenvalues -1 as well, we simply multiply T with itself: T = T T (25) which gives us our final parametrization of the special orthogonal group SO(N) (Gallier, 2013). We follow Gallier (2013), to parametrize antisymmetric special orthogonal matrices S. In particular, we parametrize some T using the procedure outlined above. To parametrize AHF, it remains to antisymmetrize T while preserving the special orthogonal property. We accomplish this by defining AHF = T IT T (26) where I = diag 0 1 1 0 0 1 0 0 . . . 1 0 0 0 0 0 0 1 0 0 1 0 ... ... is the antisymmetric identity matrix. Since the product of special orthogonal matrices is special orthogonal and BABT yielding an antisymmetric matrix for any special orthogonal matrix B, we have that AHF S is an antisymmetric special orthogonal matrix. Now that we can parametrize both groups with real matrices, we can simplify the optimization problem by performing gradient optimization for both T, AHF, and θ. We solve this problem alternatively, where we first solve for T and AHF by doing Npre steps of gradient optimization with the prodigy optimizer (Mishchenko & Defazio, 2023) and then perform a single outer step on θ with the lamb optimizer (You et al., 2020) like previous works (Gao & Günnemann, 2022; von Glehn et al., 2023). D Model architectures We largly reuse the same architecture for the Meta GNN M : (R3 N+)Nn Θ and wave function ΨPfaffian : RNe 3 Θ R as Gao & Günnemann (2023a). We canonicalize all molecular structures using the equivariant coordinate frame from Gao & Günnemann (2022). D.1 Wave function Similar to Gao & Günnemann (2023a), we use bars above functions and parameters to indicate that the Meta GNN M parameterizes these and that they vary by structure. We define our wave function as a Jastrow-Pfaffian wave function like Kim et al. (2023): Ψ(r) = exp (J(r)) k=1 ck Pf ˆΦk Pf( r)Ak Pf ˆΦk Pf( r)T . (28) As Jastrow factor J : RN 3 R we use a linear combination of a learnable MLP of electron embeddings and the fixed electronic cusp Jastrow from von Glehn et al. (2023): i=1 MLP(h( ri| r)) i,j;αi=αj 1 4 α2 par αpar + ri rj i,j;αi =αj 1 2 α2 anti αanti + ri rj where h : R3 RN 3 RNf is the ith output of the permutation equivariant neural network, implemented via the Molecular orbital network (Moon) (Gao & Günnemann, 2023a), βpar, βanti, αpar, αanti R are learnable scalars, and αi is the spin of the ith electron. The orbitals ˆΦPf are defined as in Eq. (14) with Moon performing the following steps: We start with constructing electron embeddings based on electron-electron distances and then proceed to aggregate these embeddings to the orbitals. The nuclei are updated through MLPs and finally diffused to the electrons, yielding the final electron embeddings. The initial embedding h(0) i of the ith electron is constructed as h(0) i = 1 µ(ri) j=1 σ ge-e ij W δ αj αi Γ δ αj αi ( ri rj ) where denotes the Hadamard product and the Kronecker delta δαj αi as superscript indicates different parameters depending on the identity between spin αi and αj. Γ : RI RD is a learnable radial filter function, and σ is the activation function. ge-e ij R4 are the rescaled electron-electron distances (von Glehn et al., 2023): gij = log (1 + ri rj ) ri rj [ ri rj, ri rj ] . (31) µ is a normalization factor: µ( r) = 1 + We use the initial electron embeddings with nuclei embeddings and electron-nuclei distances to construct pairwise nuclei-electron embeddings representing edges in a fully connected graph: he-n im = σ h(0) i + zm + ge-n im Wm . (33) where zm is the mth nucleus embedding, ge-n im R4 are the rescaled electron-nuclei distances like in Eq. (31). These embeddings are then aggregated with spatial filters twice: once towards the nuclei and once towards the electrons: hnα(1) m = 1 i Aα he-n i,m Γ n m( ri Rm), (34) m(1) i = 1 µ( ri) m=1 he-n i,m Γ e m( ri Rm), (35) h(1) i = σ(m(1) i W + b). (36) We update the nuclei embeddings with L update layers: hnα(l+1) m = hnα(l) m + σ([hnα(l) m , hnˆα(l) m ]W (l) + b(l)), (37) where ˆα denotes the opposite spin of α, to obtain the final nuclei embeddings hnα(L) m . The final electron embeddings he(L) i are constructed by combining the message from the nuclei and the previous electron embedding: he(L) i = σ σ h(1) i W + m(L) i + b1 W + b2 + h(1) i (38) where mi is the message from the nuclei to the ith electron: m(L) i = 1 µ( ri) m=1 σ h hnαi(L) m , hnˆαi(L) m i W + b Γ diff m ( ri Rm). (39) The spatial filters Γ are defined as: Γ (l) m (x) = βm(x)W (l), (40) i=1 W env σ x W (1) m + b(1) m W (2) + b(2) . (41) Note that β is shared across all instances of Γ. Γ is defined analogously to Γ but with fixed learnable parameters instead of Meta GNN parametrized ones. D.2 Meta GNN The Meta GNN M : (R3 N+)Nn Θ takes the nucleus position R and charges Z as input and outputs parameters of the electronic wave function to adapt the solution to the system of interest. We follow Gao & Günnemann (2022, 2023a) and implement it as a graph neural network (GNN) where nuclei are represented as nodes and edges are constructed based on inter-particle distances. The charge of the nucleus determines the initial node embeddings: k(0) i = EZi (42) where E is an embedding matrix and Zi is the charge of the ith nucleus. These embeddings are iteratively updated via message passing in the following way: k(l+1) i = f (l)(k(l) i , t(l) i ), (43) j=1 g(l)(k(l) i , k(l) j ) Γ (l)( Ri Rj), (44) νN x = 1 + X y N exp x y 2 where Eq. (43) describes the update function, Eq. (44) the message construction, and Eq. (45) a learnable normalization coefficient. We implement the functions f and g via Gated Linear Units (GLU) (Shazeer, 2020). As spatial filters, we use the same as in the wave function but additionally multiply the filters with radial Bessel functions from Gasteiger et al. (2019): Γ (l)(x) =β(x)W (l), (46) W env σ x W (1) + b(1) W (2) + b(2) where fi are learnable frequencies, and c is a smooth cutoff for the Bessel functions. After L layers, we take the final node embeddings, pass them through another GLU, and then use a different GLU as head for each distinct parameter tensor of the wave function we want to predict. For edge-dependent parameters, like π or A, we first construct edge embeddings by concatenating all combinations of node embeddings. We pass these through a GLU and then proceed like for node embeddings. For all outputs, we add a default charge-dependent parameter tensor such that the Meta GNN only learns a delta to an initial guess depending on the charge of the nucleus. D.3 Orbital parametrization Our Pfaffian wave function enables us to simply parametrize a No max{N , N } orbitals rather than parametrizing exactly N /N . As discussed in Sec. 4.4, we accomplish this by associating a fixed number of orbitals with each nucleus. Here, we provide detailed construction for all parameters of the orbital construction. For simplicity, we do not explicitly show the dependence on the kth Pfaffian. Note that we simply extend the readout by an Nk sized dimension for each of the Nk Pfaffians from Eq. (13). Further, we predict two sets of parameters, one for ΦPf and one for ΦPf in Eq. (9). To parametrize the orbitals, we predict Norb/nuc orbital parameters for each of the Nn nuclei. Concretely, the linear projection to Wk from Eq. (14) are constructed as ω1(k1) ... ωNorb/nuc(k1) ω1(k2) ... ωNorb/nuc(k Nn) RNo Nf (48) where ωi : RD RNf learnable readouts of our Meta GNN. Similarly, we parametrize the envelope coefficients σk from Eq. (16): ς1(k1) ... ςNenv/nuc(k1) ς1(k2) ... ςNenv/nuc(k Nn) RNenv + (49) where ςi : RD R+ are learnable readouts of our Meta GNN. The linear orbital weights π connect each nuclei-centered envelope to the non-atom-centered orbitals. For this, we need to find a mapping from each of the Nenv envelopes to each of the No orbitals. Since Nenv = Nenv/nuc Nn and No = Norb/nuc Nn are predicted per nuclei, a natural connection is established via a pair-wise atom function: ϖ1,1(k1, k1) . . . ϖ1,Norb/nuc(k1, k1) ϖ1,1(k2, k1) . . . ... ... ... ϖNenv/nuc,1(k1, k1) . . . ϖNenv/nuc,Norb/nuc(k1, k1) ϖNenv/nuc,1(k2, k1) . . . ϖ1,1(k1, k2) ϖ1,Norb/nuc(k1, k2) ϖ1,1(k2, k2) . . . ... ... ... ... where ϖi,j : RD RD R are learnable readouts of our Meta GNN. Similarly, we establish the orbital correlations A from Eq. (14) by connecting each of the No orbitals to each other: α1,1(k1, k1) . . . α1,Norb/nuc(k1, k2) α1,1(k2, k1) . . . ... ... ... αNorb/nuc,1(k1, k1) . . . αNorb/nuc,Norb/nuc(k1, k1) αNorb/nuc,1(k2, k1) . . . α1,1(k1, k2) α1,Norb/nuc(k1, k2) α1,1(k2, k2) . . . ... ... ... ... 2( ˆAPf ˆAT Pf) (52) where αi,j : RD RD R are learnable readouts of our Meta GNN and Eq. (52) enforcing the antisymmetry requirements on A. D.4 Changes to the Meta GNN We performed several optimizations on the Meta GNN from Gao & Günnemann (2023a) that primarily reduce the number of parameters while keeping accuracy. In particular, we changed the following: We replace all MLPs with gated linear units (GLU) (Shazeer, 2020). We reduced the hidden dimension from 128 to 64. Table 2: Hyperparameters used for the experiments. Hyperparameter Value Structure batch size full batch Total electron samples 4096 Pretraining Epochs 10000 Learning rate 10 3 (1 + t 10 4) 1 Optimizer Lamb MCMC steps 5 Basis STO-6G Subproblem steps 50 Subproblem optimizer Prodigy Subproblem α 1.0 Subproblem β 10 4 Optimization Steps 60000 Learning rate 0.02 (1 + t 10 4) 1 Optimizer Spring MCMC steps 20 Norm constraint 10 3 Damping 0.001 Momentum 0.99 Energy clipping 5 times mean deviation from median Hidden dim 256 E-E int dim 32 Layers 4 Activation Si LU Determinants/Pfaffians 16 Jastrow layers 3 Filter hidden dims [16, 8] Norb/nuc (H, He) 2 Norb/nuc (Li, Be) 6 Norb/nuc (B, C) 7 Norb/nuc (N, O) 8 Norb/nuc (F, Ne) 10 Nenv/nuc 8 Embedding dim 64 Message dim 32 Layers 3 Activation Si LU Filter hidden dims [32, 16] We reduced the message dimension from 64 to 32. We use bessel basis functions (Gasteiger et al., 2019) on the radius for edge filters. We remove the hand-crafted orbital locations and the associated network. We added a Layer Norm before every GLU. Together, these changes reduce the number of parameters from 13M to 1M for the Meta GNN while outperforming Gao & Günnemann (2023a) as demonstrated in Sec. 5. E Experimental setup Table 3: Compute time per experiment measured in Nvidia A100 GPU hours. Experiment Time (GPU hours) Ionization & affinity 224 N2 116 N2 + Ethene 124 Tiny Mol small 78 Tiny Mol large 96 E.1 Hyperparameters We list the default parameters used for the experiments in Tab. 2. Most of them were taken directly from Gao & Günnemann (2023a). We may have used different parameters for the experiments in Sec. 5 if explicitly stated so. We implement everything in JAX (Bradbury et al., 2018). To compute the laplacian 2Ψ, we use the forward laplacian algorithm (Li et al., 2024) implemented in the folx library (Gao et al., 2023). E.2 Source code We provide the source code publicly on Git Hub 1. E.3 Compute time Tab. 3 lists the compute times required for conducting our experiments measured in Nvidia A100 GPU hours. Depending on the experiment, we use between 1 and 4 GPUs per experiment via data parallelism. We typically allocated 32GB of system memory and 16 CPU cores per experiment. In terms of the number of parameters, the Moon wave function is as large as in Gao & Günnemann (2023a) at 1M parameters, and the Meta GNN shrank from 13M parameters to just 1M parameters. E.4 Preconditioning The Spring optimizer (Goldshlager et al., 2024) is a natural gradient descent optimizer for electronic wave functions Ψ with the following update rule θt =θt ηδt (53) δt =( OT O + λI) 1( θt + λµδt 1) (54) where λ is the damping factor, µ is the momentum, η is the learning rate, and O is the zero-centered Jacobian: i=1 Oi, (55) θ... log ψ(x N) Since O RN P where N is the batch size and P the number of parameters, the update in Eq. (54) can be efficiently computed using the Woodbury matrix identity, which after some simplifications yields δt = O( O OT + λI) 1(ϵ + µ Oδt 1) + µδt 1. (57) Our early experiment found it necessary to center the jacobian O per molecule rather than once for all. In single-structure VMC, the centering eliminates the gradient of the wave function along the direction where the amplitude of the wave function increases for all inputs. This direction does not 1https://github.com/n-gao/neural-pfaffian 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Number of hydrogen Energy per atom (Eh) our TAO Globe + Moon Globe + Fermi Net Hartree Fock AFQMC natoms Training data Fig. 6: Energy per atom of hydrogen chains with different lengths. The energy is computed with a single Neur Pf trained on the hydrogen chains with 6 and 10 atoms. affect energies. Thus, instead of restricting the gradient from increasing in magnitude for all samples, we constrain it to not increase in magnitude for each molecule separately N.ote that the latter implies the first but not vice versa. For multi-structure VMC, we compute O as 1 N1 PN1 i=1 Oi ... 1 NM PNM i=N NM Oi where N1, ..., NM are the index limits between molecular structures. To stabilize computations, we performed preconditioning in float64. F Extensivity on hydrogen chains Gao & Günnemann (2023a) and Scherbela et al. (2024) analyzed the behavior of their wave functions on hydrogen chains to investigate the extensivity of their wave functions. They did so by training the generalized wave functions on a set of hydrogen chains with 6 and 10 elements. Then, they evaluated the energy per atom on hydrogen chains with different lengths. We replicated their experiment and trained a single Neur Pf on the hydrogen chains with 6 and 10 atoms and evaluated the energy per atom on hydrogen chains of increasing lengths. Fig. 6 shows the energy per atom of hydrogen chains with different lengths for various methods, Globe+Moon and Globe+Fermi Net from Gao & Günnemann (2023a), Scherbela et al. (2024), Hartree Fock (CBS), the AFQMC limit for an infinitely long chain (Motta et al., 2017), and Neur Pf. It is apparent that Neur Pf outperforms Globe+Moon and Globe+Fermi Net significantly by achieving significantly lower energies outside of the training regime. Compared to Scherbela et al. (2024), Neur Pf generally performs better on longer chains, achieving errors below the Hartree-Fock baseline. However, we observe significantly higher errors in the shortest chains in Neur Pf. These results indicate that Neur Pf is better at generalizing to longer chains than previous works despite not including additional Hartree-Fock calculations like Scherbela et al. (2024). G Metal ionization energies In addition to the results in Sec. 5, where we train on all second-row elements and their ionization and affinity potentials, we here train a single Neur Pf on a set of metals and their ionization energies. This demonstrates that Neural Pfaffians also scale to heavier 3rd and 4th row elements. Fig. 7 shows the ionization energy during training. It is apparent, that Neur Pf can learn a solution for all states simultaneously. 102 103 104 105 10 5 102 103 104 105 102 103 104 105 102 103 104 105 102 103 104 105 Fig. 7: Ionization energies of metal atoms. The ionization energies are computed with a single Neur Pf trained on the neutral and ionized atoms. Reference energies are taken from Martin & Musgrove (1998). 0 20 40 60 Time (h) Energy (Eh) 0 25 50 75 Time (h) 155.500 Large our our (Fermi Net) our (Psi Former) Globe Fig. 8: Energy convergence as a function of time. H Tiny Mol convergence in time In Fig. 8, we show the runtime effect of choosing different embedding and antisymmetrizer. We test our default model, our (Fermi Net), our (Psi Former) and Globe + Moon on both Tiny Mol datasets. For any time budget, all variants of Neur Pf converge to lower energies than Globe. I Convergence ablation studies Here, we provide additional ablation studies to further investigate the performance of Neur Pf and our efficient envelopes. In particular, we train four different models on the small Tiny Mol dataset: Neur Pf, Neur Pf with the envelopes from Spencer et al. (2020), Neur Pf with the envelopes from Pfau et al. (2024), and an AGP-based generalized wave function. The total energy during training is shown in Fig. 9. The left plot shows the convergence regarding the number of steps, and the right plot shows the convergence in terms of time. We observe that Neur Pf convergence is consistently faster than the other methods in terms of the number of steps and time. One further sees the importance of generalizing Eq. (9) via the Pfaffian as the AGP-based wave function does not converge to the same accuracy as Neur Pf. The bottleneck envelopes from Pfau et al. (2024) do not only converge to worse energies but are also slower per step than our efficient envelopes from Sec. 4.2. J Model ablation studies 0 5000 10000 15000 20000 25000 30000 Total energy (Eh) AGP our + Full Env. our + Bottleneck Env. our + Efficient Env. 0 20 40 60 80 100 Time (h) Fig. 9: Ablation study on the small Tiny Mol dataset. The y-axis shows the sum of all energies in the dataset. The left plot shows the convergence in terms of the number of steps. The right plot shows the convergence in terms of time. our + Full Env. shows a Neur Pf with the envelopes from Spencer et al. (2020) and our + Bottleneck Env. uses the bottleneck envelopes from Pfau et al. (2024). 250 500 1k 2k 4k 8k 16k 32k Training Step E ECCSD(T) CBS (m Eh) Small molecules Pf(Φ1Φ2 Φ2ΦT 1 ) 250 500 1k 2k 4k 8k 16k 32k Training Step 104 Large molecules Fig. 10: Tiny Mol ablation with fixed and learnable antisymmetrizer. 250 500 1k 2k 4k 8k 16k 32k Training Step E ECCSD(T) CBS (m Eh) Small molecules our (Fermi Net) our (Psi Former) 250 500 1k 2k 4k 8k 16k 32k Training Step 104 Large molecules Fig. 11: Ablation study on the small Tiny Mol dataset with different embedding networks. Table 4: Tiny Mol energies compared to CCSD(T) in m Eh. Small Large Method (Steps) CNH C2H4 COH2 C3H4 CN2H2 CNOH CO2 Globe (32k) 5.2 12.3 10.7 62.3 45.8 40.4 42.7 TAO (32k) 1.1 4.5 6.6 18.7 21.0 41.9 19.6 our (32k) -3.7 0.1 -2.1 12.7 5.5 3.1 5.0 our (128k) -4.2 -1.5 -3.7 1.4 -3.8 -6.9 -8.2 CNH C2H4 COH2 Eour ECCSD (m Eh) Small molecules C3H4 CN2H2 CNOH CO2 Large molecules Our (32k steps) TAO (32k steps) TAO (pretrained, 32k steps) Fig. 12: Boxplot of the energy per molecule on both Tiny Mol small and large datasets for Neur Pf, TAO, and the pretrained TAO from Scherbela et al. (2024). Each boxplot contains results from 10 structures for the given molecule. The line indicates the mean, the box the interquartile range, and the whiskers the 1.5 times the interquartile range. J.1 Learnable antisymmetrizer We picked Pf(ΦAΦT ) as parametrization because it generalizes Slater determinants and many alternative parametrizations. For instance, by choosing A = 0 I I 0 and Φ = (Φ1 Φ2) = Pf(ΦAΦT ) = Pf(Φ1ΦT 2 Φ2ΦT 1 ). We investigate the impact of having A being fixed/learnable in Fig. 10. The results suggest that having A being learnable is a significant factor in our Neural Pfaffian s accuracy. J.2 Embedding network Since Neur Pf is not limited to Moon, we performed additional ablations with Fermi Net (Pfau et al., 2020) and Psi Former (von Glehn et al., 2023) as the embedding. The results in Fig. 11 show Neural Pfaffians outperforming Globe and TAO with any of the three equivariant embedding models. Consistent with Gao & Günnemann (2023a), Moon is the best choice for generalized wave functions. K Tiny Mol results Here, we provide additional data analysis and error metrics for the Tiny Mol dataset. First, we show in Table 4 the energy per molecule for the small and large Tiny Mol datasets for Neur Pf, Globe, and TAO. To estimate the remaining error, we also train another Neur Pf for 128k steps. The results show that Neur Pf consistently outperforms TAO and Globe on all molecules in both datasets. Second, we show the error per molecule for both the small and large Tiny Mol datasets in Fig. 12. We plot all models after 32k steps of training. It is apparent that Neur Pf consistently results in lower, i.e., better, energies than TAO on all molecules in both datasets. Even the pretrained TAO is outperformed by Neur Pf on all but four structures of C3H4 in the large Tiny Mol dataset. 250 500 1k 2k 4k 8k 16k 32k Training Step E ECCSD(T) CBS (m Eh) Small molecules our TAO CCSD(T) 250 500 1k 2k 4k 8k 16k 32k Training Step Large molecules Fig. 13: Tiny Mol results with pretraining on the training set. 250 1k 4k 16k 64k 256k 1M Total Steps E ECCSD(T) (m Eh) 250 1k 4k 16k 64k 256k 1M Total Steps joint training separate training CCSD(T) CBS Fig. 14: Comparison of the energy per molecule on the Tiny Mol dataset for training jointly on all structures vs training a model per structure. L Pretraining on the Tiny Mol dataset The Tiny Mol provides an additional pretraining set of 360 structures (18 molecules, 20 structures each). Like Scherbela et al. (2024), we pretrain our model on the training set of the Tiny Mol dataset and then finetune on the two test sets. Interestingly, we find the Spring optimizer to be unstable when swapping molecules from step to step and, thus, use CG-preconditioning like Gao & Günnemann (2023a) during pretraining. While yielding a small benefit on the small molecules, we find no notable difference to the Hartree-Fock pretrained model on the large molecules as shown in Fig. 13. On the small structures, the unpretrained Neur Pf s energies are 5.7 m Eh lower. Neur Pf also surpasses the pretrained TAO after just 8k steps. Compared to the pretrained TAO on the large structures, Neur Pf surpasses TAO after 16k steps and achieves 5.4 m Eh lower energies after 32k steps. M Joint vs separate training To estimate the benefit of training a generalized wave function compared to training a model per molecule, we compare the convergence of the total energy on the Tiny Mol dataset for both approaches depending on the total number of training steps. As training a separate model for each of the 70 Tiny Mole test molecules is computationally beyond the scope of this work, we select on structure per molecules and train a model for each of the 7 molecules. We use the same Neur Pf with Meta GNN for both approaches. The results are shown in Fig. 14. We observe that for lower step numbers, it is quite beneficial to train a generalized model. Though, this benefit vanishes for higher step numbers, and 2 4 8 16 32 Ne in molecule 1 Ne in molecule 2 2.8 3.0 3.1 3.4 5.0 3.0 2.9 3.1 3.4 5.1 3.1 3.1 3.0 3.5 5.3 3.4 3.4 3.5 3.7 5.6 5.0 5.1 5.3 5.6 6.9 Fig. 15: Time per training step depending on the number of electrons in two molecules. 20 40 60 80 100 Number of electrons (Ne) Fig. 16: Time for the forward pass, gradient and Laplacian computation of determinant vs our Pfaffian implementation. training a model per molecule yields lower energies. We attribute this to the fact that the generalized model has to learn a more complex representation that is not necessary for each molecule individually. Further, the per-molecule energy estimates are quite unstable due to the small shared batch size. Developments like Scherbela et al. (2023) may improve Neur Pf training as well. N Training time by batch composition Here, we benchmark the total time per step for a two-molecule batch. We test all combinations of two molecules with N 1 e , N 2 e 2, 4, 8, 16, 32. While we find a small runtime increase when processing small molecules jointly in Fig. 15, for larger systems, we see the runtime per step converge to the geometric mean of the individual runtimes. O Pfaffian runtime In Fig. 16, we benchmark our implementation for Pf(ΦAΦT ) (incl. the matrix multiplications) against the standard operation of det Φ for 10 to 100 electrons. We implement the Pfaffian in JAX while highly optimized CUDA kernels are available for the determinant. In summary, both share the same complexity of O(N 3), but the Pfaffian is approximately 5 times slower. P Broader impact Highly accurate quantum chemical calculations are essential for understanding chemical reactions and materials properties. Our work contributes to this development by providing accurate neural network quantum Monte Carlo calculations at broader scales thanks to generalized wave functions. While this may be used to distill more accurate force fields or exchange-correlation functionals for DFT, the societal impact of our work is primarily in the scientific domain due to the high computational cost of neural network VMC. To the best of our knowledge, our work does not promote any negative societal impact more than general theoretical chemistry research does. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We claim the following in our abstract and introduction: (1) Neural Pfaffian are applicable to any molecular system. As outlined in Section 4.4, we ensure this by parametrizing the orbitals to be always larger than the number of electrons. (2) Neural Pfaffian can learn all second-row element systems ground state, ionization, and electron affinity energies. We demonstrate this in our first experiment in Section 5. (3) Neural Pfaffian outperforms Globe on the nitrogen dimer. See the second experiment in Section 5.(4) We outperform CCSD(T) CBS on the small structures in Tiny Mol and TAO by factors of 10 and 6 on small and large structures, respectively. See the third experiment in Section 5. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: At the end of Section 4.4, we list the limitations of our work. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: Our paper does not contain theoretical results. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Section 4 gives the mathematical definition of our new contribution. Appendix D details the exact model definitions. Appendix E lists all hyperparameters, and as we explain in Appendix E.2, we provide the source code to reviewers and publish it publicly upon publication. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the source code via Open Review to the reviewers as mentioned in Appendix E.2. The code will be made publicly available upon publication. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: At the experimental setups (Section 5), we list the original references for structures and energies. Hyperparameters and additional details are listed in Appendix E. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: As common in deep learning-based quantum Monte Carlo literature, we do not repeat experiments for different seeds due to their computational cost, see Appendix E.3, and their generally low deviations across runs. We omit error bars due to numerical integration as these are typically below the readability threshold 0.1 m Eh. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We list compute resources and time in Appendix E.3. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our work respects the Neur IPS Code of Ethics in every aspect. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discuss this in App. P. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We strongly believe that there is no higher danger of misuse for our work than for traditional methods in computational chemistry, especially not at the scale where neural wave functions are applicable. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We list all sources for our molecular structures and reference energies at the appropriate places in Section 5. Further, we cite other codes we base our implementation of in Appendix E.2 and E. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Upon publication, we will publish our source code as a new asset publically under the MIT license. Before that, we provide an early version to the reviewers via Open Review, see Appendix E.2. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our research does not involve human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our research does not involve human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.