# training_dynamics_of_learning_3drotational_equivariance__5731abc2.pdf

Published in Transactions on Machine Learning Research (12/2025)

Training Dynamics of Learning 3D-Rotational Equivariance

Max W. Shen shenm19@gene.com Genentech Computational Sciences

Ewa M. Nowara

Genentech Computational Sciences

Michael Maser Genentech Computational Sciences

Kyunghyun Cho Genentech Computational Sciences & New York University

Reviewed on Open Review: https: // openreview. net/ forum? id= DLOIAW18W3

While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to 2% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant efficiency gap is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.

1 Introduction

Machine learning modeling of molecules generative modeling, property prediction, simulating dynamics, etc. holds great potential for advancing scientific discovery and human health via therapeutics. Molecules are three-dimensional physical entities whose biochemical properties are invariant or equivariant to 3D rotations1. To model these symmetries, two approaches are common: 1) use symmetry-respecting neural architectures, or 2) training symmetry-agnostic models with data augmentation, wherein training samples are randomly transformed by the symmetry group. This choice is made at the start of any molecular modeling project and can have a significant impact on engineering, training, and model performance, yet there has been a lack of clarity on when to prefer which approach.

3D-rotational equivariant architectures use sophisticated tensor operations to maintain equivariance (Luo et al., 2024), achieve loss scaling curves similar to non-equivariant models (Brehmer et al., 2025; Mahan et al., 2024), and are more parameter efficient than non-equivariant models on spherical image tasks (Gerken et al., 2022). Yet they can be much slower (10x-100x) than non-equivariant models2 (Gerken et al., 2022; Elhag et al., 2025; Brehmer et al., 2025), and they can be harder to optimize based on findings that breaking exact

Co-first author. 1Molecules also have symmetries to translation, which are commonly handled by centering molecule positions. 2Slowness is also partially from less optimized code and GPU kernels.

Published in Transactions on Machine Learning Research (12/2025)

equivariance improves learning (Qu & Krishnapriyan, 2024; Pertigkiozoglou et al., 2024; Canez et al., 2024). We call this the efficiency gap, arising both from optimization speed (training steps per second) and ease (loss reduction per training step). Meanwhile, recent work achieve strong performance on molecular machine learning tasks using non-equivariant architectures with data augmentation (Wang et al., 2024; Abramson et al., 2024; Qu & Krishnapriyan, 2024; Geffner et al., 2025).

Twisted prediction with a transform T:

Prediction:

x y T T-1 f

Loss decomposition

Harder task

ESc AIP (Force field prediction) over T ~ Uniform(G), shown with n samples Distribution of twisted predictions

Mean: Group-averaged (twirled) prediction is perfectly equivariant

Variance: Measures non-equivariance

... T3 T3 -1 f

Flow matching (Proteína 60M, PDB)

Force field prediction (ESc AIP 6M, SPICE)

Easier task, learned quickly Condition number 100x-1,000x smaller

Graph transformer (1M, 4M, 6M)

Rotation invariant Non-equivariant

Proteína (Flow matching) AF3-like transformer (60M, 400M) 3D U-Net (111M) Vox Mol (Molecule denoising)

Conformer graph

Input block

Output block

Linear w/ input edge vectors

Per-atom force vector System energy

Node features

Edge features

Graph attention blocks

C-alpha 3D coordinates 3D cubic grid of atom densities

3D cubic grid of atom densities

Vector field

Sequence representation

Final sequence repr.

Sequence cond. features

Adaptive biased attention blocks (optionally, triangular attention)

3D Conv., self-attention

t=0 t=0.2 t=0.4 t=0.6 t=0.8 t=0.9

Flow matching (Proteína 400M)

Dip at 4% of an epoch

% val. MSE loss from equivariance error = total val MSE loss

equivariance error

ESc AIP 6M Vox Mol 111M

Dip around 5000 training steps irrespective of training set size

Figure 1: Overview of the paper. (a) Schematic of twisting and twirling, which underpin a principled measure of equivariance error. (b) Loss decomposition by Taylor expansion around the twirled prediction. (c) Loss landscapes for each loss component at early model checkpoints (step=500). (d) Architectures of three non-equivariant models studied here. (e) For MSE loss, the loss decomposition holds exactly, enabling computing the percent validation loss from equivariance error, which is plotted by training step in three settings.

To answer are symmetry-respecting architectures worth it? , one powerful principle is: use the model that achieves better held-out loss. In a fixed amount of GPU-hours, non-equivariant models could incur unnecessary equivariance error leading to higher loss, but equivariant models may achieve worse test loss due to the efficiency gap. In fact, we suggest the loss penalty vs. efficiency gap tradeoff is a general explanatory framework. This work focuses on equivariance, because on rotation-invariant tasks like property prediction, symmetry-respecting architectures are relatively uncontroversial (Shoghi et al., 2024; Gasteiger et al., 2022; Nowara et al., 2025): they have a minimal efficiency gap to symmetry-agnostic architectures as rotation-invariant features are informative and fast to compute, and standard deep learning operations easily preserve rotation invariance. In contrast, consider set permutation invariance where the symmetryrespecting architecture is the norm. This can be explained by observing that set transformers have minimal efficiency gap to symmetry-agnostic transformers, as set transformers simply ignore positional embeddings.

While it is possible to directly compare efficiency gaps to loss penalties from imperfect symmetry, this is easily confounded by implementation details. To provide a more fundamental insight, we instead isolate and quantify a key source of potential underperformance in symmetry-agnostic models. We develop tools to investigate: what is the percent of a symmetry-agnostic model s loss that comes only from its failure

Published in Transactions on Machine Learning Research (12/2025)

to be perfectly equivariant? ( 2, Fig. 1A-B) In an idealized setting (ignoring efficiency differences), this characterizes the counterfactual error reduction if we had trained a symmetry-respecting model instead. In light of efficiency gaps for 3D-rotational equivariance, this metric quantifies how small the efficiency gap must become for equivariant models to outperform non-equivariant models.

In this work, we focus our empirical investigations to three high-dimensional (R3N R3N) molecular learning tasks satisfying 3D-rotational equivariance flow matching, molecular dynamics force field prediction, and denoising voxelized atomic densities ( 3, Fig. 1D). For any non-negative convex loss, we decompose the total loss with data augmentation L(θ) = Lmean(θ) + Lequiv(θ), where Lequiv captures all information about deviance from exact equivariance. In particular, exactly equivariant models have L(θ) = Lmean(θ), i.e., Lequiv = 0.

We find that equivariance error shrinks rapidly within 1k-10k training steps (minutes) to under 2% of the total loss (Fig. 1E). This occurs because Lequiv is a significantly easier learning task than Lmean: the loss landscape for Lequiv is significantly smoother and better conditioned (Fig. 1C). Strikingly, this is robust to model size, training set size, batch size, and optimizer: we find it with standard batch sizes as well as batch size 1, on training sets of 1M molecules to as small as 500 molecules, and on model sizes from 1M to 400M.

Lastly, in 4, we conduct theoretical and experimental investigations to better understand the relationship between relative equivariance error, learning gradients, and parameters. We prove that under certain conditions, and experimentally, smaller equivariance error can increase the similarity between L(θ) Lmean(θ). We also prove a quadratic relationship between Lequiv and the parameter deviation from the subspace of exactly equivariant functions for the modern graph transformer ESc AIP.

2 Measuring Equivariance & Loss Decompositions

Let f : RD RD be a learnable function and let G be a compact group, for instance of 3D rotations. We consider T as the matrix representation of the action of G on RD. A function f is G-equivariant if it commutes with all transformations T G, such that for any input x RD, we have f(T(x)) = T(f(x)), also written (f T)(x) = (T f)(x). Rearranging, we observe that a perfectly equivariant function satisfies, for all x, T:

(T 1 f T)(x) = f(x) (1)

We call (T 1 f T)(x) the twisted prediction for x, from the twisted function T 1 f T. To produce a twisted prediction3 on molecules, we sample a random rotation, use it to rotate the input molecule, pass this through the function, and un-rotate the output. The un-rotation step re-aligns the output to the original frame of the input molecule, which provides a canonical frame to compare the impact of different transformations on the output.

In contrast to a perfectly equivariant function, a non-equivariant function must have some distinct transformations T1, T2 where the twisted prediction is different: (T 1 1 f T1)(x) = (T 1 2 f T2)(x). This property motivates analyzing the distribution of twisted predictions over a uniform distribution on the group, which is the usual choice for data augmentation. For a given x:

Zx(T) (T 1 f T)(x), T Uniform(G) (2)

Its first central moment µ(x) is the group-averaged, or twirled prediction.

µ(x) ET [(T 1 f T)(x)] (3)

3The name reflects a physical intuition of introducing a twist in the middle of a rope with fixed endpoints: approaching the middle, the rope twists, and after the middle, it untwists.

Published in Transactions on Machine Learning Research (12/2025)

By the twirling formula, µ(x) is perfectly G-equivariant (Fulton & Harris, 1999). The second central moment of the twisted random variable is the covariance: Cov T (Zx(T)) = ET (Zx(T) µ(x))(Zx(T) µ(x)) . The total variance the trace of the covariance matrix is a natural measure of equivariance error:

1 DEx,T (T 1 f T)(x) µ(x) 2 (4)

This quantity measures the variance of the twisted predictions around their equivariant mean. An important property is that it is zero if and only if the function is perfectly equivariant. Higher moments of Zx(T) likewise capture multivariate generalizations of skewness and kurtosis of the equivariance error.

2.1 Loss decomposition

Twisting and twirling provide machinery to understand a function s behavior around group actions. We can extend this machinery to analyze losses used to train models under random data augmentation, where each training point is randomly rotated. Let the data distribution p(x, y) and loss function l : RD RD R be invariant to G. That is, the joint data distribution p(x, y) for any transformation T G satisfies: p(x, y) = p(T(x), T(y)) and for any predictions z and targets y, and for all T G: l(T(z), T(y)) = l(z, y). These conditions imply that the loss-optimal model is equivariant, and that: l((f T)(x), T(y)) = l((T 1 f T)(x), y). The total loss over all data and transformations is:

L(f) Ex,y,T l((T 1 f T)(x), y) (5)

We perform a Taylor expansion of the total loss around the twirled prediction µ(x) (averaged over T), and obtain terms involving central moments of the twisted random variable:

L(f) = Ex,y[l(µ(x), y)] | {z } twirled prediction error

2Ex,y tr Hl(µ(x), y)Cov T [(T 1 f T)(x)]

| {z } equivariance error

where δ = (T 1 f T)(x) µ(x), Hl(µ, y) is the D D Hessian matrix of the loss with respect to its first argument, and Cov T is a D D covariance matrix over the distribution of transformations T. Proposition 1. If l(z, y) = 1

D z y 2 is mean-squared error, then the total loss decomposes as:

L(f) = Ex,y[l(µ(x), y)] + 1

DEx,T (T 1 f T)(x) µ(x) 2 .

For MSE loss, our Taylor expansion reduces to a version of bias-variance decomposition. The equivariance error is identical to equation 4 because MSE loss places equal weight on all dimensions. These two terms are central objects of study, so we name them:

Lmean Ex,y[l(µ(x), y)] (6)

DEx,T (T 1 f T)(x) µ(x) 2 (7)

Percent of loss from equivariance error. Denoting model parameters as θ, under MSE loss, we can express the total loss exactly as L(θ) = Lmean(θ) + Lequiv(θ). As all three terms are strictly non-negative, this implies:

% MSE loss from equivariance error = Lequiv(θ)

Generalization to convex losses. We can further define a generalized measure of the percent of loss from equivariance error for any convex loss function with non-negative outputs, such as KL divergence or crossentropy. By Jensen s inequality, we have Lmean(θ) L(θ) and both terms are non-negative. Furthermore,

Published in Transactions on Machine Learning Research (12/2025)

the two terms are equal if and only if the model is exactly equivariant. For convex losses, this motivates defining:

Lequiv(θ) L(θ) Lmean(θ) (for convex losses) (9)

as the non-negative difference, which is compatible with: % loss from equivariance error = Lequiv(θ)/L(θ).

Finite sample estimators. Denoting twisting as ˆZi(x) = (T 1 i f Ti)(x), the naive Monte Carlo estimates with N group samples are ˆµ(x) = 1 N ˆZi(x), and for MSE loss \ Lmean(x) = 1

D ˆµ(x) y 2, \ Lequiv(x) =

1 ND PN i=1 ˆZi(x) ˆµ(x) 2. However, \ Lequiv is a variance term, so it is biased relative to the true population statistic, unless adjusted by the Bessel correction N/(N 1). In D.6, we derive unbiased finite-sample estimators for Lmean, Lequiv, and a bias-corrected finite-sample estimator for the percent loss from equivariance error: On 3D molecular tasks, we find that neural networks are typically smooth enough that only five to ten rotation samples are necessary to achieve a stable estimate of the twirled prediction and each loss component ( B.1).

Our derivations provide a principled framework for measuring and understanding degrees of learned equivariance. An important property is that for exactly equivariant architectures, Lequiv(θ) = 0, so that L(θ) = Lmean(θ). We remark that sometimes, non-equivariant models may be trained without data augmentation, so that this decomposition may not apply on the training loss. We stress that as long as we aim for these models to behave in an equivariant manner, then the loss decomposition is valid for held-out loss.

3 Experiments

To gain insight into the empirical learning behavior of non-equivariant models, we apply our loss decomposition framework to three high-dimensional learning problems on 3D molecules, each with a distinct task and a modern non-equivariant model architecture. For each task, we follow the standard training procedure described in its original publication. The tasks span predictive regression tasks, autoencoding, as well as generative modeling. Notably, all tasks use a mean-squared error loss, so our framework provides an exact decomposition of L(f) into Lmean and Lequiv. We report both of these metrics, as well as the percentage of the total loss attributable to the model s lack of equivariance, on a held-out dataset during training. We provide complete details on methods in D.

Neural Interatomic Potential (NNIP): We consider force prediction with ESc AIP (Qu & Krishnapriyan, 2024), a graph transformer architecture. The model predicts a 3D force vector for each atom based on density functional theory, mapping an input molecule with N atoms to an output in R3N. This task is physically equivariant to the special orthogonal group SO(3) acting on atom coordinates in R3.

Probabilistic Flow Matching: We study a generative modeling task with Proteína (Geffner et al.,

2025), a transformer-based architecture with similarities to Alpha Fold3. The model learns to approximate the velocity field of a probability flow that transforms random noise into structured protein backbones. For a molecule with N alpha carbon atoms, the network maps noised atom coordinates and a time t [0, 1] to a velocity vector in R3N. The learning task is made rotationally equivariant through data augmentation, aligning it with SO(3) acting on atom coordinates in R3.

Denoising Voxelized Atomic Densities: We analyze a denoising autoencoder task with Vox Mol (Pinheiro et al., 2023; Nowara et al., 2025), a non-equivariant 3D convolutional neural network. Molecules are represented as densities in a cubic voxel grid. For a grid length g and a atom types, the input and output are tensors of shape [g, g, g, a]. This learning task is made rotationally equivariant through data augmentation using 16 axis-preserving 90-degree rotations of a cube, which do not introduce discretization artifacts due to aliasing. These rotations are a subset of the full octohedral group O.

Published in Transactions on Machine Learning Research (12/2025)

3.1 Force field prediction with ESc AIP

We trained ESc AIP 6M on a subset of SPICE with 950k training examples used by Qu & Krishnapriyan (2024) for 30 epochs with batch size 64. SPICE is a dataset with of small molecule 3D conformers with energies and forces computed by quantum-mechanical density functional theory (Eastman et al., 2024). We varied model size from 1M, 4M and 6M, varied training set size from 950k, 50k, 5k, and 500 (with batch size 1), and varied the optimizer or learning rate. In this task, an equivariant model would output the same yet rotated force prediction, when the input molecule rotates. We observe the following:

Equivariance is learned early and quickly, in a manner robust to training set size, model size, and optimizer and learning rate. The percent validation loss from equivariance error rapidly plummets in the first stage of training to 0.1% within 1k-10k training steps (Fig. 2A-B). Notably, this speed is independent of epoch or training set size - with a 950k training set, this occurs 25% through the first epoch. Training with 500 datapoints with batch size 1, this occurs at the fourth epoch. The dip is least affected by changing model size (Fig. 2E), and most affected by the optimizer and learning rate (Fig. 2F).

Equivariance is learned quickly because it is an easier learning task than the main prediction task. The loss landscape (Fig. 1C) for the equivariance error is much smoother and better conditioned, with a 1,000x lower condition number, than the loss landscape for the twirled prediction error.

After a near-universal dip, percent loss from equivariance error can increase mildly. In the default setting, the percent increases from 0.1% to 0.3%. This is explained by a plateau in the equivariance error while the twirled prediction error continues to decrease (Fig. 2C).

Typical models converge to being nearly equivariant, with percent validation loss from equivariance error under 0.1%. The exception is training on 500 or 5k examples only: equivariance error continues to increase as training progresses, whereas equivariance error decreases in the long-term for larger training set sizes (Fig. 2D, Supp. Fig. 6).

3.2 Flow matching with Proteína

The conditional flow matching objective at each time optimizes a mean-squared error loss, which enables us to apply our loss decomposition to study the percent loss from equivariance error. In this task, an equivariant model would output the same yet rotated velocity, when the input noised molecule rotates. Such equivariance is a common desired property for molecular generative models Geffner et al. (2025); Abramson et al. (2024). While many users may care more about final generative quality metrics, the MSE loss plays a critical role in training, monitoring, and in defining the target loss-optimal velocity field. In particular, any non-equivariance in the model s learned generative distribution is caused by a non-equivariant velocity field. This motivates tracking and understanding the percent loss from equivariance error of flow matching models at various t, which may enable refining training strategies to improve equivariance.

We trained Proteína at 60M without triangular attention and 400M with triangular attention on the full Protein databank (PDB) dataset with 225k training examples. We also trained models on 1% of the PDB with 2k examples and 0.1% with 200 examples. Flow matching trains a model jointly over t, flow matching time, ranging from t = 0 for noise and t = 1 for data. We measure metrics at t = 0, 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, and 0.99, and use red colors for high t close to the data, and blue-purple colors for low t near noise in Figure 3.

We observe that the equivariance learning dip occurs early for all t, in a manner robust to training set size and model size. Following the dip at 1k-10k training steps (Fig. 3A), low t (closer to noise) are more equivariant, while high t (closer to data) are less equivariant, which spike following the dip. This holds for the 400M model (Fig. 3E), and 60M model trained on 1% and 0.1% of the PDB (Fig. 3F-H). Interestingly, the dip occurs at the same number of training steps despite different training set sizes, so it occurs 4% through one epoch when trained on the full PDB, but occurs around epoch 53 when trained on 0.1% of the PDB.

Published in Transactions on Machine Learning Research (12/2025)

Vary model size Vary optimizer Vary training set size

Loss decomposition a b Force field prediction with ESc AIP c

Figure 2: Training dynamics of learning equivariance in ESc AIP (Force field prediction). (a-c) Validation losses and percent validation loss from equivariance error during training, early in training (a), with log-log axes (b), and decomposed into separate terms (c). (d-f) Impact of varying training set size (d), model size (e), and optimizer or learning rate (f).

After 1M training steps, the model s equivariance error is low across t, with a maximum value around 6% at t = 0.90 , though it is lower at the extremes t = 0.99 at 3% and t = 0 at 0.04% (Fig. 3B). This indicates that the time-conditional velocity field learned by the model is approximately equivariant. Our finding that at 1M training steps, t = 0.90 is the least equivariant timestep is relevant for designing training augmentation or test-time strategies for improving equivariance. Notably, this finding highlights the advantage of our loss decomposition framework. Vonessen et al. (2025) measure equivariance error in Proteina by the variance of the normalized twisted prediction, but this is not interpretable as a percent of loss, and thus conflates task difficulty, which gets easier as t 1, with equivariance error. We correct for this issue, and find that t = 0.9 is the most problematic time for non-equivariance, whereas they find t = 0.5 instead.

3.3 Denoising voxelized atomic densities with Vox Mol

We trained Vox Mol 111M on GEOM-drugs, a dataset of 3D structures of drug-like molecules with 1.1M training examples. We also trained models on 1% (11k), 10% (110k), 25% (275k), and 50% (550k) examples, and models of varying size: full (111 M parameters), small (28 M), and tiny (7 M). In this autoencoding task, an equivariant model would output a rotated predicted reconstruction when the input molecule rotates; this is a commonly desired property when using the decoder as a generative model Pinheiro et al. (2023). We observe that the equivariance learning dip occurs early for all t, in a manner robust to training set and model size. Across the training set sizes, all models rapidly reduce their percent validation loss from equivariance error from an initial 80% to 2% within 1k-10k training steps (Fig. 4A-B). At 50k training steps, models have around 5-10% validation loss from equivariance error. Beyond 50k training steps, the twirled prediction error continues to decrease while the equivariance error plateaus, or decreases more slowly, below 1e-5 (Fig. 4C).

Published in Transactions on Machine Learning Research (12/2025)

Flow matching with Proteína

Percent validation loss after 1M training steps

Proteína 60M trained on PDB (225k training examples)

Proteína 400M trained on PDB (225k training examples) Proteína 60M trained on 1% of PDB (2k training examples) Proteína 60M trained on 0.1% of PDB (225 training examples) Proteína 60M trained on 0.1% of PDB (225 training examples)

t=0 t=0.2 t=0.4 t=0.6 t=0.8 t=0.9

Overfitted regime Overfitted regime

Early in training (1k-10k training steps); Below 1% for all t

Figure 3: Training dynamics of learning equivariance in Proteína (Flow matching). Colors indicate flow matching time, with noise at t = 0 and data at t = 1. (a) Percent validation loss from equivariance error during training. (b) Bar plot of the percent validation loss from equivariance error, by flow matching time, at a final checkpoint after 1M training steps. (c-d) Validation losses by training step. (e-h) Impact of varying model size (e), training set size (f-h).

a b c Denoising voxelized atomic densities with Vox Mol

Twirled prediction error

Equivariance error

Figure 4: Training dynamics of learning equivariance in Vox Mol (Denoising voxelized atomic densities). (a) Percent validation loss from equivariance error during training. (b-c) Validation losses by training step.

3.4 Loss Landscape Analysis

To better understand the initial dip, we studied loss landscapes for Lmean and Lequiv at early checkpoints (500 steps). We computed the Hessian of each loss on a training batch for a subset of 33k parameters including non-linear layers for ESc AIP, 1.5k parameters in Proteína s linear head, and 6.9k parameters in a final layer of Vox Mol. For ESc AIP, we measured condition numbers around 1e9 for Lmean and 1e6 for Lequiv ( 1,000x smaller). For Proteína, we measured 2e10 for Lmean and 1e8 for Lequiv ( 100x smaller). For Vox Mol, we measured 5e9 and 6e8 respectively ( 10x smaller). We calculate condition numbers for Lmean and Lequiv using the largest positive and smallest positive eigenvalues for each loss, and plot them over 20 minibatches in figures 12, 13. For loss landscape plotting, we chose two axes for plotting using the largest positive and smallest positive eigenvector on the total loss, and used the same step size and grid for Lmean and Lequiv. In both models, we find that Lequiv has a substantially smoother loss landscape than Lmean (Fig. 1C).

Published in Transactions on Machine Learning Research (12/2025)

3.5 Do Latent Representations Learn to Respect Equivariance?

The three model architectures studied here have substantial differences (Fig. 1D), yet they display some similarities in their training dynamics of learning equivariance. A natural question is whether their latent representations learn to respect equivariance during training. We provide an analysis here, with further details in D.

ESc AIP s latent representation is rotation-invariant. ESc AIP uses rotation-invariant features, and all intermediate layers maintain rotation-invariance. Thus, the final representation is exactly rotationinvariant by design (Fig. 1D). Note that the whole architecture is not invariant or equivariant due to the final prediction head (see eq. 17).

Proteína s latent representation is approximately equivariant. The architecture acts directly on 3D C-alpha coordinates, and does not use rotation-invariant or equivariant features. Its final latent sequence representation is mapped by a linear head into the model output (Fig. 1D). Thus, when the model is empirically approximately equivariant, the final latent is also approximately equivariant.

Vox Mol s latent representations are not equivariant nor invariant. Unlike ESc AIP and Proteína where first-principles reasoning suffices, we had to study Vox Mol empirically. To evaluate if latents were equivariant, for an input molecule and a given rotation, we measured the cosine similarity between the rotated latent, and the latent of the rotated molecule. We found a median of 0.6, comparable to the cosine similarity between latents of different molecules, indicating a lack of equivariance (Fig. 11). To evaluate if latents were invariant, we measured the cosine similarity between the latents of different rotations of the same molecule as 0.64, which is statistically significantly higher but with a small effect size than the cosine similarity between latents of different molecules at 0.58.

4 Impact of Lequiv < Lmean on Gradients and Parameters

Our empirical results showed that early in training, equivariance error rapidly diminishes, such that Lequiv is small relative to Lmean. In this section, we investigate empirically and theoretically the connections between equivariance error, gradients and parameters. For non-negative convex losses, the loss is connected to the gradients by an analogous decomposition (eq. 9):

L(θ) = Lmean(θ) + Lequiv(θ) (10)

L(θ) = Lmean(θ) (for exactly equivariant models) (11)

When Lequiv(θ) vanishes, the model is said to follow equivariant learning dynamics . While gradient norms in general have a complex relationship with loss norms, in 4.1 we show that in certain regions of parameter space, as Lequiv shrinks, L(θ) can become more similar to Lmean(θ). Experimentally, we find moderate-to-strong correlations between the percent loss from equivariance error, and the similarity of L(θ) Lmean(θ), throughout training.

In 4.2, we consider a parameter decomposition framework from Nordenfors et al. (2025), which shows that when certain assumptions hold, model parameters can be othogonally decomposed:

θ = θE + θE (12)

where θE is the model s parameter deviation from the subspace of perfectly equivariant functions E. These assumptions do not always hold, but they do hold for ESc AIP s force prediction head, enabling us to prove a quadratic relationship between Lequiv and parameter deviation from the subspace of exactly equivariant functions for the modern graph transformer. Experimentally, we find the two are closely linked over training with Spearman correlation = 0.99. Finally, in general settings where the parameter decomposition does hold, we use our loss decomposition to derive some additional relationships between θE to Lequiv for MSE loss.

Published in Transactions on Machine Learning Research (12/2025)

4.1 Equivariance error and gradients

We shall consider the relative loss ratio from equivariance error for non-negative convex losses:

ϵ(θ) Lequiv(θ)

Lmean(θ) (13)

In proposition 2, we consider a mild assumption, commonly satisfied in practice, that the deep neural network is analytic, i.e., it is composed from analytic activation functions, and that the losses are analytic. By standard real analysis results, analyticity implies smoothness on compact subsets of parameter space (such as parameters explored in training; Lemma 7). A function f is Mf(U)-smooth in a compact subset X if f(x) f(x ) Mf(U) x x for all x, x X. Smoothness in turn can be used to derive a bound on the similarity of L(θ) Lmean(θ) in terms of ϵ(θ). In practice, the bound can be weak when Mf(U) can be large, and the bound becomes vacuous near saddle points where Lmean(θ) 0. Nevertheless, this result sheds light on the structure of the relationship between ϵ(θ) and the gradients.

Proposition 2. Let the model fθ be analytic (i.e., a deep neural network constructed from analytic activation functions), and suppose the analytic loss satisfies L(θ) = Lmean(θ) + Lequiv(θ) (i.e., for convex losses via 9). Then, for any compact subset U in the parameter space of fθ where ϵ(θ) is well-defined (the denominator is non-zero) and that includes an optima θ with ϵ(θ ) = 0, there exists a finite constant Mϵ(U) = supθ U 2ϵ(θ) such that the approximation L(θ) Lmean(θ) holds for all θ U with relative error bounded by:

L(θ) Lmean(θ)

Lmean(θ) ϵ(θ) + Lmean(θ) Lmean(θ)

2Mϵ(U)ϵ(θ) (14)

Proof. Provided in C.3

Near the global optima, we can derive a different bound in proposition 3 using the same assumptions, but employing a theorem known as the Kurdyka-Łojasiewicz (KŁ) inequality (Kurdyka, 1998; Dereich & Kassing, 2021). The Kurdyka-Łojasiewicz (KŁ) inequality is a generalization of the Polyak-Lojasiewicz condition, itself a generalization of convexity, which has been used to study convergence rates of stochastic gradient descent in conditions that are more realistic to deep neural networks (Scaman et al., 2022). The KurdykaŁojasiewicz (KŁ) inequality holds locally under mild conditions, only requiring analyticity, and states that there exists a compact local neighborhood U around any critical point θ and constants c > 0, α [1, 2) such that, for all θ U:

f(θ) 2 c|f(θ) f(θ )|α (15)

This is mathematically an inequality in the opposite direction as smoothness. Whereas smoothness ensures the function does not change too quickly, the KŁ inequality says the function does not change too slowly, which is important for gradient descent convergence rates. For our purposes, smoothness gives an upper bound on the numerator, while the KŁ inequality provides a lower bound on the denominator. Combining the two gives the ratio bound.

Proposition 3. Let the model fθ be analytic (i.e., a deep neural network constructed from analytic activation functions), and suppose L(θ) = Lmean(θ) + Lequiv(θ) (i.e., for convex losses via 9) is analytic and has a minimum at 0. Then, there exists a compact neighborhood U around the global minimum θ and finite constants c > 0, MLequiv(θ)(U), α [1, 2) such that for all θ U, the approximation L(θ) Lmean(θ) holds with relative error bounded by:

L(θ) Lmean(θ)

2MLequiv(θ)(U)

Lequiv(θ) Lmean(θ)α (16)

Published in Transactions on Machine Learning Research (12/2025)

Proof. Provided in C.4.

Experimental investigation. Our propositions 2, 3 show that in certain conditions, a function of the loss ratio upper bounds the gradient norm ratio. We empirically investigated this by measuring the loss ratio and the gradient norm ratio during training, which we plot in B.3. We find strong log-log correlations between the loss ratio and the gradient norm ratio of Pearson R = 0.75 over training in ESc AIP, and statistically significant R = 0.23 to 0.95 for Proteína across all times. Interestingly, we observe stronger correlations at smaller lag times, suggestive of stable coefficients within basins, with a sparse number of transitory windows where the models seem to move between basins . Collectively, these experimental results provide complementary insight into the relationship between relative equivariance error and model gradients.

4.2 Equivariance error and deviation from equivariant parameter subspace

In the proceeding analysis, we adopt Nordenfors et al. (2025) s mathematical framework for analyzing neural network parameters in terms of equivariant and non-equivariant parameter subspaces; under certain conditions, model parameters can be decomposed into orthogonal components: θ = θE + θE . This framework relies on three assumptions: (i) the symmetry group is compact and acts on finite-dimensional hidden spaces; (ii) the neural net non-linearities are equivariant; and (iii) the loss is invariant. Under these conditions, the total parameter space is an inner product space, and the subspace of perfectly equivariant functions E is a linear subspace, which together enable the orthogonal decomposition θ = θE + θE . This framework can be applied to a broad class of modern neural network operations and architectures, including fully connected layers with non-linearities, convolutions, residual connections, and attention layers. It also includes a broad class of symmetry groups including SO(3) and all groups studied in this work. We provide more detail in A.1 and refer the interested reader to Nordenfors et al. (2025).

Our scope and contributions here are as follows. Assumptions (i-iii) and the parameter decomposition θ = θE + θE do not always hold, but they do hold for ESc AIP s force prediction head, enabling us to prove a quadratic relationship between Lequiv and parameter deviation from the subspace of exactly equivariant functions for the modern graph transformer. Furthermore, in general settings where the parameter decomposition does hold, Nordenfors et al. (2025) studies its implications, such as models following approximately equivariant learning dynamics when θE is small. In this general setting, we apply our loss decomposition to derive some relationships between θE to Lequiv.

What is the relationship between Lequiv and θE when θ = θE +θE ? In general for neural networks, Lequiv is a complex, highly non-linear function of θ. However, we know that Lequiv is non-negative, continuous, and equal to zero iff θE = 0. By these properties, we know that if θE is small, then Lequiv is small. More formally, for any ϵ > 0, there exists a δ > 0 such that if the parameter deviation is small ( θE < δ), then the equivariance error is also small (Lequiv < ϵ).

The ESc AIP architecture is a modern graph transformer architecture that achieved strong results on NNIP energy and force prediction tasks, and satisfies assumptions (i-iii). The ESc AIP architecture uses rotationinvariant features derived from an input molecular graph. Its hidden representations for atoms and edges, denoted h, are rotation-invariant throughout the network. Force prediction outputs a 3D force vector at each atom in a molecule. For a single atom with a set of 3D edge vectors E (the vectors pointing from one atom to another atom) in a molecule x, ESc AIP predicts force vectors as:

ex wx h(e, x) ey wy h(e, x) ez wz h(e, x)

where e R3 is a 3D edge vector, h(e, x) Rh is the last hidden representation of the edge e in molecule x, and W = [wx, wy, wz], where each w Rh, are the parameters for a linear head with no bias. The 3D edge vectors e are rotation-equivariant with respect to the input molecule, while the hidden representation h(e) is rotation-invariant to the input molecule, but composing these to form the output prediction generally breaks both invariance and equivariance.

Published in Transactions on Machine Learning Research (12/2025)

In particular, force predictions are equivariant if and only if the scalar projections of the hidden features are independent of the coordinate axis, i.e., wx h(e, x) = wy h(e, x) = wz h(e, x), for all inputs. Under the mild assumption of a non-degenerate learned embedding function h(e, x), such that the set of all possible hidden vectors spans the feature space, this condition holds if and only if the parameter vectors themselves are identical: wx = wy = wz. This condition defines the subspace E for the ESca IP architecture. Using this, we decompose W = WE +WE with an equivariant part WE = [ w, w, w] E where w = 1

3(wx +wy +wz), and a non-equivariant part WE = [dx, dy, dz] E where dx = wx w, and same for y, z.

With this setup, we can now establish that the equivariance error of the ESca IP architecture has a quadratic relationship with the magnitude of the parameter deviation from E, the space of perfectly equivariant functions.

Theorem 4. For the ESc AIP architecture trained with mean-squared error loss on a non-degenerate dataset, for any fixed set of upstream parameters θ\W , there exist positive constants 0 < λmin λmax (which depend on the model architecture, data distribution, and other parameters θ \ W ) such that:

λmin WE 2 F Lequiv(θ) λmax WE 2 F (18)

Proof. Provided in C.5.

Experimental investigation. In B.4, we empirically plot the force prediction head s deviation from the mean, vs. equivariance error, and find a Pearson correlation of 0.94 and Spearman correlation of 0.99 over training, which supports our conclusion.

The preceding analysis can be generalized to a broader class of neural networks. Applying a Taylor expansion to Lequiv(θ) for the neural net f on an input x, we have: f(x; θE + θE ) = f(x; θE) + JθE f(x; θE) θE + O( θE 2) where JθE f(x; θE) is the Jacobian of the network output with respect to parameter components θE , evaluated at θE. The key structure, analogous to the ESc AIP argument, is the decomposition of the neural net output into a purely equivariant term, and a term linear in θE , as well as a remainder term in this setting. With this setup, for a broad class of neural network architectures, we can relate locally near E that Lequiv is quadratic in θE (Thm. 5), and its grad norm is linear in θE (Thm. C.7).

Theorem 5. For any neural network whose parameters can be expressed as θ = θE + θE with θE E and θE E , and for equivariance error Lequiv defined by the variance of the output with respect to transformations, there exist positive constants 0 < λmin λmax such that for a non-degenerate dataset, using to denote L2-norm:

λmin θE 2 + O( θE 3) Lequiv(θ) λmax θE 2 + O( θE 3) (19)

Proof. Provided in C.6.

Theorem 6. Under the same conditions as Thm. 5, the norm of the gradient of the equivariance loss with respect to the non-equivariant parameters is bounded by the deviation itself. Specifically, there exists a constant C such that: θE Lequiv(θ) C θE

Proof. Provided in C.7.

5 Related Work

Prior work have measured learned equivariance with a wide variety of approaches (Kvinge et al., 2022; Karras et al., 2021; Geffner et al., 2025; Qu & Krishnapriyan, 2024; Gruver et al., 2023; Nowara et al., 2025; Fuchs et al., 2020), but to our knowledge, this work is the first to derive a measure of equivariance error that is interpreted as a percent of loss. Notably, many prior measures effectively estimate equivariance error as a pairwise deviation using only two samples per datapoint, whereas we estimate variance around a mean

Published in Transactions on Machine Learning Research (12/2025)

using enough samples of the twisted prediction as necessary to obtain stable estimates. Vonessen et al. (2025) use the variance of the normalized twisted prediction, but this is not interpretable as a percent of loss. They study flow matching, but their metric conflates task difficulty, which gets easier as t 1, with equivariance error. We correct for this issue, and find that t = 0.9 is the most problematic time for nonequivariance, whereas they find t = 0.5 instead. Canez et al. (2024) find that relaxing architectures from exact equivariance improves loss landscape conditioning and achieves better loss than perfectly equivariant architectures on image super-resolution and fluid dynamics modeling.

Twirling serves as a simple yet powerful postprocessing operation to transform any learned function into an equivariant one at test time. Ideas like this have been explored in Pozdnyakov & Ceriotti (2023); Nordenfors et al. (2025). Preprocessing inputs to a canonical frame is another simple yet powerful postprocessing operation to convert any function into an equivariant one (Mondal et al., 2023; Gandikota et al., 2021).

6 Discussion

In this work, we found that 3D-rotational equivariance is learned easily and quickly. We described a two-phase learning dynamic: initially, models rapidly learn equivariance. This occurs because learning equivariance is an easier task, with a smoother and better-conditioned loss landscape, than the main prediction task. After training, the final percent loss from equivariance error is small for all models, but it is notably smaller for ESc AIP at 0.006% than for Proteína and Vox Mol (< 5%). While all of these loss penalties are small, and easily remedied by test-time postprocessing techniques like twirling or input frame canonicalization, this observation may also motivate research on architecture design to narrow this gap.

Intriguingly, equivariance is learned rapidly despite significant differences in model architectures. ESc AIP is nearly equivariant , as it becomes exactly equivariant with only a small change to its final linear head, yet its initial dip occurs just as quickly as Proteína and Vox Mol, which are distant from being architecturally equivariant. It is also interesting that each model s latents learn (or fail to learn) to respect symmetries in different ways.

Our work establishes a principled and unified framework for quantifying equivariance error for non-negative convex losses. We focused our empirical study on 3D rotations, as this is a physically important symmetry group for biomolecules, but other symmetry groups may be easier or harder to learn. Looking forward, our framework could be used to study the learning dynamics of equivariance on other symmetry groups.

Acknowledgements

We thank Pan Kessel and Saeed Saremi for helpful discussions.

Code Availability

We provide code at https://github.com/genentech/equivariance_learning. Our code simply adds callbacks to compute equivariance metrics during training on top of the original ESc AIP, Proteína, and Vox Mol codebases.

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, Sebastian W Bodenstein, David A Evans, Chia Chun Hung, Michael O Neill, David Reiman, Kathryn Tunyasuvunakool, Zachary Wu, Akvil e Žemgulyt e, Eirini Arvaniti, Charles Beattie, Ottavia Bertolli, Alex Bridgland, Alexey Cherepanov, Miles Congreve, Alexander I Cowen-Rivers, Andrew Cowie, Michael Figurnov, Fabian B Fuchs, Hannah Gladman, Rishub Jain, Yousuf A Khan, Caroline M R Low, Kuba Perlin, Anna Potapenko, Pascal Savy, Sukhdeep Singh, Adrian Stecula, Ashok Thillaisundaram, Catherine Tong, Sergei Yakneen, Ellen D Zhong, Michal Zielinski, Augustin Žídek, Victor Bapst, Pushmeet Kohli, Max Jaderberg, Demis Hassabis, and John M Jumper.

Published in Transactions on Machine Learning Research (12/2025)

Accurate structure prediction of biomolecular interactions with Alpha Fold 3. Nature, 630(8016):493 500, June 2024.

Johann Brehmer, Sönke Behrends, Pim De Haan, and Taco Cohen. Does equivariance matter at scale?, 2025. URL https://openreview.net/forum?id=i IWeyf GTof.

Diego Canez, Nesta Midavaine, Thijs Stessen, Jiapeng Fan, Sebastian Arias, and Alejandro Garcia. Effect of equivariance on training dynamics, July 2024.

Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for lojasiewiczlandscapes. Co RR, abs/2102.09385, 2021. URL https://arxiv.org/abs/2102.09385.

Peter Eastman, Benjamin P. Pritchard, John D. Chodera, and Thomas E. Markland. Nutmeg and spice: Models and data for biomolecular machine learning. Journal of Chemical Theory and Computation, 20(19): 8583 8593, 2024. doi: 10.1021/acs.jctc.4c00794. URL https://doi.org/10.1021/acs.jctc.4c00794. PMID: 39318326.

Ahmed A. Elhag, T. Konstantin Rusch, Francesco Di Giovanni, and Michael Bronstein. Relaxed equivariance via multitask learning, 2025. URL https://arxiv.org/abs/2410.17878.

Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1970 1981. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 15231a7ce4ba789d13b722cc5c955834-Paper.pdf.

William Fulton and Joe Harris. Representation theory. Graduate texts in mathematics. Springer, New York, NY, 1 edition, July 1999.

Kanchana Vaishnavi Gandikota, Jonas Geiping, Zorah Lähner, Adam Czapliński, and Michael Moeller. Training or architecture? how to incorporate invariance in neural networks, 2021. URL https://arxiv. org/abs/2106.10044.

Johannes Gasteiger, Muhammed Shuaibi, Anuroop Sriram, Stephan Günnemann, Zachary Ward Ulissi, C. Lawrence Zitnick, and Abhishek Das. Gemnet-OC: Developing graph neural networks for large and diverse molecular simulation datasets. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=u8tv Sxm4Bs.

Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, and Karsten Kreis. Proteina: Scaling flow-based protein structure generative models. In International Conference on Learning Representations (ICLR), 2025.

Jan Gerken, Oscar Carlsson, Hampus Linander, Fredrik Ohlsson, Christoffer Petersson, and Daniel Persson. Equivariance versus augmentation for spherical images. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 7404 7421. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/gerken22a.html.

Nate Gruver, Marc Anton Finzi, Micah Goldblum, and Andrew Gordon Wilson. The lie derivative for measuring learned equivariance. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=JL7Va5Vy15J.

Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 852 863. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/ file/076ccd93ad68be51f23707988e934906-Paper.pdf.

Published in Transactions on Machine Learning Research (12/2025)

Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. Annales de l institut Fourier, 48(3):769 783, 1998. URL http://eudml.org/doc/75302.

Henry Kvinge, Tegan Emerson, Grayson Jorgenson, Scott Vasquez, Timothy Doster, and Jesse Lew. In what ways are deep neural networks invariant and how should we measure this? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=SCD0hn3k MHw.

Shengjie Luo, Tianlang Chen, and Aditi S. Krishnapriyan. Enabling efficient equivariant operations in the fourier basis via gaunt tensor products. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mhy QXJ6Js K.

Scott Mahan, Davis Brown, Timothy Doster, and Henry Kvinge. What makes a machine learning task a good candidate for an equivariant network? In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2024. URL https://openreview.net/forum?id=46vf UIf Io1.

Arnab Kumar Mondal, Siba Smarak Panigrahi, Sékou-Oumar Kaba, Sai Rajeswar, and Siamak Ravanbakhsh. Equivariant adaptation of large pretrained models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=m6d RQJw280.

Oskar Nordenfors, Fredrik Ohlsson, and Axel Flinth. Optimization dynamics of equivariant and augmented neural networks. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https: //openreview.net/forum?id=PTTa3U29NR.

Ewa Nowara, Pedro O Pinheiro, Sai Pooja Mahajan, Omar Mahmood, Andrew Martin Watkins, Saeed Saremi, and Michael Maser. Nebula: Neural empirical bayes under latent representations for efficient and controllable design of molecular libraries. In ICML 2024 AI for Science Workshop, 2024.

Ewa M. Nowara, Joshua Rackers, Patricia Suriana, Pan Kessel, Max Shen, Andrew Martin Watkins, and Michael Maser. Do we need equivariant models for molecule generation?, 2025. URL https://arxiv. org/abs/2507.09753.

Stefanos Pertigkiozoglou, Evangelos Chatzipantazis, Shubhendu Trivedi, and Kostas Daniilidis. Improving equivariant model training via constraint relaxation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=t Wk L7k1u5v.

Pedro O. Pinheiro, Joshua Rackers, joseph Kleinhenz, Michael Maser, Omar Mahmood, Andrew Martin Watkins, Stephen Ra, Vishnu Sresht, and Saeed Saremi. 3d molecule generation by denoising voxel grids. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https:// openreview.net/forum?id=Zyzluw0h C4.

Sergey Pozdnyakov and Michele Ceriotti. Smooth, exact rotational symmetrization for deep learning on point clouds. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=Cd SRFn1f Ve.

Eric Qu and Aditi S. Krishnapriyan. The importance of being scalable: Improving the speed and accuracy of neural network interatomic potentials across chemical domains. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=Y4m Ba Zu4vy.

Saeed Saremi and Aapo Hyvärinen. Neural empirical bayes. Journal of Machine Learning Research, 20, 2019. ISSN 1532-4435.

Kevin Scaman, Cedric Malherbe, and Ludovic Dos Santos. Convergence rates of non-convex stochastic gradient descent under a generic lojasiewicz condition and local smoothness. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 19310 19327. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/scaman22a. html.

Published in Transactions on Machine Learning Research (12/2025)

Nima Shoghi, Adeesh Kolluru, John R. Kitchin, Zachary Ward Ulissi, C. Lawrence Zitnick, and Brandon M Wood. From molecules to materials: Pre-training large generalizable models for atomic property prediction. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=Pf Pnugdxup.

Carlos Vonessen, Charles Harris, Miruna Cretu, and Pietro Liò. Tabasco: A fast, simplified model for molecular generation with improved physical quality, 2025. URL https://arxiv.org/abs/2507.00899.

Yuyang Wang, Ahmed A. Elhag, Navdeep Jaitly, Joshua M. Susskind, and Miguel Ángel Bautista. Swallowing the bitter pill: Simplified scalable conformer generation. In Forty-first International Conference on Machine Learning, 2024.

Published in Transactions on Machine Learning Research (12/2025)

A.1 Parameter space decomposition

Here, we describe in greater detail Nordenfors et al. (2025) s mathematical framework for analyzing the geometry of neural network parameters in terms of equivariant and non-equivariant parameter subspaces. This framework relies on three assumptions: (i) the symmetry group is compact and acts on finite-dimensional hidden spaces; (ii) the neural net non-linearities are equivariant; and (iii) the loss is invariant.

In practice, conditions (i) and (iii) are satisfied by many settings. Notably, the group SO(3) of 3D rotations is compact, but SE(3) which includes translations is not compact; but this is readily handled by restricting modeling to positionally centered data. Finite-dimensional hidden spaces is easily satisfied by neural networks. Belief that a task is equivariant or invariant implies that the correct loss to use must be invariant. Condition (ii) is more setting-specific; this is satisfied in the Vox Mol setting with voxel atomic densities with rotations, and non-linearities that act pixel-wise, but this is not generally satisfied by element-wise non-linearities on linear transformations of 3D coordinates.

In the scope of this manuscript, we use Nordenfors et al. (2025) s framework to study the linear head of the ESc AIP architecture, which does satisfy the conditions. We also use the framework to extend Proposition 3.14 in Nordenfors et al. (2025), which describes how models follow approximately equivariant learning dynamics when θE is small. In our propositions 5, 6, we extend this with our loss decomposition framework, and derive a relationship between θE and Lequiv.

The foundation of this framework is the representation of a network s parameters in all of its linear layers as a point in a high-dimensional vector space, denoted H. This captures the dominant set of learnable parameters when non-linearities are fixed. The space is formally constructed as the direct sum of the parameter spaces for each individual layer: H = L

i Hom(Xi, Xi+1). Specific network architectures are assumed to have parameters in an affine subspace L H, referred to as the space of "admissible layers". This setup is shown by construction to be expressive and capable of describing many modern neural network architectures and operations, including fully connected layers, convolutions, residual connections, and attention layers.

To define equivariance for a multi-layer network, the framework supposes that the symmetry group G acts on all input, hidden, and output spaces (X0, X1, ..., XL) through a series of representations, ρi. With this setup, the set of all parameter configurations where each linear layer is individually equivariant forms a linear subspace of H, denoted HG. This set is a linear subspace because the group actions ρi(g) is a linear operator, which means any linear combination of equivariant linear maps remains equivariant. For instance in the setting of rotations on 3D molecules, consider a linear layer with matrix A with a rotation matrix R if it is equivariant, we have ARx = RAx. If A and B are both equivariant to R, then C = c1A + c2B is also equivariant to R: RCx = R(c1A + c2B)x = (c1A + c2B)Rx = CRx. HG is thus a linear subspace that is closed under addition and scalar multiplication.

Algebraic manipulations show that TCix = Ci Tx, using:

TCix = T(c1Ai + c2Bi)x

= c1Ai Tx + c2Bi Tx

= (c1Ai + c2Bi)Tx

This subspace s linearity follows from the group s actions being linear transformations.

The parameters that are both architecturally admissible and perfectly equivariant then lie in the intersection of these spaces, E = L HG. It further follows that if non-linearities are equivariant (condition ii), then the entire neural network function is equivariant when its parameters are in E.

This geometric structure guarantees that any admissible parameters θ in L can be uniquely decomposed via orthogonal projection into two components: θ = θE +θE . This is possible because H being an inner product space allows for a unique projection onto the tangent space of the subspace E. The component θE is the

Published in Transactions on Machine Learning Research (12/2025)

projection of the parameters onto the subspace of equivariant functions (E), while θE is the component in the orthogonal complement of this subspace, representing deviation from perfect equivariance.

B Supplementary Analyses & Figures

B.1 Sensitivity of percent loss from equivariance error to number of rotations used for estimation

Here, we plot bootstrapped estimates of standard error of percent loss from equivariance error, for a model with 4 percent loss from equivariance error, with varying number of rotations used to estimate the statistic. The model is a 60M Proteina model early in training at 500 steps, evaluated at t = 0.5. We compute loss metrics across 100 randomly sampled 3D rotations, and use this metrics set to derive subsampled bootstrap statistics for lower number of rotations. We find that the Proteina model is smooth enough, even early in training, over the group of 3D rotations. Our practical choice of estimating percent loss from equivariance error with 10 rotations has low standard error around 0.0015 compared to the measured percent loss from equivariance error of 0.04.

Figure 5: Bootstrapped estimate of standard error of percent loss from equivariance error, for a model with 4 percent loss from equivariance error.

B.2 Validation loss curves for ESc AIP, by training set size

Here, we plot validation loss curves for ESc AIP, split into total MSE loss, group-averaged loss, and equivariance error, for models trained on different training set sizes. These results highlight that across a variety of training set sizes, group-averaged loss dominates the total loss. At smaller training set sizes, equivariance error rises towards the end of training, suggesting some type of overfitting effect, while at larger training set sizes, equivariance error continues to shrink.

Published in Transactions on Machine Learning Research (12/2025)

950k training examples (full dataset)

Figure 6: ESc AIP: Validation loss curves over training, varied by training set size.

B.3 Gradient norm ratios vs. loss ratios

Here, we plot the loss ratio (percent loss from equivariance error) to the gradient norm ratio: the ratio of the gradient norm of the equivariance error, to the gradient norm of the group-averaged loss, over training. In general, gradient norms have a complex relationship with loss norms. In propositions 2 and 3, we show that under certain local smoothness assumptions, the gradient norm ratio can be controlled by the loss ratio; however, this smoothness assumption may not apply in practice. With these empirical plots, we find a generally strong correlation between loss ratio and gradient norm ratio.

Figure 7: ESc AIP: Percent validation loss from equivariance error vs. grad norm ratio, over training. Colored line indicates smoothed exponential moving average, colored by training step.

Published in Transactions on Machine Learning Research (12/2025)

Figure 8: Proteína: Percent validation loss from equivariance error vs. grad norm ratio, over training, by flow matching time. Colored line indicates smoothed exponential moving average, colored by training step.

Published in Transactions on Machine Learning Research (12/2025)

B.4 Variance of ESc AIP s linear force head controls equivariance error

Here, we provide plots of the variance, or deviation from the mean, of ESc AIP s linear force head, over training time, and compared to percent loss from equivariance error. Our proposition 4 states that equivariance error is controlled quadratically by the linear head s deviation from its mean. Empirically, we find strong agreement between the two metrics, with Pearson correlation = 0.94, and Spearman correlation = 0.99 over training.

0 100000 200000 300000 400000 Step

ESc AIP linear force head weight, variance

Figure 9: ESc AIP: Variance of linear force head weights, by training time.

Figure 10: ESc AIP: Variance of linear force head weights, vs. percent validation loss from equivariance error, over training.

C.1 Proof of Proposition 1

Proposition. If l(z, y) = 1

D z y 2 is mean-squared error, then the total loss decomposes as:

L(f) = Ex,y[l(µ(x), y)] | {z } prediction error

i=1 Var T [(T 1 f T)(x)i]

| {z } equivariance error

Published in Transactions on Machine Learning Research (12/2025)

Proof. For mean-squared error, the Hessian is constant: Hl(z, y) = 2 DI where I is the D D identity matrix. Furthermore, higher-order derivatives are zero, so the decomposition has no additional terms. The equivariance error simplifies as:

DI Cov T [. . . ] = 1

DEx,y [tr (Cov T [. . . ])] (21)

C.2 Lemma: Analytic functions are smooth on compact sets

Lemma 7. Let f : X R be a real-analytic function. Then, on any compact subset X X, there exists a finite constant M > 0 such that f is M-smooth on X; that is,

1 2 f(x) 2 M|f(x) f(x )| (22)

for some minimum x X.

Proof. Since f is analytic, all derivatives of f exist and are continuous. In particular, its Hessian 2f(x) is continuous on X. By the extreme value theorem, any continuous function on a compact set attains a finite supremum. Therefore,

M sup x X 2f(x) 2 < . (23)

This boundedness of the Hessian implies that f is Lipschitz continuous with constant M on X, which is the M-smoothness condition. The inequality 1

2 f(x) 2 M f(x) f(x ) follows from integrating the gradient-Lipschitz property along the line segment between x and x , or by standard smoothness results.

C.3 Proof of Proposition 2

Proposition. Let the model fθ be analytic (i.e., a deep neural network constructed from analytic activation functions), and suppose the analytic loss satisfies L(θ) = Lmean(θ) + Lequiv(θ) (i.e., for convex losses via 9). Then, for any compact subset U in the parameter space of fθ where ϵ(θ) is well-defined (the denominator is non-zero) and that includes an optima θ with ϵ(θ ) = 0, there exists a finite constant Mϵ(U) = supθ U 2ϵ(θ) such that the approximation L(θ) Lmean(θ) holds for all θ U with relative error bounded by:

L(θ) Lmean(θ)

Lmean(θ) ϵ(θ) + Lmean(θ) Lmean(θ)

2Mϵ(U)ϵ(θ) (24)

Proof. Using L(θ) = Lmean(θ) + Lequiv(θ), the total loss is L(θ) = (1 + ϵ(θ))Lmean(θ). Differentiate with the product rule:

L(θ) = [(1 + ϵ(θ))Lmean(θ)] (25)

= ϵ(θ)Lmean(θ) + (1 + ϵ(θ)) Lmean(θ) (26)

L(θ) Lmean(θ) = ϵ(θ) Lmean(θ) + Lmean(θ) ϵ(θ) (27)

Now, we bound the norm of this difference using the triangle inequality:

L(θ) Lmean(θ) ϵ(θ) Lmean(θ) + Lmean(θ) ϵ(θ) (28)

Published in Transactions on Machine Learning Research (12/2025)

The model and losses are analytic, so by lemma 7, there exists a finite Mϵ(U) such that ϵ(θ) is Mϵ(U)-smooth over the compact parameter region U. Specifically, Mϵ(U) = supθ U 2ϵ(θ) . Applying the smoothness property ϵ(θ) p

2Mϵ(U)ϵ(θ) relative to the optima θ with ϵ(θ ) = 0, we obtain the final result:

L(θ) Lmean(θ)

Lmean(θ) ϵ(θ) + Lmean(θ) Lmean(θ)

2Mϵ(U)ϵ(θ) (29)

C.4 Proof of Proposition 3

Proposition. Let the model fθ be analytic (i.e., a deep neural network constructed from analytic activation functions), and suppose L(θ) = Lmean(θ) + Lequiv(θ) (i.e., for convex losses via 9) is analytic and has a minimum at 0. Then, there exists a compact neighborhood U around the global minimum θ and finite constants c > 0, MLequiv(θ)(U), α [1, 2) such that for all θ U, the approximation L(θ) Lmean(θ) holds with relative error bounded by:

L(θ) Lmean(θ)

2MLequiv(θ)(U)

Lequiv(θ) Lmean(θ)α (30)

Proof. By Kurdyka (1998); Dereich & Kassing (2021), any real-analytic function f satisfies the KurdykaŁojasiewicz inequality, which states that there exists a compact local neighborhood U around any critical point θ and constants c > 0, α [1, 2) such that, for all θ U:

f(θ) 2 c|f(θ) f(θ )|α (31)

When applied to the real-analytic function Lmean at the global minimum θ with Lmean(θ ) = 0, the Kurdyka-Łojasiewicz inequality states that there exists a neighborhood U around θ and constants c > 0 and α [1, 2) such that:

Lmean(θ) 2 c Lmean(θ)α (32)

Finally, by analyticity of the network and loss functions and applying lemma 7 relative to the optima θ with Lequiv(θ ) = 0, there exists a finite MLequiv(θ)(U) such that Lequiv is MLequiv(θ)(U)-smooth on the compact neighborhood U:

Lequiv(θ) 2 2MLequiv(θ)(U) Lequiv(θ) (33)

The final result follows algebraically by combining inequalities 32 and 33.

C.5 Proof of Proposition 4

Theorem. For the ESc AIP architecture trained with mean-squared error loss on a non-degenerate dataset, for any fixed set of upstream parameters θ \ W , there exist positive constants 0 < λmin λmax such that:

λmin WE 2 F Lequiv(θ) λmax WE 2 F (34)

Remarks. The constants λmin and λmax depend on the model architecture, data distribution, and other parameters θ \ W .

Published in Transactions on Machine Learning Research (12/2025)

Proof. For a molecule x, the k-th component of the predicted force vector decomposes into a sum of contributions from WE and WE :

ok(x; W ) = X

e E ek ( w T h(e))

| {z } oeq,k(x;WE)

e E ek (d T k h(e))

| {z } ok(x;WE )

where the final hidden representation h depends on θ \ W , the set of upstream parameters. Recall the equivariance error from Proposition 1, and observe that the variance of ok = oeq + ok depends only on ok, as oeq is equivariant by construction. Thus, the equivariance error of the entire model, for a fixed set of upstream parameters and expressed as a function of the force prediction head parameters, is:

Lequiv(θ) = Ex,T o(Tx; WE ) ET [ o(T x; WE )] 2

Now, let us denote: g(T, x, WE ) = T 1 o(Tx; WE ). Observe that this function g is linear in our deviation parameters WE . By vectorizing the h 3 parameter matrix WE into a 3h 1 column vector p = vec(WE ), we can express this linear relationship as a matrix-vector product, for some matrix MT,x with shape 3 3h: g(T, x, WE ) = MT,xp. Similarly, the rotation-averaged prediction g(x; WE ) = ET [g(T, x, WE )] is also a linear function, so we associate it with the matrix Mx. The equivariance error term with these linear matrix forms is:

Ex,T [ g(T, x, WE ) g(x, WE ) 2] = p Qp (36)

where the matrix Q = Ex,T [(MT,x Mx) (MT,x Mx)]. Finally, observe that Q is positive definite, as the as equivariance error is strictly positive on a non-degenerate dataset whenever WE = 0. By the properties of a positive definite matrix, the quadratic form p Qp is lower-bounded by the smallest eigenvalue of Q, denoted λmin( Q), which is positive. It is also upper bounded by the largest eigenvalue λmax( Q). This establishes the quadratic relationship on the equivariance loss as stated in the theorem.

C.6 Proof of Proposition 5

Theorem. For any neural network whose parameters can be expressed as θ = θE + θE with θE E and θE E , and for equivariance error Lequiv defined by the variance of the output with respect to transformations, there exist positive constants 0 < λmin λmax such that for a non-degenerate dataset, using to denote L2-norm:

λmin θE 2 + O( θE 3) Lequiv(θ) λmax θE 2 + O( θE 3) (37)

Proof. Applying a Taylor expansion to Lequiv(θ) for the neural net f on an input x around equivariant parameters θE, we have:

f(x; θE + θE ) = f(x; θE) + JθE f(x; θE)θE + O( θE 2) (38)

where JθE f(x; θE) is the Jacobian of the network output with respect to parameter components θE , evaluated at θE. As before, the term f(x; θE) is equivariant by construction, and thus drops out of the equivariance error term. The term JθE f(x; θE)θE is linear in θE , which creates a quadratic dependence on θE in the variance term in Lequiv.

Published in Transactions on Machine Learning Research (12/2025)

The deviation from the twirled mean is the difference between the canonicalized prediction and its average over transformations. Let s expand this difference:

(T 1 f T)(x; θ) µ(x; θ) = (T 1 f T)(x; θ) ET [(T 1 f T )(x; θ)] (39)

Substituting the Taylor series and using the equivariance of f(x; θE):

= f(x; θE) + [T 1JθE f(T(x); θE)]θE + O(|θE |2)

ET f(x; θE) + [T 1JθE f(T (x); θE)]θE + O(|θE |2) (40)

= T 1JθE f(T(x); θE) ET [T 1JθE f(T (x); θE)] θE + O(|θE |2) (41)

Let Jx,T T 1JθE f(T(x); θE) ET [T 1JθE f(T (x); θE)]. The expression becomes Jx,T θE + O( θE 2).

Lequiv(θ) = 1

DEx,T | Jx,T θE + O(|θE |2)|2 (42)

DEx,T | Jx,T θE |2 + 2( Jx,T θE )T O(|θE |2) + |O(|θE |2)|2 (43)

The orders of the terms are:

Jx,T θE 2 is O( θE 2).

The cross-term is O( θE ) O( θE 2) = O( θE 3).

The final term is (O( θE 2))2 = O( θE 4).

We will study the leading term, which is quadratic in θE , and subsume the remainder into O( θE 3). As Jx,T is a linear function, we can define a matrix Q that represents the averaged outer product of the Jacobian deviations: Q 1

DEx,T [( Jx,T ) ( Jx,T )]. The equivariance error can now be expressed concisely:

Lequiv(θE + θE ) θT E QθE (44)

The matrix Q is positive definite for a non-degenerate dataset when θE = 0. Using the Rayleigh-Ritz theorem, this quadratic form is thus bounded by the smallest and largest eigenvalues:

λmin θE 2 2 θT E QθE λmax θE 2 2

Reincorporating the remainder term in our Taylor expression, we arrive at:

λmin|θE |2 2 + O(|θE |3 2) Lequiv(θ) λmax|θE |2 2 + O(|θE |3 2) (45)

C.7 Proof of Proposition 6

Theorem. Under the same conditions as the Taylor expansion theorem above, the norm of the gradient of the equivariance loss with respect to the non-equivariant parameters is bounded by the deviation itself. Specifically, there exists a constant C such that:

θE Lequiv(θ) C θE

Proof. From our proof in C.6, we know Lequiv(θ) p Qp, where p = vec(θE ). The gradient of a quadratic form is linear: p Lequiv = 2 Qp. Taking norms, we get p Lequiv = 2 Qp 2 Q p . Setting C = 2λmax or 2 Q 2 gives the result.

Published in Transactions on Machine Learning Research (12/2025)

D Methods & Experimental Details

D.1 ESc AIP

We trained ESc AIP 6M on a subset of SPICE with 950k training examples used by Qu & Krishnapriyan (2024) for 30 epochs with batch size 64. SPICE is a dataset with of small molecule 3D conformers with energies and forces computed by quantum-mechanical density functional theory (Eastman et al., 2024). We varied model size from 1M, 4M and 6M, varied training set size from 950k, 50k, 5k, and 500 (with batch size 1), and varied the optimizer or learning rate. The model predicts a 3D force vector for each atom based on density functional theory, mapping an input molecule with N atoms to an output in R3N. This task is physically equivariant to the special orthogonal group SO(3) acting on atom coordinates in R3.

We follow the same training recipe as the original repository, which does not use data augmentation. We suspect that data augmentation is not as important for ESc AIP because it operates on rotation-invariant features.

For further details and configuration files, please refer to our code repository.

D.2 Proteína

We trained Proteína at 60M without triangular attention and 400M with triangular attention on the full Protein databank (PDB) dataset with 225k training examples. We also trained models on 1% of the PDB with 2k examples and 0.1% with 200 examples. Flow matching trains a model jointly over t, flow matching time, ranging from t = 0 for noise and t = 1 for data. We measure metrics at t = 0, 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, and 0.99, and use red colors for high t close to the data, and blue-purple colors for low t near noise in Figure 3. The model learns to approximate the velocity field of a probability flow that transforms random noise into structured protein backbones. For a molecule with N alpha carbon atoms, the network maps noised atom coordinates and a time t [0, 1] to a velocity vector in R3N. The learning task is made rotationally equivariant through data augmentation, aligning it with SO(3) acting on atom coordinates in R3.

For further details and configuration files, please refer to our code repository.

D.3 Vox Mol

Following Pinheiro et al. (2023), we represent each molecule using a 3D voxel grid by placing a continuous Gaussian density at each atom s position. Each atom type is assigned a distinct input channel, producing a 4D tensor of shape [c l l l], where c denotes the number of atom types and l is the edge length of the voxel grid. The voxel values are normalized between 0 and 1.

The denoising task arises from the use of walk-jump sampling for generating molecules (Saremi & Hyvärinen, 2019). This uses a two-step score-based sampling method. The walk phase involves running k steps of Langevin Markov chain Monte Carlo on a randomly initialized noisy voxel grid, simulating a stochastic trajectory along a manifold. The jump phase applies a denoising autoencoder (DAE) to clean up the noisy sample using a forward pass of the trained model at step k. The DAE is trained on voxelized molecules corrupted with isotropic Gaussian noise, with a mean squared error (MSE) loss between prediction and ground truth. WJS provides a fast alternative to diffusion models by requiring only a single noise and denoise step (Pinheiro et al., 2023; Nowara et al., 2024).

Architecture The Vox Mol architecture is based on a 3D U-Net with convolutional layers spanning four resolution scales, and includes self-attention modules at the two coarsest levels Pinheiro et al. (2023). During training, data augmentation is performed by applying random rotations and translations to each sample. For further architectural and training details, refer to Pinheiro et al. (2023).

Measuring whether latent representations learn to respect equivariance To evaluate whether Vox Mol learns equivariant latent features, we analyze cosine similarity between latent embeddings under two scenarios.

Published in Transactions on Machine Learning Research (12/2025)

First, we examine representations of the same molecule under rotation. Let x be a molecule and Rk a discrete rotation operator (e.g., 90 around an axis). Using the encoder ϕ( ) RC D H W , with C = 512 and spatial dimensions 8 8 8, we define the spatially pooled latent vector:

ϕ(x) = 1 DHW

d,h,w ϕ(x)[:, d, h, w]

We then compute: simsame = cos ϕ(Rk(x)), Rk( ϕ(x))

This measures whether encoding a rotated molecule is equivalent to rotating the latent vector of the original input a key signature of learned equivariance.

Second, to obtain a baseline, we compute cosine similarities between embeddings of randomly selected different molecules: simdiff = cos ϕ(xi), ϕ(xj) , with xi = xj

We compute these metrics across 1000 molecules for various rotation angles along all three axes. Cosine similarities are calculated over the 512-dimensional latent vectors and visualized using violin plots to capture the distributional differences in Figure 11.

Findings. Cosine similarity between rotated versions of the same molecule tends to decrease as rotation angle increases, reflecting imperfect latent equivariance. While same-molecule embeddings remain more similar to each other than to embeddings of different molecules, the overlap between their distributions grows with rotation. This suggests that although the encoder partially preserves geometric structure, the latent space does not fully achieve rotation equivariance, indicating potential for improved regularization or architectural design.

Figure 11: Vox Mol: Cosine similarity of molecule latent representations with different rotations. x, y, z indicate rotation axes, and numbers 0, 0.5, 1, 1.5, 2 correspond to 0, 90, 180, 270, 360 degrees of rotation. The last column depicts cosine similarity between different molecules.

D.4 Metrics

To compute equivariance error, twirled prediction, error, percent MSE loss from equivariance error, and gradient norms, 10 rotations per sample were used in ESc AIP and Proteína. 4 rotations per sample were

Published in Transactions on Machine Learning Research (12/2025)

used for Vox Mol. These numbers were found to be sufficient to provide a stable signal for metrics which was robust to randomness and resampling ( B.1). Each measurement point uses a different set of random rotations, so each metric s stability over time also reflects the stability of our measurements. For ESc AIP, these metrics were computed on the first four (fixed) validation batches with batch size of 16, for a total of 64 samples. For Proteína, these metrics were computed on the first eight (fixed) validation batches with batch size of 3, for a total of 24 samples. The total MSE loss on these subsets was indicative of the total validation MSE loss, indicating these sample sizes were sufficient to provide a stable and representative signal for these metrics.

D.5 Hessian Analysis and Condition Numbers

To plot the loss landscape, we selected a subset of parameters in each architecture. For ESc AIP, we used the final FFN (with a non-linearity) and the final linear head, for a combined total of 33k parameters. For Proteína, we used the final linear head with 1.5k parameters. We computed the Hessian of this parameter subset for the total MSE loss using one fixed training batch with ten rotations. We then performed eigendecomposition of the total MSE loss Hessian to find the eigenvectors for the largest positive eigenvalue, and minimum positive eigenvalue, which formed the two axes for plotting the loss landscape. We selected a step size approximately 2-3x the training step size at that checkpoint, which is estimated by multiplying the training learning rate with the total parameter gradient norm at that checkpoint. We then create a 2D grid of perturbations to the parameter subset, and compute Lmean and Lequiv at each point on the grid. Importantly, the axes and the step size are the same for both Lmean and Lequiv.

To compute the condition numbers, we computed the Hessian of the same parameter subsets for Lmean and Lequiv separately, and performed eigendecomposition on them separately. We reported the condition number as the ratio between the largest positive eigenvalue and the minimum positive eigenvalue.

Here, we provide figures of the condition numbers across 20 minibatches.

Figure 12: ESc AIP: Condition numbers across 20 minibatches.

Published in Transactions on Machine Learning Research (12/2025)

Figure 13: Proteína: Condition numbers across 20 minibatches.

Figure 14: Vox Mol: Condition numbers across 20 minibatches.

D.6 Bias-corrected finite sample estimators

In this section, we discuss unbiased finite sample estimators for Lmean and Lequiv, which can be used to build a bias-corrected finite sample estimator for the percent loss from equivariance error. Suppose we have N > 1 finite group samples. Denote the the exact group-averaged mean as µ (i.e., the exact expectation, which could be calculated with infinite group samples), and our finite sample estimate of the mean as ˆµ.

µ(x) = ET (T 1 f T)(x) (46)

(T 1 i f Ti)(x) (47)

Similarly, denote σ2 as the exact variance over the group. Denote ˆσ2 as our biased finite sample estimate of the variance (biased because it divides by N, not N 1):

(T 1 i f Ti)(x) ˆµ(x) 2 (48)

ET [ˆσ2] = N 1

Published in Transactions on Machine Learning Research (12/2025)

The MSE loss of our finite sample estimate ˆµ is biased, relative to µ. For exposition, we denote µ = µ(x).

ET [(ˆµ y)2] = (µ y)2 | {z } True squared bias

This can be derived as follows:

(ˆµ y)2 = (ˆµ + µ µ y)2 (51)

= (ˆµ µ)2 + 2(µ y)(ˆµ µ) + (µ y)2 (52)

Now, take an expectation over group samples T:

ET [(ˆµ y)2] = (ˆµ + µ µ y)2 (53)

= ET [(ˆµ µ)2] + 2(µ y)(((((( ET [(ˆµ µ)] + ET [(µ y)2] (54)

= ET [(ˆµ µ)2] | {z } =σ2/N

+(µ y)2 (55)

Thus, unbiased finite sample estimators are:

(ˆµ y)2 1 N 1 ˆσ2 = (µ y)2 (56)

N N 1 ˆσ2 = σ2 (57)

Proposition 8. The finite sample estimator for the percent loss from equivariance error

(ˆµ y)2 + ˆσ2 (58)

has a numerator that is unbiased to σ2, and has a denominator is unbiased to (µ y)2 + σ2.

Proof. For the numerator, see equation 48. For the denominator: ET [(ˆµ y)2 + ˆσ2] = (µ y)2 + σ2.

ET [(ˆµ y)2] + ET [ˆσ2] = (µ y)2 + 1

= (µ y)2 + σ2 (60)

A similar argument can be used to motivate the N/(N 1) correction for general convex losses, where the correction is second-order accurate.