# dimensionless_machine_learning_imposing_exact_units_equivariance__54184687.pdf

Journal of Machine Learning Research 24 (2023) 1-32 Submitted 6/22; Revised 12/22; Published 2/23

Dimensionless machine learning: Imposing exact units equivariance

Soledad Villar soledad.villar@jhu.edu Department of Applied Mathematics and Statistics, Johns Hopkins University Mathematical Institute for Data Science, Johns Hopkins University

Weichi Yao weichi.yao@nyu.edu Department of Technology, Operations and Statistics, New York University

David W. Hogg david.hogg@nyu.edu Center for Cosmology and Particle Physics, Department of Physics, New York University Max-Planck-Institut f ur Astronomie Flatiron Institute, a Division of the Simons Foundation

Ben Blum-Smith bblumsm1@jhu.edu Department of Applied Mathematics and Statistics, Johns Hopkins University

Bianca Dumitrascu bmd39@cam.ac.uk Department of Computer Science and Technology, Cambridge University

Editor: Jean-Philippe Vert

Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (noncompact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods that are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology.

1. Introduction

In recent years, there has been enormous progress in developing machine learning methods that incorporate exact symmetries. Indeed, the revolutionary performance of convolutional neural networks (Le Cun et al., 1989) was enabled by the imposition of a local translational symmetry at the bottom layer of the networks. When a learning problem obeys an exact symmetry, we have the strong intuition that imposing symmetry on the method must improve learning and generalization. Much of the work on exact symmetries has been situated in physics and natural-science domains (Yu et al., 2021; Kashinath et al., 2021), because physical laws exactly obey a panoply of symmetries.

2023 Soledad Villar, Weichi Yao, David W. Hogg, Ben Blum-Smith, Bianca Dumitrascu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0680.html.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

One of the symmetries of physics and indeed all of the sciences is the symmetry underlying dimensional analysis. Quantities that are added or subtracted must have identical units if the units system of the inputs to a function is changed, the units of the output must change accordingly. This symmetry which we call units equivariance but in physics, it might be more natural to call this units covariance is a passive symmetry that applies to all problems (see Section 4.1 of Rovelli and Gaul (2000)). It has many implications. One is that it is possible to derive scalings and dependencies of outputs on inputs from the units directly. For example, if a problem involves only a length L, an acceleration g, and a mass m, and the problem is to learn or predict a time t, it is possible from units alone to see that t p

L/g. Another implication is that inhomogeneous functions, such as transcendental functions and most non-linear functions, can only be applied to dimensionless (unitless) quantities: if a quantity x is dimensional, an inhomogeneous polynomial expression in x is inconsistent with the principle that quantities to be added or subtracted must have identical units. These dimensional symmetries are strict and exact, and apply to essentially every problem in the sciences. In chemistry, ecology, and economics, for example, the inputs and the outputs of functional relationships have non-trivial units, and the results must be equivariant to the choice of units system. Machine learning methods for physical sciences are often designed in such a way that implies units-equivariance, even if they don t explicitly say so (see for instance Cambioni et al. (2019, 2021)).

In this work, we implement a particular case of group-equivariant machine learning, for the group corresponding to changes in units. We make use of dimensionless quantities, which are the invariants of this group. We thereby build on our previous work in which group invariants are used to build group-equivariant functions (Villar et al., 2021; Yao et al., 2021; Blum-Smith and Villar, 2022). Our procedure here builds dimensionless features out of the problem inputs and then ensures that the resulting outputs are scaled back to their correct dimensions or units. The guiding philosophy of the work is to transform the inputs into invariant features before they are used to train a machine learning method, and then to un-transform label predictions at the output or at test time. These approaches to symmetry-respecting machine learning are simple to implement and perform well (Yao et al., 2021; Chen and Villar, 2022).

In what follows, we will make the strong assumption that the dimensions and units of all regression inputs are known, complete, and self-consistent. However, a different direction of research is to look at how dimensional relationships or other symmetries are discovered. This is the setting for Constantine et al. (2017), and Evangelou et al. (2021). This idea discovery of dimensional structure connects to prior work as a particular case of the more general problem of learning symmetries from data (Benton et al., 2020; Cahill et al., 2020; Portilheiro, 2022).

Related work: Dimensional analysis is a classical theory with applications in engineering and science (Barenblatt, 1996). These ideas have been connected to machine learning previously (Rudolph et al., 1998; Frisone and Misiti, 2019; Bakarji et al., 2022). The key theoretical result in the dimensional analysis is the Buckingham Pi Theorem (Buckingham, 1914), which says that a function is units equivariant if and only if it is a function of a set of dimensionless quantities. These quantities can (usually) be obtained as products of integer

Units equivariant machine learning

powers of the input features. Integer linear-algebra algorithms permit the discovery of a generating set of dimensionless features (Hubert and Labahn, 2012). Incorporating group invariances and equivariances in the design of neural networks has led to advances in many applications from molecular dynamics, to turbulence, to climate and traffic prediction (Batzner et al., 2021; Wang et al., 2020b; Bakarji et al., 2022; Kashinath et al., 2021; Jin et al., 2020). In many applications, symmetries are implemented approximately via data augmentation (Baird, 1992; Van Dyk and Meng, 2001; Wong et al., 2016; Cubuk et al., 2018; Dao et al., 2019; Chen et al., 2020a; Shen et al., 2022). In other applications symmetries are implemented exactly. In graph neural networks, the learned functions are equivariant with respect to actions by permutations of the order in which nodes appear in the adjacency matrix (Gilmer et al., 2017; Duvenaud et al., 2015; Chen et al., 2019a; Gama et al., 2020). Parametrizing such functions efficiently and universally is a difficult task because of its connection with the graph isomorphism problem. Many methods and theoretical results have been proposed to address this challenge (Xu et al., 2018; Morris et al., 2019; Chen et al., 2019b, 2020b; Huang and Villar, 2021). More generally, in equivariant machine learning, neural networks are restricted to only represent functions that are invariant or equivariant with respect to group actions (Cohen and Welling, 2016; Maron et al., 2018; Kondor, 2018). Some of these methods involve group convolutions (Cohen and Welling, 2016, 2017; Wang et al., 2020a), irreducible representations of groups (Fuchs et al., 2020; Kondor, 2018; Thomas et al., 2018; Weiler et al., 2018; Cohen et al., 2018; Weiler and Cesa, 2019), or constraints on optimization (Finzi et al., 2021). Others involve the construction of explicitly invariant features (Gripaios et al., 2021; Haddadin, 2021; Villar et al., 2021; Blum-Smith and Villar, 2022). We are proponents of the approach of constructing explicit invariants. In regression problems, a recent line of theoretical work shows that imposing symmetries and group equivariance can reduce generalization error in linear (Elesedy and Zaidi, 2021) and kernel (Elesedy, 2021) settings, as well as sample complexity in (finite) group-invariant kernel settings Bietti et al. (2021); Mei et al. (2021). Most importantly, without imposing certain symmetries many learning algorithms are not able to learn correctly Brugiapaglia et al. (2021).

Our contributions:

We provide a definition for units equivariance as an equivariance with respect to a group action, and incorporate it into machine learning models, aided by ideas from classical dimensional analysis.

We show that exact units equivariance is easy to impose on many kinds of learning tasks, by constructing a dimensionless version of the learning task, performing the dimensionless task, and then scaling back in the proper dimensions (and units) at the end (perhaps prior to evaluating the cost function). Dimensionless quantities are invariants with respect to changes in units and can be computed using discrete linear algebra algorithms. In this sense, the approach we advocate here is related to approaches based on group invariants to impose exact group equivariances (Villar et al., 2021).

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

We discuss extensions of theoretical results on generalization bounds for regression problems under symmetries generated by compact groups (Elesedy and Zaidi, 2021), to the group of scalings (which is not compact but reductive).

We demonstrate with a few simple numerical regression problems that the reduction of model capacity (at fixed complexity) delivered by the units equivariance leads to improvements in generalization (in-distribution and out-of-distribution). In this context, we discuss symbolic regression and emulator-related tasks. We also discuss the limitations of our approach in the context of unknown dimensional constants.

2. Units, dimensions, and units equivariance

Almost every physical quantity (any position, velocity, or energy, say) has units (inches, kilometers per hour, or BTUs, say). In the physical sciences, we are advised to use SI units (Thompson and Taylor, 2008), which include meters (m), meters per second (m s 1), and Joules (J), for example. Any energy can be converted to J, any velocity can be converted to m s 1, and so on, according to known conversion factors. There are dimensionless quantities in physics too, such as Reynolds numbers, or concentrations, but these can also be thought of as having units of unity (or percent or parts per million or so on). Abstracting slightly, all the (say) SI units are built on base units of kilograms (kg), meters (m), seconds (s), kelvin (K), amperes (A), and a few others (such as moles). That is, it is possible to convert any SI unit into powers of the base units. For example, a pascal (Pa) is a kg m 1 s 2, and a volt (V) is a kg m2 s 3 A 1. That is, the units of any physical quantity can be converted to powers of the SI base units. Abstracting even further, there is a concept of dimensions, which is the physical entity being measured by units, or the thing that is unchanged when you change the units of something. Two energies, even if measured in different unit systems, both have the dimensions of energy. In this sense, there are not just the base units of kg, m, s, K, A, and so on; there are also the base dimensions of mass, length, time, temperature, current, and so on. It is always possible to create, from any dimensioned quantity, a dimensionless quantity by multiplying and dividing by powers of quantities with the base dimensions. Similarly, it is possible to convert any dimensional relationship into a dimensionless relationship. This operation is formalized in the Buckingham Pi Theorem (Buckingham, 1914), which motivates this work. Not only it is the case that physical quantities that are added must have the same dimensions; in detail, the numerical addition of the quantities must be performed only after they have been converted to the same units as well. And the products (or ratios) of quantities of units will have the products (or ratios) of the input-quantity units. Connected to this is a concept of consistent base units or coherence: For example, imagine having a mass M measured in grams, a length L measured in inches, a force F measured in newtons, and a speed V measured in kilometers per hour. These quantities have inconsistent base units, in the sense that the time unit inside the force unit is not the same as the time unit inside the speed unit. Naive multiplication of different combinations of these quantities will produce outputs with incommensurate units. They have to be converted to consistent units prior to any arithmetic manipulations. In what follows, we will assume that the inputs and

Units equivariant machine learning

outputs of any model or problem under consideration have been converted into coherent units, for example, the explicitly coherent SI system. Coherence is a hard requirement, but still leaves a lot of room to maneuver. For example, in one of the examples in Section 5, we measure horizontal distances in meters and volumes of water in liters. These are incoherent technically since a volume can be expressed as a length cubed. However, since we express the problem such that horizontal distances and volumes never inter-convert, we can coherently express the problem with this choice. We consider spaces Z = Qd i=1 Xui, which consist of coherent (meaning described in a consistent units system) dimensional elements. Each factor Xui is a space of values for a given feature, which is measured in units specified by a parameter ui. Thus x1 Xu1 might be a mass, x2 Xu2 a temperature, x3 Xu3 a velocity, etc. As a set, each Xui is just R, but the specification of its dimensions via ui endows it with a specific action by a group G of rescalings. A precise development of this setup follows. We fix a list of k base units in terms of which all the desired features can be described. For example, if the features consist of energies, temperatures, velocities, forces, masses, and accelerations, the base units could be (kg, m, s, K) (and k = 4). The choice of base units determines a rescaling group G := Rk >0. An element (g1, . . . , gk) G rescales the ith base unit by a factor of gi for each i. Say there are d features. Then for each i = 1, . . . , d, we express the units of the ith feature in terms of the base units, and record this expression in an integer vector ui Zk, whose jth component is the exponent to which the jth base unit occurs in this expression. Continuing the example, if the first feature is an energy measured in Joules, then u1 = [1, 2, 2, 0], because 1 J = 1 kg m2/s2. We then define Xui to be the real line R equipped with the action by G induced by its action on the base units. Explicitly, if xi Xui and g = (g1, . . . , gk) G, then the action is given by the formula

j=1 g uij j

In our running example, if we replace kg with g and m with cm, leaving s and K untouched, then the group element that accomplishes this rescaling is g = (0.001, 0.01, 1, 1), and if x1 = 2.9, representing a value of 2.9 J, then

g x1 = (0.001) 1(0.01) 2(1)2(1)0(2.9) = 2.9 107

reflecting the fact that 2.9 J = 2.9 107 g cm2/s2. We call the space of features Z a units-typed space. It is the cartesian product

i=1 Xui. (2)

It is a real vector space under coordinatewise addition and multiplication by (dimensionless) scalars. Furthermore, the elements of Xu can be multiplied by elements of Xu . In summary

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

the algebraic rules for x Xu and x Xu , α R and γ Z are the following:

α (x, u) = (α x, u), where α is a dimensionless scalar (3)

(x, u) + (x , u ) = (x + x , u) if u = u

does not exist otherwise (4)

(x, u) (x , u ) = (x x , u + u ) (5)

(x, u)γ = (xγ, γ u), where γ is a (dimensionless) integer. (6)

Thus Xu can be seen as a homogeneous component of degree u in a Zk-graded algebra. We do not make explicit use of this, however. Because each factor Xu carries an action (1) by the rescaling group G, the space Z does as well. Note that the action is completely specified by the ordered list (u1, . . . , ud) (Zk)d

of units vectors. Sometimes we will denote x Z as x = (xi, ui)i=1,...,d to indicate the units of each of its features. We call units-typed function any function between units-typed spaces. Definition 1 defines units equivariant functions.

Definition 1 A units-typed function is any function f : ZX ZY. If in addition, it has the property that f(g x) = g (f(x)) (7)

for every g in the rescaling group G, then we say that f is a units-equivariant function.

For example, imagine that f is a units-equivariant function that takes as input a mass, a length, and an acceleration, and returns a time and an energy. If the mass, length, and acceleration inputs are multiplied by 1, 1, and 1/25 respectively, then the time and energy outputs should end up multiplied by 5 and 1/25 respectively. That is, the units equivariance is an expression of the natural dimensional and units-conversion scalings. In order for a non-constant function f to be units-equivariant, it will be necessary that each of the vectors of units of ZY lies in the span of the set of input vectors of units of ZX . In other work (Villar et al., 2021) we are concerned with geometric (coordinate-free) properties of scalar, vector, and tensor inputs to machine learning problems. Each component of a vector or a tensor with units (such as a velocity or a stress) can be represented as an element of X. In physics, all elements of a vector or tensor must have the same units vector u. By combining the formulation from (Villar et al., 2021) with our current formulation, we can make units-equivariant and coordinate-free models (see Section 5).

3. Units-equivariant regressions

Given training data (xt, yt)t=1,...N, where xt ZX and yt ZY (both units-typed spaces), a units-equivariant regression is a regression restricted to a space of units-equivariant functions. There are multiple approaches for imposing exact symmetries on machine learning methods. Here we take an invariant-features approach (Villar et al., 2021). We begin by algorithmically constructing a featurizer ϕ( ) that constructs dimensionless features ξ from the dimensioned input data x, and a decoder gx,v( ) that converts dimensionless label predictions

Units equivariant machine learning

dimensional inputs featurizer dimensionless features learned map dimensionless label decoder dimensional output

Z ϕ Rs h R gx,v Xv

Figure 1: Overview of the general approach.

ˆη into dimensioned training-label predictions ˆy with dimensions of v. See Figure 1 for a visualization of the setup. Regression (or classification, or any other task) proceeds as usual, but in the space of purely dimensionless features and labels, with the dimensioned training labels appearing only in the cost function. The concept underlying this approach is that, for the units equivariance to be exact, it is necessary that the inputs to any method that implements nonlinear functions of the input x (such as a multilayer perceptron, or a kernel regression) act instead only on dimensionless features ξ, because otherwise the nonlinearities effectively (internally) add and subtract quantities with different dimensions, which violates the equivariance. This approach borrows ideas from the Buckingham Pi Theorem (Buckingham, 1914), and uses technology from linear algebra over the integers (Stanley, 2016) to construct the featurizer ϕ( ) algorithmically. Specifically, the input space Z includes d dimensional input features, such as mass, temperature, etc (2). Some of the inputs will be fundamental constants, such as Newton s constant, speed of light, etc. The featurizer ϕ : Z Rs delivers s (dimensionless) products of integer powers of the numerical elements of Z. If x = (xi, ui)i=1,...,d then ϕ(x) = (ϕ1(x), . . . , ϕs(x)) where for all j = 1, . . . s we have

ξj = ϕj(x) =

i=1 xαji i where

i=1 αjiui = 0 Zk, (8)

where the constraint guarantees that ξj is dimensionless. The exponents αji Z can be found by solving the system of diophantine linear equations in (8) and the solutions form a lattice, the dimension of which can be computed as:

#dimensionless features = #input variables - #independent units, (9)

where the number of independent units coincides with the number of linearly independent vectors in {ui}d i=1. For example, consider three velocities v1, v2, v3 with units m s 1, the number of units is two (m and s), but the number of independent units is one, making the dimensionless scalars a two-dimensional lattice. We could select our featurizer to produce a basis of the lattice (we use the Smith Normal Form to this end (Stanley, 2016), a similar approach to (Hubert and Labahn, 2012), which uses the Hermite Normal Form), or we could select our featurizer to produce all lattice points within a bounded region. If the dimensioned output is y = (y, v) X we find an integer solution αyi of Pd i=1 αyiui = v and the decoder gx,v : R X is gx,v(ˆη) = ˆη Qd i=1 xαyi i . In words, the decoder finds from x a product of integer powers of elements of x that has the same dimensions as the output label y, and multiplies the dimensionless label prediction ˆη by that product. This is possible because any non-constant units-equivariant function must have the

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

vector v lie in the span of the vectors {ui}d i=1. In practice, there is not a unique choice for αyi. One practical solution is to learn different models for different solutions αyi and use an ensemble method (see Section 5). The training of a regression model usually involves the optimization of a loss function. This is often a norm of a difference between the training labels y and their predictions ˆy. When the problem is made dimensionless, this loss can be made dimensionless as well, or else it can be left dimensional. In many contexts, the loss is a chi-squared objective or a log-likelihood or can be interpreted as such. In these cases, the loss is dimensionless naturally. The units-equivariance approach we recommend is agnostic to whether the loss is dimensionless or dimensional; adoption of this approach does not require the adoption of any particular loss. This approach does, however, place several significant burdens on the user: All elements of the input x and output y of the method must have well-defined and known units, and the units information must be encoded in the form of a k-vector of integer powers of a well-defined list of k base units. Also, sometimes quantities external to the natural training data must be included with the inputs x, such as parameters or fundamental constants (such as Newton s constant and the speed of light), that are relevant to unit conversions or natural relationships among quantities with different units. The key idea is that the dimensionless featurizer ϕ( ), which produces s dimensionless features, can be compared with an arbitrary featurizer ϕ( ), which produces p features that are not necessarily dimensionless. For example, we can consider the space spanned by rational monomials of inputs of a certain degree.

Definition 2 (Rational monomial) A function P : Rd R is a rational monomial (also known as a Laurent monomial) if

P(x1, . . . , xd) =

i=1 aixαi i , (10)

where the coefficients ai R and the exponents αi Z. The degree of P is defined as Pd i=1 |αi|.

The units-equivariance approach (8) will produce features consisting of dimensionless rational monomials (example: mass spring constant (length)2 (momentum) 2 is a dimensionless rational monomial), whereas a non-equivariant approach can produce arbitrary features of the same type (rational monomials of bounded degrees). We claim that imposing the units equivariance provides a good inductive bias for the learning problem. In Section 4 we briefly discuss the generalization gains of imposing this exact symmetry. This approach plays well with many machine learning approaches, including linear regression, kernel regression, and deep learning: It involves only a swap-in replacement for the featurizing or feature normalization that is usually done prior to training a model, and a tweak to the output layer of the method. Thus it can be adapted to almost any machine learning method in use at the present day.

Units equivariant machine learning

4. Generalization improvements of units equivariance

Imposing the exact units-equivariance symmetry in regression tasks improves the prediction performance even in cases where the held-out data has out-of-distribution properties. This empirical improvement can be partly attributed to the fact that the symmetry acts like a physics-informed prior which constraints the regression inputs to satisfy correct dimensional scaling relationships. Another reason could be due to dimensionality reduction: when the task is made dimensionless the resulting number of independent, dimensionless inputs is always strictly less than the number of dimensional inputs. One way to make this argument precise is by assuming that the units-equivariant functions are a subset of the baseline hypothesis class. In this case, at fixed model complexity (e.g. rational monomials of a certain degree), the number of free parameters or the model capacity goes down when the problem is made dimensionless. In Appendix A we formally discuss the generalization improvements of certain equivariant models and how these results translate to units-equivariance. We based our analysis on recent results that show how to compute explicitly the gains in terms of generalization gaps and sample complexity of imposing group invariances and equivariances in machine learning. The results from Mei et al. (2021) and Bietti et al. (2021) hold for finite groups, whereas the results from Elesedy and Zaidi (2021) and Elesedy (2021) hold for general compact groups. Specifically, Elesedy and Zaidi (2021) shows that if we are aiming to learn a target invariant function f from samples from an invariant distribution µ, then for any estimator ˆf, the (Reynolds) projection of ˆf onto the space of invariant functions has a smaller expected error. We explain the details in Appendix A. The problem we consider here, units equivariance, uses the group of scalings, which is unfortunately non-compact so the results from Elesedy and Zaidi (2021) don t directly apply. In particular, given a function, it is unclear what is the correct notion of its projection onto the space of invariant functions. This is particularly relevant because our approach does not compute a regression to later project its estimator onto the space of invariant functions. It directly solves a regression in a space of invariant functions. The question of how much one gains by using this approach assumes the existence of a certain non-invariant baseline. Given a space of functions F and F the subspace of invariant functions, then a projection onto the space of invariant functions is just an operator P : F F that fixes F pointwise. Therefore the way to project a function onto the space of invariant functions is far from being unique. There are two notions of projection we consider, (1) the so-called Reynolds projection, which generalizes the notion of averaging the function along the group orbit, (2) the orthogonal projection with respect to the measure µ used to generate the data. In the case of compact groups and data sampled from invariant measures both notions coincide, giving a very simple expression for the generalization gap of a function in comparison with its projection onto the space of invariant functions. However, we show that in the non-compact case, the notions do not coincide for any real measure µ. In Appendix A:

We summarize the ideas from Elesedy and Zaidi (2021) that allow them to compute a generalization gap for compact groups. In a nutshell: We assume we are learning an unknown G-invariant function f : Rd R from samples. We assume that G is a compact group and the data is sampled from a G-invariant measure in Rd (x µ Rd,

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

µ(U) = µ(g U) for all g G and measurable U Rd). The risk of a function f : Rd R is defined as the expected test error:

R(f) := Ex µ f(x) f (x) 2 (11)

Elesedy and Zaidy show that by projecting f to the space of invariant functions, the generalization gap always improves. In particular:

(f, f) := R(f) R( f) = f 2 µ 0 (12)

where f is the (orthogonal or Reynolds) projection of f onto the space of invariant functions (see Appendix A for the precise definition), and f is its orthogonal complement in L2(Rd, µ).

We explain how to extend these ideas to non-compact groups using Weyl s trick.

We show that if we replace Rd with a complex domain in Cd, and the measure is invariant with respect to complex scalings of modulus 1, then the results from Elesedy and Zaidi (2021) apply and we can prove a non-negative generalization gap for equivariant functions with respect to complex scalings.

We show that if the domain is real, the two notions of projection (orthogonal and Reynolds) don t match for any choice of measure.

Out-of-distribution generalization The generalization improvements from Elesedy and Zaidi (2021); Elesedy (2021); Bietti et al. (2021) assume that the data is sampled from group invariant measures. However, since the group of scalings is not compact, scaling invariant probability measures do not exist. The exact scaling symmetries enforced by the units equivariance extend to arbitrary dilations of the base dimensions. Thus they connect function outputs for very different inputs. In the context of regression, this means that the predictions of trained units-equivariant models ought to enable accurate predictions far outside the domain of the training set. One way to state this is to compare the domains of the dimensionless features to the domains of the raw dimensional features. Mathematically, this can be stated in terms of out-of-distribution generalization (Arjovsky, 2020; Geisa et al., 2021; Dey et al., 2022). In particular, consider a generic problem with d inputs with dimensions made from k base units, and (therefore) a basis of s = d k dimensionless quantities. Consider a training set sampled from a distribution µ supported in a compact set D Rd. This induces a distribution ϕ(µ) in the space of dimensionless features Rs. Many distributions in Rd (possibly even supported in disjoint sets) map to ϕ(µ). For a trivial example, consider d = 2 mass inputs, such that k = 1; if the training set has masses in kg drawn from Unif(0, 1) and the test set has masses in g drawn from Unif(0, 104), they will nonetheless both have the same distribution in the one (d k = 1) available independent dimensionless quantity (the ratio of masses). This is one reason why this method allows us to generalize to settings that can be considered out-of-distribution in the original input space. We conjecture that the out-of-distribution generalization improvement spans beyond this trivial case. We show numerical evidence of this claim in Section 5. This observation is corroborated in related literature where different types of equivariances are shown related to data augmentation procedures (Chen et al., 2020a), which, in turn, lead to an empirically

Units equivariant machine learning

favorable performance in out-of-distribution settings (Liang and Zou, 2022). We believe that a rigorous out-of-distribution result could be formalized using techniques from domain adaptation and meta-learning (Ben-David et al., 2010; Mansour et al., 2009; Hanneke and Kpotufe, 2019; Li et al., 2018; Kang and Feng, 2018); we leave that for future work.

5. Experimental demonstrations

Symbolic regression: Simple springy pendulum In this example, we consider a pendulum bob of mass m (units of kg) at the end of a linear spring, swinging under the influence of gravity. The total mechanical energy or hamiltonian H (units of kg m2 s 2) of this system consists of a kinetic energy and two potential-energy contributions:

H = 1 2 |p|2

m | {z } kinetic energy

2 ks (|q| L)2

| {z } spring potential energy

m g q | {z } gravitational potential energy

where p is the 3-vector momentum of the bob (units of kg m s 1), |p|2 = p p, ks is the spring constant (units of N m 1 = kg s 2), q is the 3-vector position of the bob relative to the pivot (units of m), |q| = p

q q, L is the natural length of the spring (units of m), g is the 3-vector acceleration due to gravity (units of m s 2). The natural base units here are the SI base units (kg, m, s), but they could just as easily be (stone, furlong, fortnight). This is almost the simplest possible physics problem. We honor its simplicity by constructing an extremely simplified symbolic regression: Given samples of the parameters m, ks, L, g, the initial conditions p, q, and the corresponding values of the hamiltonian H, we show that (as expected) we can infer the exact functional form of the hamiltonian, and that, imposing units-equivariance significantly reduces the complexity and prediction error of the problem. First, we observe that, in Newtonian mechanics, the hamiltonian or total mechanical energy is a scalar. Thus the hamiltonian can be a function only of scalars and scalar products of the vector and scalar inputs (Villar et al., 2021). We construct all rational scalar monomials of the inputs up to a well-defined degree, including, for example, m ks |q| (g p) 2, where the vectors g, p, q are implicitly column vectors, |q| is the magnitude of q, and g p is the scalar (inner) product of g and p. For our purposes, the degree of the rational monomial is the maximum absolute value exponent appearing in the expression, so the example would have a degree 2. Then we construct all dimensionless rational scalar monomials of the inputs up to the same well-defined degree, including, for example, m ks L2 |p| 2, which is also of degree 2 but dimensionless. It turns out that there are far fewer dimensionless rational scalar monomials than rational scalar monomials to any degree. In detail, the dimensional scalar inputs to our monomial lists are the 9 scalars m, ks, L, |g|, (g p), (g q), |p|, (p q), |q|. We produce all monomials to a maximum degree 2 but are subject to two additional rules: While we count scalars |g|, |q|, and |p| as having degree 1, we count scalars g p and p q and so on as having degree 2 (so at maximum degree 2, say, they cannot appear squared). We also did not permit the dot products g p and p q and so on to appear with negative powers (because these inverses can produce unbounded singular values in the design matrix). With these inputs and these rules, there are 286 dimensionless

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

monomials to (only!) degree 2, and 187 500 total monomials (irrespective of dimensions) to degree 2. Given the enormous difference between 286 and 187 500, it is obvious that units equivariance is incredibly informative. We demonstrate the value experimentally by performing symbolic regressions for the hamiltonian H with two different objectives; one an L2, and the other a LASSO objective. In each case, the objective is the norm of the difference between the predicted and true hamiltonian value, but made dimensionless by dividing by the quantity ks L2, which has units of energy. In the L2 case, 8192 training-set objects are used, and in the LASSO case, 128. Training-set objects are drawn from distributions in m, ks, L, g, p, q, in which the scalars m, ks, L are drawn from uniforms and the vectors g, p, q are drawn from isotropically oriented unit vectors times magnitudes drawn from uniforms. In both cases, the regression applied to the dimensionless-feature design matrix delivers machine-precision-level errors on held-out data and linear-fit coefficients that represent the correct formula or expression for the hamiltonian. A large number of baseline monomials (187 500) makes it computationally difficult to perform equivalent baseline comparisons. This alone demonstrates the value of the unitsequivariant approach for symbolic regression-like problems. However, in order to test this, we augment the 286 dimensionless monomials with 500 randomly chosen dimensional monomials and at these training-set sizes the symbolic regressions fail: They deliver an order of magnitude worse mean-squared error on held-out test data and they do not find the correct coefficients for the hamiltonian expression. The code is publicly available at Google Colab.1

Emulator: Springy double pendulum Next, we consider the task of learning the dynamics of the springy double pendulum, which is a pair of single springy pendula connected with a free pivot2 (Figure 2). The goal here is to predict its trajectory at later times from different initial states. For this task, for each of the realization of the N training data, m1, m2, ks1, ks2, L1, L2 are randomly generated from Unif(1, 2), as well as the norm of the gravitational acceleration vector g. Initializations at t0 of the pendulum positions and momenta are generated as those in Finzi et al. (2021) and Yao et al. (2021). The training labels are the positions and momenta at a set of T later times t {t1, . . . , t T }:

z(t) = (q1(t), q2(t), p1(t), p2(t)), t {t0, . . . , t T }. (14)

In our experiments, the training set consists of positions and momenta of the pendula in a sequence of T = 10 equispaced consecutive times sampled from a sequence of T = 60 equispaced times obtained by integrating the dynamical system according to its ground truth parameters. In the testing stage, the trajectory at t = 1, . . . , T from different initial states are predicted given the initializations at t = 0. We consider three different testing setups to compare the dimensionless scalar-based implementation with the dimensional baseline considered in Yao et al. (2021). This baseline is currently state-of-the-art on this problem. It embodies Hamiltonian and geometric symmetries and performs very well (Yao et al., 2021). The test data used in Experiment 1 is generated from the same distribution as the training dataset. The test data used in Experiment 2 consists of applying a transformation to the test data in Experiment 1, where each of the input parameters that include a power of

1. See https://dwh.gg/springy. 2. The source code is published in https://github.com/weichiyao/Scalar EMLP/tree/dimensionless.

Units equivariant machine learning

Figure 2: The springy double pendulum.

kg in its units (m1, m2, ks1, ks2, p1(0) and p2(0)) is scaled by a factor randomly generated from Unif(3, 7). The test data used in Experiment 3 has the input parameters m1, m2, ks1, ks2, L1 and L2 generated from Unif(1, 5). We use the same training data N = 30000 for all three experiments and each test set consists of 500 data points. That is, Experiments 2 and 3 have out-of-distribution test data, relative to their training data. We implement Hamiltonian neural networks (HNNs; Greydanus et al. 2019; Sanchez Gonzalez et al. 2019) with scalar-based MLPs for this learning task. In particular, we have a set of scalar inputs S = {m1, m2, ks1, ks2, L1, L2} and a set of vector inputs V = {g, p1(0), p2(0), q1(0), q2(0) q1(0)}. We construct the dimensional scalars (baseline) and dimensionless scalars based on these two sets of inputs. The dimensional scalar inputs to the baseline MLPs include 32 scalars: (i) scalar inputs S, as well as their inverses {1/a : a S}; (ii) inner products of the vector inputs {u v : u, v V}, as well as their magnitudes {|u| : u V}. The dimensionless scalar inputs are the following 32 scalars: (i) m1/m2, ks1/ks2, L1/L2 and their inverses; (ii) we divide each vector input by its magnitude before we compute the inner products, which gives a set of dimensionless scalars {(u v)/(|u||v|) : u, v V}; (iii) we also consider dimensionless rational scalar monomials (mi |g|)/(ksi Li), (ksi Li)/(mi |g|), |qi(0)|/Li, |qi(0)|2/L2 i , |pi(0)|/( mi ksi Li), |pi(0)|2/(mi ksi L2 i ), i = 1, 2. We use dimensionless scalars as inputs to the MLPs, which makes the outputs of the MLPs also dimensionless. The decoder then scales the outputs to restore the hamiltonian H units of kg m2 s 2. At this stage we employ the following 26 scaling factors: ksr Li Lj, mi Lj |g|, mi g qj(0), (pi(0) pj(0))/mr, ksr qi(0) qj(0), i, j, r {1, 2}, all of which have the units of kg m2 s 2. The dimensional scalars-based and the dimensionless scalars-based MLPs both have equal numbers of model parameters, and are trained with the same set of hyper-parameters (number of training epochs, learning rate, etc.). The prediction error (or state relative error) at time t is defined as

State.Rel Err(t) =

(ˆz(t) z(t)) (ˆz(t) z(t)) p

ˆz(t) ˆz(t) + p

z(t) z(t) . (15)

Table 1 reports the average errors over {t1, . . . , t60}. When the test data are generated from the same distribution as the training data, the dimensional scalar-based MLP exhibits

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

0 3 6 9 ex q1

0 0.8 1.6 2.4 ey q1

1 2 3 4 ez q1

0 5 10 15 ex q2

0 0.8 1.6 2.4 ey q2

1 2 3 4 ez q2

Ground truth Baseline Dimensionless Ground truth Baseline Dimensionless

8 0 8 16 ex q1

1.5 0 1.5 ey q1

0 4 8 ez q1

0 10 20 ex q2

1.5 0 1.5 3 ey q2

6 0 6 12 ez q2

Ground truth Baseline Dimensionless Ground truth Baseline Dimensionless

Figure 3: Ground truth and predictions of mass 1 (top) and 2 (bottom) in the phase space w.r.t. each dimension. Top 6 panels: Results from Experiment 1, where the test data are generated from the same distribution as those used for training. Here the dimensional scalar-based MLPs exhibit slightly more accurate predictions for longer time scales. Bottom 6 panels: Results from Experiment 2, where we use the same test data in Experiment 1 but each with its inputs that have units of kg randomly scaled by a factor generated from Unif(3, 7). Here the dimensionless scalar-based MLP is able to provide comparable performance to Experiment 1, while using the dimensional scalars gives much worse predictions.

Units equivariant machine learning

Scalar-based MLPs Experiment 1 Experiment 2 Experiment 3

Baseline .0055 .0030 .3669 .0050 .1885 .0031 Dimensionless .0061 .0024 .0089 .0034 .0435 .0047

Table 1: Geometric mean (standard deviation computed over 10 trials) of state relative errors of the springy pendulum over T = 60. Results are shown for the dimensional vs dimensionless scalar-based Hamiltonian Neural Networks (implemented as an MLP) on three different test sets. Test data used in Experiment 1 are generated from the same distribution as the training dataset; test data used in Experiment 2 using the same test data in Experiment 1 but each with its inputs that have units of kg randomly scaled by a factor generated from Unif(3, 7); test data used in Experiment 3 has mass m, scalar spring constant ks and natural spring length L generated from a different distribution.

slightly more accurate predictions for longer time scales using the same training hyperparameters. When we have out-of-distribution test data as in Experiment 2 and 3, the performance of both methods deteriorates as expected, but the dimensionless scalar-based MLP exhibits a significantly better generalization performance. In particular, if we rescale the units as in Experiment 2, where all the quantities that have the units of kg are scaled by the same randomly generated factor, the dimensionless scalar-based MLP is able to provide comparable performance to results from Experiment 1. Actually, this could be considered to be an in-distribution test set in the space of dimensionless scalars (see Section 4), and thus the only reason why the error is different is that the state relative error (15) is not dimensionless. In other words, our experiments show that imposing units equivariance increases the generalization performance significantly, especially in out-ofdistribution settings.

Figure 3 provides an illustration of the predicted orbits by the dimensional and the dimensionless methods in Experiment 1 and Experiment 2.

Emulation: Arid vegetation model We further explore unit equivariance-informed learning inspired by a non-linear problem in ecology3. In semi-arid environments, banded vegetation is a characteristic feature of plant self-organization which is modulated by the quantity of water available (Dagbovie and Sherratt, 2014). Inverting emergent vegetation patterns as a function of environmental changes is a central problem in ecology. Towards this end, a popular approach is the Rietkerk model, a set of differential equations relating surface water u, water absorbed into the soil w, and vegetation density v (Rietkerk et al., 2002). These differential equations are

3. See https://dwh.gg/Rietkerk.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

description default units R rainfall 0.375 ℓd 1 m 2

α infiltration rate 0.2 d 1

k2 saturation const. 5 g m 2

W0 water infiltration const. 0.1 Du surface water diffusion 100 d 1 m2

gm water uptake 0.05 ℓg 1 d 1

k1 water uptake constant 5 ℓm 2

δw soil water loss 0.2 d 1

Dw soil water diffusion 0.1 d 1 m2

c water to biomass 20 ℓ 1 g δv vegetation loss 0.25 d 1

Dv vegetation diffusion 0.1 d 1 m2

T total integration time 200 d δt integration time step 0.005 d L integration patch length 200 m δl spatial step size 2 m

Dimensionless features

c α 1 gm R 1α k1 R 1c 1α k2 α 1δw α 1δv W0 α 1Dv L 2

α T α δt α 1Dw L 2

Table 2: (Right) Parameters in the Rietkerk model and their units (Rietkerk et al., 2002). The bottom four parameters are parameters of the integration. (Left) Basis of dimensionless features found by our method.

dt = R α v + k2 W0

v + k2 u + Du 2u

dt = α v + k2 W0

v + k2 u gm v w k1 + w δw w + Dw 2w

dt = c gm v w k1 + w δv v + Dv 2v, (16)

where u, w, v are all functions of both two-dimensional spatial coordinates and time t, and the 2 operator is the scalar second derivative operator (Laplacian) with respect to position. In detail, u denotes the surface water density (units of mm = ℓm 2), v is the soil water content (same units as u), and v is the vegetation density (units of g m 2). Further, the time derivative operator has units of d 1, the Laplacian operator has units of m 2, and the units of the other quantities (R, w0, gm, and so on) can be inferred from the equations in (16). Here the natural base units are (ℓ, g, d, m). Note that as there is a conversion 1000 ℓ= 1 m3, we could, in principle, reduce the base units by one. However, there is no direct communication between water volume and distance across the surface, so these units can be kept separate. In general, the units equivariance is more powerful when there are more independent base units, which leads to substantial design decisions for the investigator. The Rietkerk model is determined by a set of dimensional parameters described in Table 2, and the initial conditions u0, v0, w0. We consider random initial conditions, and random choice of parameters, uniformly sampled between 0.5 and 1.5 times the default value. For each choice of parameters, we use finite differences to estimate the derivatives and Laplacian, and

Units equivariant machine learning

integrate the Rietkerk model using Euler s method with time step 0.005 d, in a 200 m 200 m grid, with 2 m pixel spacing.

We consider the task of predicting, from initial conditions, the average vegetation density after 200 days (the empirical steady state solution of (16) at default parameters). We produce a training set of 1000 initial configurations and a test set of 100 configurations. A significant portion of these simulations ended up with total vegetation death at a finite time. We didn t consider these examples for the regression task extending our results to classification is a future direction. We perform two forms of linear regression, a baseline regression, and a unitless regression. The baseline regression uses 33 features: the dimensional parameters, their inverses, and the dimensionless constant 1 which describes affine linear functions. The dimensionless linear regression uses the method described in Section 3. It uses the Smith normal form to construct a basis of 12 dimensionless features, and it uses them, their inverses, and the constant 1, obtaining 25 regression features. The results show that the dimensionless regression has significantly better performance in Figure 4.

Our toy model explores the impact of selecting dimensionally correct features on predicting average vegetation outcomes when data are generated from a well-characterized ecological model. However, other interesting symbolic regression problems remain open. For example, is the Rietkerk model considered the most appropriate for modeling banded vegetation patterns in general? A recent approach aimed to address the related inverse problem of determining the underlying structure of a nonlinear dynamical system from data (Brunton et al., 2016). There, sparse regression and compressed sensing tools informed the selection of a small number of informative, non-linear terms hypothesized to explain observed dynamics. Thus, these kinds of problems present an intriguing venue for future exploration of units equivariance as a principled way to impose additional sparsity in a non-linear feature space which could further aid methods like that of Brunton et al. (2016) by restricting feature selection to units-equivariant, physically informative terms.

Symbolic regression: The black-body radiation law One of the most important moments in modern physics was the introduction of the quantum-mechanical constant h by Planck around 1900 (Planck, 1901). In our language, this discovery can be seen as a symbolic regression, in which Planck discovered a simple symbolic expression that accurately summarized a host of data sets on radiating bodies at different temperatures. The dimensional constant h was introduced to explain the short-wavelength part of the radiation law, but it ended up being the governing constant for all quantum phenomena; it led to a simple prediction of the spectrum of the Hydrogen atom (Bohr, 1913) and is the core of the Schr odinger equation (Schr odinger, 1926); it was extremely important in the history of physics.

The black-body radiation Bλ(λ; T) from a perfectly radiating and absorbing (black) thermal source at temperature T is properly measured in intensity units, which are (or can be) energy per time per wavelength per area per solid angle. Because solid angles are dimensionless, this translates to SI units of J m 3 s 1. The problem Planck faced was a set of measurements (labels) Bλ(λ; T) at many wavelengths λ for bodies at multiple temperatures T. The input features λ, T, c, k B and output labels Bλ(λ; T) of the problem are summarized in Table 3, along with their units in the SI base unit system of kg, m, s, K.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

0 20 40 60 Value predicted by Rietkerk model

Trained regression on model parameters

spatial mean of vegetation density v (g m 2) after T=200 d

naive units-equivariant

Figure 4: (Top) The evolution of vegetation density from random initialization according to a Rietkerk model. The model s parameters are given in Table 2. (Bottom) We consider 1000 random initializations and evolutions according to random parameters sampled from a uniform distribution supported in [0.5x, 1.5x] where x are the baseline Rietkerk parameters from Table 2 (the integration parameters remain fixed). Our regression task is to predict the spatial mean vegetation after 200 days as a function of the Rietkerk parameters. The light blue dots show the (naive) regression vs Rietkerk model for linear regression on the model parameters, its inverses, and the constant 1 (33 features in total) on a held-out test set. The dark red dots correspond to a linear regression using a basis of dimensionless features obtained with our (units-equivariant) method, their inverses, and the constant 1 (25 features in total) in the same test set. The naive regression has a test MSE of 26.3 g2 m 4 whereas the units-equivariant regression has a MSE of 12.6 g2 m 4. The Pearson correlation of the prediction and the target value is 0.94 for the naive regression and 0.97 for the units-equivariant regression.

Units equivariant machine learning

description units comment

Bλ(λ; T) intensity kg m 1 s 3 regression label

λ wavelength m variable feature T temperature K variable feature c speed of light m s 1 fundamental constant k B Boltzmann s constant kg m2 s 2 K 1 fundamental constant

Table 3: Labels and features in Planck s black-body radiation problem and their units.

In terms of the language illustrated in Figure 1, the decoder gx,v involves multiplying the dimensionless output of a dimensionless regression by a dimensional quantity with the same units as the labels Bλ(λ; T). The only possible dimensional quantity that can be made out of the features that match the dimensions of the labels is c λ4 k B T , (17)

which has units of intensity. The featurizer ϕ makes all possible dimensionless quantities out of the inputs. But wait, there are no (non-trivial) dimensionless features possible! In a dimensionless regression, the only available input feature is the dimensionless constant unity. That is, in our approach, the only possible outcome of the regression, in this case, is

Bλ(λ; T) = C c

λ4 k B T , (18)

where C is a universal constant. This is a classical dimensional analysis result. There are two comments to make here. The first is that this form (18) is not a good fit for the data! Thus the method we are proposing here fails. The explanation for this failure is that there is a dimensionless constant, h (now known as Planck s constant) that is missing from our formulation in Table 3. The second is that this form (18) is a perfect fit to the data at long wavelengths. That is, at long wavelengths, where quantum occupation numbers are high, the problem behaves classically, and the data are extremely well explained by (18), with C = 2. This result (with C = 2) is called the Rayleigh Jeans law. If the reader is interested in the history of physics, the Rayleigh Jeans law is the lynchpin of the ultraviolet catastrophe, which is a paradox of classical statistical mechanics, resolved by quantization. What Planck discovered or realized is that the data could only be explained with the introduction of a new dimensional universal constant. He had choices for the dimensions of this constant, but he set it to have dimensions of energy times time. Planck s symbolic regression led to the complete expression

Bλ(λ; T) = 2 h c2

λ5 1 exp( h c λ k B T ) 1 , (19)

which reduces to (18) with C = 2 in the limit λ . This result required the introduction of the dimensional constant h, resolved the ultraviolet catastrophe, and seeded the discovery of quantum mechanics. In the approach advocated in this work, we have no way to learn or discover missing dimensional constants. That is a limitation of our approaches and motivates future work.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

6. Discussion

In the above, we defined units equivariance for machine learning, with a focus on regression and complex functions. A function obeying this equivariance obeys the exact scalings that are required by the rules of dimensional analysis. These scalings must be obeyed by any theory or function in use in the natural sciences. We developed a simple framework for implementing units equivariance into regression problems. This framework puts burdens on the investigator burdens of having consistent units for all inputs, and also a comprehensive list of dimensional constants but is otherwise lightweight in terms of modifying existing regression methods. We did not consider the important problems of learning dimensions, or the discovery of missing dimensionless inputs, but these are worthy extensions of what we looked at here. We argued that imposing units equivariance must improve the bias and variance of regression methods, both because it incorporates correct information, and also because it reduces model capacity at fixed complexity, often by an enormous factor. The equivariance also enables out-of-sample generalization, because a test set that doesn t overlap a training set in dimensional inputs will often significantly overlap in dimensionless combinations of those inputs. We illustrated these effects empirically with a few simple experiments. Units equivariance applies to all functions in the natural sciences. It won t be useful everywhere. In particular, it is most useful when there are many independent units at play, and the full panoply of physical constants is known. This is not true, say, for standard image-recognition tasks, for which all the inputs have the same units (intensity in image pixels) and the physical quantities (involved in the identification of pandas and kittens, say) are not known. It is also not true in natural-science problems where there might be unknown physical constants or physical laws at play. The discovery of physical laws is often the discovery of dimensional physical constants, as our black-body radiation law example problem (Section 5) illustrates. However, we are very optimistic about the usefulness of units equivariance in problems of emulation and symbolic regression. In these settings, all symmetries are exact, and often all inputs (including all fundamental constants) are known (and have known units). In particular, some of the cleanest physics problems might be in the area of the growth of structure in the Universe, where there are very few dimensioned quantities and the physics is dominated by one force (gravity). These problems are of great interest at the present day and have attracted very promising work with machine learning methods (for example, He et al. 2019; Berger and Stein 2019; Kodi Ramanah et al. 2020; Tr oster et al. 2019).

Acknowledgments: It is a pleasure to thank Timothy Carson (Google), Miles Cranmer (Princeton), Samory Kpotufe (Columbia), Sanjoy Mahajan (Olin College), Bernhard Sch olkopf (MPI-IS), Kate Storey-Fisher (NYU), and Wenda Zhou (NYU and Flatiron Institute) for valuable discussions. We also thank the action editor Jean-Philippe Vert and the anonymous reviewers for constructive feedback that helped us improve the manuscript significantly. SV was partially supported by ONR N00014-22-1-2126, the NSF Simons Research Collaboration on the Mathematical and Scientific Foundations of Deep Learning (Mo DL) (NSF DMS 2031985), NSF CISE 2212457, an AI2AI Amazon research award, and the TRIPODS Institute for the Foundations of Graph and Deep Learning at Johns Hopkins University.

Units equivariant machine learning

Martin Arjovsky. Out of distribution generalization in machine learning. Ph D thesis, New York University, 2020.

Henry S Baird. Document image defect models. In Structured Document Image Analysis, pages 546 556. Springer, 1992.

Joseph Bakarji, Jared Callaham, Steven L Brunton, and J Nathan Kutz. Dimensionally consistent learning with buckingham pi. ar Xiv:2202.04643, 2022.

Grigory Isaakovich Barenblatt. Scaling and transformation groups. Renormalization group, page 161 180. Cambridge Texts in Applied Mathematics. Cambridge University Press, 1996. doi: 10.1017/CBO9781107050242.009.

Simon Batzner, Albert Musaelian, Lixin Sun, Mario Geiger, Jonathan P Mailoa, Mordechai Kornbluth, Nicola Molinari, Tess E Smidt, and Boris Kozinsky. Se (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. ar Xiv:2101.03164, 2021.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010.

Gregory Benton, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. Learning invariances in neural networks. ar Xiv:2010.11882, 2020.

Philippe Berger and George Stein. A volumetric deep convolutional neural network for simulation of mock dark matter halo catalogues. Monthly Notices of the Royal Astronomical Society, 482(3):2861 2871, 2019.

Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning under geometric stability. Advances in Neural Information Processing Systems, 34, 2021.

Ben Blum-Smith and Soledad Villar. Equivariant maps from invariant functions. ar Xiv preprint ar Xiv:2209.14991, 2022.

Niels Bohr. I. on the constitution of atoms and molecules. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 26(151):1 25, 1913.

Simone Brugiapaglia, M Liu, and Paul Tupper. Invariance, encodings, and generalization: learning identity effects with neural networks. ar Xiv preprint ar Xiv:2101.08386, 2021.

Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15):3932 3937, 2016.

Edgar Buckingham. On physically similar systems; illustrations of the use of dimensional equations. Physical Review, 4(4):345 376, 1914.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

Jameson Cahill, Dustin G Mixon, and Hans Parshall. Lie PCA: Density estimation for symmetric manifolds. ar Xiv:2008.04278, 2020.

Saverio Cambioni, Erik Asphaug, Alexandre Emsenhuber, Travis S. J. Gabriel, Roberto Furfaro, and Stephen R. Schwartz. Realistic on-the-fly outcomes of planetary collisions: Machine learning applied to simulations of giant impacts. Astrophysical Journal, 875(1): 40, April 2019.

Saverio Cambioni, Seth A. Jacobson, Alexandre Emsenhuber, Erik Asphaug, David C. Rubie, Travis S. J. Gabriel, Stephen R. Schwartz, and Roberto Furfaro. The effect of inefficient accretion on planetary differentiation. Planetary Science Journal, 2(3):93, June 2021.

Nan Chen and Soledad Villar. SE(3)-equivariant self-attention via invariant features. Machine Learning for Physics Neur IPS Workshop, 2022.

Shuxiao Chen, Edgar Dobriban, and Jane Lee. A group-theoretic framework for data augmentation. Advances in Neural Information Processing Systems, 33:21321 21333, 2020a.

Zhengdao Chen, Lisha Li, and Joan Bruna. Supervised community detection with line graph neural networks. Internation Conference on Learning Representations, 2019a.

Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pages 15894 15902, 2019b.

Zhengdao Chen, Lei Chen, Soledad Villar, and Bruna Joan. Can graph neural networks count substructures? Advances in neural information processing systems, 2020b.

Taco S. Cohen and Max Welling. Group equivariant convolutional networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, volume 48, page 2990 2999, 2016.

Taco S Cohen and Max Welling. Steerable cnns. In International Conference on Learning Representations (ICLR), 2017.

Taco S. Cohen, Mario Geiger, Jonas Koehler, and Max Welling. Spherical cnns, 2018.

Paul G Constantine, Zachary del Rosario, and Gianluca Iaccarino. Data-driven dimensional analysis: Algorithms for unique and relevant dimensionless groups. ar Xiv:1708.04303, 2017.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. ar Xiv:1805.09501, 2018.

Ayawoa S Dagbovie and Jonathan A Sherratt. Pattern selection and hysteresis in the rietkerk model for banded vegetation in semi-arid environments. Journal of The Royal Society Interface, 11(99):20140465, 2014.

Units equivariant machine learning

Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher R e. A kernel theory of modern data augmentation. In International Conference on Machine Learning, pages 1528 1537. PMLR, 2019.

Jayanta Dey, Ashwin De Silva, Will Le Vine, Jong Shin, Haoyin Xu, Ali Geisa, Tiffany Chu, Leyla Isik, and Joshua T Vogelstein. Out-of-distribution detection using kernel density polytopes. ar Xiv:2201.13001, 2022.

David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al an Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224 2232, 2015.

Bryn Elesedy. Provably strict generalisation benefit for invariance in kernel methods. ar Xiv:2106.02346, 2021.

Bryn Elesedy and Sheheryar Zaidi. Provably strict generalisation benefit for equivariant models. ar Xiv:2102.10333, 2021.

Nikolaos Evangelou, Noah J Wichrowski, George A Kevrekidis, Felix Dietrich, Mahdi Kooshkbaghi, Sarah Mc Fann, and Ioannis G Kevrekidis. On the parameter combinations that matter and on those that do not. ar Xiv:2110.06717, 2021.

Marc Finzi, Max Welling, and Andrew Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. ar Xiv:2104.09459, 2021.

Federico Frisone and Andrea Misiti. Buckingham theorem application to machine learning algorithms: methodology and practical examples. Master s thesis, Politecnico di Milano, 2019.

Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. Se (3)-transformers: 3d rototranslation equivariant attention networks. Advances in Neural Information Processing Systems, 33, 2020.

William Fulton and Joe Harris. Representation theory: a first course, volume 129. Springer Science & Business Media, 2013.

Fernando Gama, Elvin Isufi, Geert Leus, and Alejandro Ribeiro. Graphs, convolutions, and neural networks: From graph filters to graph neural networks. IEEE Signal Processing Magazine, 37(6):128 138, 2020.

Ali Geisa, Ronak Mehta, Hayden S Helm, Jayanta Dey, Eric Eaton, Jeffery Dick, Carey E Priebe, and Joshua T Vogelstein. Towards a theory of out-of-distribution learning. ar Xiv:2109.14501, 2021.

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263 1272. JMLR. org, 2017.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, 2019.

Ben Gripaios, Ward Haddadin, and Christopher G Lester. Lorentz-and permutationinvariants of particles. Journal of Physics A: Mathematical and Theoretical, 54(15): 155201, 2021.

Ward Haddadin. Invariant polynomials and machine learning. ar Xiv:2104.12733, 2021.

Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. Advances in Neural Information Processing Systems, 32, 2019.

Siyu He, Yin Li, Yu Feng, Shirley Ho, Siamak Ravanbakhsh, Wei Chen, and Barnab as P oczos. Learning to predict the cosmological structure formation. Proceedings of the National Academy of Sciences, 116(28):13825 13832, 2019.

Ningyuan Teresa Huang and Soledad Villar. A short tutorial on the weisfeiler-lehman test and its variants. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8533 8537. IEEE, 2021.

Evelyne Hubert and George Labahn. Rational invariants of scalings from Hermite normal forms. In Proceedings of the 37th International Symposium on Symbolic and Algebraic Computation, pages 219 226, 2012.

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Composing molecules with multiple property constraints. ar Xiv:2002.03244, 2020.

Bingyi Kang and Jiashi Feng. Transferable meta learning across domains. In UAI, pages 177 187, 2018.

K Kashinath, M Mustafa, A Albert, JL Wu, C Jiang, S Esmaeilzadeh, K Azizzadenesheli, R Wang, A Chattopadhyay, A Singh, et al. Physics-informed machine learning: Case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A, 379(2194):20200093, 2021.

Doogesh Kodi Ramanah, Tom Charnock, Francisco Villaescusa-Navarro, and Benjamin D Wandelt. Super-resolution emulator of cosmological simulations using deep physical models. Monthly Notices of the Royal Astronomical Society, 495(4):4227 4236, 2020.

Risi Kondor. N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. ar Xiv:1803.01588, 2018.

Yann Le Cun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989.

Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to generalize: Meta-learning for domain generalization. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Units equivariant machine learning

Weixin Liang and James Zou. Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts. ar Xiv:2202.06523, 2022.

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. ar Xiv:0902.3430, 2009.

Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. In International Conference on Learning Representations, 2018.

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models, 2021.

Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. Association for the Advancement of Artificial Intelligence, 2019.

Max Planck. On the law of the energy distribution in the normal spectrum. Ann. Phys, 4 (553):1 11, 1901.

Vasco Portilheiro. A tradeoff between universality of equivariant models and learnability of symmetries. ar Xiv preprint ar Xiv:2210.09444, 2022.

Max Rietkerk, Maarten C Boerlijst, Frank van Langevelde, Reinier Hille Ris Lambers, Johan van de Koppel, Lalit Kumar, Herbert HT Prins, and Andr e M de Roos. Selforganization of vegetation in arid ecosystems. The American Naturalist, 160(4):524 530, 2002.

Carlo Rovelli and Marcus Gaul. Loop quantum gravity and the meaning of diffeomorphism invariance. In Towards quantum gravity, pages 277 324. Springer, 2000.

Stephan Rudolph et al. On the context of dimensional analysis in artificial intelligence. In International Workshop on Similarity Methods. Citeseer, 1998.

Alvaro Sanchez-Gonzalez, Victor Bapst, Kyle Cranmer, and Peter Battaglia. Hamiltonian graph networks with ODE integrators, 2019.

Erwin Schr odinger. An undulatory theory of the mechanics of atoms and molecules. Physical review, 28(6):1049, 1926.

Igor R Shafarevich. Basic Algebraic Geometry 1. Springer-Verlag Berlin/Heidelberg, second edition, 1994.

Ruoqi Shen, S ebastien Bubeck, and Suriya Gunasekar. Data augmentation as feature manipulation: A story of desert cows and grass cows. ar Xiv:2203.01572, 2022.

Richard P. Stanley. Smith normal form in combinatorics. Journal of Combinatorial Theory, Series A, 144:476 495, 2016.

Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotationand translation-equivariant neural networks for 3d point clouds. ar Xiv:1802.08219, 2018.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

Ambler Thompson and Barry N. Taylor. Guide for the Use of the International System of Units (SI); Natl. Inst. Stand. Technol. Spec. Publ. 811, 2008 ed. National Institute of Standards and Technology, 2008.

Tilman Tr oster, Cameron Ferguson, Joachim Harnois-D eraps, and Ian G Mc Carthy. Painting with baryons: Augmenting N-body simulations with gas using deep generative models. Monthly Notices of the Royal Astronomical Society: Letters, 487(1):L24 L29, 2019.

David A Van Dyk and Xiao-Li Meng. The art of data augmentation. Journal of Computational and Graphical Statistics, 10(1):1 50, 2001.

Soledad Villar, David W Hogg, Kate Storey-Fisher, Weichi Yao, and Ben Blum-Smith. Scalars are universal: Equivariant machine learning, structured like classical physics. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

Binghui Wang, Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Certified robustness of graph neural networks against adversarial structural perturbation. ar Xiv:2008.10715, 2020a.

Rui Wang, Karthik Kashinath, Mustafa Mustafa, Adrian Albert, and Rose Yu. Towards physics-informed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1457 1466, 2020b.

Maurice Weiler and Gabriele Cesa. General e(2)-equivariant steerable cnns. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, 2019.

Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data, 2018.

Sebastien C Wong, Adam Gatt, Victor Stamatescu, and Mark D Mc Donnell. Understanding data augmentation for classification: when to warp? In 2016 international conference on digital image computing: techniques and applications (DICTA), pages 1 6. IEEE, 2016.

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ar Xiv:1810.00826, 2018.

Weichi Yao, Kate Storey-Fisher, David W Hogg, and Soledad Villar. A simple equivariant machine learning method for dynamics based on scalars. ar Xiv:2110.03761, 2021.

Rose Yu, Paris Perdikaris, and Anuj Karpatne. Physics-guided ai for large-scale spatiotemporal data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD 21, page 4088 4089, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3470793. URL https://doi.org/10.1145/3447548.3470793.

Units equivariant machine learning

Appendix A. Generalization improvements

We first explain ideas from Elesedy and Zaidi (2021) developed in the context of invariant and equivariant functions with respect to actions by compact groups. Given X Rd and G a compact group acting on X, we fix a G-invariant measure µ. We consider the Hilbert space of functions H = L2(X, µ). The action of G on X induces a natural action on H via the formula

[λ f](x) := f(λ 1 x)

for x X and λ G. We split H into two orthogonal components, the closed ground-truth G-invariant subspace H consisting of the functions f H satisfying λ f = f for all λ G, and its orthogonal complement H , so that H = H H . Using that µ is G-invariant, it is noted (Elesedy and Zaidi, 2021) that the orthogonal projection onto H coincides with averaging along the G-orbit, also known as the Reynolds operator:

G f(λ x)dλ, (20)

where λ is the normalized Haar measure of the group. The proof from Elesedy and Zaidi (2021) that orthogonal projection to H coincides with O is reproduced below in (23)-(26), and an alternative proof follows as well. The Reynolds operator has good algebraic properties. It can be alternatively characterized as the unique G-invariant projection H H, i.e., the unique linear map O : H H that restricts to the identity on H and satisfies

O(λ f) = Of (21)

for λ G and f H. Yet another characterization is that it restricts to the identity on any subspace consisting of invariants, and to zero on any G-stable closed subspace not containing nontrivial invariants. The Hilbert space H contains the algebra of compactly-supported continuous functions Cc(X) as a dense subspace, and the Reynolds operator has the property that its restriction to this algebra commutes with multiplication by invariant such functions, i.e., it satisfies O(fh) = f Oh (22)

for all f Cc(X)G (the subalgebra of invariant compactly-supported continuous functions) and h Cc(X). In other words, O restricts to a Cc(X)G-module projection Cc(X) Cc(X)G. Now we verify the observation from Elesedy and Zaidi (2021) that O is the orthogonal projection in H onto the space of invariants; equivalently, given f H, we have arg minh H h f 2 µ = O(f). In order to show this, it suffices to observe that O is selfadjoint with respect to the inner product in L2(X, µ):

Of, h µ = Z

G f(λ x)dλ, h(x) dµ(x) (23)

G f(λ x), h(x) dλ dµ(x) (24)

G f(x), h(λ 1 x) dλ dµ(x) (25)

= f, Oh µ. (26)

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

Note that (25) holds due to µ being G-invariant. An alternative, more conceptual way to see that O is an orthogonal projection to H under the assumption that µ is G-invariant is as follows. Because µ is G-invariant, the inner product on H = L2(X, µ) is also G-invariant. Thus G acts by unitary transformations on H. This implies that every G-invariant element of H is orthogonal to every G-stable closed subspace not containing any nontrivial G-invariants. This is because if f H is invariant and h V H where V is a closed, G-stable subspace containing no nontrivial invariants, then f, h = R

G λ f, λ h dλ = f, Oh = f, 0 = 0. The integral is with respect to the normalized Haar measure on G. The first equality is the fact that λ is unitary, the second is because f is invariant so the integral can be pushed into the inner product, and the third is because Oh is both invariant and in V , therefore trivial. As discussed above, O acts as the identity on invariants and annihilates G-stable closed subspaces containing no nontrivial invariants (and this also follows from the calculation just performed). Since the former is orthogonal to the latter, this means O is the orthogonal projection onto the subspace of invariants in H. Now, given f : X Rk, f H, Elesedy and Zaidi (2021) consider data y = f (x) + ξ where ξ is sampled from a zero-mean, finite variance distribution in Rk. The risk of a function f is the expected value of the prediction error

R(f) = Ex µ y f(x) 2 2 (27)

and given two functions f and f the generalization gap is

(f, f ) = R(f) R(f ). (28)

In Elesedy and Zaidi (2021) a regression problem is considered in which the regression is performed on a subspace U H that is closed under (20). Then U can be decomposed into closed subspaces U = U U where U H and U H . Given a function f U, let f := Π Uf and f := ΠU f the respective orthogonal projections of f. Note that under the present hypothesis f := Of and f := f Of since O restricted to U is the orthogonal projection to U. The goal is to compute (f, f), namely what is the excess risk of doing the regression on the function space U instead of restricting to the invariant subspace U, which corresponds to the ground truth. A simple computation shows that if x is sampled from µ, then

(f, f) = f 2 µ; (29)

this in particular shows that there is always a benefit to restricting to the ground truth subspace. For instance, if the target is a group invariant function, and we do a regression in a space of polynomials, we might as well restrict the regression to group invariant polynomials, since the generalization error will be strictly smaller. The rest of the analysis in Elesedy and Zaidi (2021) focuses on computing f 2 µ for data models, for instance, assuming U is the space of linear functions (they discuss both over and under-parameterized), and the input is Gaussian with mean zero and identity covariance (assuming the spherical Gaussian distribution is G-invariant). A slightly more general formulation allows them to express the same ideas for equivariance. This work has been extended to kernel regressions (Elesedy, 2021).

Units equivariant machine learning

Reynolds projection onto scaling invariant functions via Weyl s unitarian trick Let X = (R, Zk)d where each of the d features xi has base units exponents ui = (ui1, . . . uik) Zk expressed in terms of k base units. The Buckingham Pi theorem discussed in Section 3 shows that units-equivariant functions are in a one-to-one correspondence with functions of (finitely many) dimensionless features (constructed as products of powers of input features). This characterization also implies that the dimensionless functions are exactly the functions that are invariant with respect to unit rescalings. For each base units, represented here by the index j {1, . . . , k}, we consider uij, its exponent in each feature xi. Then f is scaling invariant (i.e. dimensionless4) if and only if for all g R>0 and all j {1, . . . , k} we have

f(gu1jx1, . . . , gudjxd) = f(x1, . . . , xd); (30)

this property is also known as self-similarity (Barenblatt, 1996). The rescalings can be seen as an action by the appropriate renormalization group, in this case G = (R>0)k, defined as

(g1, . . . , gk) (x1, . . . , xd) =

j=1 g u1j j )x1, . . . , (

j=1 g udj j )xd

Since the group is not compact, the results of Elesedy and Zaidi (2021) are not directly applicable. However, it is closely related to the reductive real algebraic group (R )k, so some of the concepts can be generalized by using Weyl s unitarian trick. (This will require us to work in complex space.) However, the key property from Elesedy and Zaidi (2021), that the Reynolds projection coincides with the orthogonal projection with respect to the measure, will not hold in real space. We consider U a space of real analytic functions (for instance, rational monomials). We can define a Reynolds projection to the G-invariant functions U, analogous to (20), by using Weyl s unitarian trick. If f : Rd R is a real analytic function it has an analytic continuation, f C : Cd C, such that f C|Rd = f. Weyl s unitarian trick is to replace the group G = (R>0)k with the torus Tk = {z Ck : |zi| = 1}. Both groups are Zariski-dense in the group (C )k in other words, any polynomial that vanishes identically on either G or Tk

actually vanishes identically on all of (C )k. (For background on Weyl s trick, see (Fulton and Harris, 2013, pp. 129 131); for more on the Zariski topology and Zariski-denseness, see (Shafarevich, 1994, pp. 22 24).) It follows, because the group action (31) is described by rational functions, that all three groups have the same invariants. On the other hand, the torus Tk is compact, so we can average over it to obtain a projection

Qf(x1, . . . , xd) = Z

Tk f C (z1, . . . , zk) (x1, . . . , xd) dλC(z), (32)

where the Haar measure λC coincides with the (normalized) Lebesgue measure on the torus namely, the Lebesgue measure scaled by 1 (2π)k . In order to explain how the projection Q behaves, we first note that the rational monomials define characters for the group action. Namely,

i=1 xai i , (z1, . . . zk) xa = (

Pd s=1 usjaj j )xa, (33)

4. We assume the output is dimensionless for simplicity, without loss of generality a dimensioned output function F : Z Xv can be obtained as F (x) = gx,u(f (x)) where gx,u is fixed.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

χxa : (C>0)k C>0 (34)

(z1, . . . , zk) 7

Pd s=1 usjaj j (35)

is a continuous group homomorphism (same for the real characters). Therefore, the rational monomials are eigenvectors of Q:

Tk z xadλC(z) = Z

Tk χxa(z) xadλC(z) = xa Z

Tk χxa(z)dλC(z). (36)

A standard computation shows that if χ is a character of a compact group with Haar measure λ, then R

G χ(g)dλ = λ(G) if χ is trivial, and 0 otherwise. In particular, this shows

Q(xa) = xa if Pd s=1 usjaj = 0 for all j = 1, . . . , k 0 otherwise. (37)

In other words, comparing (37) with (8), a rational monomial is either invariant under the group action (i.e. dimensionless), or it is in the kernel of Q. Thus Q is a Reynolds operator for the group of scalings.

Generalization gap for (complex) units equivariant regressions In order to use the results from Elesedy and Zaidi (2021) it is not enough to have a projection Q. We need Q to be an orthogonal projection in an L2 space. To this end, we consider a space of functions of the original input features, where the units equivariant functions are a linear subspace. The characterization from the dimensional analysis discussed above suggests focusing on rational monomials. Unfortunately, the rational monomials may have poles when features are zero, so in order to define the inner product we will restrict the measure µ to have bounded support which does not include zero. In particular we can consider Xd = ([ b, a] [a, b])d, and µ the standard (Lebesgue) measure in Rd restricted to Xd and 0 outside X. We let H = L2(Rd, µ). We note that µ is not scaling invariant, and indeed, no compactly-supported measure is scaling-invariant. Relatedly, Q is not an orthogonal projection in L2(Rd, µ) for any real measure µ. But we will salvage what we can. Since we extended the class of functions to a complex domain to use Weyl s trick, we also need to extend the measure µ to µCd. We consider the Lebesgue measure in Cd with support (XC)d where XC = {z C : a < |z| < b}. Now HC = L2(Cd, µCd). We will see that even though µCd is a complex analog of µ, the resulting Hilbert spaces are not necessarily comparable. In particular, Q is an orthogonal projection in L2(Cd, µCd), due to the fact that µCd is rotationally symmetric (i.e., invariant under the action by Td).

Proposition 3 If U HC is a linear subspace closed under scalings, then Q defined in (32) is the orthogonal projection onto the space of scaling invariant functions. In particular U = U U where U = Image(Q) and U = kernel(Q).

Units equivariant machine learning

Proof It suffices to show: (a) Q(U) U. (b) f is scaling invariant if and only if Q(f) = f. (c) Q is self adjoint for , µCd. This can be shown with a similar argument to (23)-(26):

Qf, h µCd = Z

Tk f(z x)dλC(z), h (x) dµCd(x) (38)

Tk f(z x), h (x) dλC(z) dµCd(x) (39)

Tk f(x), h (z 1 x) dλC(z) dµCd(x) (40)

= f, Qh µCd, (41)

where h is the complex conjugate of h. Note that (40) holds because the measure µCd is invariant with respect to scalings in Tk (namely, µCd is rotationally symmetric).

Note this argument is not possible for µ. We ll see at the end of this section that Q is not self-adjoint with respect to any real inner product in general. Our argument in the previous section shows that when U is a space generated by rational monomials, the projection onto U is easy to characterize. In particular, a simple computation (Proposition (4)) shows that the complex rational monomials are orthogonal in L2(Cd, µCd).

Proposition 4 Let U be the space of rational monomials, spanned by

i=1 xai i : a = (a1, . . . , ad) Zd )

Then for all xa, xa B with a = a we have xa, xa µ = 0.

Proof Let a = a , then

xa, xa µC = Z

i=1 xai i xa i i dµCd (43)

i=1 riejaiθirie ja iθidµCd, (44)

where xi = ejθi and j = 1. Now since a = a we can choose s such that as = a s. Using Fubini-Tonelli s theorem we can write:

xa, xa µC = Z

riej(ai a i)θidµCd 1 Z

XC rsej(as a s)θsdµC1, (45)

where the last term is zero because as = a s and the measure µC1 is rotationally symmetric.

This setting allows generalizing the results from Elesedy and Zaidi (2021) to complex scalings of complex functions.

Villar, Yao, Hogg, Blum-Smith, Dumitrascu

Proposition 5 Let X µCd where µCd is a rotation invariant distribution in A Cd. Let Y = f (X) + ξ C, where ξ is a random element of C that is independent of X with zero mean and finite variance, and f : A C is scaling invariant. Then, for any f, the generalization gap satisfies (f, Qf) = f 2 µCd. (46)

Discussion of real units-equivariant functions Even though some dimensional quantities can be complex, for example, electromagnetic field amplitudes in Fourier space, most dimensional quantities are real-valued, and the dimensional scalings are always real, therefore the above theory is not directly applicable. Unfortunately, the analysis above will not hold for the real case. Note that the Reynolds operator defined in (32) is well-defined for real-analytic functions and delivers the same projection in the rational monomials case (i.e. it drops the non-dimensionless rational monomials). However, this projection does not correspond to an orthogonal projection with respect to any nontrivial measure µ. One way to see this is by observing that the monomials cannot be orthogonal for any non-trivial real measure µ. For example, suppose we have two input features x, y, and the dimensionless space is generated by xy. Then the Reynolds projection Q will satisfy that Q(xy 1) = 0 (see Eq. (37)). However, no non-trivial real measure will satisfy that xy, xy 1 µ = 0 because xy, xy 1 µ = R (xy)(xy 1)dµ = R x2dµ = x, x µ > 0. The underlying question here is, what is the right notion of projection of a real function onto the space of units-equivariant functions? The Reynolds projection is the most natural projection from an algebraic point of view, per the discussion following (20). The orthogonal projection with respect to the L 2-norm of the measure of the data is the one corresponding to the estimator risk (27). The fact that these two projections diverge in the present case stems from the fact that the measure from which the data is drawn cannot itself be scalinginvariant. Furthermore, there is no reason to expect that the space spanned by rational monomials is closed under the orthogonal projection in L 2(Rd, µ). Note that our algorithm does not perform a projection, it directly optimizes in a space of invariant functions. The generalization gap one may want to investigate is ( ˆf, f) where ˆf is the output of a baseline regression, and f is our units-equivariant regression. We show specific examples in Section 5.