# approximationgeneralization_tradeoffs_under_approximate_group_equivariance__f0bf3b4b.pdf Approximation-Generalization Trade-offs under (Approximate) Group Equivariance Mircea Petrache Shubhendu Trivedi The explicit incorporation of task-specific inductive biases through symmetry has emerged as a general design precept in the development of high-performance machine learning models. For example, group equivariant neural networks have demonstrated impressive performance across various domains and applications such as protein and drug design. A prevalent intuition about such models is that the integration of relevant symmetry results in enhanced generalization. Moreover, it is posited that when the data and/or the model may only exhibit approximate or partial symmetry, the optimal or best-performing model is one where the model symmetry aligns with the data symmetry. In this paper, we conduct a formal unified investigation of these intuitions. To begin, we present general quantitative bounds that demonstrate how models capturing task-specific symmetries lead to improved generalization. In fact, our results do not require the transformations to be finite or even form a group and can work with partial or approximate equivariance. Utilizing this quantification, we examine the more general question of model mis-specification i.e. when the model symmetries don t align with the data symmetries. We establish, for a given symmetry group, a quantitative comparison between the approximate/partial equivariance of the model and that of the data distribution, precisely connecting model equivariance error and data equivariance error. Our result delineates conditions under which the model equivariance error is optimal, thereby yielding the best-performing model for the given task and data. Our results are the most general results of their type in the literature. 1 Introduction It is a common intuition that machine learning models that explicitly incorporate task-specific symmetries tend to exhibit superior generalization. The rationale is simple: if aspects of the task remain unchanged or change predictably when subject to certain transformations, then, a model without knowledge of them would require additional training samples to learn to disregard them. While the general idea is by no means new in machine learning, it has experienced a revival in interest due to the remarkable successes of group equivariant convolutional neural networks (GCNNs) [14, 38, 29, 16, 13, 60, 8, 34, 24, 66]. Such networks hard-code equivariance to the action of an algebraic group on the inputs and have shown promise in a variety of domains [55, 5, 42, 2, 28, 62, 25, 52, 18, 61, 68], in particular in those where data is scarce, but exhibits clear symmetries. Further, the benefits of encoding symmetries in this manner extend beyond merely enhancing generalization. Indeed, in UC Chile, Fac. de Matemáticas, & Inst. de Ingeniería Matematica y Computacional, Av. Vicuña Mackenna 4860, Santiago, 6904441, Chile. mpetrache@mat.uc.cl. shubhendu@csail.mit.edu. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). numerous applications in the physical sciences, neglecting to do so could lead to models producing unphysical outputs. However, the primary focus of most of the work in the area has been on estimating functions that are strictly equivariant under a group action. It has often been noted that this may impose an overly stringent constraint as real-world data seldom exhibits perfect symmetry. The focus on strict equivariance is partly due to convenience: the well-developed machinery of group representation theory and noncommutative harmonic analysis can easily be marshalled to design models that are exactly equivariant. A general prescriptive theory for such models also exists [29, 16]. In a certain sense, the basic architectural considerations of such models are fully characterized. Recent research has begun to explore models that enforce only partial or approximate equivariance [40, 59, 23, 56]. Some of these works suggest interpolating between models that are exactly equivariant and those that are fully flexible, depending on the task and data. The motivation here is analogous to what we stated at the onset. If the data has a certain symmetry whether exact, approximate or partial and the model s symmetry does not match with it, its performance will suffer. We would expect that a model will perform best when its symmetry is correctly specified, that is, it aligns with the data symmetry. Despite increased research interest in the area, the theoretical understanding of these intuitions remains unsatisfactory. Some recent work has started to study improved generalization for exactly equivariant/invariant models (see section 2). However, for the more general case when the data and/or model symmetries are only approximate or partial, a general treatment is completely missing. In this paper, we take a step towards addressing this gap. First, we show that a model enforcing task-specific invariance/equivariance exhibits better generalization. Our theoretical results subsume many existing results of a similar nature. Then we consider the question of model mis-specification under partial/approximate symmetries of the model and the data and prove a general result that formalizes and validates the intuition we articulated above. We summarize below the main contributions of the paper: We present quantitative bounds, the most general to date, showing that machine learning models enforcing task-pertinent symmetries in an equivariant manner afford better generalization. In fact, our results do not require the set of transformations to be finite or even be a group. While our investigation was initially motivated by GCNNs, our results are not specific to them. Using the above quantification, we examine the more general setting when the data and/or model symmetries are approximate or partial and the model might be misspecified relative to the data symmetry. We rigorously formulate the relationship between model equivariance error and data equivariance error, teasing out the precise conditions when the model equivariance error is optimal, that is, provides the best generalization for given data. To the best of our knowledge, this is the most general result of its type in the literature. 2 Related Work As noted earlier, the use of symmetry to encode inductive biases is not a new idea in machine learning and is not particular to neural networks. One of the earliest explorations of this idea in the context of neural networks appears to be [46]. Interestingly, [46] originated in response to the group invariance theorems in Minsky & Papert s Perceptrons [36], and was the first paper in what later became a research program carried out by Shawe-Taylor & Wood, building more general symmetries into neural networks [63, 48, 49, 65, 64], which also included PAC style analysis [50]. While Shawe-Taylor & Wood came from a different perspective and already covered discrete groups [14, 38], modern GCNNs go beyond their work [29, 16, 13, 60, 8, 34, 24]. Besides, GCNNs [14] were originally proposed as generalizing the idea of the classical convolutional network [32], and seem to have been inspired by symmetry considerations from physics [15]. Outside neural networks, invariant/equivariant methods have also been proposed in the context of support vector machines [43, 57], general kernel methods [39, 26], and polynomial-based feature spaces [44]. On the theoretical side, the assertion that invariances enhance generalization can be found in many works, going back to [47]. It was argued by [1] that restricting a classifier to be invariant can not increase its VC dimension. Other examples include [3, 4, 51, 41]. The argument in these, particularly [51, 41] first characterizes the input space and then proceeds to demonstrate that certain data transformations shrink the input space for invariant models, simplifying the input and improving generalization. Some of our results generalize their results to the equivariant case and for more general transformations. Taking a somewhat different tack, [22] showed a strict generalization benefit for equivariant linear models, showing that the generalization gap between a least squares model and its equivariant counterpart depends on the dimension of the space of anti-symmetric linear maps. This result was adapted to the kernel setting in [19]. [9] studied sample complexity benefits of invariance in a kernel ridge regression setting under a geometric stability assumption, and briefly discussed going beyond group equivariance. [53] expands upon [9] by characterizing minimax optimal rates for the convergence of population risk in a similar setting when the data resides on certain manifolds. [35] characterizes the benefits of invariance in overparameterized linear models, working with invariant random features projected onto high-dimensional polynomials. Benefits of related notions like data augmentation and invariant averaging are formally shown in [33, 12]. Using a standard method from the PAC-Bayesian literature [37] and working on the intertwiner space, [7] provide a PAC-Bayesian bound for equivariant neural networks. Some improved generalization bounds for transformation-invariant functions are proposed in [67], using the notion of an induced covering number. A work somewhat closer to some of our contributions in this paper is [20], which gives PAC bounds for (exact) group equivariant models. However, we note that the proof of the main theorem in [20] has an error (see the Appendix for a discussion), although the error seems to be corrected in the author s dissertation [21]. An analysis that somewhat overlaps [20], but goes beyond simply providing worst-case bounds was carried out by [45]. Finally, recent empirical work on approximate and partial equivariance includes [40, 59, 23, 56]. These make the case that enforcing strict equivariance, if misspecified, can harm performance, arguing for more flexible models that can learn the suitable degree of equivariance from data. However, there is no theoretical work that formalizes the underlying intuition. 3 Preliminaries In this section, we introduce basic notation and formalize some key notions that will be crucial to ground the rest of our exposition. 3.1 Learning with equivariant models Learning problem. To begin, we first describe the general learning problem which we adapt to the equivariant case. Let s say we have an input space X and an output space Y, and assume that the pairs Z = (X, Y ) X Y are random variables having distribution D. Suppose we observe a sequence of n i.i.d pairs Zi = (Xi, Yi) D, and want to learn a function f : X Y that predicts Y given some X. We will consider that f belongs to a class of functions e F { f : X Y} and that we work with a loss function ℓ: X Y [0, ). To fully specify the learning problem we also need a loss class, comprising of the functions f(x, y) := ℓ( f(x), y) : X Y [0, ). Using these notions, we can define: F := {f(x, y) := ℓ( ef(x), y) : f e F}, Pf := E[f(X, Y )], Pnf := 1 i=1 f(Xi, Yi). The quantities Pf and Pnf are known as the risk and the empirical risk associated with ef, respectively. In practice we have access to Pnf, and we estimate Pf by it. One of the chief quantities of interest in such a setup is the worst-case error of this approximation on a sample {Zi} = {Zi}n i=1, that is, the generalization error: Gen Err(F, {Zi}, D) := sup f F (Pf Pnf). Group actions. We are interested in the setting where a group G acts by measurable bijective transformations over X and Y, which can also be seen as a G-action over Z, denoted by g Z = (g X, g Y ), where g G. Such actions can also be applied to points (x, ef(x)) in the graph of functions ef, transforming any f e F into g ef : x 7 g 1 ef(g x). The set G ef := {g ef : g G} forms the orbit of ef under the action of G. Note that although we treat G as an abstract group throughout, we will sometimes abuse notation and use the same letter to also refer to the representations of G as transformations over the spaces in which X, Y, Z, ef, f live. Equivariant models. We are now in a position to specify the learning setup with equivariant models. We say that ef e F is equivariant, if G ef = { ef}. Further, we say that the hypothesis class is G-equivariant if all ef e F are equivariant. Note that if the G-action over Y is trivial, that is, if we have g Y = Y for all g G, y Y ), then ef being equivariant reduces to requiring ef(g x) = ef(x) for all g, x. In such a case the model f is said to be invariant. Also note that ef(x) being equivariant corresponds to f(x, y) = ℓ( ef(x), y) being invariant under some suitable product G-action over X Y. Averaging over a set of transformations. Let G be a set of transformations of Z. The G-averaged generalization error for arbitrary F by taking averages Eg with respect to the natural probability measure PG3 on G is as follows: Gen Err(G)(F, {Zi}, D) := sup f F (Eg EZ[f(g Z)] 1 i=1 Egf(g Zi)), Note that the above equals Gen Err(F, {Zi}, D) if F is composed of invariant functions. Data augmentation/Orbit averaging. Following our discussion above, it might be instructive to consider the case of data augmentation. Note that a data distribution which is invariant under some G-action can be created from any D by averaging over G-orbits: we take g to be a random variable uniformly distributed over a compact set of transformations G. Then let D(G) := 1 |G| G g D. This is the distribution of the dataset obtained from D under G-augmentation. Then we have by change of variables: Gen Err(G)(F, {Zi}, D) = Gen Err(F, {Eg[g Zi]}, D(G)). (3.1) Note that the notions of orbit and averaging used above do not rely on the property that G is a group in order to be well-defined. 3.2 Partial and Approximate Equivariance Since our analysis treats the general case where the model and/or data symmetries are not strict, we now state some basic attendant notions. Error to equivariance function. We first need a measure to quantify how far off we are from perfect equivariance. To do so, we can define a function in terms of f(x, y) as: ee : F G [0, + ), ee(f, g) = g f f . The crucial property here is that ee(f, g) = 0 iff g f = f. Note that one could use other distances or discrepancies (e.g. like the ℓ2 norm) directly in terms of ef, using which we can measure the error between g ef and ef. More general sets of transformations. We can use ee( ) to precisely specify the notions of partial and approximate equivariance that we work with. We consider the general case where the underlying transformations on (X, Y ) still have a common label set G and are assumed to be bijective over X and Y, but don t necessarily form a group. For ϵ > 0 and f F we consider the ϵ-stabilizer Stabϵ(f) := {g G : ee(f, g) ϵ}. It is fairly easy to see that this subset is automatically a group for ϵ = 0, but most often will not be when ϵ > 0. The latter corresponds to partial symmetry and thus to the setting 3We take PG to be the uniform measure over G if G is finite, and the normalization of the Haar measure of G to G if G G is a positive measure subset of an ambient locally compact group G. of partial equivariance. On the other hand, if we consider specific hypotheses such that for some ϵ > 0 we have Stabϵ(f) = G for all f F, then we are working in the setting of approximate equivariance [59, 56]. Conditions on G. We conclude this section with a remark on the nature of G we consider in our paper. In our bounds, we first restrict to finite G. However, this setting can be extended to compact groups which are doubling with respect to a metric that respects the composition operation. That is, a distance d G over G, such that d G(gh, g h) = d(g, g ). The main examples we have in mind are compact Lie groups such as S1, SO(n), O(n), and their finite subgroups like Z/NZ, as well as their quotients and products. 4 PAC bounds for G-equivariant hypotheses In this section, we study the generalization properties of equivariant models. More specifically, we derive PAC bounds for both exact and approximately G-equivariant hypotheses. But first, we present some preliminaries related to PAC bounds in what follows. PAC learning bounds. We follow the standard approach (such as that outlined in [10]). We start with the following concentration bound, where F is composed of functions with range [ M/2, M/2]: P sup F |(P Pn)f| R(FZ) + ϵ 2 exp ϵ2n where Rn(F) denotes the Rademacher complexity, which is given by R(FZ) := Eσ sup F i=1 σif(Zi), and where Z1, . . . , Zn D are n i.i.d. samples and σ = (σ1, . . . , σn) denotes the so-called Rademacher variable, which is a uniformly distributed random variable over { 1, 1}n, independent of the Zi s. R(FZ) can be bounded using a classical technique for bounding the expected suprema of random processes indexed by the elements of a metric space, variously called the Dudley entropy integral or the chaining bound. The result below was proven in [45, Lemma 3] (also see slightly stronger results in [58, Thm. 17] or [6, Thm. 1.1]) R(FZ) 4 inf α>0 ln N(F, t, )dt Recall that for a metric space (X, d X), the covering number N(X, ϵ, d X) is the smallest number of balls of radius ϵ required to cover X. In (4.2) we use the cover numbers of F in supremum distance, i.e. the distance between two functions f, g F, f g := supz Z |f(z) g(z)|). It is known that by rescaling the fundamental estimate due to Kolmogorov-Tikhomirov [54, eq. 238, with s = 1, and eq. 1], and under the mild assumption that F is composed of 1-Lipschitz4 functions on Z with values in an interval [ M/2, M/2], for a centralizable5 metric space Z, the following holds N(Z, 2ϵ) log2 N(F, ϵ, ) log2 ϵ + 1 + N(Z, ϵ/2). (4.3) Before we start stating our results, we need to define a few more notions: Doubling and Hausdorff dimensions. If in a metric space (X, d X), every ball of radius R can be covered by at most λ balls of radius R/2, the space has doubling dimension ddim(X) = log2 λ. The doubling dimension coincides with the usual notion of (Hausdorff) dimension dim X, i.e. ddim X = dim X, in case X is a compact manifold with bounded injectivity radius, in particular it equals d if X RD is a regular d-dimensional submanifold or if D = d and X is a regular open subset such as a ball or a cube. 4All of our formulas generalize to classes of ℓ-Lipschitz functions for general ℓ> 0: the change amounts to applying suitable rescalings, see the Appendix for more details. 5This mild condition signifies that for any open set U of diameter at most 2r there exists a point x0 so that U is contained in B(x0, r). Discrete metric spaces. A metric space (X, d X) is discrete if it has strictly positive minimum separation distance δX := minx =x X d X(x, x ) > 0. We then have N(X, ϵ) = N(X, min{ϵ, δX}), (4.4) which is especially relevant for finite groups G endowed with the word distance (i.e. d G(g, g ), which is the minimum number of generators required to express g 1g ), for which δG = 1. Now, note that as a straightforward consequence of the definition, there exists a universal c > 0 such that[30, 31]: N(Z, ϵ) 2diam(Z) ddim(Z) . (4.5) and by (4.3) we get the simplified bound (where implicit constants are universal): ln N(F, ϵ, ) 4diam(Z) ddim(Z) . (4.6) By all the above considerations, we can bound the generalization error as follows: Proposition 4.1. Assume that d = ddim(Z) > 2, and 0 < δ < 1/2, and let D := diam(Z). Then for any probability distribution D of data over Z, with notation as in the beginning of Section 4, the following holds with probability at least 1 δ: Gen Err(F, {Zi}, D) 4dd 1/d + n 1/2p F log(2/δ), the implicit constant is independent of δ, n; only depending on Z through the constants in (4.5). Due to space constraints, all our proofs are relegated to the Appendix. With the above background, we now first show generalization bounds under the more general notions of equivariance we have discussed. 4.1 Generalization bounds improvement under partial or approximate equivariance In this section, we prove generalization error bounds with the notations and definitions from Section 3.2. We consider the following sets of transformations: Stabϵ = Stabϵ(F) := \ f F Stabϵ(f). We note that the strict equivariance case is recovered if, for ϵ = 0, we have Stab0 = G. Proposition 4.2. Let Stabϵ be as above, and assume that |Stabϵ| > 0. let Z0 ϵ Z be a measurable choice of Stabϵ-orbit representatives for points in Z, and let ι0 ϵ : Z Z0 ϵ be the measurable map that associates to each z Z its representative in Z0 ϵ . Let F0 ϵ := {f|Z0ϵ : f F} and let ι0 ϵ(D) represent the image measure of D. Then for each n N, if {Zi}n i=1 are i.i.d. samples with Zi D and Z0 i,ϵ := ι0 ϵ Zi, then we have Gen Err(F, {Zi}, D) 2ϵ + Gen Err(Stabϵ)(F, {Zi}, D) = 2ϵ + Gen Err(F0 ϵ , {Z0 i,ϵ}, ι0 ϵ(D)). The above result says that the generalization error of our setting of interest could roughly be obtained by working with a reduced set of only the orbit representatives for points in Z. This is in line with prior work such as [51, 20, 41]. However, note that our result is already more general and does not require that the set of transformations form a group. With this, we now need an understanding of the covering of spaces of representatives Z0 ϵ and the effect of Stabϵ on them. The answer is evident in case Stabϵ = G, that G, Z0 are manifolds, and Z = Z0 G. Since ddim coincides with topological dimension, and we immediately have d0 = d dim(G). Intuitively, the dimensionality of G can be understood as eliminating degrees of freedom from Z, and it is this effect that improves generalization by n 1/(d dim(G)) n 1/d. We now state a simplified form of Theorem A.3, which is sufficient for proceeding further. Our results generalize [51, Thm. 3]: we allow for non-product structure, possibly infinite sets of transformations that may not form a group, and we relax the distance deformation hypotheses for the action on Z). Corollary 4.3 (of Thm. A.3). With the same notation as above, assume that for Stabϵ the transformations corresponding to the action of elements of Stabϵ satisfy the following for some L > 0, and for a choice of a set of representatives Z0 ϵ Z of representatives of Stabϵ-orbits: 1. For all z0, z 0 Z0 ϵ and all g Stabϵ it holds that d(z0, z 0) L d(g z0, g z 0). 2. For all g, g Stabϵ it holds that d G(g, g ) L dist(g Z0 ϵ , g Z0 ϵ )6. Then for each δ > 0 there holds N(Z0 ϵ , δ) N(Z, δ/2L)/N(Stabϵ, δ). To be able to quantify the precise generalization benefit we need a bound on the quantity N(Stabϵ, δ). For doing so, we assume that Stabϵ is a finite subset of a discrete group G, or is a positive measure subset of a compact group G. As before, let |G| and d G denote G and ddim(G) respectively for finite G. Also, denote the Hausdorff measure and dimension by Vol(G) and dim(G) for compact metric groups. Note that the minimum separation δG is zero for dim(G) > 0. Then our covering bound can be expressed in the following umbrella statement: Assumption: N(Stabϵ, δ) max 1, |Stabϵ| (max{δ, δG})d G In order to compare the above to the precise equivariance case, we later use the factor Dens(ϵ) := |Stabϵ| |G| (0, 1], which measures the richness of Stabϵ compared to the whole ambient group G, in terms of error ϵ. The fact that Dens(ϵ) > 0 follows from our assumption that |Stabϵ| > 0. Furthermore, Dens(ϵ) is always an increasing function of ϵ, as follows from the definition of Stabϵ. With these considerations on coverings, we can state a general result quantifying the generalization benefit, where if we set ϵ = 0, we recover the case of exact group equivariance. Theorem 4.4. Let ϵ > 0 be fixed. Assume that for a given ambient group G the almost stabilizers Stabϵ satisfy assumption (4.7) and that the assumptions of Corollary 4.3 hold for some finite L = Lϵ > 0. Then with the notations of Proposition 4.1 we have with probability 1 δ Gen Err(F, {Zi}, D) n 1/2p F log(2/δ) + 2ϵ + (Eϵ), d0 2 δ d0/2+1 G (2Lϵ)d Dd |Stabϵ|n 1/2 if G is finite and (2Lϵ)d Dd < |Stabϵ|n δd0 G , d0 2 (2Lϵ)d Dd |Stabϵ|n 1/d0 else. The interpretation of the above terms is direct: the term 2ϵ is the effect of approximate equivariance, and the term (Eϵ) includes the dependence on |Stabϵ| and thus is relevant to the study of partial equivariance. In general the Lipschitz bound Lϵ is increasing in ϵ as well, since Stabϵ includes more elements of G as ϵ increases, and we can expect that actions of elements generating higher equivariance error in general have higher oscillations. 5 Approximation error bounds under approximate data equivariance In the introduction, we stated that we aim to validate the intuition that that model is best whose symmetry aligns with that of the data. So far we have only given bounds on the generalization error. Note that our bounds hold for any data distribution: in particular, the bounds are not affected by whether the data distribution is G-equivariant or not. However, 6Here dist(A, B) := min{d(a, b) : a A, b B} denotes the distance between sets, induced by d. the fact that data may have fewer symmetries than the ones introduced in the model will have a controlled effect on worsening the approximation error, as described in this section. However, controlling the approximation error has an additional actor in the fray to complicate matters: the distinguishing power of the loss function. Indeed, having a degenerate loss function with a large minimum set will not distinguish whether the model is fit to the data distribution or not. Thus, lowering the discrimination power of the loss has the same effect as increasing the function space for which we are testing approximation error. Assuming at first that e F is composed of G-equivariant functions, we compare its approximation error to the one of the set M of all measurable functions m : X Y. Recall that f(Z) = ℓ( ef(X), Y ) is a positive function for all ef e F, thus, denoting by M := {F(x, y) = ℓ(m(x), y), m M}, the approximation error gap can be defined and bounded as follows (where F M is a function realizing the first minimum in the second line below): App Gap(F, D, ) := App Err(F, D) App Err(M , D) := min f F E[f(Z)] min F M E[F(Z)] min F M E[Eg[F(g Z)]] min F M E[F(Z)] E[Eg[F (g Z)]] E[F (Z)] min F M E[Eg[F(g Z)] F(Z)]. With applications to neural networks in mind, we will work with the following simplifying assumptions: 1. The loss function quantitatively detects whether the label is correct, i.e. we have ℓ(y, y ) = 0 if y = y , and ℓ(y, y ) φ(d Y(y, y )) for a strictly convex φ with φ(0) = 0, where d Y is the distance on Y. This assumption seems reasonable for use in applications, where objective functions with strict convexity properties are commonly used, as they help for the convergence of optimization algorithms. 2. The minimization across all measurable functions e F = M produces a zero-risk solution, i.e. min F M E[F(Z)] = 0 is achieved by a measurable function y : X Y. By assumption 1, this implies that D is concentrated on the graph of y . This assumption corresponds to saying that the learning task is in principle deterministically solvable. Under this assumption, the task becomes to approximate to high accuracy the (unknown) solution y . With the above hypotheses we have App Gap(F, D) = App Err(F, D) min F M EZ[Eg[F(g Z)]]. As a slight simplification of Assumption 1 for ease of presentation, we will simply take Y = Rd and ℓ(y, y ) = d Y(y, y )2 below. Note that Assumption 1 directly reduces to this case, as a strictly convex φ(t) with φ(0) = 0 as in Assumption 1 is automatically bounded below by a multiple of t2. Now, with the aim to capture the mixed effects of approximate and partial equivariance, we introduce the following set of model classes. For ϵ 0, λ (0, 1]: Cϵ,λ := F M : Dens(ϵ) = |Stabϵ(F)| In order to establish some intuition, note that Stabϵ(F) is increasing in ϵ and as a consequence Cλ,ϵ is also increasing in ϵ. Directly from the definition one finds that Stab0(F) is necessarily a subgroup of G, and thus we allow ϵ > 0, which allows more general sets of symmetries than just subgroups of G. The most interesting parameter in the above definition of Cϵ,λ is λ, which actually bounds from below the "amount of prescribed symmetries" for our models. In the fully equivariant case ϵ = 0, λ = 1, we have the simple description C0,1 = {F : F (M(G)) }, whose maximal element is (M(G)) . However in general Cϵ,λ will not have a unique maximal element for λ < 1: this is due to the fact that two different elements F1 = F2 Cϵ,λ may have incomparable stabilizers, so that |Stabϵ(F1)\ Stabϵ(F2)| > 0 and |Stabϵ(F2)\ Stabϵ(F1)| > 0, and thus F1 F2 / Cϵ,λ. Our main result towards our approximation error bound is the following. Proposition 5.1. Assume that (Y, d Y) is a compact metric space and consider ℓ(y, y ) = d Y(y, y )2. Fix ϵ 0. Then for each λ (0, 1] there exists an element F Cϵ,λ and a measurable selection ι0 ϵ : X X 0 ϵ of representatives of orbits of X under the action of Stabϵ(F), such that if X0 ϵ , g are random variables obtained from X D by imposing X0 ϵ := ι0 ϵ X and X = g X0 ϵ , then we have App Err(F, D) EX0ϵ min y Eg|X0ϵ h max d Y g y, y (g X0 ϵ ) ϵ, 0 2i . In the above, as ϵ increases, we see a decrease in the approximation gap, since in this case model classes F allow more freedom for approximating y more accurately. On the other hand, the effect of parameter λ is somewhat more subtle, and it has to do with the fact that as λ decreases, Stabϵ(F) for F Cϵ,λ can become smaller and the allowed oscillations of y (g X0 ϵ ) are more and more controlled. Now to bound the approximation error above using the quantities that we have been working with, we will use the following version of an isodiametric inequality on metric groups G. As before we assume that G has Hausdorff dimension dim(G) > 0 which coincides with the doubling dimension, or is discrete. By |X| we denote the Hausdorff measure of X G if dim(G) > 0 or the cardinality of X if G is finite. Then there exists an isodiametric constant CG depending only on G such that for all X G it holds that |G| λ diam(X) CGλ1/ddim(G). (5.1) The above bound can be obtained directly from the doubling property of G, applied to a cover of X via a single ball of radius diam(X), which can be refined to obtain better and better approximations of |X|. We can now control the bound from Proposition 5.1 as follows: Theorem 5.2. Assume that (Y, d Y) is a compact metric space and ℓ(y, y ) = d Y(y, y )2. Let G be a compact Lie group or a finite group, with distance d G. Let ddim(G) be the doubling dimension of G, and assume that for all g G such that the action of g over X is defined, it holds that d(g x, g x) L d G(g, g ). Then, there exists a constant CG depending only on G such that for all λ (0, 1] and ϵ > 0, there exists an element F Cϵ,λ where for any data distribution D, y can be chosen Lipschitz with constant Lip(y ). We have App Err(F, D) max n CG L (1 + Lip(y ))λ1/ddim(G) ϵ, 0 o 2 . We note that, given our proof strategy of Proposition 5.1, we conjecture that actually the above bound is sharp up to a constant factor: min F Cϵ,λ App Err(F, D) may have a lower bound comparable to the above upper bound. The goal of the above result is to formalize the property that the approximation error guarantees of F will tend to grow as the amount of imposed symmetries grow, as measured by λ. 6 Discussion: Imposing optimal equivariance We have so far studied bounds on the generalization and the approximation errors, each of which take a very natural form in terms of properties of the space of orbit representatives (and thus the underlying G). One of them depends on the data distribution, while the other doesn t. Using them we can thus state a result that quantifies the performance error, or gives the generalization-approximation trade-off, of a model based on the data distribution (and their respective symmetries). Define the performance error of a model class F Cϵ,λ over an i.i.d. sample {Zi}n i=1, Zi D as Perf Err(F, {Zi}, D) := Gen Err(F, {Zi}, D) + App Err(F, D). Combining Theorems 4.4 and 5.2, we get the following: Theorem 6.1. Under the assumptions of Theorems 4.4 and 5.2, we have that, assuming all the function classes are restricted to functions with values in [ M, M], then there exist values C1, C2, C3 > depending only on d0, (2LD)d/|G|, δG and C = CGL (1 + Lip(y )) such that with probability at least 1 δ, Perf Err(F, {Zi}, D) n 1/2p M log(2/δ)+2ϵ+(Cλ1/d G ϵ)2 ++ C1 1 (nλ)1/2 if nλ C3, C2 1 (n λ)1/d0 if n λ < C3. Note that the above bounds are not informative as λ 0 for the continuous group case, which corresponds to the second option in Theorem 6.1. This can be sourced back to the form of Assumption (4.7), in which the covering number on the left is always 1, whereas in the continuous group case, the right-hand side may be arbitrarily small. Error versus lambda for large n for fixed values of C, C1, C2, C3. Thus the result is only relevant for λ away from 0. If we fix ϵ = 0 and we optimize over λ (0, 1] in the above bound, we find a unique minimum λ for the error bound, specifying the optimum size of Stabϵ for the corresponding case (see the Appendix for more details). This validates the intuition stated at the onset that for the best model its symmetries must align with that of the data. As an illustration, for C, C1, C2 = 0.04 and C3 = 0.01 with n = 1000000, the error trade-off is shown on the figure on the right. 7 Concluding Remarks In this paper, we have provided general quantitative bounds showing that machine learning models with task-pertinent symmetries improve generalization. Our results do not require the symmetries to be a group and can work with partial/approximate equivariance. We also presented a general result which confirms the prevalent intuition that if the symmetry of a model is mis-specified w.r.t to the data symmetry, its performance will suffer. We now indicate some important future research directions, that correspond to limitations of our work: (Model-specific bounds) While in Theorem 6.1 we obtain the existence of an optimal amout λ without hopes to be sharp in such generality, we may expect that if we restrict to specific tasks, the bounds can become much tighter, and if so, the value λ of optimal amount of symmetries can become interpretable and give insights into the structure of the learning tasks. (Tighter bounds beyond classical PAC theory) For the sake of making the basic principles at work more transparent, we based our main result on long-known classical results. However, tighter data-dependent or probabilistic bounds would be very useful for getting a sharper value for the optimal value of symmetries λ via a stronger version of Theorem 6.1. (Beyond controlled doubling dimension) Our focus here is on groups G of controlled doubling dimension. This includes compact and nilpotent Lie groups, and discrete groups of polynomial growth, but the doubling dimension is not controlled for notable important cases, such as for the permutation group Sn (cf. section 3 of [17]) or for the group (Z/2Z)n. In order to cover these cases, it will be interesting to build analogues of our main result based on other group complexity bounds, beyond the doubling dimension case. Acknowledgments and Disclosure of Funding The authors thank Aryeh Kontorovich for readily answering questions (and supplying references) about Kolmogorov-Tikhomirov, and Bryn Elesedy for graciously answering queries and providing a copy of his dissertation before it appeared online. MP thanks the Centro Nacional de Inteligencia Artificial in Chile for support, as well as Fondecyt Regular grant number 1210462 entitled Rigidity, stability and uniformity for large point configurations . [1] Yaser S. Abu-Mostafa. Hints and the VC Dimension. Neural Computation, 5(2):278 288, 03 1993. [2] Brandon Anderson, Truong Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [3] Fabio Anselmi, Joel Z. Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, and Tomaso A. Poggio. Unsupervised learning of invariant representations in hierarchical architectures. Co RR, abs/1311.4158, 2013. [4] Fabio Anselmi, Lorenzo Rosasco, and Tomaso Poggio. On invariance and selectivity in representation learning. Information and Inference: A Journal of the IMA, 5(2):134 158, 05 2016. [5] Minkyung Baek, Frank Dimaio, Ivan V. Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, Claudia Millán, Hahnbeom Park, Carson Adams, Caleb R. Glassman, Andy M. De Giovanni, Jose H. Pereira, Andria V. Rodrigues, Alberdina Aike van Dijk, Ana C Ebrecht, Diederik Johannes Opperman, Theo Sagmeister, Christoph Buhlheller, Tea Pavkov-Keller, Manoj K. Rathinaswamy, Udit Dalwadi, Calvin K. Yip, John E. Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. Adams, Randy J. Read, and David Baker. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373:871 876, 2021. [6] Peter Bartlett. Covering numbers, chaining & dudley s entropy integral (lecture 24). class notes on statistical learning theory, 2006. [7] Arash Behboodi, Gabriele Cesa, and Taco S Cohen. A pac-bayesian generalization bound for equivariant networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5654 5668. Curran Associates, Inc., 2022. [8] Erik J. Bekkers. B-spline cnns on lie groups. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. [9] Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning under geometric stability. Advances in neural information processing systems, 34:18673 18684, 2021. [10] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction to statistical learning theory. Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2-14, 2003, Tübingen, Germany, August 4-16, 2003, Revised Lectures, pages 169 207, 2004. [11] Anadi Chaman and Ivan Dokmanic. Truly shift-invariant convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3773 3783, 2021. [12] Shuxiao Chen, Edgar Dobriban, and Jane H Lee. A group-theoretic framework for data augmentation. The Journal of Machine Learning Research, 21(1):9885 9955, 2020. [13] Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn. In ICML, 2019. [14] Taco Cohen and Max Welling. Group equivariant convolutional networks. In ICML, 2016. [15] Taco S. Cohen. Learning transformation groups and their invariants. Master s thesis, University of Amsterdam, 2013. [16] Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant CNNs on homogeneous spaces. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [17] Michel Deza and Tayuan Huang. Metrics on permutations, a survey. Journal of Combinatorics, Information & System Sciences, 23(1-4):173 185, 1998. [18] Stephan Eismann, Raphael J. L. Townshend, Nathaniel Thomas, Milind Jagota, Bowen Jing, and Ron O. Dror. Hierarchical, rotation-equivariant neural networks to select structural models of protein complexes. Proteins: Structure, 89:493 501, 2020. [19] Bryn Elesedy. Provably strict generalisation benefit for invariance in kernel methods. Advances in Neural Information Processing Systems, 34:17273 17283, 2021. [20] Bryn Elesedy. Group symmetry in PAC learning. In ICLR 2022 Workshop on Geometrical and Topological Representation Learning, 2022. [21] Bryn Elesedy. Symmetry and Generalisation in Machine Learning. Ph D thesis, University of Oxford, 2023. [22] Bryn Elesedy and Sheheryar Zaidi. Provably strict generalisation benefit for equivariant models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 2959 2969. PMLR, 2021. [23] Marc Finzi, Gregory Benton, and Andrew G Wilson. Residual pathway priors for soft equivariance constraints. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30037 30049. Curran Associates, Inc., 2021. [24] Marc Finzi, Max Welling, and Andrew Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. Ar Xiv, abs/2104.09459, 2021. [25] Jonathan Gordon, David Lopez-Paz, Marco Baroni, and Diane Bouchacourt. Permutation equivariant models for compositional generalization in language. In ICLR, 2020. [26] B. Haasdonk, A. Vossen, and H. Burkhardt. Invariance in kernel methods by haarintegration kernels. In Heikki Kalviainen, Jussi Parkkinen, and Arto Kaarna, editors, Image Analysis, pages 841 851, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [27] Osman Semih Kayhan and Jan C van Gemert. On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14274 14285, 2020. [28] Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. Ar Xiv, abs/2003.03123, 2020. [29] Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In ICML, 2018. [30] Aryeh Kontorovich and Roi Weiss. Maximum margin multiclass nearest neighbors. In International conference on machine learning, pages 892 900. PMLR, 2014. [31] Robert Krauthgamer and James R Lee. Navigating nets: Simple algorithms for proximity search. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 798 807. Citeseer, 2004. [32] Yann André Le Cun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541 551, 1989. [33] Clare Lyle, Mark van der Wilk, Marta Z. Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. On the benefits of invariance in neural networks. Ar Xiv, abs/2005.00178, 2020. [34] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. Ar Xiv, abs/1812.09902, 2019. [35] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3351 3418. PMLR, 15 19 Aug 2021. [36] Marvin Minsky and Seymour Papert. Perceptrons, Expanded Edition: An Introduction to Computational Geometry. The MIT Press, Cambridge, MA, 1987. [37] Behnam Neyshabur, Srinadh Bhojanapalli, David A. Mc Allester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. Ar Xiv, abs/1707.09564, 2017. [38] Siamak Ravanbakhsh, Jeff G. Schneider, and Barnabás Póczos. Equivariance through parameter-sharing. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2892 2901. PMLR, 2017. [39] Marco Reisert and Hans Burkhardt. Learning equivariant functions with matrix valued kernels. Journal of Machine Learning Research, 8(15):385 408, 2007. [40] David W Romero and Suhas Lohit. Learning partial equivariances from data. Advances in Neural Information Processing Systems, 35:36466 36478, 2022. [41] Akiyoshi Sannai, Masaaki Imaizumi, and Makoto Kawano. Improved generalization bounds of group invariant / equivariant deep networks via quotient feature spaces. In Cassio P. de Campos, Marloes H. Maathuis, and Erik Quaeghebeur, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, 27-30 July 2021, volume 161 of Proceedings of Machine Learning Research, pages 771 780. AUAI Press, 2021. [42] Victor Garcia Satorras, E. Hoogeboom, F. Fuchs, I. Posner, and M. Welling. E(n) equivariant normalizing flows for molecule generation in 3d. Ar Xiv, abs/2105.09016, 2021. [43] Bernhard Schölkopf, Chris Burges, and Vladimir Vapnik. Incorporating invariances in support vector learning machines. In Christoph von der Malsburg, Werner von Seelen, Jan C. Vorbrüggen, and Bernhard Sendhoff, editors, Artificial Neural Networks ICANN 96, pages 47 52, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg. [44] H. Schulz-Mirbach. On the existence of complete invariant feature spaces in pattern recognition. In Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems, pages 178 182, 1992. [45] Han Shao, Omar Montasser, and Avrim Blum. A theory of PAC learnability under transformation invariances. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. [46] John. Shawe-Taylor. Building symmetries into feedforward networks. In 1989 First IEE International Conference on Artificial Neural Networks, (Conf. Publ. No. 313), pages 158 162, 1989. [47] John Shawe-Taylor. Threshold network learning in the presence of equivalences. In NIPS, 1991. [48] John Shawe-Taylor. Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks, 4(5):816 826, 1993. [49] John Shawe-Taylor. Introducing invariance: a principled approach to weight sharing. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 94), volume 1, pages 345 349 vol.1, 1994. [50] John Shawe-Taylor. Sample sizes for threshold networks with equivalences. Information and Computation, 118(1):65 72, 1995. [51] Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel Rodrigues. Generalization error of invariant classifiers. In Artificial Intelligence and Statistics, pages 1094 1103. PMLR, 2017. [52] I. Sosnovik, A. Moskalev, and A. Smeulders. Scale equivariance improves siamese tracking. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2764 2773, 2021. [53] Behrooz Tahmasebi and Stefanie Jegelka. The exact sample complexity gain from invariances for kernel regression on manifolds, 2023. [54] Vladimir M. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces. Selected Works of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms, pages 86 170, 1993. [55] Raphael J. L. Townshend, Stephan Eismann, Andrew M. Watkins, Ramya Rangan, Maria Karelina, Rhiju Das, and Ron O. Dror. Geometric deep learning of rna structure. Science, 373:1047 1051, 2021. [56] Tycho F. A. van der Ouderaa, David W. Romero, and Mark van der Wilk. Relaxing equivariance constraints with non-stationary continuous filters. In Advances in Neural Information Processing Systems (Neur IPS), volume 35, Dec 2022. [57] Andrea Vedaldi, Matthew Blaschko, and Andrew Zisserman. Learning equivariant structured output svm regressors. In 2011 International Conference on Computer Vision, pages 959 966, 2011. [58] Ulrike von Luxburg and Olivier Bousquet. Distance-based classification with lipschitz functions. J. Mach. Learn. Res., 5(Jun):669 695, 2004. [59] Rui Wang, Robin Walters, and Rose Yu. Approximately equivariant networks for imperfectly symmetric dynamics. In International Conference on Machine Learning, pages 23078 23091. PMLR, 2022. [60] Maurice Weiler, Patrick Forré, Erik P. Verlinde, and Max Welling. Coordinate independent convolutional networks - isometry and gauge equivariant convolutions on riemannian manifolds. Ar Xiv, abs/2106.06020, 2021. [61] Jennifer C. White and Ryan Cotterell. Equivariant transduction through invariant alignment. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4651 4663, Gyeongju, Republic of Korea, Oct. 2022. International Committee on Computational Linguistics. [62] Marysia Winkels and Taco Cohen. Pulmonary nodule detection in ct scans with equivariant cnns. Medical image analysis, 55:15 26, 2019. [63] Jeffrey Wood and John Shawe-Taylor. Theory of symmetry network structure. Technical report, 1993. [64] Jeffrey Wood and John Shawe-Taylor. Representation theory and invariant neural networks. Discrete Applied Mathematics, 69(1):33 60, 1996. [65] Jeffrey Wood and John Shawe-Taylor. A unifying framework for invariant pattern recognition. Pattern Recognition Letters, 17(14):1415 1422, 1996. [66] Yinshuang Xu, Jiahui Lei, Edgar Dobriban, and Kostas Daniilidis. Unified fourier-based kernel and nonlinearity design for equivariant networks on homogeneous spaces. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 24596 24614. PMLR, 2022. [67] Sicheng Zhu, Bang An, and Furong Huang. Understanding the generalization benefit of model invariance from a data perspective. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 4328 4341. Curran Associates, Inc., 2021. [68] Xupeng Zhu, Dian Wang, Ondrej Biza, Guanang Su, Robin Walters, and Robert Platt. Sample efficient grasp learning using equivariant models. Co RR, abs/2202.09468, 2022. A.1 Proofs of results from Section 4 A.1.1 Proof of Proposition 4.1 Proof. Reformulating (4.1) we find that with probability at least 1 δ it holds that: sup f F |(P Pn)f| R(FZ) + 2 F log(2/δ) n . (A.1) Next, we compound the bounds for R(FZ). The optimum α > 0 in the first line of (4.2) is the one for which n 9 = ln N(F, α, ). Keeping in mind (4.6), we choose α = diam(Z)n 1 ddim Z in (4.2), which gives (abbreviating d = ddim Z, D = diam Z) R(FZ) D/n1/d + n 1/2 ˆ Dn 1/d(D/t)d/2dt = D/n1/d + (D/n)1/2 ˆ n 1/d τ d/2dτ Now the last equation and equation (A.1) yield the claim. A.1.2 Proof of Proposition 4.2 We start with a preliminary result under hypotheses of strict equivariance. In this case, we are able to use a change of variables to reduce the generalization error formula to an equivalent one depending only on a measurable choice of G-orbit representatives of elements from Z: Proposition A.1. Let F be a set of G-invariant functions, and let Z0 Z be a choice of G-orbit representatives for points in Z, such that ι0 : Z Z0 associating to each z Z its orbit representative z0, is Borel measurable. Let F0 := {f|Z0 : f F} and denote by ι0(D) the image measure of D. Then for each n N if {Zi}n i=1 are i.i.d. samples with Zi D and Z0 i := ι0 Zi, we have Gen Err(F, {Zi}, D) = Gen Err(G)(F, {Zi}, D) = Gen Err(F0, {Z0 i }, ι0(D)). Proof. For the first equality, we use the definition of Gen Err and the change of variable formula (3.1) and the fact that G-invariant functions f satisfy f(Z) = Eg[f(g Z)]. For the second equality, note that by hypothesis, for each f F we have f(z) = f(g z) for all g G, z Z, in particular f(z) = f(ι0(z)) and we conclude by a change of variable by the map ι0 in the expectations from the definition of Gen Err(G). Now the proof Proposition 4.2 combines the above idea with a simple extra step: Proof of Proposition 4.2: The proof uses the triangular inequality. For f F and g Stabϵ, we have: E[g f(Z)] 1 i=1 g f(Zi) + 2 g f f . By averaging over g Stabϵ, we obtain the inequality in the statement of the proposition. The equality follows by a change of variable via map ι0 ϵ, exactly as in the strategy of Proposition A.1. A.1.3 Proof of Corollary 4.3 and of the more general result of Theorem A.3 In order to make the treatment better digestible, we first consider the intuitively simpler case of strict equivariance, and then describe how to extend it to the more general case of approximate and partial equivariance. In this case, if we restrict our equivariant functions to only the space of orbit representatives Z0, the dimension counts from classical generalization bounds of Proposition 4.1 improve as follows: Corollary A.2. Assume that F is composed of G-invariant functions and that d0 := ddim(Z0) > 2. Also, denote D0 := diam(Z0). With the same notation as in Proposition A.1 and with the hypotheses of Proposition 4.1, for any probability distribution D over Z, the following holds with probability at least 1 δ: Gen Err(F, {Zi}, D) d0 d0 2 !1/d0 + n 1/2p F log(2/δ). Proof. Due to Proposition A.1, we only need to bound Gen Err(F0, {Z0 i }, ι0(D)). Thus it suffices to apply Proposition 4.1 for the above function. We note that F0 F to conclude. The drawback of the above Corollary, is that it leaves open the question of how to actually bound the diameter and dimension of Z0, on which we do not have direct control. The next steps we take consist precisely in translating the information from properties of G to relevant properties of Z0. A first, simpler, approach could be the following. Under the reasonable assumption that Z, Z0 have diameter greater than 1, the leading term on the left in Corollary A.2 is n 1/d0.Thus the optimal choices of Z0 are those which minimize the doubling dimension d0 = ddim(Z0) amongst sets of representatives of G-orbits. This is a weak regularity assumption, implying that we want Z0 to not oscillate wildly. The effect of G on coverings is evident in case G, Z0 are manifolds, and Z = Z0 G (see (A.3) for the strictly equivariant case, and the more general (A.5) for the general case). Since ddim coincides with topological dimension, we immediately have d0 = d dim(G). Intuitively, the dimensionality of G can be understood as eliminating degrees of freedom from Z, and it is this effect that improves generalization by n 1/(d dim(G)) n 1/d. In order to include more general situations, we now describe a second, more in-depth approach. We take a step back and rather than addressing direct diameter and dimension bounds for Z0, we go "back to the source" of Proposition 4.1. We update the bounds on covering numbers of Z0, directly in terms of the G-action and of Z. The ensuing framework is robust enough to later include, after a few adjustments, also the cases of partial and approximate equivariance. Here is our fundamental bound, which generalizes and extends [51, Thm.3]. Theorem A.3. Assume that Z is a metric space with distance d and S G is a subset of a metric group G consisting of transformations g : Z Z (with action denoted g z := g(z)) for which there exists a choice of S-orbit representatives Z0 Z and a distance function d on S satisfying the following for L, L (0, + ] (with the conventions 1/ + := 0 and N(X, 0) := + , N(X, + ) := 0): 1. For all z0, z 0 Z0 and all g S it holds that 1 Ld(z0, z 0) d(g z0, g z 0) L d(z0, z 0). 2. For all g, g S and z0, z 0 Z0 it holds that 1 Ld G(g, g ) d(g z0, g z 0) L d G(g, g ). Then for each δ > 0 the following holds N(Z0, 2δ L)N(S, 2δ L) N(Z, δ ) N(Z0, δ /2L )N(S, δ /2L ). (A.2) Before the proof, we observe how Corollary 4.3 can be recovered using the choice L > 0, L = + in Theorem A.3: Proof of Corollary 4.3: We apply Theorem A.3 to S = Stabϵ and Z0 = Z0 ϵ as in the corollary statement. Then we take δ = 2Lδ in the conclusion (A.2), and consider only the lower bound inequality, which directly gives the claim of the corollary. Proof of Theorem A.3: In this proof, we will denote a minimal α-ball cover of a metric space X by Xα. Note that we are not assuming G to be a group, but due to lower bound in property 1. it follows that z0 7 g z0 is injective (for z0 = z 0 we have d(z0, z 0) > 0 and thus g z0 = g z 0), and when below we write "g 1" this has to be interpreted as the inverse of the g-action, restricted to its image. Further, note that the case when one of L, L is + , corresponds to removing the part of the assumptions (and of the conclusions) involving that value, thus we only consider the case of finite L and finite L. On fixing an arbitrary point z Z, we can write z = g z0 for a suitable g G, z0 Z0. Let η := δ /2L . For fixed covers Z0 η, Gη, there exists a point z 0 Z0 η with d(z 0, z0) < η and g Gη with d G(g , g) < η. Thus by property 1. we have d(g z 0, z) < L η = ϵ/2 and by property 2. we have d(g z 0, g z 0) < L η = δ /2. By the triangle inequality, d(g z 0, z) < δ and thus Gη Z0 η is an δ -cover of Z. Thus we have N(Z, δ ) #Gδ /2L #Z0 δ /2L , optimizing over the cardinalities on the right hand side yields the second inequality in (A.2). Now consider an δ -cover Zδ of Z, and for η = 2δL consider an η-cover Gη of G. We find that for each z Zδ , there exists at most one g Gη such that dist(z, g Z0) < η/2L = δ . Notice that if this were false, we could use the triangle inequality and contradict property 2. in the statement. For each g Gη denote Zg the set of such points z Zδ such that there exists x g Z0, and assign exactly one such x = x(z) to each z, forming a set Xg of all such x(z). Any other point x g Z0 such that d(x , z) < δ then satisfies d(x , x) < 2δ by triangle inequality, and thus Xg is a 2δ -cover of g Z0. If for g z0 g X0 the point x g X0 satisfies d(g z0, x) < 2δ , then by property 1. in the statement we have d(z0, g 1 x) < 2δ L, and thus g 1 Xg is a 2δ L-cover of Z0, having the same cardinality as Zg. We then compute as follows, proving the first inequality in (A.2): g Gη #Zg = X g Gη #(g 1 Xg) N(G, 2δ L)N(Z0, 2δ L). A.1.4 Proof of Theorem 4.4 As before, we focus again first on the exact equivariance case, where Theorem A.4 is the direct analogue to (or special case of) Theorem 4.4. Under the hypotheses of Theorem A.3 on the G-action, we directly obtain the following, for the strictly equivariant case: N(Z0, δ) N(Z, δ/2L) N(G, δ) . (A.3) We next impose that for δ diam(G) the group G satisfies the natural "volume growth" assumption, where for compact groups Vol(G) is its dim(G)-dimensional Hausdorff measure and dim(G) is the usual Hausdorff dimension, and for finite groups we use minimum separation notation δG > 0 as defined in (4.4): Assumption: N(G, δ) ( #G/(max{δ, δG})ddim(G), if G finite, Vol(G)/δdim G, if dim G > 0. (A.4) Similarly to Proposition 4.1, we then get the following, in which the leading term in the bound has exponent figuring d0 = ddim(Z) dim(G). Recall that dim(G) = 0 for finite groups, thus the distinction can be made directly in terms of the dimension of G. Theorem A.4. Let δ > 0 be fixed. Assume that for a given ambient group G the group G satisfies assumption (A.4) and that its action satisfies the the assumptions 1. and 2. of Theorem A.3 for some finite L > 0 and L = + . We denote d G = ddim(G) if G is a discrete group and d G = dim(G) if G is compact and non-discrete, and d = ddim(Z). Furthher assume d0 := d d G > 2. Furthermore, set |G| := Vol(G) if dim G > 0 and |G| := #G for finite G. Then with the notations of Proposition 4.1 we have with probability 1 δ Gen Err(F, {Zi}, D) n 1/2p F log(2/δ) + (E), d0 d0 2δ d0/2+1 G (2L)d Dd |G|n 1/2 if G is finite and (2L)d Dd < |G|n δd0 G , d0 d0 2 (2L)d Dd |G|n 1/d0 otherwise. Proof. We follow the same computation as Proposition 4.1, but use Proposition A.1 in order to reduce to restrictions of functions to Z0. In this case, using (A.3) and assumption (A.4), and with notation as in our statement, we will have: N(Z0, t) (2L)d Dd |G| max{δG, t} d0, where we have δG = 0 for dim G > 0. We set C := (2L)d Dd |G| for simplicity of notation. In case Cδd0 G < 1 (which includes the case dim G > 0), we take α = (C/n)1/d0 in the Dudley integral (4.2) and find R(FZ0) α + n 1/2 ˆ from which the proof goes exactly as in Proposition 4.1, with C replacing Dd, and we get the second option for the value of (E) as given in our statement. In case Cδd0 G < 1 instead we take α = 0 and our above bound for N(Z0, t) plugged into (4.2) (recalling the notation for C): C max{δG, t} d0dt = δ d0/2+1 G p δG t d0/2dt, from which the second case of the value of (E) follows by direct computation. Now the proof of Theorem 4.4 proceeds in exactly the same manner as the above. Below we explain the required adaptations: Proof of Theorem 4.4: The following updates are the principal adaptations required for the above proof of Theorem A.4: The role of G should be replaced by Stabϵ, except for the fact that parameters δG, d G remain unchanged (i.e. we use their values corresponding to "ambient" group G rather than those for Stabϵ). The G-orbit representative set Z0 then should be replaced by representatives Z0 ϵ for orbits of Stabϵ. With these changes, assumption (A.4) implies its more general version, assumption (4.7). Indeed, |G| equals #G for finite G and Vol(G) for compact G, and δG > 0 only in the first case. Furthermore, we have δStabϵ δG as a direct consequence of Stabϵ G. We observe that Corollary 4.3 (which also is obtained from Theorem A.3 with the above two main substitutions) directly gives the version of Theorem (A.3) required to get the correct replacement of (A.3) under our initially declared two substitutions. We get: N(Z0 ϵ , δ) N(Z, δ/2L) N(Stabϵ, δ) . (A.5) With the above changes, the proof follows by exactly the same steps as in the above proof of Theorem A.4. Remark A.5. Note that, as might be evident from the last proof, we could have introduced new more precise parameters to keep track of dimensionality and minimum separation for Stabϵ rather than formulating assumption (4.7) in terms of d G, δG. This is justified for the aims of this work. Indeed, all the main situations of interest to us are those in which Stabϵ is a "large" subset of G, i.e. it has dimension d G, and in all our examples for finite groups δG > 0, the minimum separation for Stabϵ is within a small factor of δG itself. A.2 Proof of Proposition 5.1 Proof. We have that Z = (X, Y ) is a data distribution in which by our assumption 2. preceding the proposition, we have that almost surely Y = y (X) for a deterministic function y . With this notation, we may write EZ[f(Z)] = EX0ϵ Eg|X0ϵ [f(g X0 ϵ , y (g X0 ϵ )]. Recalling that we restrict to functions of the form f(x, y) = ℓ( ef(x), y) = d Y( ef(x), y)2, we first consider the precise equivariance case ϵ = 0. In this case for g Stabϵ(F) we find also ef(g X0 ϵ ) = g ef(X0 ϵ ) and thus when optimizing over ef we have to determine the optimal value of y = ef(X0 ϵ ) to be associate to each X0 ϵ . Thus as a consequence of all the above, if e F would be the class of all precisely Stabϵ-equivariant measurable functions, we would get the following rewriting: App Gap(F, D) = min e f e F EZ[d Y( ef(X), y (X)] = EX0ϵ min y Y Eg|X0ϵ d Y(g y, y (g X0 ϵ )2 . (A.6) For ϵ > 0, for each fixed X0 ϵ = x0 ϵ we may further perturb the associated y = ef(x0 ϵ) by at most ϵ in the direction of y (X0 ϵ ), while still obtaining a measurable function with approximation ℓ -norm error bounded by ϵ, thus the above bound is improved to App Err(F, D) EX0ϵ min y Eg|X0ϵ h d Y(g y, y (g X0 ϵ )) ϵ 2 + as desired. In case e F contains a strict subset of measurable invariant functions with error ϵ, we would only get an inequality instead of the first equality in (A.6) but we still have the same bound as in (A.7), and thus the proof is complete. A.3 Proof of Theorem 5.2 Proof. We use the isodiametric inequality (5.1) in G, applying it to Stabϵ(F) for F Cϵ,λ. Then by taking Stabϵ(F) = X which is optimal for inequality (5.1) we can saturate the two bounds (modulo discretization errors for discrete G) and we get |G| λ, diam(Stabϵ(F)) CGλ1/ddim(G). We next use Lipschitz deformation bounds and find that for all x X we have diam {y (g x) : g Stabϵ(F)} diam(Stabϵ(F)) Lip(y ) L CGλ1/ddim(G)ϵ(F)) Lip(y ) L . Then we use Proposition 5.1 for F and observe that when g, X0 ϵ are random variables as in the proposition, in particular g Stabϵ(F) and for each X0 ϵ = x0 ϵ we find the following estimate valid uniformly over y y (g X0 ϵ ) : g Stabϵ(F) : d Y(y, y (g x0 ϵ)) C Gλ1/ddim(G) Lip(y ) L . In a similar way, we also find d Y(y, g y) C Gλ1/ddim(G) L . By triangle inequality, and using the assumption that Lip(y ) 1 it follows that d Y(y, y (g x0 ϵ) C Gλ1/ddim(G)(1 + Lip(y ))L C Gλ1/ddim(G) Lip(y ) L . Then we may perturb each y by ϵ in order to possibly diminish this value without violating the condition defining Stabϵ(F), and with these choices we obtain the claim. A.4 Finding the optimal λ = λ for the bound of Theorem 6.1 We note that λ minimizing Cλα + C λ β for α, β > 0 is given by recall that in our case have the following choices, for case 1 and case 2 in the theorem s statement. α1 = α2 = 1/d G, β1 = 1/2, β2 = 1/d0, C = CGLip(y )L , C 1 (2LD)d/2|G|1/2 δ(d0 2)/2 G , C 2 (2LD)d/d0|G|1/d0, and thus the optimal choice of λ is in case nλ C3, λ = (2LD)d/2|G|1/2 δ(d0+2)/2 G !2d G/(d G+2) n d G/(d G+2), in case nλ > C3, λ = d0 d G (2LD)d/d0|G|1/d0 d0d G/(d0+d G) n d G/(d G+2). We describe some concrete examples of partial and approximate equivariance using the language we used in section 3.2 while sourcing them from existing literature. But first, we expand a little on our equivariance error notation. B.1 Equivariance error notation Recall that the action of elements of an ambient group G over the product space Z = X Y may be written as follows: For coordinates z = x y we may write g z = (g x, g y), and thus for ef : X Y and f(x, y) = ℓ( ef(x), y), we have the action (g f)(z) := f(g z) = ℓ( ef(g x), g y). For the equivariance error of g, f, interpreted as "the error of f s approximate equivariance under the action by g", we get the following, which is valid in the common situations in which ℓ(y, y ) 0 in general with ℓ(y, y) = 0 for all y Y: ee(f, g) := g f f = sup x,y ℓ( ef(g x), g y) ℓ( ef(x), y) sup x ℓ( ef(g x), g ef(x)), (B.1) where the last inequality follows by restricting the supremum from X Y to the graph of ef, namely by imposing y = ef(x). In several recent works, the equivariance error is defined simply by comparing g ef(x)and ef(g x), as in the rightmost term of (B.1), thus it is lower than the one found here. We provide a justification for our definition of the equivariance error: The loss ℓis the integrative part of the model, thus a definition for equivariance error which does not include it will only detect partial information concerning the influence of symmetries. The notion of equivariance error defined via ℓsimplifies the comparison between ef and data distributions D. B.2 Examples B.2.1 Imperfect translation equivariance in classical CNNs We consider here the most common examples of group equivariant convolutional networks (GCNNs), which are the usual Convolutional Neural Networks (CNNs) for computer vision tasks. We follow observations from [27] and [11], and connect the underlying ideas to Theorem 6.1. Setting of the problem. We consider a usual CNN layer, keeping in mind a segmentation task, where both X = Y represent spaces of images. More precisely, we think of images as pixel values at integer-coordinate positions taken from the index set Z Z. We also assume that the relevant information of each image only comes from a square of size n n pixels, outside which the pixel values are taken to be 0. We consider the application of a single convolution kernel/filter, of k k pixels (with k a small odd number). One typically applies padding by a layer of 0 s of size (k 1)/2 on the perimeter of the n n square, after which convolution with the kernel is computed on the n n central pixels of the resulting (n + k 1) (n + k 1) padded input image. The output relevant information is restricted to a n n square, outside which pixel values are set to 0 again, via padding. Metric on X. As a natural choice of distance over X we may consider L2-difference between pixel-value functions, or interpret pixel values as probability densities, and use Wasserstein distance, or consider other ad-hoc image metrics. Group action: translations. The group acting on our pixel-value functions is the group of translations with elements from Z Z. We expect the following invariance for the segmentation function f : X X: f(v x) = v f(x), where x X represents an image with pixel values assigned to integer coordinates and v Z2 is a translation vector and (in two alternative notations) v x = τv(x) is the result of translating all pixel values of x by v. Deformation properties of the action. If we take the previously mentioned distance functions on X and the usual distance induced from R2 over translation vectors v, it is easy to verify that the assumptions of Theorem A.3 about the action of translations hold, and the Lipschitz constants with respect to the metric on X only depend on the mismatch near the boundary, due to zero pixels moving in and to interior pixels moving out of our n n square, and being truncated. The ensuing bounds only depend on the precise distance that we introduce use on X. More realistic actions. An alternative more realistic definition of Z Z-action consists of defining v x as the truncation of τv(x) where, for pixels outside our relevant n n square we set pixel value to 0 after the translation. Problems near the boundary. Nevertheless, the updated translation action, will move pixel values of 0, coming from pixels outside the n n square, and will create artificial zero pixel values inside the image v x, different than the values that would be present in real data. Imperfect equivariance of data. Also, even in the latter more realistic alternative, the above translation equivariance is not respecting by segmentation input-output pairs coming from finite n n images, since, independently of n, the boundary pixel positions translated by v, fall outside the original image. Approximate stabilizer. In any case, we have to restrict the choices of v to integercoordinate elements of a (subset of a) n n square, containing only the translations that are relating real segmentations that appear within our n n relevant window. It is thus natural to restrict Stabϵ to only include vectors in a smaller subset of Z Z, of cardinality |Stabϵ| n2. The value of error ϵ may quantify the allowed error (or noisiness) for our model sets. Reducing to finite G and computing λ. For the sake of computing λ in our Theorem 6.1, we can observe that input and output values outside the padding perimeter given by a (n + k 1) (n + k 1) square, are irrelevant, and thus we may actually periodize the images and consider the images as subsets of a padded torus (Z/(n + k 1)Z)2, which we consider as acting on itself by translations. In this case G (Z/(n + k 1)Z)2, so that |G| = (n + k 1)2 and thus λ = |Stabϵ| (n + k 1)2 . Further extensions. In [27], it is argued that convolutional layers in classical CNNs are not fully translation-equivariant, and can encode positional information, due to the manner in which boundary padding is implemented. Possible solutions are increasing the padding to size k (so that the padded image is a square of size (n + 2k) (n + 2k)) or extending images by periodicity, via so-called "circular padding" (which transforms each image into a space equivalent to the torus (Z/n Z)2). In either case, the application of actions by translations by vectors that are too large compared to the image size of n n, will increase the mismatch between model equivariance and data equivariance. Stride and downsampling. In [11], a different equivariance error for classical CNNs is studied, related to the use of stride > 1 in order to lower the output dimensions of CNN layer outputs. If for example we use stride 2 when defining layer operation f : X Y, then Y will have relevant pixel values only in an n/2 n/2-square, and we apply the k k convolution kernel only at positions with coordinates in (2Z) (2Z) from image x. In this case we require that translations by group elements v (2Z) (2Z) on x have the effect of a translation by v/2 Z on the output. However for shifts in the input via vectors v that do not have two even coordinates, we may not have have an explicit corresponding action on the output, and in [11] a solution via adaptive polyphase sampling is proposed. A possibility for studying the best polyphase sampling strategy via almost equivariance, would be to include a bound for equivariance error ϵ > 0 and consider the optimization problem of finding the polyphase approximation that minimizes theoretical or empirical quantifications of ϵ. As a benchmark (modelled on the case of infinite images without boundary effects) one could compare the above to the action via Stab0 = (2Z) (2Z) which has λ = 1/4 within the ambient group G = Z Z. Our Theorem 6.1 can be used to compare the effects of increasing or decreasing ϵ, λ, in terms of data symmetry. B.2.2 Partial equivariance in GCNNs In this section, we connect the main results from [40] to our setup. In [40], one of the main motivating examples was to consider rotations applied to a handwritten digit and revert them. The underlying group action was via SO(2) and only rotations of angles between [ 60 , 60 ] were permitted in one case, which allowed to not confound rotated digits 3 and 9 for example. The above task can be formulated on a space of functions f : X Y in which X represents the space of possible images and Y the labels. Elements (Yd, Yθ) Y include a digit classification label Yd and a rotation angle value Yθ. We consider actions by group G = SO(2) = {Rϕ : ϕ R/360Z}, where Rϕ is the rotation matrix by angle ϕ, and the group operation corresponds to summing the angles, modulo 360 (or in radians, modulo 2π). The action of Rϕ over X would be by rotation as usual (Rθ sends image x X to Rθ x, now rotated by θ), and over Y we consider the action by Rϕ(Yd, Yθ) = (Yd, Yθ + ϕ), i.e. the restriction of the action on the digit label leaves it invariant and the restriction of the action on the angle label is non-trivial, giving a shift on the label. The optimum labelling assigns to x a label y (x) enjoying precise equivariance under the above definitions of the actions, and thus we are allowed to permit equivariance error ϵ = 0. However as mentioned above, applying rotations outside the range θ [ 60 , 60 ] to the data would surely bring us outside the labelled data distribution, thus we are led to take ϵ = 0, Stab0 = {Rθ : θ [ 60 , 60 ]}. We then have that |Stab0|/|SO(2)| = 1/3, with respect to the natural Haar measure on rotations. It is natural to think of the set of Stab0-action representatives of images X 0 0 given as the "unrotated" images. If we take a digit image that is rotated, say by 20 , from its base version, and we apply a rotation of 50 to it (i.e. an element of Stab0), then we reach the version of the image now excessively rotated by 70 . This means that without further modification, considering model symmetries with Stab0 taken to be independent of the point, would automatically generate some error when tested on the data. While decreasing the threshold angle in the definition of Stab0 from 60 would limit this effect, it will also decrease generalization error in the model. The study of point-dependent invariance sets Stabϵ is interesting in view of this example application, but it is outside the scope of the current approximation/generalization bounds and is left for future work. B.2.3 Possible applications to partial equivariance in Reinforcement Learning The use of approximate invariances for RL applications was considered in [23, Sec. 6] via soft equivariance constraints allowing better generalization for an agent moving in a geometric environment. While imposing approximate equivariance for memoryless G-action for groups such as G = SO(2), Z2 has produced positive results, it may be interesting, in analogy to the previous section, to include memory, and thus restrict the choices of group actions across time steps. Note that for a temporal evolution of T steps, the group action by G acting independently at each step would produce a T-interval action via the product group GT , and allowing for a partial action via Stabϵ, with possibly increased fitness to evolving data. More precise time-dependence prescriptions and consequences within Q-learnig are left to future work. C Discussion of [20] In section 2 we mentioned [20], which considers PAC-style bounds under model symmetry. [20] works with compact groups and argues how the learning problem in such a setup reduces to only working with a set of reduced orbit representatives, which leads to a generalization gain. This message of [20] is similar to ours, although we work with a more general setup. However, we noted that the main theorem in [20] has an error. Here, we briefly sketch the issue with the proof. One of the main quantities of interest in the main theorem of [20] is Dτ(X, H) (notation from their paper), which directly comes from the following bound, and is claimed to have a linear dependence on Cov(X, ρ, δ). Again, for the sake of easier verification, we follow their notation. Crucially, note that [20] uses the notation Cov as analogous to our N: Cov(H, L , 2Cδ + κ) Cov(X, ρ, δ) sup x X Cov(H(x), , κ) However, the correct application of the Kolmogorov-Tikhomirov estimate shows that the reasoning in the proof should yield a dependence which is exponential in Cov(X, ρ, δ), not linear. To see this, set s = 2 (sufficient for our purposes) in equation 238 in [54] (page 186). In other words, it is not possible to cover Lipschitz functions in infinity norm by only using constant functions.