# approximately_equivariant_graph_networks__f33a1282.pdf Approximately Equivariant Graph Networks Ningyuan (Teresa) Huang Johns Hopkins University nhuang19@jhu.edu Ron Levie Technion Israel Institute of Technology levieron@technion.ac.il Soledad Villar Johns Hopkins University svillar3@jhu.edu Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to symmetries of the fixed domain acting on the image signals (sometimes known as active symmetries), whereas in GNNs any permutation acts on both the graph signals and the graph domain (sometimes described as passive symmetries). In this work, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We present a bias-variance formula that quantifies the tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group. To illustrate our approach, we conduct extensive experiments on image inpainting, traffic flow prediction, and human pose estimation with different choices of symmetries. We show theoretically and empirically that the best generalization performance can be achieved by choosing a suitably larger group than the graph automorphism, but smaller than the permutation group. 1 Introduction Graph Neural Networks (GNNs) are popular tools to learn functions on graphs. They are commonly designed to be permutation equivariant since the node ordering can be arbitrary (in the matrix representation of a graph). Permutation equivariance serves as a strong geometric prior and allows GNNs to generalize well [1 3]. Yet in many applications, the node ordering across different graphs is matched or fixed a priori, such as a time series of social networks where the nodes identify the same users, or a set of skeleton graphs where the nodes represent the same joints. In such settings, the natural symmetries arise from graph automorphisms, which effectively only act on the graph signals; This is inherently different from the standard equivariance in GNNs that concerns all possible permutations acting on both the signals and the graph domain. A permutation of both the graph and the graph signal can be seen as a change of coordinates since it does not change the object it represents, just the way to express it. This parallels the passive symmetries in physics, where physical observables are independent of the coordinate system one uses to express them [4, 5]. In contrast, a permutation of the graph signal on a fixed graph potentially transforms the object itself, not only its representation. This parallels the active symmetries in physics, where the coordinate system (in this case, the graph or domain) is fixed but a group transformation on the signal results in a predictable 37th Conference on Neural Information Processing Systems (Neur IPS 2023). transformation of the outcome (e.g., permuting left and right joints in the human skeleton changes a left-handed person to right-handed). The active symmetries on a fixed graph are similar to the translation equivariance symmetries in Euclidean convolutional neural networks (CNNs), where the domain is a fixed-size grid and the signals are images. In this work, we focus on active symmetries in GNNs. Specifically, we consider a fixed graph domain G with N nodes, an adjacency matrix A RN N, and input graph signals X RN d. We are interested in learning equivariant functions f that satisfy (approximate) active symmetries f(ΠX) Πf(X) for permutations Π G SN, (1) where G is a subgroup of the permutation group SN that depends on G. For example, G can be the graph automorphism group AG = {Π : ΠA = AΠ}. Each choice of G induces a hypothesis class HG for f: the smaller the group G, the larger the class HG. We aim to select G so that the learned function f HG generalizes well, also known as the model selection problem [6, Chp.4]. In contrast, standard graph learning methods (GNNs, spectral methods) are defined to satisfy passive symmetries, by treating A as input and requiring f(ΠAΠ , ΠX) = Πf(A, X) for all permutations Π SN and all N. But for the fixed graph setting, we argue that active symmetries are more relevant. Thus we use A to define the hypothesis class rather than treating it an input. By switching from passive symmetries to active symmetries, we will show how to design GNNs for signals supported on a fixed graph with different levels of expressivity and generalization properties. While enforcing symmetries has been shown to improve generalization when the symmetry group is known a priori [7 13], the problem of symmetry model selection is not completely solved, particularly when the data lacks exact symmetries (see for instance [14 16] and references therein). Motivated by this, we study the symmetry model selection problem for learning on a fixed graph domain. This setting is particularly interesting since (1) the graph automorphism group serves as the natural oracle symmetry; (2) real-world graphs tend to be asymmetric, but admit cluster structure or local symmetries. Therefore, we define approximate symmetries of graphs using the cut distance between graphs from graphon analysis. An approximate symmetry of a graph G is a symmetry of any other graph G that approximates G in the cut distance. In practice, we take G as coarse-grainings (or clusterings) of G, as these are typically guaranteed to be close in cut distance to G. We show how to induce approximate symmetries for G via the automorphisms of G . Our main contributions include: 1. We formalize the notion of active symmetries and approximate symmetries of GNNs for signals supported on a fixed graph domain, which allows us to study the symmetry group model selection problem. (See Sections 2, 4) 2. We theoretically characterize the statistical risk depending on the hypothesis class induced from the symmetry group, and show a bias-variance tradeoff between the reduction in expressivity and the gain in regularity of the model. (See Sections 3, 4) 3. We illustrate our approach empirically for image inpainting, traffic flow prediction, and human pose estimation. (See Section 5 for an overview, and Appendix D for the details on how to implement equivariant graph networks with respect to different symmetry groups). 1.1 Related Work Graph Neural Networks and Equivariant Networks. Graph Neural Networks (GNNs) [17 19] are typically permutation-equivariant (with respect to node relabeling). These include message-passing neural networks (MPNNs) [19 21], spectral GNNs [18, 22 24], and subgraph-based GNNs [25 30]. Permutation equivariance in GNNs can extend to edge types [31] and higher-order tensors representing the graph structure [32, 33]. Equivariant networks generalize symmetries on graphs to other objects, such as sets [34], images [35], shapes [36, 37], point clouds [38 42], manifolds [1, 43, 44], and physical systems [45, 46] among many others. Notably, many equivariant machine learning problems concern the setting where the domain is fixed, e.g., learning functions on a fixed sphere [47], and thus focus on active symmetries rather than passive symmetries such as the node ordering in standard GNNs. Yet learning on a fixed graph domain arises naturally in many applications such as molecular dynamics modeling [48]. This motivates us to consider active symmetries of GNNs for learning functions on a fixed graph. Our work is closely related to Natural Graph Networks (NGNs) in [49], which use global and local graph isomorphisms to design maximally expressive GNNs for distinguishing different graphs. In contrast, we focus on generalization and thus consider symmetry model selection on a fixed graph. Generalization of Graph Neural Networks and Equivariant Networks. Most existing works focus on the graph-level tasks, where in-distribution generalization bounds of GNNs have been derived using statistical learning-theoretic measures such as Rademacher complexity [50] and Vapnik Chervonenkis (VC) dimension [51, 52], or uniform generalization analysis based on random graph models [53]; Out-of-distribution generalization properties have been investigated using different notions including transferability [54, 55] (also known as size generalization [56]), and extrapolation [57]. In general, imposing symmetry constraints improves the sample complexity and the generalization error [7 13]. Recently, Petrache and Trivedi [58] investigated approximation-generalization tradeoffs using approximate symmetries for general groups. Their results are based on uniform convergence generalization bounds, which measure the worst-case performance of all functions in a hypothesis class and differ from our non-uniform analysis. Approximate Symmetries. For physical dynamical problems, Wang et al. [45] formalized approximate symmetries by relaxing exact equivariance to allow for a small equivariance error. For reinforcement learning applications, Finzi et al. [59] proposed Residual Pathway Priors to expand network layers into a sum of equivariant layers and non-equivariant layers, and thus relax strict equivariance into approximate equivariance priors. In the context of self-supervised learning, Suau et al. [60], Gupta et al. [61] proposed to learn structural latent representations that satisfy approximate equivariance. Inspired by randomized algorithms, Cotta et al. [62] formalizes probabilistic notions of invariances and universal approximation. In [63], scattering transforms (a specific realization of a CNN) are shown to be approximately invariant to small deformations in the image domain, which can be seen as a form of approximate symmetry. Scattering transforms on graphs are discussed in [64 66]. Additional discussions on approximate symmetries can be found in [67, Section 6]. 2 Problem Setup: Learning Equivariant Maps on a Fixed Graph Notations. We let R, R+ denote the reals and the nonegative reals, I denote the identity matrix and 1 denote the all-ones matrix. For a matrix Y RN k, we write the Frobenious norm as Y F . We denote by X Y the element-wise multiplication of matrices X and Y . We write [N] = {1, . . . , N} and SN as the permutation group on the set [N]. Groups are typically noted by calligraphic letters. Given H, K groups, we denote a semidirect product by H K. For a group G with representation ϕ, we denote by χϕ|G the corresponding character (see Definition 6 in Appendix A.1). Denoting H G means that H is a subgroup of G. Equivariance. We consider a compact group G with Haar measure λ (the unique G-invariant probability measure on G). Let G act on spaces X and Y by representations ϕ and ψ, respectively. We say that a map f : X Y is G-equivariant if for all g G, x X, ψ(g 1)f(ϕ(g) x) = f(x). Given any map f : X Y, a projection of f onto the space of G-equivariant maps can be computed by averaging over orbits with respect to λ (QGf)(x) = Z G ψ(g 1) f (ϕ(g)x) dλ(g). (2) For u, v Y, let u, v be a G-invariant inner product, i.e. ψ(g)u, ψ(g)v = u, v , for all g G, for all u, v Y. Given two maps f1, f2 : X Y, we define their inner product as f1, f2 µ = R X f1(x), f2(x) dµ(x), where µ is a G-invariant measure on X. Let V be the space of all (measurable) map f : X Y such that f µ = p Graphs. We consider edge-node weighted graphs G = ([N], A, b), where [N] is a finite set of nodes, A [0, 1]N N is the adjacency matrix describing the edge weights, and b = {b1, . . . , b N} R are the node weights. An edge weighted graph is a special case of G where all node weights are 1. A simple graph is a special case of an edge weighted graph, where all edge weights are binary. Let AG be the automorphism group of a graph defined as AG := {Π SN : Π A Π = A, Π b = b}, which characterizes the symmetries of G. Hereinafter, G is assumed to be a subgroup of SN. Graph Signals and Learning Task. We consider graph signals supported in the nodes of a fixed graph G and maps between graphs signals. Let X = RN d, Y = RN k be the input and output graph signal spaces. We denote by f a map between graph signals, f : X Y. Even though the functions f depend on G, we don t explicitly write G as part of the notation of f because G is fixed. Our goal is to learn a target map between graph signals f : X Y. To this end, we assume access to a training set {(Xi, Yi)} where the Xi are i.i.d. sampled from an SN-invariant distribution µ on X where Sn acts on X by permuting the rows, and Yi = f (Xi) + ξi for some noise ξi. A natural assumption is that f is approximately equivariant with respect to AG in some sense. Our symmetry model selection problem concerns a sequence of hypothesis class {HG} indexed by G, where we choose the best class H G such that the estimator ˆf H G gives the best generalization performance. Equivariant Graph Networks. We propose to learn the target function f on the fixed graph using Gequivariant graph networks (G-Net), which are equivariant to a chosen symmetry group G depending on the graph domain. Using standard techniques from representation theory, such as Schur s lemma and projections to isotypic components [68], we can parameterize G-Net by interleaving G-equivariant linear layers f G : RN d RN k with pointwise nonlinearity, where the weights in the linear map f G are constrained in patterns depending on the group structure (also known as parameter sharing or weight tying [69], see Appendix A.1 for technical details). In practice, we can make G-Net more flexible by systematically breaking the symmetry, such as incorporating graph convolutions (i.e., Af G) and locality constraints (i.e., A f G). Compared to standard GNNs (described in Appendix D.1), G-Net uses a more expressive linear map to gather global information (i.e., weights are not shared among all nodes). Importantly, G-Net yields a suite of models that allows us to flexibly choose the hypothesis class (and estimator) reflecting the active symmetries in the data, and subsumes standard GNNs that are permutation-equivariant with respect to passive symmetries (but not necessarily equivariant with respect to active symmetries). Graphons. A graphon is a symmetric measurable function W : [0, 1]2 [0, 1]. Graphons represent (dense) graph limits where the number of nodes goes to infinity; they can also be viewed as random graph models where the value W(x, y) represents the probability of having an edge between the nodes x and y. Let W denote the space of all graphons and η denote the Lebesgue measure. Lovász and Szegedy [70] introduced the cut norm on W as W := sup S,T [0,1] S,T measurable S T W(u, v) dη(u) dη(v) . (3) Based on the cut norm, Borgs et al. [71] defined the cut distance between two graphons W, U W, δ (W, U) := inf f S[0,1] W U f , (4) where S[0,1] is the set of all measure-preserving bijective measurable maps between [0, 1] and itself, and U f(x, y) = U(f(x), f(y)). Note that the cut distance is permutation-invariant, where measure preserving bijections are seen as the continuous counterparts of permutations. 3 Generalization with Exact Symmetry In this section, we assume the target map f is AG-equivariant and study the symmetry model selection problem by comparing the (statistical) risk of different models. Concretely, the risk quantifies how a given model performs on average on any potential input. The smaller the risk, the better the model performs. Thus, we use the risk gap of two functions f, f , defined as (f, f ) := E Y f(X) 2 F E Y f (X) 2 F , (5) as our model selection metric. Following ideas from Elesedy and Zaidi [8], we analyze the risk gap between any function f V and its G-equivariant version, for G SN. Lemma 1 (Risk Gap). Let X = RN d, Y = RN k be the input and output graph signal spaces on a fixed graph G. Let X µ where µ is a SN-invariant distribution on X. Let Y = f (X) + ξ, where ξ RN k is random, independent of X with zero mean and finite variance and f : X Y is AG-equivariant. Then, for any f V and for any compact group G SN, we can decompose f as f = f G + f G , where f G = QGf, f G = f f G. Moreover, the risk gap satisfies (f, f G) = E Y f(X) 2 F E Y f G(X) 2 F = 2 f , f G µ | {z } mismatch µ | {z } constraint Lemma 1, proven in Appendix B.1, adapts ideas from [8] to equate the risk gap to the symmetry mismatch between the chosen group G and the target group AG, plus the symmetry constraint captured by the norm of the anti-symmetric part of f with respect to G. Our symmetry model selection problem aims to find the group G SN such that the risk gap is maximized. Note that for G AG, the mismatch term vanishes since f is AG-equivariant, but the constraint term decreases with G; When G = AG, we recover Lemma 6 in [8]. On the other hand, for G > AG, the mismatch term can be positive, negative or zero (depending on f ) whereas the constraint term increases with G. 3.1 Linear Regression In this section, we focus on the linear regression setting and analyze the risk gap of using the equivariant estimator versus the vanilla estimator. We consider linear estimator ˆΘ : RN d RN k that predicts ˆY = f ˆΘ(X) = ˆΘ X. Given a compact group G and any linear map Θ, we obtain its G-equivariant version via projection to the G-equivariant space (also known as intertwiner average), G ϕ(g) Θ ψ g 1 dλ(g). (6) We denote Ψ G (Θ) = Θ ΨG(Θ) as the projection of ˆΘ to the orthogonal complement of the G-equivariant space. Here ΨG instantiates the orbit-average operator QG (eqn. 2) for linear functions. Theorem 2 (Bias-Variance-Tradeoff). Let X = RN d, Y = RN k be the graph signals spaces on a fixed graph G. Let SN act on X and Y by permuting the rows with representations ϕ and ψ. Let G be a subgroup of SN acting with restricted representations ϕ|G on X and ψ|G on Y. Let X[i,j] i.i.d. N 0, σ2 X and Y = f (X) + ξ where f (x) = Θ x is AG-equivariant and Θ RNd Nk. Assume ξ[i,j] is random, independent of X, with mean 0 and E ξξ = σ2 ξI < . Let ˆΘ be the least-squares estimate of Θ from n i.i.d. examples {(Xi, Yi) : i = 1, . . . , n}, ΨG(ˆΘ) be its equivariant version with respect to G. Let χψ|G | χϕ|G = R G χψ|G(g)χϕ|G(g)dλ(g) denote the inner product of the characters. If n > Nd + 1 the risk gap is E h f ˆΘ, fΨG( ˆΘ) i = σ2 X Ψ G (Θ) 2 F | {z } bias + σ2 ξ N 2dk χψ|G | χϕ|G n Nd 1 | {z } variance The bias term depends on the anti-symmetric part of Θ with respect to G, whereas the variance term captures the difference of the dimension between the space of linear maps RNd RNk (measured by N 2dk), and the space of G-equivariant linear maps (measured by χψ|G | χϕ|G ; a proof can be found in [72, Section 2.2]). The bias term reduces to zero for G AG and we recover Theorem 1 in [8] when G = AG. Notably, by using a larger group G, the dimension of the equivariant space measured by χψ|G | χϕ|G is smaller, and thus the variance term increases; meanwhile, the bias term decreases due to extra (symmetry) constraints on the estimator. We remark that χψ|G | χϕ|G depends on d, k as well: For d = k = 1, ϕ(g) = ψ(g) RN N are standard permutation matrices; For general d > 1, k > 1, ϕ(g) RNd Nd, ψ(g) RNk Nk are block-diagonal matrices, where each block is a permutation matrix. To understand the significance of the risk gap, we note that (see Appendix B.1) when n > Nd + 1, the risk of the least square estimator is E h Y ˆΘ X 2 F i = σ2 ξ Nd n Nd 1 + σ2 ξ. (7) Thus, when n is small enough so that the risk gap in Theorem 2 is dominated by the variance term, enforcing equivariance gives substantial generalization gain of order N2dk (χψ|G |χϕ|G) Example 3.1. In Appendix C.1 we construct an example with X = R3, Y = R3, and x N(0, σ2 XId). We consider a target linear function f that is S2-equivariant and approximately S3-equivariant, and compare the estimators fψS2( ˆΘ) versus fψS3( ˆΘ). When the number of training samples n is small, (1,1) (1,6) (1,1) (1,6) (a) (b) (c) (1,6) (1,1) (1,6) (1,1) Figure 1: Overview of Definitions for Approximate Symmetries: (a) Graph G and its coarsened graph G ; (b) Induced graphons WG, WG with small cut distance δ and their symmetries AG = S2, AG = {e}. The induced symmetry group GG G yields more symmetries for G; (c) Approximate equivariant mapping f that is close to f GG G . using a S3-equivariant least square estimator yields better test error than the S2-equivariant one, as shown in figure inset. The dashed vertical line denotes the theoretical threshold n 35, before which using S3 yields better generalization than S2. This is an example where enforcing more symmetries than the target (symmetry) results on better generalization properties. 4 Generalization with Approximate Symmetries Large graphs are typically asymmetric. Therefore, the assumption of f being AG-equivariant in Section 3 becomes trivial. Yet, graphon theory asserts that graphs can be seen as living in a continuous metric space, with the cut distance (Definition 1). Since graphs are continuous entities, the combinatorial notion of exact symmetry of graphs is not appropriate. We hence relax the notion of exact symmetry to a property that is continuous in the cut distance. The regularity lemma [73, Theorem 5.1] asserts that any large graph can be approximated by a coarser graph in the cut distance. Since smaller graphs tend to exhibit more symmetries than larger ones, we are motivated to consider the symmetries of coarse-grained versions of graphs as their approximate symmetries (Definition 2, 3). We then present the notion of approximately equivariant mappings (Definition 4), which allows us to precisely characterize the bias-variance tradeoff (Corollary 3) in the approximate symmetry setting based on Lemma 1. Effectively, we reduce selecting the symmetry group to choosing the coarsened graph G . Moreover, our symmetry model selection perspective allows for exploring different coarsening procedures and choosing the one that works best for the problem. Definition 1 (Induced graphon). Let G be a (possibly edge weighted) graph with node set [N] and adjacency matrix A = {ai,j [0, 1]}N i,j=1. Let PN = {In N = ( n 1 N )}N n=1 be the equipartition of [0, 1] to N intervals (where formally the last interval is the closed interval [ N 1 N , 1]). We define the graphon WG induced by G as i,j=1 ai,j1Ii N (x)1Ij N (y), where 1Ii N is the indicator function of the set Ii N [0, 1], i = 1, . . . , N. We induce a graphon from an edge-node weighted graph G = ([M], A, b), with rational node weights bi, as follows. Let N N be a value such that the node weights can be written in the form b = {bm = qm N }M m=1, for some qm N 0, m = 1, . . . , M. We blow-up G to an edge weighted graph GN of P qm nodes by splitting each node m of G into qm nodes {nj m}qm j=1 of weight 1, and defining the adjacency matrix AN with entries anj m,nj m = am,m . Note that for any two choices N1, N2 N in the above construction, δ (WGN1, WGN2) = 0, where the infimum in the definition of δ is realized on some permutation of intervals in [0, 1], as explained below. We hence define the induced graphon of the edge-node weighted graph G by WG := WGN for some N, where, in fact, each choice of N gives a representative of an equivalence class of piecewise constant graphons with zero δ distance. Definition 2. Let G be an edge-node weighted graph with M nodes and node weights b = {bm = qm N }M m=1, satisfying {qm}m N 0 and P qm = N, and let G be an edge weighted graph with N nodes. We say that G coarsens G up to error ϵ if δ (WG, WG ) < ϵ. Suppose that G coarsens G. This implies the existence of an assignment of nodes of G to nodes of G . Namely, the supremum underlying the definition of δ over the measure preserving bijections ϕ : [0, 1] [0, 1] is realized by a permutation Π : [N] [N] applied on the intervals {Ii N}N i=1. With this permutation, the assignment CG G defined by [N] Π(nj m) 7 m [M], for every m = 1, . . . , M and j = 1, . . . , qm, is called a cluster assignment of G to G . Let G , with M nodes, be a edge-node weighted graph that coarsens the simple graph G of N nodes. Let CG G be a cluster assignment of G to G . Let AG be the automorphism group of G . Note that two nodes can belong to the same orbit of AG only if they have the same node weight. Hence, for every two nodes m, m in the same orbit of AG (i.e., equivalent clusters), there exists a bijection ψm,m : C 1 G G {m} C 1 G G {m }. We choose a set of such bijections, such that for every m, m on the same orbit of AG ψm,m = ψ 1 m ,m. We moreover choose ψm,m as the identity mapping for each m = 1, . . . , M. We identify each element g of AG with an element g in a permutation group AG of [N], called the blown-up symmetry group, as follows. Given g AG , for every m [M] and every n C 1 G G {m}, define gn := ψm,gmn. Definition 3. Let G , with M nodes, be a edge-node weighted graph that coarsen the (simple or weighted) graph G of N nodes. Let CG G be a cluster assignment of G to G with cluster sizes c1, . . . , c M. Let AG be the automorphism group of G , and AG be the blown-up symmetry group of [N]. For every m = 1, . . . M, let Scm be the symmetry group of C 1 G G {m}. We call the group of permutations GG G := Sc1 Sc2 . . . Sc M AG SN the symmetry group of G induced by the coarsening G . We call any element of GG G an approximate symmetry of G. Specifically, every element s GG G can be written in coordinates by s = s1s2 . . . s Ma (s1, . . . , s M, a), for a unique choice of sj Scj, j = 1, . . . , M, and a AG (see Appendix B.2 for details). In words, we consider the symmetry of the graph G of N nodes not by its own automorphism group, but via the symmetry of its coarsened graph G . The nodes in the same orbit in G are either in the same cluster of G (i.e., from the same coarsened node), or in the equivalent clusters of G (i.e., they belong to the same orbit of the coarsened graph and share the same cluster size). See Figure 1 (and Figure 9 in Appendix A.2) for examples. Definition 4. Let G be a graph with N nodes, and X = RN d be the space of graph signals. Let X µ where µ is a SN-invariant measure on RN d. We call f : RN d RN k an approximately equivariant mapping if there exists a function κ : R+ R+ satisfying limϵ 0 κ(ϵ) = 0 (called the equivariance rate), such that for every ϵ > 0 and every edge-node weigted graph G that coarsen G up to error ϵ, f f GG G µ κ(ϵ), where f GG G = QGG G (f) is the intertwining projection of f with respect to GG G . Example 4.1. Here we give an example of an approximately equivariant mapping in a natural setting. Suppose that G is a random geometric graph, namely, a graph that was sampled from a metric-probability space M by randomly and independently sampling nodes {xn}N n=1 M. Hence, G can be viewed as a discretization of M, and the graphs that coarsen G can be seen as coarser discretizations of M. Therefore, the approximate symmetries are local deformations of G, namely, permutations that swap nodes if they are close in the metric of M. This now gives an interpretation for approximately equivariance mappings of geometric graphs: these are mappings that are stable to local deformations. In Appendix C.2 we develop this example in detail. We are ready to present the bias-variance tradeoff in the setting with approximate symmetries. The proofs are deferred to Appendix B.2. Corollary 3 (Risk Gap via Graph Coarsening). Let X = RN d, Y = RN k be the input and output graph signal spaces on a fixed graph G. Let X µ where µ is a SN-invariant distribution on X. Let Y = f (X) + ξ, where ξ RN k is random, independent of X with zero mean and finite variance, and f : RN d RN k be an approximately equivariant mapping with equivariance rate κ. Then, for any G that coarsens G up to error ϵ, for any f V , we have (f, f GG G ) = 2 f , f GG G µ | {z } mismatch µ | {z } constraint 2κ(ϵ) f GG G µ + f GG G 2 Notably, Corollary 3 illustrates the tradeoff explicitly in the form of choosing the coarsened graph G : If we choose G close to G such that the coarsening error ϵ and f GG G µ are small, then the mismatch term is close to zero; meanwhile the constraint term is also small since AG is typically trivial. On the other hand, if we choose G far from G that yields large coarsening error ϵ, then the constraint term f GG G 2 µ also increases. Corollary 4 (Bias-Variance-Tradeoff via Graph Coarsening). Consider the same linear regression setting in Theorem 2, except now f is an approximately equivariant mapping with equivariance rate κ, and G = GG G is controlled by G that coarsens G up to error ϵ. Denote the canonical permutation representations of GG G on X, Y as ϕ , ψ , respectively. Let (χψ | χϕ ) = R GG G χψ (g)χϕ (g)dλ(g) denote the inner product of the characters. If n > Nd + 1 the risk gap is bounded by E h f ˆΘ, fΨGG G ( ˆΘ) i 2κ(ϵ) σ2 ξ N 2dk (χψ | χψ ) n Nd 1 + σ2 ξ N 2dk (χψ | χψ ) 5 Experiments We illustrate our theory in three real-world tasks for learning on a fixed graph: image inpainting, traffic flow prediction, and human pose estimation1. For image inpainting, we demonstrate the biasvariance tradeoff via graph coarsening using G-Net; For the other two applications, we show that the tradeoff not only emerges from G-Net that enforces strict equivariance, but also G-Net augmented with symmetry-breaking modules, allowing us to recover standard GNN architectures as a special case. Concretely, we consider the following variants (more details in Appendix D.4.2): 1. G-Net with strict equivariance using equivariant linear map f G. 2. G-Net augmented with graph convolution Af G(x), denoted as G-Net(gc). 3. G-Net augmented with graph convolution and learnable edge weights: G-Net(gc+ew). 4. G-Net augmented with graph locality constraints (A f G)(x) and learnable edge weights, denoted as G-Net(pt+ew). G-Net serves as a baseline to validate our generalization analysis for equivariant estimators (see Section 3). Yet in practice, we observe that augmenting G-Net with symmetry breaking can further improve performance, thereby justifying the analysis for approximate symmetries (see Section 4). In particular, G-Net(gc) and G-Net(gc+ew) are motivated by graph convolutions used in standard GNNs; G-Net(pt+ew) is inspired by the concept of locality in CNNs, where A f G effectively restricts the receptive field to the 1-hop neighrborhood induced by the graph A. 5.1 Application: Image Inpainting We consider a 28 28 grid graph as the fixed domain, with grey-scale images as the graph signals. The learning task is to reconstruct the original images given masked images as inputs (i.e., image inpainting). We use subsets of MNIST [74] and Fashion MNIST [75], each comprising 100 training samples and 1000 test samples. The input and output graph signals are (mi xi, xi), where xi R28 28 R784 denotes the image signals and mi denotes a random mask (size 14 14 for MNIST and 20 20 for Fashion MNIST). We investigate the symmetry model selection problem by clustering 1The source code is available at https://github.com/nhuang37/Approx_Equivariant_Graph_Nets. the grid into M patches with size d d, where d {28, 14, 7, 4, 2, 1}. Here d = 28 means one cluster (with SN symmetry); d = 1 is 784 singleton clusters with no symmetry (trivial). We consider GG G -equivariant networks G-Net with Re LU nonlinearity. We parameterize the equivariant linear layer f : RN RN with respect to GG G = (Sc1 . . . Sc M ) AG using the following block-matrix form (assuming the nodes are ordered by their cluster assignment), with fkl denoting block matrices, and ak, bk, ekl representing scalars: " f11 f1M f M1 f MM , fkk = ak I + bk11 , fkl = ekl11 for k = l. (8) The coarsened graph symmetry AG induces constraints on ak, bk, ekl. If AG is trivial, then these scalars are unconstrained. In the experiment, we consider a reflection symmetry on the coarsened grid graph, i.e., AG = S2 which acts by reflecting the left (coarsened) patches to the right (coarsened) patches. Suppose the reflected patch pairs are ordered consecutively, then ak = ak+1, bk = bk+1 for k {1, 3, . . . , M 1}, and ekl = ek+1,l 1 for k {1, 3, . . . , M 1}, l {2, 4, . . . , M} (see Figure inset for an illustration). In practice, we extend the parameterization to f : RN d RN k. More details can be found in Appendix D.2. Figure 2 shows the empirical risk first decreases and then increases as the group decreases, illustrating the bias-variance tradeoff from our theory. Figure 2 (left) compares a 2-layer G-Net with a 1-layer linear G-Net, demonstrating that the tradeoff occurs in both linear and nonlinear models. Figure 2 (right) shows that using reflection symmetry of the coarsened graph outperforms the trivial symmetry baseline, highlighting the utility of modeling coarsened graph symmetries. Figure 2: Bias-variance tradeoff via graph coarsening. Left:2-layer G-Net(blue) and 1-layer linear G-equivariant functions (orange), assuming the coarsened graph is asymmetric; Right: 2-layer G-Net with both trivial and non-trivial coarsened graph symmetry. See Table 7 for more numerical details. 5.2 Application: Traffic Flow Prediction The traffic flow prediction problem can be formulated as follows: Let X(t) Rn d represent the traffic graph signal (e.g., speed or volume) observed at time t. The goal is to learn a function h( ) that maps T historical traffic signals to T future graph signals, assuming the fixed graph domain G: h X(t T +1), . . . , X(t); G i h( ) h X(t+1), . . . , X(t+T )i . The traffic graph is typically large and asymmetric. Therefore we leverage our approximate symmetries to study the symmetry model selection problem (Section 4). We use the METR-LA dataset which represents traffic volume of highways in Los Angeles (see figure inset). We use both the original graph G from [76], and its sparsified version Gs which is more faithful to the road geometry. In Gs, the (i, j)-edge exists if and only if nodes i, j lie on the same highway. We first validate our Definition 4 in such dataset (see Appendix D.3.1). In terms of the models, we consider a simplified version of DCRNN model proposed in [76], where the spatial module is modelled by a standard GNN, and the temporal module is modelled via an encoder-decoder architecture for sequence-to-sequence modelling. We follow the same temporal module and extending its GNN module using G-Net(gc), which recovers DCRNN when choosing SN. We then consider different choices of the coarsened graph G (to induce approximate symmetries). As shown in Table 3, using 2 clusters as approximate symmetries yields better generalization error than using 9 clusters, or the full permutation group SN. G-Net(gc) SN Sc1 Sc2 Sc1 . . . Sc9 Graph Gs 3.173 0.013 3.150 0.008 3.204 0.006 Graph G 3.106 0.013 3.092 0.008 3.174 0.013 Table 3: Traffic forecasting with different choices of graph clustering. Table shows Mean Absolute Error (MAE) across 3 runs. G-Net(gc+ew) S16 Relax-S16 AG = (S2)2 Trivial MPJPE 42.55 0.88 39.87 0.46 42.18 0.49 41.60 0.32 P-MPJPE 34.48 0.44 31.38 0.14 32.08 0.20 31.69 0.17 Table 4: Human pose estimation with different choices of symmetries. Table shows mean per-joint position error (MPJPE) and MPJPE after alignment (P-MPJPE) across 3 runs. 5.3 Application: Human Pose Estimation We consider the simple (loopy) graph G with adjacency matrix A {0, 1}16 16 representing 16 human joints, illustrated on the right. The input graph signals X R16 2 describe the joint 2D spatial location. The goal is to learn a map to reconstruct the 3D pose Y R16 3. We use the standard benchmark dataset Human3.6M [77] and follow the evaluation protocol in [78]. The generalization performance is measured by mean per joint position error (MPJPE) and MPJPE after aligntment (P-MPJPE). We implement a 4-Layer G-Net(gc+ew), which recovers the same model architecture in Sem GCN [78] when choosing G = S16 (further details are provided in Appendix D.4.1). We consider the following symmetry groups: the permutation group S16, the graph automorphism AG, and the trivial symmetry. Note that the human skeleton graph has AG = (S2)2 (corresponding to the arm flip and leg flip). Additionally, we investigate an approximate symmetry called Relax-S16; This relaxes the global weight sharing in S16 (that only learns 2 scalars per equivariant linear layer f S16 : RN RN, where f S16[i, i] = a, f S16[i, j] = b for i = j) to local weight sharing (that learns 2 16 = 32 scalars, where f Relax-S16[i, i] = ai, f Relax-S16[i, j] = bi for i = j). Table 4 shows that the best performance is achieved at the hypothesis class induced from Relax-S16, which is smaller than S16 but differs from AG. Furthermore, using G-Net(gc+ew) with Relax-S16 gives a substantial improvement over S16 (representing Sem GCN in [78]). This demonstrates the utility of enforcing active symmetries in GNNs that results in more expressive models. More details and additional ablation studies can be found in Appendix D.4. 6 Discussion In this paper, we focus on learning tasks where the graph is fixed, and the dataset consists of different signals on the graph. We developed an approach for designing GNNs based on active symmetries and approximate symmetries induced by the symmetries of the graph and its coarse-grained versions. A layer of an approximately equivariant graph network uses a linear map that is equivariant to the chosen symmetry group; the graph is not used as an input but rather induces a hypothesis class. In practice, we further break the symmetry by incorporating the graph in the model computation, thus combining symmetric and asymmetric components. We theoretically show a bias-variance tradeoff between the loss of expressivity due to imposing symmetries, and the gain in regularity, for settings where the target map is assumed to be (approximately) equivariant. For simplicity, the theoretical analysis focuses on the equivariant models without symmetry breaking; Theoretically analyzing the combination of symmetric and asymmetric components in machine learning models is an interesting open problem. The bias-variance formula is computed only for a simple linear regression model with white noise and in the underparamterized setting; Extending it to more realistic models and overparameterized settings is a promising direction. As a proof of concept, our approximately equivariant graph networks only consider symmetries of the fixed graph. An interesting future direction is to extend our approach to also account for (approximate) symmetries in node features and labels, using suitably generalized cut distance (e.g., graphon-signal cut distance in [79]; see Appendix C.2 for an overview). Our network architecture consists only of linear layers and pointwise nonlinearity. Another promising direction is to incorporate pooling layers (e.g., [22, 80]) and investigate them through the lens of approximate symmetries. Finally, extending our analysis to general groups is a natural next step. Our techniques are based on compact groups with orthogonal representation. While we believe that the orthogonality requirement can be lifted straightforwardly, relaxing the compactness requirement appears to be more challenging (see related discussion in [46]). Acknowledgments This research project was started at the BIRS 2022 Workshop Deep Exploration of non-Euclidean Data with Geometric and Topological Representation Learning held at the UBC Okanagan campus in Kelowna, B.C. The authors thank Ben Blum-Smith (JHU), David Hogg (NYU), Erik Thiede (Cornell), Wenda Zhou (Open AI), Carey Priebe (JHU), and Rene Vidal (UPenn) for valuable discussions, and Hong Ye Tan (Cambridge) for pointing out typos in an earlier version of the manuscript. The authors also thank the anonymous Neur IPS reviewers and area chair for helpful feedback. Ningyuan Huang is partially supported by the MINDS Data Science Fellowship from Johns Hopkins University. Ron Levie is partially funded by ISF (Israel Science Foundation) grant #1937/23. Soledad Villar is partially funded by the NSF Simons Research Collaboration on the Mathematical and Scientific Foundations of Deep Learning (Mo DL) (NSF DMS 2031985), NSF CISE 2212457, ONR N00014-22-1-2126 and an Amazon AI2AI Faculty Research Award. [1] M. M. Bronstein, J. Bruna, Y. Le Cun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017. [2] Fernando Gama, Joan Bruna, and Alejandro Ribeiro. Stability properties of graph neural networks. IEEE Transactions on Signal Processing, 68:5680 5695, 2020. [3] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges, 2021. [4] Carlo Rovelli and Marcus Gaul. Loop quantum gravity and the meaning of diffeomorphism invariance. In Towards Quantum Gravity: Proceeding of the XXXV International Winter School on Theoretical Physics Held in Polanica, Poland, 2 11 February 1999, pages 277 324. Springer, 2000. [5] Soledad Villar, David W Hogg, Weichi Yao, George A Kevrekidis, and Bernhard Schölkopf. The passive symmetries of machine learning. ar Xiv preprint ar Xiv:2301.13724, 2023. [6] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. [7] Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning under invariance and geometric stability, 2021. [8] Bryn Elesedy and Sheheryar Zaidi. Provably strict generalisation benefit for equivariant models. In International Conference on Machine Learning, pages 2959 2969. PMLR, 2021. [9] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models. ar Xiv preprint ar Xiv:2102.13219, 2021. [10] Clare Lyle, Mark van der Wilk, Marta Kwiatkowska, Yarin Gal, and Benjamin Bloem-Reddy. On the benefits of invariance in neural networks. ar Xiv preprint ar Xiv:2005.00178, 2020. [11] Bryn Elesedy. Group symmetry in pac learning. In ICLR 2022 Workshop on Geometrical and Topological Representation Learning, 2022. [12] Bryn Elesedy. Provably strict generalisation benefit for invariance in kernel methods. Advances in Neural Information Processing Systems, 34:17273 17283, 2021. [13] Arash Behboodi, Gabriele Cesa, and Taco Cohen. A pac-bayesian generalization bound for equivariant networks. ar Xiv preprint ar Xiv:2210.13150, 2022. [14] Gregory Benton, Marc Finzi, Pavel Izmailov, and Andrew G Wilson. Learning invariances in neural networks from training data. Advances in neural information processing systems, 33: 17605 17616, 2020. [15] Jameson Cahill, Dustin G Mixon, and Hans Parshall. Lie pca: Density estimation for symmetric manifolds. Applied and Computational Harmonic Analysis, 65:279 295, 2023. [16] Vasco Portilheiro. A tradeoff between universality of equivariant models and learnability of symmetries. ar Xiv preprint ar Xiv:2210.09444, 2022. [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61 80, 2009. [18] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun. Spectral networks and deep locally connected networks on graphs. In International Conference on Learning Representation, 2014. [19] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, pages 1263 1272, 2017. [20] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017. [21] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. [22] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3837 3845, 2016. [23] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro. Convolutional neural network architectures for signals supported on graphs. IEEE Transactions on Signal Processing, 67(4):1034 1049, 2019. [24] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing, 67(1): 97 109, 2019. [25] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count substructures? Advances in neural information processing systems, 33:10383 10395, 2020. [26] Erik Henning Thiede, Wenda Zhou, and Risi Kondor. Autobahn: Automorphism-based graph neural nets, 2021. URL https://arxiv.org/abs/2103.01710. [27] Beatrice Bevilacqua, Fabrizio Frasca, Derek Lim, Balasubramaniam Srinivasan, Chen Cai, Gopinath Balamurugan, Michael M Bronstein, and Haggai Maron. Equivariant subgraph aggregation networks. ar Xiv preprint ar Xiv:2110.02910, 2021. [28] Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. [29] Fabrizio Frasca, Beatrice Bevilacqua, Michael M Bronstein, and Haggai Maron. Understanding and extending subgraph gnns by rethinking their symmetries. ar Xiv preprint ar Xiv:2206.11140, 2022. [30] Bohang Zhang, Guhao Feng, Yiheng Du, Di He, and Liwei Wang. A complete expressiveness hierarchy for subgraph gnns via subgraph weisfeiler-lehman tests. Co RR, abs/2302.07090, 2023. [31] Jianfei Gao, Yangze Zhou, and Bruno Ribeiro. Double permutation equivariance for knowledge graph completion. ar Xiv preprint ar Xiv:2302.01313, 2023. [32] Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph networks. ar Xiv preprint ar Xiv:1812.09902, 2018. [33] Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, pages 4602 4609, 2019. [34] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30, 2017. [35] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990 2999. PMLR, 2016. [36] Yashil Sukurdeep et al. Elastic shape analysis of geometric objects with complex structures and partial correspondences. Ph D thesis, Johns Hopkins University, 2023. [37] Jameson Cahill, Joseph W Iverson, Dustin G Mixon, and Daniel Packer. Group-invariant max filtering. ar Xiv preprint ar Xiv:2205.14039, 2022. [38] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. ar Xiv:1802.08219, 2018. [39] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. Se (3)-transformers: 3d roto-translation equivariant attention networks. Neural Information Processing Systems, 33, 2020. [40] Mario Geiger and Tess Smidt. e3nn: Euclidean neural networks. ar Xiv preprint ar Xiv:2207.09453, 2022. [41] Soledad Villar, David W Hogg, Kate Storey-Fisher, Weichi Yao, and Ben Blum-Smith. Scalars are universal: Equivariant machine learning, structured like classical physics. Advances in Neural Information Processing Systems, 34:28848 28863, 2021. [42] Ben Finkelshtein, Chaim Baskin, Haggai Maron, and Nadav Dym. A simple and universal rotation equivariant point-cloud network. In Topological, Algebraic and Geometric Learning Workshops 2022, pages 107 115. PMLR, 2022. [43] Risi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch gordan nets: a fully fourier space spherical convolutional neural network. Advances in Neural Information Processing Systems, 31:10117 10126, 2018. [44] Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling. Coordinate independent convolutional networks isometry and gauge equivariant convolutions on riemannian manifolds. ar Xiv preprint ar Xiv:2106.06020, 2021. [45] Rui Wang, Robin Walters, and Rose Yu. Approximately equivariant networks for imperfectly symmetric dynamics. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23078 23091. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/wang22aa. html. [46] Soledad Villar, Weichi Yao, David W Hogg, Ben Blum-Smith, and Bianca Dumitrascu. Dimensionless machine learning: Imposing exact units equivariance. Journal of Machine Learning Research, 24(109):1 32, 2023. [47] Taco S. Cohen, Mario Geiger, Jonas Koehler, and Max Welling. Spherical cnns, 2018. [48] Jiaqi Han, Wenbing Huang, Tingyang Xu, and Yu Rong. Equivariant graph hierarchy-based neural networks. Advances in Neural Information Processing Systems, 35:9176 9187, 2022. [49] Pim de Haan, Taco S Cohen, and Max Welling. Natural graph networks. Advances in Neural Information Processing Systems, 33:3636 3646, 2020. [50] Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3419 3430. PMLR, 13 18 Jul 2020. URL https://proceedings. mlr.press/v119/garg20c.html. [51] Pascal Esser, Leena Chennuru Vankadara, and Debarghya Ghoshdastidar. Learning theory can (sometimes) explain generalisation in graph neural networks. Advances in Neural Information Processing Systems, 34:27043 27056, 2021. [52] Christopher Morris, Floris Geerts, Jan Tönshoff, and Martin Grohe. Wl meet vc, 2023. [53] Sohir Maskey, Ron Levie, Yunseok Lee, and Gitta Kutyniok. Generalization analysis of message passing neural networks on large random graphs. In Advances in Neural Information Processing Systems, 2022. [54] Ron Levie, Wei Huang, Lorenzo Bucci, Michael Bronstein, and Gitta Kutyniok. Transferability of spectral graph convolutional neural networks. Journal of Machine Learning Research, 22 (272):1 59, 2021. [55] Luana Ruiz, Luiz Chamon, and Alejandro Ribeiro. Graphon neural networks and the transferability of graph neural networks. Advances in Neural Information Processing Systems, 33: 1702 1712, 2020. [56] Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. Size-invariant graph representations for graph classification extrapolations. In International Conference on Machine Learning, pages 837 851. PMLR, 2021. [57] Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. How neural networks extrapolate: From feedforward to graph neural networks. ar Xiv preprint ar Xiv:2009.11848, 2020. [58] Mircea Petrache and Shubhendu Trivedi. Approximation-generalization trade-offs under (approximate) group equivariance, 2023. [59] Marc Finzi, Gregory Benton, and Andrew G Wilson. Residual pathway priors for soft equivariance constraints. Advances in Neural Information Processing Systems, 34:30037 30049, 2021. [60] Xavier Suau, Federico Danieli, T. Anderson Keller, Arno Blaas, Chen Huang, Jason Ramapuram, Dan Busbridge, and Luca Zappella. Duet: 2d structured and approximately equivariant representations. In International Conference on Machine Learning. PMLR, 2023. [61] Sharut Gupta, Joshua Robinson, Derek Lim, Soledad Villar, and Stefanie Jegelka. Structuring representation geometry with rotationally equivariant contrastive learning. ar Xiv preprint ar Xiv:2306.13924, 2023. [62] Leonardo Cotta, Gal Yehuda, Assaf Schuster, and Chris J Maddison. Probabilistic invariant learning with randomized linear classifiers. ar Xiv preprint ar Xiv:2308.04412, 2023. [63] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012. doi: https://doi.org/10.1002/cpa.21413. [64] Fernando Gama, Alejandro Ribeiro, and Joan Bruna. Diffusion scattering transforms on graphs. ar Xiv preprint ar Xiv:1806.08829, 2018. [65] Michael Perlmutter, Feng Gao, Guy Wolf, and Matthew Hirn. Understanding graph neural networks with asymmetric geometric scattering transforms. ar Xiv preprint ar Xiv:1911.06253, 2019. [66] Dongmian Zou and Gilad Lerman. Graph convolutional neural networks via scattering. Applied and Computational Harmonic Analysis, 49(3):1046 1074, 2020. [67] Alexander Bogatskiy, Sanmay Ganguly, Thomas Kipf, Risi Kondor, David W Miller, Daniel Murnane, Jan T Offermann, Mariel Pettee, Phiala Shanahan, Chase Shimmin, et al. Symmetry group equivariant architectures for physics. ar Xiv preprint ar Xiv:2203.06153, 2022. [68] William Fulton and Joe Harris. Representation theory: a first course, volume 129. Springer Science & Business Media, 2013. [69] Siamak Ravanbakhsh, Jeff Schneider, and Barnabas Poczos. Equivariance through parametersharing. In International conference on machine learning, pages 2892 2901. PMLR, 2017. [70] László Lovász and Balázs Szegedy. Limits of dense graph sequences. Journal of Combinatorial Theory, Series B, 96(6):933 957, 2006. [71] Christian Borgs, Jennifer T Chayes, László Lovász, Vera T Sós, and Katalin Vesztergombi. Convergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801 1851, 2008. [72] Mark Reeder. Notes on representations of finite groups, 2014. [73] László Lovász and Balázs Szegedy. Szemerédi s lemma for the analyst. GAFA Geometric And Functional Analysis, 17(1):252 270, 2007. [74] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. [75] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [76] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. ar Xiv preprint ar Xiv:1707.01926, 2017. [77] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325 1339, 2013. [78] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N Metaxas. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3425 3435, 2019. [79] Ron Levie. A graphon-signal analysis of graph neural networks, 2023. [80] Alejandro Parada-Mayorga, Zhiyang Wang, and Alejandro Ribeiro. Graphon pooling for reducing dimensionality of signals and convolutional operators on graphs, 2022. [81] Daniel A. Spielman. Testing isomorphism of graphs with distinct eigenvalues, 2018. [82] Bruce Sagan. The symmetric group: representations, combinatorial algorithms, and symmetric functions, volume 203. Springer Science & Business Media, 2001. [83] Jean-Pierre Serre et al. Linear representations of finite groups, volume 42. Springer, 1977. [84] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 986, 2022. [85] Ningyuan Huang, David W. Hogg, and Soledad Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions. SIAM Journal on Mathematics of Data Science, 4(1):126 152, feb 2022. doi: 10.1137/20m1387821. URL https://doi.org/10. 1137%2F20m1387821. [86] Somesh Das Gupta. Some aspects of discrimination function coefficients. Sankhy a: The Indian Journal of Statistics, Series A, pages 387 400, 1968. [87] Hosagrahar V Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Communications of the ACM, 57(7):86 94, 2014. [88] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7753 7762, 2019. A Parameterization of equivariant linear Maps We consider a group G acting on spaces X and Y via representations ϕ and ψ, respectively. Our goal is to find the equivariant linear maps f : X Y such that f(ϕ(g)x) = ψ(g)f(x) for all g G and x X. The standard way to do this, used extensively in the equivariant machine learning literature (e.g. [40, 43]), is to decompose ϕ and ψ in irreducibles and use Schur s lemma. In a nutshell, a group representation φ is an homomorphism G GL(V ), where GL(V ) denotes the General Linear group of the vector space V (sometimes mathematicians say that V is a representation of G, but we need to know the homomorphism φ too). One way to interpret the group homomorphism (i.e. φ(gh) = φ(g) φ(h)) is that the group multiplication corresponds to the composition of linear invertible maps (i.e. matrix multiplication). A linear subspace W of V is said to be a subrepresentation of φ if φ(G)(W) W. An irreducible representation is one that only has itself and the trivial subspace as subrepresentations. Schur s lemma states that if V , W are vector spaces over C and φV , φW are irreducible representations, then either (1) φV and φW are not isomorphic as representations (and the only equivariant linear map between V , W is the zero map), or (2) φV and φW are isomorphic and the only non-trivial equivariant maps are of the form λ I where λ C and I is the identity (See Chapter 1 of [68]). Given G acting on spaces X and Y via representations ϕ and ψ, respectively, one can decompose ϕ and ψ in irreducibles over C ϕ = ℓ k=1ak Tk ψ = ℓ k=1bk Tk (this notation assumes the same irreducibles appear in both decompositions, which can be done if we allow some of the ak and bk to be zero). Then one can parameterize the equivariant linear maps by having one complex parameter per irreducible that appears in both decompositions. These ideas can be applied to real spaces too. By Maschke s theorem we can decompose the representation in irreducibles over R. Then we can check further how to decompose these irreducibles over C, and apply Schur s lemma. We have 3 cases for the decomposition: 1. The irreducible over R is also irreducible over C. In this case we directly apply Schur s lemma. 2. The irreducible over R decomposes in two different irreducibles over C. In this case we can send each C-irreducible to their isomorphic counterpart. 3. The irreducibles over R decompose in two copies of the same irreducible over C. In this case we can send each irreducible to any isomorphic copy independently. Therefore, finding the equivariant linear maps reduces to decomposing the corresponding representations in irreducibles. In the next sections we explain in detail how to do this for the specific problems described in this paper. The appendix is organized as follows: We first show how to parameterize equivariant linear layers for abelian group using isotypical decomposition (Section A.1), and then discuss the case for the symmetry group induced by graph coarsening, which further considers parameterizing permutation-equivariant linear maps to model the within-cluster symmetries (Section A.2). A.1 Equivariant Linear Maps via Isotypical Decomposition In this section, we assume that the graph adjacency matrix A has distinct eigenvalues λ1 > λ2 > . . . > λn. Then AG is an abelian group [81, Lemma 3.8.1]. Under this assumption, we present the construction of AG-equivariant linear maps using isotypical decomposition (i.e. decomposition into isomorphism classes of irreducible representations) and group characters. We remark that such construction extends to non-abelian groups and refer the interested reader to [82], but we omit it here for the ease of exposition. We consider the simplest setting where f : RN RN is a linear function that maps graph signals. Let x RN be the node features, then AG-equivariance requires f(g x) = g f(x) for all g AG. (9) To construct equivariant linear functions f, our roadmap is outlined as follows: 1. Decompose the vector space X = RN into a sum of components such that different components cannot be mapped to each other equivariantly (also known as the isotypic decomposition); 2. Given X = i Xi an isotypic representation, we then parameterize f by linear maps at each Xi such that for all i, f(Xi) Xi. To this end, we need the following definitions. Definition 5. (G-module, [82, Defn 1.3.1]) Let X be a vector space and G be a group. We say the vector space X is a G-module or X carries a representation of G if there is a group homomorphism ρ : G GL(X), where GL denotes the General Linear group. Equivalently, if the following holds: 2. g(cv + dw) = c(gv) + d(gw), 3. (gh)v = g(hv), for all g, h G; v, w X and scalars c, d C (e G denotes the identity element). In what follows, we consider X = RN carries a representation of G. Definition 6. (Group characters) Given a group G and a representation ϕ : G GL(V ), the group character is a function χ : G R (or C) defined as χ(g) := tr ϕ(g). Definition 7. (Group orbits) Let X be a vector space and G be a group. The group orbit of an element x X is O(x) := {gx : g G}. Let g1, . . . , gs be the generators of AG (S2)n, or simply AG (S2)k for some k n. Since AG is abelian, any irreducible representation is 1-dimensional [68, p.8]. In other words, the irreducible representations of an abelian group are homomorphisms ρ : AG C. (10) Since all the elements of the group AG = (S2)k is of order 1 or 2, the homomorphisms are ρ : AG { 1} R. By Definition 6, the irreducible characters (i.e., characters of irreducible matrix representation) are also homomorphisms ρ : AG { 1}. In other words, χ(g) { 1} for all g AG. Then we can write down the 2k 2k character table, where the rows are the characters χ, and the columns are the group elements g AG (see Table 1 as an example). Now, define the projection onto the isotypic component of the representation X as Pχ := deg(X) χ(g) g = 1 |AG| g AG χ(g) g, (11) where the second equality uses the fact that AG is abelian. Intuitively, applying Pχ on X = span({e1, . . . , e N}) picks out all v X that stays in the same subspace defined by the group character χ. (Note that for the (S2)k case χ 1(g) = χ(g) since χ(g) { 1}). Algorithm 1 presents the construction of equivariant linear map f with respect to an abelian group. Such construction can parameterize all abelian AG-equivariant linear maps, as shown in Lemma 5. Lemma 5. f is linear, equivariant with respect to the abelian group AG if and only if f can be written as (13) in Algorithm 1. Proof. By construction in Algorithm 1, f is linear and equivariant. To show the converse, we first recall some useful facts and notations. Since AG is abelian with all irreducible representations being one-dimensional, for isotypic components Xχ1 Xχ2, we have g v1 = λ1(g) v1, for all g G, v1 Xχ1, (14) g v2 = λ2(g) v2, for all g G, v2 Xχ2, (15) where there exists some g G such that λ1(g) = λ2(g). To show f being linear and equivariant implies for all v Xχ, f(v) Xχ, we prove by contradiction. Without loss of generality, suppose f(vχ1) = α1vχ1 + α2vχ2, (16) Algorithm 1 Parameterizing equivariant linear functions f : RN RN for abelian group Require: Abelian group AG = (S2)k 1. Construct the character table of χirreps for AG, i.e. χi : AG { 1} i = 1, . . . ℓ; 2. For each character χi in the character table, compute the projection matrix Pχi(X) = [Pχi(e1); . . . ; Pχi(e N)] RN N; (12) followed by computing the basis from Pχi(X) and call it Xχi = [b(1) χi , . . . , b(Ki) χi ]. 3. X = ℓ i=1Xχi where Xχi are the isotypic component. Then f is any linear function satisfying that f(Xχi) Xχi for all i = 1, . . . , ℓ. In particular, in the basis [b(s) χi ]1 i ℓ,1 s Ki f can be written as a block diagonal matrix Rn n with each block Mχi being the linear map from Xχi Xχi, Mχ1 Mχ2 ... Mχℓ χe 1 1 χ2 1 1 Table 1: Character table for aut(P4) = Z2 for some scalars α1, α2 and vχ1 Xχ1, vχ2 Xχ2. Then by (14), for all g G, f(g vχ1) = f(λ1(g) vχ1) = λ1(g)f(vχ1) = λ1(g)α1vχ1 + λ1(g)α2vχ2. (17) Now, since f is equivariant, for all g G, f(g vχ1) = gf(vχ1) = g(α1vχ1 + α2vχ2) = λ1(g)α1vχ1 + λ2(g)α2vχ2. (18) But there exists some g G such that λ1(g ) = λ2(g ), which leads to f(g vχ1) = f(g vχ1), a contradiction. One can easily extend the proof strategy to the general case for f(vχ1) = Pl i=1 vχi. Example A.1. Consider the path graph on 4 nodes (i.e., P4). We have aut(P4) = {e, (14)(23)} = Z2. Steps 1: Note that Z2 is abelian and thus all irreducible characters χ(g) { 1}, for all g Z2. The character table is shown in Table 1. Step 2: using (11) we have Pχe[e1; e2; e3; e4] = 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 which yields basis B(Pχe) = [e1 + e4; e2 + e3]. Pχ2[e1; e2; e3; e4] = 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 which yields basis B(Pχ2) = [e1 e4; e2 e3]. Step 3: Parameterize f : R4 R4 by f : B(Pχe) B(Pχe) and f : B(Pχ2) B(Pχ2). For all v R4, write v = c1(e1 + e4) + c2(e2 + e3) + c3(e1 e4) + c4(e2 e3), then f(v) = α1 α2 α3 α4 + α5 α6 α7 α8 where α1, . . . , α8 are (learnable) real scalars. Now f is linear, equivariant by construction. A.2 Equivariant Linear Map for Symmetries Induced by Graph Coarsening In this section, we provide additional details of constructing equivariant linear maps for using the symmetry group induced by graph coarsening (Definition 3). Recall the symmetry group with M clusters of G (with the associated coarsened graph G ) is given by GG G := S1 S2 . . . SM AG SN. We first assume that AG is trivial and show how to parameterize equivariant functions with respect to products of permutations. Then we discuss more general cases, for instance if AG is abelian. For the ease of exposition, we consider X RN, Y RN. Suppose AG is trivial, we claim that a linear function f : RN RN is equivariant with respect to GG G if and only if it admits the following block-matrix form: f11 f12 f1M f21 f22 f2M ... f M1 f M2 f MM , fkk = ak Ick + bk1ck1 ck, fkl = ekl1ck1 cl for k = l, (20) where fkl are block matrices, Ik is a size-k identity matrix, 1k is a length-k vector of all ones, and ak, bk, ekl are scalars. The subscript ck denotes the size of the k-th cluster. Figure 5 illustrates the block structure of f. We provide a proof sketch of our claim. Since AG is trivial by assumption, it remains to show any function that is equivariant to the product of permutations S1 S2 . . . SM admits the form in eqn. 20. We justify as follows: 1. (within-cluster) fkk is a linear permutation-equivariant function if and only if its diagonal elements are the same and its off-diagonal elements are the same ([34, Lemma 3.]); 2. (across-cluster) fkl for k = l is a constant matrix since nodes within a cluster are indistinguishable; moreover, ekl are unconstrained since the coarsened symmetry AG is trivial. Figure 5: The block structure of equivariant linear function f : Rn Rn with respect to GG G (where G, G are asymmetric): Each diagonal block fkk is diagonally constant and off-diagonally constant; Each off-diagonal block fkl is a constant matrix. As an example, we illustrate the equivariant linear layer for two-cluster graph coarsening. Without loss of generality, assume that the node signals X are ordered according to the cluster assignment (e.g., X[1:|V1|] are node features for the first cluster, etc). Let X(1), X(2) denote the node features for the first and second cluster, W s (1), W s (2) denote the weights on the block diagonal for self-feature transformation, W n (1), W n (2) denote the weights on the block diagonal for within-cluster neighbors, and W n (12), W n (21) denote the weights off the block diagonal for across-cluster neighbors. Let I denote the identity matrix, 1(1), 1(2) denote the all-ones matrices with the same size as the corresponding cluster, and 1(12), 1(21) denote the all-ones matrices mapping across clusters. Recall denotes the element-wise multiplication of two matrices. Then the equivariant linear layer is parameterized as I X(1)W s (1) X(2)W s (2) + 1(1) 0 0 1(2) I X(1)W n (1) X(2)W n (2) + 0 1(12) 1(21) 0 X(1)W n (12) X(2)W n (21) For cases where AG is nontrivial, if AG is abelian, we can use a construction by Serre ([83] Section 8.2). Observe that the symmetry of AG effectively constrains the patterns of ekl in equation 20. We provide an example where AG = S2 in our image inpainting experiment (see Section 5.1), and defer the investigation of general cases for future work. B Proofs of Our Theoretical Results B.1 Proofs of Generalization with Exact Symmetries Lemma 1 (Risk Gap). Let X = RN d, Y = RN k be the input and output graph signal spaces on a fixed graph G. Let X µ where µ is a SN-invariant distribution on X. Let Y = f (X) + ξ, where ξ RN k is random, independent of X with zero mean and finite variance and f : X Y is AG-equivariant. Then, for any f V and for any compact group G SN, we can decompose f as f = f G + f G , where f G = QGf, f G = f f G. Moreover, the risk gap satisfies (f, f G) = E Y f(X) 2 F E Y f G(X) 2 F = 2 f , f G µ | {z } mismatch µ | {z } constraint Lemma 1 is a straightforward extension of Lemma 6 in [8], which makes use of Lemma 1 in [8]. Lemma 1 in [8]. Let U be any subspace of V that is closed under Q. Define the subspaces S and A of, respectively, the G-symmetric and G-anti-symmetric functions in U : S = {f U : f is G-equivariant } and A = {f U : Qf = 0}. Then U admits admits an orthogonal decomposition into symmetric and anti-symmetric parts Proof of Lemma 1. The first part of Lemma 1 f = f G + f G follows from Lemma 1 in [8]. For the second part, by the assumption that the noise ξ is independent of X with zero mean and finite variance, we can simplify the risk gap as (f, f G) := E Y f(X) 2 F E Y f G(X) 2 F = E f (X) f(X) 2 F E f (X) f G(X) 2 F . (22) Substituting f = f G + f G yields E f (X) f G(X) f G (X) 2 F E f (X) f G(X) 2 F = 2 f (X) f G(X), f G (X) µ + E f G (X) 2 F = 2 f , f G µ + f G 2 We remark that Lemma 6 in [8] assumes that f is G-equivariant, so the first term in (23) vanishes. We are motivated from the symmetry model selection problem, and thereby relax the assumption of the chosen symmetry group G can differ from the target symmetry group AG. Theorem 2 (Bias-Variance-Tradeoff). Let X = RN d, Y = RN k be the graph signals spaces on a fixed graph G. Let SN act on X and Y by permuting the rows with representations ϕ and ψ. Let G be a subgroup of SN acting with restricted representations ϕ|G on X and ψ|G on Y. Let X[i,j] i.i.d. N 0, σ2 X and Y = f (X) + ξ where f (x) = Θ x is AG-equivariant and Θ RNd Nk. Assume ξ[i,j] is random, independent of X, with mean 0 and E ξξ = σ2 ξI < . Let ˆΘ be the least-squares estimate of Θ from n i.i.d. examples {(Xi, Yi) : i = 1, . . . , n}, ΨG(ˆΘ) be its equivariant version with respect to G. Let χψ|G | χϕ|G = R G χψ|G(g)χϕ|G(g)dλ(g) denote the inner product of the characters. If n > Nd + 1 the risk gap is E h f ˆΘ, fΨG( ˆΘ) i = σ2 X Ψ G (Θ) 2 F | {z } bias + σ2 ξ N 2dk χψ|G | χϕ|G n Nd 1 | {z } variance Theorem 2 presents the risk gap in expectation, which follows from Lemma 1, by taking f as the least-squares estimator and using assumptions in the linear regression setting. To this end, we denote X Rn Nd, Y Rn Nk, ξ Rn Nk as the training data arranged in matrix form, where Y = f (X) + ξ. Recall that the least-squares estimator of Θ in the classic regime (n > Nd) is given by ˆΘ := (X X) X Y a.e. = Θ + (X X) 1X ξ, (24) while its equivariant map is G ϕ(g) ˆΘ ψ g 1 dλ(g). (25) Our proof makes use of the following results in [8], which we restate adapted versions here for our setting. Proposition 11 in [8]. Let V = {f W : f W (x) = W x, W Rd k, x Rd} denote the space of linear functions. Let X µ with E[XX ] = Σ. For any linear functions f W1, f W2 V , the inner product on V satisfies f W1, f W2 µ = Tr(W 1 ΣW2). (26) Theorem 13 in [8] (Simplified, Adapted). Consider the same setting as Theorem 2. For n > Nd + 1, σ2 XE Ψ G X X + X ξ 2 = σ2 ξ N 2dk χψ|G | χϕ|G Proof of Theorem 2. We first plug in the least-squares expressions ˆΘ, ΨG(ˆΘ) to Lemma 1 and treat the mismatch term and constraint term separately; We complete the proof by collecting common terms together. For the mismatch term, our goal is to compute 2 E [ Θ, ˆΘ ΨG(ˆΘ) µ], (27) where the expectation is taken over the test point X and the training data X, ξ. To that end, we write ˆΘ ΨG(ˆΘ) x a.e. = Θ x+ξ X(X X) 1x Z G ψ(g 1) Θ + ξ X(X X) 1 ϕ(g)x dλ(g). Taking expectation yields EX,X,ξ [ Θ, ˆΘ ΨG(ˆΘ) µ] = Θ 2 µ + EX,X,ξ Θ X, ξ X(X X) 1x G ψ(g 1) Θ + ξ X(X X) 1 ϕ(g)x dλ(g) . Note that ξ is independent with X and mean 0, so the second term in (29) vanishes. Similarly, the part EX,X,ξ R G ψ(g 1) ξ X(X X) 1 ϕ(g)x dλ(g) also vanishes (by first taking conditional expectation of ξ conditioned on X). Thus, we arrive at E [ Θ, ˆΘ ΨG(ˆΘ) µ] = Θ 2 µ Ex G ψ(g 1) Θ ϕ(g)x dλ(g) = Θ 2 µ Θ, ΨG(Θ) µ = Ψ G (Θ) 2 µ = 2 σ2 X Ψ G (Θ) 2 F , (30) where the last equality follows from Proposition 11 in [8] with the assumption that Σ = σ2 X. This finishes the computation for the mismatch term. Now for the constraint term, we have f G 2 µ = Ψ G (ˆΘ) 2 µ (31) = σ2 X EX,ξ Ψ G Θ + (X X) 1X ξ 2 (32) = σ2 X Ψ G (Θ) 2 F + σ2 X EX,ξ Ψ G (X X) 1X ξ 2, (33) where the last equality follows from linearity of expectation, E[ξ] = 0 and ξ independent of x. Combining the mismatch term in (30) with the constraint term in (33), the risk gap becomes E h f ˆΘ, fΨG( ˆΘ) i = σ2 X Ψ GL(Θ) 2 + σ2 X EX,ξ Ψ GL (X X) 1X ξ 2, (34) Applying Theorem 13 in [8], the second term in (34) reduces to σ2 X EX,ξ Ψ GL (X X) 1X ξ 2 = σ2 ξ N 2dk χψ|G | χϕ|G n Nd 1 , (35) from which the theorem follows immediately. Finally, we state a well-known result for the risk of (Ordinary) Least-Squares Estimator (see [84, 85] and references therein). Lemma 6 (Risk of Least-Squares Estimator). Consider the same set-up as Theorem 2. For n > Nd + 1, E h Y ˆΘ X 2 F i = σ2 ξ Nd n Nd 1 + σ2 ξ. Proof of Lemma 6. Recall X, Y denote the test sample. We denote the risk of the least-squares estimator conditional on the training data X Rn Nd as R(ˆΘ | X), which has the following bias-variance decomposition: R(ˆΘ | X) = E h Y ˆΘ X 2 F | X i (36) = E h Θ X + ξ ˆΘ X 2 F | X i (37) = E h (Θ ˆΘ) X 2 F | X i + σ2 ξ, (38) where the last equality follows from ξ being zero mean and independent with X. The second term σ2 ξ is also known as irreducible error. We decompose the first term into E h (Θ ˆΘ) X 2 F | X i = E h (Θ E[ˆΘ]) X 2 F + (E[ˆΘ] ˆΘ) X 2 F | X i . (39) Recall that ˆΘ a.e. = (X X) 1X Y = (X X) 1X (XΘ + ξ) = Θ + (X X) 1X ξ. Thus E[ˆΘ] = Θ and (39) simplifies to E h (E[ˆΘ] ˆΘ) X 2 F | X i . We finish computing the risk by taking expectation over X, and using E[ˆΘ] ˆΘ = (X X) 1X ξ, E h Y ˆΘ X 2 F i = E h R(ˆΘ | X) i (40) = EX h EX,ξ h (E[ˆΘ] ˆΘ) X 2 F | X ii + σ2 ξ (41) = E h (X X) 1X ξ X 2 F i + σ2 ξ (42) = σ2 ξ tr E[(X X) 1] σ2 XI + σ2 ξ. (43) By [86, Lemma 2.3], for n > Nd + 1, E[(X X) 1] = Nd n Nd 1I. Putting this in (43) completes the proof. B.2 Proofs of Generalization with Approximate Symmetries In Definition 3, we construct the symmetry group of G induced by the coarsening G via a semidirect product, GG G = Sc1 Sc2 . . . Sc M AG . We explain the construction in more details here. We first recall the definition of semidirect product. Given two groups G1, G2 and a group homomorphism φ : G2 aut(G1), we can construct a new group G1 φ G2, called the semidirect product of G1, G2 with respect to φ as follows: 1. The underlying set is the Cartesian product G1 G2; 2. The group operation is determined by the homomorphism φ, such that : (G1 φ G2) (G1 φ G2) (G1 φ G2) (g1, g2) (g 1, g 2) = (g1 φg2(g 1), g2g 2), g1, g 1 G1; g2, g 2 G2, Take G1 = Sc1 Sc2 . . . Sc M , G2 = AG . Note that G1 is a normal subgroup in GG G ; Namely, for all s GG G , g1 G1, we have sg1s 1 G1. Thus, the map g1 7 sg1s 1 is an automorphism of G1. In particular, the homomorphism φg2(g1) = g2 g1 g 1 2 for g2 G2 describes the action of G2 on G1 by conjugation 2. In the context of a graph G with N nodes and its coarsening G with M clusters, the homomorphism φ describes how across-cluster permutations in G act on the original graph G. Note that in the special case where AG acts transitively on the coarsened nodes, we recover the wreath product a special kind of semidirect product. Our construction of GG G can be seen as a natural generalization of the wreath product. Corollary 3 (Risk Gap via Graph Coarsening). Let X = RN d, Y = RN k be the input and output graph signal spaces on a fixed graph G. Let X µ where µ is a SN-invariant distribution on X. Let Y = f (X) + ξ, where ξ RN k is random, independent of X with zero mean and finite variance, and f : RN d RN k be an approximately equivariant mapping with equivariance rate κ. Then, for any G that coarsens G up to error ϵ, for any f V , we have (f, f GG G ) = 2 f , f GG G µ | {z } mismatch µ | {z } constraint 2κ(ϵ) f GG G µ + f GG G 2 2In our context, it is sensible to write φg2(g1) = g2 g1 g 1 2 given that G1, G2 are originally inside a common group SN. Yet semidirect product applies to two arbitrary groups not necessarily inside a common group initially, where φg2(g1) is an abstraction of g2 g1 g 1 2 . Proof of Corollary 3. We start by simplifying the mismatch term in Lemma 1, 2E h f (x), f GG G (x) i = 2E h f (x) f GG G (x) + f GG G (x), f GG G (x) i f (x) f GG G (x) | {z } GL-anti-symmetric part of f , f GG G (x) | {z } GL-anti-symmetric part of f 2 f f GG G µ f GG G µ (By Cauchy Schwarz Ineq.) 2 κ(ϵ) f GG G µ. (By Definition 4 Approx. Equiv. Map) Putting this together with the constraint term completes the proof. Corollary 4 (Bias-Variance-Tradeoff via Graph Coarsening). Consider the same linear regression setting in Theorem 2, except now f is an approximately equivariant mapping with equivariance rate κ, and G = GG G is controlled by G that coarsens G up to error ϵ. Denote the canonical permutation representations of GG G on X, Y as ϕ , ψ , respectively. Let (χψ | χϕ ) = R GG G χψ (g)χϕ (g)dλ(g) denote the inner product of the characters. If n > Nd + 1 the risk gap is bounded by E h f ˆΘ, fΨGG G ( ˆΘ) i 2κ(ϵ) σ2 ξ N 2dk (χψ | χψ ) n Nd 1 + σ2 ξ N 2dk (χψ | χψ ) Proof of Corollary 4. It follows immediately from applying Theorem 13 in [8] to Corollary 3 with G = GG G . C.1 Example: Gaussian data model with approximate symmetries Example 3.1 considers G = S3, G = S2, X = R3, Y = R3, and x N(0, σ2 XId). The target function is linear, i.e., f (x) = Θ x for some Θ R3 3. In other words, we are learning linear functions on a fixed graph domain with 3 nodes. Suppose the target function is S2-equivariant such that it has the form "a b c b a c d d e , a, b, c, d, e R. (44) Now, we project Θ in (44) to S3-equivariant space using the intertwined average 6 with the canonical permutation representation of S3. A direct calculation yields 1 3(2a + e) 1 3(b + c + d) 1 3(b + c + d) 1 3(b + c + d) 1 3(2a + e) 1 3(b + c + d) 1 3(b + c + d) 1 3(b + c + d) 1 3(2a + e) Ψ S3(Θ) = Θ ΨS3(Θ) = 1 3(a e) 1 3(2b c d) 1 3( b + 2c d) 1 3(2b c d) 1 3(a e) 1 3( b + 2c d) 1 3( b c + 2d) 1 3( b c + 2d) 1 3( 2a + 2e). Therefore, the bias term evaluates to σ2 X Ψ S3(Θ) 2 = σ2 X 3 + 2( 2b + c + d)2 9 + 2(b 2c + d)2 9 + 2(b + c 2d)2 For the variance term, recall χψS3, χϕS3 are both the canonical permutation representations of S3, we have χψS3 | χϕS3 = 1 6(32 + 12 + 12 + 12 + 02 + 02) = 2. (48) Figure 6: Choosing the symmetry group corresponding to the target function usually yields the best generalization ((a), (b), (d)), but not always: when the number of training data n is small and the target function f is approximately equivariant with respect to a larger group, choosing the larger symmetry group could yield further generalization gain, as shown in (c) empirically. The dashed gray vertical line highlights the theoretical threshold n 35, before which using S3 yields better generalization than S2, validating our theoretical analysis. We set σ2 X = 1, σ2 ξ = 1 64, conduct 10 random runs and compute the generalization error based on 300 test points. We obtain the estimators via stochastic gradient descent, and enforce the symmetry via tying weights. The titles of each subplot indicate the symmetry of the target function, and display the target function values. Therefore, the variance term evaluates to σ2 ξ N 2 χψ|G | χψ|G n N 1 = σ2 ξ 7 n 4. (49) Putting (47) and (49) together yields the generalization gap of for the least square estimator f ˆΘ compared to its S3-equivariant version fΨS3( ˆΘ). As a comparison, when choosing the symmetry group of the target function G = S2, the bias vanishes and note that χψS2 | χϕS2 = 1 2(32 + 12) = 5, so generalization gap is E h f ˆΘ, fΨS2( ˆΘ) i = σ2 ξ 4 n 4. (50) We see that choosing G = S3 is better if a e, b c d (i.e., f is approximately S3-invariant) and the training sample size n small, whereas S2 is better vice versa. This analysis illustrates the advantage of choosing a (suitably) larger symmetry group to induce a smaller hypothesis class when learning with limited data, and introduce useful inductive bias when the target function is approximately symmetric with respect to a larger group. We further illustrate our theoretical analysis via simulations, with details and results shown in Figure 6. C.2 Example: Approximately Equivariant Mapping on a Geometric Graph In this section, we illustrate a construction of an approximately equivariant mapping. We focus on a version of Definition 3 that does not take to account the symmetries of G . Namely, we consider a definition of the approximate symmetries as GG G := Sc1 Sc2 . . . Sc M SN. Equivalently, we restrict the analysis to coarsening graphs G that are asymmetric. Background from graphon-signal analysis. To support our construction, we cite some definitions and results from [79]. Definition 8. Let r > 0. The graphon-signal space with signals bounded by r is WLr := W L r [0, 1], where L r [0, 1] is the ball of radius r in L [0, 1]. The distance in WLr is defined for (W, s), (V, g) WLr by d (W, s), (V, g) := (W, s) (V, g) := W V + s g 1. Moreover, δ (W, s), (V, g) = inf ϕ d (W, s), (V ϕ, gϕ) , where gϕ(x) = g(ϕ(x)) and ϕ is a measure preserving bijection. Any graph-signal induces a graphon signal in the natural way, as in Definition 1. The cut norm and distance between two graph-signals is defined to be the cut norm and distance between the two induced graphon-siganl respectively. Similarly, the L1 distance between a signal q on a graph and a signal s on [0, 1] is defined to be the L1 distance between the induced signal from q and s. The supremum in the definition of cut distance between two induced graphon-signals is realized by some measure preserving bijection. Sampling graphon-signals. The following construction is from [79, Section 3.4]. Let Λ = (λ1, . . . λN) [0, 1]N be N independent uniform random samples from [0, 1], and (W, s) WLr. We define the random weighted graph W(Λ) as the weighted graph with N nodes and edge weight wi,j = W(λi, λj) between node i and node j. We similarly define the random sampled signal s(Λ) with value si = s(λi) at each node i. Note that W(Λ) and s(Λ) share the sample points Λ. We then define a random simple graph as follows. We treat each wi,j = W(λi, λj) as the parameter of a Bernoulli variable ei,j, where P(ei,j = 1) = wi,j and P(ei,j = 0) = 1 wi,j. We define the random simple graph G(W, Λ) as the simple graph with an edge between each node i and node j if and only if ei,j = 1. The following theorem is [79, Theorem 3.6] Theorem 3.6 from [79] (Sampling lemma for graphon-signals). Let r > 1. There exists a constant N0 > 0 that depends on r, such that for every N N0, every (W, s) WLr, and for Λ = (λ1, . . . λN) [0, 1]N independent uniform random samples from [0, 1], we have E δ W, s , G(W, Λ), s(Λ) < 15 p log(N) . (51) By Markov s inequality and (51), for any 0 < p < 1, there is an event of probability 1 p (regarding the choice of Λ) in which δ W, s , G(W, Λ), s(Λ) < 15 log(N) . (52) Stability to deformations of mappings on geometric graphs. Let M be a metric space with an atomless standard probability measure defined over the Borel sets (up to completion of the measure). Such a probability space is equivalent to the standard probabiltiy space [0, 1] with Lebesgue measure. Namely, there are co-null sets A M and B [0, 1], and a measure preserving bijection ϕ : A B. Hence, graphon analysis applied as-is when replacing the domain [0, 1] with M. Suppose that we are interested in a target function f M : L1(M) L1(M) that is stable to deformations in the following sense. Definition 9. Let ϵ > 0. A measurable bijection ν : M M is called a deformation up to ϵ, if there exists an event Bϵ M with probability greater than 1 ϵ such that for every x Bϵ d M ν(x), x < ϵ. The mapping f M : L1(M) L1(M) is called stable to deformations with stability constant C, if for any deformation ν up to ϵ, and every s L1(M), we have f M(s) f M(s ν) ν 1 1 < Cϵ. Suppose that we observe a discretized version of the domain M, defined as follows. There is a graphon W : M2 [0, 1] defined as W(x, y) = r d(x, y) , (53) where r : R+ [0, 1] is a decreasing function with support [0, ρ]. Instead of observing W, we observe a graph G = G(W, Λ) with node set [N], sampled from W on the random independent points Λ = {λn}N n=1 M as above. Suppose moreover that any graph signal is sampled from a signal in L1(M), on the same random points Λ, as above. Suppose that the target f M on the continuous domain is well approximated by some mapping f : L1[N] L1[N] on the discrete domain in the following sense. For every s L1(M), let s G be the graph signal sampled on the random samples {λn}n. Then there is an event of high probability such that f s G { f M(s) (xn)}n 1 < e for some small e. We hence consider f as the target mapping of the learning problem. One example of such a scenario is when there exists some Lipschitz continuous mapping Θ : WLr WLr with Lipschitz constant L, such that f M = Θ(W, ) and f = Θ(G, ). Indeed, by (52), for some p as small as we like, there is an event of probability 1 p in which, up to a measure preserving bijection, f Ms f s G 1 δ W, f Ms , G, f s G Lδ W, s , G, s G < 15L log(N) = e. (54) A concrete example is when Θ is a message passing neural network (MPNN) with Lipschitz continuous message and update functions, and normalized sum aggregation [79, Theorem 4.1]. Let G be a graph that coarsens G up to error ϵ. In the same event as above, by (52), up to a measure preserving bijection, δ (WG , W) δ (WG , WG) + δ (WG, W) ϵ + e = u. (55) We next show an approximation property that we state here informally: Since W(x, y) 0 for x away from y, we must have WG (x, y) 0 as well for a set of high measure. Otherwise, δ (WG , W) cannot be small. By this, any approximate symmetry of G is a small deformation, and, hence, f is an approximately equivariant mapping. Equivariant mappings on geometric graphs. In the following, we construct a scenario in which f can be shown to be approximately equivariant in a restricted sense. For simplicity, we assume f (s G) L2[0, 1], and restrict to the case r = 1[0,ρ] in the geometric graphon W of (53). Denote the induced graphon WG = T. Given h > 0, define the h-diagonal dh = {(x, y) M2 | d M(x, y) h}. In the following, all distances are assumed to be up to the best measure preserving bijection. If there is a domain S T M2 outside the ρ-diagonal in which T(x, y) > c for some c > 0, by reverse triangle inequality, we must have T T(x, y)dydx = cµ(S )µ(T ). Hence, since by (55), W T < u, for every S T that does not intersect dρ, we must have Z T T(x, y)dydx u. In other words, for any two sets S, T with distance more than ρ (infs S,t T dµ(s, t) > ρ), we have Z T T(x, y)dydx u. This formalizes the statement WG (x, y) 0 for x away from y from above. Next, we develop the analysis for the special case M = [0, 1] with the standard metric and Lebesgue probability measure. We note that the analysis can be extended to M = [0, 1]D for a general dimension D N. For every z [0, 1], we have Z 2] T(x, y)dydx u, 2,1] T(x, y)dydx u. Let ν > 0. We take a grid {xj} [0, 1] of spacing 2ν. The sets [ 2, 1] [0, xj ρ/ j [0, xj ρ/ 2] [xj + ρ/ cover dc ν (where dc ν is the complement of dν). Hence, dcν T(x, y)dydx 2] T(x, y)dydx 2,1] T(x, y)dydx 2ν u = t, for u t 1, namely, ν = t . For example, we may take t = ν = u2/3, assuming that ρ < u1/3. Hence, we have ZZ dc u2/3 T(x, y) To conclude, the probability of having an edge between nodes λi and λj in G N which are further away than u2/3, namely, d M(λi, λj) > u2/3, is less than Suppose that G is asymmetric. This means that symmetries of GG G can only permute between nodes that have an edge between them in the blown-up graph G N. The probability of having an edge between nodes further away than u2/3 is less than 2u1/3. Hence, a symmetry in GG G can be seen as a small deformation, where for each node λi and a random uniform g GG G , the probability that λi it is mapped by g to a node of distance less than u2/3 is more than 1 Any symmetry g in GG G induces a measure preserving bijection ν in M = [0, 1], by permuting the intervals of the partition PN of Definition 1. As a result, the set of points that are mapped further away than u2/3 under ν has probability upper bounded by 2u1/3, and symmetries in GG G can be seen as a small deformation ν according to Definition 9 (in high probability). This means that, for any g GG G , f M(s) f M(s g) g 1 1 < C so by the triangle inequality, combining with equation 54, we have f (s G) g 1f (gs G) 1 < 2e + C 2u1/3 = ϵ . (56) Equation (56) leads to f (s G) QGG G (f )(s G) 1 = f (s G) 1 |GG G | g GG G g 1f (gs G) 1 (57) g GG G f (s G) g 1f (gs G) 1 < ϵ . (58) Since for any q L2[0, 1] L [0, 1] we have q 2 2 q q 1, we can bound f (s G) QGG G (f )(s G) 2 < q f (s G) QGG G (f )(s G) f (s G) + QGG G (f )(s G) where the last inequality follows from QGG G (f )(s G) f (s G) . Denote f := R f (s G) dµ(s G), and suppose that f is finite. Hence, if µ is a probability measure, we have f QGG G (f ) µ < q This shows an example of approximately equivariant mapping based on random geometric graph, where the approximation rate is also a function of the size of the graph N, and goes to zero as N and ϵ 0. In future work, we will extend this example to more general metric space M and to non-trivial symmetry groups AG . Intuitively, most random geometric graphs are close to asymmetric . This means that for most G , the symmetries of AG can only permute between nodes connected by an edge, and so are the symmetries of GG G . For this, we need to extend Definition 9 by treating G probabilistically. D Experiment Details In this section, we provide additional details of our experiments. We first give a brief introduction of standard graph neural networks (Section D.1), followed by in-depth explanations of our applications in image inpainting (Section D.2), traffic flow prediction (Section D.3), and human pose estimation (Section D.4). All experiments were conducted on a server with 256 GB RAM and 4 NVIDIA RTX A5000 GPU cards. D.1 Graph Neural Networks (GNNs) We consider standard message-passing graph neural networks (MPNNs) [19 21] defined as follows. A L-layer MPNN maps input X RN d to output Y RN k following an iterative scheme: At initialization, h(0) = X; At each iteration l, the embedding for node i is updated to h(l 1) i , X j N(i) ψ h(l 1) i , h(l 1) j , A[i,j] where ϕ, ψ are the update and message functions, N(i) denotes the neighbors of node i, and A[i,j] represents the (i, j)-edge weight. MPNNs typically have two key design features: (1) ϕ, ψ are shared across all nodes in the graph, typically chosen to be a linear transformation or a multi-layer perceptions (MLPs), known as global weight sharing; (2) the graph A is used for (spatial) convolution. D.2 Application: Image Inpainting We provide additional details of the data, model, and optimization procedure used in the experiment. Data. We consider the datasets MNIST [74] and Fashion MNIST [75]. For each dataset, we take 100 training samples and 1000 test samples via stratified random sampling. The input and output graph signals are (mi xi, xi) ( is entrywise multiplication). Here xi R28 28 R784 denotes the image signals and mi denotes a random mask (size 14 14 for MNIST and 20 20 for Fashion MNIST). For experiments with reflection symmetry on the coarsened graph, we further transform each image in the Fashion MNIST subset using horizontal flip with probability 0.5 (Fashion MNIST+hflip). We remark that it is possible to model the dihedral group D4 on the square as the coarsened graph symmetry, yet we only consider the reflection symmetry S2 to simplify the parameterization (since it is Abelien while D4 is not) while capturing the most relevant symmetries. Model. We consider a 2-layer GG G -equivariant networks G-Net, a composition of fout Re LU fin, where fin, fout denote the input/output equivariant linear layers. The input and output feature dimension is 1 (since the signals are grayscale images), and the hidden dimension is set as 28. For comparison, we also consider a simple 1-layer GG G -equivariant networks G-Net. We train the models with ADAM (learning rate 0.01, no weight decay, at most 1000 epochs). We report the best test accuracy at the model checkpoint selected by the best validation accuracy (with a 80/20 training-validation split). We supplement Figure 2 with Table 7 for further numerical details. MSE ( 1e 2) S282 = Sn (S142)4 = (Sn/4)4 (S72)16 = (Sn/16)16 (S42)49 = (Sn/49)49 (S22)196 = (Sn/196)196 Trivial = (Sn/784)784 MNIST 41.56 0.16 40.53 0.26 36.06 0.24 34.68 0.5 33.67 0.07 33.92 0.04 Fashion 23.48 0.14 22.26 0.02 16.94 0.08 15.16 0.1 14.47 0.11 14.75 0.11 Table 7: Image inpainting using G-Net with different levels of coarsening. Table shows mean squared error (MSE) across 3 runs on the test set, supplementing Figure 2 (Left, blue curves). D.3 Application: Traffic Flow Prediction Data. The METR-LA traffic dataset, [76], contains traffic information collected from 207 sensors in the highway of Los Angeles County from Mar 1st 2012 to Jun 30th 2012 [87]. We use the same traffic data normalization and 70/10/20 train/validation/test data split as [76]. We consider two different traffic graphs constructed from the pairwise road network distance matrix: (1) the sensor graph G introduced in [76] based on applying a thresholded Gaussian kernel (degree distribution in Figure 8e); (2) the sparser graph Gs based on applying the binary mask where the (i, j) entry is nonzero if and only if nodes i, j lie on the same highway (degree distribution in Figure 8d). We construct the second variant to more faithfully model the geometry of the highway, illustrated in Figure 8a. Graph coarsening. We choose 2 clusters based on highway intersection and flow direction, indicated by colors (Figure 8b (b)), and 9 clusters based on highway labels (Figure 8c (c)). (a) Our faithful traffic graph (b) Graph clustering (2 clusters) (c) Graph clustering (9 clusters) (d) Our faithful graph degree distribution (e) Original sensor graph degree distribution Figure 8: METR-LA traffic graph: visualization, clustering, and degree distribution Model. We use a standard baseline, DCRNN proposed in [76]. DCRNN is built on a core recurrent module, DCGRU cell, which iterates as follows: Let xi,t, hi,t denote the i-th node feature and hidden state vector at time t; Let Xt, Rt, Ht 1 be the matrices of stacking feature vectors xi,t, ri,t, hi,t 1 as rows. zi,t = σg (Wz xi,t + Uz hi,t 1 + bz) (61) ri,t = σg (Wr xt + Ur ht 1 + br) (62) ˆhi,t = ϕh [A X Wh] [i,:] + [A (Rt Ht 1) Uh] [i,:] + bh (63) hi,t = zt ht 1 + (1 zt) ˆht, (64) where Wz, Uz, bz, Ur, Wr, br, Wh, Uh, bh are learnable weights and biases, σg is the sigmoid function, ϕg is the hyperbolic tangent, and hi,0 = 0 for all i at initialization. The crucial different from a vanilla GRU lies in eqn (63) where graph convolution replaces matrix multiplication. We then modify the graph convolution in (63) from global weight sharing to tying weights among clusters of nodes, similar to the implementation in Appendix D.4 for Relax-S16. For example, in the case of two clusters (orbits), we change XWh to swap (concat[Xc1Wh,c1; Xc2Wh,c2]) , (65) where Xci denotes the submatrix of X including the rows of nodes from cluster i only, and Wh,c1, Wh,c2 are two learnable matrices. In words, we perform cluster-specific linear transformation, combine the transformed features, and reorder the rows (i.e., swap) to ensure compatibility with the graph convolution. Experiment Set-up. For our experiments, we use DCRNN model with 1 RNN layer and 1 diffusion step. We choose T = 3 (i.e., 3 historical graph signals) and T = 3 (i.e., predict the next 3 period graph signals). We train all variants for 30 epochs using ADAM optimizer with learning rate 0.01. We report the test set performance selected by the best validation set performance. D.3.1 Assumption Validation: Approximate Equivariant Map Before applying our construction of approximate symmetries, we validate the assumption of the target function f being an approximately equivariant mapping using a trained DCRNN model as a proxy. We proceed as follows: Data. We use the validation set of METR-LA (traffic graph signals in LA), which has 207 nodes and consists of 14, 040 input and output signals. Each input X R207 2 represents the traffic volume and speed in the past at the 207 stations, and output Y R207 representing future traffic volume. Model. We use a trained DCRNN model on our faithful graph, with input being 3 historical signals X = (XT 3, XT 2, XT 1) R3 207 2 to predict the future signals Y = (XT , XT +1, XT +2) R3 207. We denote this model as f. It gives reasonable performance with Mean Absolute Error 3, and serves as a good proxy for the target (unknown) function f . Neighbors. We take our faithful traffic graph that originally has 397 non-loop edges, and only consider a subset of 260 edges by thresholding the distance values to eliminate geometrically far-away nodes. This defines our 260 neighboring node pairs. Equivariance error. For each node pair (i, j), we swap their input signals by interchanging the (i, j)-th slices in the node dimension of the tensor X, denoted as X(i,j), and check if the swapped output ˆY(i,j) = f(X(i,j)) is close to the original output ˆY = f(X) with (i, j)-th slices swapped. We measure closeness via the relative equivariant error at the node pair. Concretely, let X[i, j] denote the tensor slices at the (i, j) node pair, and X[j, i] being the swapped version by interchanging (i, j)-th slices. The relative different is computed as ˆY(i,j)[j, i] ˆY [i, j] / ˆY [i, j], where / denotes element-wise division. We then compute the mean relative equivariance error over all instances in the validation set, which equals to 5.17%. This gives concrete justification to enforce approximate equivariance in the traffic flow prediction problems. D.4 Application: Human Pose Estimation D.4.1 Equivariant Layer for Human Skeleton Graph We apply the constructions in Section A.1 to our human skeleton graph. We first show how to parameterize all linear AG-equivariant functions. Observe that AG = (S2)2 = {e, a, l, al}, where the nontrivial actions correspond to the arm flip with respect to the spine, the leg flip with respect to the spine, and their composition. To fix ideas, we first treat both input and output graph signals as vectors, and construct AG-equivariant linear maps f : R16 R16. Step 1: Obtain the character table for (S2)2 χe 1 1 1 1 χ2 1 1 1 1 χ3 1 1 1 1 χ4 1 1 1 1 Table 2: Character table for (S2)2 Step 2: Construct the basis for isotypic decomposition. Here we choose to index the leg joint pairs as (1, 4), (2, 5), (3, 6), arm joint pairs as (10, 13), (11, 14), (12, 15), and spline joints 0, 7, 8, 9. B = [B(Pχe); B(Pχ2); B(Pχ3); B(Pχ4)] where B(Pχe) = [(e1 + e4)/ 2; . . . ; (e12 + e15)/ 2; e0; e7; e8; e9] R16 10. B(Pχ2) = [(e1 e4)/ 2; (e2 e5)/ 2; (e3 e6)/ B(Pχ3) = [(e10 e13)/ 2; (e11 e14)/ 2; (e12 e15)/ 2] R16 3; B(Pχ4) = (66) Step 3: Parameterize f : R16 R16 by f : B(Pχe) B(Pχe) and f : B(Pχ2) B(Pχ2), i.e. for all v R16, let v = B(Pχe) ce + B(Pχ2) c2 + B(Pχ3) c3, then f(v) = We ce + W2 c2 + W3 c3, (67) where We R10 10, W2 R3 3, W3 R3 3 are (learnable) weight matrices. Now f expresses all linear, equivariant maps w.r.t (S2)2. The following calculation based on f : R16 R16 shows how much degree of freedom (measured by learnable parameters) is gained by relaxing the symmetry from global (group S16), exact AG = (S2)2, to trivial group (i.e., no symmetry). f S16 = w I16 + w (1 I16), (2 parameters); (68) f AG = We W2 W3, (118 parameters on the isotypic components); (69) ftriv. = W, (256 parameters). (70) To parameterize equivariant linear function f : R16 d R16 d , we proceed by decoupling the input space into R10 d, R3 d, R3 d and the output space into R10 d , R3 d , R3 d . Now the learnable weight matrices for multidimensional input/output become We R10d 10d , W2 R3d 3d , W3 R3d 3d . The construction is summarized in Algorithm 2. Algorithm 2 Equivariant layer f AG : R16 d R16 d for AG = (S2)2 Require: The basis B R16 16 in (66) for isotypic decomposition of AG = (S2)2, input h(l) R16 d. Initialize: The learnable weights W (l) e R10d 10d; W (l) 2 , W (l) 3 R3d 3d; M (l) R16 16. 1. Project h(l) to the isotypic component: z(l) = B h(l); 2. Perform block-wise linear transformation: ze = We flatten(z(l) [:,:10]) z2 = W2 flatten(z(l) [:,10:13]) z3 = W3 flatten(z(l) [:,13:]) z(l+1) = concat[ze; z2; z3] R16 d 3. Project back to the standard basis: h(l+1) = B z(l+1). 4. Perform pointwise nonlinearity: h(l+1) = σ( h(l+1)). return h(l+1) Figure 9: Human skeleton graph G, its coarsened graph G (clustering leg joints), and blow-up of G D.4.2 Experiment Details Data. We use the standard benchmark dataset, Human3.6M [77], with the same protocol as in [78]: We train the models on 1.56M poses (from human subjects S1, S5, S6, S7, S8) and evaluate them on 0.54M poses (from human subjects S9, S11). We use the method described in [88] to normalize the inputs (2D joint poses) to [ 1, 1] and align the targets (3d joint poses) with the root joint. Model. We give a detailed description of G-Net and its variants used in the experiments. Figure inset illustrates the architecture of G-Net. Nonlinearity Nonlinearity For the human skeleton graph with N = 16, we have f G : R16 d R16 k, where d, k represent the input dimension and output dimension (for each layer). Let f G[i, j] : R16 R16 denote its (i, j)-th slice. 1. G-Net with strict equivariance using equivariant linear map f G (see Table 3): S16: f S16[i, j] R16 16 is a diagonal matrix, with one learnable scalar a on diagonal and another learnable scalar b off diagonal. Relax-S16: We relax f S16[i, j] by having 16 different pairs of scalars (ai, bi), i [16], such that each node i can map to itself and communicate to its neighbors in a different way (controlled by (ai, bi)), while still treat all neighbors equally (by using the same bi for nodes j = i). AG = S2 2: We use Algorithm 2. Trivial: We allow f[i, j] R16 16 to be arbitrary, i.e., it has 16 16 learnable scalars. We remark that for S16 and Relax-S16, we implement them by tying weights; for AG, we implement them by projecting to isotypic component as shown in Algorithm 2. 2. G-Net augmented with graph convolution Af G(x), denoted as G-Net(gc) (see Table 3): We apply the equivariant linear map f G in 1. and obtain the output f G(x) R16 k; We then apply graph convolution by multiplication from the left, i.e., Af G(x) R16 k. 3. G-Net augmented with graph convolution and learnable edge weights, denoted as G-Net(gc+ew) (see Table 4): We further learn the edge weights for the adjacency matrix A, by softmax(M A) where M R16 represents the learnable edge weights, and Mi,j is nonzero when Ai,j = 0 and 0 elsewhere. This is inspired from Sem GCN [78]. Besides the groups discussed above, we also implemented Relax-(S6)2 which corresponds to tying weights among the coarsened graph orbits, consists of 4 spline nodes (singleton orbits) and 2 orbits for the left/right arm and leg nodes. 4. G-Net augmented with graph locality constraints (A f G)(x) and learnable edge weights, denoted as G-Net(pt+ew) (see Table 3): We perform pointwise multiplication A f G[i, j] at each (i, j)-th slice of f G. In practice, we also allow learnable edge weights as done in 3. Experimental Set-up. We design G-Net to have 4 layers (with batch normalization and residual connections in between the hidden layers), 128 hidden units, and use Re LU nonlinearity. This allows G-Net(gc+ew) to recover Sem GCN [78] when choosing G = S16. We train our models for at most 30 epochs with early stopping. For comparison purpose, we use the same optimization routines as in Sem GCN [78] and perform the hyper-parameter search of learning rates {0.001, 0.002}. Evaluation. Table 3 shows results of G-Net and its variants when varying the choice of G. We observe that using the automorphism group AG does not give the best performance, while imposing no symmetries (Trivial) or a relaxed version of S16 yields better results. Here, enforcing no symmetry achieves better performance since the human skeleton graph is very small with 16 nodes only. As shown in other experiments with larger graphs (e.g. image inpainting), enforcing symmetries indeed yields better performance. Table 3: 3D human pose prediction using G-Net and its variants. Error ( std) measured by Mean Per-Joint Position Error (MPJPE) and MPJPE after rigid alignment (P-MPJPE) across 3 runs. All methods use the same hidden dimension d = 128. Bold type indicates the top-2 performance among each variant. NA indicates the loss fails to converge. MPJPE S16 Relax-S16 AG = (S2)2 Trivial G-Net NA 47.97 0.47 48.30 0.69 42.86 0.64 G-Net(gc) NA 54.50 4.33 49.40 1.37 43.24 0.82 G-Net(pt+ew) 41.54 0.47 40.44 0.61 40.63 0.26 38.41 0.31 P-MPJPE S16 Relax-S16 AG = (S2)2 Trivial G-Net NA 36.45 0.56 37.17 0.59 32.59 0.62 G-Net(gc) NA 40.61 0.99 37.62 1.32 33.05 0.81 G-Net(pt+ew) 32.31 0.03 31.11 0.68 31.35 0.14 29.68 0.22 Additional Evaluation. Table 4 shows the experiments when we keep the number of parameters roughly the same across different choices of G. Table 4: 3D human pose prediction using G-Net(gc+ew), where the models induced from each choice of G are set to have roughly the same number of parameters. d denotes the number of hidden units. G-Net Number of Parameters Number of Epochs MPJPE P-MPJPE S16 0.27M (d = 128) 50 43.48 34.96 Relax-S16 0.27M (d = 32) 20 40.08 32.08 AG = (S2)2 0.22M (d = 16) 30 44.10 34.12 Trivial 0.22M (d = 10) 30 45.05 34.79