# equivariant_polynomial_functional_networks__b2f25992.pdf Equivariant Polynomial Functional Networks Thieu N. Vo * 1 Viet-Hoang Tran * 1 Tho Tran Huu * 1 An Nguyen The 2 Thanh Tran 3 Minh-Khoi Nguyen-Nhat 2 Duy-Tung Pham 2 Tan M. Nguyen 1 Abstract A neural functional network (NFN) is a specialized type of neural network designed to process and learn from entire neural networks as input data. Recent NFNs have been proposed with permutation and scaling equivariance based on either graph-based message-passing mechanisms or parameter-sharing mechanisms. However, the challenge of designing a permutation and scaling equivariant NFN that maintains low memory consumption and running time while preserving expressivity remains unresolved. In this paper, we propose a novel solution with the development of MAGEP-NFN (Monomial m Atrix Group Equivariant Polynomial NFN). Our approach follows the parameter-sharing mechanism but differs from previous works by constructing a nonlinear equivariant layer represented as a polynomial in the input weights. This polynomial formulation enables us to incorporate additional relationships between weights from different input hidden layers, enhancing the model s expressivity while keeping memory consumption and running time low, thereby addressing the aforementioned challenge. We provide empirical evidence demonstrating that MAGEP-NFN achieves competitive performance and efficiency compared to existing baselines. 1. Introduction Neural functional networks (NFNs) serve as specialized architectures that operate on fundamental components of deep neural networks, including weights, gradients, and sparsity masks, by treating them as input data (Zhou et al., 2024b). These networks have been utilized across various domains of machine learning, contributing to applications *Equal contribution 1National University of Singapore 2FPT Software AI Center, Vietnam 3Vin University, Vietnam. Correspondence to: Viet-Hoang Tran . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). such as enhancing training efficiency through learnable optimizers (Bengio et al., 2013; Runarsson & Jonsson, 2000; Andrychowicz et al., 2016; Metz et al., 2022), capturing features from implicit data representations (Stanley, 2007; Mildenhall et al., 2021; Runarsson & Jonsson, 2000), modifying network parameters for targeted adjustments (Sinitsin et al., 2020; Cao et al., 2021; Mitchell et al., 2022), assessing policies in reinforcement learning (Harb et al., 2020), and facilitating Bayesian inference by interpreting neural networks as sources of evidence (Sokota et al., 2021). A fundamental aspect of NFNs is their ability to respect the inherent symmetries present in the weight space of the input neural networks. In the case of a multilayer perceptron (MLP), the weight space exhibits two primary forms of symmetry: permutation symmetry and scaling symmetry. Permutation symmetries arise from the structure of the network itself, since neurons within a hidden layer have no intrinsic ordering. On the other hand, scaling symmetries are induced by the activation functions. For networks with the Re LU activation function, multiplying a neuron s bias and all its incoming weights by the same positive scalar scales its output proportionally, leading to a scaling-type symmetry (Bui Thi Mai & Lampert, 2020; Neyshabur et al., 2015; Badrinarayanan et al., 2015). Similarly, for sine and tanh activations, flipping the sign of both the bias and all incoming weights of a neuron inverts the sign of its output, introducing an alternative form of scaling symmetry (Chen et al., 1993; Fefferman & Markel, 1993; Kurkova & Kainen, 1994). These structural symmetries define equivalence classes in the weight space, and incorporating them into the design of NFNs ensures that learned representations remain consistent and invariant to these transformations. Recent methods have focused on creating permutation equivariant NFNs, such as (Navon et al., 2023; Zhou et al., 2024b; Kofinas et al., 2024; Zhou et al., 2024c). These methods leverage permutation equivariance to respect symmetries arising from neuron reordering within hidden layers. NFNs that are equivariant to both permutations and scaling or signflipping have been introduced in (Kalogeropoulos et al., 2024) using a graph-based message-passing mechanism and in (Tran et al., 2024a) with a parameter sharing mechanism. However, similar to other graph-based neural functional networks, treating the entire input neural network as a graph Equivariant Polynomial Functional Networks and utilizing graph neural networks causes the graph-based equivariant NFNs in (Kalogeropoulos et al., 2024) to have very high memory consumption and running time. In contrast, the NFNs built upon equivariant linear layers using the parameter sharing mechanism in (Tran et al., 2024a) exhibit much lower memory consumption and running time. Nevertheless, the equivariant linear layers introduced in (Tran et al., 2024a) possess weak expressive properties, as the weights of the input hidden layers are updated solely by the corresponding weights of the same input hidden layers. The challenge of designing an equivariant layer based on the parameter-sharing mechanism that maintains both lower memory consumption and running time while preserving expressivity remains unresolved. 1.1. Contribution This paper aims to develop a novel NFN that is equivariant to both permutations and scaling/sign-flipping symmetries, called MAGEP-NFN (Monomial m Atrix Group Equivariant Polynomial NFN). We follow the parameter-sharing mechanism as described in (Tran et al., 2024a); however, unlike (Tran et al., 2024a), we construct a nonlinear equivariant layer, which is represented as a polynomial in the input weights. This polynomial formulation enables us to incorporate additional relationships between weights from different input hidden layers, thereby addressing the challenges posed in (Tran et al., 2024a) and enhancing the expressivity of MAGEP-NFN. However, determining equivariant and invariant layers among generic polynomials in the input weights is challenging due to two main factors: the difficulty in identifying polynomial orbits under group actions and the high computational cost of working with generic polynomials. To overcome these issues, we introduce a specialized class of polynomials that remain stable under permutations and scaling. Restricting equivariant and invariant layers to linear combinations of these terms ensures computational efficiency and reduced memory consumption. In particular, our contribution is as follows: 1. We introduce a class of polynomials in the input weights, referred to as stable polynomial terms, which remain stable under the group action of the weight space. In addition, we conduct a comprehensive study of the linear independence of stable polynomial terms. 2. We characterize all equivariant and invariant layers among the linear combination of the stable polynomial terms. These layers are polynomials of degree at most L + 1 where L is the number of layers of the input neural networks. 3. Build on top of the equivariant and invariant poly- nomial layers, we design MAGEP-NFN, a family of monomial matrix equivariant NFNs based on the parameter-sharing mechanism that maintains both lower memory consumption and running time while preserving expressivity. We evaluate MAGEP-NFNs on three tasks: predicting CNN generalization from weights using Small CNN Zoo (Unterthiner et al., 2020), weight space style editing, and classifying INRs using INRs data (Zhou et al., 2024b). Experimental results show that our model achieves competitive performance and efficiency compared to existing baselines. 1.2. Notations Let n be a positive integer. We denote Pn as the set of all permutation matrices, and Dn as the set of diagonal matrices in GLn(R). We also denote Mn as the set of monomial matrices in GLn(R), where a monomial matrix is a product of a diagonal matrix and a permutation matrix. These sets are subgroups of GLn(R). In addition, we use M>0 n and M n to denote the set of monomial matrices whose nonzero entries are positive numbers and 1, respectively. For any permutation matrix P Pn, there exists a unique permutation π Sn such that P is obtained by permuting the columns of the identity matrix In according to π. We denote this as P := Pπ and refer to it as the permutation matrix associated with π, where Sn is the symmetric group of all permutations on {1, 2, . . . , n}. Organization. We reformulate the definitions of weight spaces for MLPs and CNNs, as well as the action of monomial matrix groups on these weight spaces. In Section 3, we construct polynomial equivariant and invariant layers, which serve as the main building blocks for our MAGEP-NFNs. Several experiments are conducted in Section 4 to verify the applicability and efficiency of our models in comparison with previous ones in the literature. Some related works will be recalled in Section 5. The paper concludes with a summary in Section 6. 2. Background: Weight Spaces and Their Symmetries Let U, V be two sets and assume that a group G acts on them. A function f : U ! V is called G-equivariant if f(g x) = g f(x) for all x U and g G. In case G acts trivially on Y, the function f is called G-invariant. In the context of this paper, U and V are weight spaces of a fixed neural network architecture, while G is a direct product of the groups of monomial matrices. From now on, we will fix the activation σ to be the rectified linear unit σ = Re LU. The case when σ is another typical Equivariant Polynomial Functional Networks activation, such as semilinear (e.g., Leaky Re LU) or odd (e.g., sin, tanh), can be derived similarly. Following (Tran et al., 2024a), we write the weight space of an MLP or CNN with L layers and ni channels at i-th layer in the general form U = W B, where: W = Rw L n L n L 1 . . . Rw2 n2 n1 Rw1 n1 n0, B = Rb L n L 1 . . . Rb2 n2 1 Rb1 n1 1. (1) Here, ni is the number of channels at the i-th layer, in particular, n0 and n L are the number of channels of input and output; wi is the dimension of weights and bi is the dimension of the biases in each channel at the i-th layer. Each element U of U is written as U = ([W], [b]), with the weights [W] = [W](L), . . . , [W](1) W, (2) [b] = [b](L), . . . , [b](1) B. (3) The square brackets will be convenient in the next section when we determine polynomials in the entries of U. The weight space U exhibits two primary forms of symmetry: permutation symmetry and scaling symmetry. Permutation symmetries arise from the structure of the network itself, since neurons within a hidden layer have no intrinsic ordering. On the other hand, for networks with the Re LU activation function, multiplying a neuron s bias and all its incoming weights by the same positive scalar scales its output proportionally, leading to a scaling-type symmetry. Based on the above observation, we define the group G of the form G := {In L} M>0 n L 1 . . . M>0 n1 {In0}, (4) where In is the identity matrix of size n n, and M>0 n is the set of monomial matrices whose nonzero entries are positive numbers. Each element g G has the form g = g(L), . . . , g(0) , where g(i) = D(i) Pπi for some diagonal matrix D(i) = diag(d(i) 1 , . . . , d(i) ni ) in Dni and permutation πi Sni. The action of G on U is defined formally as (g, U) 7! g U = ([g W], [gb]), [g W](i) := g(i) [W](i) g(i 1) 1 and [gb](i) := g(i) [b](i), (5) or equivalently, [g W](i) jk := d(i) j d(i 1) k [W](i) π 1 i (j)π 1 i 1(k) and [gb](i) j := d(i) j [b](i) π 1 i (j). (6) With notation as above, it is well-known that the function f = f( ; U, σ) be an MLP or CNN given in Equation (1) with the weight space U U and an activation σ = Re LU will be G-invariant under the action of G, i.e. f(x ; U, σ) = f(x ; g U, σ) (7) for all g G, U U and x Rn0. 3. Equivariant and Invariant Polynomial Functional Networks In this section, we construct our MAGEP-NFNs, a new class of NFNs that exhibit equivariance to the group G of monomial matrices with positive nonzero entries, as described in the previous section. The core components of MAGEP-NFNs are invariant and equivariant polynomial layers, which will be detailed in Subsection 3.2. At the heart of these layers are the stable polynomial terms, which play a crucial role in ensuring low memory consumption and computational efficiency of MAGEP-NFNs. These terms will be formally introduced in Subsection 3.1. 3.1. Stable polynomial terms We follow the parameter-sharing mechanism used in (Tran et al., 2024a) for constructing equivariant and invariant layers from U to ensure low memory consumption and computational efficiency in our model. Unlike the linear layers utilized in (Tran et al., 2024a), we employ polynomial layers. This choice allows us to capture additional relationships between weights from different hidden layers of the input network, thereby enhancing the expressivity of our model. However, determining equivariant and invariant layers among generic polynomials in the input weights presents a significant challenge for two key reasons. First, identifying the orbits of a generic polynomial under the group action is difficult, as such polynomials lack an inherent structured compatibility with the symmetry group of the weight space. Second, computations involving equivariant and invariant layers constructed from generic polynomials are highly inefficient in both memory usage and computational cost. To address these challenges, we introduce a specialized class of polynomials, referred to as stable polynomial terms. Intuitively, a stable polynomial term is a polynomial in the Equivariant Polynomial Functional Networks entries of U U that remains stable under the action of G (see Definition 3.1). By restricting equivariant and invariant polynomial layers to linear combinations of these stable polynomial terms, we ensure both computational efficiency and reduced memory consumption. Consider the case where the weight spaces have the same number of dimensions across all channels, which means wi = bi = d for all i. Definition 3.1 (Stable polynomial terms). Assume that U = ([W], [b]) is an element of the weight space U with weights [W] = [W](L), . . . , [W](1) and biases [b] = [b](L), . . . , [b](1) . For each L s > t 0, we define the following matrices: [W](s,t) := [W](s) [W](s 1) . . . [W](t+1), [Wb](s,t)(t) := [W](s,t) [b](t) . (8) In addition, for each indices s and t with L s, t 0, and matrices Ψ(s)(L,t) R1 n L and Ψ(s,0)(L,t) Rn0 n L, we also define the following matrices: [b W](s)(L,t) := [b](s) Ψ(s)(L,t) [W](L,t), [WW](s,0)(L,t) := [W](s,0) Ψ(s,0)(L,t) [W](L,t). (9) The entries of the matrices [W](s,t), [Wb](s,t)(t), [b W](s)(L,t) and [WW](s,0)(L,t) defined above are called stable polynomial terms of U under the action of G. Note that we omit the Ψ( ) s in [b W], [WW] for simplicity, they are parameters, and are different for each s, t. The denotations are straight-forward and reasonable. In the above definition, we use the notation [W] and [WW] to denote products of the weight matrices [W](i) with the appropriate index i. The notation [Wb] indicates that this is a product of several weight matrices [W](i) and a bias vector [b](j), with appropriate indices i and j. For the indices, we use the notation (s, t) to signify that the considered product contains weight matrices with indices ranging from s down to t + 1. When the index has two components, for example [Wb](s,t)(t), the first component (s, t) specifies the range of indices for [W], while the second component (t) indicates the index of the bias vector [b]. Specifically, the last two terms [b W](s)(L,t) and [WW](s,0)(L,t) contain the matrices Ψ to multiply two matrices of different sizes from the left and the right. Proposition 3.2 (Stable polynomial terms as generalization of weights and biases). With notation as above, then for all L s > t > r 0, we have [W](s,s 1) = [W](s) Rd ns ns 1, [W](s,t) [W](t,r) = [W](s,r) Rd ns nr. [b W](s)(s,t) [W](t,r) = [b W](s,r) Rd ns nr, [W](s,t) [Wb](t,r) = [Wb](s,r) Rd ns nr. The above proposition shows that the stable polynomial terms can be viewed as a generalization of the entries of the weight matrices [W](i) and bias vectors [b](i). The stable polynomial terms defined above are actually stable under the action of G in the sense presented in the following theorem. Theorem 3.3 (Stable polynomial terms are stable ). With notation as above, let g U = ([g W], [gb]) be the element of U obtained by acting g = (g(L), . . . , g(0)) G on the element U = ([W], [b]). Then we have [g W](s,t) = g(s) [W](s,t) g(t) 1 , [gb](s) = g(s) [b](s), [g Wgb](s,t)(t) = g(s) [Wb](s,t)(t), [gbg W](s)(L,t) = g(s) [b W](s)(L,t) g(t) 1 , [g Wg W](s,0)(L,t) = g(s) [WW](s,0)(L,t) g(t) 1 . Intuitively, the theorem above asserts that stable polynomials exhibit compatibility with the action of the group G. In particular, it provides explicit formulas for efficiently determining the transformation of stable polynomials under group actions. This property plays a crucial role in enabling the efficient computation of equivariant and invariant layers, especially when utilizing the weight-sharing mechanism. Inherited from Proposition 3.2 and Theorem 3.3, we define the polynomial map I : U ! Rd with maps each element U U to the vector I(U) Rd of the following form: q=1 Φ(s,t):pq [W](s,t) pq p=1 Φ(s):p [b](s) p p=1 Φ(s,t)(t):p [Wb](s,t)(t) p q=1 Φ(s)(L,t):pq [b W](s)(L,t) pq q=1 Φ(s,0)(L,t):pq [WW](s,0)(L,t) pq Intuitively speaking, I(U) is a linear combination of the entries from the input weights [W](s) (from the first sum) and biases [b](s) (the second sum), as well as all entries from the stable polynomial terms [W](s,t) (the first sum), [Wb](s,t)(t) (the third sum), [b W](s)(s,t) (the fourth sum), and [WW](s,L)(0,t) (the last fifth sum) for all appropriate indices s and t, with a bias. Here, the coefficients Equivariant Polynomial Functional Networks Φ and the connection matrix Ψ (inside [b W](s)(s,t) and [WW](s,L)(0,t)) are learnable parameters. It is important to note that I(U) is a polynomial of degree at most L + 1 in the input weights, encompassing all linear layers as special cases. To identify invariant layers among polynomials of the form given in I, an important step is to determine all relations between elements U, U U such that I(U) = I(U ). When the parameters Ψ are fixed, I becomes a linear function in the parameters Φ , and the difference satisfies I(U) I(U ) = I(U U ). Thus, the problem of identifying invariant layers within polynomials of the form I reduces to determining the linear dependencies among stable polynomials. The following theorem characterizes these dependencies in I(U), with a formal statement provided in Theorem B.6 in the appendix. Theorem 3.4 (Linear dependence of stable polynomials). For a given pair of coefficients matrix Φ and Ψ , if I(U) given in Equation (10) is equal to zero for all input weights U U, then we have q=1 Φ(L,0):pq [W](L,0) pq q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq = 0, p=1 Φ(L,t)(t):p [Wb](L,t)(t) p q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq = 0, L > t > 0, and all entries of Φ and Ψ , except those appear in the above two equations, are equal to zero. Intuitively speaking, almost stable polynomial terms are linearly independent over the reals R, except those in Equations (11) and (12). This linear dependence property of the stable polynomials is essential in the computation of equivariant and invariant polynomial layers using weight-sharing mechanism. The proofs of Proposition 3.2, Theorem 3.3, and Theorem 3.4 can be found in Appendix B. 3.2. Polynomial Invariant and Equivariant Layers We now proceed to construct G-invariant polynomial layers. The construction of G-equivariant polynomial layers is similar and will be derived in detail in Appendix C. These polynomial layers serve as the fundamental building blocks for our MAGEP-NFNs. We define a polynomial map I : U ! Rd with maps each element U U to the vector I(U) Rd of the form given in Equation (10). To make I to be G-invariant, the learnable parameters Φ and Ψ must satisfy a system of constraints (usually called parameter sharing), which are induced from the condition I(g U) = I(U) for all g G and U U. We show in details what are these constraints and how to derive the concrete formula of I in Appendix C. The formula of I is then determined by q=1 Φ(L,0)(L,0):pq [WW](L,0)(L,0) pq q=1 Φ(L,0):pq [W](L,0) pq p=1 Φ(s,0)(L,s): [WW](s,0)(L,s) pp q=1 Φ(L)(L,0):pq [b W](L)(L,0) pq p=1 Φ(L,t)(t):p [Wb](L,t)(t) p p=1 Φ(t)(L,t): [b W](t)(L,t) pp p=1 Φ(L):p [b](L) p + Φ1. (13) In the above formula, the bullet symbol denotes that the corresponding coefficient is independent of the index at that position. The summation terms involving expressions of the form [W] (respectively, [b], [Wb], [b W], [WW]) correspond to those present in the summation in Equation (10). The omitted terms are those eliminated during solving the parameter-sharing process to ensure that the resulting formula becomes invariant. To conclude, we obtain the following: Theorem 3.5. With the notation given above, the polynomial map I : U ! Rd defined by Equation (13) is Ginvariant. Moreover, if a map takes the form of Equation (10) and is G-invariant, then it has the form given in Equation (13). Remark (Comparison to the invariant/equivariant linear layers in (Tran et al., 2024a)). Equation (13) describes the invariant polynomial layer derived from the parameter-sharing mechanism of our MAGEP-NFNs. In contrast, the invariant equivariant layer proposed in (Tran et al., 2024a) is an ad hoc Equivariant Polynomial Functional Networks Table 1: Classification train and test accuracies (%) for implicit neural representations of MNIST, Fashion MNIST, and CIFAR-10. Uncertainties indicate standard error over 5 runs, baseline results are from (Tran et al., 2024a). MNIST CIFAR-10 Fashion MNIST MLP 10.62 0.54 10.48 0.74 9.95 0.36 NP (Zhou et al., 2024b) 69.82 0.42 33.74 0.26 58.21 0.31 HNP (Zhou et al., 2024b) 66.02 0.51 31.61 0.22 57.43 0.46 Monomial-NFN (Tran et al., 2024a) 68.43 0.51 34.23 0.33 61.15 0.55 MAGEP-NFNs (ours) 77.55 0.68 37.18 0.30 62.83 0.57 formulation and does not result from a parameter-sharing mechanism. Consequently, there is no direct relationship between our invariant layer and the invariant layer in (Tran et al., 2024a). However, the equivariant polynomial layer in our MAGEPNFNs and the equivariant linear layer from (Tran et al., 2024a) are related. Specifically, the equivariant layer in (Tran et al., 2024a) is exactly the linear component of our equivariant polynomial layer. Due to the lengthy formulation and construction process, we have provided the details of the derived equivariant polynomial layers in Appendix D.4. 4. Experimental Results In this session, we assess the performance of our Monomial Matrix Group Polynomial Equivariant Neural Functionals (MAGEP-NFNs) across a variety of equivariant and invariant tasks. For invariant tasks, we implement our model for classifying Implicit Neural Representations of images and predicting CNN generalization based on weights. The equivariant task focuses on weight space style editing. Our experiments are designed to illustrate that MAGEP-NFNs either outperform or match the performance of other baseline models with a similar number of parameters. We perform five independent runs for each experiment and report the average results. Detailed information on hyperparameter settings, training protocols, memory and runtime analysis can be found in Appendix E. Additionally, we present a supplementary experiment comparing our approach with a GNN-based functional network, which is included in Appendix F. 4.1. Classifying Implicit Neural Representations of images Experiment setup. In this experiment, we aim to determine which class each pretrained Implicit Neural Representation (INR) weight was trained on. Following (Tran et al., 2024a), we employ three distinct INR weight datasets (Zhou et al., 2024b), each was trained on a different image dataset: CIFAR-10 (Krizhevsky & Hinton, 2009), Fashion- MNIST (Xiao et al., 2017), and MNIST (Le Cun & Cortes, 2005). Each INR weight is trained to encode a single image from its respective class, capturing the image structure by mapping pixel coordinates (x, y) to the corresponding pixel color values represented as 3-channel RGB values for CIFAR-10 and 1-channel grayscale values for MNIST and Fashion MNIST. The varying complexity and diversity of the datasets provide a robust test for evaluating MAGEPNFN s performance, demonstrating its effectiveness and benchmarking it against existing models. Results. We present the performance of our model alongside several baseline models, including MLP, NP (Zhou et al., 2024b), HNP (Zhou et al., 2024b), and Monomial NFN (Tran et al., 2024a). As shown in Table 1, our model achieves the highest test accuracies across all INR datasets. Notably, it outperforms the second-best model on the MNIST dataset, by a significant margin of 7.73%. For the CIFAR-10 and Fashion MNIST datasets, our model also demonstrates substantial improvements, with accuracy gains of 2.95% and 1.68%, respectively, over the existing baselines. These results indicate that our model leverages the embedded information from the pretrained INRs more effectively than any of the compared baselines. This consistent superior performance across various INR datasets highlights the effectiveness of MAGEP-NFN. It also suggests that our model generalizes well to INR weights embedded with different image structures and complexities. 4.2. Predicting CNN generalization from weights Experiment setup. For this experiment, we focus on predicting the generalization performance of pretrained CNNs based solely on their weights, without evaluating them on test data. We utilize the Small CNN Zoo dataset (Unterthiner et al., 2020), which contains various pretrained CNN models trained with different combinations of hyperparameters and activation functions. For our study, we split the Small CNN Zoo into two subsets: one comprising networks using Re LU activations and the other using Tanh activations. These two types of CNNs follow different group actions: M>0 n for Relu networks (see Equation (4)) and M 1 ni for Tanh networks. Equivariant Polynomial Functional Networks Table 2: Performance prediction of CNNs on the Re LU subset of Small CNN Zoo with varying scale augmentations. We use Kendall s τ as the evaluation metric. The uncertainty bars indicate the standard deviation across 5 runs. Augment settings No augment U[1, 101] U[1, 102] U[1, 103] U[1, 104] STATNet (Unterthiner et al., 2020) 0.915 0.002 0.894 0.0001 0.853 0.007 0.523 0.02 0.516 0.001 NP (Zhou et al., 2024b) 0.920 0.003 0.900 0.002 0.898 0.003 0.884 0.002 0.884 0.002 HNP (Zhou et al., 2024b) 0.926 0.003 0.913 0.001 0.903 0.003 0.891 0.003 0.601 0.02 Monomial-NFN (Tran et al., 2024a) 0.922 0.001 0.920 0.001 0.919 0.001 0.920 0.002 0.920 0.001 MAGEP-NFNs (ours) 0.933 0.001 0.933 0.001 0.933 0.001 0.932 0.001 0.932 0.001 Table 3: Performance prediction of CNNs on the Tanh subset of Small CNN Zoo. We use Kendall s τ as the evaluation metric. The uncertainty bars indicate the standard deviation across 5 runs. Model Kendall s τ STATNet (Unterthiner et al., 2020) 0.913 0.0012 NP (Zhou et al., 2024b) 0.925 0.0013 HNP (Zhou et al., 2024b) 0.933 0.0019 Monomial-NFN (Tran et al., 2024a) 0.939 0.0004 MAGEP-NFNs (ours) 0.940 0.001 To evaluate the robustness of our model to input transformations under group actions, we augment the Re LU dataset by applying randomly sampled group actions M>0 n . Specifically, we randomly sampling the diagonal elements D>0 n,ii of the matrix D>0 n , with each element drawn from uniform distributions over different ranges, defined as U[1, 10i] for i = 1, 2, 3, 4. To further diversify the transformations, we also randomly sample the permutation matrix Pn. Results. Table 2 illustrates the performance of all models trained on the Re LU subset, where our MAGEP-NFNs model clearly outperforms all other baselines. Notably, it demonstrates robustness to scale and permutation symmetry, similar to Monomial-NFN, while consistently surpassing its performance across both the original and all augmented dataset settings. This suggests that incorporating polynomial layers allows our model to capture more information from the weights across different hidden layers, compared to Monomial-NFN, thereby enhancing expressivity. On the original dataset, our model achieves a Kendall s τ performance gap of 0.007 over other baselines, and maintaining at least a 0.012 advantage in all other augmented settings. Similarly, Table 3 reveals that MAGEP-NFNs achieves the highest Kendall s τ with Tanh activation, further reinforcing its superior accuracy across different network configurations. 4.3. Weight space style editing Experiment setup. In this experiment, we focus on modifying the weights of SIREN (Sitzmann et al., 2020) to modify the image encoded within each model. We utilize the pretrained models from paper (Zhou et al., 2024b), which encode images from the CIFAR-10 and MNIST datasets. Specifically, we address two tasks aimed at modifying the embedded information: enhancing the contrast of CIFAR-10 images and dilating MNIST images encoded in the SIREN models. We report the MSE loss between the images encoded in the modified SIREN network and the ground truth contrast-enhanced CIFAR-10 images or dilated MNIST images. Results. Table 4 demonstrates that our model achieves performance comparable to other baselines. Specifically, MAGEP-NFNs matches the performance of NP and Monomial-NFN in the contrast-enhancing task on the CIFAR-10 dataset. Additionally, our model outperforms Monomial-NFN in the dilation task on the MNIST dataset, while achieving similar results to NP. Interestingly, NP remains a strong candidate in the weight editing tasks, and our model consistently performs on par with NP across both experiments. 4.4. Ablation study on the role of higher-order terms To evaluate the impact of the newly introduced Inter-Layer terms ([W], [WW], [b W], [Wb]), we conduct an ablation study focusing on the invariant task of predicting CNN generalization for the Re LU subset, following the same setting outlined in Subsection 4.2. The results presented in Table 5 clearly show that the inclusion of Inter-Layer terms enhances the network s performance. Notably, the performance improves from a Kendall s τ of 0.929 with only Non Inter-Layer terms to 0.933 when both terms are combined, highlighting a significant boost in overall performance attributed to the incorporation of Inter-Layer terms. Equivariant Polynomial Functional Networks Table 4: Test mean squared error (lower is better) between weight-space editing methods and ground-truth image-space transformations. Uncertainties indicate standard error over 5 runs Contrast (CIFAR-10) Dilate (MNIST) MLP 0.031 0.001 0.306 0.001 NP (Zhou et al., 2024b) 0.020 0.002 0.068 0.002 HNP (Zhou et al., 2024b) 0.021 0.002 0.071 0.001 Monomial-NFN (Tran et al., 2024a) 0.020 0.001 0.069 0.002 MAGEP-NFNs (ours) 0.020 0.001 0.068 0.002 Table 5: Ablation study assessing the importance of higherorder terms on the task of predicting CNN generalization on the Re LU subset. Components Kendall s τ Only Non Inter-Layer terms 0.929 Only Inter-Layer terms 0.932 Non Inter-Layer terms + [W] 0.930 Non Inter-Layer terms + [WW] 0.930 Non Inter-Layer terms + [Wb] 0.931 Non Inter-Layer terms + [b W] 0.931 Non Inter-Layer terms + Inter-Layer terms 0.933 5. Related Work Functional Equivalence of Neural Networks. Hecht Nielsen was provided a foundational perspective on the relationship between weight symmetries and network functionality in (Hecht-Nielsen, 1990). This work has been extended to different network architectures, such as Re LU networks (Bui Thi Mai & Lampert, 2020; Neyshabur et al., 2015; Badrinarayanan et al., 2015; Albertini & Sontag, 1993), and sin or tanh networks (Chen et al., 1993; Fefferman & Markel, 1993; Kurkova & Kainen, 1994). These studies build on earlier insights into convergence, gradient dynamics, and structural properties of neural networks, as explored in (Allen-Zhu et al., 2019; Du et al., 2019; Frankle & Carbin, 2019; Belkin et al., 2019; Novak et al., 2018). Neural Functional Networks. Early NFNs have been proposed to evaluate their generalization capabilities and uncover insights into neural network dynamics (Baker et al., 2018; Eilertsen et al., 2020; Unterthiner et al., 2020; Sch urholt et al., 2021; 2022a;b). These approaches typically involve either flattening the network parameters or deriving parameter statistics for further processing using standard multi-layer perceptrons (MLPs) (Unterthiner et al., 2020; Dupont et al., 2022; Luigi et al., 2023). To involve the symmetric structure of the input neural networks, Sch urholt et al. (2021) introduced neuron permutation augmentations to better align model representations with their functional equivalence. Other studies have expanded on these ideas by focusing on encoding and decoding neural network parameters, primarily for reconstruction and generative modeling (Peebles et al., 2022; Ashkenazi et al., 2023; Knyazev et al., 2021; Erkoc et al., 2023). Equivariant Neural Functional Networks. To achieve NFNs that are equivariant with respect to permutation symmetries of neural networks weight space, several approaches have been used, such as: weight-sharing mechanisms (Navon et al., 2023; Zhou et al., 2024b; Kofinas et al., 2024; Zhou et al., 2024c), set-based (Andreis et al., 2023) or graph-based structures (Lim et al., 2024; Kofinas et al., 2024; Zhou et al., 2024a). Despite these developments, current approaches often overlook additional symmetries present in neural networks, for instance, weight scaling symmetries in Re LU networks and weight sign-flipping symmetries in sin and tanh networks. Recently, NFNs that are equivariant to both permutations and scaling have been introduced in (Kalogeropoulos et al., 2024; Tran et al., 2024a). These networks leverage advanced techniques such as graph-based message-passing mechanisms (Kalogeropoulos et al., 2024) and parametersharing frameworks (Tran et al., 2024a) to extend the scope of equivariant modeling and enhance the expressivity of NFNs. However, the graph-based equivariant NFNs proposed in (Kalogeropoulos et al., 2024) suffer from high memory consumption and significant runtime overhead. While, the Monomial-NFNs constructed using equivariant linear layers and a parameter-sharing mechanism in (Tran et al., 2024a) exhibit limited expressive power. In contrast, our MAGEP-NFNs are built upon equivariant polynomial layers, leveraging a parameter-sharing mechanism that achieves both lower memory consumption and reduced runtime while preserving strong expressivity. 6. Conclusion We have developed MAGEP-NFN, a novel NFN that is equivariant to both permutations and scaling symmetries. Our approach follows a parameter-sharing mechanism; however, unlike previous works, we construct an equivariant polynomial layer that incorporates stable polynomial terms. This polynomial formulation enables us to capture relationships between weights from different input hidden layers, thereby enhancing the expressivity of MAGEP-NFN while maintaining low memory consumption and efficient running time. Experimental results demonstrate that our model achieves competitive performance and efficiency compared to existing baselines. One limitation of the equivariant polynomial layers proposed in this paper is that they are applied to a specific architecture. However, since our method is based on a parametersharing mechanism, it is applicable to other architectures with additional operators (such as layer normalization, softmax, pooling) and other activation functions, provided that the symmetric group of the weight network is known. Equivariant Polynomial Functional Networks Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Albertini, F. and Sontag, E. D. Identifiability of discretetime neural networks. In Proc. European Control Conference, pp. 460 465. Springer Berlin, 1993. Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pp. 242 252. PMLR, 2019. Andreis, B., Bedionita, S., and Hwang, S. J. Set-based neural network encoding. ar Xiv preprint ar Xiv:2305.16625, 2023. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016. Ashkenazi, M., Rimon, Z., Vainshtein, R., Levi, S., Richardson, E., Mintz, P., and Treister, E. Nern: Learning neural representations for neural networks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. Badrinarayanan, V., Mishra, B., and Cipolla, R. Understanding symmetries in deep networks. ar Xiv preprint ar Xiv:1511.01029, 2015. Baker, B., Gupta, O., Raskar, R., and Naik, N. Accelerating neural architecture search using performance prediction. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. Open Review.net, 2018. URL https://openreview.net/ forum?id=HJqk3N1v G. Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019. Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the optimization of a synaptic learning rule. In Optimality in Biological and Artificial Networks?, pp. 265 287. Routledge, 2013. Bui Thi Mai, P. and Lampert, C. Functional vs. parametric equivalence of relu networks. In 8th International Conference on Learning Representations, 2020. Cao, N. D., Aziz, W., and Titov, I. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pp. 6491 6506. Association for Computational Linguistics, 2021. Chen, A. M., Lu, H.-m., and Hecht-Nielsen, R. On the geometry of feedforward neural network error surfaces. Neural computation, 5(6):910 927, 1993. Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675 1685. PMLR, 2019. Dupont, E., Kim, H., Eslami, S. M. A., Rezende, D. J., and Rosenbaum, D. From data to functa: Your data point is a function and you can treat it like one. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 5694 5725. PMLR, 2022. URL https://proceedings. mlr.press/v162/dupont22a.html. Eilertsen, G., J onsson, D., Ropinski, T., Unger, J., and Ynnerman, A. Classifying the classifier: Dissecting the weight space of neural networks. In ECAI 2020 - 24th European Conference on Artificial Intelligence, volume 325 of Frontiers in Artificial Intelligence and Applications, pp. 1119 1126. IOS Press, 2020. URL https://doi.org/10.3233/FAIA200209. Erkoc , Z., Ma, F., Shan, Q., Nießner, M., and Dai, A. Hyperdiffusion: Generating implicit neural fields with weightspace diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14300 14310, 2023. Fefferman, C. and Markel, S. Recovering a feed-forward net from its output. In Advances in Neural Information Processing Systems 6, [7th NIPS Conference, Denver, Colorado, USA, 1993], pp. 335 342. Morgan Kaufmann, 1993. Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL https://openreview.net/forum? id=r Jl-b3Rc F7. Harb, J., Schaul, T., Precup, D., and Bacon, P.-L. Policy evaluation networks. ar Xiv preprint ar Xiv:2002.11833, 2020. Equivariant Polynomial Functional Networks Hecht-Nielsen, R. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, pp. 129 135. Elsevier, 1990. Kalogeropoulos, I., Bouritsas, G., and Panagakis, Y. Scale equivariant graph metanetworks. In Advances in Neural Information Processing Systems 38, 2024. Knyazev, B., Drozdzal, M., Taylor, G. W., and Romero Soriano, A. Parameter prediction for unseen deep architectures. Advances in Neural Information Processing Systems, 34:29433 29448, 2021. Kofinas, M., Knyazev, B., Zhang, Y., Chen, Y., Burghouts, G. J., Gavves, E., Snoek, C. G. M., and Zhang, D. W. Graph neural networks for learning equivariant representations of neural networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=o O6Fs My DBt. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf. Kurkova, V. and Kainen, P. C. Functionally equivalent feedforward neural networks. Neural Computation, 6(3): 543 558, 1994. Le Cun, Y. and Cortes, C. The mnist database of handwritten digits. 2005. URL https://api. semanticscholar.org/Corpus ID:60282629. Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=ij K5hyxs0n. Luigi, L. D., Cardace, A., Spezialetti, R., Ramirez, P. Z., Salti, S., and Stefano, L. D. Deep learning on implicit neural representations of shapes. In The Eleventh International Conference on Learning Representations, ICLR 2023, 2023. URL https://openreview.net/ forum?id=Oo OIW-3uadi. Metz, L., Harrison, J., Freeman, C. D., Merchant, A., Beyer, L., Bradbury, J., Agrawal, N., Poole, B., Mordatch, I., Roberts, A., et al. Velo: Training versatile learned optimizers by scaling up. ar Xiv preprint ar Xiv:2211.09760, 2022. Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99 106, 2021. Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning, C. D. Fast model editing at scale. In The Tenth International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=0Dc Zxe Wf OPt. Navon, A., Shamsian, A., Achituve, I., Fetaya, E., Chechik, G., and Maron, H. Equivariant architectures for learning in deep weight spaces. In International Conference on Machine Learning, pp. 25790 25816. PMLR, 2023. Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. Pathsgd: Path-normalized optimization in deep neural networks. Advances in neural information processing systems, 28, 2015. Novak, R., Bahri, Y., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. Sensitivity and generalization in neural networks: an empirical study. ar Xiv preprint ar Xiv:1802.08760, 2018. Peebles, W., Radosavovic, I., Brooks, T., Efros, A. A., and Malik, J. Learning to learn with generative models of neural network checkpoints. ar Xiv preprint ar Xiv:2209.12892, 2022. Ruhe, D., Brandstetter, J., and Forr e, P. Clifford group equivariant neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=n84bz Mr GUD. Runarsson, T. P. and Jonsson, M. T. Evolution and design of distributed learning rules. In 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No. 00, pp. 59 63. IEEE, 2000. Sch urholt, K., Kostadinov, D., and Borth, D. Self-supervised representation learning on neural network weights for model characteristic prediction. Advances in Neural Information Processing Systems, 34:16481 16493, 2021. Sch urholt, K., Knyazev, B., Gir o-i Nieto, X., and Borth, D. Hyper-representations as generative models: Sampling unseen neural network weights. Advances in Neural Information Processing Systems, 35:27906 27920, 2022a. Sch urholt, K., Taskiran, D., Knyazev, B., Gir o-i Nieto, X., and Borth, D. Model zoos: A dataset of diverse populations of neural network models. Advances in Neural Information Processing Systems, 35:38134 38148, 2022b. Sinitsin, A., Plokhotnyuk, V., Pyrkin, D. V., Popov, S., and Babenko, A. Editable neural networks. In 8th International Conference on Learning Representations, Equivariant Polynomial Functional Networks 2020. URL https://openreview.net/forum? id=HJed Xa Etv S. Sitzmann, V., Martel, J. N. P., Bergman, A. W., Lindell, D. B., and Wetzstein, G. Implicit neural representations with periodic activation functions. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. Sokota, S., Hu, H., Wu, D. J., Kolter, J. Z., Foerster, J. N., and Brown, N. A fine-tuning approach to belief state modeling. In International Conference on Learning Representations, 2021. Stanley, K. O. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8:131 162, 2007. Tran, H., Vo, T., Huu, T., Nguyen The, A., and Nguyen, T. Monomial matrix group equivariant neural functional networks. In Advances in Neural Information Processing Systems, volume 37, pp. 48628 48665. Curran Associates, Inc., 2024a. Tran, H. V., Nguyen, K. N., Pham, T., Chu, T. T., Le, T., and Nguyen, T. M. Distance-based tree-sliced wasserstein distance. The Thirteenth International Conference on Learning Representations, 2025a. Tran, T., Tran, V.-H., Chu, T., Pham, T., Ghaoui, L. E., Le, T., and Nguyen, T. M. Tree-sliced wasserstein distance with nonlinear projection. ar Xiv preprint ar Xiv:2505.00968, 2025b. Tran, V.-H., Pham, T., Tran, T., Le, T., and Nguyen, T. M. Tree-sliced wasserstein distance on a system of lines. ar Xiv preprint ar Xiv:2406.13725, 2024b. Tran, V.-H., Vo, T. N., Huu, T. T., and Nguyen, T. M. A clifford algebraic approach to e (n)-equivariant high-order graph neural networks. ar Xiv preprint ar Xiv:2410.04692, 2024c. Tran, V.-H., Vo, T. N., The, A. N., Huu, T. T., Nguyen Nhat, M.-K., Tran, T., Pham, D.-T., and Nguyen, T. M. Equivariant neural functional networks for transformers. ar Xiv preprint ar Xiv:2410.04209, 2024d. Tran, V.-H., Chu, T. T., Nguyen, K. N., Pham, T., Le, T., and Nguyen, T. M. Spherical tree-sliced wasserstein distance. ar Xiv preprint ar Xiv:2503.11249, 2025c. Unterthiner, T., Keysers, D., Gelly, S., Bousquet, O., and Tolstikhin, I. Predicting neural network accuracy from weights. ar Xiv preprint ar Xiv:2002.11448, 2020. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. URL http: //arxiv.org/abs/1708.07747. Zhou, A., Finn, C., and Harrison, J. Universal neural functionals. Advances in neural information processing systems, 2024a. Zhou, A., Yang, K., Burns, K., Cardace, A., Jiang, Y., Sokota, S., Kolter, J. Z., and Finn, C. Permutation equivariant neural functionals. Advances in Neural Information Processing Systems, 36, 2024b. Zhou, A., Yang, K., Jiang, Y., Burns, K., Xu, W., Sokota, S., Kolter, J. Z., and Finn, C. Neural functional transformers. Advances in Neural Information Processing Systems, 36, 2024c. Equivariant Polynomial Functional Networks Supplement to Equivariant Polynomial Functional Networks A. Preliminaries This section contains notations and basic results on matrices and polynomials that will be used throughout the paper. We will mainly focus on matrices with real entries (real matrices) and polynomials with real coefficients (real polynomials). We will omit almost all of the proofs in this section as they are well-known. These results will be use in proofs in the rest of the paper. A.1. Entries of matrices A real matrix A with m rows and n columns is an element of Rm n: A = (aij)1 i m,1 j n = a11 a12 . . . a1n a21 a22 . . . a2n ... ... ... ... am1 am2 . . . amn The entry in the ith-row and the jth-column of A, or the (i, j) entry of A, is denoted by Aij = aij. The ith-row of A and jth-column of A are, respectively, denoted by: Ai = (ai1 , ai2 , . . . , ain) R1 n, A j = (a1j , a2j , . . . , amj) Rm 1. Remark. Sometimes, a comma is added between two subscript indices to make sure there will be no confusion, i.e. Ai,j, ai,j, Ai, , A ,j. Let A(L), . . . , A(2), A(1) be L matrices such that the matrix product: A(L) . . . A(2) A(1), is well-defined. Proposition A.1. The (i, j) entry of A(L) . . . A(2) A(1) is equal to: A(L) . . . A(2) A(1) ij = A(L) i, A(L 1) . . . A(2) A(1) ,j k L 1,...,k2,k1 a(L) i,k L 1 a(L) k L 1,k L 2 . . . a(2) k2,k1 a(1) k1,j. In the case where L = 1, the above equation is simply A(1) ij = a(1) ij . We set a denotation for matrices that have only one nonzero entry with value 1. The matrix with the 1 in the ith-row and the jth-column, and the rest are 0, is denoted by Eij. Matrix Eij can have any shape, but its shape are usually defined by context, and will be omitted without confusion. The product of matrices of this type is presented as below. Proposition A.2. Let Ei1,j1, Ei2,j2, . . . , Ei L,j L be L matrix units such that the product: Ei1,j1 Ei2,j2 . . . Ei L,j L, is well-defined. Then: Ei1,j1 Ei2,j2 . . . Ei L,j L = δj1,i2 δj2,i3 . . . δj L 1,i L Ei1,j L, where δij is the Kronecker delta: ( 0 if i = j, 1 if i = j. We have a direct corollary for E1,1 s. Corollary A.3. We have: E1,1 E1,1 . . . E1,1 = E1,1. Equivariant Polynomial Functional Networks A.2. Evaluation of polynomials Denote R[x1, . . . , xn] be the ring of all polynomials with real coefficients in n indeterminates x1, . . . , xn. Definition A.4. A monomial of R[x1, . . . , xn] is a polynomial of R[x1, . . . , xn] that has one term. Remark. In some contexts, a monomial is defined as a polynomial that has one term with coefficient 1. We will use both of these definitions simultaneously. Proposition A.5. R[x1, . . . , xn] is naturally a vector space over R. It is an infinite-dimensional vector space; moreover, the set of all monomials with coefficient 1 in R[x1, . . . , xn] is a basis for the vector space. Remark. For f R[x1, . . . , xn], by saying monomials in f, we refer to all monomials that appeared in the expression of f. Polynomial evaluation is computing of the value of a polynomial when the indeterminates are substituted for some values. We have the well-known result. Proposition A.6. Let f, g be two polynomials of R[x1, . . . , xn]. If f, g are equal at every evaluations, i.e. f(x1, . . . , xn) = g(x1, . . . , xn) , (x1, . . . , xn) Rn, (14) then f = g. In other words, the only polynomial of R[x1, . . . , xn], that has Rn as its zero set, is the polynomial 0 R[x1, . . . , xn]. Remark. The result still holds if R is replaced by an arbitrary infinite field, but does not hold if R is replaced by a finite field. We have a direct corollary. Corollary A.7. Let f be a nonzero polynomial of R[x1, . . . , xn]. Then there exists (x1, . . . , xn) Rn such that f(x1, . . . , xn) = 0. A.3. Entries of tensors Proposition A.8. Let a = (ai)1 i n and b = (bi)1 i n be two vectors in Rn. If: ai bj + aj bi = 0, (15) for all 1 i, j n, then a = 0 or b = 0. Proof. Assume that both of a and b are not equal to 0, then there exists i, j such that ai and bj are non-zero. From Equation (15), we have: ai bi + ai bi = 0, (16) so ai bi = 0. Since ai is non-zero, then bi = 0. It implies that: ai bj + aj bi = ai bj + 0 = ai bj = 0, (17) which contradicts to Equation (15). So at least one of a and b is equal to 0. Proposition A.9. Let A = (aij)1 i m,1 j n and B = (bij)1 i m,1 j n be two matrices in Rm n. If: aij bkl + akj bil + ail bkj + akl bij = 0, (18) for all 1 i, k m and 1 j, l n, then A = 0 or B = 0. Proof. Consider Equation (18) when 1 j = l n, we have: 0 = aij bkj + akj bij + aij bkj + akj bij (19) = 2 (aij bkj + akj bij) , (20) which means: aij bkj + akj bij = 0. (21) Equivariant Polynomial Functional Networks This holds for all 1 i, k m. Apply Proposition A.8, we have aij = 0 for all 1 i m, or bij = 0 for all 1 i m, which means A ,j = 0 or B ,j = 0. This holds for all 1 j n. Similarly, we have Ai, = 0 or Bi, = 0 for 1 i m. Now, assume that, both of A and B are not equal to 0, then there exists i, j and k, l such that aij and bkl are non-zero. By previous observation, we have Bi, = B ,j = Ak, = A ,l = 0. It implies that: aij bkl + akj bil + ail bkj + akl bij = aij bkl + 0 + 0 + 0 = aij bkl = 0, (22) which contradicts to Equation (18). So at least one of A and B is equal to 0. Proposition A.8 and Proposition A.9 are, respectively, one-dimensional and two-dimensional cases. By using the same arguments, we will obtain the d-dimensional version belows. Proposition A.10. Let d be a positive integer and n1, n2, . . . , nd be d positive integers. Let: A = (Ai1,i2,...,id)1 i1 n1,1 i2 n2,...,1 id nd Rn1 n2 ... nd, B = (Bi1,i2,...,id)1 i1 n1,1 i2 n2,...,1 id nd Rn1 n2 ... nd. If for all 1 i0 1, i1 1 n1, 1 i0 2, i1 2 n2, . . . , 1 i0 d, i1 d nd, we have: X (α1,...,αd) {0,1}d Aiα1 1 ,iα2 2 ,...,i αd d Bi1 α1 1 ,i1 α2 2 ,...,i 1 αd d then A = 0 or B = 0. B. Stable Polynomial Terms Intuitively, a stable polynomial term is a polynomial in the entries of U U that is stable under the action of G (see Definition B.1 below). The equivariant polynomial layers we aim to construct are linear combinations of these stable polynomial terms. In Subsection B.1, we provide a formal definition for stable polynomial terms as well as their properties. We will study the linear dependence of stable polynomial terms in the language of polynomial rings with real coefficients in Subsections B.2 and B.3. These properties play a central role in determining the parameter-sharing computation of equivariant polynomial layers in the next section. B.1. Definitions and basic properties Recall the weight space U given by: U = W B, where: W = Rw L n L n L 1 . . . Rw2 n2 n1 Rw1 n1 n0, B = Rb L n L 1 . . . Rb2 n2 1 Rb1 n1 1. Let us consider the case where the weight spaces have the same number of dimensions across all channels, which means wi = bi = d for all i. Definition B.1 (Stable polynomial terms). Let U = ([W], [b]) be an element of U with weights [W] = [W](L), . . . , [W](1) and biases [b] = [b](L), . . . , [b](1) . For each L s > t 0, we define: [W](s,t) := [W](s) [W](s 1) . . . [W](t+1) Rd ns nt, [Wb](s,t)(t) := [W](s,t) [b](t) Rd ns 1. (24) In addition, for each L s, t 0, and matrices Ψ(s)(L,t) R1 n L and Ψ(s,0)(L,t) Rn0 n L, we also define [b W](s)(L,t) := [b](s) Ψ(s)(L,t) [W](L,t) Rd ns nt, [WW](s,0)(L,t) := [W](s,0) Ψ(s,0)(L,t) [W](L,t) Rd ns nt. (25) Equivariant Polynomial Functional Networks The entries of the matrices [W](s,t), [Wb](s,t)(t), [b W](s)(L,t) and [WW](s,0)(L,t) defined above are called stable polynomial terms of U under the action of G. The following observations are direct implications from the definition. For all L s > t > r 0: [W](s,s 1) = [W](s) Rd ns ns 1, (26) [W](s,t) [W](t,r) = [W](s,r) Rd ns nr, (27) by definition. For g = g(L), . . . , g(0) GU: [g W](s,t) = g(s) [W](s,t) g(t) 1 Rd ns nt. (28) If g G, then: [g W](L,t) = [W](L,t) g(t) 1 Rd n L nt (29) [g W](s,0) = g(s) [W](s,0) Rd ns n0. (30) For all L s > t > 0, we have [g W](s,t) [gb](t) = g(s) [W](s,t) [b](t) Rd ns 1 (31) For all L s > 0, L > t 0 and Ψ(s,0)(L,t) Rd n0 n L, we have: [g W](s,0) Ψ(s,0)(L,t) [g W](L,t) = g(s) [W](s,0) Ψ(s,0)(L,t) [W](L,t) g(t) 1 Rd ns nt. (32) In particular, if t = s 1, we have: [g W](s,0) Ψ(s,0)(L,s 1) [g W](L,s 1) = g(s) [W](s,0) Ψ(s,0)(L,s 1) [W](L,s 1) g(s 1) 1 Rd ns ns 1. (33) For all L s > 0 and Ψ(s)(L,t) Rd 1 n L, we have: [gb](s) Ψ(s)(L,t) [g W](L,t) = g(s) [b](s) Ψ(s)(L,t) [W](L,t) g(t) 1 Rd ns nt. (34) Based on the above observations, we can determine the image of the stable polynomial terms under the action of an element g GU as follows: [g W](s,t) = g(s) [W](s,t) g(t) 1 , [gb](s) = g(s) [b](s), [g Wgb](s,t)(t) = g(s) [Wb](s,t)(t), [gbg W](s)(L,t) = g(s) [b W](s)(L,t) g(t) 1 , [g Wg W](s,0)(L,t) = g(s) [WW](s,0)(L,t) g(t) 1 . Equivariant Polynomial Functional Networks In concrete, we have: [g W](s,t) pq = d(s) p d(t) q [W](s,t) π 1 s (p),π 1 t (q), [gb](s) p = d(s) p [b](s) π 1 s (p), [g Wgb](s,t)(t) p = d(s) p [Wb](s,t)(t) π 1 s (p), [gbg W](s)(L,t) pq = d(s) p d(t) q [b W](s)(L,t) π 1 s (p),π 1 t (q), [g Wg W](s,0)(L,t) = d(s) p d(t) q [WW](s,0)(L,t) π 1 s (p),π 1 t (q). B.2. Input weights as indeterminates To simplify the technical difficulties, we consider the weight space U in the case where d = 1, i.e., U = W B, where: W = Rn L n L 1 . . . Rn2 n1 Rn1 n0, B = Rn L 1 . . . Rn2 1 Rn1 1. We introduce the set I consists of indeterminates defined by: I := {x(i) jk : 1 i L, 1 j ni, 1 k ni 1} {y(i) j : 1 i L, 1 j ni}. We have |I| = dim U. Denote R = R[I], which is the ring of all polynomials with indeterminates are all elements of I. For 1 i L, we define: [W](i) := x(i) jk 1 j ni,1 k ni 1 Rni ni 1, [b](i) := y(i) j 1 j ni Rni 1, and [W](s,t) := [W](s) [W](s 1) . . . [W](t+1) Rns nt, [Wb](s,t)(t) := [W](s,t) [b](t) Rns 1, [b W](s)(L,t) := [b](s) Ψ(s)(L,t) [W](L,t) Rns nt, [WW](s,0)(L,t) := [W](s,0) Ψ(s,0)(L,t) [W](L,t) Rns nt. with feasible indices (s, t). The coefficients Ψ( ) s are fixed real matrices and they are omitted from the notations. Note that the entries of these matrices are stable polynomial terms in which the entries of U are now viewed as indeterminates of the polynomial ring R. B.3. Linear dependence of stable polynomial terms In this subsection, we derive a necessary condition for the coefficients Φ and Ψ such that the following linear combination of stable polynomial terms are identically zero: α(Φ, Ψ) := X q=1 Φ(s,t):pq [W](s,t) pq + X p=1 Φ(s):p [b](s) p p=1 Φ(s,t)(t):p [Wb](s,t)(t) p Equivariant Polynomial Functional Networks q=1 Φ(s)(L,t):pq [b W](s)(L,t) pq q=1 Φ(s,0)(L,t):pq [WW](s,0)(L,t) pq + Φ1 1. (35) Here, α is parameterized by Φ and Ψ, where Φ is a collection of real scalars Φ s appeared in the linear combination and Ψ is a collection of real matrices Ψ s that be used to define [b W]( ) s and [WW]( ) s. The index of each scalar Φ naturally presents its corresponding polynomial in α(Φ, Ψ). This necessary and sufficient condition enables us to determine the equivariant polynomial map via parameter sharing later. We first take a look at entries of [W]( ) s, [b]( ) s, [Wb]( ) s, [b W]( ) s, [WW]( ) s. It is clear that for one of these matrices, its entries are homogeneous polynomials with the same degree. For example: [W]( ): The polynomial [W](s,t) pq has degree s t. All of its monomial terms consist of one x(i) for each s i > t. [b]( ): The polynomial [b](s) p has degree 1. All of its monomial terms consist of one ys . [Wb]( ): The polynomial [Wb](s,t)(t) p has degree s t + 1. All of its monomial terms consist of one x(i) for each s i > t and one y(t) . [b W]( ): The polynomial [b W](s)(L,t) p has degree L t + 1. All of its monomial terms consist of one y(s) and one x(i) for each L i > t. [WW]( ): The polynomial [WW](s,0)(L,t) pq has degree L + s t. All of its monomial terms consist of one x(i) for each s i > 0 and one x(i) for each L i > t. 1: The polynomial 1 R. By these above observations, we have: [W]( ), [WW]( ) : Each of the polynomials [W]( ) s and [WW]( ) s is 0 or a non-constant element in R, and it is a real polynomial of at least one indeterminate from x( ) s. [b]( ): Each of the polynomials [b]( ) s is a non-constant element in R, and it is a real polynomial of one indeterminate from y( ) s. [Wb]( ), [b W]( ): Each of the polynomials [Wb]( ) s and [b W]( ) s is 0 or a non-constant element in R, and it is a real polynomial of at least one indeterminate from x( ) s and one indeterminate from y( ) s. Therefore, if α(Φ, Ψ) = 0, we must have q=1 Φ(s,t):pq [W](s,t) pq + X q=1 Φ(s,0)(L,t):pq [WW](s,0)(L,t) pq , (36) p=1 Φ(s):p [b](s) p , (37) p=1 Φ(s,t)(t):p [Wb](s,t)(t) p + X q=1 Φ(s)(L,t):pq [b W](s)(L,t) pq , (38) 0 = Φ1 1. (39) Equivariant Polynomial Functional Networks We induce the constraints on Φ and Ψ in these above equations by using the fact that a set of distinct monomials of R is a linear independent set (see Proposition A.5). Equation (36): Observe that If L s > t 0 and (s, t) = (L, 0), then the monomials x(s) . . . x(t+1) s only appear in the polynomials [W](s,t) s. They do not appear in the polynomials [W](s ,t ) s for all pairs (s , t ) = (s, t), and do not appear in the polynomials [WW](s ,0)(L,t ) s for all pairs (s , t ). If L s > 0, L > t 0, and s = t, then the monomials x(s) . . . x(1) x(L) . . . x(t+1) s only appear in the polynomials [WW](s,0)(L,t) s. They do not appear in the polynomials [W](s ,t ) s for all pairs (s , t ), and do not appear in the polynomials [WW](s ,0)(L,t ) s for all pairs (s , t ) = (s, t). So from Equation (36), it implies that q=1 Φ(s,t):pq [W](s,t) pq , (40) for all (s, t) = (L, 0), and q=1 Φ(s,0)(L,t):pq [WW](s,0)(L,t) pq , (41) for all (s, t) that s = t. The rest of the terms in Equation (36) is q=1 Φ(L,0):pq [W](L,0) pq + X q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq . (42) Equation (37): Since the set of all [b]( ) s, which is the set of all monomials y( ) s, is a linear independent set in R, it implies that 0 = Φ(s):p, (43) for all L s > 0 and 1 p ns. Equation (38): Observe that If L > s > t > 0, then the monomials x(s) . . . x(t+1) y(t) s only appear in the polynomials [Wb](s,t)(t) s. They do not appear in the polynomials [Wb](s ,t )(t ) s for all L s > t > 0 that (s , t ) = (s, t), and do not appear in polynomials [b W](s )(L,t ) s for all L s > 0 and L > t 0. If L s > 0, L > t 0, and s = t, then the monomials y(s) x(L) . . . x(t+1) s only appear in the polynomials [b W](s)(L,t) s. They do not appear in the polynomials [Wb](L,t )(t ) s for all L > t > 0, and do not appear in the polynomials [b W](s )(L,t ) s for all pairs (s , t ) = (s, t). If L > t > 0, then the monomials y(t) x(L) . . . x(t+1) s only appear in the polynomials [Wb](L,t)(t) s and appear in the polynomials [b W](t)(L,t) s. They do not appear in the polynomials [Wb](L,t )(t ) s for all L > t > 0 that t = t, and do not appear in polynomials [b W](s )(L,t ) s for all pair (s , t ) that t = t. So from Equation (38), it implies that p=1 Φ(s,t)(t):p [Wb](s,t)(t) p , (44) Equivariant Polynomial Functional Networks for all L > s > t > 0, and q=1 Φ(s)(L,t):pq [b W](s)(L,t) pq (45) for all (s, t) that s = t, and p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq (46) for all L > t > 0. Equation (39): Clearly, it implies that 0 = Φ1. (47) There are 8 equations, Equations (40)-(47), that are derived. In Equations (43) and (47), the corresponding Φ s are directly characterized, and in Equations (40), (41), (42), (44), (45), (46), the corresponding Φ s are not. We will characterize the Φ s and Ψ s in Equations (40), (41), (44) and (45) below by Lemma B.2, Lemma B.3 and Lemma B.5, respectively. These lemma are stated as below (their proofs will be postponed to Section B.4). Lemma B.2. For a pair (s, t) such that L s > t 0 and (s, t) = (L, 0), if q=1 Φ(s,t):pq [W](s,t) pq , (48) then Φ(s,t):pq = 0 for all p, q. Lemma B.3. For a pair (s, t) such that L s > 0, L > t 0, if q=1 Φ(s,0)(L,t):pq [WW](s,0)(L,t) pq , (49) then Φ(s,0)(L,t):pq = 0 for all p, q or Ψ(s,0)(L,t) = 0. Lemma B.4. For a pair (s, t) such that L s > t > 0, if p=1 Φ(s,t)(t):p [Wb](s,t)(t) p , (50) then Φ(s,t)(t):p = 0 for all p. Lemma B.5. For a pair (s, t) such that L s > 0, L > t 0, if q=1 Φ(s)(L,t):pq [b W](s)(L,t) pq , (51) then Φ(s)(L,t):pq = 0 for all p, q or Ψ(s)(L,t) = 0. Remark. The reason that we skip the characterizations of Φ s and Ψ s in Equations (42) and (46) is they can be concretely characterized. For instance, consider the case when n L = . . . = n2 = n1 = n0 = 1. From Equation (42), we have q=1 Φ(L,0):pq [W](L,0) pq + X q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq = Φ(L,0):1,1 [W](L,0) 1,1 + X L>s>0 Φ(s,0)(L,s):1,1 [WW](s,0)(L,s) 1,1 Equivariant Polynomial Functional Networks = Φ(L,0):1,1 x(L) 1,1 . . . x(2) 1,1 x(1) 1,1 L>s>0 Φ(s,0)(L,s):1,1 x(s) 1,1 . . . x(2) 1,1 x(1) 1,1 Ψ(s,0)(L,s) x(L) 1,1 . . . x(s+2) 1,1 x(s+1) 1,1 = Φ(L,0):1,1 x(L) 1,1 . . . x(2) 1,1 x(1) 1,1 L>s>0 Φ(s,0)(L,s):1,1 Ψ(s,0)(L,s) x(s) 1,1 . . . x(2) 1,1 x(1) 1,1 x(L) 1,1 . . . x(s+2) 1,1 x(s+1) 1,1 = Φ(L,0):1,1 x(L) 1,1 . . . x(2) 1,1 x(1) 1,1 L>s>0 Φ(s,0)(L,s):1,1 Ψ(s,0)(L,s) x(L) 1,1 . . . x(2) 1,1 x(1) 1,1 Φ(L,0):1,1 + X L>s>0 Φ(s,0)(L,s):1,1 Ψ(s,0)(L,s) ! x(L) 1,1 . . . x(2) 1,1 x(1) 1,1 . It implies that 0 = Φ(L,0):1,1 + X L>s>0 Φ(s,0)(L,s):1,1 Ψ(s,0)(L,s). (52) From Equation (52), we can not derive a more concrete relation on the Φ s and Ψ s. Similarly, from Equation (46), we have p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq = Φ(L,t)(t):1 [Wb](L,t)(t) 1 + Φ(t)(L,t):1,1 [b W](t)(L,t) 1,1 = Φ(L,t)(t):1 x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 y(t+1) 1,1 + Φ(t)(L,t):1,1 y(t+1) 1,1 Ψ(t)(L,t) x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 = Φ(L,t)(t):1 x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 y(t+1) 1,1 + Φ(t)(L,t):1,1 Ψ(t)(L,t) y(t+1) 1,1 x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 = Φ(L,t)(t):1 x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 y(t+1) 1,1 + Φ(t)(L,t):1,1 Ψ(t)(L,t) x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 y(t+1) 1,1 = Φ(L,t)(t):1 + Φ(t)(L,t):1,1 Ψ(t)(L,t) x(L) 1,1 . . . x(t+2) 1,1 x(t+1) 1,1 y(t+1) 1,1 It implies that 0 = Φ(L,t)(t):1 + Φ(t)(L,t):1,1 Ψ(t)(L,t). (53) From Equation (53), we can not derive a more concrete relation on the Φ s and Ψ s. Combining the discussions above, we obtain the following necessary for the coefficients Φ and Ψ such that α(Φ, Ψ) = 0. Theorem B.6. Let α(Φ, Ψ) be a polynomial given in equation 35. If α(Φ, Ψ) = 0, then the following condition holds: 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(s,t):pq = 0. (54) Equivariant Polynomial Functional Networks 2. For all L s > 0, L > t 0 with s = t, we have Φ(s,0)(L,t):pq = 0, (55) for all p, q, or Ψ(s,0)(L,t) = 0. (56) q=1 Φ(L,0):pq [W](L,0) pq + X q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq . (57) 4. For all L > s > t > 0, and for all p, we have Φ(s,t)(t):p = 0. (58) 5. For all L s > 0, L > t 0 with s = t, we have Φ(s)(L,t):pq = 0, (59) for all p, q, or Ψ(s)(L,t) = 0. (60) 6. For all L > t > 0, we have p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq . (61) 7. For all L s > 0 and for all p, we have Φ(s):p = 0. (62) Φ1 = 0. (63) Proof. By the previous observations, α(Φ, Ψ) = 0 implies four Equations (36)-(39). These equations imply Equations (40)- (47). The proof of all parts in Theorem B.6 are as follows. 1. It comes from Equation (40) and Lemma B.2. 2. It comes from Equation (41) and Lemma B.3. 3. It comes from Equation (42). 4. It comes from Equation (44) and Lemma B.4 5. It comes from Equation (45) and Lemma B.5. 6. It comes from Equation (46). 7. It comes from Equation (43). 8. It comes from Equation (47). Equivariant Polynomial Functional Networks The proof is finsihed. The following corollary is a direct consequence of Theorem B.6. Corollary B.7. If for some Φ, Φ , and Ψ, we have α(Φ, Ψ) = α(Φ , Ψ), then: 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(s,t):pq = Φ (s,t):pq. (64) 2. For all L s > 0, L > t 0 with s = t, we have Φ(s,0)(L,t):pq = Φ (s,0)(L,t):pq, (65) for all p, q, or Ψ(s,0)(L,t) = 0. (66) q=1 Φ(L,0):pq [W](L,0) pq + X q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq (67) q=1 Φ (L,0):pq [W](L,0) pq + X q=1 Φ (s,0)(L,s):pq [WW](s,0)(L,s) pq . (68) 4. For all L > s > t > 0, and for all p, we have Φ(s,t)(t):p = Φ (s,t)(t):p. (69) 5. For all L s > 0, L > t 0 with s = t, we have Φ(s)(L,t):pq = Φ (s)(L,t):pq, (70) for all p, q, or Ψ(s)(L,t) = 0. (71) 6. For all L > t > 0, we have p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq (72) p=1 Φ (L,t)(t):p [Wb](L,t)(t) p + q=1 Φ (t)(L,t):pq [b W](t)(L,t) pq . (73) 7. For all L s > 0 and for all p, we have Φ(s):p = Φ (s):p. (74) Φ1 = Φ 1. (75) Proof. Since 0 = α(Φ, Ψ) α(Φ , Ψ) = α(Φ Φ , Ψ), so the results come directly from Theorem B.6. Equivariant Polynomial Functional Networks B.4. Proofs of Lemmas B.2-B.5 The proofs of Lemmas B.2 through B.5 are directly from the coefficient comparison of two equal polynomials and the following lemma. We omit the proofs of Lemmas B.2 through B.5 and show only the proof for Lemma B.8. Lemma B.8. For a feasible tuple (s, t, p, q), if [WW](s,0)(L,t) pq = [W](s,0) Ψ(s,0)(L,t) [W](L,t) pq = 0 R. (76) Then Ψ(s,0)(L,t) = 0 Rn0 n L. Proof. We first consider the case where s = L and t = 0, then the case s t, and finally the case s > t. Note that, the proof for the last case will be by combining the arguments of the first two cases. Case s = L and t = 0. From Equation (76), we have [WW](L,0)(L,0) pq = [W](L,0) Ψ(L,0)(L,0) [W](L,0) pq = 0, (77) which means [W](L) . . . [W](1) Ψ(L,0)(L,0) [W](L) . . . [W](1) pq = 0. (78) Here, we consider three cases, where L = 1, L = 2 and L 3. Case L = 1. Equation (78) becomes [W](1) Ψ(1,0)(1,0) [W](1) pq = 0. (79) By Proposition A.1, this is equivalent to [W](1) p, Ψ(1,0)(1,0) [W](1) ,q = 0, (80) j=0 Ψ(1,0)(1,0) ij x(1) p,i x(1) j,q = 0. (81) Since the LHS of Equation (81) is a linear combination between distinct monomials x(1) p,i x(1) j,q, (82) for 1 i n0 and 1 j n1, so it implies that Ψ(1,0)(1,0) ij = 0, (83) for all 1 i n0 and 1 j n1, which means Ψ(1,0)(1,0) = 0. (84) Case L = 2. Equation (78) becomes [W](2) [W](1) Ψ(2,0)(2,0) [W](2) [W](1) pq = 0. (85) By Proposition A.1, this is equivalent to [W](2) p, [W](1) Ψ(2,0)(2,0) [W](2) [W](1) ,q = 0, (86) Equivariant Polynomial Functional Networks which is [W](2) p, [W](1) ,1 , [W](2) p, [W](1) ,2 , . . . , [W](2) p, [W](1) ,n0 Ψ(2,0)(2,0) (87) [W](2) 1, [W](1) ,q , [W](2) 2, [W](1) ,q , . . . , [W](2) n2, [W](1) ,q = 0, (88) j=0 Ψ(2,0)(2,0) ij [W](2) p, [W](1) ,i [W](2) j, [W](1) ,q = 0. (89) Since the LHS of Equation (89) is a linear combination between elements of a linear independent collection of R, which are: [W](2) p, [W](1) ,i [W](2) j, [W](1) ,q , (90) for 1 i n0 and 1 j n2, so it implies that Ψ(2,0)(2,0) ij = 0, (91) for all 1 i n0 and 1 j n2, which means Ψ(2,0)(2,0) = 0. (92) Case L 3. Equation (78) becomes [W](L) . . . [W](1) Ψ(L,0)(L,0) [W](L) . . . [W](1) pq = 0. (93) It is noted that the matrices [W](L 1), . . . , [W](2) can be substituted for some matrix units at 1st-row and 1st-column such that the product [W](L 1) . . . [W](2) (94) becomes a matrix unit at 1st-row and 1st-column. Substitute this in Equation (93), we have [W](L) E11 [W](1) Ψ(L,0)(L,0) [W](L) E11 [W](1) pq = 0. (95) By Proposition A.1, this is equivalent to [W](L) p, E11 [W](1) Ψ(L,0)(L,0) [W](L) E11 [W](1) ,q = 0. (96) which can be rewritten as [W](L) p,1 [W](1) 1, Ψ(L,0)(L,0) [W](L) ,1 [W](1) 1,q = 0, (97) x(L) p,1 [W](1) 1, Ψ(L,0)(L,0) [W](L) ,1 x(1) 1,q = 0. (98) Since the LHS of Equation (98) is a polynomial in R, we have [W](1) 1, Ψ(L,0)(L,0) [W](L) ,1 = 0, (99) j=0 Ψ(L,0)(L,0) ij x(1) 1,i x(L) j,1 = 0. (100) Equivariant Polynomial Functional Networks Since the LHS of Equation (100) is a linear combination between distinct monomials x(1) 1,i x(L) j,1 , (101) for 1 i n0 and 1 j n L, so it implies that Ψ(L,0)(L,0) ij = 0, (102) for all 1 i n0 and 1 j n L, which means Ψ(L,0)(L,0) = 0. (103) In conclusion, for all cases of L, we have Ψ(L,0)(L,0) = 0. (104) We finish the proof of the case where s = L and t = 0. Case s t. From Equation (76), we have [WW](s,0)(L,t) pq = [W](s,0) Ψ(s,0)(L,t) [W](L,t) pq = 0, (105) which means [W](s) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](t+1) pq = 0. (106) Since s t, the [W] s that are in the product [W](s) . . . [W](1), and the [W] s that are in the product [W](L) . . . [W](t+1), are distinct. By directly multiplying [W](s) . . . [W](1) and [W](L) . . . [W](t+1), we can write these two products in the forms [W](s) . . . [W](1) = f (s) ij 1 i ns,1 j n0 Rns n0, [W](L) . . . [W](t+1) = g(t) ij 1 i n L,1 j nt Rn L nt, (107) where all f (s) s are nonzero and all g(t) s are nonzero. Moreover, f (s) s are real polynomials of indeterminates x(1) s, x(2) s, . . ., x(s) s. Similarly, g(t) s are real polynomials of indeterminates x(L) s, x(L 1) s, . . ., x(t+1) s. Now, Equation (106) is equal to f (s) ij 1 i ns,1 j n0 Ψ(s,0)(L,t) g(t) ij 1 i n L,1 j nt pq = 0. (108) By Proposition A.1, this is equivalent to f (s) p,1 , f (s) p,2 , . . . , f (s) p,n0 Ψ(s,0)(L,t) g(t) 1,q , g(t) 2,q , . . . , f (t) n L,q = 0, (109) j=0 Ψ(s,0)(L,t) ij f (s) p,i g(t) j,q = 0. (110) Since the LHS of Equation (110) is a linear combination between elements of a linear independent collection of R, which are f (s) p,i g(t) j,q, (111) Equivariant Polynomial Functional Networks for 1 i n0 and 1 j n L. The linear dependency comes from the distinction between indeterminates of f ( ) and g( ) . It implies that Ψ(s,0)(L,t) ij = 0, (112) for all 1 i n0 and 1 j n L, which means Ψ(s,0)(L,t) = 0. (113) We finish the proof of the case where s t. Case s > t. From Equation (76), we have: [WW](s,0)(L,t) pq = [W](s,0) Ψ(s,0)(L,t) [W](L,t) pq = 0, (114) which means [W](s) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](t+1) pq = 0. (115) Assume that Ψ(s,0)(L,t) = 0 Rn0 n L. Observe that [W](s) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](t+1) (116) = [W](s) . . . [W](t+1) [W](t) . . . [W](1) (117) Ψ(s,0)(L,t) [W](L) . . . [W](s+1) [W](s) . . . [W](t+1) = [W](s) . . . [W](t+1) (118) [W](t) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](s+1) [W](s) . . . [W](t+1) . In the second term of Equation (118), [W](t) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](s+1). (119) Since s > t, all [W]( ) s in Equation (119) are distinct. And since Ψ(s,0)(L,t) = 0 Rn0 n L, with the same argument as in Case s t, we have [W](t) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](s+1) = 0 Rnt ns. (120) Moreover, it can be written in the form [W](t) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](s+1) (121) 1 i nt,1 j ns Rnt ns, (122) where all c(st) s are real polynomials of indeterminates x(1) s, x(2) s, . . ., x(t) s and x(L) s, x(L 1) s, . . ., x(s+1) s, and at least one of them is a nonzero element in R. By Corollary A.7, indeterminates x(1) s, x(2) s, . . ., x(t) s and x(L) s, x(L 1) s, . . ., x(s+1) s can be substituted for some real values to make [W](t) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](s+1), (123) Equivariant Polynomial Functional Networks become a nonzero matrix of Rnt ns. We denote this nonzero matrix by Ψ (s,0)(L,t) Rnt ns. Note that, since s > t, the substitution only applies for indeterminates in [W](1), [W](2), . . . , [W](t) and [W](L), [W](L 1), . . . , [W](s+1). In other words, it does not apply for [W](s), . . . , [W](t+1). So, with the above substitution, Equation (116) becomes [W](s) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](t+1) = [W](s) . . . [W](t+1) Ψ (s,0)(L,t) [W](s) . . . [W](t+1). (124) Note that, since s > t, there is at least one [W]( ) in the product [W](s) . . . [W](t+1). Combine with Ψ (s,0)(L,t) = 0, by applying the argument of Case s = L and t = 0, we have [W](s) . . . [W](1) Ψ(s,0)(L,t) [W](L) . . . [W](t+1) = [W](s) . . . [W](t+1) Ψ (s,0)(L,t) [W](s) . . . [W](t+1) pq = 0, (125) which contradicts to Equation (115). In conclusion Ψ(s,0)(L,t) = 0. (126) We finish the proof of the case where s > t. In summary, we did consider all possible cases. The proof is finished. C. Equivariant Polynomial Layers We now proceed to construct a G-equivariant polynomial layer, denoted as E. These layers serve as the fundamental building blocks for our MAGEP-NFNs. Our strategy is as follows: we first express E as a polynomial layer that is a linear combination of stable polynomial terms (Subsection C.1). We then find the equivariant maps among these polynomial layers using the parameter sharing mechanism. Equivariance in Machine Learning is used in various context, such as Euclidean equivariance (Tran et al., 2024c; Ruhe et al., 2023), equivariant metanetwork (Tran et al., 2024d; Zhou et al., 2024c;b), Optimal Transport (Tran et al., 2024b; 2025a;c;b), etc. C.1. Equivariant layer as a linear combination of stable polynomial terms For two weight spaces U and U with the same number of layers L as well as the same number of channels at ith layer ni, U = W B, where: W = Rw L n L n L 1 . . . Rw2 n2 n1 Rw1 n1 n0, B = Rb L n L 1 . . . Rb2 n2 1 Rb1 n1 1, U = W B , where: W = Rw L n L n L 1 . . . Rw 2 n2 n1 Rw 1 n1 n0, B = Rb L n L 1 . . . Rb 2 n2 1 Rb 1 n1 1. We want to build a map E : U ! U such that E is G-equivariant, where: G = {id Gn L} G>0 n L 1 . . . G>0 n1 {id Gn0}. (127) Let us consider a polynomial map E : U ! U such that, for input U = ([W], [b]), each entry of the output E(U) = ([E(W)], [E(b)]) is a linear combinations of stable polynomial terms, i.e the entries of [W](s,t), [b](s), [Wb](s,t)(t), [b W](s)(L,t), [WW](s,0)(L,t), together with a bias. In concrete: [E(W)](i) jk := X q=1 Φ(i):jk (s,t):pq [W](s,t) pq + X p=1 Φ(i):jk (s):p [b](s) p Equivariant Polynomial Functional Networks p=1 Φ(i):jk (s,t)(t):p [Wb](s,t)(t) p q=1 Φ(i):jk (s)(L,t):pq [b W](s)(L,t) pq q=1 Φ(i):jk (s,0)(L,t):pq [WW](s,0)(L,t) pq + Φ(i):jk 1 , [E(b)](i) j := X q=1 Φ(i):j (s,t):pq [W](s,t) pq + X p=1 Φ(i):j (s):p [b](s) p p=1 Φ(i):j (s,t)(t):p [Wb](s,t)(t) p q=1 Φ(i):j (s)(L,t):pq [b W](s)(L,t) pq q=1 Φ(i):j (s,0)(L,t):pq [WW](s,0)(L,t) pq + Φ(i):j 1 . All Φ s are in Rd d, except the biases Φ 1 s are in Rd 1. In summary, E is parameterized by Φ s and Ψ s. In order to be G-equivariant, the polynomial map E must satisfy the condition E(g U) = g E(U) for all g GU and U U. In the following subsections, we derive the computations of E(g U) and g E(U), and compare them in order to obtain all possible G-equivariant polynomial maps among those considered. C.2. Compute E(g U) [E(g W)](i) jk = X q=1 Φ(i):jk (s,t):pq [g W](s,t) pq p=1 Φ(i):jk (s):p [gb](s) p p=1 Φ(i):jk (s,t)(t):p [g Wgb](s,t)(t) p q=1 Φ(i):jk (s)(L,t):pq [gbg W](s)(L,t) pq q=1 Φ(i):jk (s,0)(L,t):pq [g Wg W](s,0)(L,t) pq + Φ(i):jk 1 q=1 Φ(i):jk (s,t):pq d(s) p d(t) q [W](s,t) π 1 s (p),π 1 t (q) p=1 Φ(i):jk (s):p d(s) p [b](s) π 1 s (p) Equivariant Polynomial Functional Networks p=1 Φ(i):jk (s,t)(t):p d(s) p [Wb](s,t)(t) π 1 s (p) q=1 Φ(i):jk (s)(L,t):pq d(s) p d(t) q [b W](s)(L,t) π 1 s (p),π 1 t (q) q=1 Φ(i):jk (s,0)(L,t):pq d(s) p d(t) q [WW](s,0)(L,t) π 1 s (p),π 1 t (q) + Φ(i):jk 1 q=1 Φ(i):jk (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [W](s,t) pq p=1 Φ(i):jk (s):πs(p) d(s) πs(p) [b](s) p p=1 Φ(i):jk (s,t)(t):πs(p) d(s) πs(p) [Wb](s,t)(t) p q=1 Φ(i):jk (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [b W](s)(L,t) pq q=1 Φ(i):jk (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [WW](s,0)(L,t) pq + Φ(i):jk 1 [E(gb)](i) j = X q=1 Φ(i):j (s,t):pq [g W](s,t) pq p=1 Φ(i):j (s):p [gb](s) p p=1 Φ(i):j (s,t)(t):p [g Wgb](s,t)(t) p q=1 Φ(i):j (s)(L,t):pq [gbg W](s)(L,t) pq q=1 Φ(i):j (s,0)(L,t):pq [g Wg W](s,0)(L,t) pq q=1 Φ(i):j (s,t):pq d(s) p d(t) q [W](s,t) π 1 s (p),π 1 t (q) p=1 Φ(i):j (s):p d(s) p [b](s) π 1 s (p) p=1 Φ(i):j (s,t)(t):p d(s) p [Wb](s,t)(t) π 1 s (p) Equivariant Polynomial Functional Networks q=1 Φ(i):j (s)(L,t):pq d(s) p d(t) q [b W](s)(L,t) π 1 s (p),π 1 t (q) q=1 Φ(i):j (s,0)(L,t):pq d(s) p d(t) q [WW](s,0)(L,t) π 1 s (p),π 1 t (q) q=1 Φ(i):j (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [W](s,t) pq p=1 Φ(i):j (s):πs(p) d(s) πs(p) [b](s) p p=1 Φ(i):j (s,t)(t):πs(p) d(s) πs(p) [Wb](s,t)(t) p q=1 Φ(i):j (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [b W](s)(L,t) pq q=1 Φ(i):j (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [WW](s,0)(L,t) pq + Φ(i):j 1 . Note that, we can move around the π s in above equations since the group G satisfy that: G Pi is trivial (for i = 0 or i = L) or the whole Pi (for 0 < i < L). C.3. Compute g E(U) [g(E(W))](i) jk = d(i) j d(i 1) k [W ](i) π 1 i (j)π 1 i 1(k) d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,t):pq [W](s,t) pq d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s):p [b](s) p d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,t)(t):p [Wb](s,t)(t) p d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s)(L,t):pq [b W](s)(L,t) pq d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,0)(L,t):pq [WW](s,0)(L,t) pq + d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) 1 Equivariant Polynomial Functional Networks [g(E(b))](i) j = d(i) j [b ](i) π 1 i (j) q=1 d(i) j Φ (i):π 1 i (j) (s,t):pq [W](s,t) pq p=1 d(i) j Φ (i):π 1 i (j) (s):p [b](s) p p=1 d(i) j Φ (i):π 1 i (j) (s,t)(t):p [Wb](s,t)(t) p q=1 d(i) j Φ (i):π 1 i (j) (s)(L,t):pq [b W](s)(L,t) pq q=1 d(i) j Φ (i):π 1 i (j) (s,0)(L,t):pq [WW](s,0)(L,t) pq + d(i) j Φ (i):π 1 i (j) 1 C.4. Compare E(g U) and g E(U) Since E(g U) = g E(U), from Corollary B.7, the parameters Φ s have to satisfy these following conditions: 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(i):jk (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,t):pq Φ(i):j (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) j Φ (i):π 1 i (j) (s,t):pq 2. For all L s > 0, L > t 0 with s = t, we have Φ(i):jk (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,0)(L,t):pq Φ(i):j (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) j Φ (i):π 1 i (j) (s,0)(L,t):pq q=1 Φ(i):jk (L,0):πL(p)π0(q) d(L) πL(p) d(0) π0(q) [W](L,0) pq q=1 Φ(i):jk (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (L,0):pq [W](L,0) pq d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,0)(L,s):pq [WW](s,0)(L,s) pq Equivariant Polynomial Functional Networks q=1 Φ(i):j (L,0):πL(p)π0(q) d(L) πL(p) d(0) π0(q) [W](L,0) pq q=1 Φ(i):j (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 d(i) j Φ (i):π 1 i (j) (L,0):pq [W](L,0) pq q=1 d(i) j Φ (i):π 1 i (j) (s,0)(L,s):pq [WW](s,0)(L,s) pq 4. For all L > s > t > 0, and for all p, we have Φ(i):jk (s,t)(t):πs(p) d(s) πs(p) = d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s,t)(t):p Φ(i):j (s,t)(t):πs(p) d(s) πs(p) = d(i) j Φ (i):π 1 i (j) (s,t)(t):p . 5. For all L s > 0, L > t 0 with s = t, we have Φ(i):jk (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s)(L,t):pq Φ(i):j (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) j Φ (i):π 1 i (j) (s)(L,t):pq 6. For all L > t > 0, we have p=1 Φ(i):jk (L,t)(t):πL(p) d(L) πL(p) [Wb](L,t)(t) p q=1 Φ(i):jk (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (L,t)(t):p [Wb](L,t)(t) p d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (t)(L,t):pq [b W](t)(L,t) pq p=1 Φ(i):j (L,t)(t):πL(p) d(L) πL(p) [Wb](L,t)(t) p q=1 Φ(i):j (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 d(i) j Φ (i):π 1 i (j) (L,t)(t):p [Wb](L,t)(t) p q=1 d(i) j Φ (i):π 1 i (j) (t)(L,t):pq [b W](t)(L,t) pq Equivariant Polynomial Functional Networks 7. For all L s > 0 and for all p, we have Φ(i):jk (s):πs(p) d(s) πs(p) = d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) (s):p Φ(i):j (s):πs(p) d(s) πs(p) = d(i) j Φ (i):π 1 i (j) (s):p . Φ(i):jk 1 = d(i) j d(i 1) k Φ (i):π 1 i (j)π 1 i 1(k) 1 Φ(i):j 1 = d(i) j Φ (i):π 1 i (j) 1 Also, since the group G satisfy that: G Pi is trivial (for i = 0 or i = L) or the whole Pi (for 0 < i < L), so we can simplify the above conditions by moving some of the permutation π s to the left hand sides. We have 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(i):πi(j)πi 1(k) (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s,t):pq Φ(i):πi(j) (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) Φ(i):j (s,t):pq 2. For all L s > 0, L > t 0 with s = t, we have Φ(i):πi(j)πi 1(k) (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s,0)(L,t):pq Φ(i):πi(j) (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) Φ(i):j (s,0)(L,t):pq q=1 Φ(i):πi(j)πi 1(k) (L,0):πL(p)π0(q) d(L) πL(p) d(0) π0(q) [W](L,0) pq q=1 Φ(i):πi(j)πi 1(k) (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (L,0):pq [W](L,0) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s,0)(L,s):pq [WW](s,0)(L,s) pq q=1 Φ(i):πi(j) (L,0):πL(p)π0(q) d(L) πL(p) d(0) π0(q) [W](L,0) pq q=1 Φ(i):πi(j) (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq Equivariant Polynomial Functional Networks q=1 d(i) πi(j) Φ(i):j (L,0):pq [W](L,0) pq q=1 d(i) πi(j) Φ(i):j (s,0)(L,s):pq [WW](s,0)(L,s) pq 4. For all L > s > t > 0, and for all p, we have Φ(i):πi(j)πi 1(k) (s,t)(t):πs(p) d(s) πs(p) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s,t)(t):p Φ(i):πi(j) (s,t)(t):πs(p) d(s) πs(p) = d(i) πi(j) Φ(i):j (s,t)(t):p. 5. For all L s > 0, L > t 0 with s = t, we have Φ(i):πi(j)πi 1(k) (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s)(L,t):pq Φ(i):πi(j) (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) Φ(i):j (s)(L,t):pq 6. For all L > t > 0, we have p=1 Φ(i):πi(j)πi 1(k) (L,t)(t):πL(p) d(L) πL(p) [Wb](L,t)(t) p q=1 Φ(i):πi(j)πi 1(k) (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (L,t)(t):p [Wb](L,t)(t) p d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (t)(L,t):pq [b W](t)(L,t) pq p=1 Φ(i):πi(j) (L,t)(t):πL(p) d(L) πL(p) [Wb](L,t)(t) p q=1 Φ(i):πi(j) (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 d(i) πi(j) Φ(i):j (L,t)(t):p [Wb](L,t)(t) p q=1 d(i) πi(j) Φ(i):j (t)(L,t):pq [b W](t)(L,t) pq 7. For all L s > 0 and for all p, we have Φ(i):πi(j)πi 1(k) (s):πs(p) d(s) πs(p) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s):p Φ(i):πi(j) (s):πs(p) d(s) πs(p) = d(i) πi(j) Φ(i):j (s):p. Equivariant Polynomial Functional Networks Φ(i):πi(j)πi 1(k) 1 = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk 1 Φ(i):πi(j) 1 = d(i) πi(j) Φ(i):j 1 From the above equalities, we can determine all constraints for the coefficients according to the entries of [E(W)] as follows: 1. If (s, t) = (i, i 1) / {(L, L 1), (1, 0)}, p = j, q = k, Φ(i):π(j)π (k) (i,i 1):π(j)π (k) = Φ(i):jk (i,i 1):jk If (s, t) = (i, i 1) = (L, L 1), q = k, Φ(L):jπL 1(k) (L,L 1):pπL 1(k) = Φ(L):jk (L,L 1):pk If (s, t) = (i, i 1) = (1, 0), p = j, Φ(1):π1(j)k (1,0):π1(j)q = Φ(1):jk (1,0):jq 2. If (s, t) = (i, i 1) / {(L, L 1), (1, 0)}, p = j, q = k, Φ(i):π(j)π (k) (i,0)(L,i 1):π(j)π (k) = Φ(i):jk (i,0)(L,i 1):jk If (s, t) = (i, i 1) = (L, L 1), q = k, Φ(L):jπL 1(k) (L,0)(L,L 1):pπL 1(k) = Φ(L):jk (L,0)(L,L 1):pk If (s, t) = (i, i 1) = (1, 0), p = j, Φ(1):π1(j)k (1,0)(L,0):π1(j)q = Φ(1):jk (1,0)(L,0):jq q=1 Φ(i):πi(j)πi 1(k) (L,0):pq [W](L,0) pq q=1 Φ(i):πi(j)πi 1(k) (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (L,0):pq [W](L,0) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s,0)(L,s):pq [WW](s,0)(L,s) pq For each L > l > 0, by scaling all the d(l) s by the same scalar, we have q=1 Φ(i):πi(j)πi 1(k) (L,0):pq [W](L,0) pq Equivariant Polynomial Functional Networks q=1 Φ(i):πi(j)πi 1(k) (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(i):jk (L,0):pq [W](L,0) pq q=1 Φ(i):jk (s,0)(L,s):pq [WW](s,0)(L,s) pq 4. For all L > s > t > 0, and for all p, we have Φ(i):πi(j)πi 1(k) (s,t)(t):πs(p) d(s) πs(p) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s,t)(t):p. Φ(i):jk (s,t)(t):p = 0. 5. For all L s > 0, L > t 0 with s = t, we have Φ(i):πi(j)πi 1(k) (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (s)(L,t):pq. If (s, t) = (i, i 1) / {(L, L 1), (1, 0)}, p = j, q = k, Φ(i):π(j)π (k) (i)(L,i 1):π(j)π (k) = Φ(i):jk (i)(L,i 1):jk. If i = L, (s, t) = (L, L 1), q = k, Φ(L):jπ(k) (L)(L,L 1):pπ(k) = Φ(L):jk (L)(L,L 1):pk If i = 1, (s, t) = (1, 0), p = j, Φ(1):π(j)k (1)(L,0):π(j)q = Φ(1):jk (1)(L,0):jq. 6. For all L > t > 0, we have p=1 Φ(i):πi(j)πi 1(k) (L,t)(t):p [Wb](L,t)(t) p q=1 Φ(i):πi(j)πi 1(k) (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (L,t)(t):p [Wb](L,t)(t) p d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (t)(L,t):pq [b W](t)(L,t) pq For each L > l > 0, by scaling all the d(l) s by the same scalar, we have p=1 Φ(i):πi(j)πi 1(k) (L,t)(t):p [Wb](L,t)(t) p Equivariant Polynomial Functional Networks q=1 Φ(i):πi(j)πi 1(k) (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (L,t)(t):p [Wb](L,t)(t) p d(i) πi(j) d(i 1) πi 1(k) Φ(i):jk (t)(L,t):pq [b W](t)(L,t) pq 7. If i = s = 1, p = j, Φ(1):π(j)k (1):π(j) = Φ(1):jk (1):j . Φ(i):jk 1 = 0. Similarly, we can also determine all constraints for the coefficients according to the entries of [E(b)] as follows: 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(i):πi(j) (s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) Φ(i):j (s,t):pq (s, t) = (i, 0), p = j, Φ(i):π(j) (i,0):π(j)q = Φ(i):j (i,0):jq 2. For all L s > 0, L > t 0 with s = t, we have Φ(i):πi(j) (s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) Φ(i):j (s,0)(L,t):pq t = 0, s = i < L, p = j, Φ(i):π(j) (i,0)(L,0):π(j)q = Φ(i):j (i,0)(L,0):jq t = 0, s = i = L, Φ(L):j (L,0)(L,0):pq = Φ(L):j (L,0)(L,0):pq q=1 Φ(i):πi(j) (L,0):pq [W](L,0) pq q=1 Φ(i):πi(j) (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 d(i) πi(j) Φ(i):j (L,0):pq [W](L,0) pq q=1 d(i) πi(j) Φ(i):j (s,0)(L,s):pq [WW](s,0)(L,s) pq Equivariant Polynomial Functional Networks If L > i > 0. For each L > l > 0, by scaling all the d(l) s by the same scalar, we have q=1 Φ(i):πi(j) (L,0):pq [W](L,0) pq q=1 Φ(i):πi(j) (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 d(i) πi(j) Φ(i):j (L,0):pq [W](L,0) pq q=1 d(i) πi(j) Φ(i):j (s,0)(L,s):pq [WW](s,0)(L,s) pq If i = L. We have q=1 Φ(L):j (L,0):pq [W](L,0) pq q=1 Φ(L):j (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(L):j (L,0):pq [W](L,0) pq q=1 Φ(L):j (s,0)(L,s):pq [WW](s,0)(L,s) pq which means Φ(L):j (L,0):pq can be arbitrary. The rest is q=1 Φ(L):j (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(L):j (s,0)(L,s):pq [WW](s,0)(L,s) pq For an L > r > 0, by letting πr be the identity, and d(r) p be 1 for all p, we have q=1 Φ(L):j (s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(L):j (s,0)(L,s):pq [WW](s,0)(L,s) pq , q=1 Φ(L):j (r,0)(L,r):πr(p)πr(q) d(r) πr(p) d(r) πr(q) [WW](r,0)(L,r) pq q=1 Φ(L):j (r,0)(L,r):pq [WW](r,0)(L,r) pq Equivariant Polynomial Functional Networks By Lemma B.3 and Corollary A.7, we have Φ(L):j (r,0)(L,r):πr(p)πr(q) d(r) πr(p) d(r) πr(q) = Φ(L):j (r,0)(L,r):pq. Φ(L):j (r,0)(L,r):pq = 0 for p = q, and Φ(L):j (r,0)(L,r):πr(p)πr(p) = Φ(L):j (r,0)(L,r):pp. In conclusion, we have Φ(L):j (L,0):pq is arbitrary, and for L > s > 0, Φ(L):j (s,0)(L,s):π(p)π(p) = Φ(L):j (s,0)(L,s):pp. 4. For all L > s > t > 0, and for all p, we have Φ(i):πi(j) (s,t)(t):πs(p) d(s) πs(p) = d(i) πi(j) Φ(i):j (s,t)(t):p. If i = s, p = j, Φ(i):π(j) (i,t)(t):π(j) = Φ(i):j (i,t)(t):j. 5. For all L s > 0, L > t 0 with s = t, we have Φ(i):πi(j) (s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = d(i) πi(j) Φ(i):j (s)(L,t):pq s = i < L, t = 0, p = j, Φ(i):π(j) (i)(L,0):π(j)q = Φ(i):j (i)(L,0):jq s = i = L, t = 0 Φ(L):j (L)(L,0):pq = Φ(L):j (L)(L,0):pq 6. For all L > t > 0, we have p=1 Φ(i):πi(j) (L,t)(t):p) [Wb](L,t)(t) p q=1 Φ(i):πi(j) (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 d(i) πi(j) Φ(i):j (L,t)(t):p [Wb](L,t)(t) p q=1 d(i) πi(j) Φ(i):j (t)(L,t):pq [b W](t)(L,t) pq If L > i > 0. For each L > l > 0, by scaling all the d(l) s by the same scalar, we have Equivariant Polynomial Functional Networks p=1 Φ(i):πi(j) (L,t)(t):πL(p) d(L) πL(p) [Wb](L,t)(t) p q=1 Φ(i):πi(j) (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 d(i) πi(j) Φ(i):j (L,t)(t):p [Wb](L,t)(t) p q=1 d(i) πi(j) Φ(i):j (t)(L,t):pq [b W](t)(L,t) pq If i = L. We have p=1 Φ(L):j (L,t)(t):p [Wb](L,t)(t) p q=1 Φ(L):j (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 Φ(L):j (L,t)(t):p [Wb](L,t)(t) p q=1 Φ(L):j (t)(L,t):pq [b W](t)(L,t) pq which means Φ(L):j (L,t)(t):p) can be arbitrary. The rest is q=1 Φ(L):j (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq q=1 Φ(L):j (t)(L,t):pq [b W](t)(L,t) pq By Lemma B.5 and Corollary A.7, we have Φ(L):j (t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) = Φ(L):j (t)(L,t):pq Φ(L):j (t)(L,t):pq = 0 for p = q, and Φ(L):j (t)(L,t):πt(p)πt(p) = Φ(L):j (t)(L,t):pp In conclusion, for all L > t > 0, we have Φ(L):j (L,t)(t):p) is arbitrary, and Φ(L):j (t)(L,t):π(p)π(p) = Φ(L):j (t)(L,t):pp. 7. For all L s > 0 and for all p, we have Φ(i):πi(j) (s):πs(p) d(s) πs(p) = d(i) πi(j) Φ(i):j (s):p. Equivariant Polynomial Functional Networks If i = s < L, p = j, Φ(i):π(j) (i):π(j) = Φ(i):j (i):j. If i = s = L, Φ(L):j (L):p = Φ(L):j (L):p. Φ(i):πi(j) 1 = d(i) πi(j) Φ(i):j 1 If i = L, we have Φ(L):j 1 = Φ(L):j 1 . C.5. G-Equivariant polynomial layers Based on the above discussions, we conclude that every G-equivariant polynomial layer , which is defined as a linear combination of stable polynomial terms, is given as E(U) = ([E(W)], [E(b)]), where the entries of [E(W)] and [E(b)] are given case by case as folows: For i = L, we have [E(W)](L) jk = p=1 Φ(L):j (L,L 1):p [W](L,L 1) pk + p=1 Φ(L):j (L,0)(L,L 1):p [WW](L,0)(L,L 1) pk p=1 Φ(L):j (L)(L,L 1):p [b W](L)(L,L 1) pk [E(b)](L) j = q=1 Φ(L):j (L,0)(L,0):pq [WW](L,0)(L,0) pq + q=1 Φ(L):j (L,0):pq [W](L,0) pq p=1 Φ(L):j (s,0)(L,s): [WW](s,0)(L,s) pp + q=1 Φ(L):j (L)(L,0):pq [b W](L)(L,0) pq p=1 Φ(L):j (L,t)(t):p [Wb](L,t)(t) p + X p=1 Φ(L):j (t)(L,t): [b W](t)(L,t) pp p=1 Φ(L):j (L):p [b](L) p + Φ(L):j 1 [E(b)](L) j = q=1 Φ(L):j (L,0)(L,0):pq [WW](L,0)(L,0) pq + q=1 Φ(L):j (L,0):pq [W](L,0) pq L>s>0 Φ(L):j (s,0)(L,s): p=1 [WW](s,0)(L,s) pp + q=1 Φ(L):j (L)(L,0):pq [b W](L)(L,0) pq p=1 Φ(L):j (L,t)(t):p [Wb](L,t)(t) p + X L>t>0 Φ(L):j (t)(L,t): p=1 [b W](t)(L,t) pp p=1 Φ(L):j (L):p [b](L) p + Φ(L):j 1 For i = 1, we have [E(W)](1) jk = q=1 Φ(1): k (1,0): q [W](1,0) jq + q=1 Φ(1): k (1,0)(L,0): q [WW](1,0)(L,0) jq Equivariant Polynomial Functional Networks q=1 Φ(1): k (1)(L,0): q [b W](1)(L,0) jq + Φ(1): k (1): [b](1) j [E(b)](1) j = q=1 Φ(1): (1,0): q [W](1,0) jq + q=1 Φ(1): (1,0)(L,0): q [WW](1,0)(L,0) jq q=1 Φ(1): (1)(L,0): q [b W](1)(L,0) jq + Φ(1): (1): [b](1) j For L > i > 1, we have [E(W)](i) jk =Φ(i): (i,i 1): [W](i,i 1) jk + Φ(i): (i,0)(L,i 1): [WW](i,0)(L,i 1) jk + Φ(i): (i)(L,i 1): [b W](i)(L,i 1) jk [E(b)](i) j = q=1 Φ(i): (i,0): q [W](i,0) jq + q=1 Φ(i): (i,0)(L,0): q [WW](t,0)(L,0) jq p=1 Φ(i): (i,t)(t): [Wb](i,t)(t) j + q=1 Φ(i): (i)(L,0): q [b W](i)(L,0) jq + Φ(i): (i): [b](i) j In the above formulas, the bullet indicates that the corresponding coefficient is independent of the indices at the bullet. D. Invariant Polynomial Layers In this section, we construct a polynomial map I : U ! Rd that is G-invariant, i.e., I(g U) = I(U) for all g GU and U U. D.1. Invariant layer as a linear combination of stable polynomial terms Similar to the equivariant maps, we seek the invariant map among polynomial maps that is a linear combination of stable polynomial terms, specifically the entries of [W](s,t), [b](s), [Wb](s,t)(t), [b W](s)(L,t), and [WW](s,0)(L,t), along with a bias. In concrete: q=1 Φ(s,t):pq [W](s,t) pq + X p=1 Φ(s):p [b](s) p p=1 Φ(s,t)(t):p [Wb](s,t)(t) p q=1 Φ(s)(L,t):pq [b W](s)(L,t) pq q=1 Φ(s,0)(L,t):pq [WW](s,0)(L,t) pq + Φ1. All Φ s are in Rd d, except the bias Φ1 is in Rd 1. In summary, I is parameterized by Φ s and Ψ s. We need I to be G-invariant, which means I(g U) = I(U). D.2. Compute I(g U) From the definition of I(U), we have: Equivariant Polynomial Functional Networks q=1 Φ(s,t):pq [g W](s,t) pq p=1 Φ(s):p [gb](s) p p=1 Φ(s,t)(t):p [g Wgb](s,t)(t) p q=1 Φ(s)(L,t):pq [gbg W](s)(L,t) pq q=1 Φ(s,0)(L,t):pq [g Wg W](s,0)(L,t) pq q=1 Φ(s,t):pq d(s) p d(t) q [W](s,t) π 1 s (p),π 1 t (q) p=1 Φ(s):p d(s) p [b](s) π 1 s (p) p=1 Φ(s,t)(t):p d(s) p [Wb](s,t)(t) π 1 s (p) q=1 Φ(s)(L,t):pq d(s) p d(t) q [b W](s)(L,t) π 1 s (p),π 1 t (q) q=1 Φ(s,0)(L,t):pq d(s) p d(t) q [WW](s,0)(L,t) π 1 s (p),π 1 t (q) q=1 Φ(s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [W](s,t) pq p=1 Φ(s):πs(p) d(s) πs(p) [b](s) p p=1 Φ(s,t)(t):πs(p) d(s) πs(p) [Wb](s,t)(t) p q=1 Φ(s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [b W](s)(L,t) pq q=1 Φ(s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) [WW](s,0)(L,t) pq D.3. Compare I(g U) and I(U) Since I(g U) = I(U), from Corollary B.7, the parameters Φ s have to satisfy these following conditions: Equivariant Polynomial Functional Networks 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(s,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = Φ(s,t):pq 2. For all L s > 0, L > t 0 with s = t, we have Φ(s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = Φ(s,0)(L,t):pq q=1 Φ(L,0):πL(p)π0(q) d(L) πL(p) d(0) π0(q) [W](L,0) pq q=1 Φ(s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(L,0):pq [W](L,0) pq q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq 4. For all L > s > t > 0, and for all p, we have Φ(s,t)(t):πs(p) d(s) πs(p) = Φ(s,t)(t):p. 5. For all L s > 0, L > t 0 with s = t, we have Φ(s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = Φ(s)(L,t):pq 6. For all L > t > 0, we have p=1 Φ(L,t)(t):πL(p) d(L) πL(p) [Wb](L,t)(t) p q=1 Φ(t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 Φ(L,t)(t):p [Wb](L,t)(t) p q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq 7. For all L s > 0 and for all p, we have Φ(s):πs(p) d(s) πs(p) = Φ(s):p. Equivariant Polynomial Functional Networks Solve these equations, we have 1. For all L s > t 0 with (s, t) = (L, 0), and for all p, q, we have Φ(s,t):pq = 0 2. For all L s > 0, L > t 0 with s = t, we have Φ(s,0)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = Φ(s,0)(L,t):pq If (s, t) = (L, 0, we have Φ(L,0)(L,0):pq = Φ(L,0)(L,0):pq. q=1 Φ(L,0):pq [W](L,0) pq q=1 Φ(s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(L,0):pq [W](L,0) pq q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq which means Φ(L,0):pq can be arbitrary. The rest is q=1 Φ(s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq For an L > r > 0, by letting πr be the identity, and d(r) p be 1 for all p, we have q=1 Φ(s,0)(L,s):πs(p)πs(q) d(s) πs(p) d(s) πs(q) [WW](s,0)(L,s) pq q=1 Φ(s,0)(L,s):pq [WW](s,0)(L,s) pq , q=1 Φ(r,0)(L,r):πr(p)πr(q) d(r) πr(p) d(r) πr(q) [WW](r,0)(L,r) pq Equivariant Polynomial Functional Networks q=1 Φ(r,0)(L,r):pq [WW](r,0)(L,r) pq By Lemma B.3 and Corollary A.7, we have Φ(r,0)(L,r):πr(p)πr(q) d(r) πr(p) d(r) πr(q) = Φ(r,0)(L,r):pq. Φ(r,0)(L,r):pq = 0, for p = q, and Φ(r,0)(L,r):πr(p)πr(p) = Φ(r,0)(L,r):pp. In conclusion, we have Φ(L,0):pq is arbitrary, and for L > s > 0, Φ(s,0)(L,s):π(p)π(p) = Φ(s,0)(L,s):pp. 4. For all L > s > t > 0, and for all p, we have Φ(s,t)(t):p = 0. 5. For all L s > 0, L > t 0 with s = t, we have Φ(s)(L,t):πs(p)πt(q) d(s) πs(p) d(t) πt(q) = Φ(s)(L,t):pq If (s, t) = (L, 0), we have Φ(L)(L,0):pq = Φ(L)(L,0):pq. 6. For all L > t > 0, we have p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + q=1 Φ(t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq which means Φ(L,t)(t):p can be arbitrary. The rest is q=1 Φ(t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) [b W](t)(L,t) pq = q=1 Φ(t)(L,t):pq [b W](t)(L,t) pq By Lemma B.5 and Lemma A.7, we have Φ(t)(L,t):πt(p)πt(q) d(t) πt(p) d(t) πt(q) = Φ(t)(L,t):pq. Φ(t)(L,t):pq = 0 Equivariant Polynomial Functional Networks for p = q, and Φ(t)(L,t):πt(p)πt(p) = Φ(t)(L,t):pp. In conclusion, for all L > t > 0, we have Φ(L,t)(t):p is arbitrary, and Φ(t)(L,t):π(p)π(p) = Φ(t)(L,t):pp 7. For all L s > 0 and for all p, we have Φ(s):πs(p) d(s) πs(p) = Φ(s):p. If s = L, we have Φ(L):p = Φ(L):p. D.4. G-Invariant polynomial layers Based on the above discussions, we conclude that every G-invariant polynomial layer, which is defined as a linear combination of stable polynomial terms, is given as: q=1 Φ(L,0)(L,0):pq [WW](L,0)(L,0) pq + q=1 Φ(L,0):pq [W](L,0) pq p=1 Φ(s,0)(L,s): [WW](s,0)(L,s) pp + q=1 Φ(L)(L,0):pq [b W](L)(L,0) pq p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + X p=1 Φ(t)(L,t): [b W](t)(L,t) pp p=1 Φ(L):p [b](L) p + Φ1 In the above formula, the bullet indicates that the corresponding coefficient is independent of the index at the bullet. E. Additional Experimental Details E.1. Predicting generalization from weights Dataset. The Tanh subset from the CNN Zoo dataset has 5,949 training instances and 1,488 testing instances, while the original Re LU subset consists of 6,050 training instances and 1,513 testing instances. We do the augmentation for Re LU subset, with an augmentation factor of 2, effectively doubling the size of the dataset by adding one augmented version of each original instance. The overall dataset sizes, including both the original and augmented data, are summarized in Table 6. Baselines In this experiment, we compare our model with five other baselines: STATNN (Unterthiner et al., 2020): utilizes statistical features of the weights and biases Graph-NN (Kofinas et al., 2024): represents input network parameters as graphs and processes using Graph Neural Networks. NP and HNP (Zhou et al., 2024b): incoporates the permutation symmetries of neurons into neural functional networks. Monomial-NFN (Tran et al., 2024a): extends the group action on weights from group of permutation matrices to the group of monomial matrices by incoporates scaling/sign-flipping symmetries. Equivariant Polynomial Functional Networks Table 6: Datasets information for predicting generalization task. Dataset Train size Val size Original Re LU 6050 1513 Augment Re LU 12100 3026 Tanh 5949 1488 Table 7: Number of parameters of all models for prediciting generalization task. Model Re LU dataset Tanh dataset STATNN 1.06M 1.06M NP 2.03M 2.03M HNP 2.81M 2.81M Monomial-NFN 0.25M 1.41M MAGEP-NFN (ours) 0.99M 0.99M Table 8: Hyperparameters for MAGEP-NFN on prediciting generalization task. MLP hidden Loss Optimizer Learning rate Batch size Epoch 500 Binary cross-entropy Adam 0.001 8 50 Table 9: Dataset size for Classifying INRs task. Train Validation Test CIFAR-10 45000 5000 10000 MNIST size 45000 5000 10000 Fashion-MNIST 45000 5000 20000 Implementation Details. Our architecture begins with a Monomial-NFN layer featuring 20 channels, aligning the dimensions of the weights and biases. This is followed by two equivariant MAGEP-NFN layers, each with 20 channels, using either a Re LU activation for the Re LU dataset or a Tanh activation for the Tanh dataset. Next, the output is processed by a MAGEP-NFN Invariant layer. The final output from this layer is flattened and mapped to R500. This vector is further processed by a fully connected MLP with two hidden layers, both activated by Re LU. The final output undergoes a linear projection to a scalar, followed by a sigmoid function. The model is trained using Binary Cross Entropy (BCE) loss over 50 epochs, with early stopping determined by a threshold τ on the validation set. The entire training process on an A100 GPU takes 30 minutes. A summary of hyperparameters can be found in Table 8. For the baseline models, we adhere to the original implementations as outlined in (Zhou et al., 2024b), utilizing the official code (available at: https://github.com/Allan Yang Zhou/nfn), and (Tran et al., 2024a). In the HNP, NP, and Monomial-NFN models, we employ three equivariant layers with channel configurations of 16, 16, and 5, respectively. The extracted features are then passed through an average pooling layer, followed by three MLP layers with hidden dimensions of 200 (Monomial-NFN Re LU case) and 1000 neurons (Monomial-NFN Tanh case and other models). The hyperparameters for our model, along with the parameter counts for all models involved in this task, are detailed in Table 7. E.2. Classifying implicit neural representations of images Dataset. We use the original INRs dataset, which contained three dataset: CIFAR-10, MNIST and Fashion-MNIST, obtained by applying a single SIREN model to every image. The detailed information about dataset is described in (Zhou et al., 2024b). We use the datasett without any data augmentation as in the settings of (Tran et al., 2024a). The breakdown of training, validation, and test sample sizes for each dataset is detailed in Table 9. Equivariant Polynomial Functional Networks Table 10: Hyperparameters of MAGEP-NFN for each dataset in Classify INRs task. MNIST Fashion-MNIST CIFAR-10 MAGEP-NFN hidden dimension 128x3 64 64 x 3 Base model MAGEP-Inv NP MAGEP-Inv Base model hidden dimension 128 128x3 64 MLP hidden neurons 1000 500 1000 Dropout 0.1 0.1 0.1 Learning rate 0.0001 0.0001 0.0001 Batch size 32 32 32 Step 200000 200000 200000 Loss Binary cross-entropy Binary cross-entropy Binary cross-entropy Table 11: Number of parameters of all models for classifying INRs task. CIFAR-10 MNIST Fashion-MNIST MLP 2M 2M 2M NP 16M 15M 15M HNP 42M 22M 22M Monomial-NFN 16M 22M 20M MAGEP-NFN (ours) 3.4M 4.1M 4.9M Table 12: Number of parameters of all models for Weight space style editing task. Model Number of parameters MLP 4.5M NP 4.1M HNP 12.8M Monomial-NFN 4.1M MAGEP-NFN (ours) 4.1M Table 13: Hyperparameters for MAGEP-NFN on weight space style editing task. MAGEP-NFN hidden dimension 16 NP dimension 128 Optimizer Adam Learning rate 0.001 Batch size 32 Steps 50000 Implementation Details. In these experiments, we use two different architectures. For both the MNIST and CIFAR datasets, the architecture begins with a Monomial-NFN layer to adjust the weight dimensions, followed by three MAGEPNFN layers, each utilizing sine activation. The resulting weight features are then passed through a MAGEP Invariant layer. Finally, the output is flattened and processed by an MLP with two hidden layers, each containing 1,000 units and using Re LU activations. For the Fashion-MNIST dataset, we begin with a Monomial-NFN layer with sine activation, followed by a MAGEP-NFN Equivariant Polynomial Functional Networks layer also utilizing sine activation, and then a Monomial-NFN layer with absolute activation. The architecture then aligns with the design of the NP model from (Zhou et al., 2024b). Specifically, a Gaussian Fourier Transformation is applied to encode the input into sine and cosine components, mapping from 1 dimension to 256 dimensions. The encoded features are processed through IOSinusoidal Encoding, a positional encoding tailored for the NP layer, which uses a maximum frequency of 10 and 6 frequency bands. Following this, the features pass through three NP layers with Re LU activations. An average pooling is applied, after which the output is flattened, and the resulting vector is processed by an MLP with two hidden layers, each containing 1,000 units and using Re LU activations. Finally, the output is linearly projected to a scalar. We employ the Binary Cross Entropy (BCE) loss function and train the model for 200,000 steps, taking approximately 2 hours on an A100 GPU. The parameter counts for all models are presented in Table 11 E.3. Weight space style editing Dataset. We utilize the same INRs dataset as employed in the classification task, with the sizes of the training, validation, and test sets provided in Table 9. Implementation Details. In these experiments, our architecture begins with two MAGEP-NFN layers, each with 16 hidden dimensions. The rest of the design follows the NP model outlined in (Zhou et al., 2024b). Specifically, we apply a Gaussian Fourier Transformation with a mapping size of 256, followed by IOSinusoidal Encoding. The features are then processed through three NP layers, each with 128 hidden dimensions and Re LU activation. The final output is passed through an NP layer for scalar projection and a Learned Scale layer as described in the Appendix of (Zhou et al., 2024b). We use the Binary Cross Entropy (BCE) loss function and train the model for 50,000 steps, which takes approximately 35 minutes on an A100 GPU. For the baseline models, we maintain the same settings as the official implementation. Specifically, the HNP or NP model consists of three layers, each with 128 hidden dimensions, followed by Re LU activations. An NFN of the same type is applied to map the output to one dimension, after which it is processed by a Learned Scale layer. The number of parameters for all models is detailed in Table 12, and the hyperparameters for our model are presented in Table 13. F. Performance comparison with graph-based NFNs Experiment Setup: Following the same experiment setup in Appendix E.1, we compare the predictive performance of our model and two graph-based baselines: GNN (Kofinas et al., 2024) and Scale GMN (Kalogeropoulos et al., 2024), using HNP (Zhou et al., 2024b) as a reference. Results: The results are presented in Table 14. The GNN model exhibits a noticeable performance decline when tested on separate activation subsets. Although Scale GMN significantly improves performance on the Tanh subset, its enhancements on the Re LU subset are comparatively modest. In contrast, our model demonstrates substantial overall improvements across both datasets, highlighting its effectiveness. Table 14: Performance comparison with Graph-based NFNs on Small CNN Zoo task. Re LU subset Tanh subset HNP (Zhou et al., 2024b) 0.897 0.934 GNN (Kofinas et al., 2024) 0.897 0.893 Scale GMN (Kalogeropoulos et al., 2024) 0.928 0.942 MAGEP-NFNs (ours) 0.933 0.940 G. Memory and Runtime. Table 15 and 16 provide runtime and memory consumption data for our model and other baselines in predicting CNN generalization task. For graph-based architectures, we compare with two recent works: GNN (Kofinas et al., 2024) and Scale GMN (Kalogeropoulos et al., 2024). Our model runs significantly faster and uses much less memory than these graph-based networks and NP/HNP (Zhou et al., 2024b). Introducing additional polynomial terms slightly increases our model s runtime and memory usage compared to Monomial-NFN (Tran et al., 2024a). However, this trade-off results in considerably enhanced expressivity, which is evident across many tasks like Predict CNN Generalization or INRs Equivariant Polynomial Functional Networks Table 15: Runtime of models on Small CNN Zoo task. Re LU subset Tanh subset NP (Zhou et al., 2024b) 36m40s 35m34s HNP (Zhou et al., 2024b) 30m06s 29m37s GNN (Kofinas et al., 2024) 4h27m29s 4h25m17s Scaled GMN (Kalogeropoulos et al., 2024) 1h20m 1h20m Monomial-NFN (Tran et al., 2024a) 23m47s 18m23s MAGEP-NFNs (ours) 28m43s 28m12s Table 16: Memory consumption of models on Small CNN Zoo task. Re LU subset Tanh subset NP (Zhou et al., 2024b) 838MB 838MB HNP (Zhou et al., 2024b) 856MB 856MB GNN (Kofinas et al., 2024) 6390MB 6390MB Scaled GMN (Kalogeropoulos et al., 2024) 2918MB 2918MB Monomial-NFN (Tran et al., 2024a) 560MB 582MB MAGEP-NFNs (ours) 584MB 584MB Classification. H. Implementation of Equivariant and Invariant Layers We provide the multi-channel implementations of the GU-equivariant map E : Ud ! Ue and the GU-invariant map I : Ud ! Re d . For uniformity in implementing Equivariant and Invariant layers from Appendix C.5 and Appendix D.4, we employ einops-style pseudocode as a consistent framework. We summarize the key dimensions in Table 17 and outline the shapes of the input terms in Table 18. H.1. Equivariant Layers Pseudocode H.1.1. PSEUDOCODE FOR CASE i = L From the formula for [E(W)](L) jk : [E(W)](L) jk = p=1 Φ(L):j (L,L 1):p [W](L,L 1) pk + p=1 Φ(L):j (L,0)(L,L 1):p [WW](L,0)(L,L 1) pk Table 17: Summary of key dimensions involved in the implementation Symbol Description d Number of input channels for the equivariant and invariant layer e Number of output channels for the equivariant and invariant layer b Batch size ni Number of channels at the ith layer d Embedding dimension of the invariant layer s output Equivariant Polynomial Functional Networks Table 18: Shapes of input terms used in the implementation [W](s,t) [b, d, ns, nt] [Wb](s,t)(t) [b, d, ns] [b W](s)(L,t) [b, d, ns, nt] [WW](s,0)(L,t) [b, d, ns, nt] p=1 Φ(L):j (L)(L,L 1):p [b W](L)(L,L 1) pk We define the pseudocode for each term: For Φ(L):j (L,L 1):p [W](L,L 1) pk , with [W](L,L 1) pk of shape [b, d, n L, n L 1] and Φ(L):j (L,L 1):p of shape [e, d, n L, n L], Corresponding pseudocode: einsum(edpj, bdpk ! bejk) For Φ(L):j (L,0)(L,L 1):p [WW](L,0)(L,L 1) pk , with [WW](L,0)(L,L 1) pk of shape [b, d, n L, n L 1] and Φ(L):j (L,0)(L,L 1):p of shape [e, d, n L, n L], Corresponding pseudocode: einsum(edpj, bdpk ! bejk) For Φ(L):j (L)(L,L 1):p [b W](L)(L,L 1) pk , with [b W](L)(L,L 1) pk of shape [b, d, n L, n L 1] and Φ(L):j (L)(L,L 1):p of shape [e, d, n L, n L], Corresponding pseudocode: einsum(edpj, bdpk ! bejk) From the formula for [E(b)](L) j : [E(b)](L) j = q=1 Φ(L):j (L,0)(L,0):pq [WW](L,0)(L,0) pq + q=1 Φ(L):j (L,0):pq [W](L,0) pq p=1 Φ(L):j (s,0)(L,s): [WW](s,0)(L,s) pp + q=1 Φ(L):j (L)(L,0):pq [b W](L)(L,0) pq p=1 Φ(L):j (L,t)(t):p [Wb](L,t)(t) p + X p=1 Φ(L):j (L)(L,t): [b W](t)(L,t) pp p=1 Φ(L):j (L):p [b](L) p + Φ(L):j 1 We define the pseudocode for each term: For Φ(L):j (L,0)(L,0):pq [WW](L,0)(L,0) pq , with [WW](L,0)(L,0) pq of shape [b, d, n L, n0] and Φ(L):j (L,0)(L,0):pq of shape [e, d, n L, n0, n L], Corresponding pseudocode: einsum(edpqj, bdpq ! bej) For Φ(L):j (L,0):pq [W](L,0) pq , with [W](L,0) pq of shape [b, d, n L, n0] and Φ(L):j (L,0):pq of shape [e, d, n L, n0, n L], Corresponding pseudocode: einsum(edpqj, bdpq ! bej) Equivariant Polynomial Functional Networks For Φ(L):j (s,0)(L,s): [WW](s,0)(L,s) pp , with [WW](s,0)(L,s) pp of shape [b, d, ns, ns] and Φ(L):j (s,0)(L,s): of shape [e, d, ns, ns, n L], Corresponding pseudocode: einsum(edppj, bdpp ! bej) For Φ(L):j (L)(L,0):pq [b W](L)(L,0) pq , with [b W](L)(L,0) pq of shape [b, d, n L, n0] and Φ(L):j (L)(L,0):pq of shape [e, d, n L, n0, n L], Corresponding pseudocode: einsum(edpqj, bdpq ! bej) For Φ(L):j (L,t)(t):p [Wb](L,t)(t) p , with [Wb](L,t)(t) p of shape [b, d, n L] and Φ(L):j (L,t)(t):p of shape [e, d, n L, n L], Corresponding pseudocode: einsum(edpj, bdp ! bej) For Φ(L):j (t)(L,t): [b W](t)(L,t) pp , with [b W](t)(L,t) pp of shape [b, d, nt, nt] and Φ(L):j (t)(L,t): of shape [e, d, nt, nt, n L], Corresponding pseudocode: einsum(edppj, bdpp ! bej) For Φ(L):j (L):p [b](L) p , with [b](L) p of shape [b, d, n L] and Φ(L):j (L):p of shape [e, d, n L, n L], Corresponding pseudocode: einsum(edpj, bdp ! bej) For Φ(L):j 1 of shape [e, n L], Corresponding pseudocode: einsum(ej, ! ej).unsqueeze(0) H.1.2. PSEUDOCODE FOR CASE i = 1 From the formula for [E(W)](1) jk : [E(W)](1) jk = q=1 Φ(1): k (1,0): q [W](1,0) jq + q=1 Φ(1): k (1,0)(L,0): q [WW](1,0)(L,0) jq q=1 Φ(1): k (1)(L,0): q [b W](1)(L,0) jq + Φ(1): k (1): [b](1) j We define the pseudocode for each term: For Φ(1): k (1,0): q [W](1,0) jq , with [W](1,0) jq of shape [b, d, n1, n0] and Φ(1): k (1,0): q of shape [d, e, n0, n0], Corresponding pseudocode: einsum(bdjq, deqk ! bejk) For Φ(1): k (1,0)(L,0): q [WW](1,0)(L,0) jq , with [WW](1,0)(L,0) jq of shape [b, d, n1, n0] and Φ(1): k (1,0)(L,0): q of shape [d, e, n0, n0], Corresponding pseudocode: einsum(bdjq, deqk ! bejk) For Φ(1): k (1)(L,0): q [b W](1)(L,0) jq , with [b W](1)(L,0) jq of shape [b, d, n1, n0] and Φ(1): k (1)(L,0): q of shape [d, e, n0, n0], Corresponding pseudocode: einsum(bdjq, deqk ! bejk) For Φ(1): k (1): [b](1) j , with [b](1) j of shape [b, d, n1] and Φ(1): k (1): of shape [d, e, n0], Corresponding pseudocode: einsum(bdj, dek ! bejk) From the formula for [E(b)](1) j : [E(b)](1) j = q=1 Φ(1): (1,0): q [W](1,0) jq + q=1 Φ(1): (1,0)(L,0): q [WW](1,0)(L,0) jq Equivariant Polynomial Functional Networks q=1 Φ(1): (1)(L,0): q [b W](1)(L,0) jq + Φ(1): (1): [b](1) j We define the pseudocode for each term: For Φ(1): (1,0): q [W](1,0) jq , with [W](1,0) jq of shape [b, d, n1, n0] and Φ(1): (1,0): q of shape [d, e, n0], Corresponding pseudocode: einsum(bdjq, deq ! bej) For Φ(1): (1,0)(L,0): q [WW](1,0)(L,0) jq , with [WW](1,0)(L,0) jq of shape [b, d, n1, n0] and Φ(1): (1,0)(L,0): q of shape [d, e, n0], Corresponding pseudocode: einsum(bdjq, deq ! bej) For Φ(1): (1)(L,0): q [b W](1)(L,0) jq , with [b W](1)(L,0) jq of shape [b, d, n1, n0] and Φ(1): (1)(L,0): q of shape [d, e, n0], Corresponding pseudocode: einsum(bdjq, deq ! bej) For Φ(1): (1): [b](1) j , with [b](1) j of shape [b, d, n1] and Φ(1): (1): of shape [d, e], Corresponding pseudocode: einsum(bdj, de ! bej) H.1.3. PSEUDOCODE FOR CASE 1 < i < L From the formula for [E(W)](i) jk : [E(W)](i) jk = Φ(i): (i,i 1): [W](i,i 1) jk + Φ(i): (i,0)(L,i 1): [WW](i,0)(L,i 1) jk + Φ(i): (i)(L,i 1): [b W](i)(L,i 1) jk We define the pseudocode for each term: For Φ(i): (i,i 1): [W](i,i 1) jk , with [W](i,i 1) jk of shape [b, d, ni, ni 1] and Φ(i): (i,i 1): of shape [d, e], Corresponding pseudocode: einsum(bdjk, de ! bejk) For Φ(i): (i,0)(L,i 1): [WW](i,0)(L,i 1) jk , with [WW](i,0)(L,i 1) jk of shape [b, d, ni, ni 1] and Φ(i): (i,0)(L,i 1): of shape [d, e], Corresponding pseudocode: einsum(bdjk, de ! bejk) For Φ(i): (i)(L,i 1): [b W](i)(L,i 1) jk , with [b W](i)(L,i 1) jk of shape [b, d, ni, ni 1] and Φ(i): (i)(L,i 1): of shape [d, e], Corresponding pseudocode: einsum(bdjk, de ! bejk) From the formula for [E(b)](i) j : [E(b)](i) j = Φ(i): (i,0): q [W](i,0) jq + Φ(i): (i,0)(L,0): q [WW](i,0)(L,0) jq Φ(i): (i,t)(t): [Wb](i,t)(t) j + Φ(i): (i)(L,0): q [b W](i)(L,0) jq + Φ(i): (i): [b](i) j We define the pseudocode for each term: For Φ(i): (i,0): q [W](i,0) jq , with [W](i,0) jq of shape [b, d, ni, n0] and Φ(i): (i,0): q of shape [d, e, n0], Equivariant Polynomial Functional Networks Corresponding pseudocode: einsum(bdjq, deq ! bej) For Φ(i): (i,0)(L,0): q [WW](i,0)(L,0) jq , with [WW](i,0)(L,0) jq of shape [b, d, ni, n0] and Φ(i): (i,0)(L,0): q of shape [d, e, n0], Corresponding pseudocode: einsum(bdjq, deq ! bej) For Φ(i): (i,t)(t): [Wb](i,t)(t) j , with [Wb](i,t)(t) j of shape [b, d, ni] and Φ(i): (i,t)(t): of shape [d, e], Corresponding pseudocode: einsum(bdj, de ! bej) For Φ(i): (i)(L,0): q [b W](i)(L,0) jq , with [b W](i)(L,0) jq of shape [b, d, ni, n0] and Φ(i): (i)(L,0): q of shape [d, e, n0], Corresponding pseudocode: einsum(bdjq, deq ! bej) For Φ(i): (i): [b](i) j , with [b](i) j of shape [b, d, ni] and Φ(i): (i): of shape [d, e], Corresponding pseudocode: einsum(bdj, de ! bej) H.2. Invariant Layers Pseudocode From the formula for the Invariant layer I(U): q=1 Φ(L,0)(L,0):pq [WW](L,0)(L,0) pq + q=1 Φ(L,0):pq [W](L,0) pq p=1 Φ(s,0)(L,s): [WW](s,0)(L,s) pp + q=1 Φ(L)(L,0):pq [b W](L)(L,0) pq p=1 Φ(L,t)(t):p [Wb](L,t)(t) p + X p=1 Φ(t)(L,t): [b W](t)(L,t) pp p=1 Φ(L):p [b](L) p + Φ1 We define the pseudocode for each term: For Φ(L,0)(L,0):pq [WW](L,0)(L,0) pq , with [WW](L,0)(L,0) pq of shape [b, d, n L, n0] and Φ(L,0)(L,0):pq of shape [d, e, n L, n0, d ], Corresponding pseudocode: einsum(bdpq, depqk ! bek) For Φ(L,0):pq [W](L,0) pq , with [W](L,0) pq of shape [b, d, n L, n0] and Φ(L,0):pq of shape [d, e, n L, n0, d ], Corresponding pseudocode: einsum(bdpqk, depqk ! bek) For Φ(s,0)(L,s): [WW](s,0)(L,s) pp , with [WW](s,0)(L,s) pp of shape [b, d, ns] and Φ(s,0)(L,s): of shape [d, e, d ], Corresponding pseudocode: einsum(bdpk, dek ! bek) For Φ(L)(L,0):pq [b W](L)(L,0) pq , with [b W](L)(L,0) pq of shape [b, d, n L, n0] and Φ(L)(L,0):pq of shape [d, e, n L, n0, d ], Corresponding pseudocode: einsum(bdpqk, depqk ! bek) For Φ(L,t)(t):p [Wb](L,t)(t) p , with [Wb](L,t)(t) p of shape [b, d, n L] and Φ(L,t)(t):p of shape [d, e, n L, d ], Corresponding pseudocode: einsum(bdpk, deijk ! bek) For Φ(t)(L,t): [b W](t)(L,t) pp , with [b W](t)(L,t) pp of shape [b, d, nt] and Φ(t)(L,t): of shape [d, e, d ], Corresponding pseudocode: einsum(bdpk, dek ! bek) Equivariant Polynomial Functional Networks For Φ(L):p [b](L) p , with [b](L) p of shape [b, d, n L] and Φ(L):p of shape [d, e, n L, d ], Corresponding pseudocode: einsum(bdpk, depk ! bek) For Φ1of shape [e, d ], Corresponding pseudocode: einsum(ek ! ek).unsqueeze(0)