# equivariant_hypergraph_diffusion_neural_operators__62a6c751.pdf Published as a conference paper at ICLR 2023 EQUIVARIANT HYPERGRAPH DIFFUSION NEURAL OPERATORS Peihao Wang 1 , Shenghao Yang 3, Yunyu Liu 4, Zhangyang Wang 1, Pan Li2,4 1University of Texas at Austin, 2Georgia Tech, 3University of Waterloo, 4Purdue University {peihaowang,atlaswang}@utexas.edu, shenghao.yang@uwaterloo.ca, liu3154@purdue.edu, panli@gatech.edu Hypergraph neural networks (HNNs) using neural networks to encode hypergraphs provide a promising way to model higher-order relations in data and further solve relevant prediction tasks built upon such higher-order relations. However, higher-order relations in practice contain complex patterns and are often highly irregular. So, it is often challenging to design an HNN that suffices to express those relations while keeping computational efficiency. Inspired by hypergraph diffusion algorithms, this work proposes a new HNN architecture named ED-HNN, which provably approximates any continuous equivariant hypergraph diffusion operators that can model a wide range of higher-order relations. ED-HNN can be implemented efficiently by combining star expansions of hypergraphs with standard message passing neural networks. ED-HNN further shows great superiority in processing heterophilic hypergraphs and constructing deep models. We evaluate ED-HNN for node classification on nine real-world hypergraph datasets. ED-HNN uniformly outperforms the best baselines over these nine datasets and achieves more than 2% in prediction accuracy over four datasets therein. Our code is available at: https://github.com/Graph-COM/ED-HNN. 1 INTRODUCTION Machine learning on graphs has recently attracted great attention in the community due to the ubiquitous graph-structured data and the associated inference and prediction problems (Zhu, 2005; Hamilton, 2020; Nickel et al., 2015). Current works primarily focus on graphs which can model only pairwise relations in data. Emerging research has shown that higher-order relations that involve more than two entities often reveal more significant information in many applications (Benson et al., 2021; Schaub et al., 2021; Battiston et al., 2020; Lambiotte et al., 2019; Lee et al., 2021). For example, higher-order network motifs build the fundamental blocks of many real-world networks (Mangan & Alon, 2003; Benson et al., 2016; Tsourakakis et al., 2017; Li et al., 2017; Li & Milenkovic, 2017). Session-based (multi-step) behaviors often indicate the preferences of web users in more precise ways (Xia et al., 2021; Wang et al., 2020; 2021; 2022). To capture these higher-order relations, hypergraphs provide a dedicated mathematical abstraction (Berge, 1984). However, learning algorithms on hypergraphs are still far underdeveloped as opposed to those on graphs. Recently, inspired by the success of graph neural networks (GNNs), researchers have started investigating hypergraph neural network models (HNNs) (Feng et al., 2019; Yadati et al., 2019; Dong et al., 2020; Huang & Yang, 2021; Bai et al., 2021; Arya et al., 2020). Compared with GNNs, designing HNNs is more challenging. First, as aforementioned, higher-order relations modeled by hyperedges could contain complex information. Second, hyperedges in real-world hypergraphs are often of large and irregular sizes. Therefore, how to effectively represent higher-order relations while efficiently processing those irregular hyperedges is the key challenge when to design HNNs. In this work, inspired by the recently developed hypergraph diffusion algorithms (Li et al., 2020a; Liu et al., 2021b; Fountoulakis et al., 2021; Takai et al., 2020; Tudisco et al., 2021a) we design a novel HNN architecture that holds provable expressiveness to approximate a large class of hypergraph diffusion while keeping computational efficiency. Hypergraph diffusion is significant due to Correspondence to: Peihao Wang and Pan Li. Published as a conference paper at ICLR 2023 Features on nodes 𝑔𝑯 భ,𝑔(𝑯 మ) Hyperedge potentials Hypergraph diffusion ℎ௩ 𝑔𝑯 భ ௩ 𝑔𝑯 మ ௩ E.g., Diffusion appears between the top-2 ranked nodes in each hyperedge. 𝑣 𝑎 𝑏 𝑐 𝑤 𝑢 Hyperedge potentials are permutation-invariant functions to model higher-order relations. Running message passing neural networks on the bipartite representation of the hypergraph 𝜓𝑯 ௩= 𝜌(ℎ௩, 𝜙(ℎ௪) which universally represents Feed 𝜌with node feature ℎ௩ 𝑔𝑯 ௩ the permutation-equivariant diffusion where 𝑯 = ℎ௩௩ Figure 1: Hypergraph diffusion often uses permutation-invariant hyperedge potentials to model higher-order relations. The gradients of those potentials determine the diffusion process and are termed diffusion operators. Our ED-HNN can universally represent such operators by feeding node representations into the message from hyperedges to nodes. Such small changes make big differences in model performance. its transparency and has been widely applied to semi-supervised learning (Hein et al., 2013; Zhang et al., 2017a; Tudisco et al., 2021a), ranking aggregation (Li & Milenkovic, 2017; Chitra & Raphael, 2019), network analysis (Liu et al., 2021b; Fountoulakis et al., 2021; Takai et al., 2020) and signal processing (Zhang et al., 2019; Schaub et al., 2021) and so on. However, traditional hypergraph diffusion needs to first handcraft potential functions to model higher-order relations and then use their gradients or some variants of their gradients as the diffusion operators to characterize the exchange of diffused quantities on the nodes within one hyperedge. The design of those potential functions often requires significant insights into the applications, which may not be available in practice. We observe that the most commonly-used hyperedge potential functions are permutation invariant, which covers the applications where none of the nodes in a higher-order relation are treated as inherently special. For such potential functions, we further show that their induced diffusion operators must be permutation equivariant. Inspired by this observation, we propose a NN-parameterized architecture that is expressive to provably represent any permutation-equivariant continuous hyperedge diffusion operators, whose NN parameters can be learned in a data-driven way. We also introduce an efficient implementation based on current GNN platforms Fey & Lenssen (2019); Wang et al. (2019): We just need to combine a bipartite representation (or star expansion Agarwal et al. (2006); Zien et al. (1999), equivalently) of hypergraphs and the standard message passing neural network (MPNN) Gilmer et al. (2017). By repeating this architecture by layers with shared parameters, we finally obtain our model named Equivariant Diffusion-based HNN (ED-HNN). Fig. 1 shows an illustration of hypergraph diffusion and the key architecture in ED-HNN. To the best of our knowledge, we are the first one to establish the connection between the general class of hypergraph diffusion algorithms and the design of HNNs. Previous HNNs were either less expressive to represent equivariant diffusion operators (Feng et al., 2019; Yadati et al., 2019; Dong et al., 2020; Huang & Yang, 2021; Chien et al., 2022; Bai et al., 2021) or needed to learn the representations by adding significantly many auxiliary nodes (Arya et al., 2020; Yadati, 2020; Yang et al., 2020). We provide detailed discussion of them in Sec. 3.4. We also show that due to the capability of representing equivariant diffusion operators, ED-HNN is by design good at predicting node labels over heterophilic hypergraphs where hyperedges mix nodes from different classes. Moreover, ED-HNN can go very deep without much performance decay. As an extra theoretical contribution, our proof of expressiveness avoids using equivariant polynomials as a bridge, which allows precise representations of continuous equivariant set functions by compositing a continuous function and the sum of another continuous function on each set entry, while previous works (Zaheer et al., 2017; Segol & Lipman, 2020) have only achieved an approximation result. This result may be of independent interest for the community. We evaluate ED-HNN by performing node classsification over 9 real-world datasets that cover both heterophilic and homophilic hypergraphs. ED-HNN uniformly outperforms all baseline methods across these datasets and achieves significant improvement (>2% ) over 4 datasets therein. EDHNN also shows super robustness when going deep. We also carefully design synthetic experiments to verify the expressiveness of ED-HNN to approximate pre-defined equivariant diffusion operators. 2 PRELIMINARIES: HYPERGRAPHS AND HYPERGRAPH DIFFUSION Here, we formulate the hypergraph diffusion problem, along the way, introduce the notations. Published as a conference paper at ICLR 2023 Definition 1 (Hypergraph). Let G = (V, E, X) be an attributed hypergraph where V, E are the node set and the hyperedge set, respectively. Each hyperedge e = {v(e) 1 , ..., v(e) |e| } is a subset of V. Unlike graphs, a hyperedge may contain more than two nodes. X = [..., xv, ...]T RN denotes the node attributes and xv denotes the attribute of node v. Define dv = |{e E : v e}| as the degree of node v. Let D, De denote the diagonal degree matrix for v V and the sub-matrix for v e. Here, we use 1-dim attributes for convenient discussion while our experiments often have multidim attributes. Learning algorithms will combine attributes and hypergraph structures into (latent) features defined as follows, which can be further used to make prediction for downstream tasks. Definition 2 (Latent features). Let hv R denote the (latent) features of node v V. The feature vector H = [..., hv, ...] v V RN includes node features as entries. Further, collect the features into a hyperedge feature vector He = he,1 he,|e| R|e|, where he,i corresponds to the feature hv(e) i for some node v(e) i e. For any v e, there is one corresponding index i {1, ..., |e|}. Later, we may use subscripts (e, i) and (e, v) interchangeably if they cause no confusion. A widely-used heuristic to generate features H is via hypergraph diffusion algorithms. Definition 3 (Hypergraph Diffusion). Define node potential functions f( ; xv) : R R for v V and hyperedge potential functions ge( ) : R|e| R for each e E. The hypergrah diffusion combines the node attributes and the hypergraph structure and asks to solve v V f(hv; xv) + X e E ge(He). (1) In practice, ge is often shared across hyperedges of the same size. Later, we ignore the subscript e. The two potential functions are often designed via heuristics in traditional hypergraph diffusion literatures. Node potentials often correspond to some negative-log kernels of the latent features and the attributes. For example, f(hv; xv) could be (hv xv)2 2 when to compute hypergraph Page Rank diffusion (Li et al., 2020a; Takai et al., 2020). Hyperedge potentials are more significant and complex, as they need to model those higher-order relations between more than two objects, which makes hypergraph diffusion very different from graph diffusion. Here list a few examples. Example 1 (Hyperedge potentials). Some practical g(He) may be chosen as follows. Clique Expansion (CE, hyperedges reduced to cliques) plus pairwise potentials (Zhou et al., 2007): P u,v e(hv hu)2 2 or with degree normalization P u,v e( hv dv hu du )2 2 ( g(D 1/2 e He)). Divergence to the mean (Tudisco et al., 2021a;b): P v e(hv |e| 1He p)2 2, where p computes the ℓp-norm. Total Variation (TV) (Hein et al., 2013; Zhang et al., 2017a):maxu,v e |hv hu|p, p {1, 2}. Lov asz Extension (Lov asz, 1983) for cardinality-based set functions (LEC) (Jegelka et al., 2013; Li et al., 2020a; Liu et al., 2021b) y, ς(He) p, p {1, 2} where y = [..., yj, ...] R|e| is a constant vector and ς(He) sorts the values of He in a decreasing order. One may reproduce TV by using LEC and setting y1 = y|e| = 1. To reveal more properties of these hyperedge potentials g( ) later, we give the following definitions. Definition 4 (Permutation Invariance & Equivariance). Function ψ : RK R is permutationinvariant if for any K-dim permutation matrix P Π[K], ψ(P Z) = ψ(Z) for all Z RK. Function ψ : RK RK is permutation-equivariant if for any K-dim permutation matrix P Π[K], ψ(P Z) = P ψ(Z) for all Z RK. We may easily verify permutation invariance of the hyperedge potentials in Example 1. The underlying physical meaning is that the prediction goal of an application is independent of node identities in a hyperedge so practical g s often keep invariant w.r.t. the node ordering (Veldt et al., 2021). 3 NEURAL EQUIVARIANT DIFFUSION OPERATORS AND ED-HNN Previous design of hyperedge potential functions is tricky. Early works adopted clique or star expansion by reducing hyperedges into edges (Zhou et al., 2007; Agarwal et al., 2006; Li & Milenkovic, Published as a conference paper at ICLR 2023 2017) and further used traditional graph methods. Later, researchers proved that those hyperedge reduction techniques cannot well represent higher-order relations (Li & Milenkovic, 2017; Chien et al., 2019). Therefore, Lov asz extensions of set-based cut-cost functions on hyperedges have been proposed recently and used as the potential functions (Hein et al., 2013; Li & Milenkovic, 2018; Li et al., 2020a; Takai et al., 2020; Fountoulakis et al., 2021; Yoshida, 2019). However, designing those set-based cut costs is practically hard and needs a lot of trials and errors. Other types of handcrafted hyperedge potentials to model information propagation can also be found in (Neuh auser et al., 2022; 2021), which again are handcrafted and heavily based on heuristics and evaluation performance. Our idea uses data-driven approaches to model such potentials, which naturally brings us to HNNs. On one hand, we expect to leverage the extreme expressive power of NNs to learn the desired hypergraph diffusion automatically from the data. On the other hand, we are interested in having novel hypergraph NN (HNN) architectures inspired by traditional hypergraph diffusion solvers. To achieve the goals, next, we show by the gradient descent algorithm (GD) or alternating direction method of multipliers (ADMM) (Boyd et al., 2011), solving objective Eq. 1 amounts to iteratively applying some hyperedge diffusion operators. Parameterizing such operators using NNs by each step can unfold hypergraph diffusion into an HNN. The roadmap is as follows. In Sec. 3.1, we make the key observation that the diffusion operators are inherently permutationequivariant. To universally represent them, in Sec. 3.2, we propose an equivariant NN-based operators and use it to build ED-HNN via an efficient implementation. In Sec. 3.3, we discuss the benefits of learning equivariant diffusion operators. In Sec. 3.4, we review and discuss that no previous HNNs allows efficiently modeling such equivariant diffusion operators via their architecture. 3.1 EMERGING EQUIVARIANCE IN HYPERGRAPH DIFFUSION We start with discussing the traditional solvers for Eq. 1. If f and g are both differentiable, one straightforward optimization approach is to adopt gradient descent. The node-wise update of each iteration can be formulated as below: h(t+1) v h(t) v η( f(h(t) v ; xv) + X e:v e [ g(H(t) e )]v), for v V, (2) where [ g(He)]v denotes the gradient w.r.t. hv for v e. We use the supscript t to denote the number of the current iteration, h(0) v = xv is the initial features, and η is known as the step size. For general f and g, we may adopt ADMM: For each e E, we introduce an auxiliary variable Qe = qe,1 qe,|e| R|e|. We initialize h(0) v = xv and Q(0) e = H(0) e . And then, iterate Q(t+1) e proxηg(2H(t) e Q(t) e ) H(t) e + Q(t) e , for e E, (3) h(t+1) v proxηf( ;xv)/dv( X e:v e q(t+1) e,v /dv), for v V, (4) where proxψ(h) arg minz ψ(z) + 1 2 z h 2 2 is the proximal operator. The detailed derivation can be found in Appendix A. The iterations have convergence guarantee under closed convex assumptions of f, g (Boyd et al., 2011). However, our model does not rely on the convergence, as our model just runs the iterations with a given number of steps (aka the number of layers in ED-HNN). The operator proxψ( ) has nice properties as reviewed in Proposition 1, which enables the possibility of NN-based approximation even for the case when f and g are not differentiable. One noted non-differentiable example is the LEC case when g(He) = y, ς(He) p in Example 1. Proposition 1 (Parikh & Boyd (2014); Polson et al. (2015)). If ψ( ) : RK R is a lower semicontinuous convex function, then proxψ( ) is 1-Lipschitz continuous. The node-side operations, gradient f( ; xv) and proximal gradient proxηf( ;xv)( ) are relatively easy to model, while the operations on hyperedges are more complicated. We name gradient g( ) : R|e| R|e| and proximal gradient proxηg( ) : R|e| R|e| as hyperedge diffusion operators, since they summarize the collection of node features inside a hyperedge and dispatch the aggregated information to interior nodes individually. Next, we reveal one crucial property of those hyperedge diffusion operators by the following propostion (see the proof in Appendix B.1): Proposition 2. Given any permutation-invariant hyperedge potential function g( ), hyperedge diffusion operators proxηg( ) and g( ) are permutation equivariant. Published as a conference paper at ICLR 2023 Algorithm 1: ED-HNN Initialization: H(0) = X and three MLPs ˆϕ, ˆρ, ˆφ (shared across L layers). For t = 0, 1, 2, ..., L 1, do: 1. Designing the messages from V to E: m(t) u e = ˆϕ(h(t) u ), for all u V. 2. Sum V E messages over hyperedges m(t) e = P u e m(t) u e, for all e E. 3. Broadcast m(t) e and design the messages from E to V: m(t) e v = ˆρ(h(t) v , m(t) e ), for all v e. 4. Update h(t+1) v = ˆφ(h(t) v , P e:u e m(t) e u, xv, dv), for all v V. It states that an permutation invariant hyperedge potential leads to an operator that should process different nodes in a permutation-equivariant way. 3.2 BUILDING EQUIVARIANT HYPEREDGE DIFFUSION OPERATORS Our design of permutation-equivariant diffusion operators is built upon the following Theorem 1. We leave the proof in Appendix B.2. Theorem 1. ψ( ) : [0, 1]K RK is a continuous permutation-equivariant function, if and only if it can be represented as [ψ(Z)]i = ρ(zi, PK j=1 ϕ(zj)), i [K] for any Z = [..., zi, ...] [0, 1]K, where ρ : RK R, ϕ : R RK 1 are two continuous functions, and K K. Remark on an extra theoretical contribution: Theorem 1 indicates that any continuous permutation-equivariant operators where each entry of the input Z has 1-dim feature channel can be precisely written as a composition of a continuous function ρ and the sum of another continuous function ϕ on each input entry. This result generalizes the representation of permutation-invariant functions by ρ(PK i=1 ϕ(zi)) in (Zaheer et al., 2017) to the equivariant case. An architecture in a similar spirit was proposed in (Segol & Lipman, 2020). However, their proof only allows approximation, i.e., small ψ(Z) ˆψ(Z) instead of precise representation in Theorem 1. Also, Zaheer et al. (2017); Segol & Lipman (2020); Sannai et al. (2019) focus on representations of a single set instead of a hypergraph with coupled sets. The above theoretical observation inspires the design of equivariant hyperedge diffusion operators. Specifically, for an operator ˆψ( ) : R|e| R|e| that may denote either gradient g( ) or proximal gradient proxηg( ) for each hyperedge e, we may parameterize it as: [ ˆψ(He)]v = ˆρ(hv, X u e ˆϕ(hu)), for v e, where ˆρ, ˆϕ are multi-layer perceptions (MLPs). (5) Intuitively, the inner sum collects the ˆϕ-encoding node features within a hyperedge and then ˆρ combines the collection with the features from each node further to perform separate operation. The implementation of the above ˆψ is not trivial. A naive implementation is to generate an auxillary node to representation each (v, e)-pair for v V and e E and learn its representation as adopted in (Yadati, 2020; Arya et al., 2020; Yang et al., 2020). However, this may substantially increase the model complexity. Our implementation is built upon the bipartite representations (or star expansion (Zien et al., 1999; Agarwal et al., 2006), equivalently) of hypergraphs paired with the standard message passing NN (MPNN) (Gilmer et al., 2017) that can be efficiently implemented via GNN platforms (Fey & Lenssen, 2019; Wang et al., 2019) or sparse-matrix multiplication. Specifically, we build a bipartite graph G = ( V, E). The node set V contains two parts V VE where V is the original node set while VE contains nodes that correspond to original hyperedges e E. Then, add an edge between v V and e E if v e. With this bipartite graph representation, the model ED-HNN is implemented by following Algorithm 1. The equivariant diffusion operator ˆψ can be constructed via steps 1-3. The last step is to update the node features to accomplish the first two terms in Eq. 2 or the ADMM update Eq. 4. We leave a more detailed discussion on how Algorithm 1 aligns with the GD and ADMM updates in Sec. C. The initial attributes xv and node degrees are included to match the diffusion algorithm by design. As the diffusion operators are shared across iterations, ED-HNN shares parameters across layers. Now, we summarize the joint contribution of our theory and efficient implementation as follows. Proposition 3. MPNNs on bipartite representations (or star expansions) of hypergraphs are expressive enough to learn any continuous diffusion operators induced by invariant hyperedge potentials. Published as a conference paper at ICLR 2023 Simple while Significant Architecture Difference. Some previous works (Dong et al., 2020; Huang & Yang, 2021; Chien et al., 2022) also apply GNNs over bipartite representations of hypergraphs to build their HNN models, which look similar to ED-HNN. However, ED-HNN has a simple but significant architecture difference: Step 3 in Algorithm 1 needs to adopt h(t) v as one input to compute the message m(t) e v. This is a crucial step to guarantee that the hyperedge operation by combining steps 1-3 forms an equivariant operator from {h(t) v }v e {m(t) e v}v e. However, previous models often by default adopt m(t) e v = m(t) e , which leads to an invariant operator. Such a simple change is significant as it guarantees universal approximation of equivariant operators, which leads to many benefits as to be discussed in Sec. 3.3. Our experiments will verify these benefits. Extension. Our ED-HNN can be naturally extended to build equivariant node operators because of the duality between hyperedges and nodes, though this is not necessary in the traditional hypergraph diffusion problem Eq. 1. Specifically, we call this model ED-HNNII, which simple revises the step 1 in ED-HNN as m(t) u e = ˆϕ(h(t) u , m(t 1) e ) for any e such that u e. Due to the page limit, we put some experiment results on ED-HNNII in Appendix F.1. 3.3 ADVANTAGES OF ED-HNN IN HETEROPHILIC SETTINGS AND DEEP MODELS Here, we discuss several advantages of ED-HNN due to the design of equivariant diffusion operators. Heterophily describes the network phenomenon where nodes with the same labels and attributes are less likely to connect to each other directly (Rogers, 2010). Predicting node labels in heterophilic networks is known to be more challenging than that in homophilic networks, and thus has recently become an important research direction (Pei et al., 2020; Zhu et al., 2020; Chien et al., 2021; Lim et al., 2021). Heterophily has been proved as a more common phenomenon in hypergraphs than in graphs since it is hard to expect all nodes in a giant hyperedge to share a common label (Veldt et al., 2022). Moreover, predicting node labels in heterophilic hypergraphs is more challenging than that in graphs as a hyperedge may consist of the nodes from multiple categories. Learnable equivariant diffusion operators are expected to be superior in predicting heterophilic node labels. For example, if a hyperedge e of size 3 is known to cover two nodes v, u from class C1 while one node w from class C0. We may use the LEC potential y, ς(He) 2 by setting the parameter y = [1, 1, 0] . Suppose the three nodes attibutes are He = [hv, hu, hw] = [0.7, 0.5, 0.3] , where hv, hu are close as they are from the same class. One may check the hyperedge diffusion operator gives g(He) = [0.4, 0.4, 0] . One-step gradient descent hv η[ g(He)]v with a proper step size (η < 0.5) drags hv and hu closer while keeping hw unchanged. However, invariant diffusion by forcing the operation on hyperedge e to follow [ g(He)]v = [ g(He)]u = [ g(He)]w (invariant messages from e to the nodes in e) will allocate every node with the same change and cannot deal with the heterogeneity of node labels in e. Moreover, a learnable operator is important. To see this, Suppose we have different ground-truth labels, say w, u from the same class and v from the other. Then, using the above parameter y may increase the error. However, learnable operators can address the problem, e.g., by obtaining a more suitable parameter y = [0, 1, 1] via training. Moreover, equivariant hyperedge diffusion operators are also good at building deep models. GNNs tend to degenerate the performance when going deep, which is often attributed to their oversmoothing (Li et al., 2018; Oono & Suzuki, 2019) and overfitting problems (Cong et al., 2021). Equivariant operators allocating different messages across nodes helps with overcoming the oversmoothing issue. Moreover, diffusion by sharing parameters across layers may reduce the risk of overfitting. 3.4 RELATED WORKS: PREVIOUS HNNS FOR REPRESENTING HYPERGRAPH DIFFUSION ED-HNN is the first HNN inspired by the general class of hypergraph diffusion and can provably achieve universal approximation of hyperedge diffusion operators. So, how about previous HNNs representing hypergraph diffusion? We temporarily ignore the big difference in whether parameters are shared across layers and give some analysis as follows. HGNN (Feng et al., 2019) runs graph convolution (Kipf & Welling, 2017) on clique expansions of hypergraphs, which directly assumes the hyperedge potentials follow CE plus pairwise potentials and cannot learn other operations on hyperedges. Hyper GCN (Yadati et al., 2019) essentially leverages the total variation potential by adding mediator nodes to each hyperedge, which adopts a transformation technique of the total variantion potential in (Chan et al., 2018; Chan & Liang, 2020). GHSC (Zhang et al., 2022) is Published as a conference paper at ICLR 2023 Table 1: Dataset statistics. CE homophily is the homophily score (Pei et al., 2020) based on CE of hypergraphs. Here, four hypergraphs Congress, Senate, Walmart, House are heterophilic. Cora Citeseer Pubmed Cora-CA DBLP-CA Congress Senate Walmart House # nodes 2708 3312 19717 2708 41302 1718 282 88860 1290 # edges 1579 1079 7963 1072 22363 83105 315 69906 340 # classes 7 6 3 7 6 2 2 11 2 avg. |e| 3.03 3.200 4.349 4.277 4.452 8.656 17.168 6.589 34.730 CE Homophily 0.897 0.893 0.952 0.803 0.869 0.555 0.498 0.530 0.509 to represent a specific edge-dependent-node-weight hypergraph diffusion proposed in (Chitra & Raphael, 2019). A few works view each hyperedge as a multi-set of nodes and each node as a multiset of hyperedges (Dong et al., 2020; Huang & Yang, 2021; Chien et al., 2022; Bai et al., 2021; Arya et al., 2020; Yadati, 2020; Yang et al., 2020; Jo et al., 2021). Among them, HNHN (Dong et al., 2020), Uni GNNs (Huang & Yang, 2021), EHGNN (Jo et al., 2021) and All Set (Chien et al., 2022) build two invariant set-pooling functions on both the hyperedge and node sides, which cannot represent equivariant functions. HCHA (Bai et al., 2021) and Uni GAT (Yang et al., 2020) compute attention weights as a result of combining node features and hyperedge messages, which looks like our ˆρ operation in step 3. However, a scalar attention weight is too limited to represent the potentially complicated ˆρ. Hyper SAGE (Arya et al., 2020), MPNN-R (Yadati, 2020) and LEGCN (Yang et al., 2020) may learn hyperedge equivariant operators in principle by adding auxillary nodes to represent node-hyperedge pairs as aforementioned. However, because of the extra complexity, these models are either too slow or need to reduce the sizes of parameters to fit in memory, which constrains their actual performance in practice. Of course, none of these works have mentioned any theoretical arguments on universal representation of equivariant hypergraph diffusion operators as ours. One subtle point is that Uni{GIN,Graph SAGE,GCNII} (Huang & Yang, 2021) adopt a jump link in each layer to pass node features from the former layer directly to the next and may expect to use a complicated node-side invariant function ˆψ (h(t) v , P e:v e m(t) e ) to approximate our ˆψ(h(t) v , P e:v e m(t) e v) = ˆψ(h(t) v , P e:v e ˆρ(h(t) v , m(t) e )) in Step 4 of Algorithm 1 directly. This may be doable in theory. However, the jump-link solution may expect to have a higher dimension of m(t) e (dim=| e:v e e|) so that the sum pooling P e:v e m(t) e does not lose anything from the neighboring nodes of v before interacting with h(t) v , while in ED-HNN, m(t) e needs dim=|e| according to Theorem 1. Our empirical experiments also verify more expressiveness of ED-HNN. 4 EXPERIMENTS 4.1 RESULTS ON BENCHMARKING DATASETS Experiment Setting. In this subsection, we evaluate ED-HNN on nine real-world benchmarking hypergraphs. We focus on the semi-supervised node classification task. The nine datasets include co-citation networks (Cora, Citeseer, Pubmed), co-authorship networks (Cora-CA, DBLP-CA) Yadati et al. (2019), Walmart Amburg et al. (2020), House Chodrow et al. (2021), Congress and Senate Fowler (2006b;a). More details of these datasets can be found in Appendix F.2. Since the last four hypergraphs do not contain node attributes, we follow the method of Chien et al. (2022) to generate node features from label-dependent Gaussian distribution Deshpande et al. (2018). As we show in Table 1, these datasets already cover sufficiently diverse hypergraphs in terms of scale, structure, and homo-/heterophily. We compare our method with top-performing models on these benchmarks, including HGNN (Feng et al., 2019), HCHA (Bai et al., 2021), HNHN (Dong et al., 2020), Hyper GCN (Yadati et al., 2019), Uni GCNII (Huang & Yang, 2021), All Deep Sets (Chien et al., 2022), All Set Transformer (Chien et al., 2022), and a recent diffusion method Hyper ND (Prokopchik et al., 2022). All the hyperparameters for baselines follow from (Chien et al., 2022) and we fix the learning rate, weight decay and other training recipes same with the baselines. Other model specific hyperparameters are obtained via grid search (see Appendix F.3). We randomly split the data into training/validation/test samples using 50%/25%/25% splitting percentage by following Chien et al. (2022). We choose prediction accuracy as the evaluation metric. We run each model for ten times with different training/validation splits to obtain the standard deviation. In Appendix F.7, we also provide the analysis of the model sensitivity to different hidden dimensions. Performance Analysis. Table 2 shows the results. Our ED-HNN uniformly outperforms all the compared models on all the datasets. We observe that the top-performing baseline models are All Set Transformer, All Deep Sets and Uni GCNII. As having been analyzed in Sec. 3.4, they model Published as a conference paper at ICLR 2023 Table 2: Prediction Accuracy (%). Bold font highlights when ED-HNN significantly (difference in means > 0.5 std) outperforms all baselines. The best baselines are underlined. The training and testing times are test on Walmart by using the same server with one GPU NVIDIA RTX A6000. Homophilic Hypergraphs Cora Citeseer(Homo.) Pubmed Cora-CA DBLP-CA Training Time (10 1 s) HGNN Huang & Yang (2021) 79.39 1.36 72.45 1.16 86.44 0.44 82.64 1.65 91.03 0.20 0.24 0.51 HCHA Bai et al. (2021) 79.14 1.02 72.42 1.42 86.41 0.36 82.55 0.97 90.92 0.22 0.24 0.01 HNHN Dong et al. (2020) 76.36 1.92 72.64 1.57 86.90 0.30 77.19 1.49 86.78 0.29 0.30 0.56 Hyper GCN Yadati et al. (2019) 78.45 1.26 71.28 0.82 82.84 8.67 79.48 2.08 89.38 0.25 0.42 1.51 Uni GCNII Huang & Yang (2021) 78.81 1.05 73.05 2.21 88.25 0.40 83.60 1.14 91.69 0.19 4.36 1.18 Hyper ND Tudisco et al. (2021a) 79.20 1.14 72.62 1.49 86.68 0.43 80.62 1.32 90.35 0.26 0.15 1.32 All Deep Sets Chien et al. (2022) 76.88 1.80 70.83 1.63 88.75 0.33 81.97 1.50 91.27 0.27 1.23 1.09 All Set Transformer Chien et al. (2022) 78.58 1.47 73.08 1.20 88.72 0.37 83.63 1.47 91.53 0.23 1.64 1.63 ED-HNN (ours) 80.31 1.35 73.70 1.38 89.03 0.53 83.97 1.55 91.90 0.19 1.71 1.13 Heterophilic Hypergraphs Congress Senate Walmart House Avg. Rank Inference Time (10 2 s) HGNN (Huang & Yang, 2021) 91.26 1.15 48.59 4.52 62.00 0.24 61.39 2.96 5.22 1.01 0.04 HCHA (Bai et al., 2021) 90.43 1.20 48.62 4.41 62.35 0.26 61.36 2.53 5.89 1.54 0.18 HNHN (Dong et al., 2020) 53.35 1.45 50.93 6.33 47.18 0.35 67.80 2.59 6.67 6.11 0.05 Hyper GCN (Yadati et al., 2019) 55.12 1.96 42.45 3.67 44.74 2.81 48.32 2.93 8.22 0.87 0.06 Uni GCNII (Huang & Yang, 2021) 94.81 0.81 49.30 4.25 54.45 0.37 67.25 2.57 3.89 21.22 0.13 Hyper ND (Tudisco et al., 2021a) 74.63 3.62 52.82 3.20 38.10 3.86 51.70 3.37 6.00 0.02 0.01 All Deep Sets (Chien et al., 2022) 91.80 1.53 48.17 5.67 64.55 0.33 67.82 2.40 5.22 5.35 0.33 All Set Transformer (Chien et al., 2022) 92.16 1.05 51.83 5.22 65.46 0.25 69.33 2.20 2.88 6.06 0.67 ED-HNN (ours) 95.00 0.99 64.79 5.14 66.91 0.41 72.45 2.28 1.00 5.87 0.36 Table 3: Prediction Accuracy (%) over Synthetic Hypergraphs with Controlled Heterophily α. Homophily Heterophily α = 1 α = 2 α = 3 α = 4 α = 6 α = 7 HGNN (Huang & Yang, 2021) 99.86 0.05 90.42 1.14 66.60 1.60 57.90 1.43 50.77 1.94 48.68 1.21 HGNN + Jump Link (Huang & Yang, 2021) 99.29 0.19 92.61 0.97 79.59 2.32 64.39 1.55 55.39 2.01 51.14 2.05 All Deep Sets (Chien et al., 2022) 99.86 0.07 99.13 0.31 93.52 1.31 83.45 1.42 65.38 1.27 70.94 0.14 All Deep Sets + Jump Link (Chien et al., 2022) 99.95 0.07 99.08 0.24 96.61 1.45 86.45 1.30 74.08 1.02 71.52 0.94 All Set Transformer (Chien et al., 2022) 99.99 0.01 99.81 0.02 98.38 0.81 96.83 0.03 78.25 0.16 74.68 0.09 ED-HNN (ours) 99.96 0.01 99.87 0.02 98.85 0.25 98.45 0.23 80.48 0.70 79.71 0.27 invariant set functions on both node and hyperedge sides. Uni GCNII also adds initial and jump links, which accidentally resonates with our design principle (the step 4 in Algorithm 1). However, their performance has large variation across different datasets. For example, Uni GCNII attains promising performance on citation networks, however, has subpar results on Walmart dataset. In contrast, our model achieves stably superior results, surpassing All Set models by 12.9% on Senate and Uni GCNII by 12.5% on Walmart. We owe our empirical significance to the theoretical design of exact equivariant function representation. Compared with Hyper ND, the SOTA hypergraph diffusion algorithm, our ED-HNN achieves superior performance on every dataset. Hyper ND provably reaches convergence to the divergence to the mean hyperedge potential (see Example 1), which gives it good performance on homophilic networks while only subpar performance on heterophilic datasets. Regarding the computational efficiency, we report the wall-clock times for training (for 100 epochs) and testing of all models over the largest hypergraph Walmart, where all models use the same hyperparameters that achieve the reported performance on the left. Our model achieves efficiency comparable to All Set models (Chien et al., 2022) and is much faster than Uni GCNII. So, the implementation of equivariant computation in ED-HNN is still efficient. 4.2 RESULTS ON SYNTHETIC HETEROPHILIC HYPERGRAPH DATASET Experiment Setting. As discussed, ED-HNN is expected to perform well on heterophilic hypergraphs. We evaluate this point by using synthetic datasets with controlled heterophily. We generate data by using contextual hypergraph stochastic block model (Deshpande et al., 2018; Ghoshdastidar & Dukkipati, 2014; Lin & Wang, 2018). Specifically, we draw two classes of 2, 500 nodes each and then randomly sample 1,000 hyperedges. Each hyperedge consists of 15 nodes, among which αi many are sampled from class i. We use α = min{α1, α2} to denote the heterophily level. Afterwards, we generate label-dependent Gaussian node features with standard deviation 1.0. We test both homophilic (α = 1, 2 or CE homophily 0.7) and heterophilic (α = 4 7 or CE homophily 0.7) cases. We compare ED-HNN with HGNN, All Set models and their variants with jump links. We follow previous 50%/25%/25% data splitting methods and repeated 10 times the experiment. Results. Table 3 shows the results. On homophilic datasets, all the models can achieve good results, while ED-HNN keeps slightly better than others. Once α surpasses 3, i.e., entering the heterophilic regime, the superiority of ED-HNN is more obvious. The jump-link trick indeed also helps, while building equivariance as ED-HNN does directly provides more significant improvement. Published as a conference paper at ICLR 2023 1 2 4 8 16 32 Test Acc. (%) ED-HNN All Deep Sets All Set Transformer Uni GCNII HGNN 1 2 4 8 16 32 30 Test Acc. (%) 1 2 4 8 16 32 Test Acc. (%) Figure 2: Comparing Deep Achitectures across Different Models: Test Acc. v.s. # Layers 256 128 64 32 MAE ( 10 3) Clique Expansion Equivariant Invariant All Deep Sets All Set Transformer 256 128 64 32 2 MAE ( 10 2) Total Variation (p = 2) 256 128 64 32 MAE ( 10 1) Lovász Extension - Cardinality Figure 3: Comparing the Powers to Represent known Diffusion: MAE v.s. Latent Dimensions. 4.3 BENEFITS IN DEEPENING HYPERGRAPH NEURAL NETWORKS We also demonstrate that by using diffusion models and parameter tying, ED-HNN can benefit from deeper architectures, while other HNNs cannot. Fig. 2 illustrates the performance of different models versus the number of network layers. We compare with HGNN, All Set models , and Uni GCNII. Uni GCNII inherits from (Chen et al., 2020) which is known to be effective to counteract oversmoothness. The results reveal that All Set models suffer from going deep. HGNN working through more lightweight mechanism has better tolerance to depth. However, none of them can benefit from deepening. On the contrary, ED-HNN successfully leverages deeper architecture to achieve higher accuracy. For example, adding more layers boosts ED-HNN by 1% in accuracy on Pubmed and House, while elevating ED-HNN from 58.01% to 64.79% on Senate dataset. 4.4 EXPRESSIVENESS JUSTIFICATION ON THE SYNTHETIC DIFFUSION DATASET We are to evaluate the ability of ED-HNN to express given hypergraph diffusion. We generate semi-synthetic diffusion data using the Senate hypergraph (Chodrow et al., 2021) and synthetic node features. The data consists of 1,000 pairs (H(0), H(1)). The initial node features H(0) are sampled from 1-dim Gaussian distributions. To obtain H(1) we apply the gradient step in Eq. 2. For non-differential cases, we adopt subgradients for convenient computation. We fix node potentials as f(hv; xv) = (hv xv)2 and consider 3 different edge potentials in Example 1 with varying complexities: a) CE, b) TV (p=2) and c) LEC. The goal is to let one-layer models V E V to recover H(1). We compare ED-HNN with our implemented baseline (Invariant) with parameterized invariant set functions on both the node and hyperedge sides, and All Set models (Chien et al., 2022) that also adopt invariant set functions. We keep the scale of all models almost the same to give fair comparison, of which more details are given in Appendix F.5. The results are reported in Fig. 3. The CE case gives invariant diffusion so all models can learn it well. The TV case mostly shows the benefit of the equivariant architecture, where the error almost does not increase even when the dimension decreases to 32. The LEC case is challenging for all models, though ED-HNN is still the best. The reason, we think, is that learning the sorting operation in LEC via the sum pooling in Eq. 5 is empirically challenging albeit theoretically doable. A similar phenomenon has been observed in previous literatures (Murphy et al., 2018; Wagstaff et al., 2019). 5 CONCLUSION This work introduces a new hypergraph neural network ED-HNN that can model hypergraph diffusion process. We show that any hypergraph diffusion with permutation-invariant potential functions can be represented by iterating equivariant diffusion operators. ED-HNN provides an efficient way to model such operators based on the commonly-used GNN platforms. ED-HNN shows superiority in processing heterophilic hypergraphs and constructing deep models. For future works, ED-HNN can be applied to other tasks such as regression tasks or to develop counterpart implicit models. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS We would like to express our deepest appreciation to Dr. David Gleich, Dr. Kimon Fountoulakis for the insightful discussion on hypergraph computation, and Dr. Eli Chien for the constructive advice on doing the experiments. Sameer Agarwal, Kristin Branson, and Serge Belongie. Higher order learning with graphs. In International Conference on Machine Learning (ICML), 2006. I. Amburg, N. Veldt, and A. R. Benson. Clustering in graphs and hypergraphs with categorical edge labels. In Proceedings of The Web Conference, 2020. Devanshu Arya, Deepak K Gupta, Stevan Rudinac, and Marcel Worring. Hypersage: Generalizing inductive representation learning on hypergraphs. ar Xiv preprint ar Xiv:2010.04558, 2020. Song Bai, Feihu Zhang, and Philip HS Torr. Hypergraph convolution and hypergraph attention. Pattern Recognition, 110:107637, 2021. Federico Battiston, Giulia Cencetti, Iacopo Iacopini, Vito Latora, Maxime Lucas, Alice Patania, Jean-Gabriel Young, and Giovanni Petri. Networks beyond pairwise interactions: structure and dynamics. Physics Reports, 2020. Austin R Benson, David F Gleich, and Jure Leskovec. Higher-order organization of complex networks. Science, 2016. Austin R Benson, David F Gleich, and Desmond J Higham. Higher-order network analysis takes off, fueled by classical ideas and new data. ar Xiv preprint ar Xiv:2103.05031, 2021. Claude Berge. Hypergraphs: combinatorics of finite sets, volume 45. Elsevier, 1984. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning, 2011. Ben Chamberlain, James Rowbottom, Maria I Gorinova, Michael Bronstein, Stefan Webb, and Emanuele Rossi. Grand: Graph neural diffusion. In International Conference on Machine Learning, pp. 1407 1418. PMLR, 2021. S. H. Chan, X. Wang, and O. A. Elgendy. Plug-and-play admm for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84 98, 2016. T-H Hubert Chan and Zhibin Liang. Generalizing the hypergraph laplacian via a diffusion process with mediators. Theoretical Computer Science, 2020. T-H Hubert Chan, Anand Louis, Zhihao Gavin Tang, and Chenzi Zhang. Spectral properties of hypergraph laplacian and approximation algorithms. Journal of the ACM (JACM), 2018. J. H. R. Chang, C.-L. Li, B. Poczos, B. V. Kumar, and A. C. Sankaranarayanan. One network to solve them all solving linear inverse problems using deep projection models. In International Conference on Computer Vision (ICCV), 2017. Siheng Chen, Yonina C Eldar, and Lingxiao Zhao. Graph unrolling networks: Interpretable neural networks for graph signal denoising. IEEE Transactions on Signal Processing, 69:3699 3713, 2021a. Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. ar Xiv preprint ar Xiv:2103.12828, 2021b. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020. Published as a conference paper at ICLR 2023 X. Chen, J. Liu, Z. Wang, and W. Yin. Theoretical linear convergence of unfolded ista and its practical weights and thresholds. 2018. Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence, 39(6):1256 1272, 2016. Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations (ICLR), 2021. Eli Chien, Chao Pan, Jianhao Peng, and Olgica Milenkovic. You are allset: A multiset function framework for hypergraph neural networks. In International Conference on Learning Representations, 2022. I Eli Chien, Huozhi Zhou, and Pan Li. hs2: Active learning over hypergraphs with pointwise and pairwise queries. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019. Uthsav Chitra and Benjamin Raphael. Random walks on hypergraphs with edge-dependent vertex weights. In International Conference on Machine Learning (ICML), 2019. Philip S Chodrow, Nate Veldt, and Austin R Benson. Hypergraph clustering: from blockmodels to modularity. Science Advances, 2021. Weilin Cong, Morteza Ramezani, and Mehrdad Mahdavi. On provable benefits of depth in training graph convolutional networks. Advances in Neural Information Processing Systems (Neur IPS), 2021. H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song. Learning steady-states of iterative algorithms over graphs. In International Conference on Machine Learning (ICML), 2018. Y. Deshpande, S. Sen, A. Montanari, and E. Mossel. Contextual stochastic block models. In Advances in Neural Information Processing Systems (Neur IPS), 2018. Yihe Dong, Will Sawin, and Yoshua Bengio. Hnhn: Hypergraph networks with hyperedge neurons. ar Xiv preprint ar Xiv:2006.12278, 2020. Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. Hypergraph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3558 3565, 2019. Matthias Fey and Jan E. Lenssen. Fast graph representation learning with Py Torch Geometric. In International Conference on Learning Representations (ICLR) Workshop on Representation Learning on Graphs and Manifolds, 2019. Kimon Fountoulakis, Pan Li, and Shenghao Yang. Local hyper-flow diffusion. Advances in Neural Information Processing Systems, 34, 2021. J. H. Fowler. Connecting the congress: A study of cosponsorship networks. Political Analysis, 14 (4):456 487, 2006a. J. H. Fowler. Legislative cosponsorship networks in the us house and senate. Social networks, 28 (4):454 465, 2006b. D. Ghoshdastidar and A. Dukkipati. Consistency of spectral partitioning of uniform hypergraphs under planted partition model. Advances in Neural Information Processing Systems (Neur IPS), 2014. Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning (ICML), 2017. K. Gregor and Y. Le Cun. Learning fast approximations of sparse coding. In International Conference on Machine Learning (ICML), 2010. Published as a conference paper at ICLR 2023 F. Gu, H. Chang, W. Zhu, S. Sojoudi, and L El Ghaoui. Implicit graph neural networks. Advances in Neural Information Processing Systems (Neur IPS), 2020. William L Hamilton. Graph representation learning. 2020. Matthias Hein, Simon Setzer, Leonardo Jost, and Syama Sundar Rangapuram. The total variation on hypergraphs-learning on hypergraphs revisited. In Advances in Neural Information Processing Systems, pp. 2427 2435, 2013. J. R. Hershey, J. L. Roux, and F. Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv preprint ar Xiv:1409.2574, 2014. Jing Huang and Jie Yang. Unignn: a unified framework for graph and hypergraph neural networks. ar Xiv preprint ar Xiv:2105.00956, 2021. Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly submodular optimization. In Advances in Neural Information Processing Systems (Neur IPS), 2013. Jaehyeong Jo, Jinheon Baek, Seul Lee, Dongki Kim, Minki Kang, and Sung Ju Hwang. Edge representation learning with hypergraphs. In Advances in Neural Information Processing Systems (Neur IPS), 2021. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. Johannes Klicpera, Aleksandar Bojchevski, and Stephan G unnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations, 2019. Renaud Lambiotte, Martin Rosvall, and Ingo Scholtes. From networks to optimal higher-order models of complex systems. Nature physics, 2019. Geon Lee, Minyoung Choe, and Kijung Shin. How do hyperedges overlap in real-world hypergraphs?-patterns, measures, and generators. In Proceedings of the Web Conference 2021, pp. 3396 3407, 2021. Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. In Advances in Neural Information Processing Systems, pp. 2305 2315, 2017. Pan Li and Olgica Milenkovic. Submodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering. In International Conference on Machine Learning (ICML), 2018. Pan Li, Hoang Dau, Gregory Puleo, and Olgica Milenkovic. Motif clustering and overlapping clustering for social network analysis. In INFOCOM 2017-IEEE Conference on Computer Communications, IEEE, pp. 1 9. IEEE, 2017. Pan Li, Niao He, and Olgica Milenkovic. Quadratic decomposable submodular function minimization: Theory and practice. Journal of Machine Learning Research (JMLR), 2020a. Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI conference on artificial intelligence (AAAI), 2018. Yuelong Li, Mohammad Tofighi, Junyi Geng, Vishal Monga, and Yonina C Eldar. Efficient and interpretable deep blind image deblurring via algorithm unrolling. IEEE Transactions on Computational Imaging, 6:666 681, 2020b. Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Advances in Neural Information Processing Systems (Neur IPS), 2021. I. Chien C.-Y. Lin and I.-H. Wang. Community detection in hypergraphs: Optimal statistical limit and efficient algorithms. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2018. Published as a conference paper at ICLR 2023 J. Liu, K. Kawaguchi, B. Hooi, Y. Wang, and X. Xiao. Eignn: Efficient infinite-depth graph neural networks. Advances in Neural Information Processing Systems (Neur IPS), 2021a. Jialin Liu and Xiaohan Chen. Alista: Analytic weights are as good as learned weights in lista. In International Conference on Learning Representations (ICLR), 2019. Meng Liu, Nate Veldt, Haoyu Song, Pan Li, and David F Gleich. Strongly local hypergraph diffusions for clustering and semi-supervised learning. In Proceedings of the Web Conference 2021, pp. 2092 2103, 2021b. X. Liu, W. Jin, Y. Ma, Y. Li, H. Liu, Y. Wang, M. Yan, and J. Tang. Elastic graph neural networks. In International Conference on Machine Learning (ICML), 2021c. Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Deep learning markov random field for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 40(8):1814 1828, 2017. L aszl o Lov asz. Submodular functions and convexity. In Mathematical Programming The State of the Art. 1983. Shmoolik Mangan and Uri Alon. Structure and function of the feed-forward loop network motif. National Academy of Sciences, 2003. Tim Meinhardt, Michael Moller, Caner Hazirbas, and Daniel Cremers. Learning proximal operators: Using denoising networks for regularizing inverse imaging problems. In International Conference on Computer Vision (ICCV), 2017. Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 2021. R. L. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro. Janossy pooling: Learning deep permutationinvariant functions for variable-size inputs. In International Conference on Learning Representations (ICLR), 2018. Leonie Neuh auser, Michael T Schaub, Andrew Mellor, and Renaud Lambiotte. Opinion dynamics with multi-body interactions. In International Conference on Network Games, Control and Optimization, 2021. Leonie Neuh auser, Renaud Lambiotte, and Michael T Schaub. Consensus dynamics and opinion formation on hypergraphs. In Higher-Order Systems. 2022. Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 2015. Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations (ICLR), 2019. Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in optimization, 2014. Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In International Conference on Learning Representations (ICLR), 2020. Nicholas G Polson, James G Scott, and Brandon T Willard. Proximal algorithms in statistics and machine learning. Statistical Science, 2015. Konstantin Prokopchik, Austin R Benson, and Francesco Tudisco. Nonlinear feature diffusion on hypergraphs. In International Conference on Machine Learning, pp. 17945 17958. PMLR, 2022. Everett M Rogers. Diffusion of innovations. 2010. E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin. Plug-and-play methods provably converge with properly trained denoisers. In International Conference on Machine Learning (ICML), 2019. Published as a conference paper at ICLR 2023 Akiyoshi Sannai, Yuuki Takai, and Matthieu Cordonnier. Universal approximations of permutation invariant/equivariant functions by deep neural networks. ar Xiv preprint ar Xiv:1903.01939, 2019. Michael T Schaub, Yu Zhu, Jean-Baptiste Seby, T Mitchell Roddenberry, and Santiago Segarra. Signal processing on higher-order networks: Livin on the edge... and beyond. Signal Processing, 2021. Christian J Schuler, Michael Hirsch, Stefan Harmeling, and Bernhard Sch olkopf. Learning to deblur. IEEE transactions on pattern analysis and machine intelligence, 38(7):1439 1451, 2015. Nimrod Segol and Yaron Lipman. On universal equivariant set networks. In International Conference on Learning Representations (ICLR), 2020. Yuuki Takai, Atsushi Miyauchi, Masahiro Ikeda, and Yuichi Yoshida. Hypergraph clustering based on pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1970 1978, 2020. A. M. Teodoro, J. M. Bioucas-Dias, and M. A. T. Figueiredo. Scene-adapted plug-and-play algorithm with convergence guarantees. In International Workshop on Machine Learning for Signal Processing (MLSP), 2017. Matthew Thorpe, Tan Minh Nguyen, Hedi Xia, Thomas Strohmer, Andrea Bertozzi, Stanley Osher, and Bao Wang. Grand++: Graph neural diffusion with a source term. In International Conference on Learning Representations, 2022. Charalampos E Tsourakakis, Jakub Pachocki, and Michael Mitzenmacher. Scalable motif-aware graph clustering. In Proceedings of the Web Conference, pp. 1451 1460, 2017. Francesco Tudisco, Austin R Benson, and Konstantin Prokopchik. Nonlinear higher-order label spreading. In Proceedings of the Web Conference, 2021a. Francesco Tudisco, Konstantin Prokopchik, and Austin R Benson. A nonlinear diffusion method for semi-supervised learning on hypergraphs. ar Xiv preprint ar Xiv:2103.14867, 2021b. Nate Veldt, Austin R Benson, and Jon Kleinberg. Hypergraph cuts with general splitting functions. SIAM Review, 2021. Nate Veldt, Austin R Benson, and Jon Kleinberg. Higher-order homophily is combinatorially impossible. SIAM Review (SIREV), 2022. Edward Wagstaff, Fabian Fuchs, Martin Engelcke, Ingmar Posner, and Michael A Osborne. On the limitations of representing functions on sets. In International Conference on Machine Learning (ICML), 2019. Jianling Wang, Kaize Ding, Liangjie Hong, Huan Liu, and James Caverlee. Next-item recommendation with sequential hypergraphs. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 1101 1110, 2020. Jianling Wang, Kaize Ding, Ziwei Zhu, and James Caverlee. Session-based recommendation with hypergraph attention networks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 82 90. SIAM, 2021. Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. ar Xiv preprint ar Xiv:1909.01315, 2019. Nan Wang, Shoujin Wang, Yan Wang, Quan Z Sheng, and Mehmet A Orgun. Exploiting intra-and inter-session dependencies for session-based recommendations. World Wide Web, 25(1):425 443, 2022. Zhong-Qiu Wang, Jonathan Le Roux, De Liang Wang, and John R Hershey. End-to-end speech separation with unfolded iterative phase reconstruction. In Conference of the International Speech Communication Association (INTERSPEECH), 2018. Published as a conference paper at ICLR 2023 Xin Xia, Hongzhi Yin, Junliang Yu, Qinyong Wang, Lizhen Cui, and Xiangliang Zhang. Selfsupervised hypergraph convolutional networks for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 4503 4511, 2021. Bo Xin, Yizhou Wang, Wen Gao, David Wipf, and Baoyuan Wang. Maximal sparsity with deep networks? Advances in Neural Information Processing Systems, 29, 2016. Naganand Yadati. Neural message passing for multi-relational ordered and recursive hypergraphs. Advances in Neural Information Processing Systems, 33:3275 3289, 2020. Naganand Yadati, Madhav Nimishakavi, Prateek Yadav, Vikram Nitin, Anand Louis, and Partha Talukdar. Hypergcn: A new method for training graph convolutional networks on hypergraphs. Advances in neural information processing systems, 32, 2019. Chaoqi Yang, Ruijie Wang, Shuochao Yao, and Tarek Abdelzaher. Hypergraph learning with line expansion. ar Xiv preprint ar Xiv:2005.04843, 2020. Y. Yang, T. Liu, Y. Wang, J. Zhou, Q. Gan, Z. Wei, Z. Zhang, Z. Huang, and D. Wipf. Graph neural networks inspired by classical iterative algorithms. In International Conference on Machine Learning (ICML), 2021a. Yongyi Yang, Yangkun Wang, Zengfeng Huang, and David Wipf. Implicit vs unfolded graph neural networks. ar Xiv preprint ar Xiv:2111.06592, 2021b. Yuichi Yoshida. Cheeger inequalities for submodular transformations. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2582 2601. SIAM, 2019. Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems (Neur IPS), 2017. Chenzi Zhang, Shuguang Hu, Zhihao Gavin Tang, and TH Hubert Chan. Re-revisiting learning on hypergraphs: confidence interval and subgradient method. In International Conference on Machine Learning, pp. 4026 4034, 2017a. Jiying Zhang, Fuyang Li, Xi Xiao, Tingyang Xu, Yu Rong, Junzhou Huang, and Yatao Bian. Hypergraph convolutional networks via equivalency between hypergraphs and undirected graphs. ar Xiv preprint ar Xiv:2203.16939, 2022. K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017b. Songyang Zhang, Zhi Ding, and Shuguang Cui. Introducing hypergraph signal processing: Theoretical foundation and practical applications. IEEE Internet of Things Journal, 2019. Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV), 2015. Denny Zhou, Jiayuan Huang, and Bernhard Sch olkopf. Learning with hypergraphs: Clustering, classification, and embedding. In Advances in Neural Information Processing Systems, pp. 1601 1608, 2007. Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems (Neur IPS), 2020. Xiaojin Zhu. Semi-supervised learning with graphs. Carnegie Mellon University, 2005. Jason Y Zien, Martine DF Schlag, and Pak K Chan. Multilevel spectral hypergraph partitioning with arbitrary vertex sizes. IEEE Transactions on computer-aided design of integrated circuits and systems, 1999. Published as a conference paper at ICLR 2023 A DERIVATION OF EQ. 3 AND 4 We derive iterative update Eq. 3 and 4 as follows. Recall that our problem is defined as: v V f(hv; xv) + X where H RN is the node feature matrix. He = he,1 he,|e| R|e| collects the associated node features inside edge e, where he,i corresponds to the node features of the i-th node in edge e. We use xv to denote the (initial) attribute of node v. Here, neither f( ; xv) nor ge( ) is necessarily continuous. To solve such problem, we borrow the idea from ADMM (Boyd et al., 2011). We introduce an auxiliary variable Re = He for every e E and reformulate the problem as a constrained optimization problem: v V f(hv; xv) + X subject to Re He = 0, e E By the Augmented Lagrangian Method (ALM), we assign a Lagrangian multiplier Se (scaled by 1/λ) for each edge, then the objective function becomes: max {Se}e E min H,{Re}e E v V f(hv; xv) + X e E g(Re) + λ e E Re He + Se 2 F λ e E Se 2 F . (6) We can iterate the following primal-dual steps to optimize Eq. 6. The primal step can be computed in a block-wise sense: R(t+1) e arg min Re Re H(t) e + S(t) e 2 = proxg/λ H(t) e S(t) e , e E, (7) h(t+1) v arg min hv hv S(t) e,v R(t+1) e,v 2 F + f(hv; xv) = proxf( ;xv)/λdv e:v e(S(t) e,v + R(t+1) e,v ) dv The dual step can be computed as: S(t+1) e S(t) e + R(t+1) e H(t+1) e , e E. (9) Denote Q(t+1) e := S(t) e + R(t+1) e and η := 1/λ, then the iterative updates become: Q(t+1) e = proxηg(2H(t) e Q(t) e ) + Q(t) e H(t) e , for e E, (10) h(t+1) v = proxηf( ;xv)/dv e:e v Q(t+1) e,v dv , for v V. (11) B DEFERRED PROOFS B.1 PROOF OF PROPOSITION 2 Proof. Define π : [K] [K] be an index mapping associated with the permutation matrix P Π(K) such that P Z = zπ(1), , zπ(K) . To prove that g( ) is permutation equivariant, we show by using the definition of partial derivatives. For any Z = [z1 z K] , and permutation π, we have: [ g(P Z)]i = lim δ 0 g(zπ(1), , zπ(i) + δ, , zπ(K)) g(zπ(1), , zπ(i), , zπ(K)) = lim δ 0 g(z1, , zπ(i) + δ, , z K) g(z1, , zπ(i), , z K) δ = [ g(Z)]π(i), (12) Published as a conference paper at ICLR 2023 where the second equality Eq. 12 is due to the permutation invariance of g( ). To prove Proposition 2 for the proximal gradient, we first define: H = proxg(Z) = arg min H g(H) + 1 2 H Z 2 F for some Z. For arbitrary permutation matrix P Π(K), we have proxg(P Z) = arg min H g(H) + 1 2 H P Z 2 F = arg min H g(H) + 1 2 P (P H Z) 2 F = arg min H g(H) + 1 2 P H Z 2 F = arg min H g(P H) + 1 2 P H Z 2 F where Eq. 13 is due to the permutation invariance of g( ). B.2 PROOF OF THEOREM 1 Proof. To prove Theorem 1, we first summarize one of our key results in the following Lemma 2. Lemma 2. ψ( ) : [0, 1]K RK is a permutation-equivariant function if and only if there is a function ρ( ) : [0, 1]K R that is permutation invariant to the last K 1 entries, such that [ψ(Z)]i = ρ(zi, zi+1, , z K, , zi 1 | {z } K 1 ) for any i. Proof. (Sufficiency) Define π : [K] [K] be an index mapping associated with the permutation matrix P Π(K) such that P Z = zπ(1), , zπ(K) . Then [ψ(zπ(1), , zπ(K))]i = ρ(zπ(i), zπ(i+1), , zπ(K), , zπ(i 1)). Since ρ( ) is invariant to the last K 1 entries, [ψ(P Z)]i = ρ(zπ(i), zπ(i)+1, , z K, , zπ(i) 1) = [ψ(Z)]π(i). (Necessity) Given a permutation-equivariant function ψ : [0, 1]K RK, we first expand it to the following form: [ψ(Z)]i = ρi(z1, , z K). Permutation-equivariance means ρπ(i)(z1, , z K) = ρi(zπ(1), , zπ(K)). Suppose given an index i, consider any permutation π : [K] [K], where π(i) = i. Then, we have ρi(z1, , zi, , z K) = ρπ(i)(z1, , zi, , z K) = ρi(zπ(1), , zi, , zπ(K)), which implies ρi : RK R must be invariant to the K 1 elements other than the i-th element. Now, consider a permutation π where π(1) = i. Then ρi(z1, z2, , z K) = ρπ(1)(z1, z2, , z K) = ρ1(zπ(1), zπ(2), , zπ(K)) = ρ1(zi, zi+1, , z K, , zi 1), where the last equality is due to our previous argument. This implies two results. First, for all i, ρi(z1, z2, , zi, , z K), i [K] should be written in terms of ρ1(zi, zi+1, , z K, , zi 1). Moreover, ρ1 is permutation invariant to its last K 1 entries. Therefore, we just need to set ρ = ρ1 and broadcast it accordingly to all entries. We conclude the proof. To proceed the proof, we bring in the following mathematical tools (Zaheer et al., 2017): Definition 5. Given a vector z = [z1, , z K] RK, we define power mapping ϕM : R RM as ϕM(z) = z z2 z M , and sum-of-power mapping ΦM : RK RM as ΦM(z) = PK i=1 ϕM(zi), where M is the largest degree. Lemma 3. Let X = {[z1, , z K] [0, 1]K such that z1 < z2 < < z K}. We define mapping ϕ : R RK+1 as ϕ(z) = z0 z1 z2 z K , and mapping Φ : RK RK+1 as Φ(z) = PK i=1 ϕ(zi), where M is the largest degree. Then Φ restricted on X, i.e., Φ : X RK+1, is a homeomorphism. Proof. Proved in Lemma 6 in (Zaheer et al., 2017). We note that Definition 5 is slightly different from the mappings defined in Lemma 3 (Zaheer et al., 2017) as it removes the constant (zero-order) term. Combining with Lemma 3 (Zaheer et al., 2017) and results in (Wagstaff et al., 2019), we have the following result: Published as a conference paper at ICLR 2023 Lemma 4. Let X = {[z1, , z K] [0, 1]K such that z1 < z2 < < z K}, then there exists a homeomorphism ΦM : X RM such that ΦM(z) = PK i=1 ϕM(zi) where ϕM : R RM if M K. Proof. For M = K, we choose ϕK and ΦK to be the power mapping and power sum with largest degree K defined in Definition 5. We note that Φ(z) = K ΦK(z) . Since K is a constant, there exists a homeomorphism between the images of Φ(z) and ΦK(z). By Lemma 3, Φ : X RK+1 is a homeomorphism, which implies ΦK(z) : X RK is also a homeomorphism. For M > K, we first pad every input z X with a constant k > 1 to be an M-dimension ˆz RM. Note that padding is homeomorphic since k is a constant. All such ˆz form a subset X {[z1, , z M] [0, k]M such that z1 < z2 < < z M}. We choose ˆϕ : R RM to be power mapping and ˆΦ : X RM to be sum-of-power mapping restricted on X , respectively. Following (Wagstaff et al., 2019), we construct ΦM(z) as below: i=1 ˆϕM(zi) + i=K+1 ˆϕM(k) = i=1 ˆϕM(zi) + i=1 ˆϕM(k) (14) i=1 (ˆϕM(zi) ˆϕM(k)) + i=1 ˆϕM(k) = i=1 (ˆϕM(zi) ˆϕM(k)) + M ˆϕM(k), (15) which induces PK i=1(ˆϕM(zi) ˆϕM(k)) = ˆΦM(ˆz) M ˆϕM(k). Let ϕM(z) = ˆϕM(zi) ˆϕM(k), and ΦM(z) = PK i=1 ϕM(zi). Since [0, k] is naturally homeomorphic to [0, 1], by our argument for M = K, ˆΦM : X RM is a homeomorphism. This implies ΦM : X RM is also a homeomorphism. It is straightforward to show the sufficiency of Theorem 1 by verifying that for arbitrary permutation π : [K] [K], [ψ(zπ(1), , zπ(K))]i = ρ(zπ(i), PK j=1 ϕ(zj)) = [ψ(z1, , z K)]π(i). With Lemma 2 and 4, we can conclude the necessity of Theorem 1 by the following construction: 1. By Lemma 2, any permutation equivariant function ψ(z1, , z K) can be written as [ψ( )]i = τ(zi, zi+1, , z K, , zi 1) such that τ( ) is invariant to the last K 1 elements. 2. By Lemma 4, we know that there exists a homeomorphism mapping ΦK which is continuous, invertible, and invertibly continuous. For arbitrary z [0, 1]K, the difference between z and Φ 1 K ΦK(z) is up to a permutation. 3. Since τ( ) is permutation invariant to the last K 1 elements, we can construct the function τ(zi, zi+1, , z K, , zi 1) = τ(zi, Φ 1 K 1 ΦK 1(zi+1, , z K, , zi 1)) = ρ(zi, P j =i ϕK 1(zj)) = ρ(zi, PK j=1 ϕK 1(zj)), where ρ(x, y) = τ(x, Φ 1 K 1(y)) and ρ(x, y) = ρ(x, y ϕK 1(x)). Since τ, Φ 1 K 1, and ϕK 1 are all continuous functions, their composition ρ is also continuous. Remark 1. Sannai et al. (2019)[Corollary 3.2] showed similar results to our Theorem 1. However, their provided exact form of equivariant set function is written as ρ(zi, P j =i ϕ(zj)). We note the difference: the summation in Theorem 1 pools over all the nodes inside a hyperedge while the summation in (Sannai et al., 2019)[Corollary 3.2] pools over elements other than the central node. To implement the latter equation in hypergraphs which have coupled sets, one has to maintain each node-hyperedge pair, which is computationally prohibitive in practice. An easier proof can be shown by combining results in (Sannai et al., 2019) with our proof technique introduced in the Step 3. Published as a conference paper at ICLR 2023 e a b c d e Hypergraph Invariant Diffusion Equivariant Diffusion Figure 4: Comparing Invariant Diffusion and Equivariant Diffusion. The direction of the arrows indicates the information flow. The same color of arrows denotes the same passed message while different color of arrows means different passed messages. C ALIGNING ALGORITHM 1 WITH GD/ADMM UPDATES ED-HNN can be regarded as an unfolding algorithm to optimize objective Eq. 1, where f and g are node and edge potentials, respectively. Essentially, each iteration of ED-HNN simulates a one-step equivariant hypergraph diffusion. Each iteration of ED-HNN has four steps. In the first step, EDHNN point-wisely transforms the node features via an MLP ϕ. Next ED-HNN sum-pools the node features onto the associated hyperedges as me = P v e ϕ(hv). The third step is crucial to achieve equivariance, where each node sum-pools the hyperedge features after interacting its feature with every connected hyperedge: P e:v e ρ(hv, me). The last step is to update the feature information via another node-wse transformation φ. According to our Theorem 1, ρ(hv, P u e ϕ(hu)) can represent any equivariant function. Therefore, the messages aggregated from hyperedges ρ(hv, me) can be learned to approximate the gradient g in Eq. 2. ED-HNN can well approximate the GD algorithm (Eq. 2), while ED-HNN does not perfectly match ADMM unless we assume Q(t) e = H(t) e in Eq. 3 for a practical consideration. This assumption may reduce the performance of the hypergraph diffusion algorithm while our model ED-HNN has already achieved good enough empirical performance. Tracking Q(t) e means recording the messages from E to V for every iteration, which is not supported by the current GNN platforms (Fey & Lenssen, 2019; Wang et al., 2019) and may consume more memory. We leave the study of the algorithms that can track the update of Q(t) e as a future study. A co-design of the algorithm and the system may be needed to guarantee the scalability of the algorithm. D INVARIANT DIFFUSION VERSUS EQUIVARIANT DIFFUSION Although our Proposition 2 states hypergraph diffusion operator should be inherently equivariant, many exising frameworks design it to be invariant. In invariant diffusion, hypergraph diffusion operator g( ) or prox ηg( ) corresponds to a permutation invariant function rather than a permutation equivariant function (see Definition 4). Mathematically, g(He) = 1 ˆg(He) for some invariant function ˆg : R|e| F RF . In contrast to equivariant diffusion, the invariant diffusion operator summarizes the node information to one feature vector and passes this identical message to all nodes uniformly. We provide Fig. 4 to illustrate the key difference between invariant diffusion and equivariant diffusion. Considering implementing such an invariant diffusion by Deep Set (Zaheer et al., 2017), an invariant diffusion should have the following message passing: The difference in Example: Invariant Diffusion by Deep Set Initialization: H(0) = X. For t = 0, 1, 2, ..., L 1, do: 1. Designing the messages from V to E: m(t) u e = ˆϕ(h(t) u ), for all u V. 2. Sum V E messages over hyperedges m(t) e = P u e m(t) u e, for all e E. 3. Broadcast m(t) e and design the messages from E to V: m(t) e v = ˆρ(m(t) e ), for all v e. 4. Update h(t+1) v = ˆφ(h(t) v , P e:u e m(t) e u, xv, dv), for all v V. Published as a conference paper at ICLR 2023 the third step is worth noting. ˆρ is independent of the target node features such that each hyperedge passes every node an identical message. Hyper GCN (Yadati et al., 2019), HGNN (Feng et al., 2019) and other GCNs running on clique expansion can all be reduced to invariant diffusion. E FURTHER DISCUSSION ON OTHER RELATED WORKS Our ED-HNN is actually inspired by a line of works on optimization-inspired NNs. Optimizationinspired NNs often get praised for their interpretability (Monga et al., 2021; Chen et al., 2021b) and certified convergence under certain conditions of the learned operators (Ryu et al., 2019; Teodoro et al., 2017; Chan et al., 2016). Optimization-inspired NNs mainly focus on the applications such as compressive sensing (Gregor & Le Cun, 2010; Xin et al., 2016; Liu & Chen, 2019; Chen et al., 2018), speech processing (Hershey et al., 2014; Wang et al., 2018), image denoising (Zhang et al., 2017b; Meinhardt et al., 2017; Chang et al., 2017; Chen & Pock, 2016), partitioning (Zheng et al., 2015; Liu et al., 2017), deblurring (Schuler et al., 2015; Li et al., 2020b) and so on. Only a few works recently have applied optimization-inspired NNs to graphs as discussed in Sec. 3.4, while our work is the first one to study optimization-inspired NNs on hypergraphs. Parallel to our hypergraph diffusion inspired HNNs, there are also some models to represent diffusion on graphs. However, as there are only pairwise relations in graphs, the model design is often much simpler than that for hypergraphs. In particular, to construct an expressive operator with for pairwise relations is trivial (corresponding to the case K = 2 in Theorem 1). These models unroll either learnable (Yang et al., 2021a; Chen et al., 2021a), fixed ℓ1-norm (Liu et al., 2021c) or fixed ℓ2-norm pair-wise potentials (Klicpera et al., 2019) to formulate GNNs. Some works also view the graph diffusion as an ODE (Chamberlain et al., 2021; Thorpe et al., 2022), which essentially corresponds to the gradient descent formulation Eq. 2. Implicit GNNs (Dai et al., 2018; Liu et al., 2021a; Gu et al., 2020; Yang et al., 2021b) are to directly parameterize the optimum of an optimization problem while implicit methods are missing for hypergraphs, which is a promising future direction. F ADDITIONAL EXPERIMENTS AND IMPLEMENTATION DETAILS F.1 EXPERIMENTS ON ED-HNNII As we described in Sec. 3.2, an extension to our ED-HNN is to consider V E message passing and E V message passing as two equivariant set functions. The detailed algorithm is illustrated in Algorithm 2, where the red box highlights the major difference with ED-HNN. Algorithm 2: ED-HNNII Initialization: H(0) = X and three MLPs ˆϕ, ˆρ, ˆφ (shared across L layers). For t = 0, 1, 2, ..., L 1, do: 1. Designing the messages from V to E: m(t) u e = ˆϕ(m(t 1) u e , h(t) u ), for all u V. 2. Sum V E messages over hyperedges m(t) e = P u e m(t) u e, for all e E. 3. Broadcast m(t) e and design the messages from E to V: m(t) e v = ˆρ(h(t) v , m(t) e ), for all v e. 4. Update h(t+1) v = ˆφ(h(t) v , P e:u e m(t) e u, xv, dv), for all v V. In our tested datasets, hyperedges do not have initial attributes. In our implementation, we assign a common learnable vector for every hyperedge as their first-layer features m(0) u e. The performance of ED-HNNII and the comparison with other baselines are presented in Table 4. Our finding is that ED-HNNII can outperform ED-HNN on datasets with relatively larger scale or more heterophily. And the average accuracy improvement by ED-HNNII is around 0.5%. We argue that ED-HNNII inherently has more complex computational mechanism, and thus tends to overfit on small datasets (e.g., Cora, Citeseer, etc.). Moreover, superior performance on heterophilic datasets also implies that injecting more equivariance benefits handling heterophilic data. From Table 4, the measured computational efficiency is also comparable to ED-HNN and other baselines. Published as a conference paper at ICLR 2023 Table 4: Additional experiments and updated leaderboard with ED-HNNII. Prediction accuracy (%). Bold font highlights when ED-HNNII outperforms the original ED-HNN. Other details are kept consistent with Table 2. Cora Citeseer Pubmed Cora-CA DBLP-CA Training Time (10 1 s) HGNN (Huang & Yang, 2021) 79.39 1.36 72.45 1.16 86.44 0.44 82.64 1.65 91.03 0.20 0.24 0.51 HCHA (Bai et al., 2021) 79.14 1.02 72.42 1.42 86.41 0.36 82.55 0.97 90.92 0.22 0.24 0.01 HNHN (Dong et al., 2020) 76.36 1.92 72.64 1.57 86.90 0.30 77.19 1.49 86.78 0.29 0.30 0.56 Hyper GCN (Yadati et al., 2019) 78.45 1.26 71.28 0.82 82.84 8.67 79.48 2.08 89.38 0.25 0.42 1.51 Uni GCNII (Huang & Yang, 2021) 78.81 1.05 73.05 2.21 88.25 0.40 83.60 1.14 91.69 0.19 4.36 1.18 Hyper ND Tudisco et al. (2021a) 79.20 1.14 72.62 1.49 86.68 0.43 80.62 1.32 90.35 0.26 0.15 1.32 All Deep Sets (Chien et al., 2022) 76.88 1.80 70.83 1.63 88.75 0.33 81.97 1.50 91.27 0.27 1.23 1.09 All Set Transformer (Chien et al., 2022) 78.58 1.47 73.08 1.20 88.72 0.37 83.63 1.47 91.53 0.23 1.64 1.63 ED-HNN 80.31 1.35 73.70 1.38 89.03 0.53 83.97 1.55 91.90 0.19 1.71 1.13 ED-HNNII 78.47 1.62 72.65 1.56 89.56 0.62 82.17 1.68 91.93 0.29 1.85 1.09 Congress Senate Walmart House Avg. Rank Inference Time (10 2 s) HGNN (Huang & Yang, 2021) 91.26 1.15 48.59 4.52 62.00 0.24 61.39 2.96 6.00 1.01 0.04 HCHA (Bai et al., 2021) 90.43 1.20 48.62 4.41 62.35 0.26 61.36 2.53 6.67 1.54 0.18 HNHN (Dong et al., 2020) 53.35 1.45 50.93 6.33 47.18 0.35 67.80 2.59 7.67 8.11 0.05 Hyper GCN (Yadati et al., 2019) 55.12 1.96 42.45 3.67 44.74 2.81 48.32 2.93 9.22 0.87 0.06 Uni GCNII (Huang & Yang, 2021) 94.81 0.81 49.30 4.25 54.45 0.37 67.25 2.57 4.56 21.22 0.13 Hyper ND (Tudisco et al., 2021a) 74.63 3.62 52.82 3.20 38.10 3.86 51.70 3.37 6.89 0.02 0.01 All Deep Sets (Chien et al., 2022) 91.80 1.53 48.17 5.67 64.55 0.33 67.82 2.40 6.22 5.35 0.33 All Set Transformer (Chien et al., 2022) 92.16 1.05 51.83 5.22 65.46 0.25 69.33 2.20 3.56 6.06 0.67 ED-HNN 95.00 0.99 64.79 5.14 66.91 0.41 72.45 2.28 1.56 5.87 0.36 ED-HNNII 95.19 1.34 63.81 6.17 67.24 0.45 73.95 1.97 2.67 6.07 0.40 Table 5: More dataset statistics. CE homophily is the homophily score (Pei et al., 2020) based on CE of hypergraphs. Cora Citeseer Pubmed Cora-CA DBLP-CA Congress Senate Walmart House # nodes 2708 3312 19717 2708 41302 1718 282 88860 1290 # hyperedges 1579 1079 7963 1072 22363 83105 315 69906 340 # features 1433 3703 500 1433 1425 100 100 100 100 # classes 7 6 3 7 6 2 2 11 2 avg. dv 1.767 1.044 1.756 1.693 2.411 427.237 19.177 5.184 9.181 avg. |e| 3.03 3.200 4.349 4.277 4.452 8.656 17.168 6.589 34.730 CE Homophily 0.897 0.893 0.952 0.803 0.869 0.555 0.498 0.530 0.509 F.2 DETAILS OF BENCHMARKING DATASETS Our benchmark datasets consist of existing seven datasets (Cora, Citeseer, Pubmed, Cora-CA, DBLP-CA, Walmart, and House) from Chien et al. (2022), and two newly introduced datasets (Congress (Fowler, 2006a) and Senate (Fowler, 2006b)). For existing datasets, we downloaded the processed version by Chien et al. (2022). For co-citation networks (Cora, Citeseer, Pubmed), all documents cited by a document are connected by a hyperedge (Yadati et al., 2019). For co-authorship networks (Cora-CA, DBLP), all documents co-authored by an author are in one hyperedge (Yadati et al., 2019). The node features in citation networks are the bag-of-words representations of the corresponding documents, and node labels are the paper classes. In the House dataset, each node is a member of the US House of Representatives and hyperedges group members of the same committee. Node labels indicate the political party of the representatives. (Chien et al., 2022). In Walmart, nodes represent products purchased at Walmart, hyperedges represent sets of products purchased together, and the node labels are the product categories (Chien et al., 2022). For Congress (Fowler, 2006a) and Senate(Fowler, 2006b), we used the same setting as (Veldt et al., 2022). In Congress dataset, nodes are US Congresspersons and hyperedges are comprised of the sponsor and co-sponsors of legislative bills put forth in both the House of Representatives and the Senate. In Senate dataset, nodes are US Congresspersons and hyperedges are comprised of the sponsor and co-sponsors of bills put forth in the Senate. Each node in both datasets is labeled with political party affiliation. Both datasets were from James Fowler s data (Fowler, 2006b;a). We also list more detailed statistical information on the tested datasets in Table 5. F.3 HYPERPARAMETERS FOR BENCHMARKING DATASETS For a fair comparison, we use the same training recipe for all the models. For baseline models, we precisely follow the hyperparameter settings from (Chien et al., 2022). For ED-HNN, we adopt Adam optimizer with fixed learning rate=0.001 and weight decay=0.0, and train for 500 epochs for all datasets. The standard deviation is reported by repeating experiments on ten different data splits. We fix the input dropout rate to be 0.2, and dropout rate to be 0.3. For internal MLPs, we add a Published as a conference paper at ICLR 2023 Layer Norm for each layer similar to (Chien et al., 2022). Other parameters regarding model sizes are obtained by grid search, which are enumerated in Table 6. The search range of layer number is {1, 2, 4, 6, 8} and the hidden dimension is {96, 128, 256, 512}. We find the model size is proportional to the dataset scale, and in general heterophilic data need deeper architecture. For EDHNNII, due to the inherent model complexity, we need to prune model depth and width to fit each dataset. Algorithm 3: Contextual Hypergraph Stochastic Block Model Initialization: Empty hyperedge set E = . Draw vertex set V1 of 2,500 nodes with class 1. Draw vertex set V2 of 2,500 nodes with class 2. For i = 0, 1, 2, ..., 1, 000, do: 1. Sample a subset e1 with α1 nodes from V1. 2. Sample a subset e2 with α2 nodes from V2. 3. Construct the hyperedge E E {e1 e2}. Table 6: Choice of hyperparameters for ED-HNN. # is short for number of , hd. stands for hidden dimension, cls. means the classifier. When number of MLP layers equals to 0, the MLP boils down to be an identity mapping. ED-HNN Cora Citeseer Pubmed Cora-CA DBLP-CA Congress Senate Walmart House # iterations 1 1 8 1 1 4 8 6 8 # layers of ˆϕ 0 0 2 1 1 2 2 2 2 # layers of ˆρ 1 1 2 1 1 2 2 2 2 # layers of ˆφ 1 1 1 1 1 2 2 2 2 # layer cls. 1 1 2 2 2 2 2 2 2 MLP hd. 256 256 512 128 128 156 512 512 256 cls. hd. 256 256 256 96 96 128 256 256 128 ED-HNNII Cora Citeseer Pubmed Cora-CA DBLP-CA Congress Senate Walmart House # iterations 1 1 4 1 1 4 4 4 4 # layers of ˆϕ 1 1 2 1 1 2 2 2 2 # layers of ˆρ 1 1 2 1 1 2 2 2 2 # layers of ˆφ 0 1 1 1 1 1 1 1 1 # layer cls. 1 1 2 2 2 2 2 2 2 MLP hd. 128 128 512 128 128 156 512 512 256 cls. hd. 96 96 256 96 96 128 256 256 128 F.4 SYNTHETIC HETEROPHILIC DATASETS We use the contextual hypergraph stochastic block model (Deshpande et al., 2018; Ghoshdastidar & Dukkipati, 2014; Lin & Wang, 2018) to synthesize data with controlled heterophily. The generated graph contains 5,000 nodes and two classes in total, and 2,500 nodes for each class. We construct hyperedges by randomly sampling α1 nodes from class 1, and α2 nodes from class 2 without replacement. Each hyperedge has a fixed cardinality |e| = α1 + α2 = 15. We draw 1,000 hyperedges in total. The detailed data synthesis pipeline is summarized in Algorithm 3. We use α = min{α1, α2} to characterize the heterophily level of the hypergraph. For a more intuitive illustration, we list the CE homophily corresponding to different α in Table 7. Experiments on synthetic Table 7: Correspondence between Heterophily α and CE Homophily. α 1 2 3 4 6 7 CE Homophily 0.875 0.765 0.672 0.596 0.495 0.474 heterophilic datasets fix the training hyperparameters and hidden dimension=256 to guarantee a fair parameter budget ( 1M). Baseline HNNs are all of one-layer architecture as they are not scalable with the depth shown in Sec. 4.3. Since our ED-HNN adopts parameter sharing scheme, we can easily repeat the diffusion layer twice to achieve better results without parameter overheads. F.5 SYNTHETIC DIFFUSION DATASETS AND ADDITIONAL EXPERIMENTS In order to evaluate the ability of ED-HNN to express given hypergraph diffusion, we generate semi-synthetic diffusion data using the Senate hypergraph (Chodrow et al., 2021) and synthetic node Published as a conference paper at ICLR 2023 256 128 64 32 2 MAE ( 10 2) Total Variation (p = 2) Equivariant Invariant All Deep Sets All Set Transformer 256 128 64 32 MAE ( 10 2) Lovász Extension - Cardinality Figure 5: Comparing the Powers to Represent known Diffusion (using ADMM with the proximal operators in Eq. 3 and 4): MAE v.s. Latent Dimensions features. The data consists of 1,000 pairs (H(0), H(1)). The initial node features H(0) are sampled from 1-dim Gaussian distributions with mean 0 and variance randomly drawn between 1 and 100. That is, to generate a single instance of H(0), we first pick σ uniformly from [1,10], and then sample the coordinate entries as h(0) v N(0, σ2). Then we apply the gradient step in Eq. 2 to obtain the corresponding H(1) . For non-differentiable node or edge potentials, we adopt subgradients for convenient computation. We fix the node potential as f(hv; xv) = (hv xv)2 where xv h(0) v . We consider 3 different edge potentials from Example 1 with varying complexities: a) CE, b) TV (p = 2) and c) LEC (p = 2). For LEC, we set y as follows: if |e| is even, then yi = 2/|e| if i |e|/2 and 2/|e| otherwise; if |e| is odd, then yi = 2/(|e| 1) if i (|e| 1)/2, yi = 0 if i = (|e|+1)/2, and y0 = 2/(|e| 1) otherwise. In order to apply the gradient step in Eq. 2 we need to specify the learning rate η. We choose η in a way such that Var(H(1))/Var(H(0)) does not vary too much among the three different edge potentials. Specifically, we set η = 0.5 for CE, η = 0.02 for TV and η = 0.1 for LEC. Beyond semi-synthetic diffusion data generated from the gradient step Eq. 2, we also considered synthetic diffusion data obtained from the proximal operators Eq. 3 and 4. We generated a random uniform hypergraph with 1,000 nodes and 1,000 hyperedges of constant hyperedge size 20. The diffusion data on this hypergraph consists of 1,000 pairs (H(0), H(1)). The initial node features H(0) are sampled in the same way as before. We apply the updates given by Eq. 3 and 4 to obtain H(1). We consider the same node potential and 2 edge potentials TV (p = 2) and LEC (p = 2). We set η = 1/2 for both cases. We show the results in Figure 5. The additional results resonate with our previous results in Figure 3. Again, our ED-HNN outperforms other baseline HNNs by a significant margin when hidden dimension is limited. F.6 MORE COMPLEXITY ANALYSIS In this section, we provide more efficiency comparison with Hyper SAGE (Arya et al., 2020) and LEGCN (Yang et al., 2020). As we discussed in 3.4, Hyper SAGE and LEGCN may be able to learn hyperedge equivariant operator. However, both Hyper SAGE and LEGCN need to build nodehyperedge pairs, which is memory consuming and unfriendly to efficient message passing implementation. To demonstrate this point, we compare our model with Hyper SAGE and LEGCN in terms of training and inference efficiency. We reuse the official code provided in Open Review to benchmark Hyper SAGE. Since LEGCN has not released code, we reimplement it using Pytorch Geometric Library. We can only test the speed of these models on Cora dataset, because Hyper SAGE and LEGCN cannot scale up as there is no efficient implementation yet for their computational prohibitive preprocessing procedures. The results are presented in Table 8. We note that Hyper SAGE cannot easily employ message passing between V and E, since neighbor aggregation step in Hyper SAGE needs to rule out the central node. Its official implementation adopts a naive for -loop for each forward pass which is unable to fully utilize the GPU parallelism and significant deteriorates the speed. LEGCN can be implemented via message passing, however, the graphs expanded over node-edge pairs are much denser thus cannot scale up to larger dataset. Published as a conference paper at ICLR 2023 Table 8: Performance and Efficiency Comparison with Hyper SAGE and LEGCN. The prediction accuracy of Hyper SAGE is copied from the original manuscript (Arya et al., 2020). Hyper SAGE (Arya et al., 2020) LEGCN (Yang et al., 2020) ED-HNN ED-HNNII Training Time (10 1 s) 43.93 2.15 0.56 0.71 0.15 0.68 0.25 0.48 Inference Time (10 2 s) 297.57 30.57 0.42 0.06 0.15 0.03 0.20 0.08 Prediction Accuracy (%) 69.30 2.70 73.34 1.06 80.31 1.35 78.47 1.62 Table 9: Sensitivity Test on Hidden Dimension on Pubmed Dataset. Model 512 256 128 64 All Deep Sets (Chien et al., 2022) 88.75 0.33 88.41 0.37 87.50 0.42 86.78 0.40 All Set Transformer (Chien et al., 2022) 88.72 0.37 88.16 0.24 87.36 0.23 86.21 0.25 ED-HNN (ours) 89.03 0.53 88.74 0.38 88.84 0.38 88.76 0.24 F.7 SENSITIVITY TO HIDDEN DIMENSIONS We conduct expressivity vs hidden size experiments in Tab. 9, where we control the hidden dimension of hypergraph models and test their performance on Pubmed dataset. We choose top-performers All Deep Sets and All Set Transformer as the compared baselines. On real-world data, our 64-width ED-HNN can even achieve on-par results with 512-width All Set models. This implies our EDHNN has better tolerance of low hidden dimension and we owe this to higher expressive power by achieving an equivariance.