# nonlinear_feature_diffusion_on_hypergraphs__ac9cae78.pdf

Nonlinear Feature Diffusion on Hypergraphs

Konstantin Prokopchik 1 Austin R. Benson 2 Francesco Tudisco 1

Hypergraphs are a common model for multiway relationships in data, and hypergraph semisupervised learning is the problem of assigning labels to all nodes in a hypergraph, given labels on just a few nodes. Diffusions and label spreading are classical techniques for semi-supervised learning in the graph setting, and there are some standard ways to extend them to hypergraphs. However, these methods are linear models, and do not offer an obvious way of incorporating node features for making predictions. Here, we develop a nonlinear diffusion process on hypergraphs that spreads both features and labels following the hypergraph structure. Even though the process is nonlinear, we show global convergence to a unique limiting point for a broad class of nonlinearities and we show that such limit is the global minimum of a new regularized semi-supervised learning loss function which aims at reducing a generalized form of variance of the node features across the hyperedges. The limiting point serves as a node embedding from which we make predictions with a linear model. Our approach is competitive with popular graph and hypergraph neural network baselines, and also takes less time to train.

1. Introduction

In graph-based semi-supervised learning (SSL), one has labels on a small number of nodes, and the goal is to predict labels for the remaining nodes. Diffusions, label spreading, and label propagation are classical techniques for this problem, where known labels are diffused, spread, or propagated over the edges in a graph (Zhou et al., 2004; Zhu et al., 2003).

*Equal contribution 1Gran Sasso Science Institute, L Aquila, Italy 2Cornell University, New York, USA. Correspondence to: Konstantin Prokopchik <konstantin.prokopchik@gssi.it>, Austin R. Benson <arb@cs.cornell.edu >, Francesco Tudisco <francesco.tudisco@gssi.it>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

These methods were originally developed for graphs where the set of nodes corresponds to a point cloud, and edges are similarity measures such as ϵ-nearest neighbors; however, they can also be used with relational data such as social networks or co-purchasing (Chin et al., 2019; Gleich & Mahoney, 2015; Juan et al., 2020; Kyng et al., 2015). In the latter case, diffusions are particularly well-suited because they directly capture the idea of homophily (Mc Pherson et al., 2001) or assortativity (Newman, 2002), where labels are smooth over the graph.

While graphs are widely-used models for relational data, many complex systems and datasets are better described by higher-order relations that go beyond pairwise interactions (Battiston et al., 2020; Benson et al., 2018; Torres et al., 2021). For instance, co-authorship often involves more than two authors, people in social network gather in small groups and not just pairs, and emails can have several recipients. A hypergraph is a standard representation for such data, where a hyperedge can connect any number of nodes. Directly modeling higher-order interactions has led to improvements in several machine learning settings (Zhou et al., 2007; Benson et al., 2016; Li & Milenkovic, 2017; 2018; Yadati et al., 2019; Srinivasan et al., 2021).

Along this line, there are a number of diffusions or Laplacian-like regularization techniques for SSL on hypergraphs (Zhou et al., 2007; Hein et al., 2013; Zhang et al., 2017; Li et al., 2020; Liu et al., 2021; Veldt et al., 2020; Tudisco et al., 2021), which are also built on principles of similarity or assortativity. These methods are based on the optimization of a quadratic Laplacian-based regularized loss which enforces local and global consistency across the hypergredges. This optimization formulation makes it easy to interpret and to provide theoretical guarantees on the learning strategy. However, unlike the graph case, non-quadratic hypergraph consistency losses are typically required to model real-world data interactions at higherorder (Chan & Liang, 2020; Neuh auser et al., 2022; Tudisco & Higham, 2021). Thus, the corresponding diffusion algorithms, based on e.g. gradient ﬂow integration, are not linear and are in general no longer easy to interpret nor to analyze theoretically. Moreover, these methods are topically designed for cases where only labels are available, and do not take advantage of rich features or metadata associated with hypergraphs that are potentially useful for making ac-

Nonlinear Feature Diffusion on Hypergraphs

curate predictions. For instance, co-authorship or email data could have rich textual information.

Graph and hypergraph neural network (GNN) is a popular approach that uses both features and network structure for SSL (Yadati et al., 2019; Feng et al., 2019; Dong et al., 2020; Chien et al., 2022; Huang & Yang, 2021; Zhang et al., 2020a). Hidden layers of GNNs aggregate the features of neighboring nodes via neural networks and learn the model parameters by ﬁtting to the labeled nodes. While combining features according to the hypergraph structure is a key idea, the loss function of typical GNNs does not take directly into account of the fact that connected nodes likely share similar labels. Moreover, GNNs can be expensive to train. In contrast, diffusion methods work precisely because of homophily and are typically fast. In the simple case of graphs, combining these two ideas has led to several recent advances (Klicpera et al., 2018; Huang et al., 2020; Jia & Benson, 2022).

Here, we combine the ideas of GNNs and diffusions for SSL on hypergraphs with a method that simultaneously diffuses both labels and features according to the hypergraph structure. In addition to incorporating features, our new diffusion can incorporate a broad class of nonlinearities to increase the modeling capability, which is critical to the architectures of both graph and hypergraph neural networks. The limiting point of the process provides an embedding at each node, which can then be combined with a simpler model such as multinomial logistic regression to make predictions at each node. This results into a method which is much faster than typical GNNs as the training phase and embedding computation are decoupled.

Remarkably, even though our model is nonlinear, we can still prove a number of theoretical properties about the diffusion process. In particular, we show that the limiting point of the process is unique and provide a simple, globally convergent iterative algorithm for computing it. Furthermore, we show that this limiting point is the global optimum of an interpretable hypergraph SSL loss function deﬁned as the combination of a data ﬁtting term and a Laplacian-like regularizer which aims at reducing a form of generalized variance on each hyperedge. Empirically, we ﬁnd that using the limiting point of our nonlinear hypergraph diffusion as features for a linear model is competitive with a variety of graph and hypergraph neural network baselines and other diffusion algorithms on several real-world datasets. We also study the effect of the input feature embedding on the classiﬁcation performance by either removing or modifying the node features. In particular, including the ﬁnal embedding of hypergraph GNNs as additional features in the diffusion model does not improve accuracy, which provides evidence that our model is sufﬁcient for empirical data.

2. Problem Set-up

We consider the multi-class semi-supervised classiﬁcation problem on a hypergraph, in which we are given nodes with features and hyperedges connecting them. A small number of node labels are available and the goal is to assign labels to the remaining nodes.

Let H = (V, E) be a hypergraph where V = {1, . . . , n} is the set of nodes and E = {e1, . . . , em} is the set of hyperedges. Each hyperedge e E has an associated positive weight w(e) > 0. In our setting every node can belong to an arbitrary number of hyperedges. Let δi denote the (hyper)degree of node i V , i.e., the weighted number of hyperedges node i participates in, δi = P

e:i e w(e), and let D Rn n be the diagonal matrix of the node degrees, i.e., D = Diag(δ1, . . . , δn). Throughout we assume that hypergraph has no isolated nodes, i.e., δi > 0 for all i. This is a standard assumption, as one can always add self-loops or remove isolated vertices. Let K denote the n m incidence matrix of H, whose rows correspond to nodes and columns to hyperedges:

( 1 i e 0 otherwise.

To include possible weights on hyperedges, we use a diagonal matrix W deﬁned by W = Diag(w(e1), . . . , w(em)).

We will represent d-dimensional features on nodes in H by a matrix X Rn d, where row xi = Xi,: Rd is the feature vector of i V . Suppose each node belongs to one of c classes, denoted as {1, . . . , c}, and we know the label of a (small) training subset T of the nodes V . We denote by Y Rn c the input-labels matrix of the nodes, with rows yi entrywise deﬁned by Yij = (yi)j = 1 if node i belongs to class j, and (yi)j = 0 otherwise. Since we only know the labels for the nodes in T , all the yi for i T are zero, while the for i T , yi has exactly one nonzero entry (one-hot encoding).

3. Background and Related Work

We review basic ideas in hypergraph neural networks (HNNs) for SSL and hypergraph label spreading (HLS), which will contextualize the method we develop next.

3.1. Neural Network Approaches

Graph neural networks are broadly adopted methods for semi-supervised learning on graphs. Several generalizations to hypergraphs have been proposed, and we summarize the most fundamental ideas here.

When |e| = 2 for all e E, the hypergraph is a standard graph G. The basic formulation of a graph convolutional network (GCN) (Kipf & Welling, 2017) is based on a ﬁrst-

Nonlinear Feature Diffusion on Hypergraphs

order approximation of the convolution operator on graph signals (Mallat, 1999). This approximation boils down to a mapping given by the normalized adjacency matrix of the graph A = I L, where L is the (possibly rescaled) normalized Laplacian L = I D 1/2AD 1/2 and A is the adjacency matrix. The forward model for a two-layer GCN is then

Z = softmax(F) = softmax(Aσ(AXΘ(1))Θ(2))

where X Rn d is the matrix of the node features, Θ(1), Θ(2) are the input-to-hidden and hidden-to-output weight matrices of the network and σ is a nonlinear activation function. Here, the approximated graph convolutional ﬁlter A combines features across nodes that are well connected in the graph. For multi-class semi-supervised learning problems, the weights Θ(i) are then trained minimizing the crossentropy loss over the training set of known labels T .

Several hypergraph variations of this neural network model have been proposed for the more general case |e| 2. A common strategy is to consider a hypergraph Laplacian L and deﬁne an analogous convolutional ﬁlter. One relatively simple approach is to deﬁne L as the Laplacian of the clique expanded graph of H (Agarwal et al., 2006; Zhou et al., 2007), where the hypergraph is mapped to a graph on the same set of nodes by adding a clique among the nodes of each hyperedge. This is the approach used in HGNN (Feng et al., 2019). Other variants use mediators or general permutation-invariant aggregating functions instead of cliques in the hypergraph to perform the graph reduction (Chan & Liang, 2020; Huang et al., 2020), or learn the hypergraph convolutional ﬁlter via a suitable attention-based multi-set function architecture (Chien et al., 2022). Hyper GCN (Yadati et al., 2019) is based on the nonlinear hypergraph Laplacian proposed in (Chan et al., 2018; Louis, 2015). This model uses a GCN on a reduced graph GX = (V, EX) that depends on the features, where (u, v) EX if and only if (u, v) = arg maxi,j e xi xj , for all hyperedges e of the original hypergraph. An approximate graph convolutional ﬁlter AX is then deﬁned in terms of the normalized Laplacian of GX as before. Thus, the two-layer forward model for Hyper GCN is

Z = softmax(AF (1)Θ(2)), F (1) = σ(AXΘ(1)) .

3.2. Laplacian Regularization and Label Spreading

Semi-supervised learning based on Laplacian-like regularization strategies were developed in (Zhou et al., 2004) for graphs and then in (Zhou et al., 2007) for hypergraphs. The main idea of these approaches is to obtain a classiﬁer F by minimizing the regularized square loss function

min F ℓΩ:= F Y 2 + λ Ω(F) (1)

where Ωis a regularization term that takes into account for the hypergraph structure. (Note that only labels and not features are used here.) In particular, if fi = Fi,: denotes the i-th row of F, the clique expansion approach of (Zhou et al., 2007) deﬁnes Ω= ΩL2, with

ΩL2(F) = P e E P i,j e w(e)

|e| fi δi fj

while the total variation on hypergraph regularizer proposed in (Hein et al., 2013) is Ω= ΩT V , where

ΩT V (F) = P

e E w(e) maxi,j e fi fj 2

The function ΩL2 is quadratic, so its gradient is a rescaled version of the Laplacian of the clique expanded graph of H. Thus, HGNN is implicitly applying this regularization. Similarly, the graph construction in Hyper GCN is implicitly applying a regularization based on the hypergraph total variation ΩT V .

These two choices of regularizing terms can be solved by means of different strategies. As ΩL2 is quadratic, one can solve (1) via gradient descent with the learning rate α = λ/(1 + λ) to obtain the simple method

F (k+1) = αAHF (k) + (1 α)Y, (2)

where AH is the normalized adjacency matrix of the cliqueexpanded graph of H. The sequence (2) converges to the global solution F of (1) for any starting point and the limit F is entrywise nonnegative. This method is usually referred to as Hypergraph Label Spreading (HLS) as the iteration in (2) takes the initial labels Y and spreads or diffuses them throughout the vertices of the hypergraph H, following the edge structure of its clique-expanded graph.

The total variation inspired regularizer ΩT V is related to the 1-Laplacian energy (B uhler & Hein, 2009; Tudisco & Hein, 2018) and has advantages for hyperedge cut interpretations. However, although ΩT V is convex, it is not differentiable, and computing (1) requires more complex and computationally demanding methods (Hein et al., 2013; Zhang et al., 2017). Unlike HLS in (2), this is not easily interpreted as a label diffusion.

4. Hyperedge Variance Regularization and Nonlinear Diffusion

In this section, we propose a new hypergraph regularization term Ωµ which, rather than minimizing the distance between each node pair on a hyperedge, aims at reducing the variance (or a generalized variance) across the hyperedge nodes. Precisely, consider the regularization term of the form

i e w(e) fi δi µ n fj p

δj : j e o 2

Nonlinear Feature Diffusion on Hypergraphs

where µ( ) is a function that combines node embeddings on each hyperedge. When µ is the mean µ({zj : j e}) = 1 |e| P

j e zj, the right hand side in (3) coincides exactly with the variance of fi/ δi on the hyperedge e.

Note that, if F is the matrix with rows Fi,: = fi, the mean of fi/ δi over the hyperedge e can be written as the e-th row of the matrix D 1 E K D 1/2F, where DE denotes the m m diagonal matrix with diagonal entries De,e = |e|. Here we use this observation to deﬁne a family of functions µ that generalizes the mean. Precisely, let

µσ,ϱ n fj p

δj : j e o = σ(K ϱ(D 1/2F))e,: (4)

where σ and ϱ are (in general, nonlinear) operators that describe the way an embedding F is transformed and combined across the hyperedges. For example, when σ and ϱ are diagonal operators (similar to activation functions) i.e. they are such that σ(F)ij = σi(Fij) for some functions σ1, σ2, : R R (and the same for ϱ) we have that

µσ,ϱ n fj p

δj : j e o = σe X

j e ϱj fj δi

An important example of σ and ϱ of this form, which we will use in all our experiments, is

ϱ(Z1) := Zp 1, σ(Z2) := (D 1 E Z2)1/p, (5)

where the powers are taken entry-wise and Zi are matrices of appropriate size. With these choices, for every e E we have that µσ,ϱ({fi/ δi, i e}) = meanp{fi/ δi : i e} is the p-power mean of the normalized feature vectors fi/ δi of the nodes i in the hyperedge e, where

meanp{zi : i e} = 1

i e zp i 1/p .

With this choice of σ and ϱ the regularization term (3) reads

Ωµσ,ϱ(F) = X

i e w(e) fi δi meanp n fj p

δj : j e o 2

(6) that is, the embedding F that minimizes Ωµσ,ϱ minimizes the variation of each node embedding fi from the p-power mean of the embeddings of the nodes in each hyperedge i participates to. In particular, when p = 2, the regularization term (6) computes the variance of all the nodes in each of the hyperedges. Note that, for nonnegative embeddings, other special cases of meanp include the geometric mean (for p 0), the harmonic mean (for p = 1) as well as the minimum and maximum functions (for p and p + , respectively).

4.1. Nonlinear Diffusion Method

Consider the regularized loss function ℓΩin (1) with Ω= Ωµσ,ϱ. Unlike the clique-expansion case, Ω= ΩL2, ℓΩµσ,ϱ is non-quadratic and non-convex in general. Despite this fact, we will show below that the global solution to min F ℓΩµσ,ϱ (F) can be computed via a simple hypergraph diffusion algorithm similar to a nonlinear version of HLS (2), provided we restrict to the set of embeddings with nonnegative entries. We introduce the diffusion model next.

Recall that in our setting each node i V has a labelencoding vector yi (yi is the all-zero vector for the initially unlabeled points i T ) and a feature vector xi. Thus, each node in the hypergraph has an initial (c + d)-dimensional embedding, which forms an input matrix U = [Y X].

The proposed hypergraph semi-supervised classiﬁer uses the normalized limit point of the nonlinear diffusion process

F (k+1) = α L(F (k)) + (1 α) U . (7)

where the nonlinear diffusion map L is a form of nonlinear Laplacian operator deﬁned as

L(F) = D 1/2KWσ(K ϱ(D 1/2F)) . (8)

The limit point of the diffusion process (7) results in a new embedding F = [Y X ] Rn (c+d). We will show in 4.3 that such limit exists, is unique and minimizes a normalized version of the SSL regularized loss ℓΩµσ,ϱ . We will then use F to train a logistic multi-class classiﬁer based on the known labels i T and their new embedding F . Thus, unlike GNNs, the training phase and the computation of the embedding F are decoupled, and thus are much faster. The overall SSL algorithm is detailed in Algorithm 1. Note that the convergence of (7) is not trivial, due to the nonlinearity of L. We further comment on this issue in the appendix.

4.2. Related Nonlinear Diffusion Models

Our nonlinear diffusion process (7) propagates both input node label and feature embeddings through the hypergraph in a manner similar to (2), but allowing for nonlinear activations, which increases modeling power. Firstly, L is a generalization of the normalized adjacency matrix of the clique-expansion hypergraph AH (2). Secondly, for a standard graph, i.e., a hypergraph where all the edges have exactly two nodes, KWK = A + D where A is the adjacency matrix of the graph and D is the diagonal matrix of the weighted node degrees. Similarly, for a general hypergraph H, we have the identity KWK = AH + D, where AH is the adjacency matrix of the clique-expanded graph associated with H. Then, when σ = ϱ = id, L coincides with D 1/2KWK D 1/2 = AH + I , (9)

Nonlinear Feature Diffusion on Hypergraphs

Algorithm 1 Hyper ND: Nonlinear Hypergraph Diffusion

Input: Incidence matrix K and weights matrix W; mixing mappings σ, ϱ as in (5); normalization mapping ϕ as in (10); label Y {0, 1}n c and feature X Rn d matrices; regularization coefﬁcient α (0, 1) and stopping tolerance tol. Shift and scale U to obtain U > 0, e.g., via (12) when X 0 U U/ϕ(U) F (0) U repeat

G αL(F (k)) + (1 α)U F (k+1) G/ϕ(G) until F (k+1) F (k) / F (k+1) < tol New node embedding: F = [Y X ] F (k+1)

Optimize Θ for Z softmax(F Θ) to minimize crossentropy. Output: Prediction arg maxc(Z )i,c for class of unlabeled node i.

the normalized adjacency matrix of the clique-expansion hypergraph (Zhou et al., 2007; Feng et al., 2019) as in (2).

When σ and ϱ are not linear, L can represent a broad family of nonlinear diffusion mappings, with special cases used in different contexts. For example, in the graph case, if ϱ = id and σ(x) = |x|p 1sign(x), then L boils down to the graph p-Laplacian operator (Saito et al., 2018; Elmoataz et al., 2008; B uhler & Hein, 2009). Exponential and logarithmic based choices of σ and ϱ give rise to nonlinear Laplacians used to model chemical reactions (Rao et al., 2013; Van der Schaft et al., 2016) as well as to model consensus dynamics and opinion formation in hypergraphs (Neuh auser et al., 2022). Trigonometric functions such as σ(x) = sin(x) are used to model graph and hypergraph oscillators (Schaub et al., 2016; Battiston et al., 2021; Mill an et al., 2020). In the context of semi-supervised learning, nonlinear diffusion mappings based on entrywise powers and generalized means are used, for example in (Ibrahim & Gleich, 2019; Tudisco et al., 2021). Similarly to our proposed operator, the diffusion map of (Tudisco et al., 2021) generalizes the classical linear label spreading to 3-uniform hypergraphs by considering an adjacency tensor-based model that combines labels on hyperedges via p-power means. We notice that our proposed diffusion operator (8) extends the one proposed by Tudisco et al. (2021). In fact, when restricted to uniform hypergraphs and for ϱ = I, the two mappings coincide, modulo relatively minor adjustments. However, extending the model there proposed to the case of hyperedges of arbitrary size would require one to split the hyperedges into batches of same sizes and compute the corresponding adja-

cency tensors. Compared to our proposed incidence matrix model, this is computationally signiﬁcantly more demanding both because it requires the computation of several high order tensors and because the multiplication operations with tensors are more expensive than those with matrices (which use fast BLAS routines from, e.g., Num Py), especially when used to propagate high-dimensional features rather than only labels.

4.3. Convergence

The convergence of (7) is discussed in Theorem 4.1 below, where we show that, under mild assumptions on σ and ϱ, (7) always converges, provided the embedding is suitably normalized at each step. Moreover, the result shows that the nonlinear hypergraph ﬁlter L eventually generates an embedding that minimizes a regularized loss function of the form (1), with regularization term Ω= Ωµσ,ϱ. The proof of Theorem 4.1 is based on Thm. 3.1 by Tudisco et al. (2021) and is moved to the appendix.

In the following, we write F 0 (resp. F > 0) to indicate that F has nonnegative (resp. positive) entries. Moreover, we write σ hom+(a) to denote that σ is positive and homogeneous of degree a, i.e. that the following holds for σ: (1) σ(F) > 0 for all F > 0; and (2) σ(λF) = λaσ(F) for all λ > 0 and all F > 0.

Note that the class of operators hom+(a) is quite general and it includes, for example different forms of Leaky Re LU functions σ(F) = max{0, F a} ε max{0, F a} as well as the family of homogeneous nonnegative generalized polynomial (polynomial with real powers) operators, deﬁned as

j=1 B(i,j)f α(i,j) 1 1 f α(i,j) n n

for any nonnegative coefﬁcients α(i,j) j and any nonnegative

matrices B(i,j), as long as P α(i,j) 1 + + α(i,j) n = a, i.e. the sum of the powers is constant for all i, j, to ensure σ(λF) = λaσ(F) for all λ > 0. Theorem 4.1. Let L be deﬁned as in (8) and let µσ,ϱ be deﬁned as in (4). Let fi = Fi,: denote the i-th row of F and deﬁne the real-valued function

e E w(e) µσ,ϱ n fj p

δj , j e o 2 . (10)

Assume that σ hom+(a) and ρ hom+(b) for some a, b R. Let U be an entrywise positive input embedding and let α [0, 1]. If ab = 1 and if L is differentiable and such that L(F) L( e F) for all F e F > 0, then for any starting point F (0) 0, the sequence ( e F (k) = α L(F (k)) + (1 α) U F (k+1) = e F (k)/ϕ( e F (k)) (11)

Nonlinear Feature Diffusion on Hypergraphs

Table 1: Details for the real-world hypergraph datasets used in the experiments.

DBLP Pubmed Cora Cora Citeseer Foodweb co-authorship co-citation co-authorship co-citation co-citation carbon-exchange

|V | (#nodes) 43413 19717 2708 2708 3312 122 |E| (#hyperedges) 22535 7963 1072 1579 1079 141233 d (#features) 1425 500 1433 1433 3703 0 c (#labels) 6 3 7 7 6 3

converges to a unique ﬁxed point F of (7), such that ϕ(F ) = 1, F > 0. Moreover, F is the solution of

2 + λ Ωµσ,ϱ(F)

subject to F 0, ϕ(F) = 1,

where λ = α/(1 α).

Note that the choices of σ and ϱ in (5), which lead to the ppower mean regularization term (6), and the corresponding diffusion operator L satisfy all the assumptions of Thm 4.1.

4.4. Algorithm Details and Limitations

Once the new node embedding F is computed via (11), we use it to infer the labels of the non-labeled data points via a simple softmax output layer which minimizes cross-entropy (see Algorithm 1).

Similar to HLS, the parameter α in Alg. 1 yields a convex combination of the diffusion mapping L and the bias U, allowing to tune the contribution given by the homophily along the hyperedges and the one provided by the input features and labels. In other words, in view of Theorem 4.1, α quantiﬁes the strength of the regularization parameter λ, which allows us to tune the contribution of the regularization term Ωµσ,ϱ over the data-ﬁtting term F U/ϕ(U) .

A requirement for our main theoretical results is entrywise positivity of the input embedding U. While this is a limitation of the theory and the methods, it turns out to not be that stringent in practice. If X 0, i.e. we have nonnegative node features, we can easily obtain a positive embedding by performing an initial label smoothing (M uller et al., 2019; Szegedy et al., 2016), i.e. we choose a small ε > 0 and set

Uε = (1 ε)[Y X] + ε11 , (12)

being 11 the all-one matrix of the appropriate size. Note that nonnegative input features X 0 are not uncommon. For instance, bag-of-words, one-hot encodings, and binary features in general are all nonnegative. In fact, for all of the real-world datasets we consider in our experiments, the features are nonnegative. Similarly, if some of the input features have negative values (e.g., features coming from a word embedding), one could perform other preprocessing (e.g., shift on all embeddings) to get the required [Y X] >0.

Another implicit requirement of the proposed method is the assumption that the hypergraph datasets at hand are homophilic. This is a limitation shared by many graph and hypergraph learning algorithms, although relevant heterophilic settings exist in the real-world, see e.g. (Zhu et al., 2020).

5. Experiments

We now evaluate our method on several real-world hypergraph datasets (Table 1). We use ﬁve co-citation and co-authorship hypergraphs: Cora co-authorship, Cora cocitation, Citeseer, Pubmed (Sen et al., 2008) and DBLP (Rossi & Ahmed, 2015). All nodes in the datasets are documents, features are given by the content of the abstract and hyperedge connections are based on either cocitation or co-authorship. The task for each dataset is to predict the topic to which a document belongs. We also consider a foodweb hypergraph, where the nodes are organisms and hyperedges represent directed carbon exchange in the Florida bay (foo). Here we predict the role of the nodes in the food chain. This hypergraph has no features, so only labels are used for Hyper ND while we use a onehot encoding for the baselines. The code implementing the experiments is available at https://github.com/ compile-gssi-lab/Hyper ND.

We compare our method to six baselines. Two are hypergraph convolutional network, designed to work speciﬁcally for hypergraphs:

HGNN (Feng et al., 2019) This is a hypergraph neural network model that uses the clique-expansion Laplacian (Zhou et al., 2007; Agarwal et al., 2006) for the hypergraph convolutional ﬁlter. Hyper GCN (Yadati et al., 2019) This is a hypergraph convolutional network model with regularization similar to the total variation (see also 3). There are three architectural variants (1-Hyper GCN, Fast Hyper GCN, Hyper GCN), and we report whichever gives the best performance.

One is a hypergraph Laplacian regularization method:

HTV (Zhang et al., 2017) This is a conﬁdence-interval subgradient-based method that minimizes the 1-Laplacian

Nonlinear Feature Diffusion on Hypergraphs

Table 2: Accuracy (mean standard deviation) over ﬁve random samples of the training nodes T . We compare Hyper ND and the six baseline methods (APPNP, HGNN, Hyper GCN, SGC, SCE, HTV). Overall, Hyper ND is more accurate than the baselines.

Method Hyper ND APPNP HGNN Hyper GCN SGC SCE HTV

Data % labeled

Citeseer 4.2% 72.13 1.00 63.51 1.39 61.78 3.46 50.94 8.27 52.66 2.18 61.28 1.61 29.63 0.3 Cora-author 5.2% 77.33 1.51 71.34 1.60 63.11 2.73 61.27 1.06 30.46 0.22 71.96 2.18 44.55 0.6 Cora-cit 5.2% 83.13 1.11 82.08 1.61 62.88 2.26 62.78 2.73 29.08 0.25 79.85 1.91 35.60 0.8 DBLP 4.0% 89.63 0.12 88.94 0.07 73.82 0.71 70.02 0.10 43.61 0.17 87.50 0.19 45.19 0.9 Foodweb 5.0% 64.09 5.94 69.12 3.30 57.09 2.33 56.14 3.85 57.45 0.47 63.50 4.78 57.23 0.9 Pubmed 0.8% 82.81 2.16 81.50 1.18 72.57 1.03 78.11 0.99 54.30 1.11 77.57 2.34 47.04 0.8

-1.0 0.1 1.0 2.0 10.0 p

67 72 69 78 75 81 78 81 76

68 75 71 78 78 81 79 73 76

69 76 73 78 80 72 80 75 76

70 77 75 70 80 74 80 75 77

71 68 77 73 81 77 81 76 78

60 55 63 71 74 76 77 72 73

60 54 65 72 76 74 77 73 71

59 53 67 69 77 70 76 76 70

58 54 68 66 77 75 75 76 67

57 61 71 71 77 77 73 75 65

Cora-Author

63 62 65 71 71 71 73 68 71

63 61 67 71 72 69 73 69 70

63 59 68 70 73 68 72 72 69

63 58 69 67 73 70 71 73 68

63 64 70 69 72 72 70 72 66

60 56 63 76 75 82 80 81 81

59 55 65 78 78 82 82 70 81

59 57 68 79 80 67 82 76 81

59 58 70 65 82 74 82 79 80

57 61 74 71 83 78 81 81 79

80 75 81 88 88 88 89 87 88

79 73 83 88 89 87 89 87 88

78 72 84 88 89 85 89 88 87

77 72 85 83 89 88 88 89 87

76 80 87 86 89 89 87 88 86

Figure 1: Performance of the proposed Hyper ND for varying p and α parameters.

inspired loss (1) with Ω= ΩT V . In our experiments, this method outperformed other label spreading and Laplacian regularization techniques such as the original PDHG strategy of Hein et al. (2013), the HLS method (Zhou et al., 2007) and the p-Laplacian approach of Saito et al. (2018).

The remaining three baselines, instead, are graph convolutional networks designed for graph data, which we apply to the clique-expanded version of the considered hypergraphs:

APPNP (Klicpera et al., 2018) This is a graph convolutional network model combined with Page Rank. The authors of this paper introduce a personalized propagation of neural predictions and its approximation based on the relationship between GCN and Page Rank. SGC (Wu et al., 2019) This is a graph convolutional network model without nonlinearities. SCE (Zhang et al., 2020b) This is a graph convolutional network model inspired by a sparsest-cut problem, where unsupervised network embedding is learned only using negative samples for training.

Table 2 shows the size of the training dataset for each network and compares the accuracy (mean standard deviation) of Hyper ND against the different baselines. Precision and recall are reported in Tables 3 and 4 in the appendix. For each dataset, we use ﬁve trials with different samples of the training nodes T . All of the algorithms that we use have hyperparameters. For the baselines we use the reported tuned hyperparameters. For all of the neural network-based models, we use two layers and 200 training epochs, following (Yadati et al., 2019) and (Feng et al., 2019). For our method and HTV, which have no training phase, we run 5-fold cross

validation with label-balanced 50/50 splits to choose α from {0.1, 0.2, . . . , 0.9} and p from {1, 2, 3, 5, 10}. Precisely, we split the data into labeled and unlabeled points. We split the labeled points into training and validation sets of equal size (label-balanced 50/50 splits) and we choose the parameters based on the average performance on the validation set over 5 random repeats. Then, we assess the performance on the held out test set, which is comprised of all the initially non-labeled points. We repeat this process 5 times and we choose the value of α, p that gives the best mean accuracy. These values differ across the different random samplings of the training dataset. Thus, in order to highlight how the performance is affected by different values of α and p, we show in Figure 1 the mean accuracy over 10 runs of Hyper ND, for all of the considered values of α and p.

All the datasets we use here have nonnegative input embedding [Y X] which we preprocess via label smoothing as in (12), with ε = 1e 6. Our experiments have shown that different choices of ε do not inﬂuence the classiﬁcation performance of the algorithm.

Due to its simple regularization interpretation, we choose σ and ϱ to be the p-power mean considered in (5), for various p. When varying p, we change the nonlinear activation functions that deﬁne the ﬁnal embedding F . Our proposed nonlinear diffusion method performs overall very well. Interestingly, the best competitors are not hypergraphoriented methods but rather graph methods directly applied to the clique-expanded graph. In particular, APPNP is the strongest baseline, and this method also strongly relies on diffusions. Moreover, the poorer performance observed for

Nonlinear Feature Diffusion on Hypergraphs

(E1) (E2) (E3) (E4)

dblp pubmed cora-author cora-cit citeseer

Figure 2: Accuracy (mean and standard deviation) of multinomial logistic regression classiﬁer, using different combinations of features obtained from embeddings (E1) (E4).

APPNP HGNN Hyper GCN SGC SCE HTV Hyper ND

Execution time (sec)

Figure 3: Execution time on the largest dataset DBLP (for one hyper-parameter setting in each case). All methods are comparable on small datasets.

food web dataset (which has no features) highlights the ability of Hyper ND to create meaningful feature embeddings, in addition to propagating labels.

We also point out that since Hyper ND is a forward model, it can be implemented efﬁciently. The cost of each iteration of (2) is dominated by the cost of the two matrix-vector products with the matrices K and K , both of which only require a single pass over the input data and can be parallelized with standard techniques. Therefore, Hyper ND scales linearly with the number and size of the hyperedges, i.e. the size of the data. Thus, it is typically cheap to com-

pute (similar to standard hypergraph LS) and is overall faster to train than a two-layer GCN. Training times are reported in Figure 3, where we compare mean execution time over ten runs for all the methods on DBLP. The execution times are similar on other datasets. For Hyper ND, we show mean execution time over the ﬁve random choices of p. Hyper ND is up to 2 order of magnitudes faster than the baselines.

To further highlight the learning capabilities of the diffusion map we are proposing, we present one more test below. The diffusion map (2) generates a new embedding F = [Y X ] from the ﬁxed point of a purely forward model. This model yields a new feature-based representation X , similar to the last-layer embedding of any neural network approach. A natural question is whether or not X is actually a better embedding. To this end, in the next experiment we consider four node embeddings F and train a classiﬁer via crossentropy minimization of Z = softmax(FΘ), optimizing Θ. Speciﬁcally, we consider the following:

(E1) F = Y . We run a nonlinear purely label spreading iteration, by setting U = Y in (7).By Theorem 4.1, this embedding is a Laplacian regularization method.

Nonlinear Feature Diffusion on Hypergraphs

(E2) F = [Y Xhgcn], where Xhgcn is the embedding generated by Hyper GCN before the softmax. (E3) F = F = [Y X ], the limit point (7) of our Hyper ND. This is the embedding used for the results in Table 2 and Figure 1. (E4) F = [Y X Xhgcn]. This combines the representations of our Hyper ND and Hyper GCN.

Figure 2 shows the accuracy for these embeddings with various values of p for the p-power mean in Hyper ND. The best performance is obtained by the two embeddings that contain our learned features X : (E3) and (E4). In particular, while (E4) includes the ﬁnal-layer embedding of Hyper GCN, it does not improve accuracy over (E3).

6. Conclusions

Graph neural networks and hypergraph label spreading are two distinct techniques with different advantages for semisupervised learning with higher-order relational data. The proposed diffusion approach (Hyper ND) tries to combine advantages from both approaches: feature-based learning, modeling ﬂexibility, label-based regularization, and computational speed. Importantly, we can prove that the diffusion converges to an embedding that is the global minimizer of an interpretable regularized loss function which enforces small variance across the hyperedges, and we have an algorithm that can compute this optimal embedding. Overall, Hyper ND outperforms a variety of baseline neural network and label spreading based methods on several datasets.

Florida bay trophic exchange matrix. http: //vlado.fmf.uni-lj.si/pub/networks/ data/bio/foodweb/Florida.paj.

Agarwal, S., Branson, K., and Belongie, S. Higher order learning with graphs. In Proceedings of the 23rd International Conference on Machine Learning, pp. 17 24, 2006.

Battiston, F., Cencetti, G., Iacopini, I., Latora, V., Lucas, M., Patania, A., Young, J.-G., and Petri, G. Networks beyond pairwise interactions: Structure and dynamics. Physics Reports, 874:1 92, 2020.

Battiston, F., Amico, E., Barrat, A., Bianconi, G., Ferraz de Arruda, G., Franceschiello, B., Iacopini, I., K eﬁ, S., Latora, V., Moreno, Y., et al. The physics of higher-order interactions in complex systems. Nature Physics, 17(10): 1093 1098, 2021.

Benson, A. R., Gleich, D. F., and Leskovec, J. Higher-order organization of complex networks. Science, 353(6295): 163 166, 2016.

Benson, A. R., Abebe, R., Schaub, M. T., Jadbabaie, A., and Kleinberg, J. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences, 115(48):E11221 E11230, 2018.

B uhler, T. and Hein, M. Spectral clustering based on the graph p-Laplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 81 88, 2009.

Chan, T.-H. H. and Liang, Z. Generalizing the hypergraph laplacian via a diffusion process with mediators. Theoretical Computer Science, 806:416 428, 2020.

Chan, T.-H. H., Louis, A., Tang, Z. G., and Zhang, C. Spectral properties of hypergraph laplacian and approximation algorithms. Journal of the ACM, 65(3):1 48, 2018.

Chien, E., Pan, C., Peng, J., and Milenkovic, O. You are All Set: A Multiset Function Framework for Hypergraph Neural Networks. International Conference on Learning Representations (ICLR), 2022.

Chin, A., Chen, Y., M. Altenburger, K., and Ugander, J. Decoupled smoothing on graphs. In The World Wide Web Conference, pp. 263 272, 2019.

Dong, Y., Sawin, W., and Bengio, Y. Hnhn: Hypergraph networks with hyperedge neurons. ICML Graph Representation Learning and Beyond Workshop, 2020. URL https://arxiv.org/abs/2006.12278.

Elmoataz, A., Lezoray, O., and Bougleux, S. Nonlocal discrete regularization on weighted graphs: a framework for image and manifold processing. IEEE transactions on Image Processing, 17(7):1047 1060, 2008.

Feng, Y., You, H., Zhang, Z., Ji, R., and Gao, Y. Hypergraph neural networks. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 3558 3565, 2019.

Gleich, D. F. and Mahoney, M. W. Using local spectral methods to robustify graph-based learning algorithms. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 359 368, 2015.

Hein, M., Setzer, S., Jost, L., and Rangapuram, S. S. The total variation on hypergraphs - learning on hypergraphs revisited. In Advances in Neural Information Processing Systems, pp. 2427 2435, 2013.

Huang, J. and Yang, J. Uni GNN: A uniﬁed framework for graph and hypergraph neural networks. International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2021.

Nonlinear Feature Diffusion on Hypergraphs

Huang, Q., He, H., Singh, A., Lim, S.-N., and Benson, A. R. Combining label propagation and simple models out-performs graph neural networks. International Conference on Learning Representations (ICLR), 2020.

Ibrahim, R. and Gleich, D. Nonlinear diffusion for community detection and semi-supervised learning. In The World Wide Web Conference, pp. 739 750, 2019.

Jia, J. and Benson, A. R. A unifying generative model for graph learning algorithms: Label propagation, graph convolutions, and combinations. SIAM Journal on Mathematics of Data Science, 4(1):100 125, 2022.

Juan, D.-C., Lu, C.-T., Li, Z., Peng, F., Timofeev, A., Chen, Y.-T., Gao, Y., Duerig, T., Tomkins, A., and Ravi, S. Ultra ﬁne-grained image semantic embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 277 285, 2020.

Kipf, T. N. and Welling, M. Semi-supervised classiﬁcation with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.

Klicpera, J., Bojchevski, A., and G unnemann, S. Predict then propagate: Graph neural networks meet personalized Page Rank. In International Conference on Learning Representations (ICLR), 2018.

Kyng, R., Rao, A., Sachdeva, S., and Spielman, D. A. Algorithms for lipschitz learning on graphs. In Conference on Learning Theory, pp. 1190 1223, 2015.

Li, P. and Milenkovic, O. Inhomogeneous hypergraph clustering with applications. Advances in Neural Information Processing Systems, 2017:2309 2319, 2017.

Li, P. and Milenkovic, O. Submodular hypergraphs: plaplacians, cheeger inequalities and spectral clustering. In International Conference on Machine Learning, pp. 3014 3023. PMLR, 2018.

Li, P., He, N., and Milenkovic, O. Quadratic decomposable submodular function minimization: Theory and practice. Journal of Machine Learning Research, 21(106):1 49, 2020.

Liu, M., Veldt, N., Song, H., Li, P., and Gleich, D. F. Strongly local hypergraph diffusions for clustering and semi-supervised learning. In Proceedings of the Web Conference 2021, pp. 2092 2103, 2021.

Louis, A. Hypergraph markov operators, eigenvalues and approximation algorithms. In Proceedings of the fortyseventh annual ACM symposium on Theory of computing, pp. 713 722, 2015.

Mallat, S. A wavelet tour of signal processing. Elsevier, 1999.

Mc Pherson, M., Smith-Lovin, L., and Cook, J. M. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415 444, 2001.

Mill an, A. P., Torres, J. J., and Bianconi, G. Explosive higher-order Kuramoto dynamics on simplicial complexes. Physical Review Letters, 124(21):218301, 2020.

M uller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, pp. 4696 4705, 2019.

Neuh auser, L., Lambiotte, R., and Schaub, M. T. Consensus dynamics and opinion formation on hypergraphs. In Higher-Order Systems, pp. 347 376. Springer, 2022.

Newman, M. E. Assortative mixing in networks. Physical review letters, 89(20):208701, 2002.

Rao, S., van der Schaft, A., and Jayawardhana, B. A graphtheoretical approach for the analysis and model reduction of complex-balanced chemical reaction networks. Journal of Mathematical Chemistry, 51(9):2401 2422, 2013.

Rossi, R. and Ahmed, N. The network data repository with interactive graph analytics and visualization. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 29, 2015.

Saito, S., Mandic, D. P., and Suzuki, H. Hypergraph p-Laplacian: A differential geometry view. In Thirty Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Schaub, M. T., O Clery, N., Billeh, Y. N., Delvenne, J.-C., Lambiotte, R., and Barahona, M. Graph partitions and cluster synchronization in networks of oscillators. Chaos: An Interdisciplinary Journal of Nonlinear Science, 26(9): 094821, 2016.

Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classiﬁcation in network data. AI magazine, 29(3):93 93, 2008.

Srinivasan, B., Zheng, D., and Karypis, G. Learning over families of sets-hypergraph representation learning for higher order tasks. In SIAM International Conference on Data Mining (SDM), pp. 756 764. SIAM, 2021.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016.

Torres, L., Blevins, A. S., Bassett, D., and Eliassi-Rad, T. The why, how, and when of representations for complex systems. SIAM Review, 63(3):435 485, 2021.

Nonlinear Feature Diffusion on Hypergraphs

Tudisco, F. and Hein, M. A nodal domain theorem and a higher-order Cheeger inequality for the graph p Laplacian. Journal of Spectral Theory, 8(3):883 909, 2018.

Tudisco, F. and Higham, D. J. Node and edge nonlinear eigenvector centrality for hypergraphs. Communications Physics, 4(1):1 10, 2021.

Tudisco, F., Benson, A. R., and Prokopchik, K. Nonlinear higher-order label spreading. In Proceedings of the Web Conference, 2021.

Van der Schaft, A., Rao, S., and Jayawardhana, B. A network dynamics approach to chemical reaction networks. International Journal of Control, 89(4):731 745, 2016.

Veldt, N., Benson, A. R., and Kleinberg, J. Minimizing localized ratio cut objectives in hypergraphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1708 1718, 2020.

Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Simplifying graph convolutional networks. In International Conference on Machine Learning (ICML), pp. 6861 6871. PMLR, 2019.

Yadati, N., Nimishakavi, M., Yadav, P., Nitin, V., Louis, A., and Talukdar, P. Hypergcn: A new method for training graph convolutional networks on hypergraphs. Advances in Neural Information Processing Systems, 32: 1511 1522, 2019.

Zhang, C., Hu, S., Tang, Z. G., and Chan, T. H. Re-revisiting learning on hypergraphs: conﬁdence interval and subgradient method. In International Conference on Machine Learning, pp. 4026 4034. PMLR, 2017.

Zhang, R., Zou, Y., and Ma, J. Hyper-sagnn: a self-attention based graph neural network for hypergraphs. In International Conference on Learning Representations (ICLR), 2020a.

Zhang, S., Huang, Z., Zhou, H., and Zhou, Z. Sce: Scalable network embedding from sparsest cut. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul 2020b. doi: 10.1145/3394486.3403068. URL http: //dx.doi.org/10.1145/3394486.3403068.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Sch olkopf, B. Learning with local and global consistency. In Advances in neural information processing systems, pp. 321 328, 2004.

Zhou, D., Huang, J., and Sch olkopf, B. Learning with hypergraphs: Clustering, classiﬁcation, and embedding.

In Advances in neural information processing systems, pp. 1601 1608, 2007.

Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L., and Koutra, D. Beyond homophily in graph neural networks: Current limitations and effective designs. In Advances in Neural Information Processing Systems, volume 33, pp. 7793 7804, 2020.

Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semisupervised learning using gaussian ﬁelds and harmonic functions. In International Conference on Machine Learning (ICML), pp. 912 919, 2003.

Nonlinear Feature Diffusion on Hypergraphs

A. On the convergence of the nonlinear diffusion.

Our classiﬁcation method is based on the nonlinear diffusion process on the hypergraph described in (7) and (8). Unlike the linear case, where the long term behaviour of the discrete dynamical systems can be easily studied, when linear mappings are combined with nonlinearities, neither the existence nor the uniqueness of a limit point is obvious for the nonlinear discrete diffusion model (7). This makes the convergence result in Theorem 4.1 particularly remarkable. In fact, already in the vector case, it is easy to ﬁnd non convergent normalized iterates of the type exk+1 = L(xk) + y, xk+1 = exk+1/ exk+1 , or iterates with multiple ﬁxed points. Consider, for example the two iterations

zk+1 = A(Azk)1.5 + y , and

( exk+1 = A(Axk)1.5 + y xk+1 = exk+1/ exk+1 (13)

where the power is taken component-wise and where A and y are 3-dimensional and chosen as follows:

0 1 0 0 0 1 1 0 0

0.1 0.2 0.3

None of the two sequences in (13) converge for most starting points x0. Figure 4 illustrates this issue by showing the behaviour of the three coordinates of the iteration xk as in (13), for a random sampled starting point x0, sampled from a uniform distribution in [0, 1]3. We observed the same behaviour for the sequence zk and for any starting point sampled this way.

0 20 40 60 80 100

0 20 40 60 80 100

0 20 40 60 80 100

Figure 4: Example of a nonconvergent nonlinear diffusion. Each panel shows the behaviour of one of the three coordinates of xk in (13) as a function of k, for k = 1, . . . , 100.

B. Proof of Theorem 4.1.

Consider the iteration in (11). As σ hom+(a) and ϱ hom+(b) we have

L(λF) = D 1/2KWσ(K ϱ(λD 1/2F)) = D 1/2KWσ(λb K ϱ(D 1/2F))

= λab D 1/2KWσ(K ϱ(D 1/2F)) = λL(F)

i.e. L is one-homogeneous. Moreover, as K and D are nonnegative matrices, L hom+(1). Similarly, as every node appears in at least one hyperedge, we see that under the assumptions on σ and ϱ it holds ϕ(F) > 0 for all F > 0. Thus, a similar computation as the one above shows that ϕ(F) hom+(1). As a consequence, for any F with ϕ(F) = 1, every

Nonlinear Feature Diffusion on Hypergraphs

component of L(F) is bounded and positive, i.e. there exist constants Mi,j > 0 such that

max F :ϕ(F )=1 L(F)ij = max F L(F)ij

Hence, deﬁning M = maxij Mij > 0 we have that L(F) M entrywise, for all F such that ϕ(F) = 1. As a consequence, since U is entrywise positive, there exists a constant f M > 0 such that L(F) f MU entrywise, for all F such that ϕ(F) = 1. Using Theorem 3.1 in (Tudisco et al., 2021) we deduce that F (k) F as k with F unique ﬁxed point F = αL(F ) + (1 α)U such that ϕ(F ) = 1 and F > 0.

Now we show that F is also the only point where the gradient of

eℓ(F) := F U ϕ(U)

2 + λΩµσ,ϱ(F)

vanishes. To this end, denoted by S(F) is the m (c + d) matrix of the hyperedge embedding S(F) = σ(K ϱ(F)) and let B(F) = KWS(F) = KWσ(K ϱ(F)). Notice that, with this notation, as observed in (4), we can write

µσ,ϱ({fi : i e}) = S(F)e,: .

As a consequence we get

Ωµσ,ϱ(F) = X

e:i e w(e) (D 1/2F)i,: 1

2S(D 1/2F)e,: 2 ,

where, as it will be more convenient in the computation below, we are multiplying the µe term in the loss by 1/2. Of course we can always do this by rescaling one of the two mappings σ or ϱ by a factor two, without losing any of their relevant properties nor generality in the proof. Thus, we have

Ωµσ,ϱ(D1/2F) = X

e:i e w(e) X

e:i e w(e) X

j (F 2 ij Fij S(F)ej) + 1

j w(e)S(F)2 ej

j F 2 ijδi Fij B(F)ij + ϕ(D1/2F)2

= F, DF B(F) + ϕ(D1/2F)2 ,

where , denotes the matrix Frobenius scalar product. Therefore, it holds

Ωµσ,ϱ(F) ϕ(F)2 = F, F D 1/2B(D 1/2F) = F, F L(F) .

As L is 1-homogeneous and differentiable, by the Euler theorem for homogeneous functions we have that

d d F {Ωµσ,ϱ(F) ϕ(F)2} = d

d F F, F L(F) = 2(F L(F)) .

d d F {eℓ(F) λϕ(F)2} = 2(F U/ϕ(U) + λ(F L(F)) = 2((1 + λ)F λL(F) U/ϕ(U))

which shows that the gradient of eℓ(F) λϕ(F)2 vanishes on a point F if and only if F is such that

F = λ 1 + λL(F ) + 1 1 + λ U ϕ(U)

i.e. F is a ﬁxed point of (11) for λ = α/(1 α) and U = U/ϕ(U). Finally, as the two loss functions eℓ(F) and eℓ(F) λϕ(F) have the same minimizers on the slice {F : ϕ(F) = 1}, we conclude.

Nonlinear Feature Diffusion on Hypergraphs

C. Additional results.

We present here additional performance results on the real-world datasets and the baseline methods considered in 5. In particular, we report mean precision and mean recall with their deviation of all the methods considered.

Table 3: Precision (mean and standard deviation) over ﬁve random samples of the training nodes T . We compare Hyper ND and the ﬁve baseline methods (APPNP, HGNN, Hyper GCN, SGC, SCE). Overall, Hyper ND performs better than the others.

Method HOLS APPNP HGNN Hyper GCN SGC SCE HTV

Data % labeled

Citeseer 4.2% 65.02 0.98 56.76 1.92 56.05 2.29 67.94 9.56 54.75 8.74 55.66 1.45 41.58 1.5 Cora-author 5.2% 74.81 1.66 68.15 3.12 58.62 1.89 67.72 1.39 60.7 21.98 70.72 1.02 66.72 2.42 Cora-cit 5.2% 81.44 2.18 80.74 2.2 58.17 1.64 74.19 3.44 37.88 7.31 79.25 1.3 43.38 1.2 DBLP 4.0% 87.88 0.28 88.3 0.14 71.45 1.75 85.68 0.18 61.34 0.29 86.84 0.3 60.57 2.05 Foodweb 5.0% 65.69 19.92 58.73 6.85 60.04 21.68 54.61 5.58 57.84 4.49 47.54 8.6 59.93 3.57 Pubmed 0.8% 81.52 2.73 81.63 0.74 71.78 1.68 76.73 1.06 67.62 1.4 77.67 2.04 64.58 7.59

Table 4: Recall (mean and standard deviation) over ﬁve random samples of the training nodes T . We compare Hyper ND and the ﬁve baseline methods (APPNP, HGNN, Hyper GCN, SGC, SCE). Overall, Hyper ND performs better than the others.

Method Hyper ND APPNP HGNN Hyper GCN SGC SCE HTV

Data % labeled

Citeseer 4.2% 66.52 2.29 58.64 1.22 60.38 2.04 48.22 11.8 66.46 5.95 55.96 1.83 51.23 1.73 Cora-author 5.2% 76.76 2.81 70.13 1.65 62.92 1.46 63.57 1.39 55.08 14.43 69.92 2.62 59.58 1.15 Cora-cit 5.2% 82.56 0.74 81.47 1.85 61.43 3.12 63.91 2.64 71.23 6.86 78.98 2.05 64.07 2.21 DBLP 4.0% 88.52 0.11 88.41 0.23 74.0 0.75 70.99 0.22 77.42 0.03 87.32 0.3 73.85 1.27 Foodweb 5.0% 62.94 10.37 73.85 9.54 51.05 17.92 44.44 7.4 57.45 0.47 57.77 17.33 49.48 1.03 Pubmed 0.8% 81.21 1.8 80.49 1.43 72.9 0.75 78.88 1.43 57.51 5.94 77.52 1.74 58.93 1.91