# contrastive_laplacian_eigenmaps__e71fa57e.pdf

Contrastive Laplacian Eigenmaps

Hao Zhu , Ke Sun , Piotr Koniusz *, ,

Data61/CSIRO Australian National University allenhaozhu@gmail.com, sunk@ieee.org, piotr.koniusz@data61.csiro.au

Graph contrastive learning attracts/disperses node representations for similar/dissimilar node pairs under some notion of similarity. It may be combined with a low-dimensional embedding of nodes to preserve intrinsic and structural properties of a graph. In this paper, we extend the celebrated Laplacian Eigenmaps with contrastive learning, and call them COntrastive Laplacian Eigenmap S (COLES). Starting from a GAN-inspired contrastive formulation, we show that the Jensen Shannon divergence underlying many contrastive graph embedding models fails under disjoint positive and negative distributions, which may naturally emerge during sampling in the contrastive setting. In contrast, we demonstrate analytically that COLES essentially minimizes a surrogate of Wasserstein distance, which is known to cope well under disjoint distributions. Moreover, we show that the loss of COLES belongs to the family of so-called block-contrastive losses, previously shown to be superior compared to pair-wise losses typically used by contrastive methods. We show on popular benchmarks/backbones that COLES offers favourable accuracy/scalability compared to Deep Walk, GCN, Graph2Gauss, DGI and GRACE baselines.

1 Introduction

Celebrated graph embedding methods, including Laplacian Eigenmaps [5] and Iso Map [42], reduce the dimensionality of the data by assuming that it lies on a low-dimensional manifold. The objective functions used in studies [5, 42] model the pairwise node similarity [7] by encouraging the embeddings of nodes to lie close in the embedding space if the nodes are closely related. In other words, such penalties do not guarantee that unrelated graph nodes are separated from each other in the embedding space. For instance, Elastic Embedding [8] uses data-driven afﬁnities for the so-called local distance term and the data-independent repulsion term.

In contrast, modern graph embedding models, often uniﬁed under the Sampled Noise Contrastive Estimation (Sampled NCE) framework [33, 28] and extended to graph learning [41, 15, 50], enjoy contrastive objectives. By maximizing the mutual information between patch representations and highlevel summaries of the graph, Deep Graph Infomax (DGI) [43] is a contrastive method. Graph SAGE [15] minimizes/maximizes distances between so-called positive/negative pairs, respectively. It relies on the inner product passed through the sigmoid non-linearity, which we argue below as suboptimal.

Thus, we propose a new COntrastive Laplacian Eigenmap S (COLES) framework for unsupervised network embedding. COLES, derived from Sampled NCE framework [33, 28], realizes the negative sampling strategy for Laplacian Eigenmaps. Our general objective is given as:

Θ = arg max Θ Tr(fΘ(X) WfΘ(X)) + βΩ(fΘ(X)). (1)

X Rn d in Eq. (1) is the node feature matrix with d feature dimensions given n nodes, fΘ(X) Rn d is an output of a chosen Graph Neural Network backbone (embeddings to optimize) with the

*The corresponding author. Code: https://github.com/allenhaozhu/COLES. 35th Conference on Neural Information Processing Systems (Neur IPS 2021).

0.4 0.2 0.0 0.2 0.4 inner product (x)

x=<u,v> x=<u',v>

5 0 5 10 15 inner product (x)

x=<u,v> x=<u',v>

Figure 1: Densities of dot-product scores v, u and v, u (red and blue curves) between the anchor/positive embedding and the anchor/negative embedding (GCN contrastive setting). Left/right ﬁgures use two distinct minibatches sampled on Cora. With the small overlap of distributions, many contrastive methods relying on the JS divergence may underperform (see Section 4.1 for details).

feature dimension d , Θ denotes network parameters, whereas W Sn + is the difference between the degree-normalized positive and negative adjacency matrices which represent the data graph and some negative graph capturing negative links for contrastive learning. Moreover, β 0 controls the regularization term Ω( ) whose role is to constrain the ℓ2 norm of network outputs or encourage the so-called incoherence [36] between column vectors. Section 3.1 presents COLES for the Linear Graph Network (LGN) family, in which we take special interest due to their simplicity and agility.

By building upon previous studies [28, 2, 48], we show that COLES can be derived by reformulating Sampled NCE into Wasserstein GAN using a GAN-inspired contrastive formulation. This result has a profound impact on the performance of COLES, as the standard contrastive approaches based on Sampled NCE strategy (i.e., Graph SAGE [15]) turn out to utilize the Jensen-Shannon divergence, which yields log 2 constant and vanishing gradients for disjoint distributions of positive and negative sampled pairs used for contrastive learning. Figure 1 shows two examples of such nearly disjoint distributions. In contrast, COLES by design avoids the sigmoid in favour of the Radial Basis Function (RBF) non-linearity. We show that such a choice coincides with a surrogate of Wasserstein distance, which is known for its robustness under poor overlap of distributions, leading to the good performance of COLES. Moreover, we also show that the loss of COLES belongs to the family of so-called blockcontrastive losses, which were shown to be superior compared to pair-wise losses [3]. In summary, our contributions are threefold:

i. We derive COLES, a reformulation of the Laplacian Eigenmaps into a contrastive setting, based on the Sampled NCE framework [33, 28]. ii. By using a formulation inspired by GAN, we show that COLES essentially minimizes a surrogate of Wasserstein distance, as opposed to the Jensen-Shannon (JS) divergence emerging in traditional contrastive learning. Speciﬁcally, by showing the Lipschitz continuous nature of our formulation, we prove that our formulation enjoys the Kantorovich-Rubinstein duality for the Wasserstein distance. iii. We show COLES enjoys a block-contrastive loss known to outperform pair-wise losses [3].

Novelty. We propose a simple way to obtain contrastive parametric graph embeddings which works with numerous backbones. For instance, we obtain spectral graph embeddings by combining COLES with SGC [49] and S2GC [61], which is solved by the SVD decomposition.

2 Preliminaries

Notations. Let G=(V, E) be a simple, connected and undirected graph with n=|V | nodes and m=|E| edges. Let i {1, , n} be the node index of G, and dj be the degree of node j of G. Let W be the adjacency matrix, and D be the diagonal matrix containing degrees of nodes. Moreover, let X Rn d denote the node feature matrix where each node v is associated with a feature vector xv Rd. Let the normalized graph Laplacian matrix be deﬁned as L = I D 1/2 c WD 1/2 Sn +, a symmetric positive semi-deﬁnite matrix. Finally, scalars and vectors are denoted by lowercase regular and bold fonts, respectively. Matrices are denoted by uppercase bold fonts.

2.1 Negative Sampling

Sampled NCE [14, 33, 28], a contrastive learning framework, is used by numerous works [41, 15, 50]. Let pd(u|v) and pn(u |v) be the so-called data and negative distributions given the so-called anchor node v, where u and u denote the node for a positive and negative sample, respectively. Let pd(v) be the anchor distribution. Given some loss components sΘ(v, u) and sΘ(v, u ) whose role is to evaluate the similarity for pairs (v, u) and (v, u ), the contrastive loss is typically given as:

J(Θ) = Ev pd(v) Eu pd(u|v)sΘ(v, u) + ηEu pn(u |v) sΘ(v, u ) , (2)

where η > 0 controls the impact of negative sampling. Let u Rd be the embedding of the node u obtained with an encoder fΘ(xu) given parameters Θ, where xu Rd is the initial node feature vector. Let u Rd and v Rd be embeddings of nodes u and v, accordingly. Let sΘ(u, v) = log σ(u v) and sΘ(u , v) = log(1 σ(u v)), where σ( ) is the sigmoid function. Subsequently, one obtains the contrastive objective (to be maximized), employed by LINE [41], REFINE [60], Graph SAGE [15] and many other methods according to Yang et al. [50]:

J(Θ) = Ev pd(v) Eu pd(u|v) log σ(u v) + ηEu pn(u |v) log σ( u v) . (3)

In what follows, we argue that the choice of sigmoid for σ( ) leads to negative consequences. Thus, we derive COLES under a different choice of sΘ(v, u) and sΘ(v, u ).

3 Methodology

In what follows, we depart from the above setting of (typical) contrastive sampling, which results in a derivation of our COntrastive Laplacian Eigenmap S (COLES).

3.1 Contrastive Laplacian Eigenmaps

Instead of log-sigmoid used in sΘ(v, u) and sΘ(v, u ) of Eq. (3), let us substitute sΘ(v, u) = log exp(u v) = u v and sΘ(v, u ) = log exp( u v) = u v into Eq. (2), which yields:

J(Θ)=Ev pd(v) Eu pd(u|v)(u v) + ηEu pn(u |v) u v . (4) We assume that variables of the above objective (to maximize) can be constrained (e.g., by the ℓ2 norms to prevent ill-posed solutions) and represented by degree-normalized adjacency matrices. Next, we cast Eq. (4) into the objective of COLES (refer to our Suppl. Material for derivations):

Y = arg min Y, s.t. Y Y=I Tr(Y LY) η

k=1 Tr(Y L( ) k Y)

= arg max Y, s.t. Y Y=I Tr(Y WY) where W=W(+) η

k=1 W( ) k ,

and the rows of matrix Y Rn d contain the embedding vectors, L( ) k for k = 1, , κ are randomly generated degree-normalized Laplacian matrices capturing the negative sampling, L( ) k = I W( ) k and L=I W(+). The scalar 0 η 1 ensures that L η

k=1 L( ) k Sn + (one could truncate the negative spectrum instead) and controls the impact of W( ) k .

We note that COLES minimizes over the standard Laplacian Eigenmap while maximizing over the randomized Laplacian Eigenmap, which alleviates the lack of negative sampling in the original Laplacian Eigenmaps. However, unlike Laplacian Eigenmaps, we do not optimize over free variables Y but over the network parameters, as in Eq. (1) and (6). Clearly, if η =0 and Y are free variables, Eq. (5) reduces to standard Laplacian Eigenmaps [5]: Y = arg min Y, s.t. Y Y=I Tr(Y LY).

COLES for Linear Graph Networks. In what follows, we are especially interested in the lightweight family of LGNs such as SGC [49] and S2GC [61] (APPNP [23] with the linear activation could be another choice) whose COLES-based objective can be reformulated as:

P = arg max P, s.t. PP =I Tr(PX F WFXP ). (6)

F Rn n in Eq. (6) is the so-called spectral ﬁlter operating on the (degree-normalized) graph adjacency matrix, and P Rd d is a unitary projection matrix such that 0 < d < d. The solution to Eq. (6) can be readily obtained by solving the generalized eigenvalue problem X F WFXp = λp (an SVD on a small d d matrix (X F WFX) Sd +). This step results in a matrix of embeddings f P(X) = FXP Rn d for supervised training. Based on given a degree-normalized graph adjacency matrix W Rn n, the spectral ﬁlters for SGC and S2GC are given as WK and αI + 1 α

k=1 Wk. Here, integer K 1 and scalar α 0 are the number of layers and the importance of self-loop. Note that Eq. (6) is related to Locality Preserving Projections [17] if η = 0. Note also that enforcing the orthogonality constraints in Eq. (6) coincides with the SVD-based solution described above. In contrast, the more general form of COLES in Eq. (1) requires the regularization or constraints (depending on the backbone) imposed on minibatches i B e.g., we used the soft penalty Ω(fΘ(Xi))= f Θ(Xi)fΘ(Xi) I 2 F .

COLES (Stiefel). Inspired by the Locality Preserving Projections [17] and Eq. (6), we also investigate: (P , Θ ) = arg max P,Θ, s.t. PP =I Tr(Pf Θ(X) WfΘ(X)P ), (7)

solved on the Stiefel manifold by Geo Torch [29]. The embed. is: f P(X) = fΘ(X)P Rn d .

4 Theoretical Analysis

4.1 COLES is Wasserstein-based Contrastive Learning

By casting the positive and negative distributions of Sampled NCE as the real and generated data distributions of GAN, the key idea of this analysis is to (i) cast the traditional contrastive loss in Eq. (3) (used by LINE [41], Graph SAGE [15] and other methods [50]) as a GAN framework, and show this corresponds to the use of JS divergence and (ii) cast the objective of COLES in Eq. (4) as a GAN framework, and show it corresponds to the use of a surrogate of Wasserstein distance. The latter outcome is preferable under the vanishing overlap of two distributions, as the JS divergence yields log(2) constant and vanishing gradients. The Wasserstein distance suffer less from this issue.

For simplicity, consider the embedding v of the anchor node is given. An embedding vector u is sampled from the real distribution pr(u) = pd(u | v), and u is sampled from the generator distribution pg(u) = pn (u | v). Following Arjovsky et al. [2] and Weng [48], one arrives at a GAN-inspired formulation which depends on the choice of discriminator D(u):

pr(u) log(D(u)) + pg(u) log(1 D(u)) du 2JS (pr pg) 2 log 2, (8)

where JS (pr pg) denotes the Jensen-Shannon (JS) divergence. If D(u) is completely free, then the optimal D (u) which maximizes the left-hand-side (LHS) of Eq. (8) is D(u) = pr(u)/(pr(u) + pg(u)). Plugging D back into the LHS, we get the right-hand-side (RHS) of the inequality. In our setting, the case pg pr means that negative sampling yields hard negatives, that is, negative and positive samples are very similar. Hence, this family of embedding techniques try to optimally discriminate pr and pg in the embedding space.

The above analysis shows that traditional contrastive losses are bounded by the JS divergence. Regardless of the choice of D(u), if the support of the density pr and the support of pg are disjoint (e.g., positive and negative samples in the minibatch of the SGD optimization), the JS divergence yields zero and vanishing gradients. If the discriminator is set to D(u) = σ(u v), the objective in Eq. (8) becomes exactly Eq. (3). By noting log σ(u v)/ u = σ(u v)v, the gradient is likely to vanish due to the scalar σ(u v) and does not contribute to learning of network parameters. Figure 1 shows densities of x= u v and x= u v for pr and pd estimated by the Parzen window on two sampled minibatches of contrastive GCN. Clearly, these distributions are approximately disjoint.

Compared with the JS divergence, the Wasserstein distance considers the metric structure of the embedding space: inf γ Π(pr,pg) E(u,u ) γ u u 1, (9)

where Π(pr, pg) is the set of joint distributions with marginals pr(u) and pg(u ).

By the Kantorovich-Rubinstein duality [45], the optimal transport problem for COLES can be equivalently expressed as:

sup g: K(g) 1

Eu pr[g(u)] Eu pg[g(u )]

max Θ Eu pd(u|v)(u v) + Eu pn(u |v)( u v) , (10)

under a drawn anchor v pd(v), where K(g) means the Lipschitz constant, and supreme is taken over all 1-Lipschitz functions (or equivalently, all K-Lipschitz functions.)

The is because g(u) is chosen to the speciﬁc form gv(u) = u v, where v is parameterized by a graph neural network with parameters Θ. Optimizing over the neural network parameters Θ can enumerate a subset of functions which satisﬁes the Lipschitz constant K.

Lipschitz continuity of COLES. In order to assure the Lipschitz continuity of COLES, let individual embeddings be stacked row-wise into a matrix and ℓ2-norm normalized along rows, or along columns. Given v (the reference node), the following holds:

|u v u v| v max u u 1,

where K = maxv v max ( 1 in the case of either sphere embedding or the constraint Y Y = I of the COLES formula in Eq. (5)). Thus, the function g(u) = u v is Lipschitz with constant K.

4.2 COLES enjoys the Block-contrastive Loss

We notice that COLES leverages an access to blocks of similar data, rather than just individual pairs in the loss function. To this end, we resort to the Prop. 6.2 of Arora et al. [3], which shows that for family of functions F whose f( ) R for some R > 0, a block-contrastive loss Lblock un is always bounded by a pairwise-contrastive loss Lun , that is, Lblock un (f) Lun (f). To that end, Arora et al. [3] also show that as block-contrastive losses achieve lower minima than their pairwise-contrastive counterparts, they also enjoy better generalization.

We show that COLES is a block-contrastive loss, which explains its good performance. Following Eq. (4), for a given embedding v = fΘ(xv), and b embeddings ui = fΘ(xui) and u i = fΘ(xu i) drawn according to pd(u | v) and pn(u | v), we have (note minus preceding eq. as here we minimize):

Eu pd(u|v)(u v)+Eu pn(u |v) u v = v P

= v (µ+ µ ), (11)

where µ+ and µ are positive and negative block summaries of sampled nodes. Looking at Eq. (5), it is straightforward to simply expand P

(i,j) E yi yj 2 2 Wij to see that each index i will act as a selector of anchors, whereas index j will loop over positive and negative samples taking into account their connectivity to i captured by Wij. We provide this expansion in the Suppl. Material.

4.3 Geometric Interpretation.

Below, we analyze COLES through the lens of Alignment and Uniformity on the Hypersphere of Wang and Isola [47]. To this end, we decompose our objective into the so-called alignment and uniformity losses. Firstly, Mikolov et al. [33] have shown that Sampled NCE with the sigmoid non-linearity is a practical approximation of Soft Max contrastive loss, the latter suffering poor scalability w.r.t. the count of negative samples. For this reason, many contrastive approaches (Deep Walk, Graph SAGE, DGI, Graph2Gauss, etc.) adopt Sampled NCE rather than Soft Max (GRACE) framework.

Wang and Isola [47] have decomposed the Soft Max contrastive loss into Lalign and Lumiform [47]:

L(u, v, N) = Lalign(u, v) + Luniform(u , v, N) = log eu v

u N eu v (12)

where N is a sampled subset of negative samples, u and v are node indexes of so-called positive sample and anchor embeddings, u and v. Let u, u = u , u = v, v = τ 2(τ acts as the so-called temperature). Moreover, Lalign = u, v and Luniform = log P

u N {u} eu v, that is, Luniform is

a logarithm of an arithmetic mean of RBF responses over the subset N {u}. Of course, computing the total loss L requires drawing u and v from the graph and summing over multiple Lalign(u, v) and L uniform(u, v, N) but we skip this step and the argument variables of loss functions for brevity.

COLES can be decomposed into Lalign and Lumiform [47] as follows:

Lalign + L uniform = log eu v 1 |N|

u N log e u v = log eu v Πu N eu v 1 |N | , (13)

where Lalign remains the same with Soft Max but L uniform = log Πu N eu v 1 |N | is in fact a logarithm of the geometric mean of RBF responses over the subset N. Thus, our loss can be seen as the ratio of geometric means over RBF functions. Several authors (e.g., Gonzalez [12]) noted that the geometric mean helps smooth out the Gaussian noise under the i.i.d. uniform sampling while loosing less information than the arithmetic mean. The geometric mean enjoys better conﬁdence intervals the arithmetic mean given a small number of samples. As we sample few negative nodes for efﬁcacy, we expect the geometric mean is more reliable. Eq. (12) and (13) are just two speciﬁc cases of a generalized loss: Lalign + L uniform = log eu v

Mp eu 1 v, , eu |N |v , (14)

where Mp( ) in L uniform = log Mp eu 1 v, , eu |N |v is the so-called generalized mean. We

introduce Mp( ) into the denominator of Eq. (14) but it can be also introduced in the numerator. We investigate the geometric (p=0), arithmetic (p=1), harmonic (p= 1) and quadratic (p=2) means.

5 Related Works

Graph Embeddings. Graph embedding methods such as Laplacian Eigenmaps [5] and Iso Map [42] reduce the dimensionality of representations by assuming the data lies on a low-dimensional manifold. With these methods, for a set of high-dimensional data features, a similarity graph is built based on the pairwise feature similarity, and each node embedded into a low-dimensional space. The graph is constructed from non-relational high dimensional data features, and Laplacian Eigenmaps ignore relations between dissimilar node pairs, that is, embeddings of dissimilar nodes are not penalized.

To alleviate the above shortcomings, Deep Walk [35] uses truncated random walks to explore the network structure and utilizes the skip-gram model [32] for word embedding to derive the embedding vectors of nodes. LINE [41] explores a similar idea with an explicit objective function by setting the walk length as one and applying negative sampling [32]. Node2Vec [13] interpolates between breadthand depth-ﬁrst sampling strategies to aggregate different types of neighborhoods.

Representation Learning for Graph Neural Networks. Supervised and (semi-)supervised GNNs [22] require labeled datasets that may not be readily available. Yet, unsupervised GNNs have received little attention. GCN [22] employs the minimization of reconstruction error as the objective function to train the encoder. Graph SAGE [15] incorporates objectives inspired by Deep Walk e.g., contrastive loss encouraging nearby nodes to have similar representations while preserving dissimilarity between representations of disparate nodes. DGI [44], inspired by Deep Info Max (DIM) [18], proposes an objective with global-local sampling strategy, which maximizes the Mutual Information (MI) between global and local graph embeddings. In contrast, Augmented Multiscale Deep Info Max (AMDIM) [4] maximizes MI between multiple views of data. MVRLG [16] contrasts encodings from ﬁrst-order neighbors and a graph diffusion. MVRLG uses GCNs to learn node embeddings for different views. Fisher-Bures Adversary GCN [40] assumes that the graph is generated w.r.t. some observation noise. Graph-adaptive Re LU [58] uses an adaptive non-linearity in GCN. Multi-view augmentation-based methods, not studied by us, are complementary to COLES. Moreover, linear networks e.g., SGC [49] and S2GC [61] capture the neighborhood and increasingly larger neighborhoods of each node, respectively. SGC and S2GC have no projection layer, which results in embeddings of size equal to the input dimension. DGI [44] uses the block-contrastive strategy [3] by treating negative samples as a difference of instances and a summary of node embeddings for positive samples. Finally, COLES can be extended to other domains/problems e.g., time series/change point detection [9] or few-shot learning [39, 54, 55].

(Negative) Sampling. Sampling node pairs relies on random walks [35] or second-order proximity [41], etc. In contrast, COLES samples an undirected graph based on the random graph sampling theory [11], where each edge is independently chosen with a prescribed probability p > 0.

6 Experiments

We evaluate COLES on transductive and inductive node classiﬁcation tasks. Node clustering is also evaluated. COLES is compared to state-of-the-art unsupervised, contrastive and (semi-)supervised methods. Unsupervised methods do not use label information except for the classiﬁer. Contrastive methods use the contrastive setting to learn similarity/dissimilarity. (Semi-)supervised methods use labels to train their projection layer and classiﬁer. By semi-supervised, we mean that only a few of nodes used for training are labeled. (Semi-)supervised models use a Soft Max classiﬁer, whereas unsupervised and contrastive methods use a logistic regression classiﬁer.

Datasets. COLES is evaluated on four citation networks: Cora, Citeseer, Pubmed, Cora Full [22, 6] for transductive setting. We also employ the large scale Ogbn-arxiv from OGB [19]. Finally, the Reddit [53] dataset is used in inductive setting. Table 1 provides details of all datasets.

Metrics. Fixed data splits [51] for transductive tasks are often used in evaluations between different models. However, such an experimental setup may beneﬁt easily overﬁtting models [38]. Thus, instead of ﬁxed data splits, results are averaged over 50 random splits for each dataset and standard deviations are reported for empirical evaluation on transductive tasks. Moreover, we also test the performance under a different number of samples per class i.e., 5 and 20 samples per class. Typically, the performance for the inductive task is tested on relatively larger graphs. Thus, we choose ﬁxed data splits as in previous papers [15, 53], and we report the Micro-F1 scores averaged on 10 runs.

Baseline models. We group baseline models into unsupervised, contrastive and (semi-)supervised methods, and implement them in the same framework/testbed. Contrastive methods include Deep Walk [35], GCN+Sampled NCE developed as an alternative to Graph SAGE+Sampled NCE [15], Graph2Gauss [6], SCE [56], DGI [44], GRACE [62], GCA [63] and Graph CL [52], which are our main competitors. Note that GRACE, GCA and Graph CL are based on multi-view and data augmentation, and Graph CL is mainly intended for graph classiﬁcation. We do not study graph classiﬁcation as it requires advanced node pooling [24] with mixedor high-order statistics [26, 25, 27]. We compare results with representative (semi-)supervised GCN [22], GAT [44] and Mix Hop [1] models. SGC and S2GC are unsupervised spectral ﬁlter networks. They do not have any learnable parameters that depend on labels, with exception of a classiﬁer. To reduce the resulting dimensionality, we also add PCA-S2GC and RP-S2GC, which use PCA and random projections to obtain the projection layer on these methods. We extend our COLES framework with different GNNs: GCN, SGC and/or S2GC, and we name them COLES-GCN, COLES-SGC and COLES-S2GC. As COLES-GCN is a multi-layer non-linear encoder, the optimization of COLES-GCN is non-convex. The optimization of COLES-SGC and COLES-S2GC is convex if L η

k=1 L( ) k Sn +, and COLES-GCN (Stiefel) is convex w.r.t. P. We set hyperparameters based on the settings described in their papers.

General model setup. For all (semi-)supervised models, we use early stopping on each random split and we capture the corresponding classiﬁcation result. For all unsupervised models, we choose the embedding dimension to be 512 on Cora, Citeseer and Cora Full, and 256 on Pubmed. After the embeddings of nodes are learnt, a classiﬁer is trained by applying the logistic regression in the embedding space. For inductive learning, methods based on COLES use 512-dimensional embeddings. Other hyperparameters for the baseline models are the same as in original papers.

Hyperparameter of our models. In the transductive experiments, the detailed hyperparameter settings for Cora, Citeseer, Pubmed, and Cora Full are listed below. For COLES, we use the Adam optimizer with learning rates of [0.001, 0.0001, 0.02, 0.02] and the decay of [5e 4, 1e 3, 5e 4, 2e 4]. The number of training epochs are [20, 20, 100, 30], respectively. We sample 10 randomized adjacent matrices, and 5 negative samples for each node in each matrix on each dataset before training. For the S2GC and COLES-S2GC, the number of propagation steps (layers) are 8 for all datasets except Cora Full (2 steps). For SGC and COLES-SGC, we use 2 steps for all datasets.

6.1 Transductive Learning

In this section, we consider transductive learning where all nodes are available in the training process.

Contrastive Embedding Baselines vs. COLES. Table 2 shows that the performance of COLESGCN and the linear variant, COLES-S2GC, are better than other unsupervised models. In particular, COLES-GCN outperforms GCN+Sampled NCE on all four datasets, which shows that COLES has an advantage over the Sampled NCE framework. In addition, COLES-S2GC typically outperforms the

Table 1: The statistics of datasets. Dataset Task Nodes Edges Features Classes

Cora Transductive 2,708 5,429 1,433 7 Citeseer Transductive 3,327 4,732 3,703 6 Pubmed Transductive 19,717 44,338 500 3 Cora Full Transductive 19,793 65,311 8,710 70 Ogbn-arxiv Transductive 169,343 1,166,243 128 40 Reddit Inductive 232,965 11,606,919 602 41

best contrastive baseline DGI by up to 3.4%. In Cora Full, we notice that S2GC underperforms when training with 5 samples. However, COLES-S2GC is able to signiﬁcantly boost its performance by 9%. On Citeseer with 5 training samples, COLES-S2GC outperforms S2GC by 6.8%. We also note that COLES-GCN (Stiefel) outperforms COLES-GCN (based on the soft-orthogonality constraint) by up to 2.7% but its performance below the performance of COLES-S2GC.

Noteworthy is that for augmentation-based methods, COLES-GCN with augmentations denoted as COLES-GCN (+Aug) outperforms COLES-GCN without augmentations. COLES-GCN (+Aug) also outperforms GRACE and GCA, and Graph CL in most experiments. Nonetheless, COLES-S2GC without any augmentations outperformed all augmentation-based methods.

Finally, Table 7 shows that COLES-S2GC outperforms all other methods on the challenging Ogbnarxiv, while using a very small number of trainable parameters.

Semi-supervised GNNs vs. COLES. Table 2 shows that the contrastive GCN baselines perform worse than semi-supervised variants, especially when 20 labeled samples per class are available. In contrast, COLES-GCN outperformed the semi-supervised GCN on Cora by 6.3% and 1.4% given 5 and 20 labeled samples per class. COLES-GCN also outperforms GCN on Citeseer and Cora Full by 8.3% and 6.3% given 5 labeled samples per class. When the number of labels per class is 5, COLES-S2GC outperforms GCN by a margin of 8.1% on Cora and 9.4% on Citeseer. These results show the superiority of COLES on four datasets when the number of samples per class is 5. Even for 20 labeled samples per class, COLES-S2GC outperforms the best semi-supervised baselines on all four datasets e.g., by 1.7% on Citeseer. Semi-supervised models are affected by the low number of labeled samples, which is consistent with [31], e.g., for GAT and Mix Hop. The accuracy of COLES-GCN and COLES-S2GC is not affected as signiﬁcantly due to the contrastive setting.

Unsupervised GNNs vs. COLES. SGC and S2GC are unsupervised LGNs as they are spectral ﬁlters which do not use labels (except for the classiﬁer). Table 2 shows that COLES-S2GC outperforms RP-S2GC and PCA-S2GC under the same size of projections. In most cases, COLES-S2GC also outperforms the unsupervised S2GC baseline (high-dimensional representation).

Table 2: Mean classiﬁcation accuracy (%) and the standard dev. over 50 random splits. Numbers of labeled samples per class are in parentheses. The best accuracy per column is in bold. Models are organized into semi-supervised, contrastive and unsupervised groups. OOM means out of memory.

Method Cora Citeseer Pubmed Cora Full (5) (20) (5) (20) (5) (20) (5) (20)

Semisupervised GCN 67.5 4.8 79.4 1.6 57.7 4.7 69.4 1.4 65.4 5.2 77.2 2.1 49.3 1.8 61.5 0.5 GAT 71.2 3.5 79.6 1.5 54.9 5.0 69.1 1.5 65.5 4.6 75.4 2.3 43.9 1.5 56.9 0.6 Mix Hop 67.9 5.7 80.0 1.4 54.5 4.3 67.1 2.0 64.4 5.6 75.7 2.7 47.5 1.5 61.0 0.7

Contrastive

Deep Walk 60.3 4.0 70.5 1.9 38.3 2.9 45.6 2.0 60.3 5.6 70.8 2.6 38.9 1.4 51.1 0.7 GCN+Sampled NCE 61.3 4.3 74.3 1.6 42.3 3.4 56.8 1.9 60.9 5.7 70.3 2.5 32.7 1.9 45.2 0.9 SAGE+Sampled NCE 65.0 3.5 73.8 1.5 48.0 3.5 56.5 1.6 64.1 6.1 74.6 1.9 35.0 1.4 43.6 0.6 Graph2Gauss 72.7 2.0 76.2 1.1 60.7 3.5 65.7 1.5 67.6 3.9 74.1 2.1 38.9 1.3 49.3 0.5 SCE 74.3 2.7 80.2 1.1 65.4 2.9 70.7 1.2 65.7 6.0 75.8 2.2 50.7 1.5 60.6 0.6 DGI 72.9 4.0 78.1 1.8 65.7 3.6 71.1 1.1 65.3 5.7 73.9 2.3 50.5 1.4 58.4 0.6 COLES-GCN 73.8 3.4 80.8 1.3 66.0 2.6 69.0 1.3 62.7 4.6 72.7 2.1 47.3 1.5 58.9 0.5 COLES-GCN (Stiefel) 75.0 3.4 81.0 1.3 67.9 2.3 71.7 0.9 62.6 5.0 73.2 2.6 47.6 1.2 59.2 0.5 COLES-S2GC 76.5 2.6 81.5 1.2 67.5 2.2 71.3 1.0 66.0 5.2 77.4 1.9 50.8 1.4 61.8 0.5

Contrastive + Augmentation

Graph CL 72.6 4.2 78.3 1.7 65.6 3.0 71.1 0.8 OOM OOM OOM OOM GRACE 64.9 4.2 73.9 1.6 61.8 3.9 68.4 1.6 OOM OOM OOM OOM GCA 61.5 4.9 75.8 1.9 43.2 3.6 55.7 1.9 OOM OOM OOM OOM COLES-GCN (+Aug) 75.3 3.3 81.0 1.3 66.7 2.3 69.8 1.3 63.9 5.0 73.4 2.5 48.0 1.2 59.4 0.5

Unsupervised

SGC 63.9 5.4 78.3 1.9 59.5 3.4 69.8 1.4 65.8 4.4 76.3 2.3 46.0 2.2 57.7 1.2 S2GC 71.4 4.4 81.3 1.2 60.3 4.0 69.5 1.2 67.6 4.2 73.3 2.0 41.8 1.7 60.0 0.5 PCA-S2GC 72.1 3.8 81.2 1.3 61.0 3.5 68.8 1.3 67.5 4.3 73.2 2.0 42.3 1.7 59.3 0.6 RP-S2GC 65.9 4.6 78.1 1.2 51.4 3.2 61.7 1.6 66.1 5.0 72.5 1.9 31.5 1.4 48.7 0.6

Table 3: The inﬂuence of the number (κ) of negative Laplacian graphs on COLES-S2GC. Parentheses show the no. of labeled samples p/class.

Cora (20) 79.88 81.43 81.18 81.17 Cora (5) 70.12 76.24 75.89 75.79 Citeseer (20) 69.42 70.71 70.61 70.61 Citeseer (5) 58.17 67.03 66.96 67.04

Table 4: The inﬂuence of the number (κ) of negative Laplacian graphs on COLES-GCN. Parentheses show the no. of labeled samples p/class.

Cora (20) 75.70 80.90 80.87 80.90 Cora (5) 60.97 74.14 74.11 74.07 Citeseer (20) 60.61 69.04 69.21 69.08 Citeseer (5) 45.31 65.85 66.08 66.01

Table 5: The performance given various choices of the generalized mean Mp for the uniformity loss.

Method Cora Citeseer Pubmed Cora Full (5) (20) (5) (20) (5) (20) (5) (20)

Geometric (M0) (COLES-S2GC) 76.5 2.6 81.5 1.2 67.5 2.2 71.3 1.0 66.0 5.2 77.4 1.9 50.8 1.4 61.8 0.5 Arithmetic (M1) (Soft Max-Contr.) 71.8 3.0 77.6 1.3 63.2 3.1 69.3 0.8 65.9 4.3 77.1 1.5 49.2 1.4 60.6 0.6 Harmonic (M 1) 75.2 3.5 80.7 1.2 64.7 2.4 70.9. 0.9 65.9 5.5 73.9 2.4 48.0 1.6 59.7 1.6 Quadratic (M2) 72.3 2.5 77.2 1.3 65.4 2.2 70.7. 0.8 65.6 4.5 77.3 1.5 49.2 1.5 60.6 1.6

Negative Laplacian Eigenmaps. Below, we analyze how κ in Eq. (5) inﬂuences the performance. We set κ {0, 1, 5, 10} on COLES-S2GC and COLES-GCN given Cora and Citeseer with 5 and 20 labeled samples per class. The case of κ=0 means no negative Laplacian Eigenmaps are used, thus the solution simpliﬁes to regular Laplacian Eigenmaps parametrized by GCN embeddings. Table 3 shows that without the negative Laplacian Eigenmaps, the performance of COLES-S2GC drops signiﬁcantly i.e., between 6% and 9% for 5 labeled samples per class. That means the negative Laplacian Eigenmaps play important role which highlights the beneﬁts of COLES. Although negative Laplacian Eigenmaps improve results, using κ>1 negative matrices improves the performance only marginally. Table 4 shows that COLES-GCN relies on negative Laplacian Eigenmaps. Without negative Laplacian Eigenmaps, the performance of COLES-GCN drops by 20% on Citeseer with 5 samples per class. Even when 20 samples per class are used, if κ = 0, the performance of COLES-GCN drops by 8.4%.

6.2 Uniformity Loss as the Generalized Mean (Mp).

Following the analysis presented in Section 4.3, Table 5 demonstrates the impact of the choices of the uniformity loss on the performance of COLES. To this end, we select the geometric (M0), arithmetic (M1), harmonic (M 1) and quadratic(M2) means as examples of Mp realizing the uniformity loss. On all the investigated datasets, the geometric mean outperforms other variants.

6.3 Inductive Learning

In inductive learning, models have no access to the test set, thus they need to generalize well to unseen samples. Table 6 shows that COLES enjoys a signiﬁcant performance gain (1% - 5% in Micro-F1 scores), performing close to supervised methods with a low memory footprint. In contrast, DGI on Reddit triggers out-of-memory errors on Nvidia GTX 1080 GPU (94.0 Micro-F1 is taken from [44]).

Table 6: Test Micro F1 Score (%) averaged over 10 runs on Reddit. Results of other models are from original papers.

Setting Model Test F1

SAGE-mean 95.0 Supervised SAGE-LSTM 95.4 SAGE-GCN 93.0

Contrastive

SAGE-mean 89.7 SAGE-LSTM 90.7 SAGE-GCN 90.8 Fast GCN 93.7 DGI 94.0 COLES-GCN 94.0 COLES-SGC 94.8 COLES-S2GC 95.4

Table 7: Mean classiﬁcation accuracy (%) and the standard dev. over 10 runs on Ogbn-arxiv. Results of other models are from original papers.

Method Test Acc. #Params

MLP 55.50 0.23 110,120 Node2Vec [13] 70.07 0.13 21,818,792 Graph Zoom [10] 71.18 0.18 8,963,624 C&S [20] 71.26 0.01 5,160 SAGE-mean [15] 71.49 0.27 218,664 GCN [22] 71.74 0.29 142,888 Deeper GCN [30] 71.92 0.17 491,176 SIGN [37] 71.95 0.11 3,566,128 Frame Let [59] 71.97 0.12 1,633,183 S2GC [61] 72.01 0.25 110,120

COLES-S2GC 72.48 0.25 110,120

Table 8: The clustering performance on Cora, Citeseer and Pubmed.

Method Input Cora Citeseer Pubmed Acc% NMI% F1% Acc% NMI% F1% Acc% NMI% F1%

k-means Feature 34.65 16.73 25.42 38.49 17.02 30.47 57.32 29.12 57.35 Spectral-f Feature 36.26 15.09 25.64 46.23 21.19 33.70 59.91 32.55 58.61 Spectral-g Graph 34.19 19.49 30.17 25.91 11.84 29.48 39.74 3.46 51.97 Deep Walk Graph 46.74 31.75 38.06 36.15 9.66 26.70 61.86 16.71 47.06 GAE Both 53.25 40.69 41.97 41.26 18.34 29.13 64.08 22.97 49.26 VGAE Both 55.95 38.45 41.50 44.38 22.71 31.88 65.48 25.09 50.95 ARGE Both 64.00 44.90 61.90 57.30 35.00 54.60 59.12 23.17 58.41 ARVGE Both 62.66 45.28 62.15 54.40 26.10 52.90 58.22 20.62 23.04 GCN Both 59.05 43.06 59.38 45.97 20.08 45.57 61.88 25.48 60.70 SGC Both 62.87 50.05 58.60 52.77 32.90 63.90 69.09 31.64 68.45 S2GC Both 68.96 54.22 65.43 69.11 42.87 64.65 68.18 31.82 67.81

COLES-GCN Both 60.74 45.49 59.33 63.28 37.54 59.17 63.46 25.73 63.42 COLES-GCN (Stiefel) Both 62.46 47.01 59.38 65.17 38.90 60.85 63.56 25.81 63.58 COLES-SGC Both 65.62 52.32 56.95 68.24 43.09 63.85 69.47 32.31 68.57 COLES-S2GC Both 69.70 55.35 63.06 69.20 44.41 64.70 68.76 33.42 68.12

6.4 Node Clustering

We compare COLES-GCN and COLES-S2GC with three types of clustering methods listed below:

i. Methods that use only node features e.g., k-means and spectral clustering (spectral-f) construct a similarity matrix with the node features by a linear kernel.

ii. Structural clustering methods that only use the graph structure: spectral clustering (spectral-g) that takes the graph adjacency matrix as the similarity matrix, and Deep Walk [35].

iii. Attributed graph clustering methods that use node features and the graph: Graph Autoencoder (GAE), Graph Variational Autoencoder (VGAE) [22], Adversarially Regularized Graph Autoencoder (ARGE), Variational Graph Autoencoder (ARVGE) [34], SGC [49] and S2GC [61].

We measure the performance by the clustering Accuracy (Acc), Normalized Mutual Information (NMI) and macro F1-score (F1). We run each method 10 times on Cora, Cite Seer and Pub Med. We report the clustering results in Table 8. We set the number of propagation steps to 8 for SGC, S2GC, COLES-SGC and COLES-S2GC, following the setting of [57]. We note that COLES-S2GC outperforms S2GC in most cases, whereas COLES-GCN outperforms contrastive GCN on all datasets.

Scalability. Graph SAGE and DGI require neighbor sampling which result in redundant forward/backward propagation steps (long runtime). In contrast, COLES-S2GC enjoys a straightforward implementation which reduces the memory usage and runtime signiﬁcantly. For graphs with more than 100 thousands nodes and 10 millions edges (Reddit), our model runs smoothly on NVIDIA 1080 GPU. Even on larger graph datasets, the closed-form solution is attractive as for COLES-S2GC, the cost of eigen-decomposition depends on d (a few of seconds on Reddit). The runtime of COLESS2GC is also favourable in comparison to multi-view augmentation-based Graph CL. Speciﬁcally, COLES-S2GC took 0.3s, 1.4s, 7.3s and 16.4s on Cora, Citeseer, Pubmed and Cora Full, respectively. Graph CL took 110.19s, 101.0s, 8h and 8h respectively.

7 Conclusions

We have proposed a new network embedding, COnstrative Laplacian Eigenmap S (COLES), which recognizes the importance of negative sample pairs in Laplacian Eignemaps. Our COLES works well with many backbones, e.g., COLES with GCN, SGC and S2GC backbones outperforms many unsupervised, contrastive and (semi-)supervised methods. By applying the GAN-inspired analysis, we have shown that Sampled NCE with the sigmoid non-linearity yields the JS divergence. However, COLES uses the RBF non-linearity, which results in the Kantorovich-Rubinstein duality; COLES essentially minimizes a surrogate of Wasserstein distance, which offers a reasonable transportation plan, and helps avoid pitfalls of the JS divergence. Moreover, COLES takes advantage of the so-called block-contrastive loss whose family is known to perform better than their pair-wise contrastive counterparts. Cast as the alignment and uniformity losses, COLES enjoys the more robust geometric mean rather than the arithmetic mean (used by Soft Max-Contrastive) as the uniformity loss.

Acknowledgments and Disclosure of Funding

We would like to thank the reviewers for stimulating questions that helped us improve several aspects of our analysis. Hao Zhu is supported by an Australian Government Research Training Program (RTP) Scholarship. Ke Sun and Piotr Koniusz are supported by CSIRO s Machine Learning and Artiﬁcial Intelligence Future Science Platform (MLAI FSP).

[1] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsiﬁed neighborhood mixing. In International Conference on Machine Learning, pages 21 29, 2019. [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214 223, 2017. [3] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pages 5628 5637, 2019. [4] Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. ar Xiv preprint ar Xiv:1906.00910, 2019. [5] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. [6] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. ar Xiv preprint ar Xiv:1707.03815, 2017. [7] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616 1637, 2018.

[8] Miguel Á. Carreira-Perpiñan. The elastic embedding algorithm for dimensionality reduction. In International Conference on Machine Learning, page 167174, 2010. [9] Shohreh Deldari, Daniel V Smith, Hao Xue, and Flora D Salim. Time series change point detection with self-supervised contrastive predictive coding. In Proceedings of the Web Conference, pages 3124 3135, 2021. [10] C Deng, Z Zhao, Y Wang, Z Zhang, and Z Feng. Graphzoom: A multi-level spectral approach for accurate and scalable graph embedding. In International Conference on Learning Representations, 2020. [11] Paul Erd Hos and Alfréd Rényi. On random graphs. Publicationes Mathematicae, 6:290 297, 1959. [12] Rafael C Gonzalez, Richard E Woods, et al. Digital image processing, 2002. [13] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In ACM SIGKDD international conference on Knowledge discovery and Data Mining, pages 855 864, 2016. [14] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pages 297 304. JMLR Workshop and Conference Proceedings, 2010. [15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024 1034, 2017. [16] Kaveh Hassani and Amir Hosein Khasahmadi. Contrastive multi-view representation learning on graphs. In International Conference on Machine Learning, pages 4116 4126, 2020. [17] Xiaofei He and Partha Niyogi. Locality preserving projections. Advances in Neural Information Processing Systems, 16(16):153 160, 2004. [18] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018.

[19] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. ar Xiv preprint ar Xiv:2005.00687, 2020.

[20] Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin Benson. Combining label propagation and simple models out-performs graph neural networks. In International Conference on Learning Representations, 2021.

[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[22] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[23] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations, 2019.

[24] Piotr Koniusz and Hongguang Zhang. Power normalizations in ﬁne-grained image, fewshot image and graph classiﬁcation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi: 10.1109/TPAMI.2021.3107164.

[25] Piotr Koniusz, Fei Yan, Philippe-Henri Gosselin, and Krystian Mikolajczyk. Higher-order Occurrence Pooling on Midand Low-level Features: Visual Concept Detection. Technical report, INRIA, September 2013. URL https://hal.inria.fr/hal-00922524.

[26] Piotr Koniusz, Fei Yan, and Krystian Mikolajczyk. Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Computer Vision and Image Understanding, 117(5):479 492, 2013. ISSN 1077-3142.

[27] Piotr Koniusz, Fei Yan, Philippe-Henri Gosselin, and Krystian Mikolajczyk. Higher-order occurrence pooling for bags-of-words: Visual concept detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.

[28] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 27:2177 2185, 2014.

[29] Mario Lezcano Casado. Trivializations for gradient-based optimization on manifolds. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, 2019.

[30] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns. ar Xiv preprint ar Xiv:2006.07739, 2020.

[31] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI Conference on Artiﬁcial Intelligence, pages 3538 3545, 2018.

[32] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013.

[33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013.

[34] S Pan, R Hu, G Long, J Jiang, L Yao, and C Zhang. Adversarially regularized graph autoencoder for graph embedding. In International Joint Conference on Artiﬁcial Intelligence, 2018.

[35] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701 710, 2014.

[36] Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classiﬁcation and clustering via dictionary learning with structured incoherence and shared features. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3501 3508, 2010. doi: 10.1109/CVPR.2010. 5539964.

[37] Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael Bronstein, and Federico Monti. Sign: Scalable inception graph neural networks. ar Xiv preprint ar Xiv:2004.11198, 2020.

[38] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018. [39] Christian Simon, Piotr Koniusz, Richard Nock, and Mehrtash Harandi. Adaptive subspaces for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2020. [40] Ke Sun, Piotr Koniusz, and Zhen Wang. Fisher-bures adversary graph convolutional networks. Conference on Uncertainty in Artiﬁcial Intelligence, 115:465 475, 2019. [41] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In International Conference on World Wide Web, pages 1067 1077, 2015. [42] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. [43] Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ar Xiv preprint ar Xiv:1710.10903, 2017. [44] Petar Velickovic, William Fedus, William L Hamilton, Pietro Lio, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations, 2019. [45] Cédric Villani. Optimal Transport, Old and New. Springer-Verlag Berlin Heidelberg, 2009. [46] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396 413, 2020. [47] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, volume 119, pages 9929 9939, 2020. [48] Lilian Weng. From GAN to WGAN. ar Xiv preprint ar Xiv:1904.08994, 2019. [49] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. ar Xiv preprint ar Xiv:1902.07153, 2019. [50] Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, Jingren Zhou, and Jie Tang. Understanding negative sampling in graph representation learning. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1666 1676, 2020. [51] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised learning with graph embeddings. In International Conference on Machine Learning, pages 40 48, 2016. [52] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33:5812 5823, 2020. [53] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph sampling based inductive learning method. ar Xiv preprint ar Xiv:1907.04931, 2019. [54] Hongguang Zhang, Piotr Koniusz, Songlei Jian, Hongdong Li, and Philip H. S. Torr. Rethinking class relations: Absolute-relative supervised and unsupervised few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 9432 9441, 2021. [55] Shan Zhang, Dawei Luo, Lei Wang, and Piotr Koniusz. Few-shot object detection by secondorder pooling. In Asian Conference on Computer Vision, 2020. [56] Shengzhong Zhang, Zengfeng Huang, Haicang Zhou, and Ziang Zhou. Sce: Scalable network embedding from sparsest cut. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 257 265, 2020. [57] Xiaotong Zhang, Han Liu, Qimai Li, and Xiao-Ming Wu. Attributed graph clustering via adaptive graph convolution. ar Xiv preprint ar Xiv:1906.01210, 2019. [58] Yifei Zhang, Hao Zhu, Ziqiao Meng, Piotr Koniusz, and Irwin King. Graph-adaptive rectiﬁed linear unit for graph neural networks. In Proceedings of the Web Conference, 2022. [59] Xuebin Zheng, Bingxin Zhou, Junbin Gao, Yuguang Wang, Pietro Lió, Ming Li, and Guido Montufar. How framelets enhance graph neural networks. In Marina Meila and Tong Zhang, editors, International Conference on Machine Learning, volume 139, pages 12761 12771, 2021.

[60] Hao Zhu and Piotr Koniusz. Reﬁne: Random range ﬁnder for network embedding. In ACM Conference on Information and Knowledge Management, 2021. [61] Hao Zhu and Piotr Koniusz. Simple spectral graph convolution. In International Conference on Learning Representations, 2021. [62] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. ar Xiv preprint ar Xiv:2006.04131, 2020. [63] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference, pages 2069 2080, 2021.