# learning_graph_representation_via_graph_entropy_maximization__9b5e92c5.pdf

Learning Graph Representation via Graph Entropy Maximization

Ziheng Sun 1 2 Xudong Wang 1 Chris Ding 1 Jicong Fan 1 2

Graph representation learning aims to represent graphs as vectors that can be utilized in downstream tasks such as graph classification. In this work, we focus on learning diverse representations that can capture the graph information as much as possible. We propose quantifying graph information using graph entropy, where we define a probability distribution of a graph based on its nodes representations and global-graph representation. However, the computation of graph entropy is NP-hard due to the complex vertexpacking polytope involved in its definition. To address this challenge, we provide an approximation method leveraging orthonormal representations for graph entropy maximization. The proposed method is implemented via graph neural networks, resulting in informative node-level and graph-level representations. Experimental results demonstrate the effectiveness of our method in comparison to many baselines in unsupervised learning and semi-supervised learning tasks. The code of our method is available at https:// github.com/Math Adventurer/Ge Max.

1. Introduction

Graphs, representing entities and their relationships, are crucial in diverse fields like chemistry (Debnath et al., 1991; Kriege & Mutzel, 2012), biology (Borgwardt et al., 2005), and social sciences (Yanardag & Vishwanathan, 2015). Graph representation learning, transforming graphs into vectors for tasks such as classification, is a challenging problem, due to the non-Euclidean nature of graph data. Numerous studies have focused on unsupervised graph-level representation learning using graph neural networks (GNNs), a no-

1School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China 2Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data, Shenzhen, China. Correspondence to: Jicong Fan <fanjicong@cuhk.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

table approach in this domain. Key methodologies include Info Graph (Sun et al., 2019), which maximizes mutual information between graph-level and node-level representations, and Graph Contrastive Learning techniques like Graph CL (You et al., 2020), AD-GCL (Suresh et al., 2021), and JOAO (You et al., 2021), which enhance graph representations through various augmentation strategies. Auto GCL (Yin et al., 2022) innovates with learnable graph view generators, while Graph ACL (Luo et al., 2023a) introduces a novel selfsupervised approach. Info GCL (Xu et al., 2021) and SFA (Zhang et al., 2023) focus on information transfer and feature augmentation in contrastive learning, respectively. GCS (Wei et al., 2023), NCLA (Shen et al., 2023), S3-CL (Ding et al., 2023), and Im GCL (Zeng et al., 2023) further refine graph augmentation and learning techniques. GRADATE (Duan et al., 2023) integrates subgraph contrast into multiscale learning networks. These methods are commonly rooted in the Info Max principle (Linsker, 1988), which will be detailed in Section 4.1. Other types of methods for graph representation learning include VGAE (Kipf & Welling, 2016; Hamilton et al., 2017; Cui et al., 2020), graph embedding (Wu et al., 2020; Yu et al., 2021; Bai et al., 2019; Verma & Zhang, 2019), self-supervised learning (Liu et al., 2022b; Hou et al., 2022; Lee et al., 2022; Xie et al., 2022; Wu et al., 2021; Rong et al., 2020; Zhang et al., 2021b;a; Xiao et al., 2022), and various contrastive learning methods (Le-Khac et al., 2020; Qiu et al., 2020; Ding et al., 2022; Xia et al., 2022; Fang et al., 2022; Trivedi et al., 2022; Han et al., 2022; Mo et al., 2022; Yin et al., 2022; Xu et al., 2021; Zhao et al., 2021; Zeng & Xie, 2021; Li et al., 2022a;b; Wei et al., 2022). More recently, Sun et al. (2023) presented a Lov asz principle for graph representation learning, which is based on the graph Lov asz number (Lov asz, 1979) and uses the handle vector learned by a GNN as graph representation.

For unsupervised graph representation learning, it is crucial to ensure that the representations contain sufficient information useful for downstream tasks. One may recall that the representations given by an autoencoder are often very useful for downstream tasks. The reason is that the representations have preserved the major information of the input data they can well reconstruct the input data. For graph data, it is impossible to use the vector representation of each graph to reconstruct the graph itself, but it is possible to make the vector preserve sufficient information from the

Learning Graph Representation via Graph Entropy Maximization

graph. To quantify information, we can use entropy. Entropy is a fundamental concept in information theory and is very useful for many machine learning problems such as graph learning. Various entropy concepts have evolved to quantify the complexity and information of graphs (Dehmer & Mowshowitz, 2011; Dehmer, 2008). The most representative ones are the structural entropy (Mowshowitz & Dehmer, 2012) for node-level analysis and the edge entropy (Jiang et al., 2020; Wang et al., 2021; Grebenkina et al., 2018; Aziz et al., 2020; Luo et al., 2023b) for evaluating edge connectivity. Additionally, von Neumann entropy (Liu et al., 2021; 2022a; Passerini & Severini, 2008; Minello et al., 2019; Ye et al., 2014; Dong et al., 2019) and R enyi entropy (P al et al., 2010; Oggier & Datta, 2021) address spectral complexity and graph clustering, respectively, while M-ILBO (Ma et al., 2023) focuses on dataset entropy. It is important to distinguish these from J anos K orner s original graph entropy concept (K orner, 1973), grounded in information theory and combinatorics. This concept, dating back to Shannon s work (Shannon, 1948) and further developed by K orner (K orner, 1973) and Lov asz (Lov asz, 1979), involves complex computational challenges like the vertex-packing polytope, leading to NP-hard computation. Although graph entropy is theoretically important and useful in the domains of combinatorics and information theory, it remains unexplored in the field of graph learning. Crucially, our findings indicate that graph entropy is superior in leveraging the inherent structure of graphs compared to other entropy approaches such as Shannon entropy and R enyi entropy.

This work introduces a novel approach called Graph Entropy Maximization (Ge Max) to graph representation learning, marking the first instance of graph entropy s application in this context. Our approach establishes a probability distribution for a graph by incorporating its nodes representations and a global graph representation learned through two graph neural networks respectively. The computation of graph entropy, however, presents a significant challenge, as it is NP-hard due to the complexity associated with the vertex-packing polytope in its definition. To tackle this challenge, we introduce a method of maximizing the approximation of graph entropy by utilizing Lov asz s orthonormal representations. Our contributions are as follows.

We introduce Ge Max, a novel method for graph representation learning, marking the inaugural exploration of K orner s graph entropy within the graph learning community. Recognizing the NP-hard computation of graph entropy, we propose a tractable approximation for Ge Max via leveraging orthonormal representations and present an alternating updating method for its optimization. We conduct extensive experiments to evaluate the performance of our Ge Max method in comparison to Info Max Principle (Linsker, 1988; Sun et al., 2019), Lov asz

Figure 1: Indicator matrix of independent sets of a pentagon

Principle (Sun et al., 2023), as well as Shannon entropy and R enyi entropy maximization, in unsupervised and semi-supervised graph representation learning tasks.

2. Preliminary

In this section, we present the definition of graph entropy (K orner, 1973) and Lov asz s orthonormal representations (Lov asz, 1979). Graph entropy, a crucial concept in probabilistic graph theory, was first introduced by J anos K orner (K orner, 1973). We focus on its combinatorial definition, which revolves around the vertex-packing polytope. An independent set in a graph G is a group of vertices where no two are adjacent. Let B = [b1, ..., b Nb] {0, 1}|V | Nb be the indicator matrix of the independent sets of G, with Nb being the number of such sets, and each column bi indicating a specific independent set. For instance, on the pentagon graph shown by Figure 1, where Nb = 10, the vector b6 = [1, 0, 1, 0, 0] highlighted by the blue rectangle is the independent set comprising {v1, v3}. The vertex-packing polytope VP(G) of graph G is defined as follows.

Definition 2.1 (vertex-packing polytope). For a graph G with vertices set V , VP(G), the vertex-packing polytope of G, is defined as the convex corner of its independent sets indicator vectors. Specifically, with B {0, 1}|V | Nb as the indicator matrix and λ RNb + , we define VP(G) as

VP(G) := n a R|V | : a = Bλ, λ 0, X

i λi = 1 o .

In a probabilistic graph (G, P), where P represents the probability distribution across its vertices, defined as P = {P1, P2, ..., Pn} with Pi being the probability density of vertex i, graph entropy is denoted as Hk(G, P). Based on VP(G), the definition of Hk(G, P) is as follows.

Definition 2.2 (Graph Entropy (K orner, 1973)). For a probabilistic graph (G, P), its entropy with respect to P is

Hk(G, P) := min a VP(G)

i=1 Pi log(ai).

L aszl o Lov asz established the concept of orthonormal representations for graphs, which can be formally defined as:

Definition 2.3 (Set of Lov asz s Orthonormal Representations (Lov asz, 1979)). Consider a graph G = (V, E). Each

Learning Graph Representation via Graph Entropy Maximization

vertex i in G is represented by a unit vector zi Rd, indicating its d-dimensional representation. The set of orthonormal representations for G, denoted as T (G), is:

T (G) := {Z Rn d : zi 2 = 1 for i = 1, 2, ..., n;

z i zj = 0, (i, j) / E}.

3. Proposed Methods

3.1. Graph Entropy Maximization

Given a set of N graphs G = {G1, G2, . . . , GN} drawn from some unknown distribution D in G, we want to learn a model F : G Rd Rn d to represent each graph as a vector and represent its vertices as vectors, i.e., (gj, Zj) = F(Gj), where gj Rd and Zj Rnj d denote the graphlevel and node-level representations of Gj respectively, and nj is the number of nodes of Gj. F should capture the important information of the underlying distribution D and g1, g2, . . . , g N should be useful in downstream tasks such as graph classification. A fundamental question is

How to quantify the goodness of (gj, Zj)?

We propose to use graph entropy Hk, defined by Definition 2.2, as a measure of the goodness of graph representations. We have the following observation.

Graph entropy serves as an indicator of the inherent uncertainty present in a probabilistic graph (G, P), according to the minimal possible code rate definition in graph theory (K orner, 1973). By Definition 2.2, we can regard Hk(G, P) as the minimum cross-entropy between P and all possible a in VP(G), though P ai may not be 1. Thus, Hk(G, P) measures the minimum inconsistency between P and the combination of independent sets of G.

Graph entropy, if computationally feasible, is particularly valuable for graph representation learning, as it directly correlates the entropy measure with the graph s structure through VP(G). In contrast, other commonly used entropy measures such as Shannon entropy are primarily defined upon distribution, without direct consideration of the graph s structural aspects. Consequently, these traditional entropy measures may not be effective in producing graph representations with rich structural information.

In Definition 2.2, Hk is based on the graph structure G and vertex distribution P, where the former is given and fixed but the latter is unknown and should be determined if possible. We define P using (g, Z), i.e., P is a function of F(G), formulated as PF (G), connecting graph entropy and graph representations. Then we propose to find an F that yields the representations with maximum graph entropy:

max F F EG D Hk G, PF (G) , (1)

where F denotes some hypothesis space. Based on the previous discussion, maximizing graph entropy implies that (g, Z) makes P inconsistent with the combination of independent sets of G, thereby preserving the connection information on G. A common model F(G) on all graphs may make the representations of different graphs similar, but maximizing graph entropy preserves discriminative information of the graphs useful for downstream tasks such as graph classification.

In this work, we let F be a GNN. Specifically, denote A as the space of the adjacency matrix A and X as the space of node feature matrix X. Denote F = (Fg, FZ) and let Fg( , ; θ) : A X Rd be a GNN with parameter θ for graph-level representation learning and FZ( , ; ϕ) : A X Rn d be another GNN with parameters ϕ for nodelevel representation learning. For G G with adjacency matrix A and feature matrix X, we obtain

gθ = Fg(A, X; θ), and Zϕ = FZ(A, X; ϕ). (2)

Then we denote the probability of nodes in G as

PF (G) = [P1(gθ, Zϕ), ..., Pn(gθ, Zϕ)] P(A, X; θ, ϕ),

where 0 Pi(gθ, Zϕ) 1, P Pi(g, Z) = 1, and Pi(A, X; θ, ϕ) Pi(gθ, Zϕ). Here we can define PF (G) as a Boltzmann distribution as follows:

Pi(gθ, Zϕ) := exp( zϕ i gθ 2 2) P l V exp( zϕ l gθ 2 2) , i V. (3)

Note that instead of the Boltzmann distribution, one may use other distributions such as

Pi(gθ, Zϕ) = (1 + zϕ i gθ 2) 1 P

l V (1 + zϕ l gθ 2) 1 .

We empirically find that our method is not sensitive to the definition of PF (G), possibly due to the high expressive ability of neural networks.

Now, invoking the definitions of F and P into (1), we solve

j=1 Hk (Gj, P(Aj, Xj; θ, ϕ)) (4)

or equivalently

j=1 min a VP(Gj)

i=1 Pi(Aj, Xj; θ, ϕ) log(ai), (5)

which is our Graph Entropy Maximization (Ge Max) method for graph representation learning. The computation of graph entropy and the procedure of the proposed Ge Max method are illustrated in Figure 2.

Learning Graph Representation via Graph Entropy Maximization

Visualization of

Visualization of

... ... ...

6 Independent Sets

4 Independent Sets

Find all Independent Sets to determine

Figure 2: The computation of graph entropy using toy graphs and the illustration of proposed Ge Max method

However, directly solving Ge Max is computationally challenging for large graphs because computing the vertexpacking polytope VP(G) involves computing the indicator matrix B of independent sets, which is equivalent to solving the NP-hard graph coloring problem (Jensen & Toft, 2011). Previous studies (Rezaei, 2013) typically utilized graph entropy Hk for theoretical analysis and provided its upper and lower bounds, instead of computing it exactly. The best-known lower bound of Hk (Boreland, 2018) is as

H(P) log α(G) Hk(G, P) (6)

where α(G) is the independent number and H(P) is the Shannon entropy. Since α(G) remains constant for a given G, maximizing this lower bound is equivalent to maximizing Shannon entropy over the vertex set. This approach does not directly capture any topological information of the graph s structure. The experiments in Section 5.2 will show the unsatisfactory performance of this approach. To overcome this limitation, we introduce an approximation method for graph entropy maximization in the following two sections.

3.2. Approximation Method for Ge Max

To leverage the information from the vertex packing polytope, while avoiding NP-hard calculations, we optimize objective (4) over its subset, denoted as VPSub(G).

Definition 3.1 (Subset of VP(G)). Let 1( ) be an elementwise indicator function such that [1(a)]i = 1 if ai > 0 and [1(a)]i = 0 for ai = 0. We define a subset of VP(G) as

VPSub(G) := n a R|V | : 1(a) VP(G), 0 ai 1 o .

The following proposition indicates that VPSub(G) maintains the two important properties of VP(G).

Proposition 3.2. 1) VPSub(G) is a convex corner; 2) All the

indicator vectors of independent sets of G are contained in VPSub(G), i.e., bi VPSub(G), i = 1, 2, ..., Nb.

Thus, solving Ge Max, namely (5), over VPSub(G) instead of VP(G), can also leverage the information of independent sets of G. Nevertheless, this subset still necessitates calculating the independent set indicator matrix B, which remains NP-hard. To address this challenge, we define PF (G) on the set of orthonormal representations, presented by Definition 2.3, rather than the entire real space. The following theorem provides the connection between orthonormal representations and our VPSub(G).

Theorem 3.3. Let Da = diag(a) = diag(a1, a2, . . . , an) with 0 ai 1 i [n] and Z = [z1, z2, . . . , zn] Rn d. If Z T (G) and z i zj = 0 (i, j) E, then Da(ZZ )Da = D2 a if and only if a VPSub(G)

According to Theorem 3.3, solving Ge Max (5) over VPSub(G) is equivalent to solving the following problem

i=1 Pi(gθ j, Zϕ j ) log(aj(i)),

s.t. Zϕ j T (Gj), Daj(Zϕ j Zϕ j )Daj = D2 aj,

0 aij 1, i [nj], j [N].

Note that here we do not need to consider the constraints z i zj = 0 (i, j) E because in GNN, the connected nodes always share some information, which means z i zj = 0 always holds.

Objective (7) serves as an effective approximation to objective (5), surpassing the approach of merely maximizing the lower bound of graph entropy in graph representation learning, as detailed in (6). The employment of orthonormal representations enables the preservation of non-adjacency characteristics and hence pursues diverse representations, con-

Learning Graph Representation via Graph Entropy Maximization

sistent with the information maximization goal of Ge Max. Actually, (7) is graph entropy maximization under the condition that PF (G) is defined over orthonormal representations.

We can use the Lagrange multiplier method or exact penalty method to solve the constrained optimization problem (7). In this work, we propose to relax (7) to a regularized optimization problem, for which the optimization is much easier than the constrained optimization. Specifically, first, for the orthonormality constraint Zϕ j T (Gj), we define the following regularization

Lorth(G; ϕ) :=

Mj Zϕ j (Zϕ j ) In 2

where Mj = 1nj nj Aj is a binary mask matrix, and Inj is an identity matrix of size nj nj, and the operator denotes the Hadamard product.

For the constraints Daj(Zϕ j Zϕ j )Daj = D2 aj and aij 1, we define the following regularization, termed as subvertex-packing polytope loss:

Ls-vp(G; θ, ϕ, A) :=

Daj Zϕ j (Zϕ j ) Daj D2 aj 2

s.t. 0 aij 1, i [nj], aj A (9) where Daj = diag(aj) and A = {a1, . . . , a N}. Note that the constraints 0 aij 1 can be easily handled through the projection method.

For convenience, we denote the graph entropy objective in (7) as follows:

LHk(G; θ, ϕ, A) =

i=1 Pi(gθ j, Zϕ j ) log(aj(i)). (10)

Then the objective for θ and ϕ is

J1(G; θ, ϕ, A) :=LHk(G; θ, ϕ, A) µ Lorth(G; ϕ)

γ Ls-vp(G; θ, ϕ, A) (11)

while the objective for A is

J2(G; θ, ϕ, A) :=LHk(G; θ, ϕ, A) + γ Ls-vp(G; θ, ϕ, A)

s.t. 0 aij 1, i [nj], aj A, (12) where µ > 0 and γ > 0 are regularization hyperparameters.

3.3. Optimization Algorithm

We propose to use alternating updating to solve (11) and (12), namely, alternately optimize A and (θ, ϕ). Suppose at iteration t we have A(t) = {a(t) 1 , . . . , a(t) N } where a(t) j

Algorithm 1 Optimization for Ge Max (11) and (12)

Input: G, µ, γ, t = 0.

1: Random initialization of parameters: θ(0), ϕ(0).

2: Let A(0) = {a(0) j }N j=1 = {[1/nj, ..., 1/nj]}N j=1. 3: repeat 4: θ(t+1), ϕ(t+1) = argmax θ,ϕ J1(G; θ, ϕ, A(t)).

5: A(t+1) = argmin A C J2(G; θ(t+1), ϕ(t+1), A). 6: t t + 1 7: until convergence conditions are met Output: θ(t+1), ϕ(t+1).

VPSub(Gj) j [N]. At iteration t + 1, we fix A(t) and solve the following sub-problem:

θ(t+1), ϕ(t+1) = argmax θ,ϕ J1(G; θ, ϕ, A(t)). (13)

Then, fixing θ(t+1) and ϕ(t+1), we solve

A(t+1) = argmin A C J2(G; θ(t+1), ϕ(t+1), A) (14)

where C denotes the constraints 0 aij 1 for each vector in A. It can be solved by projected gradient descent. Specifically, at iteration s + 1 of the inner loop of the subproblem, for each vector in A, let

a(s+1) = Proj[0,1] a(s) ϵδ(s+1) , (15)

where δ is the derivative of a, ϵ denotes the step size, and the element-wise projection operator Proj[0,1] is

Proj[0,1] ( a) =

0, if a 0, 1, if a 1, a, otherwise. (16)

Algorithm 1 summarizes the whole procedures of the optimization. In our experiments, we set µ = γ = 0.5.Note that the solution of the regularized optimization cannot exactly satisfy the constraints in (7). To ensure the constraints, one may consider using the inexact penalty method to solve the problem, which is just increasing γ and η gradually in the iterative optimization. We have found that the representations given by the regularized optimization are as good as those given by the constrained optimization. The comparison experiments between regularized and constrained optimizations are in Appendix D.4.

The time and space complexities of Algorithm 1 are shown by the following proposition.

Proposition 3.4. Without loss of generality, suppose n1 = n2 = = n, |E1| = |E2| = = m, both Fg and FZ have L 1 hidden layers, the widths of the hidden layers and the input feature dimension of the graphs are all d, the

Learning Graph Representation via Graph Entropy Maximization

numbers of iterations of the subproblems for {θ, ϕ} and A are τ1 and τ2 respectively, and the number of the outer loop iterations of Algorithm 1 is T. Then the space complexity of the algorithm is O(N(m + Ldn + n2) + Ld2) and the time complexity of the algorithm is O(T(τ1N(L(dm + d2n) + dn2) + τ2Nn2)).

Note that for mini-batch optimization, the N in the proposition should be replaced by the bath size B. In general, the space and time complexities of Algorithm 1 are linear with the number of graphs. Therefore, our method Ge Max is scalable to large datasets provided that n is not too large.

4. Related Work

4.1. Previous Graph Representation Learning Principles

Info Max Principle Info Max (Linsker, 1988) is highly influential in current unsupervised graph representation learning and serve as a foundation of numerous methods, including Info Graph (Sun et al., 2019), Graph CL (You et al., 2020), AD-GCL (Suresh et al., 2021), JOAO (You et al., 2021), Auto GCL (Yin et al., 2022), Graph ACL (Luo et al., 2023a), Info GCL (Xu et al., 2021), GCS (Wei et al., 2023), and Im GCL (Zeng et al., 2023). Taking Info Graph (Sun et al., 2019) as an example, it maximizes the mutual information between graph-level and node-level representations:

ϕ , θ , φ = arg max ϕ,θ,φ

i Vj Iφ(gθ j , zϕ ij), (17)

where Iφ is the Jensen-Shannon MI estimator. More details of Info Max can be found in Appendix D.1.

Lov asz Principle The Lov asz principle (Sun et al., 2023), based on the Lov asz number ϑ(G) (Lov asz, 1979) for a graph G, uses the handle vector c as graph representation. The formulation is

ϕ , θ = arg min ϕ,θ

j=1 max i Vj 1 (zϕ i ) gθ j 2 + ηℓorth(G; θ, ϕ),

(18) where ℓorth(G; θ, ϕ) is the orthonormal regularization. It was shown by (Sun et al., 2023) that the Lov asz principle often outperforms the Info Max principle in graph representation learning. Both the Lov asz principle and Info Max principle will be compared with our Ge Max in Section 5.

Information Bottleneck Principle Graph Information Bottleneck (GIB) (Wu et al., 2020) and Subgraph Information Bottleneck (SIB) (Yu et al., 2021) aim to obtain minimal yet sufficient representations for downstream tasks. However, their effectiveness diminishes in the absence of predefined downstream tasks during the learning stage, limiting their suitability for unsupervised and semi-supervised

graph learning. Therefore, we will not include GIB and SIB in the comparison experiments.

4.2. Entropy-Based Graph Learning Methods

Structural Entropy Structural entropy (Mowshowitz & Dehmer, 2012) leverages Shannon entropy and node structural components like node degree, widely used for topological analysis (Luo et al., 2021; Yang et al., 2023; Wang et al., 2023; Wu et al., 2022; Zou et al., 2023; Fang et al., 2021). Another key metric, edge entropy (Jiang et al., 2020; Wang et al., 2021; Grebenkina et al., 2018; Aziz et al., 2020; Luo et al., 2023b), focuses on edge interconnectivity to assess graph structures. For instance, the degree entropy of graph G with vertex set V and node degree di is:

Hdeg(P) = X

i V Pi log Pi, where Pi = di P

j V dj . (19)

The degree entropy of a graph G, inherently fixed by its degree structure, presents reformulation challenges in graph representation learning. It is notably complex to adapt this entropy into a format suitable for optimization via node and graph representations. Similar challenges extend to other structural entropy forms. Consequently, these entropies are excluded from our comparative experiments due to their inherent limitations.

Non-Structural Entropy Non-structural entropy methods, such as Shannon entropy (Shannon, 1948), R enyi entropy (P al et al., 2010; Oggier & Datta, 2021) used in graph clustering, and the von Neumann entropy (Liu et al., 2021; 2022a; Passerini & Severini, 2008; Minello et al., 2019; Ye et al., 2014; Dong et al., 2019) for spectral complexity, differ significantly from structural approaches. M-ILBO (Ma et al., 2023) calculates graph dataset entropy by optimizing Information Lower Bound (ILBO).

These entropies are defined on probability distributions, as shown in Eq. (3) for graph G. Thus, they can be reformulated as objective functions for graph representation learning. For instance, Shannon entropy s representation objective is

JSh(G; θ, ϕ) =

i Vj Pi(gθ, Zϕ) log(Pi(gθ, Zϕ))

(20) R enyi entropy s representation objective is expressed as

JR enyi(G; θ, ϕ) =

i Vj (Pi(gθ, Zϕ))α

(21) where we set α = 1. However, some types, like von Neumann entropy, are less suited for graph representation learning due to their reliance on eigenvalues. Both Shannon and R enyi entropies will be compared with Ge Max in Section 5.

Learning Graph Representation via Graph Entropy Maximization

5. Experiments

In this section, we evaluate the effectiveness of our Ge Max method in graph learning tasks including unsupervised and semi-supervised representation learning, on TUdataset (Morris et al., 2020). The statistics of the considered graph datasets are in Table 1.

Table 1: Statistics of TUdataset (Morris et al., 2020)

Name # of graphs

# of classes

node labels

node attributes MUTAG 188 2 17.9 yes no

PROTEINS 1113 2 39.1 yes yes

DD 1178 2 284.32 yes no

NCI1 4110 2 29.9 yes no

COLLAB 5000 3 74.49 no no

IMDB-B 1000 2 19.8 no no

REDDIT-B 2000 2 429.63 no no

REDDIT-M5K 4999 5 508.52 no no

5.1. Comparison with Info Max and Lov asz Principles

Our objective is to evaluate three unsupervised graph representation learning principles: Info Max (17), the Lov asz principle (18), and our Ge Max (5). To facilitate fair comparisons, we maintain the neural network architectures used in Info Max methods while substituting their objective with either the Lov asz Principle (18) or our Ge Max (5), keeping all other structures and parameters unchanged. It is important to note that information bottleneck principles, being task-specific, are not included in this comparison.

Unsupervised Learning Following (Sun et al., 2019; Luo et al., 2023a; Wei et al., 2023; Sun et al., 2023), we train a graph representation model on unlabeled data to obtain graph representations and use these representations and graph labels to train an SVM classifier. Our experimental setup follows Graph ACL (Luo et al., 2023a). Actually, all the six Info Max methods, including Graph CL (You et al., 2020), AD-GCL (Suresh et al., 2021), JOJOv2 (You et al., 2021), Auto GCL (Yin et al., 2022), Graph ACL (Luo et al., 2023a) and GCS (Wei et al., 2023) are based on the architecture of Info Graph (Sun et al., 2019). Specifically, they use a 5-layer GIN (Xu et al., 2018) with hidden size 128 as the representation model. The model is trained with a batch size of 128 and a learning rate of 0.001. For those contrastive learning methods (e.g., JOJOv2 and Auto GCL), they use 30 epochs of contrastive pre-training under the naive strategy. We repeated the experiments 10 times with different random seeds. Each time, we performed 10-fold cross-validation on each dataset. In each fold, we use 90% of the total data as unlabeled data for contrastive pre-training and 10% as labeled testing data. More details are in Appendix D.1 and the

sensitivity analysis for hyperparameters is in Appendix D.5. The average classification accuracy (ACC) with standard deviation is reported in Table 2 while the average results across all methods are in Figure 3a. Our Ge Max outperformed Info Max and Lov asz principles in almost all cases.

Semi-supervised Learning Following (Sun et al., 2019; Yin et al., 2022; Sun et al., 2023), we compare the three unsupervised graph representation learning principles in semi-supervised learning tasks. The loss function of semisupervised Info Max method methods is shown by (33) in Appendix D.1. To ensure fair comparisons, we follow (33) and replace the Info Max (17) with Lov asz Principle (18) or our Ge Max (5). Following the settings of Auto GCL (Yin et al., 2022), we employ a 10-fold cross-validation on each dataset. For each fold, we use 80% of the total data as the unlabeled data, 10% as labeled training data, and 10% as labeled testing data. The classifier for labeled data is a Res GCN (Chen et al., 2019) with 5 layers and a hidden size of 128. More details about the semi-supervised learning are in Appendix D.1. We repeat each experiment 10 times and report the average classification accuracy in Table 3. The average results across all methods are shown in Figure 3b. Consistent with the unsupervised learning tasks, our Ge Max still outperformed Info Max and Lov asz principles in semi-supervised graph representation learning.

5.2. Comparison of Different Entropy Measures

We evaluate the efficacy of graph entropy against other measures in graph representation learning. Despite various entropies being explored in GNNs, a standardized approach for graph representation learning remains elusive. As outlined in the related work, our experiments adapt established supervised and unsupervised frameworks, substituting the Ge Max objective (Eq. (5)) with Shannon (Eq. (20)) or R enyi entropy objectives (Eq. (21)). After conducting experiments 10 times, average accuracies are reported in Tables 4 and

6. Our visual analyses in Figures 4a and 4b indicate graph entropy s superiority in capturing topological information, outperforming Shannon and R enyi entropies that primarily focus on vertex set entropy and may miss crucial structural aspects of graphs.

5.3. More Experiment Results

We provide additional experiment results and analyses in the appendix. We compare the performance of graph entropy against other entropy measures in graph representation learning tasks (Appendix D.2) and present a comparison between the exact computation of graph entropy and other graph learning principles for small graphs (Appendix D.3). Furthermore, We explore an alternative optimization approach using the inexact penalty method (Appendix D.4),

Learning Graph Representation via Graph Entropy Maximization

Table 2: Classification accuracy (%) of unsupervised learning using different principles. The baseline results are from previous works. The number in bold denote the best principle among the three ones. The numbers with denote he best method for a dataset. This notation is also applied in Table 3, Table 4, and Table 6.

Method Principle MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph Info Max 89.01 1.13 74.44 0.31 72.85 1.78 76.20 1.06 70.65 1.13 73.03 0.87 82.50 1.42 53.46 1.03 Lov asz 89.67 1.54 75.26 1.43 74.13 1.49 78.21 1.35 71.46 1.21 73.87 1.32 84.76 1.86 54.57 1.38 Ge Max 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.25 1.17 73.31 1.53 86.45 1.43 56.16 2.40

Graph CL Info Max 86.80 1.34 74.39 0.45 78.62 0.40 77.87 0.41 71.36 1.15 71.14 0.44 89.53 0.84 55.99 0.28 Lov asz 87.24 1.96 75.87 1.17 79.14 1.67 79.13 1.27 72.52 1.37 72.44 1.46 89.87 2.13 56.12 1.73 Ge Max 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91

AD-GCL Info Max 87.13 1.56 73.59 0.65 74.49 0.52 69.67 0.51 73.32 0.61 71.57 1.01 85.52 0.79 53.00 0.82 Lov asz 87.44 2.13 74.29 2.80 76.25 1.48 75.12 1.13 73.85 1.05 73.02 1.35 87.11 1.95 54.61 2.35 Ge Max 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 72.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38

JOAOv2 Info Max 86.91 1.01 71.25 0.85 66.91 1.75 72.99 0.75 70.40 2.21 71.60 0.86 78.35 1.38 55.57 2.86 Lov asz 87.19 1.92 73.15 1.46 73.15 2.17 74.15 1.67 72.62 1.43 72.18 1.72 84.19 1.67 53.74 1.70 Ge Max 86.47 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 71.62 1.18 87.19 1.89 55.24 1.28

Auto GCL Info Max 88.64 1.08 75.80 0.36 77.57 0.60 82.00 0.29 70.12 0.68 73.30 0.40 88.58 1.49 56.75 0.18 Lov asz 89.02 1.47 76.23 1.25 78.95 1.39 82.63 2.12 71.31 1.72 73.95 1.36 89.41 1.81 57.28 1.62 Ge Max 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72

Graph ACL Info Max 90.21 0.94 75.47 0.38 79.34 0.42 81.22 1.31 74.72 0.55 74.29 0.67 88.58 1.49 56.11 1.33 Lov asz 90.55 0.82 75.88 1.36 80.22 1.01 80.57 1.76 74.48 0.78 74.33 1.88 89.30 1.60 56.17 2.07 Ge Max 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06

GCS Info Max 90.45 0.81 75.02 0.39 77.22 0.30 77.37 0.30 75.56 0.41 73.43 0.38 90.98 0.28 57.04 0.49 Lov asz 90.78 1.28 76.58 0.59 78.69 0.53 78.35 1.04 76.62 0.72 74.58 1.01 91.23 1.44 56.97 1.51 Ge Max 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79

(a) Average ACC of Table 2 (Unsupervised)

(b) Average ACC of Table 3 (Semi-supervised)

Figure 3: Average results of graph representation learning using different principles

(a) Average ACC of Table 4 (Unsupervised)

(b) Average ACC of Table 6 (Semi-supervised)

Figure 4: Average results of graph representation learning via maximizing different entropy objectives

with results suggesting that regularized optimization yields representations of comparable quality to constrained optimization. Additional investigations include parameter sensitivity analysis (Appendix D.5), ablation studies to elucidate

the significance of each component in the Ge Max objectives (Appendix D.6), and comparisons of different probability distributions for defining PF (G) (Appendix D.7). We also illustrate the convergence behavior of Ge Max (Appendix

Learning Graph Representation via Graph Entropy Maximization

Table 3: Classification accuracy (%) of semi-supervised learning using different graph learning principles.

Method Principle NCI1 PROTEINS DD COLLAB REDDIT-B REDDIT-M5K

Info Graph Info Max 73.79 0.44 75.13 0.74 76.99 1.29 73.79 1.25 86.50 1.37 53.66 1.85 Lov asz 72.58 1.45 74.56 1.01 77.05 1.17 74.82 0.22 88.14 1.71 51.62 0.96 Ge Max 74.89 1.21 75.87 1.94 75.17 1.95 76.67 1.39 90.67 1.99 53.08 1.54

Graph CL Info Max 74.63 0.25 74.17 0.34 76.17 1.37 74.23 0.21 89.11 0.19 52.55 0.45 Lov asz 75.46 1.53 75.12 1.87 77.46 1.52 76.12 1.15 89.87 1.68 53.69 1.68 Ge Max 76.59 1.55 76.06 1.83 78.25 1.23 77.73 1.27 91.84 1.64 54.10 3.22

AD-GCL Info Max 75.18 0.31 73.96 0.47 77.91 0.73 75.82 0.26 90.10 0.15 53.49 0.28 Lov asz 76.62 1.83 74.21 1.71 78.27 1.39 76.27 1.74 90.36 1.56 54.06 1.32 Ge Max 76.47 1.82 76.76 1.94 79.20 0.65 77.79 1.66 91.18 0.82 56.64 1.23

JOAOv2 Info Max 74.86 0.39 73.31 0.48 75.81 0.72 75.53 0.18 88.79 0.65 52.71 0.28 Lov asz 76.13 1.76 73.73 1.86 76.27 1.48 77.35 1.27 89.31 1.85 53.17 1.76 Ge Max 77.36 1.48 75.18 0.80 77.53 1.27 78.10 1.75 90.38 1.87 53.27 2.18

Auto GCL Info Max 73.75 2.25 75.65 1.40 77.50 4.41 77.16 1.48 79.80 1.47 49.91 2.70 Lov asz 75.77 1.48 76.36 1.57 78.16 1.61 77.63 1.78 84.64 1.53 51.31 1.81 Ge Max 76.86 1.98 76.23 1.47 79.61 1.21 78.21 1.27 88.43 1.57 51.63 1.97

Graph ACL Info Max 74.35 1.17 73.20 1.57 75.71 0.82 75.32 0.13 88.49 0.94 51.70 1.05 Lov asz 75.32 1.38 74.99 1.35 76.33 2.17 77.24 1.77 89.67 1.84 54.54 1.67 Ge Max 76.56 1.67 76.89 1.41 77.28 0.46 79.02 1.20 91.38 1.50 52.98 3.02

GCS Info Max 74.79 1.36 75.31 1.21 76.18 1.76 75.06 1.25 87.82 1.74 49.16 4.50 Lov asz 75.37 1.26 76.16 1.28 78.21 1.05 76.86 1.14 84.69 1.56 51.52 0.58 Ge Max 75.69 1.47 76.24 1.04 79.13 1.55 78.72 1.75 92.05 1.45 54.49 2.22

Table 4: Classification accuracy (%) of unsupervised learning via maximizing different entropy objectives

Method Principle MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph Shannon 84.89 1.84 69.27 1.43 71.77 1.64 72.75 1.44 68.95 1.49 69.62 1.28 81.55 1.49 49.25 1.27 R enyi 86.50 0.92 71.37 1.93 72.38 1.31 73.17 1.71 67.16 1.78 71.13 1.49 82.92 1.65 52.28 1.46 Ge Max 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.95 1.17 74.31 1.53 86.45 1.43 56.16 2.40

Graph CL Shannon 85.25 1.47 73.13 1.52 75.43 1.32 76.13 1.28 67.93 1.41 68.95 1.34 84.27 1.34 52.90 1.57 R enyi 83.19 0.76 74.24 1.75 74.14 1.44 73.11 1.96 71.73 1.29 70.64 1.26 83.91 1.47 53.24 1.21 Ge Max 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91

AD-GCL Shannon 85.13 0.25 71.31 1.26 73.21 0.74 69.82 1.58 66.14 1.94 69.13 1.21 82.29 1.48 54.86 1.27 R enyi 82.25 1.32 72.01 0.33 71.12 1.53 72.39 1.89 69.26 1.56 68.86 1.95 81.59 1.57 54.28 1.38 Ge Max 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 73.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38

JOAOv2 Shannon 84.01 1.69 74.15 1.05 70.81 1.35 70.99 1.28 67.11 1.81 68.22 1.87 80.30 1.82 50.37 1.26 R enyi 83.39 1.42 76.37 1.57 73.17 1.48 72.18 1.90 68.21 1.52 67.19 1.59 83.29 1.48 51.46 1.73 Ge Max 89.07 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 72.62 1.18 87.19 1.89 55.24 1.28

Auto GCL Shannon 81.64 1.68 73.57 1.38 72.48 1.62 74.10 1.96 67.12 1.38 69.92 1.47 82.34 1.92 51.52 1.28 R enyi 85.02 1.27 72.84 1.59 73.59 1.49 75.43 1.82 65.29 1.27 68.54 1.63 83.17 1.89 53.28 1.27 Ge Max 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72

Graph ACL Shannon 85.21 1.94 69.59 1.27 74.40 1.62 74.85 1.81 69.16 1.55 70.24 1.68 85.62 1.19 51.14 1.38 R enyi 86.55 1.33 73.16 1.46 72.82 1.19 73.22 1.26 67.28 1.84 72.13 1.87 83.31 1.69 53.78 2.01 Ge Max 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06

GCS Shannon 86.12 1.64 74.17 1.43 71.89 1.32 76.13 1.35 69.81 1.52 71.16 2.28 86.80 1.18 54.14 1.29 R enyi 84.45 1.29 73.23 1.07 73.81 1.14 75.35 1.44 72.16 1.01 70.18 1.52 85.36 1.47 52.63 1.17 Ge Max 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79

D.8) and empirically evaluate its time complexity (Appendix D.9). Lastly, we provide the code implementation of key components in Ge Max (Appendix D.10).

6. Conclusions

We introduced Ge Max, a novel graph representation learning approach based on K orner s graph entropy. Ge Max uses local node and global graph representations produced by graph neural networks to efficiently approximate the NP-hard graph entropy computation. Our experiments confirmed Ge Max s superiority over its competitors, demonstrated its ability to advance graph representation learning, and showcased the practical use of theoretical concepts of graph theory in machine learning.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No.62376236, the Guangdong Provincial Key Laboratory of Mathematical Foundations for Artificial Intelligence (2023B1212010001), Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone Project (No.HZQSWS-KCCYB2024016), and the funding UDF01001770 of The Chinese University of Hong Kong, Shenzhen. The authors appreciate the comments of the reviewers and ACs.

Learning Graph Representation via Graph Entropy Maximization

Impact Statement

This work introduces an innovative method for graph representation learning, leveraging graph entropy to generate comprehensive graph-level and node-level representations. By approximating graph entropy through orthonormal representations, our approach enhances the capability of graph neural networks in unsupervised and semi-supervised learning tasks. This advancement possibly has broader impacts for scientific and societal benefits, such as enhancing recommendation systems and advancing biomedical research. Apart from offering possible scientific and societal benefits, to the best of our knowledge, our research works do not involve ethical constraints or religious and political restrictions.

Aziz, F., Hancock, E. R., and Wilson, R. C. Network entropy using edge-based information functionals. Journal of Complex Networks, 8(3):cnaa015, 2020.

Bai, Y., Ding, H., Qiao, Y., Marinovic, A., Gu, K., Chen, T., Sun, Y., and Wang, W. Unsupervised inductive graphlevel representation learning via graph-graph proximity. ar Xiv preprint ar Xiv:1904.01098, 2019.

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In International conference on machine learning, pp. 531 540. PMLR, 2018.

Boreland, G. A lower bound on graph entropy. In Mathematical Proceedings of the Royal Irish Academy, volume 118, pp. 9 20. Royal Irish Academy, 2018.

Borgwardt, K. M., Ong, C. S., Sch onauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.-P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl 1): i47 i56, 2005.

Bouchon, B., Saitta, L., and Yager, R. R. Uncertainty and Intelligent Systems: 2nd International Conference on Information Processing and Management of Uncertainty in Knowledge Based Systems IPMU 88. Urbino, Italy, July 4-7, 1988. Proceedings, volume 313. Springer Science & Business Media, 1988.

Chen, T., Bian, S., and Sun, Y. Are powerful graph neural nets necessary? a dissection on graph classification. ar Xiv preprint ar Xiv:1905.04579, 2019.

Csisz ar, I., K orner, J., Lov asz, L., Marton, K., and Simonyi, G. Entropy splitting for antiblocking corners and perfect graphs. Combinatorica, 10:27 40, 1990.

Cui, G., Zhou, J., Yang, C., and Liu, Z. Adaptive graph encoder for attributed graph embedding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 976 985, 2020.

Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Shusterman, A. J., and Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34 (2):786 797, 1991.

Dehmer, M. Information processing in complex networks: Graph entropy and information functionals. Applied Mathematics and Computation, 201(1-2):82 94, 2008.

Dehmer, M. and Mowshowitz, A. A history of graph entropy measures. Information Sciences, 181(1):57 78, 2011.

Ding, K., Xu, Z., Tong, H., and Liu, H. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 24(2):61 77, 2022.

Ding, K., Wang, Y., Yang, Y., and Liu, H. Eliciting structural and semantic global knowledge in unsupervised graph contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7378 7386, 2023.

Dong, M., Chen, H., Wang, Y., and Xu, C. Crafting efficient neural graph of large entropy. In IJCAI, pp. 2244 2250, 2019.

Duan, J., Wang, S., Zhang, P., Zhu, E., Hu, J., Jin, H., Liu, Y., and Dong, Z. Graph anomaly detection via multi-scale contrastive learning networks with augmented view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7459 7467, 2023.

Fang, P., Wang, F., Shi, Z., Jiang, H., Feng, D., and Yang, L. Huge: An entropy-driven approach to efficient and scalable graph embeddings. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 2045 2050. IEEE, 2021.

Fang, Y., Zhang, Q., Yang, H., Zhuang, X., Deng, S., Zhang, W., Qin, M., Chen, Z., Fan, X., and Chen, H. Molecular contrastive learning with chemical element knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 3968 3976, 2022.

Fuglede, B. and Topsoe, F. Jensen-shannon divergence and hilbert space embedding. In International symposium on Information theory, 2004. ISIT 2004. Proceedings., pp. 31. IEEE, 2004.

Learning Graph Representation via Graph Entropy Maximization

Grebenkina, M., Brachmann, A., Bertamini, M., Kaduhm, A., and Redies, C. Edge-orientation entropy predicts preference for diverse types of man-made images. Frontiers in Neuroscience, 12:678, 2018.

Hamilton, W. L., Ying, R., and Leskovec, J. Representation learning on graphs: Methods and applications. ar Xiv preprint ar Xiv:1709.05584, 2017.

Han, X., Jiang, Z., Liu, N., and Hu, X. G-mixup: Graph data augmentation for graph classification. In International Conference on Machine Learning, pp. 8230 8248. PMLR, 2022.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.

Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., and Tang, J. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 594 604, 2022.

Jensen, T. R. and Toft, B. Graph coloring problems. John Wiley & Sons, 2011.

Jiang, L. Y., Shi, J., Cheung, M., Wright, O., and Moura, J. M. Edge entropy as an indicator of the effectiveness of gnns over cnns for node classification. In 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp. 746 750. IEEE, 2020.

Kipf, T. N. and Welling, M. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016.

K orner, J. Coding of an information source having ambiguous alphabet and the entropy of graphs. In 6th Prague conference on information theory, pp. 411 425, 1973.

Kriege, N. and Mutzel, P. Subgraph matching kernels for attributed graphs. ar Xiv preprint ar Xiv:1206.6483, 2012.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Le-Khac, P. H., Healy, G., and Smeaton, A. F. Contrastive representation learning: A framework and review. Ieee Access, 8:193907 193934, 2020.

Lee, N., Lee, J., and Park, C. Augmentation-free selfsupervised learning on graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7372 7380, 2022.

Li, S., Zhou, J., Xu, T., Dou, D., and Xiong, H. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 4541 4549, 2022a.

Li, Y., Chang, H., Ma, B., Shan, S., and Chen, X. Optimal positive generation via latent transformation for contrastive learning. Advances in Neural Information Processing Systems, 35:18327 18342, 2022b.

Linsker, R. Self-organization in a perceptual network. Computer, 21(3):105 117, 1988.

Liu, X., Fu, L., and Wang, X. Bridging the gap between von neumann graph entropy and structural information: Theory and applications. In Proceedings of the Web Conference 2021, pp. 3699 3710, 2021.

Liu, X., Fu, L., Wang, X., and Zhou, C. On the similarity between von neumann graph entropy and structural information: Interpretation, computation, and applications. IEEE Transactions on Information Theory, 68(4): 2182 2202, 2022a.

Liu, Y., Jin, M., Pan, S., Zhou, C., Zheng, Y., Xia, F., and Philip, S. Y. Graph self-supervised learning: A survey. IEEE Transactions on Knowledge and Data Engineering, 35(6):5879 5900, 2022b.

Lov asz, L. On the shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1 7, 1979.

Luo, G., Li, J., Su, J., Peng, H., Yang, C., Sun, L., Yu, P. S., and He, L. Graph entropy guided node embedding dimension selection for graph neural networks. ar Xiv preprint ar Xiv:2105.03178, 2021.

Luo, X., Ju, W., Gu, Y., Mao, Z., Liu, L., Yuan, Y., and Zhang, M. Self-supervised graph-level representation learning with adversarial contrastive learning. ACM Transactions on Knowledge Discovery from Data, 2023a.

Luo, Y., Luo, G., Qin, K., and Chen, A. Graph entropy minimization for semi-supervised node classification, 2023b.

Ma, Y., Zhang, X., Zhang, P., and Zhan, K. Entropy neural estimation for graph contrastive learning. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 435 443, 2023.

Minello, G., Rossi, L., and Torsello, A. On the von neumann entropy of graphs. Journal of Complex Networks, 7(4): 491 514, 2019.

Mo, Y., Peng, L., Xu, J., Shi, X., and Zhu, X. Simple unsupervised graph representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7797 7805, 2022.

Learning Graph Representation via Graph Entropy Maximization

Morris, C., Kriege, N. M., Bause, F., Kersting, K., Mutzel, P., and Neumann, M. Tudataset: A collection of benchmark datasets for learning with graphs. In ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020), 2020. URL www.graphlearning. io.

Mowshowitz, A. and Dehmer, M. Entropy and the complexity of graphs revisited. Entropy, 14(3):559 570, 2012.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016.

Oggier, F. and Datta, A. Renyi entropy driven hierarchical graph clustering. Peer J Computer Science, 7:e366, 2021.

P al, D., P oczos, B., and Szepesv ari, C. Estimation of r enyi entropy and mutual information based on generalized nearest-neighbor graphs. Advances in Neural Information Processing Systems, 23, 2010.

Passerini, F. and Severini, S. The von neumann entropy of networks. ar Xiv preprint ar Xiv:0812.2597, 2008.

Qiu, J., Chen, Q., Dong, Y., Zhang, J., Yang, H., Ding, M., Wang, K., and Tang, J. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1150 1160, 2020.

Rezaei, S. S. C. Entropy and graphs. ar Xiv preprint ar Xiv:1311.5632, 2013.

Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., and Huang, J. Self-supervised graph transformer on largescale molecular data. Advances in Neural Information Processing Systems, 33:12559 12571, 2020.

Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27(3):379 423, 1948.

Shen, X., Sun, D., Pan, S., Zhou, X., and Yang, L. T. Neighbor contrastive learning on learnable graph augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 9782 9791, 2023.

Sun, F.-Y., Hoffmann, J., Verma, V., and Tang, J. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. ar Xiv preprint ar Xiv:1908.01000, 2019.

Sun, Z., Ding, C., and Fan, J. Lov asz principle for unsupervised graph representation learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Suresh, S., Li, P., Hao, C., and Neville, J. Adversarial graph augmentation to improve graph contrastive learning. Advances in Neural Information Processing Systems, 34: 15920 15933, 2021.

Trivedi, P., Lubana, E. S., Yan, Y., Yang, Y., and Koutra, D. Augmentations in graph contrastive learning: Current methodological flaws & towards better practices. In Proceedings of the ACM Web Conference 2022, pp. 1538 1549, 2022.

Verma, S. and Zhang, Z.-L. Learning universal graph neural network embeddings with aid of transfer learning. ar Xiv preprint ar Xiv:1909.10086, 2019.

Wang, J., Wilson, R. C., and Hancock, E. R. Network edge entropy decomposition with spin statistics. Pattern Recognition, 118:108040, 2021.

Wang, Y., Wang, Y., Zhang, Z., Yang, S., Zhao, K., and Liu, J. User: Unsupervised structural entropy-based robust graph neural network. ar Xiv preprint ar Xiv:2302.05889, 2023.

Wei, C., Liang, J., Liu, D., and Wang, F. Contrastive graph structure learning via information bottleneck for recommendation. Advances in Neural Information Processing Systems, 35:20407 20420, 2022.

Wei, C., Wang, Y., Bai, B., Ni, K., Brady, D., and Fang, L. Boosting graph contrastive learning via graph contrastive saliency. In International conference on machine learning, pp. 36839 36855. PMLR, 2023.

Wu, J., Wang, X., Feng, F., He, X., Chen, L., Lian, J., and Xie, X. Self-supervised graph learning for recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pp. 726 735, 2021.

Wu, J., Chen, X., Xu, K., and Li, S. Structural entropy guided graph hierarchical pooling. In International conference on machine learning, pp. 24017 24030. PMLR, 2022.

Wu, T., Ren, H., Li, P., and Leskovec, J. Graph information bottleneck. Advances in Neural Information Processing Systems, 33:20437 20448, 2020.

Xia, J., Wu, L., Chen, J., Hu, B., and Li, S. Z. Simgrace: A simple framework for graph contrastive learning without data augmentation. In Proceedings of the ACM Web Conference 2022, pp. 1070 1079, 2022.

Xiao, T., Chen, Z., Guo, Z., Zhuang, Z., and Wang, S. Decoupled self-supervised learning for graphs. Advances in Neural Information Processing Systems, 35:620 634, 2022.

Learning Graph Representation via Graph Entropy Maximization

Xie, Y., Xu, Z., Zhang, J., Wang, Z., and Ji, S. Selfsupervised learning of graph neural networks: A unified review. IEEE transactions on pattern analysis and machine intelligence, 2022.

Xu, D., Cheng, W., Luo, D., Chen, H., and Zhang, X. Infogcl: Information-aware graph contrastive learning. Advances in Neural Information Processing Systems, 34: 30414 30425, 2021.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018.

Yanardag, P. and Vishwanathan, S. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1365 1374, 2015.

Yang, Z., Zhang, G., Wu, J., Yang, J., Sheng, Q. Z., Peng, H., Li, A., Xue, S., and Su, J. Minimum entropy principle guided graph neural networks. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 114 122, 2023.

Ye, C., Wilson, R. C., Comin, C. H., Costa, L. d. F., and Hancock, E. R. Approximate von neumann entropy for directed graphs. Physical Review E, 89(5):052804, 2014.

Yin, Y., Wang, Q., Huang, S., Xiong, H., and Zhang, X. Autogcl: Automated graph contrastive learning via learnable view generators. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8892 8900, 2022.

You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33: 5812 5823, 2020.

You, Y., Chen, T., Shen, Y., and Wang, Z. Graph contrastive learning automated. In International Conference on Machine Learning, pp. 12121 12132. PMLR, 2021.

Yu, J., Xu, T., Rong, Y., Bian, Y., Huang, J., and He, R. Recognizing predictive substructures with subgraph information bottleneck. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Zeng, J. and Xie, P. Contrastive self-supervised learning for graph classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10824 10832, 2021.

Zeng, L., Li, L., Gao, Z., Zhao, P., and Li, J. Imgcl: Revisiting graph contrastive learning on imbalanced node classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 11138 11146, 2023.

Zhang, H., Wu, Q., Yan, J., Wipf, D., and Yu, P. S. From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems, 34:76 89, 2021a.

Zhang, Y., Zhu, H., Song, Z., Koniusz, P., and King, I. Spectral feature augmentation for graph contrastive learning and beyond. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 11289 11297, 2023.

Zhang, Z., Liu, Q., Wang, H., Lu, C., and Lee, C.-K. Motifbased graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34:15870 15882, 2021b.

Zhao, H., Yang, X., Wang, Z., Yang, E., and Deng, C. Graph debiased contrastive learning with joint representation clustering. In IJCAI, pp. 3434 3440, 2021.

Zou, D., Peng, H., Huang, X., Yang, R., Li, J., Wu, J., Liu, C., and Yu, P. S. Se-gsl: A general and effective graph structure learning framework through structural entropy optimization. ar Xiv preprint ar Xiv:2303.09778, 2023.

Learning Graph Representation via Graph Entropy Maximization

A. Notations

The major notations used in this paper are shown in Table 5.

Table 5: Notations

Symbol Description Symbol Description G a graph V vertex set of graph G g graph-level representation of G n the number of vertices of G Z node representations matrix of G zi representation of node i (G, P) a probabilistic graph P(g, Z) probability distribution on V VP(G) vertex-packing polytope of G VPSub(G) a subset of VP(G) Pi(g, Z) probability density of vertex i Hk(G, P) graph entropy H(P) Shannon entropy α(G) the independence number of G 0 a 1 0 ai 0 i

B. Proof for and Propositions and Theorems

B.1. Proof for Proposition 3.2

Definition B.1 (convex corner (Csisz ar et al., 1990; Rezaei, 2013)). A subset A Rn + is called convex corner if it is compact and convex, has non-empty interior, and for every a A, a Rn + with a a, then we have a A.

Proof. (1) Based on the Definition B.1 of convex corner, if VPSub(G) is a convex corner, then given a VPSub(G), a Rn + with a a, we need to imply a VPSub(G).

Since 0 a 1 and a a, we have a 1. Since a Rn +, we have 0 a 1. Given that 0 a 1 and 0 a 1, then a a implies 1(a ) 1(a). Suppose a VPSub(G). The definition of VPSub(G) implies that 1(a) VP(G). As VP(G) is a convex corner and 1(a ) 1(a), we have 1(a ) VP(G). Together with the fact that 0 a 1, we conclude that a is in VPSub(G), meaning that VPSub(G) is a convex corner.

(2) Based on the property of the element-wise indicator function 1( ), we have bi = 1(bi). The Definition 2.1 indicates bi VP(G), i [Nb]. Then we have 1(bi) VP(G), i [Nb]. That is, bi VPSub(G), i [Nb].

B.2. Proof for Theorem 3.3

Proof. For convenience, let V = [v1, . . . , vn] = Da Z, where vi = aizi, i [n]. Then

Da(ZZ )Da = V V = aiajz i zj

a1a1z 1 z1 a1a2z 1 z2 a1a3z 1 z3 a1anz 1 zn a2a1z 2 z2 a2a2z 2 z2 a2a3z 2 z3 a2anz 2 zn a3a1z 3 z3 a3a2z 3 z2 a3a3z 3 z3 a3anz 3 zn ... ... ... ... ... ana1z n zn ana2z n z2 ana3z n z3 ... ananz n zn

(1) First, we prove that Da(ZZ )Da = D2 a = a VPSub(G).

Da(ZZ )Da = D2 a indicates that all the off-diagonal elements are zeros. Thus, we have

aiajz i zj = 0, (i, j) E (22)

Under the condition z i zj = 0 for every (i, j) E, we conclude from (22) that

ai = 0 or aj = 0, (i, j) E. (23)

Let c = [c1, ..., cn] = 1(a) {0, 1}n, where 1( ) was defined as before. Using (23), we obtain

ci = 0 or cj = 0, (i, j) E. (24)

Learning Graph Representation via Graph Entropy Maximization

This means that there are no adjacent vertex pairs in the subset induced by c. In other words, c is an indicator vector of an independent set of G. Based on the second property of VPSub(G) in Proposition 3.2, all the indicator vectors of the independent set are contained in VPSub(G), meaning that

1(a) = c VPSub(G). (25)

According to the fact a c and the convex corner property of VPSub(G) in Proposition 3.2, we concluded that

a VPSub(G).

(2) Now we prove a VPSub(G) = Da(ZZ )Da = D2 a.

Definition 3.1 of VPSub(G) indicates that if a VPSub(G), then c = 1(a) VP(G). Following Definition 2.1 of VP(G), we use B = [b1, . . . , b Nb] {0, 1}|V | Nb to denote the indicator matrix of the independent sets, where bi = [bi(1), . . . , bi(n)]

is the indicator vector of the i-th independent set. Then we can find a vector λ with λ 0 and P

i λi = 1, such that c is the combination of indicator vectors of independent sets, i.e.,

i λibi. (26)

For convenience, we define a positive index set as

Q := {i : λi > 0, i [Nb]}. (27)

Let cj be the j-th element of c, we have cj = P

i Q λibi(j). Since every i in the set Q is positive, if cj = 0, then the j-th element of all the indicator vectors bi(j) must be zero, namely, bi(j) = 0 for every i Q.

Since each element of bi is zero or one, we have

i Q λibi(j) X

i Q λi = 1. (28)

The equality, say cj = 1, can be obtained only when b1(j) = b2(j) = = b|Q|(j) = 1.

The analysis above indicates that if cj = 0, the j-th element of every bi must be zero. Meanwhile, if cj = 1, the j-th element of every bi must be one. These indicate that

c = bi, i Q, (29)

which means |Q| = 1, i.e., Q contains a single element. Therefore, c is actually one of the indicator vectors of the independent sets. We then have ci = 0 or cj = 0 for every (i, j) E, which further means

ai = 0 or aj = 0, (i, j) E. (30)

Now we can analyse the elements of Da(ZZ )Da = aiajz i zj

n n. Equation (30) implies that aiajz i zj = 0 holds for every (i, j) E. For every (i, j) / E, we have z i zj = 0 under the condition that Z are orthonormal representations, and thus aiajz i zj = 0. Combining these two cases, we have aiajz i zj = 0 for every off-diagonal element (i, j) of Da(ZZ )Da.

Each diagonal element of Da(ZZ )Da is denoted as a2 i z i zi. Since Z are orthonormal representations, we have z i zi = 1. Thus, the i-th element in the diagonal of Da(ZZ )Da is a2 i .

The above analysis means Da(ZZ )Da = D2 a. We finished the proof.

B.3. Proof for Proposition 3.4

Proof. In the algorithm, we need to store N graphs and each graph has a sparse adjacency matrix of n nodes and m edges and a feature matrix of size n d. The number of parameters of Fg and FZ is O(Ld2) because there are 2L matrices of size d d. When applying Fg or FZ to each graph, the neighbor aggregation and feature transformation operations will yield 2L matrices of size n d. We also need to store Pi, a vector of size n for each graph. For the regularization Lorth, we

Learning Graph Representation via Graph Entropy Maximization

need to store Mj and Zϕ j (Zϕ j ) , requiring O(Nn2), and the storage for Ls-vp is similar. Putting these together, the total space complexity of the algorithm is O(N(m + Ldn + n2) + Ld2).

Regarding the time complexity, in each iteration of the subproblem for {θ, ϕ} (i.e., line 4 in the algorithm), the neighbor aggregation and feature transformation operations in Fg or FZ are O(L(dm + d2n)) on each graph, and computing Lorth and Ls-vp requires O(Ndn2). The complexity of the backward propagation is similar to that of the forward propagation. The subproblem for A (i.e., line 5 in the algorithm) requires O(τ2Nn2) plus O(Ndn2) because Daj are diagonal matrices and Zϕ j (Zϕ j ) are precomputed in line 4. Putting these together, the total time complexity of the algorithm is O(T(τ1N(L(dm + d2n) + dn2) + τ2Nn2)).

C. Theoretical Foundations of Graph Entropy

K orner s graph entropy is a fundamental concept in the fields of information theory, graph theory, and combinatorics.

Information Theory K orner s graph entropy, rooted in information theory, expands traditional entropy to graph structures and was originally devised to assess communication channel capacity. Established by Claude E. Shannon in 1948 (Shannon, 1948) and further developed by J anos K orner in 1973 (K orner, 1973), graph entropy measures information transmission over noisy channels, highlighting the shared information within graph elements (Bouchon et al., 1988). L aszl o Lov asz s introduction of orthonormal representations in 1979 (Lov asz, 1979) further advanced this field, focusing on a graph s Shannon capacity and underscoring its structural properties.

Graph Theory In graph theory, graph entropy pertains to probabilistic graphs, denoted as (G, P), where P represents the probability distribution over the vertex set (Rezaei, 2013). Graph entropy measures the uncertainty or randomness inherent in a probabilistic graph. The interpretation of the probability distribution P varies depending on the context. For instance, P might represent the centrality of a vertex, or in the case where G symbolizes a communication channel, P could denote the probability of various communication symbols.

Combinatorics In combinatorics, graph entropy plays a significant role in analyzing and quantifying the complexity and informational content of graph structures (Bouchon et al., 1988). This concept focuses on assessing the combinatorial properties of graphs, such as the arrangement and interrelation of vertices and edges, as well as various subgraph configurations like cliques and independent sets.

Figure 5: Architecture of Info Graph with m layers

Learning Graph Representation via Graph Entropy Maximization

Figure 6: Applying Ge Max to Info Graph network by replacing the Info Max loss

D. More about the Experiments

D.1. Experiment Baseline: Infor Max methods

For unsupervised and semi-supervised graph-level learning, those Info Max-based methods are the most current and influential methods, each boasting high citations on Google Scholar. In experiments, we replace the Info Max objective with our Ge Max objective while keeping other settings unchanged, as shown in Figure 6. In this work, we compare seven Info Max-based methods, Info Graph (Sun et al., 2019) including Graph CL (You et al., 2020), AD-GCL (Suresh et al., 2021), JOJOv2 (You et al., 2021), Auto GCL (Yin et al., 2022), Graph ACL (Luo et al., 2023a) and GCS (Wei et al., 2023). All the other six methods share the same graph representation learning architecture with Info Graph (Sun et al., 2019), as shown in Figure 5.

Unsupervised Info Graph Following (Nowozin et al., 2016; Sun et al., 2019; Belghazi et al., 2018), suppose the nodelevel representation zp(x) and the graph-level representation g(x) are depending on the input x, Tφ is a discriminator parameterized by a neural network with parameters φ, the Jensen-Shannon mutual information (MI) estimator (Fuglede & Topsoe, 2004; Nowozin et al., 2016; Hjelm et al., 2019; Sun et al., 2019) Iφ between zv and g is defined as

Iφ(zp, g) = EP[ sp( Tφ(zp(x), g(x)))] EP P[sp(Tφ(zp(x ), g(x)))], (31)

where x is the input sample from distribution P, x is the negative sample from distribution P, and sp(a) = log(1 + ea) denotes the softplus function. P is the empirical probability distribution of the input space and P is the empirical probability distribution of the negative input space. Many recent graph-level representation learning methods (Sun et al., 2019; You et al., 2020; Yin et al., 2022) are based on the Info Max principle, i.e., maximizing (31). For example, Info Graph(Sun et al., 2019) obtains graph-level representations by maximizing the mutual information between the graph-level representation and the node-level representations as follows

ϕ , θ , φ = arg max ϕ,θ,φ

p Vi Iφ(zθ p, gϕ i ), (32)

where Iφ is the Jensen-Shannon MI estimator defined by (31).

Semi-supervised Info Graph For semi-supervised learning, the dataset G is split into labeled dataset GL and unlabeled dataset GU. They deploy another supervised encoder with parameter ψ and then generate the supervised node-level representations Zψ i , graph-level representations gψ i and prediction ˆyψ i . The loss function of Info Graph for semi-supervised learning is defined as follows:

Linfo-semi =

l=1 Lsupervised(ˆyψ l , yl) +

i=1 Lunsupervised(Zθ i , gϕ i ) λ

1 |Vi|Iφ(gϕ i , gψ i ) (33)

where Lunsupervised is derived from (17). The last term encourages the representations learned by the two encoders to have high mutual information.

D.2. Experiment: Entropy Comparison

Table 6 shows the performance of semi-supervised learning via maximizing different entropy objectives: Shannon, R enyi, and our proposed Ge Max. Figures 7 and 8 depict the t-SNE visualization of representations learned via these different

Learning Graph Representation via Graph Entropy Maximization

entropy objectives on the PROTEINS and MUTAG datasets. We see that the representations given by our Ge Max are more discriminative than those given by Shannon entropy and Renyi entropy.

Table 6: Performance (ACC) of semi-supervised learning via maximizing different entropy objectives.

methods and entropy NCI1 PROTEINS DD COLLAB REDDIT-B REDDIT-M5K

Info Graph Shannon 71.10 1.37 71.49 1.20 73.85 1.10 74.02 1.20 83.75 2.18 51.39 1.80 R enyi 70.23 1.92 72.32 1.51 72.97 1.78 72.71 1.66 82.83 1.20 52.02 1.02 Ge Max 74.89 1.21 75.87 1.94 75.17 1.95 76.67 1.39 90.67 1.99 53.08 1.54

Graph CL Shannon 72.58 0.57 71.95 1.97 70.88 1.72 71.35 1.45 80.41 1.48 50.68 1.06 R enyi 71.53 1.34 73.28 1.32 71.40 1.03 73.29 1.60 81.32 1.74 52.63 1.38 Ge Max 76.59 1.55 76.06 1.83 78.25 1.23 77.73 1.27 91.84 1.64 54.10 3.22

AD-GCL Shannon 72.53 0.61 71.83 1.19 73.07 1.98 75.10 1.14 85.94 0.72 52.35 1.29 R enyi 70.12 1.61 72.37 1.37 71.29 1.27 70.08 1.86 84.53 1.10 51.37 1.24 Ge Max 76.47 1.82 76.76 1.94 79.20 0.65 77.79 1.66 91.18 0.82 56.64 1.23

JOAOv2 Shannon 73.19 1.36 71.72 1.28 71.18 1.79 74.72 1.80 84.97 2.09 52.29 1.66 R enyi 72.39 1.77 70.23 1.62 75.84 1.12 75.42 1.71 83.20 1.48 51.34 1.51 Ge Max 77.36 1.48 75.18 0.80 77.53 1.27 78.10 1.75 90.38 1.87 53.27 2.18

Auto GCL Shannon 73.44 1.44 70.09 1.09 73.37 1.90 73.08 2.07 85.31 1.28 50.53 1.73 R enyi 70.65 1.89 72.74 1.10 72.51 1.50 72.51 1.04 82.76 1.08 51.14 1.65 Ge Max 76.86 1.98 76.43 1.47 79.61 1.21 78.21 1.27 88.43 1.57 51.63 1.97

Graph ACL Shannon 71.27 1.29 77.07 1.63 73.42 1.33 73.80 1.21 86.82 1.44 50.91 1.21 R enyi 73.95 1.76 75.09 0.95 72.12 1.92 71.83 1.14 83.61 1.70 53.23 1.66 Ge Max 76.56 1.67 76.89 1.41 77.28 0.46 79.02 1.20 91.38 1.50 52.98 3.02

GCS Shannon 73.10 1.74 70.80 1.25 74.92 1.59 73.73 1.31 85.66 1.01 50.89 1.21 R enyi 70.20 1.58 72.51 1.95 72.18 1.88 75.75 1.30 82.76 1.49 51.60 1.70 Ge Max 75.69 1.47 76.24 1.04 79.13 1.55 78.72 1.75 92.05 1.45 54.49 2.22

PROTEINS Class 0: 663 Graphs PROTEINS Class 1: 450 Graphs

PROTEINS Class 1: 663 Graphs PROTEINS Class 0: 450 Graphs

PROTEINS Class 0: 663 Graphs PROTEINS Class 1: 450 Graphs

Figure 7: t-SNE visualization of PROTEINS datasets Representations: Ge Max(Left), Shannon entropy(Middle), Renyi entropy(Right)

MUTAG Class 0: 63 Graphs MUTAG Class 1: 125 Graphs

MUTAG Class 0: 63 Graphs MUTAG Class 1: 125 Graphs

MUTAG Class 0: 63 Graphs MUTAG Class 1: 125 Graphs

Figure 8: t-SNE visualization of MUTAG datasets Representations: Ge Max(Left), Shannon entropy(Middle), Renyi entropy(Right)

Learning Graph Representation via Graph Entropy Maximization

D.3. Experiment: Performance Comparison of Exact Computation of Graph Entropy on Small Graphs

For small graphs, we can precisely identify all independent sets via exhaustive search, allowing us to accurately calculate the vertex-packing polytope VP(G) and solve the Ge Max problem Eq. (5) exactly, circumventing the need for approximation. We select those graphs with 20 or fewer vertices from the MUTAG and IMDB-B datasets as our toy graph datasets. For the MUTAG dataset, there are 128 graphs (58 for Class 0 and 70 for Class 1) with n 20. For the IMDB-B dataset, there are a total of 696 selected graphs, 359 and 337 graphs with n 20 for Class 0 and Class 1, respectively. We conduct unsupervised representation learning, followed by classification performance evaluation. Table 7 gives the performance results, which illustrate that the exact computation of Ge Max not only surpasses benchmarks such as Info Max and Lov asz but also significantly outperforms the approximate Ge Max solution.

Table 7: Performance (ACC%) of unsupervised learning via exact computation of graph entropy and the other principles

Model Principle MUTAGn 20 IMDB-Bn 20

Info Max 85.12 1.46 71.26 1.75 Lov asz 86.23 1.17 72.57 1.24 Approx. Ge Max 90.04 1.56 73.42 1.60 Exact. Ge Max 94.36 1.12 75.35 1.34

Info Max 85.22 1.35 71.09 1.18 Lov asz 87.75 1.10 72.24 1.56 Approx. Ge Max 89.68 1.58 73.28 1.71 Exact. Ge Max 93.25 1.63 74.89 1.63

Info Max 86.29 1.91 72.52 1.65 Lov asz 88.03 1.04 72.83 1.28 Approx. Ge Max 90.37 1.64 73.11 1.85 Exact. Ge Max 92.52 1.49 75.69 1.41

D.4. Experiment: Inexact Penalty Method

Algorithm 2 Inexact Penalty Method for Ge Max (11) and (12)

Input: G, µ(0), γ(0), t = 0.

1: Random initialization of parameters: θ(0), ϕ(0).

2: Let A(0) = {a(0) j }N j=1 = {[1/nj, ..., 1/nj]}N j=1.

3: Compute J (0) := J (G; θ(0), ϕ(0), A(0)). 4: repeat 5: θ(t+1), ϕ(t+1) = argmax θ,ϕ J1(G; θ, ϕ, A(t)).

6: A(t+1) = argmin A J2(G; θ(t+1), ϕ(t+1), A).

7: a(t+1) j = Proj[0,1]( a(t+1) j ), a(t+1) j A(t+1).

8: Compute J (t+1). 9: µ(t+1) = µ(t) 1.01, γ(t+1) = γ(t) 1.01 10: until Convergence Output: θ(t+1), ϕ(t+1).

We introduce an inexact penalty algorithm (Algorithm 2) for solving the Ge Max problem. By incrementally increasing values of µ and γ, the constraints are progressively satisfied. Utilizing Algorithm 2, we replicate the unsupervised experiments and present the findings in Table 8. The outcomes indicate that the regularized optimization yields representations comparable in quality to those obtained through constrained optimization.

D.5. Experiment: Sensitivity Analysis of Hyperparameters

In the alternative algorithm, as described in Algorithm 1, two critical hyperparameters require tuning: the orthonormal representation regularization parameter µ, and the subset of VP(G) objective regularization parameter γ. In this section, we analyze the sensitivity of these parameters using the average performance across methods such as Info Graph, Graph CL, AD-GCL, JOAOv2, Auto GCL, Graph ACL, and GCS, with different hyperparameter settings. We repeat each experiment ten times and plot the average accuracy with its variance on different datasets.

Learning Graph Representation via Graph Entropy Maximization

Table 8: Performance (ACC) of unsupervised learning. regularized opt. denotes the regularized algorithm 1 and constrained opt. denotes the exact algorithm 2.The bold numbers denote the better performances of the same method.

algorithms MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph regularized opt. 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.95 1.17 74.31 1.53 86.45 1.43 56.16 2.40 constrained opt. 89.86 1.64 77.60 1.05 74.63 1.79 81.62 1.89 73.79 1.68 73.84 0.97 87.29 1.20 57.93 1.44

Graph CL regularized opt. 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91 constrained opt. 89.96 0.13 75.30 0.52 79.13 1.81 82.87 2.28 72.13 1.79 73.86 0.80 89.45 1.93 56.42 1.07

AD-GCL regularized opt. 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 73.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38 constrained opt. 90.19 1.94 77.18 1.21 75.76 1.30 78.77 1.17 71.09 0.74 73.72 1.93 87.97 0.25 56.64 1.30

JOAOv2 regularized opt. 89.07 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 72.62 1.18 87.19 1.89 55.24 1.28 constrained opt. 90.80 1.28 74.29 1.09 75.87 1.52 74.87 1.75 71.70 1.16 73.94 1.60 87.69 1.93 56.15 1.09

Auto GCL regularized opt. 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72 constrained opt. 91.78 1.35 78.95 2.30 77.43 1.04 82.46 1.82 70.13 1.28 72.65 1.32 91.64 1.19 58.09 0.93

Graph ACL regularized opt. 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06 constrained opt. 89.63 1.66 76.81 2.07 80.78 4.75 79.20 1.92 76.32 1.22 74.79 1.66 88.01 2.01 56.17 2.15

GCS regularized opt. 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79 constrained opt. 89.88 2.10 77.91 1.98 80.67 1.06 80.34 2.59 74.47 1.08 77.72 2.03 89.51 1.24 55.82 1.29

D.5.1. TUNING THE ORTHONORMAL REPRESENTATION REGULARIZATION PARAMETER µ

Figure 9: The average ACC of different µ values on various datasets.

With γ fixed at 0.5, we tune µ as the hyperparameter for orthonormal representation regularization. In Figure 9, we fix other hyperparameters and vary µ across {10 3, 10 2, 0.1, 1, 10, 102, 103, 104}. The results demonstrate that µ is not sensitive within the range 0.1 µ 10. A too small µ leads to a decrease in performance due to inadequacies in achieving orthonormal representations, as postulated in Theorem 3.3. This inadequacy negatively impacts the model s overall performance. Conversely, an excessively large µ can be detrimental, as the orthonormal representation regularization starts to dominate the learning process, overshadowing other important aspects of the model.

D.5.2. TUNING THE SUBSET OF VP(G) OBJECTIVE REGULARIZATION PARAMETER γ

Fixing µ at 0.5, we tune γ as the hyperparameter for the subset of VP(G) objective regularization. In Figure 10, other hyperparameters are kept constant while γ is varied across {10 3, 10 2, 0.1, 1, 10, 102, 103, 104}. The findings indicate that γ is not particularly sensitive within the range 0.1 γ 10. A very small γ value disrupts the definition of graph

Learning Graph Representation via Graph Entropy Maximization

Figure 10: The average ACC of different γ values on various datasets.

entropy, as the subset of VP(G) loses significance. This loss adversely affects the overall performance of the model. On the other hand, a very large γ can be detrimental, as it causes the regularization parameter to dominate the representation learning, thereby impairing the model s performance.

D.6. Experiment: Ablation Study

In the ablation study, we analyze the importance of each part of Ge Max objective J1(G; θ, ϕ, A) and J2(G; θ, ϕ, A).

D.6.1. REMOVE THE GRAPH ENTROPY LOSS LHk(G)

We eliminate the graph entropy loss LHk(G) from the Ge Max objectives J1(G; θ, ϕ, A) and J2(G; θ, ϕ, A). As shown in Table 9, omitting graph entropy maximization adversely impacts the training of the probability distribution for node and graph representations. Consequently, this results in a significant

Table 9: Performance (ACC) of unsupervised learning for Ablation study. The bold numbers denote the better performances of the same method.

remove LHk(G) MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph no 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.95 1.17 74.31 1.53 86.45 1.43 56.16 2.40 yes 63.42 3.63 54.24 1.97 53.82 3.01 61.71 4.42 52.25 4.83 59.80 3.90 57.54 5.66 44.20 4.30

Graph CL no 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91 yes 63.15 2.90 54.09 6.60 59.86 5.31 60.40 5.09 54.30 3.34 53.50 3.87 62.81 3.43 41.75 2.74

AD-GCL no 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 73.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38 yes 62.67 2.73 56.61 6.50 59.25 3.47 63.99 4.07 61.90 4.55 58.28 4.63 61.44 2.09 45.59 4.97

JOAOv2 no 89.07 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 72.62 1.18 87.19 1.89 55.24 1.28 yes 66.37 3.55 65.59 5.21 63.01 1.14 60.28 2.73 62.17 1.26 63.14 2.24 58.42 0.75 47.42 3.20

Auto GCL no 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72 yes 69.45 4.04 53.93 2.08 66.66 2.05 62.01 4.33 56.98 6.59 53.52 7.33 58.23 2.09 48.81 3.15

Graph ACL no 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06 yes 63.71 7.08 66.05 2.37 65.96 7.96 53.45 1.87 67.45 2.61 65.67 1.92 63.63 3.08 42.68 4.32

GCS no 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79 yes 61.69 0.89 60.23 7.14 58.84 1.85 53.14 5.26 59.43 3.70 50.97 6.96 59.50 4.31 41.93 5.41

Learning Graph Representation via Graph Entropy Maximization

D.6.2. REMOVE THE ORTHONORMAL REPRESENTATION OBJECTIVE LORTH(G; ϕ)

We exclude the orthonormal representation objective Lorth(G; ϕ) from the Ge Max objective J1(G; θ, ϕ, A). According to Table 10, this exclusion compromises the effectiveness of the VP(G) regularization subset. This is due to Theorem 3.3, which is predicated on the existence of an orthonormal representation space. Consequently, the overall performance of the model is negatively impacted.

Table 10: Performance (ACC) of unsupervised learning for Ablation study. The bold numbers denote the better performances of the same method.

remove Lorth(G; ϕ) MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph no 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.95 1.17 74.31 1.53 86.45 1.43 56.16 2.40 yes 75.45 1.10 61.24 1.18 57.21 1.84 69.44 1.03 58.83 1.55 56.33 1.74 70.07 1.47 47.26 1.91

Graph CL no 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91 yes 71.04 1.19 59.48 1.90 61.31 1.91 64.12 1.01 60.26 1.28 61.74 1.18 72.07 1.89 44.93 1.28

AD-GCL no 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 73.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38 yes 67.98 1.70 63.50 1.14 60.93 1.09 62.57 1.86 64.05 1.90 63.97 1.72 74.96 1.25 43.75 2.06

JOAOv2 no 89.07 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 72.62 1.18 87.19 1.89 55.24 1.28 yes 70.80 1.28 59.29 1.09 61.87 1.52 60.87 1.75 61.70 1.16 67.94 1.60 66.69 1.93 42.15 1.09

Auto GCL no 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72 yes 73.07 2.10 57.50 1.98 65.33 1.06 65.42 2.59 64.01 1.08 66.50 2.03 76.44 1.24 48.24 1.29

Graph ACL no 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06 yes 72.34 1.15 59.51 1.75 62.65 0.87 64.96 1.85 64.45 1.35 66.24 1.91 76.76 2.06 50.38 1.72

GCS no 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79 yes 71.82 1.15 62.04 1.75 57.68 1.87 61.78 1.85 69.94 1.35 70.43 1.91 73.33 2.06 48.40 1.72

D.6.3. REMOVE THE SUBSET OF VP(G) OBJECTIVE LS-VP(G; θ, ϕ, A)

We omit the VP(G) objective subset Ls-vp(G; θ, ϕ, A) from the Ge Max objective J1(G; θ, ϕ, A) and J2(G; θ, ϕ, A). As indicated in Table 11, the absence of the VP(G) objective leads to unconstrained vectors ai. The result shows its significant role in the model s effectiveness.

Table 11: Performance (ACC) of unsupervised learning for Ablation study. The bold numbers denote the better performances of the same method.

remove Ls-vp(G; θ, ϕ, A) MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph no 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.95 1.17 74.31 1.53 86.45 1.43 56.16 2.40 yes 63.55 2.23 61.83 1.30 62.06 3.73 58.98 4.01 60.07 1.47 57.21 3.29 64.52 2.16 45.93 1.46

Graph CL no 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91 yes 74.73 0.65 65.06 3.72 64.37 2.30 59.63 2.24 62.55 3.01 68.38 1.06 65.30 3.27 49.71 2.17

AD-GCL no 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 73.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38 yes 73.19 1.94 61.18 1.21 65.76 1.30 60.77 1.17 61.09 2.74 65.32 1.93 74.25 1.25 45.64 1.09

JOAOv2 no 89.07 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 72.62 1.18 87.19 1.89 55.24 1.28 yes 65.77 2.92 68.94 1.28 60.27 3.10 61.31 1.26 59.73 2.20 60.27 1.50 62.91 3.39 42.55 2.19

Auto GCL no 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72 yes 73.04 1.47 60.92 1.36 63.64 2.79 67.62 1.79 57.56 3.28 61.41 1.48 67.21 0.90 41.97 2.75

Graph ACL no 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06 yes 74.23 1.46 60.74 4.55 64.97 3.61 64.31 2.76 64.64 3.09 65.08 2.29 74.11 2.05 47.52 1.09

GCS no 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79 yes 74.96 2.21 68.20 3.58 62.97 2.13 66.78 3.20 59.68 0.61 67.86 1.92 74.25 3.08 43.97 2.32

D.7. Experiment: Distribution Comparison

In Eq. (3), we propose a Boltzmann probability distribution derived from graph representations. Additionally, we suggest the use of a normalized exponential kernel to assign a distribution over the vertex set. In Table 12, we conduct a comparative analysis of these two distributions when applied to graph representation learning. The results indicate that both distributions perform well, with their performance metrics being closely aligned.

D.8. Experiment: Convergence Analysis

Figure 11. shows the convergence curves of Info Graph on different datasets.

Learning Graph Representation via Graph Entropy Maximization

Table 12: Performance (ACC) of unsupervised learning using different distribution. The bold numbers denote the better performances of the same method.

algorithms MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

Info Graph Boltzmann 92.44 1.23 76.87 1.99 75.43 1.10 80.21 1.83 73.95 1.17 74.31 1.53 86.45 1.43 56.16 2.40 exponential 89.45 4.23 77.24 2.01 75.21 1.73 82.44 2.01 74.83 1.47 74.33 1.29 87.07 2.16 53.26 1.46

Graph CL Boltzmann 88.83 1.10 77.60 1.18 78.24 1.84 80.89 1.03 74.21 1.55 74.31 1.74 90.76 1.47 57.87 1.91 exponential 86.03 0.47 78.50 1.66 78.94 2.79 82.57 1.87 71.05 1.28 73.97 1.48 89.96 2.30 58.75 2.75

AD-GCL Boltzmann 88.37 1.05 75.96 2.30 77.07 1.02 77.15 1.83 73.02 1.54 73.62 1.56 89.88 1.85 56.61 1.38 exponential 90.19 1.94 77.18 1.21 75.76 1.30 78.77 1.17 71.09 0.74 73.72 1.93 87.97 0.25 56.64 1.30

JOAOv2 Boltzmann 89.07 1.19 74.91 1.90 74.68 1.91 76.15 1.01 73.17 1.88 72.62 1.18 87.19 1.89 55.24 1.28 exponential 92.03 2.13 75.20 2.58 73.97 1.13 73.78 2.20 71.68 2.61 72.86 1.92 84.25 1.08 53.97 1.32

Auto GCL Boltzmann 90.63 1.15 77.12 1.75 80.17 0.87 81.11 1.85 72.08 1.35 74.03 1.91 91.84 2.06 57.86 1.72 exponential 91.16 1.92 75.91 2.28 78.28 1.10 82.50 1.24 68.36 2.33 73.19 1.50 90.03 1.39 58.25 1.19

Graph ACL Boltzmann 91.86 0.70 77.29 1.14 81.08 1.09 81.19 1.86 75.94 0.90 75.03 1.72 91.45 1.25 57.48 2.06 exponential 90.04 2.92 74.48 1.28 79.31 2.10 82.12 2.10 78.26 2.05 77.24 4.51 90.07 2.49 56.13 1.13

GCS Boltzmann 91.37 0.95 76.12 0.76 78.82 0.76 80.08 1.50 76.15 1.46 75.19 1.76 92.51 1.05 57.45 1.79 exponential 92.61 1.46 73.27 2.55 79.10 1.61 77.31 1.76 74.13 1.85 69.21 1.39 89.11 2.35 56.23 1.12

(b) PROTEINS

(g) REDDIT-B

(h) REDDIT-M5K

Figure 11: Convergence Curve of Info Graph on different Dataset In cases where a large dataset requires an excessive amount of time to meet convergence conditions, but where significant changes in the objective function primarily occur within the first 50 epochs, it may be practical to limit the plotting of the convergence curve to these initial 50 epochs. This approach can provide valuable insights into the majority of the convergence behavior without the need for extensive computation beyond the point of diminishing returns.

D.9. Experiment: Time Costs

We run experiments on a server with Intel 7 CPU and RTX 3090 GPUs. We repeat the experiment five times and report the average time. Since we only change the objective function while keeping other parts of the models unchanged, the time of each method is close.

D.10. Experiment: Code of Key Implementation

Here, we provide the key implementation code for the loss and objective functions of our paper. The complete code of our method is available at https://github.com/Math Adventurer/Ge Max.

(1) Orthonormal representation learning loss is as follows:

Lorth(G; ϕ) :=

Mj Zϕ j (Zϕ j ) In 2

The code of Lorth(G; ϕ) is as follows:

def loss_orthogonal(Z, phi): batch_size = Z.batch_size

Learning Graph Representation via Graph Entropy Maximization

Table 13: Time cost. The h denotes hour.

tasks methods and principles MUTAG PROTEINS DD NCI1 COLLAB IMDB-B REDDIT-B REDDIT-M5K

unsupervised learning

Info Graph Info Max 0.10 h 0.47 h 2.46 h 1.20 h 2.66 h 0.28 h 6.06 h 13.34 h Lov asz 0.04 h 0.41 h 2.43 h 1.26 h 2.03 h 0.30 h 6.05 h 13.41 h Ge Max 0.08 h 0.42 h 2.89 h 1.79 h 2.34 h 0.32 h 6.25 h 13.18 h

Graph CL Info Max 0.09 h 0.43 h 2.60 h 1.28 h 2.12 h 0.33 h 6.11 h 13.39 h Lov asz 0.08 h 0.35 h 2.40 h 1.30 h 2.08 h 0.34 h 6.08 h 13.46 h Ge Max 0.11 h 0.47 h 2.50 h 1.14 h 2.75 h 0.36 h 6.15 h 13.12 h

AD-GCL Info Max 0.13 h 0.52 h 3.02 h 1.37 h 2.51 h 0.36 h 5.85 h 14.14 h Lov asz 0.10 h 0.51 h 3.06 h 1.39 h 2.68 h 0.34 h 5.86 h 14.07 h Ge Max 0.14 h 0.48 h 3.13 h 1.87 h 2.97 h 0.36 h 6.20 h 14.67 h

JOAOv2 Info Max 0.10 h 0.62 h 2.09 h 1.48 h 2.27 h 0.32 h 6.36 h 14.79 h Lov asz 0.12 h 0.61 h 2.04 h 1.45 h 2.24 h 0.31 h 6.37 h 14.74 h Ge Max 0.11 h 0.63 h 2.10 h 1.49 h 2.26 h 0.33 h 6.35 h 14.77 h

Auto GCL Info Max 0.11 h 0.50 h 3.15 h 1.48 h 2.12 h 0.34 h 6.09 h 14.42 h Lov asz 0.14 h 0.49 h 3.10 h 1.46 h 2.09 h 0.33 h 6.07 h 14.40 h Ge Max 0.13 h 0.53 h 3.42 h 1.65 h 2.91 h 0.38 h 6.85 h 14.39 h

Graph ACL Info Max 0.18 h 0.63 h 2.11 h 1.45 h 2.24 h 0.32 h 6.41 h 14.78 h Lov asz 0.16 h 0.62 h 2.10 h 1.48 h 2.23 h 0.31 h 6.39 h 14.77 h Ge Max 0.17 h 0.66 h 2.30 h 1.54 h 2.69 h 0.33 h 6.19 h 14.32 h

GCS Info Max 0.09 h 0.59 h 2.05 h 1.48 h 2.24 h 0.39 h 6.41 h 14.78 h Lov asz 0.08 h 0.61 h 2.08 h 1.46 h 2.26 h 0.30 h 6.39 h 14.76 h Ge Max 0.10 h 0.77 h 2.32 h 1.53 h 2.71 h 0.32 h 6.20 h 14.31 h

semisupervised learning

Info Graph Info Max - 0.24 h 2.58 h 1.52 h 2.53 h - 6.85 h 14.46 h Lov asz - 0.36 h 2.84 h 1.27 h 2.14 h - 6.16 h 14.35 h Ge Max - 0.33 h 2.79 h 1.11 h 2.09 h - 6.22 h 14.28 h

Graph CL Info Max - 0.49 h 2.71 h 1.05 h 2.45 h - 6.92 h 14.53 h Lov asz - 0.63 h 2.18 h 1.39 h 2.26 h - 6.47 h 14.08 h Ge Max - 0.56 h 2.15 h 1.42 h 2.29 h - 6.41 h 14.14 h

AD-GCL Info Max - 0.56 h 2.94 h 1.10 h 2.76 h - 6.65 h 14.72 h Lov asz - 0.61 h 3.23 h 1.79 h 2.08 h - 6.63 h 14.49 h Ge Max - 0.59 h 3.27 h 1.88 h 2.02 h - 6.75 h 14.58 h

JOAOv2 Info Max - 0.61 h 3.16 h 1.72 h 2.40 h - 6.35 h 14.06 h Lov asz - 0.91 h 3.77 h 1.89 h 2.70 h - 6.57 h 14.30 h Ge Max - 0.89 h 3.79 h 1.80 h 2.72 h - 6.61 h 14.20 h

Auto GCL Info Max - 0.59 h 3.15 h 1.12 h 2.91 h - 6.90 h 14.89 h Lov asz - 0.51 h 3.38 h 1.87 h 2.50 h - 6.23 h 14.75 h Ge Max - 0.52 h 3.41 h 1.91 h 2.45 h - 6.20 h 14.81 h

Graph ACL Info Max - 0.61 h 3.14 h 1.75 h 2.30 h - 6.01 h 14.11 h Lov asz - 0.92 h 3.84 h 1.85 h 2.75 h - 6.75 h 14.27 h Ge Max - 0.91 h 3.80 h 1.88 h 2.78 h - 6.71 h 14.24 h

GCS Info Max - 0.77 h 3.13 h 1.05 h 2.29 h - 6.21 h 14.48 h Lov asz - 0.73 h 3.77 h 1.03 h 2.32 h - 6.49 h 14.83 h Ge Max - 0.79 h 3.88 h 1.50 h 2.14 h - 6.94 h 14.92 h

loss_orth = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

start_idx = 0 for j in range(batch_size): num_nodes = Z.batch_num_nodes()[j] Z_j = Z.ndata[ h ][start_idx:start_idx+num_nodes] start_idx += num_nodes n_j = Z_j.size(0) M_j = torch.eye(n_j, device=Z.device) -\ Z.adjacency_matrix().to_dense()[:n_j, :n_j] term = torch.matmul(Z_j, Z_j.t()) - torch.eye(n_j, device=Z.device) loss_orth = loss_orth + torch.norm(M_j * term, fro )**2

return loss_orth

Learning Graph Representation via Graph Entropy Maximization

(2) The subset of VP(G) loss is:

Ls-vp(G; θ, ϕ, A) :=

Daj Zϕ j (Zϕ j ) Daj D2 aj 2

The code of Ls-vp(G; θ, ϕ, A) is as follows:

def loss_sub_vertex_packing(Z, A, theta, phi): batch_size = Z.batch_size loss_svp = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

start_idx = 0 for j in range(batch_size): num_nodes = Z.batch_num_nodes()[j] Z_j = Z.ndata[ h ][start_idx:start_idx+num_nodes] start_idx += num_nodes a_j = A_set[j][:num_nodes] D_a_j = torch.diag(a_j) term = torch.matmul(torch.matmul(D_a_j, Z_j), Z_j.t()) -\ torch.matmul(D_a_j, D_a_j) loss_svp = loss_svp + torch.norm(term, fro )**2

return loss_svp

(3) The graph entropy loss is:

LHk(G; θ, ϕ, A) =

i=1 Pi(gθ j, Zϕ j ) log(aj(i)). (36)

The code of LHk(G; θ, ϕ, A) is as follows:

def loss_entropy(Z, theta, phi, A_set): batch_size = Z.batch_size loss_entropy = torch.tensor(0.0, dtype=torch.float32, requires_grad=True)

start_idx = 0 for j in range(batch_size): num_nodes = Z.batch_num_nodes()[j] Z_j = Z.ndata[ h ][start_idx:start_idx+num_nodes] start_idx += num_nodes a_j = A_set[j][:num_nodes] P_i = torch.sigmoid(torch.matmul(Z_j, theta.t())) a_j_expanded = a_j.unsqueeze(1).expand_as(P_i) loss_entropy = loss_entropy - torch.sum(P_i * torch.log(a_j_expanded))

return loss_entropy

(4) The J1 loss is: J1(G; θ, ϕ, A) :=LHk(G; θ, ϕ, A) µ Lorth(G; ϕ)

γ Ls-vp(G; θ, ϕ, A) (37)

The code of J1 objective is:

def objective_J1(Z, theta, phi, A_set, mu, gamma): loss_Hk = loss_entropy(Z, theta, phi, A_set) loss_orth = loss_orthogonal(Z, phi)

Learning Graph Representation via Graph Entropy Maximization

loss_svp = loss_sub_vertex_packing(Z, A_set, theta, phi)

J1 = loss_Hk - mu * loss_orth - gamma * loss_svp

(5) The J2 objective is: J2(G; θ, ϕ, A) :=LHk(G; θ, ϕ, A) + γ Ls-vp(G; θ, ϕ, A)

s.t. 0 aij 1, i [nj], aj A, (38)

The code of J2 objective is:

def objective_J2(Z, theta, phi, A_set, gamma): loss_Hk = loss_entropy(Z, theta, phi, A_set) loss_svp = loss_sub_vertex_packing(Z, A_set, theta, phi)

J2 = loss_Hk + gamma * loss_svp