# disentangled_multiplex_graph_representation_learning__53f08d08.pdf

Disentangled Multiplex Graph Representation Learning

Yujie Mo 1 Yajie Lei 1 Jialie Shen 2 Xiaoshuang Shi 1 Heng Tao Shen 1 Xiaofeng Zhu 1 3

Unsupervised multiplex graph representation learning (UMGRL) has received increasing interest, but few works simultaneously focused on the common and private information extraction. In this paper, we argue that it is essential for conducting effective and robust UMGRL to extract complete and clean common information, as well as more-complementarity and less-noise private information. To achieve this, we first investigate disentangled representation learning for the multiplex graph to capture complete and clean common information, as well as design a contrastive constraint to preserve the complementarity and remove the noise in the private information. Moreover, we theoretically analyze that the common and private representations learned by our method are provably disentangled and contain more taskrelevant and less task-irrelevant information to benefit downstream tasks. Extensive experiments verify the superiority of the proposed method in terms of different downstream tasks.

1. Introduction

Recently, multiplex graph representation learning (MGRL) has emerged as a powerful tool to model complex relationships among nodes (Chu et al., 2019; Zhang & Kou, 2022). In particular, unsupervised multiplex graph representation learning (UMGRL) methods have attracted increasing attention due to the label availability for the training process and have shown great potential in a wide range of applications, such as anomaly detection and recommendation systems (Chen et al., 2022; Xie et al., 2022).

1School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China 2Department of Computer Science, City, University of London, London, United Kingdom 3Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China. Correspondence to: Xiaofeng Zhu <seanzhuxf@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Existing UMGRL methods can be roughly divided into two categories, i.e., traditional unsupervised methods and selfsupervised methods. Traditional UMGRL methods aim to extract hidden information without the help of labels by random walk strategies (Dong et al., 2017) or proximity preservation (Shi et al., 2018a). However, they often overemphasize proximity information and ignore node features. Inspired by the prosperity of self-supervised learning (He et al., 2020; Chen & He, 2021), self-supervised MGRL methods (including intra-graph and inter-graph contrastive learning methods) have been developed and achieved great success in recent years. The intra-graph contrastive learning methods aim to capture the global properties, but they ignore intrinsic associations among different graphs (Park et al., 2020) and may lead to suboptimal representations. To solve this issue, inter-graph contrastive learning methods are proposed as an alternative by modelling common information among different graphs (Zhou et al., 2022). The common information contains the consistency among all graphs and has been verified to be the key component of sample identification (Zhu et al., 2022).

Although existing UMGRL especially inter-graph contrastive learning methods have achieved promising performance in many tasks, there are still some limitations to be addressed. On the one hand, previous UMGRL methods are designed to implicitly capture the common information, but such a process is performed in a black box (Zhu et al., 2022; Li et al., 2022). As a result, previous UMGRL methods cannot verify if all common information has been obtained (i.e., complete) and if it contained other confusing contents (i.e., clean). On the other hand, apart from the common information, the rest of the content specific to each graph can be referred to as private information. Some private information is noise, but the other parts contain complementary information, which is different from that in other graphs and has been demonstrated to facilitate downstream tasks in many fields (Xie et al., 2020; Wang et al., 2022a). However, previous UMGRL methods do not consider private information and thus ignoring its complementarity (Zhu et al., 2022). Moreover, they generally fuse all representations from different graphs and thus including noise into the fusion process (Jing et al., 2021a).

Based on the above observations, it is a possible solution to improve the effectiveness and robustness of UMGRL by ex-

Disentangled Multiplex Graph Representation Learning

Graph conv layer

Private representations

Concatenation

Common representations

Average pooling & Concatenation (1) H

Graph conv layer

( ) X Graph #

Figure 1. The flowchart of the proposed DMG. Specifically, given the node feature matrix and graph structures, DMG first employs the graph convolutional layer and Multi-Layer Perceptron (MLP) to generate common and private representations (i.e., C(r) and P(r)) for every graph. After that, DMG investigates the matching loss Lmat and the correlation loss L(r) cor, respectively, to obtain the complete and the clean common information, as well as investigate the reconstruction loss L(r) rec to promote the invertibility of encoders for exploring the issue of the trivial solution. Meanwhile, DMG investigates the contrastive loss L(r) con to preserve the complementarity and remove the noise in the private information. Finally, private representations of different graphs are first fused by the average pooling and then concatenated with the common variable S to obtain the final representations Z.

plicitly capturing complete and clean common information, as well as preserving complementarity and removing noise in the private information. To achieve this, there are two crucial challenges to be solved, i.e., (i) it is difficult to decouple the common information from private information as they are generally mixed together, and (ii) it is necessary to distinguish and further preserve the complementarity from noise in the private information.

In this paper, to address the above issues, different from previous traditional UMGRL and self-supervised MGRL, we investigate a new unsupervised framework, i.e., Disentangled Multiplex Graph representation learning (DMG for brevity), to conduct effective and robust UMGRL, as shown in Figure 1. To do this, we first decouple the common and private representations by designing a new disentangled representation learning for the multiplex graph to extract complete and clean common information, and thus tackling Challenge (i). Moreover, we further preserve the complementarity and remove the noise in the private information by designing a contrastive constraint on private representations, to tackle Challenge (ii). As a result, the common and private representations learned by our method can be provably disentangled and contain more task-relevant and less task-irrelevant information to benefit downstream tasks.

Compared to previous UMGRL methods, the main contributions of our method can be summarized as follows:

To the best of our knowledge, we make the first attempt to disentangle the common and private representations for the multiplex graph to extract complete and clean

common information. We further propose a contrastive constraint to preserve the complementarity and remove the noise in the private information.

We theoretically prove that the representations learned by our method can extract complete and clean common information. We further prove that the common and private representations learned by our method contain more task-relevant information and less task-irrelevant information.

We experimentally demonstrate the effectiveness and robustness of the proposed method in terms of node classification and node clustering tasks on multiplex graph datasets and single-view graph datasets, compared to numerous comparison methods.

2. Related Work

This section briefly reviews the topics related to this work, including unsupervised multiplex graph representation learning and disentangled representation learning.

2.1. Unsupervised Multiplex Graph Representation Learning

Unsupervised multiplex graph representation learning (UMGRL) has emerged as a popular method and has drawn considerable attention in recent years as it eliminates the reliance on label information, and the capability to model the complex relationships between nodes (Chu et al., 2019). Existing UMGRL methods can be classified into two cat-

Disentangled Multiplex Graph Representation Learning

egories, i.e., traditional unsupervised methods and selfsupervised methods. Traditional unsupervised methods tend to generate representations by random walk strategies (Dong et al., 2017) or proximity preservation (Shi et al., 2018a) based on the proximity among nodes. For example, MNE (Zhang et al., 2018) performs meta-path based random walks to extract information of multi-type relations into a unified representations space. Similarly, HERec (Shi et al., 2019) first designs a type constraint strategy to filter the node sequence and then learns representations by random walk based strategies. As for proximity preservation, Asp Em (Shi et al., 2018a) alleviates the incompatibility among different graphs by preserving a set of representative proximity information in the multiplex graph. HEER (Shi et al., 2018b) further improves the effectiveness by coupling the edge representations with inferred type-specific metrics.

Limited by the overemphasis on proximity information and node feature ignorance of traditional unsupervised methods, self-supervised MGRL methods have shown remarkable performance. Existing self-supervised MGRL methods can be divided into two subgroups, i.e., intra-graph contrastive learning methods and inter-graph contrastive learning methods. For example, DMGI (Park et al., 2020) and HDMI (Jing et al., 2021a) conduct contrastive learning by maximizing the mutual information between node representations and the graph summary within each graph, but overlook the intrinsic correlation among different graphs. Different to the intra-graph contrastive learning, STENCIL (Zhu et al., 2022) conducts inter-graph contrastive learning by contrasting node representations from each graph and an aggregation graph. CKD (Zhou et al., 2022) adopts contrastive learning between node representations and high-level representations of different graphs under the collaborative knowledge distillation framework. Despite their success, existing methods fail to obtain complete and clean common information, as well as more-complementarity and less-noise private information, which is significant for downstream tasks.

2.2. Disentangled Representation Learning

Disentangled representation learning aims to learn representations that identify and disentangle the underlying explanatory factors hidden in the given data. Existing disentangled representation learning has already made promising developments with great potential based on the generative models (Higgins et al., 2016; Xu et al., 2021a; Xiao et al., 2022). For example, Beta-VAE (Higgins et al., 2016) introduces an adjustable parameter to balance latent channel capacity and independence constraints with reconstruction accuracy and thus improving the variational auto-encoder (VAE) framework. DDPAE (Jiang et al., 2020) proposes a decompositional disentangled predictive auto-encoder framework to learn both the latent decomposition and disentanglement without explicit supervision. PSGAN (Jiang et al., 2020)

proposes to disentangle the content information and style information of images to generate the style transferred images based on the generative adversarial network. S3VAE (Zhu et al., 2020a) proposes a self-supervised sequential VAE model which disentangles the time-varying variables and time-invariant variables of video and audio sequences.

Inspired by the prosperity in other fields, disentangled representation learning has recently raised a surge of interest in graph-structured data (Ma et al., 2019; Yang et al., 2020; Mercatali et al., 2022). For example, Graph VAE (Simonovsky & Komodakis, 2018) transfers the generative models for images and text to the domain of graphs and is available to output a probabilistic fully-connected graph directly based on VAE framework. Graph Lo G (Xu et al., 2021b) proposes to conduct self-supervised graph-level representation learning by disentangling the local similarities and global semantic clusters. DGCL (Li et al., 2021) proposes to learn disentangled graph-level representations with self-supervision by forcing the factorized representations to independently reflect the expressive information from different latent factors. DSSL (Xiao et al., 2022) decouples different underlying semantics between different neighborhoods into the self-supervised learning process based on the generative model. Although the above methods achieve excellent results on different tasks, they are all designed for the single-view graph and thus fail to take into account the complex relationships between nodes. Moreover, these methods do not consider the complementarity and noise within each graph structure leading to suboptimal performance.

Notations. Let G = {G(1), G(2), . . . , G(R)} to denote the multiplex graph, where G(r) = {V, E(r)} is the r-th graph in the multiplex graph, R is the number of graphs. V = {v1, v2, , v N} and E(r) represent the node set of all graphs and the edge set of the r-th graph, respectively. We denote node features of each graph as X(r) = T (X) RN F , where T denotes the random dropout operation, X denotes original node features, N and F denote the number of nodes and the dimension of node features, respectively. A(r) RN N denotes the graph structure of the r-th graph, where A(r) ij = 1 iff e(r) ij = (vi, vj) E(r). The proposed DMG first learns the disentangled common representations C(r) RN D and private representations P(r) RN d

through a common variable S RN D, and then obtain the fused representations Z RN (R D+d), where D and d are the dimensions of the representation space.

3.1. Motivation

Previous UMGRL methods aim to implicitly extract common information among different graphs, which is effective and robust in revealing the identity of samples (Zhou et al.,

Disentangled Multiplex Graph Representation Learning

2022; Zhu et al., 2022). However, they generally ignore the complementarity in the private information of each graph, and may lose significant properties among nodes. For example, in a multiplex graph, where papers are nodes and edges represent either co-subjects or co-authors in two different graphs. If a private edge (e.g., the co-subject relation) exists only within a certain graph and interconnects two papers from the same class, it benefits to reduce the intra-class gap by providing complementary information to identify papers. Therefore, it is necessary to consider both the common information and the private information for achieving an effective and robust UMGRL.

Based on the common information that helps to identify the samples, it is intuitive to capture all the common information among different graphs (i.e., complete). Moreover, such complete common information should contain common information only (i.e., clean). In contrast, if common information contains other confusing contents, the quality of the common information can be compromised. Therefore, the first question comes: How to obtain complete and clean common information? On the other hand, private information is a mix of complementarity and noise. Considering the same example of citation networks, if a private edge interconnects two papers from different classes, it can interfere the message passing and should be removed as noise. Therefore, the second question comes: How to preserve complementarity and remove noise in private information?

However, few previous UMGRL methods explored the above questions. Recently, disentangled representation learning methods have been developed to obtain common and private representations (Ma et al., 2019; Li et al., 2021; Lyu et al., 2022; Wang et al., 2022c;b; Xiao et al., 2022), but it is challenging to apply them to solve the above issues in UMGRL due to the complex relationships among nodes in the multiplex graph, as well as the complementarity and noise in the graph structure. To do this, we propose a new disentangled multiplex graph representation learning framework, to answer the above two questions, i.e., Section 3.2 for the first question, and Section 3.3 for the second question.

3.2. Common Information Extraction

Previous UMGRL methods (e.g., inter-graph contrastive learning methods) generally implicitly capture the common patterns among different graphs by maximizing the mutual information between two graphs. For instance, to extract the common information, STENCIL (Zhu et al., 2022) maximizes the mutual information between each graph and the aggregation graph, while CKD (Zhou et al., 2022) maximizes the mutual information between regional representations and global representations in different graphs. However, these efforts cannot capture complete and clean common information provably as they fail to decouple the com-

mon information from private information. To solve this issue, in this paper, we investigate disentangled representation learning to obtain complete and clean common information.

Specifically, we first employ the graph convolutional layer g(r) to generate node representations H(r) based on the node features and the graph structure of each graph, i.e.,

H(r) = σ( ˆD 1

2 r ˆA(r) ˆD 1

2 r X(r)Θ(r)), (1)

where ˆA(r) = A(r) + w IN, and w indicates the weight of self-connections. ˆDr is the degree matrix of ˆA(r), σ is the activation function, and Θ(r) is the weight matrix of g(r). To facilitate the decoupling of the common and private information within each graph, we then employ MLP with unshared parameters (i.e., f (r) c and f (r) p ) to map node representations H(r) of each graph into common and private representations (i.e., C(r) and P(r)).

Given C(1), ..., C(R), the simplest way for aligning common representations from different graphs is to directly set C(1) = ... = C(R). However, this may affect the quality of common representations by directly aligning suboptimal common representations in the initial training process (Benton et al., 2019; Lyu & Fu, 2020; Lyu et al., 2022). In this paper, we introduce a common variable S with the orthogonality and zero mean via the singular value decomposition operation on common representations. After that, we conduct the matching loss between the common representations C(r) and the common variable S, aiming to gradually align common representations from different graphs for capturing complete common information among them. The matching loss is formulated as:

n=1 (c(r) n sn)2,

n=1 sns n = I, 1

n=1 sn = 0.

Intuitively, S in Eq. (2) communicates the common representations from different graphs and converges them to the consistency, i.e., C(1) = ... = S = ... = C(R). Therefore, the consistency among common representations of all graphs guarantees that the common information among different graphs can be obtained completely.

After that, to decouple the common and private representations, we have to enforce the statistical independence between them. It is noteworthy that if common and private representations are statistically independent, then we have E[ϕ(r)(C(r))ψ(r)(P(r))] = E[ϕ(r)(C(r))]E[ψ(r)(P(r))], and vice versa, where ϕ(r) and ψ(r) are measurable functions (Gretton et al., 2005). Obviously, the minimization of the correlation between ϕ(r)(C(r)) and ψ(r)(P(r)) could

Disentangled Multiplex Graph Representation Learning

be used to achieve the independence between common and private representations. In particular, the correlation loss is obtained by calculating the Pearson s correlation coefficient between ϕ(r)(C(r)) and ψ(r)(P(r)), i.e.,

|Cov(ϕ(r)(C(r)), ψ(r)(P(r)))| p

Var(ϕ(r)(C(r)) p

Var(ψ(r)(P(r)) , (3)

where Cov( , ) and Var( ) indicate covariance and variance operations, respectively. In Eq. (3), the correlation coefficient between ϕ(r)(C(r)) and ψ(r)(P(r)) is encouraged to converge to 0. Actually, as the correlation loss converges, the common representations C(r) and the private representations P(r)) are statistically independent.

Based on Eq. (3), the common representations C(r) are expected to obtain clean common information. Therefore, we almost answer the first question by the matching loss (i.e., achieving complete common information) and the correlation loss (i.e., achieving clean common information). However, the learned common and private representations may be trivial solutions under the unsupervised framework (Jing et al., 2021b; Xu et al., 2022a;b). Popular solutions include contrastive learning methods and auto-encoder methods. Contrastive learning methods introduce a large number of negative samples to avoid trivial solutions, but they may induce large memory overheads (Zhang et al., 2021; Liu et al., 2023b;a). Auto-encoder methods employ the autoencoder framework with the reconstruction loss to promote the invertibility of encoders for preventing the trivial solutions (Kipf & Welling, 2016; Liu et al., 2022). However, existing graph auto-encoders are designed to reconstruct the direct edges and ignore the topological structure as well as be with expensive computation cost (Donnat et al., 2018; Mrabah et al., 2022). To address the above issues, we investigate a new reconstruction loss to simultaneously reconstruct the node features and the topological structure.

Specifically, we first concatenate the common and private representations and then obtain the reconstructed node representations e X(r) with the reconstruction network p(r). We further conduct the feature reconstruction and topology reconstruction loss to reconstruct the node features and local topological structure, respectively. As a result, the reconstruction loss is formulated as:

n=1 ((ex(r) n x(r) n )2 + (ex(r) n x(r) n,nei)2),

(4) where x(r) n,nei = 1 m Pm j=1{x(r) j |vj N (r) i }, m is the num-

ber of sampled neighbors, and N (r) i represents the 1-hop neighborhood set of node vi. In Eq. (4), the first term encourages e X(r) to reconstruct the original node features, and the second term encourages e X(r) to reconstruct the topological structure. As a result, Eq. (4) enforces that the

reconstructed representations and the original input (i.e., the node features and the graph topology) can be recovered from each other, thus promoting the invertibility of encoders to avoid trivial solutions.

Denoting the optimal common representations as C , which contains complete and clean common information, we have Theorem 3.1 on the common information extraction. Proofs of all Theorems are shown in Appendix B. Theorem 3.1. Assume the solution that satisfies the constraints in Eq. (2), Eq. (3), and Eq. (4) has been found, then we have C(r) = f (r) c g(r)(X(r), A(r)) = φ(C ) for r [1, R], where φ is an invertible function.

Theorem 3.1 indicates that if the solution satisfies the constraints in Eq. (2), Eq. (3), and Eq. (4), the common representations learned by our method and the optimal common representations can be transformed from each other due to the invertibility of the function φ. Therefore, the common representations learned by our method (i.e., C(r)) have all the information of the optimal common representations (i.e., C ) and thus extracting complete and clean common information provably. As a result, based on Eq. (2), Eq. (3), and Eq. (4), we disentangle the common and private representations to obtain complete and clean common information and thus answering the first question in Section 3.1.

3.3. Private Information Constraint

Based on Section 3.1, the private information is a mix of complementarity and noise. Therefore, given the learned private representations, we hope to further answer the second question in Section 3.1, i.e., to preserve the complementarity and remove the noise in the private information. Moreover, the private information of the multiplex graph mainly lies in the graph structure of each graph since node features of different graphs are generated from the shared feature matrix X. Therefore, we investigate preserving the complementary edges and removing the noisy edges in each graph structure. To do this, we first give the following definition for complementarity and noise in graph structures:

Definition 3.2. For any private edge in the r-th graph G(r), i.e., e(r) ij E(r), and e(r) ij / S

r [1,R],r =r E(r ), if the

node pair (vi, vj) belongs to the same class, then e(r) ij is a

complementary edge in the graph G(r). Otherwise, e(r) ij is a noisy edge in the graph G(r).

Definition 3.2 divides the private information in each graph into two parts, i.e., complementary edges and noisy edges, according to the classes of node pairs. However, the node labels are unavailable in an unsupervised manner. To solve this issue, in this work, we approximate the label information of the node pair (vi, vj) as the cosine similarity ϵ(r) ij

Disentangled Multiplex Graph Representation Learning

between the common variables si and sj, i.e.,

ϵ(r) ij = si sj si sj . (5)

Given the cosine similarity of all node pairs in the edge set E(r), we further assume that node pairs with top similarity belong to the same class and node pairs with low similarity belong to different classes. As a result, the edges with high similarity for connected nodes are complementary edges, denoted as E(r) c , while the edges with low similarity for connected nodes are noisy edges, denoted as E(r) n . Intuitively, the complementary edges should be preserved while the noisy edges should be removed.

To achieve the above intuition, we design a simple contrastive module and conduct the contrastive loss between private representations of node pairs in E(r) c and E(r) n , i.e.,

P e θ p(r) i ,p(r) j

P e θ p(r) i ,p(r) j

/τ + P e θ p(r) k ,p(r) l

(6) p(r) i , p(r) j , p(r) k , and p(r) l indicate the private representations of node vi, vj, vk, and vl, respectively, where (vi, vj) = e(r) ij E(r) c and (vk, vl) = e(r) kl E(r) n . Moreover, θ is the cosine similarity operation and τ is a temperature parameter. Eq. (6) increases the cosine similarity of nodes connected by the complementary edges, meanwhile, it decreases the cosine similarity of nodes connected by the noisy edges. Thus Eq. (6) preserves the complementarity and removes noise in private information to answer the second question in Section 3.1. Different from the widely-used contrastive objective function, i.e., the Info NCE loss (Oord et al., 2018) regarding two augmented views of the same samples as positive pairs while our proposed contrastive loss treats two nodes connected by complementary edges as positive pairs.

Besides preserving the complementarity and removing noise, the private information constrained by Eq. (6) benefits the downstream tasks as well. Denote b H(r) as representations concatenated by the common and private representations learned by our method, and denote e H(r) as the node representations learned by previous inter-graph contrastive learning methods, which maximize the mutual information among different graphs, we have:

Theorem 3.3. For any downstream task T, the node representations b H(r) contain more task-relevant information and less task-irrelevant information than e H(r), i.e.,

I( b H(r), T) I( e H(r), T),

H( b H(r)|T) H( e H(r)|T), (7)

where I( b H(r), T) indicates the mutual information between

b H(r) and T, and H( b H(r)|T) indicates the entropy of b H(r)

conditioned on T.

Based on Theorem 3.3, the common and private representations learned by our method are demonstrated to contain more task-relevant information and less task-irrelevant information than the node representations learned by previous contrastive learning methods. Note that we do not constrain the downstream task T as classification, regression, or clustering. As a result, the concatenated common and private representations learned by our method are expected to perform better on different downstream tasks.

3.4. Objective Function

Integrating the matching loss in Eq. (2), the correlation loss in Eq. (3), the reconstruction loss in Eq. (4), with the contrastive loss in Eq. (6), the objective function of the proposed DMG is formulated as:

J = Lmat + αLcor + βLrec + λLcon, (8)

where α, β and λ are non-negative parameters.

After optimization, the proposed DMG is expected to obtain complete and clean common representations, as well as more-complementarity and less-noise private representations, to achieve effective and robust UMGRL (verified in Section 4). We then conduct the average pooling (Le Cun et al., 1989) to fuse private representations of all graphs to obtain the overall private representations P, i.e.,

r=1 P(r). (9)

Finally, we concatenate the overall private representations P with the common variable S to obtain final representations Z. We list the pseudo-code of the proposed method in Appendix A.

4. Experiments

In this section, we conduct experiments on six public datasets to evaluate the proposed method in terms of different tasks. Details of experiments are shown in Appendix C, and additional results are shown in Appendix D.

4.1. Experimental Setup

4.1.1. DATASETS

The used datasets include four multiplex graph datasets and two single-view graph datasets. Multiplex graph datasets include two citation datasets (i.e., ACM (Wang et al., 2019) and DBLP (Wang et al., 2019)), two movie datasets (i.e., IMDB (Wang et al., 2019) and Freebase (Wang et al., 2021)). Single-view graph datasets include two amazon sale datasets, i.e., Photo and Computers (Shchur et al., 2018).

Disentangled Multiplex Graph Representation Learning

Table 1. Classification performance (i.e., Macro-F1 and Micro-F1) of all methods on all multiplex graph datasets.

Method ACM IMDB DBLP Freebase

Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1

Deep Walk 73.9 0.3 74.1 0.1 42.5 0.2 43.3 0.4 88.1 0.2 89.5 0.3 49.3 0.3 52.1 0.5 GCN 86.9 0.2 87.0 0.3 45.7 0.4 49.8 0.2 90.2 0.2 90.9 0.5 50.5 0.2 53.3 0.2 GAT 85.0 0.4 84.9 0.3 49.4 0.2 53.6 0.4 91.0 0.4 92.1 0.2 55.1 0.3 59.7 0.4 DGI 89.1 0.4 88.2 0.4 45.1 0.2 46.7 0.2 90.3 0.1 91.1 0.4 54.9 0.1 58.2 0.4

MNE 79.2 0.4 79.7 0.3 44.7 0.5 45.6 0.3 89.3 0.2 90.6 0.4 52.1 0.3 54.3 0.2 HAN 89.4 0.2 89.2 0.2 49.8 0.5 54.2 0.3 91.2 0.4 92.0 0.5 53.2 0.1 57.2 0.4 DMGI 89.8 0.1 89.8 0.1 52.2 0.2 53.7 0.3 92.1 0.2 92.9 0.3 54.9 0.1 57.6 0.3 DMGIattn 88.7 0.3 88.7 0.5 52.6 0.2 53.6 0.4 90.9 0.2 91.8 0.3 55.8 0.4 58.3 0.5 HDMI 90.1 0.3 90.1 0.3 55.6 0.3 57.3 0.3 91.3 0.2 92.2 0.5 56.1 0.2 59.2 0.2 He Co 88.3 0.3 88.2 0.2 50.8 0.3 51.7 0.3 91.0 0.3 91.6 0.2 59.2 0.3 61.7 0.4 MCGC 90.2 0.4 90.0 0.3 56.3 0.5 57.5 0.6 91.9 0.3 92.1 0.4 56.6 0.1 59.4 0.3 CKD 90.4 0.3 90.5 0.2 54.8 0.2 57.7 0.3 92.0 0.2 92.3 0.5 60.4 0.4 62.9 0.5 DMG 91.0 0.3 90.9 0.4 57.6 0.2 58.9 0.4 93.3 0.2 94.0 0.3 62.4 0.7 65.9 0.8

4.1.2. COMPARISON METHODS

The comparison methods include twelve single-view graph methods and eight multiplex graph methods. Single-view graph methods include two semi-supervised methods (GCN (Kipf & Welling, 2017) and GAT (Velickovic et al., 2018)), two traditional unsupervised methods (i.e., Deep Walk (Perozzi et al., 2014) and VGAE (Kipf & Welling, 2016)), and eight self-supervised methods, (i.e., DGI (Velickovic et al., 2019), GMI (Peng et al., 2020), MVGRL (Hassani & Khasahmadi, 2020), GRACE (Zhu et al., 2020b), GCA (Zhu et al., 2021)), GIC (Mavromatis & Karypis, 2021), COSTA (Zhang et al., 2022), and DSSL (Xiao et al., 2022)). Multiplex graph methods include one semi-supervised method (i.e., HAN (Wang et al., 2019)), one traditional unsupervised method (i.e., MNE (Zhang et al., 2018)), and six self-supervised methods, i.e., DMGI (Park et al., 2020), DMGIattn (Park et al., 2020), HDMI (Jing et al., 2021a), He Co (Wang et al., 2021), MCGC (Pan & Kang, 2021), and CKD (Zhou et al., 2022)).

For a fair comparison, we use single-view graph methods on multiplex graph datasets by separately learning the representations of each graph and further concatenating them for downstream tasks. Moreover, we apply random augmentations on every single-view graph dataset to generate two graph views for multiplex graph methods.

4.1.3. EVALUATION PROTOCOL

We follow previous works (Jing et al., 2021a; Zhou et al., 2022) to conduct node classification and node clustering as semi-supervised and unsupervised downstream tasks, respectively. Moreover, we employ Macro-F1 and Micro-F1 to evaluate the performance of node classification, and Accuracy and Normalized Mutual Information (NMI) to evaluate the performance of node clustering. Furthermore, we use noisy edges (i.e., random edges) to randomly replace a

certain ratio of edges in each graph, for evaluating the robustness of our method and comparison methods. The code is released at https://github.com/Yujie Mo/DMG.

4.2. Effectiveness Analysis

4.2.1. EFFECTIVENESS ON THE MULTIPLEX GRAPH

We first evaluate the effectiveness of the proposed method on the multiplex graph datasets by reporting the results of node classification (i.e., Macro-F1 and Micro-F1) and node clustering (i.e., Accuracy and NMI) in Tables 1 and 2. Obviously, our method achieves the best effectiveness on both node classification task and node clustering task.

First, compared with single-view graph methods (i.e., Deep Walk, GCN, GAT, and DGI), the proposed DMG always outperforms them by large margins. For example, the proposed DMG on average improves by 20.4%, compared to the best single-view graph method (i.e., DGI), in terms of classification and clustering tasks, on all multiplex graph datasets. This demonstrates the superiority of the multiplex graph methods, which may explore correlations among different graphs and thus better mine the hidden information to learn discriminative node representations.

Second, compared to multiplex graph methods, the proposed DMG achieves the best results, followed by MCGC, CKD, HDMI, DMGI, He Co, DMGIattn, HAN, and MNE. For example, our method on average improves by 1.6%, compared to the best comparison method MCGC, in terms of classification and clustering tasks, on all multiplex graph datasets. This can be attributed to the fact that the proposed DMG can explicitly capture complete and clean common information, as well as more-complementarity and less-noise private information. As a result, this introduces more task-relevant information and less task-irrelevant in learned representations, leading to better downstream task performance.

Disentangled Multiplex Graph Representation Learning

Table 2. Clustering performance (i.e., Accuracy and NMI) of all methods on all multiplex graph datasets.

Method ACM IMDB DBLP Freebase

Accuracy NMI Accuracy NMI Accuracy NMI Accuracy NMI

Deep Walk 64.5 0.7 41.6 0.5 42.1 0.4 1.5 0.1 89.5 0.4 69.0 0.2 44.5 0.6 12.8 0.4 DGI 81.1 0.6 64.0 0.4 48.9 0.2 8.3 0.3 85.4 0.3 65.6 0.4 52.9 0.2 17.8 0.2

MNE 69.1 0.2 54.5 0.3 46.5 0.3 4.6 0.2 86.3 0.3 68.4 0.2 45.1 0.5 13.3 0.7 DMGI 88.4 0.3 68.7 0.5 52.5 0.7 13.1 0.3 91.8 0.5 76.4 0.6 53.1 0.4 17.3 0.4 DMGIattn 90.9 0.4 70.2 0.3 52.6 0.3 9.2 0.2 91.3 0.4 75.2 0.4 52.3 0.5 17.1 0.3 HDMI 90.8 0.4 69.5 0.5 57.6 0.4 14.5 0.4 90.1 0.4 73.1 0.3 58.3 0.3 20.3 0.4 He Co 88.4 0.6 67.8 0.8 50.9 0.5 10.1 0.6 89.2 0.3 71.0 0.7 58.4 0.6 20.4 0.5 MCGC 90.4 0.5 69.0 0.5 56.5 0.3 14.9 0.4 91.9 0.2 76.5 0.4 58.1 0.4 47.2 0.3 CKD 90.6 0.4 69.3 0.3 53.9 0.3 13.8 0.4 91.4 0.4 75.9 0.4 58.5 0.6 20.6 0.4 DMG 92.9 0.3 74.5 0.4 60.3 0.5 17.0 0.3 94.1 0.4 80.0 0.2 63.6 0.6 21.8 0.4

Table 3. Classification and clustering performance (i.e., Macro-F1, Micro-F1, Accuracy and NMI) on all single-view graph datasets.

Method Photo Computers

Macro-F1 Micro-F1 Accuracy NMI Macro-F1 Micro-F1 Accuracy NMI

Deep Walk 87.4 0.5 89.7 0.3 46.2 0.2 35.4 0.3 84.0 0.3 85.6 0.4 32.5 0.3 29.8 0.2 VGAE 89.9 0.2 91.6 0.4 54.8 0.5 37.4 0.3 82.6 0.3 85.3 0.4 37.2 0.3 32.4 0.5 GCN 90.5 0.3 92.5 0.2 N/A N/A 84.0 0.4 86.4 0.3 N/A N/A GAT 90.2 0.5 91.8 0.4 N/A N/A 83.2 0.2 85.7 0.4 N/A N/A

DGI 89.3 0.2 91.6 0.3 59.1 0.4 43.2 0.3 79.3 0.3 83.9 0.5 40.7 0.3 33.4 0.2 GMI 89.3 0.4 90.6 0.2 64.6 0.2 47.2 0.3 80.1 0.4 82.2 0.4 41.5 0.2 34.5 0.3 MVGRL 90.1 0.3 91.7 0.4 48.3 0.5 34.4 0.4 84.6 0.6 86.9 0.5 47.8 0.5 47.1 0.3 GRACE 90.3 0.5 91.9 0.3 74.1 0.4 63.4 0.2 84.2 0.3 86.8 0.5 49.6 0.2 47.9 0.4 GCA 91.1 0.4 92.4 0.4 73.6 0.3 61.4 0.2 85.9 0.5 87.7 0.3 51.3 0.5 42.6 0.4 GIC 90.0 0.3 91.6 0.2 69.5 0.2 61.5 0.1 82.6 0.4 84.9 0.3 52.5 0.2 46.4 0.2 COSTA 91.3 0.4 92.5 0.3 73.1 0.3 62.1 0.5 86.4 0.3 88.3 0.4 53.2 0.2 48.1 0.3 DSSL 90.6 0.2 92.1 0.3 74.5 0.4 63.9 0.5 85.6 0.3 87.3 0.4 53.5 0.2 48.3 0.4 DMG 91.8 0.2 92.9 0.3 77.6 0.3 66.5 0.4 86.6 0.3 88.3 0.2 55.6 0.4 49.3 0.5

N/A indicates that we did not evaluate semi-supervised methods (i.e., GCN and GAT) on unsupervised tasks (i.e., clustering).

4.2.2. EFFECTIVENESS ON THE SINGLE-VIEW GRAPH

To further verify the effectiveness of the proposed method on single-view graph datasets after random data augmentation, we report the results of node classification and node clustering on single-view graph datasets in Table 3. We can observe that our method achieves competitive results on both the node classification task and node clustering task.

First, compared to the semi-supervised baselines (i.e., GCN and GAT), the proposed DMG obtains promising improvements. For example, the proposed DMG on average improves by 1.8%, compared to the best semi-supervised method (i.e., GCN), in terms of the classification task, on all single-view graph datasets. Second, compared to all selfsupervised methods (i.e., DGI, GMI, MVGRL, GRACE, GCA, GIC, COSTA, and DSSL), the proposed DMG also achieves superior performance. For example, the proposed DMG on average outperforms the best self-supervised method (i.e., DSSL) by 2.3%, in terms of classification and clustering tasks, on all single-view graph datasets. This indicates that on single-view graph datasets, the proposed DMG

is still able to extract invariant common information between two augmented views, as well as preserve complementarity and remove noise in augmented graph structures. Therefore, the effectiveness of the proposed method is further validated on single-view graph datasets.

4.3. Robustness Analysis

We further evaluate the robustness of the proposed method on the multiplex dataset by reporting results of node classification and node clustering under different noisy edge ratios η in Figure 2.

From Figure 2, we have the observations as follows. First, compared to all self-supervised MGRL methods, the proposed DMG consistently achieves the best performance under different noise ratios on the DBLP dataset, demonstrating the superiority of the proposed method again. Second, with the increase of the noise ratios, the performance degradation of all self-supervised methods is much more drastically than the proposed method. For example, DMG and CKD achieve the Macro-F1 of 93.3 and 92.0 under

Disentangled Multiplex Graph Representation Learning

Table 4. Classification performance (i.e., Macro-F1 and Micro-F1) of each component in the proposed method on all datasets.

Lmat Lcor Lrec Lcon ACM IMDB DBLP Freebase

Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1

76.1 0.5 75.8 0.3 43.8 0.4 45.5 0.2 92.2 0.4 93.0 0.3 47.1 0.6 47.5 0.8 86.0 0.3 86.0 0.5 51.6 0.5 54.0 0.4 92.0 0.2 92.9 0.3 36.3 0.6 40.2 0.5 86.5 0.4 86.1 0.4 53.0 0.3 54.7 0.5 92.7 0.3 93.5 0.4 53.1 0.8 55.1 0.9 74.3 0.4 73.7 0.6 46.0 0.4 49.2 0.5 91.7 0.5 92.7 0.3 35.7 0.4 38.8 0.7 87.9 0.6 88.0 0.4 53.4 0.3 56.1 0.1 92.3 0.4 93.0 0.6 50.3 0.7 52.9 0.6 87.8 0.5 87.6 0.3 54.7 0.3 56.2 0.5 92.5 0.4 93.3 0.6 55.1 0.5 58.2 0.7 91.0 0.3 90.9 0.4 57.6 0.2 58.9 0.4 93.3 0.2 94.0 0.3 62.4 0.7 65.9 0.8

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

Figure 2. Classification and clustering performance of our method and all self-supervised MGRL methods under different noisy edges ratios η on the DBLP dataset.

η = 0, respectively, while with the increase of the noise rate, DMG is remarkably superior to CKD.

The reasons can be summarized as follows. On the one hand, our method extracts complete and clean common information through disentangled representation learning. As a result, the complete and clean common information is supposed to be free of noise. On the other hand, the contrastive constraint further preserves the complementarity and removes the noise in the private information. Therefore, the common and private representations learned by our method are expected to be robust to the noise in each graph.

4.4. Ablation Study

The proposed DMG investigates the matching loss, the correlation loss, and the reconstruction loss (i.e., Lmat, Lcor, and Lrec) to obtain disentangled common and private representations. Moreover, DMG further investigates the contrastive loss (i.e., Lcon) to preserve the complementarity and remove the noise in private representations. To verify the effectiveness of each component of the objective function in the proposed method, we investigate the performance of all variants (except Lmat as we cannot obtain final representations without Lmat) on the node classification task by reporting the results in Table 4.

According to Figure 4, we can draw the following con-

clusions. First, our method with the complete objective function achieves the best performance. For example, our method on average improves by 5.7%, compared to the best variant (i.e., without Lcon), indicating that all the losses are necessary for our method. This is consistent with our above argument. That is, it is essential for UMGRL to consider complete and clean common information, as well as more-complementarity and less-noise private information. Second, the variant without Lcor performs significantly inferior to the other two variants (without Lrec and without Lcon, respectively). This makes sense as the correlation loss is needed to guarantee the independence between common and private representations, which is generally essential for disentangled representation learning.

5. Conclusion

In this paper, we proposed a disentangled representation learning framework for the multiplex graph. To do this, we first disentangled the common and private representations to capture complete and clean common information. We further designed a contrastive constraint to preserve the complementarity and remove the noise in the private information. Theoretical analysis indicates that the common and private representations learned by our method can be provably disentangled and contain more task-relevant information and less task-irrelevant information to benefit downstream tasks. Comprehensive experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods in terms of both effectiveness and robustness on different downstream tasks. We discuss potential limitations and future directions in Appendix E.

6. Acknowledgement

This work was supported in part by the National Key Research and Development Program of China under Grant No. 2022YFA1004100, and the Medico-Engineering Cooperation Funds from the University of Electronic Science and Technology of China under Grants No. ZYGX2022YGRH009 and ZYGX2022YGRH014.

Disentangled Multiplex Graph Representation Learning

Benton, A., Khayrallah, H., Gujral, B., Reisinger, D. A., Zhang, S., and Arora, R. Deep generalized canonical correlation analysis. In Augenstein, I., Gella, S., Ruder, S., Kann, K., Can, B., Welbl, J., Conneau, A., Ren, X., and Rei, M. (eds.), Proceedings of the 4th Workshop on Representation Learning for NLP, Rep L4NLP@ACL 2019, Florence, Italy, August 2, 2019, pp. 1 6, 2019.

Chen, B., Zhang, J., Zhang, X., Dong, Y., Song, J., Zhang, P., Xu, K., Kharlamov, E., and Tang, J. Gccad: Graph contrastive coding for anomaly detection. IEEE Transactions on Knowledge and Data Engineering, 2022.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In ICML, pp. 1597 1607, 2020.

Chen, X. and He, K. Exploring simple siamese representation learning. In CVPR, pp. 15750 15758, 2021.

Chu, X., Fan, X., Yao, D., Zhu, Z., Huang, J., and Bi, J. Cross-network embedding for multi-network alignment. In WWW, pp. 273 284, 2019.

Dong, Y., Chawla, N. V., and Swami, A. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD, pp. 135 144, 2017.

Donnat, C., Zitnik, M., Hallac, D., and Leskovec, J. Learning structural node embeddings via diffusion wavelets. In KDD, pp. 1320 1329, 2018.

Federici, M., Dutta, A., Forr e, P., Kushman, N., and Akata, Z. Learning robust representations via multi-view information bottleneck. In ICLR, 2020.

Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Sch olkopf, B., and Logothetis, N. Kernel constrained covariance for dependence measurement. In AISTATES, pp. 112 119, 2005.

Hassani, K. and Khasahmadi, A. H. Contrastive multi-view representation learning on graphs. In ICML, pp. 4116 4126, 2020.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9729 9738, 2020.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2016.

Jiang, W., Liu, S., Gao, C., Cao, J., He, R., Feng, J., and Yan, S. PSGAN: pose and expression robust spatial-aware GAN for customizable makeup transfer. In CVPR, pp. 5193 5201, 2020.

Jing, B., Park, C., and Tong, H. Hdmi: High-order deep multiplex infomax. In WWW, pp. 2414 2424, 2021a.

Jing, L., Vincent, P., Le Cun, Y., and Tian, Y. Understanding dimensional collapse in contrastive self-supervised learning. In ICLR, 2021b.

Kingma, P. D. and Ba, L. J. Adam: A method for stochastic optimization. In ICLR, 2015.

Kipf, N. T. and Welling, M. Semi-supervised classification with graph convolutional networks. In ICLR, pp. 1 14, 2017.

Kipf, T. N. and Welling, M. Variational graph auto-encoders. ar Xiv preprint ar Xiv:1611.07308, 2016.

Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. Handwritten digit recognition with a back-propagation network. In Neur IPS, volume 2, 1989.

Li, B., Jing, B., and Tong, H. Graph communal contrastive learning. In WWW, pp. 1203 1213, 2022.

Li, H., Wang, X., Zhang, Z., Yuan, Z., Li, H., and Zhu, W. Disentangled contrastive learning on graphs. In Neur IPS, volume 34, pp. 21872 21884, 2021.

Liu, Y., Tu, W., Zhou, S., Liu, X., Song, L., Yang, X., and Zhu, E. Deep graph clustering via dual correlation reduction. In AAAI, pp. 7603 7611, 2022.

Liu, Y., Yang, X., Zhou, S., Liu, X., Wang, S., Liang, K., Tu, W., and Li, L. Simple contrastive graph clustering. IEEE Transactions on Neural Networks and Learning Systems, 2023a.

Liu, Y., Yang, X., Zhou, S., Liu, X., Wang, Z., Liang, K., Tu, W., Li, L., Duan, J., and Chen, C. Hard sample aware network for contrastive deep graph clustering. In AAAI, 2023b.

Lyu, Q. and Fu, X. Nonlinear multiview analysis: Identifiability and neural network-assisted implementation. IEEE Transactions on Signal Processing, 68:2697 2712, 2020.

Lyu, Q., Fu, X., Wang, W., and Lu, S. Understanding latent correlation-based multiview learning and selfsupervision: An identifiability perspective. In ICLR, 2022.

Ma, J., Cui, P., Kuang, K., Wang, X., and Zhu, W. Disentangled graph convolutional networks. In ICML, pp. 4212 4221, 2019.

Disentangled Multiplex Graph Representation Learning

Mavromatis, C. and Karypis, G. Graph infoclust: Maximizing coarse-grain mutual information in graphs. In PAKDD, pp. 541 553, 2021.

Mercatali, G., Freitas, A., and Garg, V. Symmetry-induced disentanglement on graphs. In Neur IPS, volume 35, pp. 31497 31511, 2022.

Mrabah, N., Bouguessa, M., Touati, M. F., and Ksantini, R. Rethinking graph auto-encoder models for attributed graph clustering. IEEE Transactions on Knowledge and Data Engineering, 2022.

Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807 814, 2010.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Pan, E. and Kang, Z. Multi-view contrastive graph clustering. In Neur IPS, volume 34, pp. 2148 2159, 2021.

Park, C., Kim, D., Han, J., and Yu, H. Unsupervised attributed multiplex network embedding. In AAAI, pp. 5371 5378, 2020.

Peng, Z., Huang, W., Luo, M., Zheng, Q., Rong, Y., Xu, T., and Huang, J. Graph representation learning via graphical mutual information maximization. In WWW, pp. 259 270, 2020.

Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In KDD, pp. 701 710, 2014.

Shchur, O., Mumme, M., Bojchevski, A., and G unnemann, S. Pitfalls of graph neural network evaluation. ar Xiv preprint ar Xiv:1811.05868, 2018.

Shi, C., Hu, B., Zhao, W. X., and Yu, P. S. Heterogeneous information network embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31(2):357 370, 2019.

Shi, Y., Gui, H., Zhu, Q., Lance, K., and Han, J. Aspem: Embedding learning by aspects in heterogeneous information networks. In SDM, pp. 144 152, 2018a.

Shi, Y., Zhu, Q., Guo, F., Zhang, C., and Han, J. Easing embedding learning by comprehensive transcription of heterogeneous information networks. In KDD, pp. 2190 2199, 2018b.

Simonovsky, M. and Komodakis, N. Graphvae: Towards generation of small graphs using variational autoencoders. In ICANN, volume 11139, pp. 412 422, 2018.

Van der Maaten, L. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2008.

Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li o, P., and Bengio, Y. Graph attention networks. In ICLR, pp. 1 12, 2018.

Velickovic, P., Fedus, W., Hamilton, W. L., Li o, P., Bengio, Y., and Hjelm, R. D. Deep graph infomax. In ICLR, pp. 1 17, 2019.

Wang, H., Guo, X., Deng, Z.-H., and Lu, Y. Rethinking minimal sufficient representation in contrastive learning. In CVPR, pp. 16041 16050, 2022a.

Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., and Yu, P. S. Heterogeneous graph attention network. In WWW, pp. 2022 2032, 2019.

Wang, X., Liu, N., Han, H., and Shi, C. Self-supervised heterogeneous graph neural network with co-contrastive learning. In KDD, pp. 1726 1736, 2021.

Wang, X., Chen, H., Tang, S., Wu, Z., and Zhu, W. Disentangled representation learning, 2022b.

Wang, X., Chen, H., Zhou, Y., Ma, J., and Zhu, W. Disentangled representation learning for recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):408 424, 2022c.

Xiao, T., Chen, Z., Guo, Z., Zhuang, Z., and Wang, S. Decoupled self-supervised learning for graphs. In Neur IPS, 2022.

Xie, D., Deng, C., Li, C., Liu, X., and Tao, D. Multitask consistency-preserving adversarial hashing for crossmodal retrieval. IEEE Transactions on Image Processing, 29:3626 3637, 2020.

Xie, Y., Xu, Z., Zhang, J., Wang, Z., and Ji, S. Selfsupervised learning of graph neural networks: A unified review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Xu, J., Ren, Y., Tang, H., Pu, X., Zhu, X., Zeng, M., and He, L. Multi-VAE: Learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In ICCV, pp. 9234 9243, 2021a.

Xu, J., Ren, Y., Tang, H., Yang, Z., Pan, L., Yang, Y., Pu, X., Yu, P. S., and He, L. Self-supervised discriminative feature learning for deep multi-view clustering. IEEE Transactions on Knowledge and Data Engineering, pp. 1 12, 2022a.

Disentangled Multiplex Graph Representation Learning

Xu, J., Tang, H., Ren, Y., Peng, L., Zhu, X., and He, L. Multi-level feature learning for contrastive multi-view clustering. In CVPR, pp. 16051 16060, 2022b.

Xu, M., Wang, H., Ni, B., Guo, H., and Tang, J. Selfsupervised graph-level representation learning with local and global structure. In ICML, volume 139, pp. 11548 11558, 2021b.

Yang, Y., Feng, Z., Song, M., and Wang, X. Factorizable graph convolutional networks. In Neur IPS, volume 33, pp. 20286 20296, 2020.

Zhang, C., Zhang, K., Zhang, C., Pham, T. X., Yoo, C. D., and Kweon, I. S. How does simsiam avoid collapse without negative samples? a unified understanding with self-supervised contrastive learning. In ICLR, 2021.

Zhang, H. and Kou, G. Role-based multiplex network embedding. In ICML, pp. 26265 26280, 2022.

Zhang, H., Qiu, L., Yi, L., and Song, Y. Scalable multiplex network embedding. In IJCAI, pp. 3082 3088, 2018.

Zhang, Y., Zhu, H., Song, Z., Koniusz, P., and King, I. Costa: Covariance-preserving feature augmentation for graph contrastive learning. In KDD, pp. 2524 2534, 2022.

Zhou, S., Yu, K., Chen, D., Li, B., Feng, Y., and Chen, C. Collaborative knowledge distillation for heterogeneous information network embedding. In WWW, pp. 1631 1639, 2022.

Zhu, Y., Min, M. R., Kadav, A., and Graf, H. P. S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. In CVPR, pp. 6537 6546, 2020a.

Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Deep graph contrastive representation learning. ar Xiv preprint ar Xiv:2006.04131, 2020b.

Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Graph contrastive learning with adaptive augmentation. In WWW, pp. 2069 2080, 2021.

Zhu, Y., Xu, Y., Cui, H., Yang, C., Liu, Q., and Wu, S. Structure-enhanced heterogeneous graph contrastive learning. In SDM, pp. 82 90, 2022.

Disentangled Multiplex Graph Representation Learning

A. Algorithm

This section provides the pseudo-code of the proposed method.

Algorithm 1 The pseudo-code of the proposed DMG.

Input: Node features X(r) and graph structure A(r) of graph G(r) for r [1, R], non-negative parameters α, β and λ, maximum training steps E; Output: Encoders f (r) c g(r), f (r) p g(r), decoder p(r), and measurable functions ϕ(r) and ψ(r); 1: Initialize parameters; 2: while not reaching E do 3: Obtain common variable S with orthogonality and zero mean via singular value decomposition; 4: while not reaching E do

5: Obtain common and private representations (i.e., C(r) and P(r)) with encoders f (r) c g(r), f (r) p g(r); 6: Conduct the matching loss between C(r) and S by Eq. (2); 7: Conduct the correlation loss between C(r) and P(r) under measurable functions ϕ(r) and ψ(r) by Eq. (3); 8: Conduct the reconstruction loss between reconstructed node representations and original input by Eq. (4);

9: Calculate the cosine similarity ϵ(r) ij between si and sj of the node pair (vi, vj) E(r) by Eq. (5);

10: Conduct the contrastive loss on P(r) based on complementary and noisy edges set E(r) c and E(r) p by Eq. (6); 11: Compute the objective function J by Eq. (8); 12: Back-propagate J to update model weights; 13: end while 14: end while

B. Proofs in Section 3

This section provides detailed proofs of Theorems in Section 3.

B.1. Proof of Theorem 3.1

Proof. Consider the matching loss in Eq. (2) achieves its minimization, which indicates that the common representations C(r) from different graphs are perfectly aligned, i.e., C(r) = C(r ) (r = r ). Assuming that the solution f (r) c satisfying the constraint in Eq. (2) has been found, then we have:

f (r) c g(r)(X(r), A(r)) = f (r ) c g(r )(X(r ), A(r )). (10)

As the solution satisfies the correlation loss in Eq. (3) and the reconstruction loss in Eq. (4), the common and private representations (i.e., C(r) and P(r)) are expected to be statistical independent, and f (r) c g(r) are expected to be invertible.

We simply use q(r) c to denote the inverted function g(r)( 1) f (r)( 1) c . Denote the optimal common and private representations as C and P(r) , which contain complete and clean common and private information, respectively. Note that the optimal common and private representations are also statistically independent. According to the invertibility of q(r) c and the independence between C and P(r) , we can transform Eq. (10) to:

where q(r) c (ϑ(r)) = g(r)( 1) f (r)( 1) c (ϑ(r)), and ϑ(r) = [C , (P(r) ) ] . Therefore, to demonstrate the functions f (r) c can extract complete and clean common information, we only have to demonstrate that q(r) c is the function of only C but not the function of P(r) . To do this, we then calculate the Jacobian of q(r) to analyze the first-order partial derivatives of q(r) c and q(r) p w.r.t. C and P(r) . The Jacobian of q(r) can be formulated as:

" J(r) 11 J(r) 12 J(r) 21 J(r) 22

Disentangled Multiplex Graph Representation Learning

where J(r) 11 RD D, J(r) 12 RD d, J(r) 21 Rd D and J(r) 22 Rd d are Jacobian matrices, and elements of them can be formulated as: h J(r) 11 i

i,j = h q(r) c (ϑ(r)) i

i c(r) j , h J(r) 12 i

i,k = h q(r) c (ϑ(r)) i

h J(r) 21 i

k,i = h q(r) p (ϑ(r)) i

k c(r) i , h J(r) 22 i

k,l = h q(r) p (ϑ(r)) i

where i, j [1, D], k, l [1, d]. Then we only have to demonstrate that J(r) 12 is an all-zero matrix while the determinant of J(r) 11 is non-zero to show that the matrix consisting of all the partial derivatives of q(r) c w.r.t. C is full rank while any partial derivatives of q(r) c w.r.t. P(r) is zero.

Note that Eq. (11) holds over the whole latent space. Therefore, with any fixed C and P(r ) , for all P(r) , we have:

Then we take the partial derivatives of Eq. (14) w.r.t. p(r) j for j [1, d], and we have: J(r) 12 | C,P(r) = J(r ) 12 | C, P(r ). According to the chain rules and taking derivatives of constants, we further have:

J(r ) 12 | C, P(r ) = Jq(r ) c

= 0D d, (15)

where Jq(r ) c RD (D+d) is the Jacobian of q(r ) c . Note that the equation above holds for any fixed C and P(r) , and thus

the same derivation holds for all C and P(r) . Therefore, J(r) 12 is an all-zero matrix and the learned q(r) c (ϑ(r)) is not a function of P(r) .

Based on the above proof, we can rewrite Eq. (12) as:

" J(r) 11 0D d J(r) 21 J(r) 22

According to the property of determinant of block matrix and the invertibility of q(r) c , we have:

det(J(r)) = det(J(r) 11 ) det(J(r) 22 ) = 0. (17)

This indicates that det(J(r) 11 ) = 0 and det(J(r) 22 ) = 0. Therefore, J(r) 11 is an non-zero matrix and q(r) c is the function of only C but not the function of P(r) , i.e., for r [1, R], we have C(r) = φ(C ), where φ is an invertible function as det(J(r) 11 ) = 0. Therefore we complete the proof.

B.2. Proof of Theorem 3.3

In the following proofs, for random variables A, B, C, we use I(A, B) to represent the mutual information between A and B, and we use I(A, B|C) to represent conditional mutual information of A and B on a given C, use H(A) for the entropy, and H(A|B) for the conditional entropy. We first list some properties of mutual information and entropy that will be used in the proofs.

Property 1. Relationship between the mutual information and entropy:

I(A, B) = H(A) H(A | B). (18)

Property 2. Relationship between the conditional mutual information and entropy:

I(A, B | C) = H(A | C) H(A | B, C). (19)

Disentangled Multiplex Graph Representation Learning

Property 3. Chain rule of the mutual information:

I(A, B | C) = I(A | B) I(A, B, C). (20)

Property 4. Relationship between the conditional entropy and entropy:

H(A | B) = H(A, B) H(B). (21)

To aid in the proof of Theorem 3.3, we first have the following Lemma:

Lemma B.1. Given a downstream task T, the node representations b H(r) learned by our method, and the node representations e H(r) learned by previous contrastive learning methods, which maximize the mutual information among different graphs (i.e., I( e H(r), e H(r ))), we have

I( b H(r), G(r ), T) = I( e H(r), G(r ), T) = I(G(r), G(r ), T), (22)

H( b H(r)) H( e H(r)) = H( b H(r)|G(r )) H( e H(r)|G(r )). (23)

Proof. According to the proof in Theorem 3.1, our method is available to obtain the complete and clean common information among different graphs, then we have

I( b H(r), G(r )) = I(G(r), G(r )). (24)

Assume previous contrastive learning methods also obtain complete information among different graphs via mutual information maximization, then we have

I( e H(r), G(r )) = I(G(r), G(r )). (25)

We then introduce an assumption, which is widely used in previous works (Federici et al., 2020; Wang et al., 2022a), i.e., if the random variable C is observed, then random variable A is conditionally independent from any other variable B, i.e., I(A, B|C) = 0, B. Based on Eq. (24), Eq. (25), Properties 2-3, and the assumption above, we have

I(G(r), G(r ), T) I( b H(r), G(r ), T)

= [I(G(r), G(r )) I(G(r), G(r ) | T)] [I( b H(r), G(r )) I( b H(r), G(r ) | T)]

= I( b H(r), G(r ) | T) I(G(r), G(r ) | T)

= [H(G(r ) | T) H(G(r ) | b H(r), T)] [H(G(r ) | T) H(G(r )|G(r), T)]

= H(G(r ) | G(r), T) H(G(r ) | b H(r), T)

= [I( b H(r), G(r ) | G(r), T) + H(G(r ) | G(r), b H(r), T)]

[I(G(r), G(r ) | b H(r), T) + H(G(r ) | G(r), b H(r), T)]

= I( b H(r), G(r ) | G(r), T) I(G(r), G(r ) | b H(r), T)

= I( b H(r), G(r ) | G(r), T)

Similarly, we can obtain I(G(r), G(r ), T) I( e H(r), G(r ), T) = 0. Therefore, we have I( b H(r), G(r ), T) = I(G(r), G(r ), T) = I( e H(r), G(r ), T).

Disentangled Multiplex Graph Representation Learning

In addition, based on Eq. (24), Eq. (25), Properties 1 and 4, we further have

H( b H(r)) H( e H(r)) H( b H(r)|G(r )) + H( e H(r)|G(r ))

= H( b H(r)) H( e H(r)) H( b H(r), G(r )) + H(G(r )) + H( e H(r), G(r )) H(G(r ))

= H( b H(r)) H( e H(r)) H( b H(r), G(r )) + H( e H(r), G(r ))

= H( b H(r)) H( e H(r)) H( b H(r)) + H( b H(r)|G(r )) + e H(r) H( e H(r)|G(r ))

= H( b H(r)|G(r )) H( e H(r)|G(r ))

= H( b H(r)) I( b H(r), G(r )) H( b H(r)) + I( e H(r), G(r ))

Therefore, we have H( b H(r)) H( e H(r)) = H( b H(r)|G(r )) H( e H(r)|G(r )) and we complete the proof.

Now we can prove the Theorem 3.3.

Proof of Theorem 3.3. We divide the proof into two parts, i.e., 1) I( b H(r), T) I( e H(r), T) and 2) H( b H(r)|T) H( e H(r)|T). We first prove that I( b H(r), T) I( e H(r), T) holds. Denote the complementary information of the representations learned by our method and previous contrastive learning method as I( b H(r), T|G(r )) and I( e H(r), T|G(r )), respectively. With the contrastive loss in Eq. (6) achieving its minimum and thus preserving the complementarity in each graph, we have I( b H(r), T|G(r )) I( e H(r), T|G(r )). Note that with Property 3, we have

I( b H(r), T) = I( b H(r), T, G(r )) + I( b H(r), T|G(r )). (28)

According to Eq. (22) in Lemma B.1, i.e., I( b H(r), G(r ), T) = I( e H(r), G(r ), T) = I(G(r), G(r ), T), then we have

I( b H(r), T) = I( e H(r), T, G(r )) + I( b H(r), T|G(r ))

= I( e H(r), T) I( e H(r), T|G(r )) + I( b H(r), T|G(r )). (29)

Based on I( b H(r), T|G(r )) I( e H(r), T|G(r )), we have I( b H(r), T) I( e H(r), T).

Similar to the above, we denote the noisy information of the representations learned by our method and previous contrastive learning method as H( b H(r)|G(r ), T) and H( e H(r)|G(r ), T) , respectively. With the contrastive loss in Eq. (6) achieving its minimum and thus removing the noise in each graph, we have H( b H(r)|G(r ), T) H( e H(r)|G(r ), T). According to Properties 1-3, and Eq. (23) in Lemma B.1, i.e., H( b H(r)) H( e H(r)) = H( b H(r)|G(r )) H( e H(r)|G(r )), then we have

H( b H(r)|T) = H( b H(r)) I( b H(r), T)

= H( b H(r)) [I( b H(r), T, G(r )) + I( b H(r), T|G(r ))]

= H( b H(r)) I( e H(r), T, G(r )) I( b H(r), T|G(r ))

= H( b H(r)) I( e H(r), T) + I( e H(r), T|G(r )) I( b H(r), T|G(r ))

= H( b H(r)) [H( e H(r)) H( e H(r)|T)] + I( e H(r), T|G(r )) I( b H(r), T|G(r ))

= H( e H(r)|T) + H( b H(r)) H( e H(r)) + I( e H(r), T|G(r )) I( b H(r), T|G(r ))

= H( e H(r)|T) + H( b H(r)) H( e H(r)) + H( e H(r)|G(r ))

H( e H(r)|G(r ), T) H( b H(r)|G(r )) + H( b H(r)|G(r ), T)

= H( e H(r)|T) H( e H(r)|G(r ), T) + H( b H(r)|G(r ), T)

Based on H( b H(r)|G(r ), T) H( e H(r)|G(r ), T), we have H( b H(r)|T) H( e H(r)|T). Therefore, we complete the proof.

Disentangled Multiplex Graph Representation Learning

C. Experimental Settings

This section provides detailed experimental settings in Section 4, including the description of all datasets in Section C.1, summarization of all comparison methods in Section C.2, model architectures and settings in Section C.3, and the evaluation protocol in Section C.4.

Table 5. Statistics of all datasets. Datasets Nodes Meta-paths Edges Features Labeled Nodes Classes

ACM 3,025 Paper-Subject-Paper (PSP) 2,210,761 1,830 600 3 Paper-Author-Paper (PAP) 29,281 (Paper Abstract)

IMDB 4,780 Movie-Actor-Movie (MAM) 98,010 1,232 300 3 Movie-Director-Movie (MDM) 21,018 (Movie Plot)

Author-Paper-Author (APA) 11,113 334 800 4 Author-Paper-Conference-Paper-Author (APCPA) 5,000,495 (Paper Abstract) Author-Paper-Term-Paper-Author (APTPA) 6,776,335

Freebase 3,492

Movie-Actor-Movie (MAM) 254,702 3,492 60 3 Movie-Director-Movie (MDM) 8,404 (One-hot Encoding) Movie-Writer-Movie (MWM) 10,706

Photo 7,487 Product-Customer-Product (PCP) 287,326 745 765 8 (Product Reviews)

Computers 13,752 Product-Customer-Product (PCP) 574,418 767 1375 10 (Product Reviews)

C.1. Datasets

We use four public multiplex graph datasets and two single-view graph datasets from various domains. Multiplex graph datasets include two citation multiplex graph datasets (i.e., ACM (Wang et al., 2019) and DBLP (Wang et al., 2019)), two movie multiplex graph datasets (i.e., IMDB (Wang et al., 2019) and Freebase (Wang et al., 2021)). Single-view graph datasets include two amazon sale datasets (i.e., Photo and Computers (Shchur et al., 2018)). Table 5 summarizes the data statistics. We list the details of the datasets as follows.

ACM1 contains 3,025 papers with graphs generated by two meta-paths (i.e., paper-author-paper and paper-subjectpaper). The feature of each paper is a 1,830-dimensional bag-of-words representation of the abstract. Papers are categorized into three classes, i.e., database, wireless communication, and data mining.

IMDB2 contains 4,780 movies with graphs generated by two meta-paths (i.e., movie-actor-movie and movie-directormovie). The feature of each movie is a 1,232-dimensional bag-of-words representation of its plots. Movies are categorized into three classes, i.e., action, comedy, and drama.

DBLP3 contains 4,057 papers with graphs generated by three meta-paths (i.e., author-paper-author, author-paperconference-paper-author, and author-paper-term-paper-author). The feature of each paper is a 334-dimensional bag-of-words representation of its abstracts. Papers are categorized into four classes, i.e., database, data mining, machine learning, and information retrieval.

Freebase4 contains 3,492 movies with graphs generated by three meta-paths (i.e., movie-actor-movie, movie-directormovie and movie-writer-movie). We assign one-hot encoding to this dataset as no features are provided. Movies are categorized into four classes, ie action, comedy and drama.

Photo and Computers5 contain 7,487 and 13,752 products, respectively. Edges in each dataset indicate that two products are frequently bought together. The feature of each product is bag-of-words encoded product reviews. Products are categorized into several classes by the product category.

1https://www.acm.org/ 2https://www.imdb.org/ 3https://aminer.org/AMiner Network/ 4http://www.freebase.com/ 5https://docs.dgl.ai/en/0.6.x/api/python/dgl.data.html

Disentangled Multiplex Graph Representation Learning

Table 6. The characteristics of all comparison methods. Methods Multiplex Single-view Semi-sup Tra-unsup Self-sup Features Com-info Pri-info Comp-noise GCN GAT Deep Walk VGAE DGI GMI MVGRL GRACE GCA GIC COSTA DSSL MNE HAN DMGI DMGIattn HDMI He Co MCGC CKD DMG (ours)

C.2. Comparison Methods

The comparison methods include twelve methods designed for the single-view graph and eight for the multiplex graph, i.e., GCN (Kipf & Welling, 2017), GAT (Velickovic et al., 2018), Deep Walk (Perozzi et al., 2014), VGAE (Kipf & Welling, 2016), DGI (Velickovic et al., 2019), GMI (Peng et al., 2020), MVGRL (Hassani & Khasahmadi, 2020), GRACE (Zhu et al., 2020b), GCA (Zhu et al., 2021), GIC (Mavromatis & Karypis, 2021), COSTA (Zhang et al., 2022), DSSL (Xiao et al., 2022), HAN (Wang et al., 2019), MNE (Zhang et al., 2018), DMGI (Park et al., 2020), DMGIattn (Park et al., 2020), HDMI (Jing et al., 2021a), He Co (Wang et al., 2021), MCGC (Pan & Kang, 2021), and CKD (Zhou et al., 2022). The characteristics of all methods are listed in Table 6, where Multiplex and Single-view indicate the methods designed for the multiplex graph and single-view graph, respectively. Semi-sup , Tra-unsup , and Self-sup indicate that the method conducts semi-supervised learning, traditional unsupervised learning, and self-supervised learning, respectively. Note that our DMG is a new unsupervised framework that cannot be simply classified as traditional unsupervised learning or self-supervised learning. Features indicates that the method takes the node features into account. Com-info , Pri-info , and Comp-noise indicate that the method takes the common information, private information, complementarity and noise into account, respectively.

C.3. Model Architectures and Settings

As described in Section 3, the proposed DMG employs the one-layer GCN and MLP as the encoders (i.e., g(r) and f (r)) to obtain common representations C(r) RN D, and private representations P(r) RN d. Note that we assign different weights ω to the self-connection of multiplex graph datasets and single-view graph datasets during the graph convolution. Then the proposed DMG investigates the correlation loss to enforce the independence between common and private representations with the measurable functions (i.e., ϕ(r) and ψ(r)). In the proposed DMG, the measurable functions ϕ(r) and ψ(r) are implemented by the two-layer MLP. After that, the proposed DMG investigates the reconstruction loss to promote the invertibility of encoders with the reconstruction network p(r), which is also implemented as an MLP. In the proposed DMG, all parameters were optimized by the Adam optimizer (Kingma & Ba, 2015) with initial learning rate and weight decay (1e-3 and 1e-4, respectively). We apply the Re LU function (Nair & Hinton, 2010) as a nonlinear activation function. In all experiments, we repeat the experiments five times for all methods and report the average results. Table 7 describes the detailed settings and architecture for most of our experimental setups with DMG.

Disentangled Multiplex Graph Representation Learning

Table 7. Settings for the proposed DMG. Settings ACM IMDB DBLP Freebase Photo Computers

D 8 8 8 8 16 40 d 2 2 2 2 2 4 ω 3 3 3 3 1 1 Hidden units of g(r) 256 512 256 256 256 512 Hidden units of ϕ(r) 256 256 256 256 256 256 Hidden units of ψ(r) 256 256 256 256 256 256 Layers of p(r) 3 2 2 2 2 2 Learning rate 1e-3 1e-3 1e-3 1e-3 1e-3 1e-3 Weight decay 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 Dropout 0.1 0.1 0.1 0.1 0.1 0.1

C.4. Evaluation Protocol

We follow the evaluation in previous works (Jing et al., 2021a; Pan & Kang, 2021; Zhou et al., 2022), where the model is trained in an unsupervised manner. Then, the learned representations are evaluated by several downstream tasks (i.e., node classification and node clustering). For the node classification task, we evaluate the effectiveness of all methods with Micro-F1 and Macro-F1 scores. For the node clustering task, we evaluate the effectiveness of all methods with Accuracy score and Normalized Mutual Information (NMI) score, and NMI = 2I( ˆY ; Y )/[H( ˆY ) + H(Y )], where ˆY and Y refer to the predicted cluster indexes and class labels.

D. Additional Experiments

This section provides some additional experimental results to further support the proposed method, including parameter analysis in Section D.1, additional ablation study in Section D.2, visualization of disentangled representation learning in Section D.3, comparison experiments with hard matching loss and Info NCE loss in Section D.4, and additional results of the robustness on other multiplex datasets in Figures 6 and 7.

10 3 10 2 10 1 100 101 102 103

ACM IMDB DBLP Freebase

10 3 10 2 10 1 100 101 102 103

ACM IMDB DBLP Freebase

10 3 10 2 10 1 100 101 102 103

ACM IMDB DBLP Freebase

Figure 3. Classification results of our method at different parameter settings (i.e., α, β, and λ) on all datasets.

D.1. Parameter Analysis

In the proposed method DMG, we employ the non-negative parameters (i.e., α, β, and λ) to achieve a trade-off between each term of the objective function. To investigate the impact of α, β, and λ in Eq. (8) with different settings, we conduct the node classification on all multiplex graph datasets by varying the value of parameters in the range of [10 3,103] and reporting the results in Figure 3. As shown in Figure 3, the proposed method DMG consistently achieves significant performance while the values of parameters are set appropriately (e.g., [10 2,100]). Moreover, if the values of parameters are too large (e.g., > 101) or too small (e.g., < 10 2), the proposed DMG obtains inferior performance due to the failure to obtain the disentangled common and private representations or the failure to preserve the complementarity and remove noise in the

Disentangled Multiplex Graph Representation Learning

private representations. This again validates that it is necessary to disentangle common and private representations, as well as to preserve complementarity and remove noise in private representations and validates the effectiveness of our method.

(a) w/o Lcor (SIL=0.145) (b) w/o Lrec (SIL=0.381) (c) w/o Lcon (SIL=0.406) (d) with J (SIL=0.452)

Figure 4. Visualization plotted by t-SNE (and the corresponding silhouette scores (SIL) of node representations) for our method on the ACM dataset, where different colors represent different classes of the nodes. (a) DMG without Lcor; (b) DMG without Lrec; (c) DMG without Lcon; and (d) DMG with complete objective function.

D.2. Additional Ablation Study

D.2.1. EFFECTIVENESS OF EACH COMPONENT

In the main paper above, we investigate the effectiveness of each component of the objective function on the semi-supervised task (i.e., node classification). To further verify and visualize the effectiveness of each component on unsupervised task, we investigate the performance of t-SNE visualization (Van der Maaten & Hinton, 2008) on different components of the objective function by reporting the results in Figure 4. Similar to the ablation study on the classification task, we have the observations as follows. First, our method with the complete objective function outperforms the variant without any loss, indicating that each component of the objective function is necessary for our method. Second, the variant without Lcor obtains the worst results, compared to the other two variants (without Lrec and without Lcon, respectively), indicating that the correlation loss may play an important role in our framework. The results on the t-SNE visualization are consistent with the ablation study of each component on the node classification in the main paper. Thus, the effectiveness of each component is further verified.

Table 8. Classification and clustering performance under different combinations of private representations of graphs.

Datasets Graphs Node classification Clustering

Macro-F1 Micro-F1 Accuracy NMI

PSP 90.3 0.2 90.3 0.4 92.1 0.3 73.5 0.4 ACM PAP 90.7 0.4 90.6 0.5 92.6 0.3 74.1 0.2 PSP+PAP 91.0 0.3 90.9 0.4 92.9 0.3 74.5 0.4 MAM 55.6 0.4 57.5 0.5 59.9 0.3 16.3 0.2 IMDB MDM 57.1 0.3 58.3 0.3 60.2 0.4 16.9 0.2 MAM+MDM 57.6 0.2 58.9 0.4 60.3 0.5 17.0 0.3

D.2.2. EFFECTIVENESS OF PRIVATE REPRESENTATIONS IN EACH GRAPH

With the complementarity and noise constraint in our framework, the private representations learned by our method are expected to contain the complementary contents of each graph. Therefore, we investigate the classification and clustering performance with private representations belonging to different graphs, and report the results in Table 8. From Table 8, we have observations as follows. First, private representations belonging to different graphs tend to have different importance. For example, our method only using the private representations of MDM graph outperforms our method only using the MAM graph on the ACM dataset on all downstream tasks. This makes sense as the complementary contents in a certain graph may be more than others. Second, our method with two graphs outperforms our method with only one graph on all downstream tasks. This verifies again that our method is available to preserve the complementarity in different graphs to benefit downstream tasks.

Disentangled Multiplex Graph Representation Learning

Table 9. Classification performance (i.e., Macro-F1 and Micro-F1) on all multiplex graph datasets.

Method ACM IMDB DBLP Freebase

Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1

DMG-HR 76.6 0.5 76.7 0.4 43.1 0.3 44.6 0.4 88.8 0.4 90.2 0.5 48.1 0.4 49.5 0.5 DMG-NCE 88.2 0.3 88.1 0.4 55.2 0.5 56.9 0.3 92.7 0.3 93.6 0.2 59.4 0.4 64.0 0.3 DMG 91.0 0.3 90.9 0.4 57.6 0.2 58.9 0.4 93.3 0.2 94.0 0.3 62.4 0.7 65.9 0.8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 5. The correlation map of common and private representations learned by the proposed DMG on the ACM dataset.

D.3. Visualization of Disentangled Representation Learning

To verify the effectiveness of disentangled representation learning, i.e., the common representations are clean to private representations, we follow the literature (Ma et al., 2019; Li et al., 2021) to calculate the correlation between the common representations and the private representations. Specifically, the smaller the correlation between the common representation and the private representation, the cleaner the common information learned by the representation learning method. To do this, we implement our method on the ACM dataset, and visualized the correlation of learned representations (concatenated by common and private representations, dimensions of them are both set as 8) in Figure 5. More specifically, the correlation map includes four blocks, where both the upper right block and the lower left block indicate the correlation between the common representations and the private representations. Based on Figure 1, their correlations are small, so the common information learned by our method is clean, and the effectiveness of disentangled representation learning is verified.

D.4. Comparison with Hard Matching Loss and Info NCE Loss

In the proposed DMG, we investigate the matching loss to align the common representations from different graphs with a common variable S. Actually, the matching loss in our framework can be regarded as a soft regularization to the common representations from different graphs, through a bridge (i.e., common variable S). Therefore, we consider replacing the soft matching with hard regularization (Chen et al., 2020) by removing the common variable, formulated as

n=1 (c(r) n c(r ) n )2,

n=1 cnc n = I,

where r = r . Moreover, in the proposed DMG, we investigate the contrastive loss to preserve the complementarity and remove noise in the private representations. Then we consider replacing it with another widely-used contrastive loss,

Disentangled Multiplex Graph Representation Learning

i.e., Info NCE loss (Oord et al., 2018), formulated as

i=1 log e θ u(r) i ,v(r) i

1 N PN j=1 e θ u(r) i ,v(r) j , (32)

where (u(r) i , v(r) i ) and (u(r) i , v(r) j ) indicate positive and negative pairs, respectively. In our implementation, we replace the

positive pair and negative pair with the node pairs in E(r) c and E(r) n . Then we denote the modified models as DMG-HR and DMG-NCE, respectively, and report the classification performance of the modified models under identical settings with the original model DMG in Table 9.

From Table 9, we have the following observations. First, compared to the variant model DMG-HR, our method on average improves by 21.9%, on all multiplex graph datasets. The reason for this may be that using a hard regularization will align the common representations directly at the initial stage of training, but the common representations may not be optimal at this point. In addition, it is easier to enforce the orthogonality and zero-mean constraints on common variable than on common representations. Second, compared to the variant model DMG-NCE, our method also achieves superior performance on all multiplex graph datasets. The results empirically demonstrate that, although Info NCE is a strict estimator of the mutual information, the contrastive loss in our framework is more effective and shows better downstream performance.

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

Figure 6. Classification performance of our method and all self-supervised MGRL methods under different noisy edges ratio η on ACM, IMDB, and Freebase dataset.

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

0.1 0.3 0.5 0.7 0.9 Noise ratio

DMGI DMGIattn HDMI He Co MCGC CKD DMG

Figure 7. Clustering performance of our method and all self-supervised MGRL methods under different noisy edges ratio η on ACM, IMDB, and Freebase dataset.

Disentangled Multiplex Graph Representation Learning

E. Discussion on Potential Limitations and Future Directions

This section discusses the limitations of the proposed method and related future directions.

E.1. Features Multiplicity

Currently, our method mainly considers the private information in different graph structures since node features of different graphs are generated from a shared feature matrix. However, if node features in different graphs are also multiplexed, i.e., the node features in each graph reveal different self-properties of the nodes, although our method can still extract complete and clean common information of different graphs, it cannot further explicitly preserve the complementarity and remove the noise in node features of different graphs. Therefore, future research may consider the multiplicity of node features so as to capture the private information in node features and to further preserve the complementarity and remove noise in node features of different graphs.

E.2. Unattributed Graphs

Currently, our method mainly considers graph datasets with node features. We should note that there exist cases where the nodes are unattributed and all information is contained in the graph topology, especially for some graph-level datasets. In fact, in our experimental section, we have chosen the unattributed graph dataset (i.e., Freebase) to further verify the effectiveness and robustness of our method by simply assigning one-hot vectors as node features. Obviously, we can observe that our method obtains more significant improvements on the Freebase dataset than other multiplex graph datasets with node features. This may indicate that removing noise and preserving complementarity in graph structures is more important to unattributed graph datasets. Therefore, future research may consider designing methods specifically for unattributed graph datasets to further improve their effectiveness and robustness.