# scalable_multiplex_network_embedding__c64afb83.pdf

Scalable Multiplex Network Embedding

Hongming Zhang1, Liwei Qiu2, Lingling Yi2, Yangqiu Song1

1Department of CSE, The Hong Kong University of Science and Technology 2Tencent Technology (SZ) Co., Ltd., China hzhangal@cse.ust.hk, deolcaqiu@tencent.com, chrisyi@tencent.com, yqsong@cse.ust.hk

Network embedding has been proven to be helpful for many real-world problems. In this paper, we present a scalable multiplex network embedding model to represent information of multitype relations into a uniﬁed embedding space. To combine information of different types of relations while maintaining their distinctive properties, for each node, we propose one high-dimensional common embedding and a lower-dimensional additional embedding for each type of relation. Then multiple relations can be learned jointly based on a uniﬁed network embedding model. We conduct experiments on two tasks: link prediction and node classiﬁcation using six different multiplex networks. On both tasks, our model achieved better or comparable performance compared to current state-of-the-art models with less memory use.

1 Introduction Network or graph embedding, which uses dense vectors to represent nodes, has been well studied for many years [Archdeacon, 1996; Chung, 1997; Goyal and Ferrara, 2017]. Recently, inspired by recurrent neural networks [Goodfellow et al., 2016], network embedding has been re-investigated and developed based on random walks over graphs and stochastic gradient descent optimization, which can be applied to very large graphs, such as real social networks [Perozzi et al., 2014; Tang et al., 2015; Grover and Leskovec, 2016]. The embedding vector of nodes can be used to encode some of the topological structure of the network, and then can be used as features for downstream models. Network embedding has been proven to be helpful for network analysis tasks including link prediction, node classiﬁcation, and community detection. Multiplexity property is a common feature of networks, especially for social networks. In the social network analysis, multiplexity refers to multifaceted relationships between two people [Verbrugge, 1979]. If we generalize this idea to all kinds of network, by multiplex network, we mean a group of networks which contains multiple kinds of relations, and each kind of the relations can create a layer of the network. Take the social network as an example. In a social network

such as Facebook, users often have different kinds of interactions with each other like friendship relation, forwarding articles to each other, conversation, money transferring, etc. Each of them will create a layer of the network among all users. If we consider them as a united one, we will get a huge multiplex network. While each layer of the network can only represent one kind of interactions among users, to better understand the whole multiplex network, it is better to integrate different types of information from these networks together without sacriﬁcing their distinctive properties. Considering the fact that a network could be huge, in this paper, we present a scalable multiplex network embedding model to efﬁciently store and learn information of multi-type relations into a uniﬁed embedding space. For each node, we propose one high-dimensional common embedding vector, which is shared across all layers of the multiplex network, and we use that to build a bridge among different layers of the network. For each node, to learn its distinct property on different layers with small memory occupation, we also propose a low-dimensional additional vector for each type of relation. To align these embeddings with different dimensions together, we introduce a global transformation matrix for each layer of the network. Following Deep Walk [Perozzi et al., 2014], we use stochastic gradient descent to optimize our model. Since the global transformation matrix will be updated much more times than node vectors will be, we add an additional regularization to constrain the size of the matrix. Our contributions can be summarized as follows.

We formally deﬁne the problem of multiplex network embedding to handle the multiplexity property of networks.

We propose a scalable multiplex network embedding model to represent information from multi-type relations into a uniﬁed embedding space.

We evaluate our embedding algorithm on two network analysis tasks: link prediction and node classiﬁcation. On both tasks, our model achieved better or comparable performance compared to current state-of-the-art models with less memory use.

The source code of our model is available at: https://github.com/HKUST-Know Comp/MNE.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

2 Related Work

2.1 Multiplex Network Analysis Traditionally in social science, multiplexity has been used to characterize the multiple facets of social exchange relationships among users [Verbrugge, 1979]. The idea of multiplexity can be generalized to all kinds of network. In data mining community, people sometimes also use the term multi-relational network to represent the multi-type relations in social networks and veriﬁed that considering such relations in the social network can help data mining tasks such as community mining or link prediction [Cai et al., 2005; Chen et al., 2016]. In addition, in bioinformatics community, it has been shown that by integrating multiple networks, the node representation can be improved for genes functional analysis [Cho et al., 2016]. Nowadays, graph mining methods on the multiplex network are mainly targeting on speciﬁc tasks. For link prediction, traditional methods [Newman, 2001; Adamic and Adar, 2003] can be applied to multiplex network without considering the information of the typed edges. For community detection, cross-layer centrality measurement [Br odka et al., 2012] is proposed to capture the centrality of nodes in a multiplex network. Besides that, multilayered local clustering coefﬁcient (MLCC) and cross-layer clustering coefﬁcient (CLCC) [Kazienko et al., 2010; Lytras et al., 2010] are also proposed to describe cluster coefﬁcient of a node in a multiplex network. In this paper, we propose a general solution for multiplex network analysis from the perspective of network embedding.

2.2 Network Embedding Network embedding (or sometimes called graph embedding, historically also graph drawing) has been well studied for many years [Archdeacon, 1996; Chung, 1997; Goyal and Ferrara, 2017]. It is also related to manifold learning [Huo et al., 2007] while manifold learning was usually applied to dimensionality reduction for high-dimensional data. Network embedding focuses on generating the vector representation of nodes for real networks or graphs to facilitate further analysis of networks. Traditional approaches to network embedding usually involve time-consuming computational components, such as eigenvalue decomposition to analyze the spectral property of graphs [Spielman, 2011]. However, given modern very large-scale online social networks, such approaches may be less efﬁcient and sometimes infeasible to generate the node representations. Recently, inspired by the development of recurrent neural networks [Goodfellow et al., 2016], particularly the efﬁcient word embedding method, word2vec [Mikolov et al., 2013b; Mikolov et al., 2013a], many network embedding approaches have been proposed to handle large-scale social networks. For example, Deep Walk [Perozzi et al., 2014] proposes to perform random walk on the network to generate sequences of nodes and then perform skip-gram algorithm (a technique used in word2vec) on those sequences to achieve the embedding. On top of Deep Walk, Node2Vec [Grover and Leskovec, 2016] adds two parameters to control the random walk process and make it biased random walk. Some other embedding

models focus on speciﬁc kinds of structures in the network. For example, LINE [Tang et al., 2015] tries to use the embedding to approximate the ﬁrst-order and second-order proximities of the network. All the models above have been proven useful on single network analysis, but they did not consider multiplexity. Recently, an embedding approach is proposed to target the multiplexity property [Liu et al., 2017]. They proposed three methods to learn one overall embedding from the multiplex network. Unlike their approach, we propose one high-dimensional common embedding and several lowerdimensional additional embeddings for each type of relations. A recent work [Li et al., 2018] called itself multi-layer network embedding. Essentially, it is a heterogeneous information network since in the deﬁnition in their paper, different layers have different type of nodes.

3 Multiplex Network Embedding 3.1 Problem Deﬁnition We ﬁrst introduce our problem and notations. Given a multiplex network, our goal is to learn a representation for each node. Suppose N is the set of nodes and there are M different relation types, represented as G1, ..., GM. For each relation based network Gi = (Ni, Ei), we have Ei Ni Ni and Ni N. For each node n N, it has one common embedding vector, which is shared across all the relation types and denoted as bn Rd. We use that common embedding as a bridge to transfer information across different relation types. To capture the distinct property of each sub-network, for each node, we also propose a series of low-dimensional vectors ui n Rs for each relation type i. Here, d is the dimension of common embedding and s is the dimension of typed relation vector. To avoid our model becoming too large, we set s d |N|, where |N| is the number of all nodes. To align bn and ui n, we introduce a transformation matrix Xi Rs d. Then we use a new vector vi n to represent the embedding vector of n for relation type i:

vi n = bn + wi Xi T ui n, (1)

where we use a weight wi to denote the overall importance of relation type i.

3.2 Optimization For each relation type i of the multiplex network, we conduct random walk to generate a sequence of nodes and then perform Skip-gram algorithm used in word2vec [Mikolov et al., 2013a] over the sequences to learn the embeddings. As shown in Eq. (1), a node s ﬁnal embedding is composed of three parts: common embedding bn, relation-based embedding ui n, and transformation matrix Xi. We need to learn these parameters at the same time. For any node n in a random walk path relation type i, assuming it appears in position j, we will take nj c, ..., nj+c as its neighbors, where c is half of the window size. Thus, given a sequence of nodes, our objective is to minimize the following negative log-likelihood:

log Pθi(nj c, ..., nj 1, nj+1, ..., nj+c | nj), (2)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

which can be further factorized as Qj 1 k=j c log Pθi(nk|nj) Qj+c k=j+1 log P(nk|nj) given the conditional independence assumption. Here we use Pθi to denote the parameterization for relation type i. For each Pθi(nk|nj), a softmax function is used to deﬁne the probability:

Pθi(nk | nj) = exp(v nk T vi nj) P n exp(v n T vinj), (3)

where vi nj represents the input word embedding of user nj, and v nk, and v n represent the parameters of context vectors shared by all relation types, which are only used during the training process. To speed up the training process, following word2vec [Mikolov et al., 2013a], we use negative sampling to approximate the objective function as:

E = log σ(v nk T vi nj) X

log σ( v n T vi nj), (4)

where σ(x) = 1/(1 + exp( x)) is the logistic function and N i nj is the randomly sampled negative context node set for nj. Assuming the sequence is generated in relation type i, according to Eq. (1), in the new objective function, we can replace vi n with bn + Xi T ui n. After that we employ stochastic gradient descent (SGD) to minimize the objective function. We update the context parameter vectors, node embedding vectors, and transformation matrix as follows:

bnj := bnj η E

bnj = bnj η e(nj, nk) v nk, (5)

v nk := v nk η E

v nk = v nk η e(nj, nk) vi nj, (6)

ui nj := ui nj η E

uinj = ui nj η e(nj, nk) Xiv nk, (7)

Xi := Xi η E

Xi = Xi η e(nj, nk) ui njv nk T , (8)

where e(nj, nk) = σ(v nk T vi nj) tk and tk=1 if nk is the neighbor of nj and tk=0 if nk is in the negative sampling set. We initialize all the vi with zero vector and Xi with zero matrix. For context parameter vector v n for all the users, we randomly initialize them. Since Xi s can be updated many times when optimizing the objective function, to avoid its norm becoming too huge, we add the following constraints:

||Xi|| r, (9)

where r is a controlling parameter for the Frobenius norm of matrix ||Xi||. In our experiment, we also involve an extra combination layer and since this layer has no actual meaning, we will only update common embedding during the training on that layer. We summarize our algorithm in Algorithm 1. As our model is a random walk based solution, assuming that we have M relation types of sub-networks and |N| nodes, the time complexity of our model is O(M|N|). The memory complexity of our model is O((d + s M)|N|).

Algorithm 1 Multiplex Network Embedding INPUT: Multiplex network which is composed of sub networks G1, . . . , GM, common embedding dimension d, additional embedding dimension s, learning rate η, size of context window c, number of negative sampling k, transformation matrix size regularization parameter r. 1: Initialize bn, vi n and Xi for different nodes n s and relations i s. 2: for Each relation network Gi do 3: for Each path P generated by random walk do 4: for nj P do 5: for nc P[j c, j + c] do 6: Nnj = Negative Sampling(N, m)

7: vi nj = bnj + Xi T ui nj 8: Nsample = Nnj nc 9: for nk Nsample do

10: e(nj, nk) = σ(v nk T vi nj) tk 11: Update bnj, v nk, ui nj, and Xi based on Eqs. (5), 6), (7), and (8) respectively. 12: end for 13: if Xi > r then 14: Xi := r Xi Xi

15: end if 16: end for 17: end for 18: end for 19: end for OUTPUT: Common embedding bn, additional embeddings ui n, and transformation matrix Xi for different nodes n s and relations i s.

4 Experiments 4.1 Datasets We use open multiplex networks from Manlio De Domenico project1 as our experimental datasets. We select four multiplex networks from different network types. The selection criteria are that they cannot be too small and they cannot be too sparse, which means that they should have at least one edge per node per relation type. The details of these datasets are as follows: Vickers [Vickers and Chan, 1981]: Data collected by asking 29 seventh grade students in a school in Victoria, Australia, for whom they asked three questions and each of them creates a relation type of the network. CKM [Coleman et al., 1957]: Data collected by asking physicians in four towns in Illinois, Bloomington, Quincy, and Galesburg. They asked three questions and each of them creates a relation type of the network. LAZEGA [Lazega, 2001]: This multiplex social network consists of three kinds of relationships (Co-work, Friendship, and Advice) between partners and associates of a corporate partnership. C.ELEGANS [Chen et al., 2006]: This multiplex network consists of layers corresponding to synaptic junctions:

1http://deim.urv.cat/~manlio.dedomenico/data.php

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Dataset (Network Type) #layers #nodes #edges

Vickers (Social) 3 29 740 CKM (Social) 3 246 1,551 LAZEGA (Social) 3 71 2,223 C.ELEGANS (Neurinal) 3 279 5,863 Twitter (Online Social) 2 456,626 15,183,974 Private (Online Social) 17 40,000 1,380,470

Table 1: Statistics of datasets.

electric ( Electr J ), chemical monadic ( Mono Syn ), and polyadic ( Poly Syn ) among a group of neurons. Since the sizes of the above multiplex networks are relatively small, to test the scalability and performance of our model, we also conduct experiments on two real large-scale online social media networks: Twitter and Private. The detailed information of these large scale social networks are as follows: Twitter: The Twitter data used in our experiments is the higgs-twitter dataset2. This dataset was used to analyze the spreading of news [De Domenico et al., 2013]. It crawled all tweets about a scientiﬁc term Higgs boson discovery in one week. We select the largest two sub-networks (following and retweet) to build the multiplex network. Private: Our Private data is based on an online social network, where users can have friendship relations with each other and can send articles to their friends. From all of our users, we randomly chose 40,000 users who have annotated themselves afﬁliated with a university and we recorded all the article forward activities within that group in one month. After applying topic modeling on the content of the articles, we grouped all the articles by topics. Thus, we got 16 topics and each one of them creates a relation type of the multiplex network. If we add the base friendship relation, we have in total 17 relation types. The statistics of these datasets are listed in Table 1.

4.2 Baseline Methods We will ﬁrst compare our model with following state-of-theart embedding-based baseline methods.

Deep Walk: Deep Walk [Perozzi et al., 2014] ﬁrst applies random walk on the network, treats the path as a sentence, and then uses Skip-gram algorithm to train the embeddings.

LINE: LINE [Tang et al., 2015] adds direct link ﬁtting term to the Deep Walk cost function, and adds second hop friends into one hop friends to incorporate higherorder relations.

Node2Vec: Node2Vec [Grover and Leskovec, 2016] adds a pair of parameters to control the random walk process and makes it better for certain types of nodes, such as hubs or tail users.

Principled Multilayer Network Embedding: PMNE [Liu et al., 2017] proposed three different models to merge multiplex network together to generate one overall embedding for each of the node, we will

2https://snap.stanford.edu/data/higgs-twitter.html

compare our model with all of their three models. We denote their network aggregation, results aggregation, and Co-analysis model as PMNE (n), PMNE (r), and PMNE (c) respectively.

Besides the above embedding methods, we will also compare our embedding model with the state-of-the-art network structure-based methods for the link prediction task.

Common neighbor (CN): The CN metric is one of the most widespread measurements used in link prediction tasks due to its simplicity [Newman, 2001]. For each pair of nodes, the more common neighbors it has, the more likely it will have one edge.

Jaccard Coefﬁcient (JC): For a pair of nodes, JC normalized the number of common neighbors with the number of their total neighbors.

Adamic/Adar (AA): AA [Adamic and Adar, 2003] is similar to JC, but unlike JC, AA gives more weight to the nodes with fewer neighbors and compared with other structure-based methods, it achieved state-of-theart performance on a series of networks [Liben-Nowell and Kleinberg, 2007].

4.3 Experimental Settings Evaluation Metrics For link prediction, following the commonly used evaluation criteria in similar tasks, we use ROC-AUC [Hanley and Mc Neil, 1982] as the evaluation criteria in our experiment. A higher value represents better performance and an ideal model that ranks all the positive samples above negative samples will achieve AUC value of 1. For Node Classiﬁcation, we employ the ℓ2-regularized logistic regression on learned embedding to train classiﬁers and we evaluate all the embedding models based on the prediction accuracy.

Model Parameters To be fair, following their original papers, for all the embedding based methods, we set their embedding dimensions and the dimension of common embedding in our model to be 200. For all the random walk based methods, we set the width of the window to be ten and select ﬁve negative samples. As LINE has two kinds of embedding (the so-called ﬁrst-order proximity and second-order proximity), we set the embedding dimension for both embeddings to be 100 and concatenate them together. For Node2Vec, we empirically use the best hyper-parameter for training, which is p = 2 and q = 0.5. For the three PMNE models, we will use the hyperparameters given by their original paper. For our algorithm multiplex network embedding (MNE), we simply set the dimension of additional vectors to be 10.

4.4 Link Prediction As shown in Table 2, we evaluate AUC values of different models on all of the datasets with ﬁve-fold cross-validation setting. Following other embedding methods, for each pair of nodes, we calculate the cosine similarity of their embedding. The larger the similarity the more likely there exists a link between them. As the multiplex network has more than one

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Model Vickers CKM LAZEGA C.ELEGANS Twitter Private

Deep Walk 0.821 (0.030) 0.781 (0.008) 0.780 (0.007) 0.821 (0.006) 0.502 (0.002) 0.621 (0.007) LINE 0.676 (0.011) 0.637 (0.012) 0.695 (0.006) 0.732 (0.006) 0.519 (0.003) 0.512 (0.010) Node2Vec 0.821 (0.030) 0.781 (0.008) 0.780 (0.007) 0.820 (0.006) 0.504 (0.003) 0.644 (0.010) PMNE (n) 0.810 (0.032) 0.917 (0.008) 0.792 (0.009) 0.843 (0.003) 0.446 (0.004) 0.629 (0.006) PMNE (r) 0.844 (0.025) 0.904 (0.008) 0.813 (0.007) 0.835 (0.007) 0.446 (0.001) 0.659 (0.005) PMNE (c) 0.837 (0.029) 0.847 (0.016) 0.797 (0.011) 0.824 (0.009) 0.449 (0.002) 0.506 (0.004)

Common Neighbor (CN) 0.799 (0.011) 0.877 (0.006) 0.809 (0.007) 0.869 (0.002) 0.592 (0.002) 0.691 (0.002) Jaccard Coeﬁcient (JC) 0.778 (0.007) 0.873 (0.006) 0.826 (0.007) 0.833 (0.001) 0.520 (0.002) 0.573 (0.004) Adamic/Adar (AA) 0.803 (0.019) 0.875 (0.013) 0.814 (0.008) 0.881 (0.001) 0.592 (0.002) 0.691 (0.003)

MNE 0.871 (0.014) 0.900 (0.010) 0.839 (0.013) 0.910 (0.006) 0.622 (0.003) 0.723 (0.002)

Table 2: Link prediction based on similarities between two nodes. All the numbers are the averaged AUC score based on ﬁve-fold cross validation. The standard deviations are reported in the parentheses.

relation types, we will compute AUC for each relation type ﬁrst and take the average of all the relation types as the ﬁnal result. For the models designed for single layer network, we will train a separate embedding for each relation type of the network and use that to predict links on the corresponding relation type, which means that they do not have information from other relation types of the network. One thing worth mentioning is that, in the two large online social network, the following/friendship network is the base network for other relation types. For example, in our Private data, one can only send articles to their friends. To make the evaluation more similar to our real task, for the AUC evaluation of these two datasets, we will randomly select the same amount of edges from the base network, which do not appear in that relation type to be the negative examples and we will not evaluate the base network. Based on the experiment results, we have the following interesting observations: (1) For nearly all the networks, jointly considering different relation types of the network is helpful. The models that take the whole network into consideration outperform all the models designed for the single network (Deepwalk, LINE, Node2Vec). This is consistent with our assumption that the information from any single network is not enough and information in different relation types can be the complement to each other. (2) The performance of baseline methods varies a lot based on the topological structure of different networks. For example, if the network is a dense one like LAZEGA or C.ELEGANS, the performance of traditional simple methods are good enough, sometimes even better than the embedding based approach. However, when the network is relatively sparse like Vickers or CKM, embedding based methods have better performance. (3) The main difference between our model and PMNE is that PMNE generates one overall embedding for a node and our model generates a set of embeddings to capture the distinct information about each relation type. The experimental results show that such information could be important, especially on our Private dataset, where difference among different relation types of the network is more obvious. To summarize, on six multiplex networks, our model can stably outperform or achieve comparable performance with

all the baseline methods. Our understanding is that the common embedding in our model merges information from different relation types, and the additional vectors keep the distinct property of each relation type of the network.

4.5 Parameter Sensitivity

In our algorithm, there are three parameters: the dimension of low-dimensional addition vectors s, the weight of additional vectors wi, and the transformation matrix norm limitation r. To clearly show the inﬂuence of these parameters, we repeat our link prediction experiment with different parameter settings. For the small multiplex datasets, we take the average of them to see the trend. As shown in Figure 1(a), the performance of our model is improved when we increase the dimension of the additional vector. When the dimension reaches 10, the performance reaches the top, which proves that with the help of common embedding, we can use a very low dimension (about onetenth of the base dimension) vector to represent the property of each relation type of the multiplex network. As shown in Figure 1(b), when we increase the weight of the additional vector, the performance of our model will arise. But after reaching to 1, the performance will start to go down slightly. This is because in our embedding training phase, we ﬁxed the weights as one. It seems tuning this weights using cross-validation does not further improve link prediction. As shown in Figure 1(c), if we set the regularization parameter of the transformation matrix to be too small, we may restrict the learning process of our model. When the regularization parameter is larger than 10, the performance becomes stable and the inﬂuence of our restriction is not signiﬁcant.

4.6 Node Classiﬁcation

In this section, we conduct the experiment on node classiﬁcation task to evaluate the quality of embedding captured by our model. As the CKM social network is the only multiplex network that provides the ground truth about node labels, we use it as the experimental dataset. Considering the size of datasets being small, we conduct 2-fold cross-validation experiment on this task. The dataset labels all the researchers with their original companies. We use this feature as labels

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

(a) additional vector dimension s

(b) weight of additional vectors wi

(c) matrix norm limitation r

Figure 1: Performance of our model on different experimental settings. For the experiment on s, we set wi = 1 and r = 1000. For the experiment on wi, we set s = 10 and r = 1000. For the experiment on r, we set s = 10 and wi = 1.

Figure 2: Node classiﬁcation result.

to train the classiﬁer and evaluate our embedding. For Deepwalk, LINE, and Node2Vec, as they are designed for single network, we ﬁrst train a separate embedding for each relation type and average them as the ﬁnal embedding for each node. For our model, we set the weight wi of all the types to be 1. The result is shown in Figure 2. Our model can achieve the best performance among all the embedding models. This proves that through simple average operation, our model can produce the high-quality overall representation of nodes.

4.7 Scalability of Our Model In this section, we investigate the scalability of our model with our large-scale online social network. As we all know, in real life, the social network could be huge. For example, there are about 1 billion active users on our private social network. Every day, more than 500 million relational edges are created. If we learn embedding for all of the sub-networks, the training and storage of these models could be a big challenge. Hence we propose the structure of small additional vector and transformation matrix, trying to reduce the overall model size without sacriﬁcing the overall performance. As shown in Figure 3, the memory used by our model is almost linear to the network size. Compared with Node2Vec and LINE, our model occupy one-tenth of the memory space without sacriﬁcing performance. The reason is obvious. As we showed in section 3, the memory complexity of our model is O((d + s M)|N|) and the memory complexity of Deepwalk and LINE is O((d M)|N|), where s is designed to be much smaller than d. For the PMNE(n) and PMNE(c), as

Figure 3: Scalability of memory use.

they merge the multiplex network into one single network, their model is the smallest but their model loses distinct information about relational edges, which is proved important in the previous experiment.

5 Conclusion and Future Work

In this paper, we present a new scalable embedding model for multiplex networks. We tested our model on two tasks: link prediction and node classiﬁcation. The experimental results show that our model can effectively capture the overall representation of nodes without sacriﬁcing the distinct property of every single relation types. We also conduct scalability experiments using our social network to show that when the network is large, we only need one-tenth of the old model size to achieve better performance. In the future, we will explore the idea of multiplex network for tasks in other areas.

Acknowledgments

This paper was supported by HKUST-We Chat WHAT Lab, China 973 Fundamental R&D Program (No. 2014CB340304), and the Early Career Scheme (ECS, No. 26206717) from Research Grants Council in Hong Kong.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

References [Adamic and Adar, 2003] Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3):211 230, 2003. [Archdeacon, 1996] Dan Archdeacon. Topological graph theory. A survey. Congressus Numerantium, 115(5-54):18, 1996. [Br odka et al., 2012] Piotr Br odka, Krzysztof Skibicki, Przemyslaw Kazienko, and Katarzyna Musial. A degree centrality in multi-layered social network. Co RR, abs/1210.5184, 2012. [Cai et al., 2005] Deng Cai, Zheng Shao, Xiaofei He, Xifeng Yan, and Jiawei Han. Community mining from multirelational networks. In PKDD, pages 445 452, 2005. [Chen et al., 2006] Beth L Chen, David H Hall, and Dmitri B Chklovskii. Wiring optimization can relate neuronal structure and function. Proceedings of the National Academy of Sciences of the United States of America, 103(12):4723 4728, 2006. [Chen et al., 2016] Chen Chen, Hanghang Tong, Lei Xie, Lei Ying, and Qing He. FASCINATE: fast cross-layer dependency inference on multi-layered networks. In KDD, pages 765 774, 2016. [Cho et al., 2016] Hyunghoon Cho, Bonnie Berger, and Jian Peng. Compact Integration of Multi-Network Topology for Functional Analysis of Genes. Cell Systems, 3(6):540 548.e5, December 2016. [Chung, 1997] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997. [Coleman et al., 1957] James Coleman, Elihu Katz, and Herbert Menzel. The diffusion of an innovation among physicians. Sociometry, 20(4):253 270, 1957. [De Domenico et al., 2013] Manlio De Domenico, Antonio Lima, Paul Mougel, and Mirco Musolesi. The anatomy of a scientiﬁc rumor. Scientiﬁc reports, 3:2980, 2013. [Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [Goyal and Ferrara, 2017] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Co RR, abs/1705.02801, 2017. [Grover and Leskovec, 2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855 864, 2016. [Hanley and Mc Neil, 1982] James A Hanley and Barbara J Mc Neil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29 36, 1982. [Huo et al., 2007] Xiaoming Huo, Xuelei (Sherry) Ni, and Andrew K. Smith. A survey of manifold-based learning methods. In T. W. Liao and E. Triantaphyllou, editors, Recent Advances in Data Mining of Enterprise Data. World Scientiﬁc, 2007.

[Kazienko et al., 2010] Przemyslaw Kazienko, Piotr Br odka, and Katarzyna Musial. Individual neighbourhood exploration in complex multi-layered social network. In IEEE/WIC/ACM, Proceedings, pages 5 8, 2010. [Lazega, 2001] Emmanuel Lazega. The collegial phenomenon: The social mechanisms of cooperation among peers in a corporate law partnership. Oxford University Press on Demand, 2001. [Li et al., 2018] Jundong Li, Chen Chen, Hanghang Tong, and Huan Liu. Multi-layered network embedding. SDM18, 2018. [Liben-Nowell and Kleinberg, 2007] David Liben-Nowell and Jon M. Kleinberg. The link-prediction problem for social networks. JASIST, 58(7):1019 1031, 2007. [Liu et al., 2017] Weiyi Liu, Pin-Yu Chen, Sailung Yeung, Toyotaro Suzumura, and Lingli Chen. Principled multilayer network embedding. In ICDM Workshops, pages 134 141, 2017. [Lytras et al., 2010] Miltiadis D. Lytras, Patricia Ord o nez de Pablos, Adrian Ziderman, Alan Roulstone, Hermann A. Maurer, and Jonathan B. Imber, editors. WSKS, Proceedings, volume 111 of Communications in Computer and Information Science. Springer, 2010. [Mikolov et al., 2013a] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111 3119. 2013. [Mikolov et al., 2013b] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746 751, 2013. [Newman, 2001] Mark EJ Newman. Clustering and preferential attachment in growing networks. Physical review E, 64(2):025102, 2001. [Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In KDD, pages 701 710, 2014. [Spielman, 2011] Daniel Spielman. Spectral graph theory. In Combinatorial Scientiﬁc Computing, chapter 16. Chapman and Hall/CRC Press, 2011. [Tang et al., 2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: largescale information network embedding. In WWW, pages 1067 1077, 2015. [Verbrugge, 1979] Lois M Verbrugge. Multiplexity in adult friendships. Social Forces, 57(4):1286 1309, 1979. [Vickers and Chan, 1981] M Vickers and S Chan. Representing classroom social structure. Victoria Institute of Secondary Education, Melbourne, 1981.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)