# largescale_heterogeneous_feature_embedding__8f03010e.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Large-Scale Heterogeneous Feature Embedding

Xiao Huang, Qingquan Song, Fan Yang, Xia Hu Department of Computer Science and Engineering, Texas A&M University {xhuang, song 3134, nacoyang, xiahu}@tamu.edu

Feature embedding aims to learn a low-dimensional vector representation for each instance to preserve the information in its features. These representations can beneﬁt various offthe-shelf learning algorithms. While embedding models for a single type of features have been well-studied, real-world instances often contain multiple types of correlated features or even information within a different modality such as networks. Existing studies such as multiview learning show that it is promising to learn uniﬁed vector representations from all sources. However, high computational costs of incorporating heterogeneous information limit the applications of existing algorithms. The number of instances and dimensions of features in practice are often large. To bridge the gap, we propose a scalable framework Feat Walk, which can model and incorporate instance similarities in terms of different types of features into a uniﬁed embedding representation. To enable the scalability, Feat Walk does not directly calculate any similarity measure, but provides an alternative way to simulate the similarity-based random walks among instances to extract the local instance proximity and preserve it in a set of instance index sequences. These sequences are homogeneous with each other. A scalable word embedding algorithm is applied to them to learn a joint embedding representation of instances. Experiments on four real-world datasets demonstrate the efﬁciency and effectiveness of Feat Walk.

Introduction Feature embedding (Zhang et al. 2015b), aka representation learning (Bengio, Courville, and Vincent 2013) or dimensionality reduction (Hinton and Salakhutdinov 2006), aims to learn low-dimensional vectors for all instances, such that instances with similar original features tend to have similar vector representations. By reducing the redundancy and extracting meaningful information, it could be efﬁciently utilized to and enhance the performance of various real-world applications such as syntactic analysis (Chen, Zhang, and Zhang 2014), human action recognition (Guo et al. 2017), and person re-identiﬁcation (Wu et al. 2018). Beyond a single data source, real-world systems are often imbued with multiple types of correlated instance features (Guo 2013) or even data of a distinct modality such as networks (Yang et al. 2015) and images (Srivastava and

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Salakhutdinov 2012). Learning a representation jointly from all types of features is potentially helpful to make it more informative, since they often complement each other. For example, Facebook users have multiple types of features such as attributes in the introductions, words in posts, contents in photos, and rich friend relationships (Tang et al. 2015). These features are all highly correlated with each other (Smith, Fischer, and Yongjian 2012), i.e., the contents posted by users would reﬂect their status described in the introductions; in turn, the social status have a signiﬁcant impact on users words (Hogg and Terry 2000); and friends tend to share posts with similar topics (Mc Pherson, Smith-Lovin, and Cook 2001). In addition, recent work such as multiview learning (Ding and Fu 2014; Gong et al. 2014) and attributed network embedding (Huang, Li, and Hu 2017a; Liu, Huang, and Hu 2017) have demonstrated the beneﬁts of joint learning. Therefore, it is promising to perform feature embedding based on instance features collected from multiple aspects. However, as it becomes increasingly easier and cheaper to collect data, existing heterogeneous feature embedding algorithms (Zhang et al. 2015c) face many challenges when applied to real-world systems. Three major ones are summarized as follows. First, the ever-growing data volume along with the complex data properties put demands on the scalability of algorithms. For example, there are 2.23 billion1 monthly active Facebook users in the second quarter of 2018, and each user could have thousands of posts and friends. Capacities of existing algorithms such as coupled matrix factorization (Yang et al. 2015) and canonical correlation analysis (Foster, Kakade, and Zhang 2008; Yuan, Sun, and Ge 2014) are highly restricted under largescale settings. Second, real-world instance features often are heterogeneous sources or even might be within a different modality such as networks (Yang et al. 2015). High computational costs of fusing heterogeneous information limit the applications of existing algorithms. A widely used approach is to calculate the instance proximity, i.e., similarities between instances, based on each source, and conduct joint learning based on these homogeneous instance proximities (Huang, Li, and Hu 2017b). But the computations of

1www.statista.com/statistics/264810/number-of-monthlyactive-facebook-users-worldwide/

instance proximities would lead to high time and space complexity, not to mention the subsequent operations. Another effective way is to employ deep neural networks (Ngiam et al. 2011), which is not scalable either. Third, the data sparsity problem is signiﬁcant (Zhang et al. 2015b), e.g., in social media, a large proportion of words or attributes often only contribute to a small number of users, providing insufﬁcient information to accomplish effective embedding. To bridge the gap, we study the problem of large-scale heterogeneous feature embedding. We focus on two most common types of data in practice, i.e., feature matrices and the topological structure (Yang et al. 2015). We target at jointly embedding multiple feature matrices and an instance relation network. It could be summarized as two research questions. (1) How to effectively utilize heterogeneous instance features and a relation network to learn a uniﬁed embedding representation? (2) How to cope with the large scale issue while maintaining the effectiveness of the joint embedding framework? Guided by these questions, we propose a scalable heterogeneous feature embedding framework - Feat Walk. Our major contributions are listed as follows.

Formally deﬁne the problem of large-scale heterogeneous feature embedding;

Propose an effective framework Feat Walk to incorporate multiple types of high-dimensional instance features into a joint embedding representation;

Design an efﬁcient algorithm that avoids computing similarity measure, and provides an alternative way to simulate the similarity-based random walks among instances to model the local instance proximity;

Empirically validate the efﬁciency and effectiveness of Feat Walk on four datasets from real-world systems.

Problem Statement Notations: We use boldface lowercase alphabets to denote vectors (e.g., h) and boldface uppercase alphabets to denote matrices (e.g., H). For a matrix H, its transpose is represented as H and its ith row is denoted as hi. We use {X(i)} and {xi} to represent a sequence of matrices X(i) and scalars xi. The main symbols are list in Table 1. Let {X(i)}, for i = 1, . . . , I, be a set of correlated feature matrices of N instances from I different views. Let the last one X(I) = G be a weighted adjacency matrix that describes the relations among the N instances. To have physical meanings, we assume that elements of all matrices in {X(i)} are non-negative. A speciﬁc example would be the products on Amazon. They have descriptions from multiple sources such as product information and customer reviews, which complement each other and could be used to construct {X(i)}. Customer purchase records (Linden, Smith, and Zada 2005) are also available and could be used to build G. Given these assumptions, we formally deﬁne the largescale heterogeneous feature embedding problem as follows. Given a large number of instances, associated with a set of instance feature matrices {X(i)} and an instance relation network G, we aim to learn a low-dimensional representation hi for each instance i, such that all the meaningful

Table 1: Main symbols and their deﬁnitions in the paper.

Notation Deﬁnition

N total number of instances

X(i) RN M(i) + the ith instance feature matrix M (i) number of categories in X(i), i = 1, . . . , I G RN N + a weighted adjacency matrix, G = X(I)

S(i) RN N instance similarity matrix based on X(i)

M, X, S M = M (1), X = X(1), S = S(1)

d dimension of embedding representations H RN d ﬁnal low-dimensional representation Q(i) sequences of indices learned from X(i)

W (i) total number of sequences in Q(i)

information in {X(i)} and G could be well preserved in H. The amount of learned meaningful information in H could be evaluated based on the performance of H in real-world applications such as classiﬁcation and clustering. Heterogeneous feature embedding can be roughly separated into two categories, i.e., multiview and multimodal feature embedding. The former aims to learn a uniﬁed representation of instances from multiple feature matrices observed from different aspects (Li et al. 2015). The latter focuses on multiple sources with distinct modalities such as networks, images, and audio (Ngiam et al. 2011). Different from these previous studies, in addition to the multiple feature matrices, our work takes the instance relation network into consideration, which is a common and crucial type of information in real-world systems. The topological structure has a different modality than feature matrices. We also target at developing a scalable framework to make it practical. It is distinct from attributed network embedding (Huang et al. 2018) since the latter focuses on embedding a single network and a single feature matrix.

Large-scale Feature Embedding

To jointly embed the heterogeneous information in {X(i)}, we propose an efﬁcient framework - Feat Walk. It achieves scalability by avoiding the computation of instance similarities, and provides an alternative way to simulate similaritybased random walks among instances. Figure 1 illustrates its main idea. Although {X(i)} are heterogeneous, proximities between instances deﬁned by rows of each X(i) are homogeneous. Feat Walk ﬁrst learns the instance proximities deﬁned by all {X(i)} via Feature Walks, and then jointly incorporates them into a uniﬁed embedding representation H. To model the instance proximity in X, an intuitive solution is to construct a new graph S with instance similarities as edge weights, and then perform random walks on S to learn a set of sequences Q(1). However, S is often dense because of the common feature categories. As N keeps increasing, it would become too expensive to be manipulated. We propose a distributed algorithm - Feature Walks, which could obtain the same results as the intuitive solution but avoid the computation of S. The learned Q(1) consists of instance indices that record the walking trajectories such as [1, 6, 4, 2, 5]. Q(1) pre-

Walk Through Features:

6 5 Homogeneous Source : Feature

1 6 4 2 5 1 2 6 3 2

5 1 3 4 6 5 6 1 3 2

Random Walks

Intuitive Solution

Proposed Equivalent Distributed Algorithm

If is a Network

Scalable Word Embedding Technique to Learn Final

All Sets {Q(i)}

Figure 1: To avoid the computation of similarity matrices, Feat Walk performs equivalent random walks through features.

serves the information in X. If X(i) is a network, e.g., G, we conduct random walks on this network directly to learn the sequences Q(i). Finally, all {Q(i)} are homogeneous. By considering the instance indices as words and sequences as sentences, a scalable word embedding technique is applied to {Q(i)} to learn a joint representation H.

Tackling the Heterogeneity Among Features

{X(i)} are collected from different aspects. They are not only mutually dependent on and complement each other (Smith, Fischer, and Yongjian 2012; Guo et al. 2017), but also heterogeneous with each other. To incorporate the heterogeneous information, a commonly used approach (Zhang et al. 2015b) is to calculate the instance proximity based on each source respectively, and learn H from all instance proximities jointly. It could effectively tackle the heterogeneity since instance proximities are homogeneous. Deﬁnition 1. (Instance Proximity) It refers to the similarities between instances deﬁned by the features of instances, i.e., the rows of each X(i). However, the computation and manipulations of similarity matrices increase exponentially as N increases. Thus, existing algorithms such as coupled spectral embedding (Huang, Li, and Hu 2017b) and similarity-based deep models (Huang, Loy, and Tang 2016; Wu et al. 2018) cannot be directly applied under large-scale settings. Feat Walk follows this effective approach and copes with the scalability issue by sampling local instance proximity in a distributed manner. We illustrate its basic idea with the ﬁrst type of features X and the last one G. It is straightforward to extend to the scenarios with multiple {X(i)}.

Intuitive Solution As shown in Figure 1, to model the instance proximity deﬁned by X, an intuitive solution is to compute its similarity matrix to construct a new graph S, and perform truncated random walks on S to learn and preserve the instance proximity into a set of sequences Q(1). Details are introduced as follows. The weight of edge between instances i and j in S is deﬁned as the similarity between their instance features. The probability of walking from instance i to j is determined by the edge weight, i.e.,

P(i j) = sij PN n=1 sin . (1)

Each walk has the same length L. Each sequence in Q(1)

consists of instance indices that record the walking trajectories. It could capture the local instance proximity, because as the number of learned sequences keeps increasing, the probability of index j follows index i in Q(1) would approach to P(i j). Thus, learning an embedding representation based on the indices co-occurrence probabilities is equivalent to the one based on the instances linking probabilities. However, this intuitive solution has several problems. First, as N increases, the size of S would increase exponentially. The calculation, storage, and manipulations of S would become expensive. Second, S is often quite dense, which makes the random walks on S inefﬁcient. For example, when creating feature matrix for Twitter users based on their tweets, the commonly used words such as good and think would make S close to a clique. Then the time complexity of sampling a neighbor from si would become O(N) (Devroye 1986), which is expensive. Therefore, we propose a distributed algorithm - Feature Walks, which solves these problems by avoiding the computation of S.

Feature Walks Since the computation and operations of S are expensive, we design an alternative way to simulate the similarity-based random walks on S, with details as follows. I. We normalize the feature matrix X. We use ℓ2 norm to normalize each row of X and get X. Since it is hard for random walks to simulate the probabilities with small values, we remove the small elements in X, i.e., elements smaller than βMean( X), and get ˆX. Mean( X) denotes the mean value of all elements in X and β is a threshold value. We use ℓ1 norm to normalize each row of ˆX and get a new matrix Y, i.e.,

yim = ˆxim PM p=1 ˆxip . (2)

II. Given an initial instance i, we randomly select a feature category based on the normalized feature of instance i, i.e., ˆxi. Let am denote the mth feature category, then the probability of selecting am is deﬁned as follows,

P(i am) = yim. (3)

III. Given that am is selected, we randomly select an instance based on the mth column of Y. The probability of selecting instance j deﬁned as follows,

P(am j) = yjm PN n=1 ynm . (4)

In such a way, we accomplish the walk from instance i to j. IV. The length of each walk is set as L. To make sure local proximities of all instances could be sampled, we select each instance as the initial index in turns. However, the number of random walks using instance i as an initial index, i.e., wi, is not ﬁxed. We assign it based on the complexity of corresponding instance, which is deﬁned as follows,

wi = nnz(xi)W (1)

PN n=1 nnz(xn) , (5)

where nnz( ) denotes the number of non-zero elements, and W (1) is the total number of walks assigned for modeling X. We design the function in (5) based on two assumptions. First, instances with more features tend to have more edges, and they would require more random walks to simulate its linking probabilities. For example, if in S, instances 4 and 6 have four and one edges respectively. We need to walk from instance 4 at least four times and from instance 6 at least one time to capture their relationships with all neighbors. Second, in a network, instances with more edges tend to be more important (Narayanan, Belkin, and Niyogi 2006). To make the local proximity of important instances well-preserved, we sample more sequences for them. Thus, we assign the numbers of walks {wi} based on the complexity.

Theoretical Analysis of Feature Walks We now prove that the output of Feature Walks is equivalent to the output of random walks on S. Let D be a diagonal matrix with the reciprocal of the sum of each column of Y on the diagonal. Theorem 1. The probability of walking from instance i to j via Feature Walks is equal to the one via random walks on the similarity graph S, with the deﬁnition as follows.

S = YDYT. (6)

Proof. In Feature Walks, the process of walking from instance i to feature category am is independent of the process of walking from am to j. Thus, the probability of walking from i to j is deﬁned as follows.

m=1 P(i am)P(am j),

yimyjm PN n=1 ynm .

It should be noted that P(i j) = P(j i). On the other hand, for the random walks on S, we have,

sij = [yi1, . . . , yi M] [d11, . . . , d MM]y j =

where notation denotes the element-wise multiplication and degree dmm = 1/ PN n=1 ynm. Thus, we have P(i j) equals the probability of walking from i to j in the random walks on S, with PN i=1 sij = 1.

Instance Proximity in Network Given the relation network G, we model its instance proximity via conducting random walk on it directly. It is because instances with stronger relationships tend to be more similar. Homophily

hypothesis (Mc Pherson, Smith-Lovin, and Cook 2001) and social inﬂuence (Zhang et al. 2015a) have demonstrated that instances with similar features tend to have similar network structures, and the latter would also have a signiﬁcant impact on the former ones. Similar to Feature Walks, the length of each walk is set as L, and the total number of walks assigned to model G is set as W (I). The number of walks using i as the initial index, i.e., ˆwi, is deﬁned as follows,

ˆwi = nnz(gi)W (I)

PN n=1 nnz(gn) . (9)

Joint Learning with Multiple Feature Matrices All the sequences in all the sets {Q(i)} are homogeneous with each other. Thus, we could put them together to jointly perform the instance proximity learning. Let W be the total number of sequences that we could sample. To balance the contributions of X and G, W (1) and W (I) are deﬁned as,

W (1) = αW and W (I) = (1 α)W. (10)

By applying a scalable word embedding method (Mikolov et al. 2013), we could learn a joint embedding representation H from {Q(i)}. Word embedding (Pennington, Socher, and Manning 2014) aims to map each word in a set of sentences into a low-dimensional vector, so that words with similar semantic meaning would have similar vector representations. Since massive amounts of documents are available in practice, many scalable word embedding algorithms such as word2vec (Mikolov et al. 2013) and Glo Ve (Pennington, Socher, and Manning 2014) have been proposed. We could take advantage of these efﬁcient algorithms to learn joint low-dimensional representations of instances from {Q(i)}, by considering the instances in {Q(i)} as words in sentences. It is straightforward to extend Feat Walk to multiple instance features {X(i)}. We could perform Feature Walks on each X(i) and learn multiple sets of homogeneous sentences. Then a joint H could be learned from these sentences.

Complexity Analysis

Let N (i) X be the numbers of nonzero entries in X(i). Let T denote the number of operations required to obtain H based on the W learned sentences. In the tasks of sampling from a discrete probability distribution, we use the alias method (Devroye 1986). The time complexity of the setup of alias method is O(N (i) X ) and each sampling takes O(1) time. Then, the time complexity of modeling X(i) is O(N (i) X + W (i)L). Thus, the time complexity of generating all sequences is linear with the numbers of nonzero entries in {X(i)}. The total time complexity of Feat Walk is O(P

i N (i) X + WL + T ). It should be noted that the processes of learning any two sequences are independent of each other. We could implement them in parallel to further accelerate Feat Walk.

Experiments We now empirically validate the efﬁciency and effectiveness of Feat Walk. There are three major questions we aim to answer. (1) How efﬁcient is Feat Walk in performing hetero-

(a) Flickr (b) ACM (c) Yelp (d) Number of workers

Figure 2: Running time of Feat Walk and different heterogeneous feature embedding methods on Flickr, ACM, and Yelp.

geneous feature embedding compared with the state-of-theart embedding methods? (2) How effective is the representations learned by Feat Walk compared with other embedding methods in applications such as classiﬁcation? (3) What are the impacts of parameters α, the window size, L, the number of sentences per instance W/N, and d on Feat Walk?

Four Real-world Datasets

The four real-world datasets that we employed in the experiments are all publicly available. The ﬁrst dataset contains two feature matrices. Each of the last three datasets contains one feature matrix and one network. Reuters (Amini, Usunier, and Goutte 2009): 18,758 documents from Reuters are used as instances. They are originally written in English. We employ their Italian translations as X, with M = 11,452, and the Spanish translations as X(2), with M (2) = 9,243, via the bag-of-words model. To make the embedding task more challenging, the original top 10% frequent features have been removed. Each document is from one of the six populous classes. Flickr (Huang, Li, and Hu 2017b): 7,564 Flickr users are employed as instances. They share photos online, and attach many related tags to their photos. These tags reﬂect the interests of instances. We use them as X, with M = 12,047. Instances also follow each other and form a network G naturally, with 239,365 undirected edges in total. The nine groups that instances have joined are employed as their labels. ACM (Tang et al. 2008): 48,579 papers published in ACM are utilized as instances. We employ paper abstracts as X, with M = 10,000, and construct an undirected G based on the citation links, with 288,374 edges in total. All instances are from nine areas such as Artiﬁcial Intelligence (AAAI, IJCAI, etc.), Data Mining (KDD, ICDM, etc.), and Machine Learning (ICML, COLT, etc.), which serve as the labels. Yelp (Yelp 2017): 249,012 Yelp users are used as instances. They have written reviews for different businesses. We set the reviews as X, with M = 20,000. Their friend relationships are employed to construct G, with 1,779,803 edges in total. All businesses are categorized into eleven classed such as Nightlife and Services. Categories of the businesses that an instance has reviewed are set as his/her labels.

Baseline Methods To study the performance of Feat Walk, we compare it with three categories of baselines. First, to investigate the impact of each type of features, we include three single feature embedding methods, i.e., NMF, Spectral, Feat Walk X. Second, to study the impact of the networks, we include two network embedding methods, i.e., Deep Walk and LINE. Third, to analyze the efﬁciency and effectiveness of Feat Walk, we include three state-of-the-art heterogeneous feature embedding methods, i.e., LCMF, Multi Spec, and AANE, and a variation of Feat Walk named w/o FW. No deep models are included since they are not scalable (Tu et al. 2018). NMF (Pedregosa et al. 2011): It is a scalable version of non-negative matrix factorization, optimized by the hierarchical alternating least squares algorithms. It reduces the dimension of X (or X(2)) to learn H. Spectral (von Luxburg 2007): It calculates the cosine similarities between vectors xi (or x(2) i ) to construct a new graph, and applies spectral embedding on it to learn H. Feat Walk X: It learns the embedding representation H only from the feature matrix X (or X(2)) via Feat Walk. Deep Walk (Perozzi, Al-Rfou, and Skiena 2014): It performs random walks on G and learns sequences of instance indices. word2vec is applied to them to learn H. It is a variant of Feat Walk that only uses G. LINE (Tang et al. 2015): It learns H from G by jointly modeling its ﬁrst and second order instance proximity. LCMF (Zhu et al. 2007): It conducts classical coupled matrix factorization between X and G to learn H jointly. Multi Spec (Kumar, Rai, and Daume 2011): It computes all instance similarity matrices and jointly models them via coupled spectral embedding. AANE (Huang, Li, and Hu 2017a): It is the state-of-the-art embedding method for networks with node attributes. w/o FW: Feat Walk without using Feature Walks. Instead, it calculates S directly, and performs random walks on S.

Experimental Settings To analyze the effectiveness of Feat Walk and baselines, we apply their learned embedding representations H to perform classiﬁcation, following the commonly-used way of

Table 2: Classiﬁcation performance of Feat Walk and all baselines on Flickr, ACM, and Yelp-sub in terms of micro-average.

Flickr ACM Yelp-sub Yelp

Training 25% 50% 100% 25% 50% 100% 25% 50% 100% 10% 25% 50% 100% # Instances 3,026 4,538 7,564 19,432 29,147 48,579 19,921 29,881 49,802 69,723 99,605 149,407 249,012

NMF 0.629 0.718 0.773 0.653 0.660 0.664 0.680 0.686 0.688 0.678 0.692 0.694 0.689 Spectral 0.771 0.813 0.846 0.688 0.700 N.A. 0.683 N.A. N.A. N.A. N.A. N.A. N.A. Feat Walk X 0.803 0.841 0.868 0.676 0.675 0.667 0.701 0.710 0.714 0.706 0.691 0.703 0.699

Deep Walk 0.373 0.465 0.535 0.576 0.630 0.684 0.310 0.318 0.350 0.324 0.345 0.366 0.368 LINE 0.332 0.421 0.516 0.549 0.624 0.693 0.243 0.264 0.294 0.295 0.313 0.336 0.354

LCMF 0.676 0.725 0.749 0.690 0.706 N.A. 0.680 0.686 N.A. N.A. N.A. N.A. N.A. Multi Spec 0.720 0.800 0.859 0.709 0.719 N.A. 0.667 N.A. N.A. N.A. N.A. N.A. N.A. AANE 0.811 0.854 0.885 0.701 0.715 0.722 0.694 0.703 0.711 0.698 0.709 0.711 0.714 Feat Walk 0.831 0.865 0.893 0.722 0.738 0.751 0.700 0.710 0.717 0.708 0.691 0.704 0.701

validating feature embedding methods (Xia et al. 2010; Zhang et al. 2015b). The classiﬁcation task aims to classify a new instance into one or multiple categories, based on its embedding representation and the trained classiﬁer. The performance is measured by two standard metrics, i.e., microaverage and macro-average (Huang, Li, and Hu 2017b). We apply 5-fold cross-validation on all datasets, i.e., randomly select 4

5 of all instances as a training group and the remaining as a test group. If the dataset has a network, then the edges between the training and test groups would be kept. We concatenate feature matrices in two groups to create new instance feature matrices. To evaluate an embedding method, we apply it to the new instance feature matrices to learn H, which contains vector representations for instances in the training group Htrain and instances in the test group Htest. To perform classiﬁcation, we build an SVM (Pedregosa et al. 2011) classiﬁer based on Htrain and corresponding labels. Then we apply the learned classiﬁer to predict the labels of instances in the test group based on Htest. We use the original papers default settings to determine the parameters of baselines. Feat Walk X and w/o FW use the same parameters as Feat Walk. If it is not speciﬁed, d is set as 100 and 100% of the instances in the training group are used. We performed ten test runs and used the arithmetic average as the ﬁnal experimental results. We ran the experiments on a Dell Opti Plex 9030 i7-16GB desktop.

Efﬁciency of Feat Walk To investigate the ﬁrst question proposed at the beginning of this section, we compare Feat Walk with the three stateof-the-art heterogeneous feature embedding methods and w/o FW. The running time of all methods as a function of the number of instances N on Flickr, ACM, and Yelp is shown in Figure 2. The result on Reuters is similar, so we omit it. From the results in Figure 2, we have three major observations. First, the running time of Feat Walk is almost linear to N, which demonstrates its scalability. For example, in Figure 2c, the slope of the green curve (Feat Walk) remains invariable as N increases. As N keeps increasing, LCMF, Multi Spec, and w/o FW run out of time since their running time increase exponentially. Second, the distributed sampling algorithm Feature Walks has signiﬁcantly accelerated Feat Walk and made it scalable. When N is small, as shown

Table 3: LCMF and AANE can not be applied to Reuters.

# Instances 7,503 (25%) 11,255 (50%) 18,758 (100%)

X(1) X(2) X(1) X(2) X(1) X(2) NMF 0.658 0.673 0.684 0.689 0.695 0.695 Spectral 0.679 0.689 0.694 0.704 0.710 0.714 Feat Walk X 0.698 0.700 0.727 0.728 0.744 0.745

Multi Spec 0.707 0.727 0.740 Feat Walk 0.720 0.748 0.771

in Figure 2a, w/o FW has almost the same running time as Feat Walk, since the manipulations of similarity matrix S are cheap at this time. However, when N keeps increasing, as shown in Figure 2b, the running time of w/o FW increases exponentially until it runs out of time. Feat Walk has significantly less running time than w/o FW on both ACM and Yelp. Third, Feat Walk always has the least running time when N is large. When N is small, AANE might have less running time than Feat Walk. Feat Walk has almost the same running time as AANE on Flickr, and is always faster than AANE on ACM. On Yelp, Feat Walk needs more running time than AANE when N < 82,000, but it becomes faster than AANE when N 82,000 since it is almost linear to N. It should be noted that AANE is designed for embedding a single feature matrix with a single network, while Feat Walk is general to multiple ones, e.g., Feat Walk could be applied to Reuters, while AANE can not. The running time of Feat Walk can be further reduced by using the multi-thread implementation. Figure 2d shows the relative running time of Feat Walk as a function of the number of workers c on all datasets. From the results, we have three observations. First, Feat Walk becomes more efﬁcient as c increases. Second, as c increases from 1 to 2, the running time decreases less than 50%. It is because the multi-thread implementation only accelerates the learning of sequences {Q(k)}, not the word embedding process. Third, the running time of Feat Walk on Flickr keeps increasing when c 5. It is because N = 7,564 is relatively small and it takes extra time for the communications between the coordinator and workers. It should be noted that word2vec can be replaced by other more efﬁcient algorithms to make Feat Walk faster.

(a) α and window size

(b) Walk length L and W/N (c) Embedding representation dimension d

Figure 3: The impacts of parameters α, window size, walk length L, number of sentences per instance W/N, and d on Feat Walk.

Effectiveness of Feat Walk

To answer the second proposed question, we compare the classiﬁcation performance of all methods on the four datasets, which are list in Tables 2 and 3. Since Spectral, LCMF, and Multi Spec are not capable of embedding the dataset Yelp, we randomly sample 20% of it and construct a new dataset named Yelp-sub. Deep Walk, LINE, LCMF, and AANE could not be applied to the dataset Reuters since it has no network. From the results in Tables 2 and 3, we have three major ﬁndings. First, Feat Walk X performs better than single feature embedding methods, i.e., NMF and Spectral. By taking advantage of the extra information (G or X(2)), Feat Walk further improves the performance. For example, on Reuters, Feat Walk achieves 3.6% of improvements than Feat Walk X and 10.9% of improvements than NMF. Second, by incorporating the heterogeneous information, Feat Walk outperforms all network embedding methods, i.e., Deep Walk and LINE. For example, Feat Walk achieves 9.8% of improvements than Deep Walk on ACM. Third, on all datasets except Yelp, Feat Walk achieves better performance than all heterogeneous feature embedding methods, i.e., LCMF, Multi Spec, and AANE. For example, on Flickr, Feat Walk achieves a gain of 0.9% over AANE. It demonstrates that the way that Feat Walk has used to incorporate the heterogeneous information is effective. To study the performance of all methods w.r.t. different training set percentages, i.e., the percentage of instances in the training group (among the four folds in the cross-validation) that have been used, we vary it as {25%, 50%, 100%}. The results on the four datasets are shown in Tables 2 and 3. From the results, we observe that the aforementioned ﬁndings hold consistently as the training set percentage increases. Feat Walk undeviatingly outperforms all baselines on all datasets except Yelp. For example, on Reuters, when the training set percentage is 50%, Feat Walk achieves 6.3% of improvements than Spectral and 2.9% of improvements than Multi Spec. On Yelp, the performance of Feat Walk decreases when the training set percentage increases from 10% to 25%. It is because the sequence sets reach the memory limit and an online version of word2vec is used to learn H.

Parameter Analysis

We now study the third proposed question. Performance of Feat Walk on Flickr as a function of the ﬁrst instance feature matrix weight α and window size, a function of L and W/N, and a function of d are shown in Figures 3a, 3b and 3c. Results on other datasets are similar, so we omit them. First, we vary α from 0 to 1 and the window size from 1 to 10. When α = 0, only G is used to learn H. When α = 1, only X is used to learn H. The window size denotes the maximum distance that is used to deﬁne context words in word2vec. From the results in Figure 3a, we observe that Feat Walk achieves the best performance when α = 0.62, i.e., when the contributions of X and G are balanced. When α is ﬁxed, the performance of Feat Walk keeps stable as the window size increases from 1 to 10. Second, we vary the walk length L as from 2 to 60 and the number of sentences per instance W/N from 2 to 20. From the results in Figure 3b, we ﬁnd that the performance of Feat Walk keeps increasing as the product of L and W/N increases, and keeps stable when the product is sufﬁciently large. Third, we vary d as {20, 60, 100, 140, 180}. From the results in Figure 3c, we observe that, as d increases from 20 to 180, the performance of Feat Walk keeps stable and is always better than all baselines.

Conclusion And Future Work

We investigate the problem of heterogeneous feature embedding and propose a scalable framework - Feat Walk. It could encode multiple instance feature matrices and even network relations into uniﬁed instance vector representations. Without calculating any similarity measure among instances, we design an alternative way to simulate the similarity-based random walks among instances, which samples the local instance similarities and preserves them in walking trajectories. Along with the trajectories learned via random walks on the network relations, we apply a scalable word embedding algorithm to learn the joint representations of instances from these trajectories. Experiments on the four realword datasets validate the scalability and effectiveness of Feat Walk. Our future work is to explore semi-supervised frameworks to incorporate label information and dynamic algorithms to cope with streaming environments.

Acknowledgments

This work is, in part, supported by DARPA (#N66001-172-4031) and NSF (#IIS-1750074 and #IIS-1718840). The views, opinions, and/or ﬁndings expressed are those of the author(s) and should not be interpreted as representing the ofﬁcial views or policies of the Department of Defense or the U.S. Government.

Amini, M.; Usunier, N.; and Goutte, C. 2009. Learning from multiple partially observed views - an application to multilingual text categorization. In NIPS, 28 36. Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. TPAMI 35(8):1798 1828. Chen, W.; Zhang, Y.; and Zhang, M. 2014. Feature embedding for dependency parsing. In COLING, 816 826. Devroye, L. 1986. Sample-based non-uniform random variate generation. In Winter Simulation Conference, 260 265. Ding, Z., and Fu, Y. 2014. Low-Rank common subspace for multiview learning. In ICDM, 110 119. Foster, D. P.; Kakade, S. M.; and Zhang, T. 2008. Multi-view dimensionality reduction via canonical correlation analysis. Toyota Technical Institute-Chicago. Gong, Y.; Ke, Q.; Isard, M.; and Lazebnik, S. 2014. A multiview embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision 106(2):210 233. Guo, Y.; Tao, D.; Liu, W.; and Cheng, J. 2017. Multiview cauchy estimator feature embedding for depth and inertial sensor-based human action recognition. IEEE Trans. on Systems, Man, and Cybernetics 47(4):617 627. Guo, Y. 2013. Convex subspace representation learning from multi-view data. In AAAI, 387 393. Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504 507. Hogg, M. A., and Terry, D. I. 2000. Social identity and selfcategorization processes in organizational contexts. Academy of Management Review 25(1):121 140. Huang, X.; Song, Q.; Li, J.; and Hu, X. 2018. Exploring expert cognition for attributed network embedding. In WSDM, 270 278. Huang, X.; Li, J.; and Hu, X. 2017a. Accelerated attributed network embedding. In SDM, 633 641. Huang, X.; Li, J.; and Hu, X. 2017b. Label informed attributed network embedding. In WSDM, 731 739. Huang, C.; Loy, C. C.; and Tang, X. 2016. Local similarity-aware deep feature embedding. In NIPS, 1262 1270. Kumar, A.; Rai, P.; and Daume, H. 2011. Co-Regularized multiview spectral clustering. In NIPS, 1413 1421. Li, Y.; Nie, F.; Huang, H.; and Huang, J. 2015. Large-scale multiview spectral clustering via bipartite graph. In AAAI, 2750 2756. Linden, G. D.; Smith, B. R.; and Zada, N. K. 2005. Use of product viewing histories of users to identify related products. Google Patents. Liu, N.; Huang, X.; and Hu, X. 2017. Accelerated local anomaly detection via resolving attributed networks. In IJCAI, 2337 2343.

Mc Pherson, M.; Smith-Lovin, L.; and Cook, J. M. 2001. Birds of a feather: Homophily in social networks. Annual Review of Sociology 27(1):415 444. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, 3111 3119. Narayanan, H.; Belkin, M.; and Niyogi, P. 2006. On the relation between low density separation, spectral clustering and graph cuts. In NIPS, 1025 1032. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; and Ng, A. Y. 2011. Multimodal deep learning. In ICML, 689 696. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; et al. 2011. Scikit-learn: Machine learning in python. JMLR 12:2825 2830. Pennington, J.; Socher, R.; and Manning, C. 2014. Glo Ve: Global vectors for word representation. In EMNLP, 1532 1543. Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In KDD, 701 710. Smith, A. N.; Fischer, E.; and Yongjian, C. 2012. How does brandrelated user-generated content differ across youtube, facebook, and twitter? Journal of Interactive Marketing 26(2):102 113. Srivastava, N., and Salakhutdinov, R. R. 2012. Multimodal learning with deep boltzmann machines. In NIPS, 2222 2230. Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; and Su, Z. 2008. Arnetminer: Extraction and mining of academic social networks. In KDD, 990 998. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015. Line: Large-scale information network embedding. In WWW, 1067 1077. Tu, K.; Cui, P.; Wang, X.; Wang, F.; and Zhu, W. 2018. Structural deep embedding for hyper-networks. In AAAI, 426 433. von Luxburg, U. 2007. A tutorial on spectral clustering. Statistics and Computing 17(4):395 416. Wu, L.; Wang, Y.; Gao, J.; and Li, X. 2018. Deep adaptive feature embedding with local sample distributions for person reidentiﬁcation. Pattern Recognition 73:275 288. Xia, T.; Tao, D.; Mei, T.; and Zhang, Y. 2010. Multiview spectral embedding. IEEE Trans. on Systems, Man, and Cybernetics 40(6):1438 1446. Yang, C.; Liu, Z.; Zhao, D.; Sun, M.; and Chang, E. Y. 2015. Network representation learning with rich text information. In IJCAI, 2111 2117. Yelp. 2017. www.yelp.com/dataset/challenge. Yelp Dataset Challenge. Yuan, Y.-H.; Sun, Q.-S.; and Ge, H.-W. 2014. Fractional-order embedding canonical correlation analysis and its applications to multiview dimensionality reduction and recognition. Pattern Recognition 47(3):1411 1424. Zhang, J.; Tang, J.; Li, J.; Liu, Y.; and Xing, C. 2015a. Who inﬂuenced you? predicting retweet via social inﬂuence locality. TKDD 9(3). Zhang, L.; Zhang, Q.; Zhang, L.; Tao, D.; Huang, X.; and Du, B. 2015b. Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding. Pattern Recognition 48(10):3102 3112. Zhang, Q.; Zhang, L.; Du, B.; Zheng, W.; Bian, W.; and Tao, D. 2015c. MMFE: Multitask multiview feature embedding. In ICDM. Zhu, S.; Yu, K.; Chi, Y.; and Gong, Y. 2007. Combining content and link for classiﬁcation using matrix factorization. In SIGIR, 487 494.