# highdimensional_similarity_learning_via_dualsparse_random_projection__4ddd7197.pdf

High-dimensional Similarity Learning via Dual-sparse Random Projection

Dezhong Yao1, Peilin Zhao2, Tuan-Anh Nguyen Pham1 and Gao Cong3

1 Rolls-Royce@NTU Corporate Lab, Nanyang Technological University, Singapore 2 South China University of Technology; Tencent AI Lab, China 3 School of Computer Science and Engineering, Nanyang Technological University, Singapore dzyao@ntu.edu.sg, peilinzhao@hotmail.com, {gaocong, ntapham}@ntu.edu.sg

We investigate how to adopt dual random projection for high-dimensional similarity learning. For a high-dimensional similarity learning problem, projection is usually adopted to map high-dimensional features into low-dimensional space, in order to reduce the computational cost. However, dimensionality reduction method sometimes results in unstable performance due to the suboptimal solution in original space. In this paper, we propose a dual random projection framework for similarity learning to recover the original optimal solution from subspace optimal solution. Previous dual random projection methods usually make strong assumptions about the data, which need to be low rank or have a large margin. Those assumptions limit dual random projection applications in similarity learning. Thus, we adopt a dual-sparse regularized random projection method that introduces a sparse regularizer into the reduced dual problem. As the original dual solution is a sparse one, applying a sparse regularizer in the reduced space relaxes the low-rank assumption. Experimental results show that our method enjoys higher effectiveness and efﬁciency than state-of-the-art solutions.

1 Introduction Pairwise similarity learning has been widely used in classiﬁcation, information retrieval, and recommendation systems [Chechik et al., 2010; Bellet et al., 2012; Cheng, 2013]. The continuous increase of data scales and dimensions in many applications (e.g. multimedia, ﬁnance, bioinformatics, and healthcare) raises two main challenges for similarity learning. First, high dimensionality poses computational and memory challenges. Second, large-scale data may be sparse and noisy, making it difﬁcult to ﬁnd any structure in the data [Fern and Brodley, 2003]. As a solution to these problems, random reduction techniques have received much attention recently [Paul et al., 2013; Yang et al., 2015; Xu et al., 2017]. Random projection (RP) is a simple but powerful feature dimensionality reduction method that projects a highdimensional sample (d dimensions) into a low-dimensional

space (m dimensions) by a randomly generated matrix [Bingham and Mannila, 2001]. In contrast to other feature reduction methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and undercomplete Independent Component Analysis (ICA), RP can more efﬁciently generate the projection matrix, so that it can avoid computing the eigenvalues of samples (with time complexity of O(d3)), which is too heavy as a data pre-processing step for large-scale datasets. However, RP does not consider the intrinsic structure of the data, so that it may lead to relatively high distortion [Fern and Brodley, 2003]. Speciﬁcally, the optimal solution solved in subspace can be used to recover a suboptimal solution in the original space, however it may signiﬁcantly differ from the optimal solution of the original problem. Recently, Dual Random Projection (Du RP) algorithm is studied to recover the optimal solution in the original space [Zhang et al., 2013; 2014; Yang et al., 2015]. Specifically, by combining the Fenchel s duality theorem and random projection, Zhang [2013] proposed a stochastic Du RP solution, which can effectively restore the optimal solution of the original optimization problem. Currently, only a few studies examined the application of Du RP to similarity learning. In this paper, we proposed an efﬁcient dual random projection framework for high-dimensional similarity learning task. Our solution avoids the computational cost of O(d3). This framework solves the optimization problem in two steps using recovery method. In step 1: RP is applied to the original data to reduce the dimensionality; then we need only solve a low-dimensional optimization problem. In step 2: the dual solution of the low-dimensional problem is constructed from its primal solution, and then we use the dual solution to recover the optimal solution of the original space. The empirical results show that the proposed framework achieves better and more robust performance. It is notable that previous dual random projection and recovery methods rely on strong assumptions about the data low rank [Paul et al., 2013; Zhang et al., 2014] or large separable margin [Balcan et al., 2006; Shi et al., 2012] which may limit their application scenarios for similarity learning. To address this issue, we adopt a dualsparse regularized random projection approach for highdimensional similarity learning. In particular, a dual-sparse regularizer is added to the reduced optimization problem. Experiments on a set of real datasets demonstrate that a suitable

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

sparse regularizer can reduce the recovery error. This paper is organized as follows. Related work is reviewed in the next section. Section 3 presents our dual-sparse random regularized random projection framework. Section 4 reports experimental results, and Section 5 concludes our work.

2 Related Work

Similarity/distance metric learning has been intensively studied [Bian and Tao, 2011; Kulis, 2013; Qian et al., 2015]. Distance metric learning (DML) methods focus on learning a symmetric distance between two objects x1 and x2 by (x1 x2)T M(x1 x2), where the parametric matrix M must be positive semi-deﬁnite (PSD) [Shalev-Shwartz et al., 2004; Jin et al., 2009]. Another relative similarity learning (RSL) approach learns a similarity score of two objects by SM (x1, x2) = x 1 M x2, where the similarity matrix M in the bilinear model can be non-symmetric [Chechik et al., 2010; Crammer and Chechik, 2012]. Without the PSD constraint, the cost of optimization decreases from O(d3) (PSD projection) to O(d2). However, when solving the large-scale optimization problem, these two methods both have extremely high computational cost. Random projection is one popular and efﬁcient techniques to project high-dimensional data into a low-dimensional space and learn a metric in the subspace. It has been successfully applied in many applications, such as classiﬁcation task [Zhang et al., 2014], regression [Maillard and Munos, 2012; Zhang et al., 2016], and information retrieval [Venna et al., 2010]. However, this dimensionality reduction method often leads to suboptimal performance. To overcome this shortcoming, Davis [2008] and Weinberger [2009] place a strong assumption on the learned metric to be a low-rank matrix. This assumption signiﬁcantly limits the applications. Recently, an original space optimal problem recovering method is studied by using dual random projection [Qian et al., 2015; Yang et al., 2015]. Same as random projection method, Du RP reduces the dimensionality of the data and solves a subspace optimization problem. Then, Du RP constructs the dual solution of the subspace and uses it to recover the optimal solution of the original space problem. The recovered optimal solution achieves a small error by using Ω(r log r) projections, where r is the rank of the data matrix [Zhang et al., 2013]. Our work is related to the Du RP work [Qian et al., 2015] for DML (Du RPDML). However, Du RPDML still needs to solve the PSD problem in the original high-dimensional space. The time complexity of PSD solution would be O(d3), which can cause expensive computational cost for the highdimensional dataset. The similarity learning method is more efﬁcient and consumes less memory than DML in solving the high-dimensional optimization problem because it does not need the PSD step. Also, to recover the primal solution, Du RPDML assumes the data is low-rank. In contrast, our algorithm makes use of the sparsity of the dual solution to relax this assumption.

3 Dual-sparse Random Projection for Similarity Learning Let X = (x1, , xn) Rd n denote a set of training examples. Our goal is to learn a similarity function S : Rd Rd R based on a sequence of training triplets D = {(xt, x+ t , x t ) Rd Rd Rd|t [T]} with [T] = {1, , T}, where the similarity score between xt and x+ t is greater than that between xt and x t . Speciﬁcally, we would like to learn a similarity function S(x, x ) that assigns higher similarity scores to more relevant instances, S(xt, x+ t ) > S(xt, x t ), t [T]. For the similarity function, we adopt a parametric similarity function that has a bi-linear form: SM (x, x ) = x M x , where M Rd d. In order to learn the optimal parameter M , we introduce a loss function ℓ(M ; (xt, x+ t , x t )) that measures its performance on the t-th triplet. One popular loss function is the hinge loss: ℓ(M ; (xt, x+ t , x t )) = [1 SM (xt, x+ t ) + SM (xt, x t )]+, where [ ]+ = max(0, ). The above loss measures how much the violation of the desired constraint S(xt, x+ t ) > S(xt, x t ) is by the similarity function deﬁned by M . A set of T triplet constraints is derived from the training examples in X. During each learning step t, a triplet (xt, x+ t , x t ) will be presented to the algorithm. The algorithm can update the model from Mt to Mt+1, based on the current triplet. We denote the model s estimation after the (t-1)-th round by Mt. Following the empirical risk minimization framework, the optimal similarity function is learned by solving the following optimization problem:

min M Rd d λ

2 M 2 F + 1

T T t=1 ℓ(M;(xt,x+ t ,x t )), (1) where λ > 0 is the regularization parameter and ℓ( ) is a convex loss function. We deﬁne the following hinge loss function for the triplet: ℓ(M ; (xt, x+ t , x t )) = max{0, 1 S(xt, x+ t ) + S(xt, x t )} = max{0, 1 M , At }, where At = xt(x+ t x t ) and M , At = trace(M At). Hence, the optimization problem can be described as solving:

M =arg min M Rd d λ

2 M 2 F + 1

T T t=1 ℓ( M,At ), (2)

on triplet constraints {At}T t=1. By writing ℓ( ) in its convex conjugate form ℓ ( ), we can turn the primal problem (2) into a dual problem:

maxα1, ,αT 1 T T t=1 ℓ ( αt) 1 2λT 2 T t=1 αt At 2

F , (3) which is equivalent to

α =arg maxα [0,1]T 1 T T t=1 ℓ ( αt) 1 2λT2 α Gα, (4)

where α = (α1, , αT ) and G = [Ga,b]T T is a matrix in RT T with Ga,b = Aa, Ab . We denote M Rd d

as the optimal primal solution to (2), and α RT as the optimal dual solution to (4). Using the ﬁrst order condition for optimality, we have:

t=1 αt At. (5)

3.1 Dual Random Projection for Similarity Learning It is a computationally challenge to solve either the primal problem in (2) or the dual problem in (4), when the dimensionality d is high and the number of training triplets T is

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Algorithm 1 Dual Random Projection Method for Relative Similarity Learning (Du RPRSL)

Input: the triplet constraints D = {(xt, x+ t , x t )}T t=1 and the number of random projection m 1: Initialize a Gaussian random matrix R Rd m

2: Project each triplet as b At = R At R 3: Solve the optimization problem (7) using SDCA and obtain the optimal solution bα (*) 4: Recover the original solution M according to (9): M = 1 λT T t=1 bαt At Output: the recovered solution M

large. We try to overcome this challenge by using the dual random projection technique. Let R Rd m be a Gaussian random matrix, where m d and Ri,j N(0, 1/m). For each triplet constraint, we project its representation At into the low-dimensional space using a randomized matrix: b At = R At R. In the low-dimensional space, the primal goal is solving the following optimization problem with the randomly projected triplets { b At}T t=1:

c M =arg minc M Rm m 1 T T t=1 ℓ( c M, b At )+ λ

2 c M 2 F . (6)

The corresponding dual problem is written as

bα =arg max b α [0,1]T 1 T T t=1 ℓ ( bαt) 1 2λT2 bα b Gbα, (7)

where b Ga,b = R Aa R, R Ab R . We denote bα Rm as the optimal solution to (7) and the primal solution is

t=1 bαt b At. (8)

Comparing the dual problem (4) in original space and problem (7) in subspace, the only difference is that the matrix Ga,b = Aa, Ab in (4) is replaced by b Ga,b = R Aa R, R Ab R in (7). As E( R Aa R, R Ab R ) E( Aa, Ab ) when the reduced dimension m is sufﬁciently large, b G will be close to G, and bα is also expected to be close to α . As the result, we can use bα to approximate α in (4). So the original optimal model can be recovered as:

t=1 bαt At. (9)

It is important to note that the recovered metric c M is the optimal solution in the projected subspace, while M is computed directly in the original space using the approximate dual solution bα . Compared to the original space optimization solving method in (3), the time complexity of the proposed method will drop to O(md)( O(d2)). Solving the dual problem (7) is the key step in our framework. Here we choose the Stochastic Dual Coordinate Ascent (SDCA) [Shalev-Shwartz and Zhang, 2013] method to ﬁnd the optimal bα . It is shown to be empirically faster than other stochastic methods, especially for solving large-scale optimization learning problems [Shalev-Shwartz and Zhang, 2013]. Now the whole framework to obtain M can be described in Algorithm 1. The SDCA solution in the projected space is described in Algorithm 2.

Algorithm 2 SDCA: Stochastic Dual Coordinate Ascent for Relative Similarity Learning

Input: λ > 0, the projected triplet constraints { b At}T t=1 1: Initialize c M 0 = c M (bα(0)) = 0 Rm m, bα(0) = (bα1, , bαT ) = 0 2: for t = 1, 2, . . . , T do 3: Uniformly pick a triplet b Ai 4: Compute bαi to increase dual:

bαi=max ( 0,min ( 1, 1 c M(t 1), b Ai

b Ai 2 F /(λT) +bα(t 1) i

)) bα(t 1) i

5: bα(t) bα(t 1) + bαiei 6: c M (t) c M (t 1) + (λT) 1 bαi b Ai 7: end for Output (Average option): Let bα = 1 T T0 ΣT i=T0+1bα(t 1)

Let c M = c M ( bα) = 1 T T0 ΣT i=T0+1c M (t 1)

return bα Output (Random option): Let bα = bα(t) and c M = c M (t) for uniformly random t {T0 + 1, , T} return bα

For the hinge loss, step 3(*) in SDCA has a closed form solution as

αi=max ( 0,min ( 1, 1 M(t 1),Ai

Ai 2 F /(λT) +α(t 1) i

)) α(t 1) i . (10)

The dual random projection and recovery approach has one deﬁciency: some non-support samples in the original optimization problem will become support samples, due to the feature reduction. This could result in the corruption of recover error. As a result, the dual recover methods rely on some strong assumptions of data: low rank [Paul et al., 2013; Zhang et al., 2014] or large separable margin [Balcan et al., 2006; Shi et al., 2012]. To address this limitation, we plan to use dual-sparse randomized reduction to relax the assumptions.

3.2 Dual-sparse Regularized Random Projection (Du SRPRSL) for Similarity Learning To reduce the number of training instances that are nonsupport samples in the original optimization problem and transformed into support samples due to the reduction of the feature space, we add a dual-sparse regularization to the reduced dual problem (7). The optimal problem is written as

eα =arg maxα [0,1]T 1 T T t=1 ℓ ( αt) 1 2λT2 α b Gα 1

T τ α 1, (11) where the regularizer is α 1 with τ > 0. To understand the effectiveness of dual-sparse regularizer in random projection for similarity learning, we evaluate the performance on a non-smooth hinge loss function ℓ(u) = max(0, 1 u) and also give a solution for a smooth squared hinge loss function ℓ(u) = max(0, 1 u)2. In this paper, the hinge loss function is chosen for illustration. Given ℓ ( αi) = αi for αi [0, 1], the new dual problem is written as:

eα =arg maxα [0,1]T 1 T T t=1 αt 1 2λT2 α b Gα τ

T α 1. (12)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Using variable transformation αt βt, the above problem is written as

maxβ [ 1,0]T 1 T T t=1 βt( 1 τ) 1 2λT2 β b Gβ. (13)

Changing into the primal form, we have

max M Rm m 1 T T t=1 ℓ1 τ ( M ,At )+ λ

2 M 2 F , (14)

where ℓγ(z) = max(0, γ z) is a max-margin loss with margin given by γ. For the hinge loss, step 3(*) in SDCA has a closed form solution as

αi=max ( 0,min ( 1, (1 τ) M(t 1),Ai

Ai 2 F /(λT) +α(t 1) i

)) α(t 1) i . (15)

For the squared loss, we have ℓ(u) = max(0, 1 u)2, a closed form solution is

αi= (1 τ) 0.5α(t 1) i M(t 1),Ai

0.5+ Ai 2 F /(λT) . (16)

The proposed dual-sparse formulation provides a bounding on the dual recovery error eα α to overcome the lowrank limitation, which is another contribution of this paper. As the margin becomes small in the reduction feature space, samples become difﬁcult to separate after dimension reduced. A suitable sparse regularizer can help to improve the recovery accuracy. This is because adding the ℓ1 regularization in the reduced problem of similarity learning is equivalent to using a max-margin loss with a smaller margin. The experiment results demonstrate that the recovery error can be reduced by a suitable sparse regularizer.

4 Experiments In this section, we ﬁrst present our study on Du RPRSL for ranking and classiﬁcation tasks. Then we give a case study on the support of dual-sparse regularized Du RPRSL.

4.1 Experiment Datasets and Settings To examine the effectiveness of the proposed method, we tested on six public datasets: Protein, Gisette, RCV1, URL, Caltech256, and BBC from LIBSVM, Caltech, and UCD, as shown in Table 1.

Data Set Source #Class #Feature #Train #Test

Protein LIBSVM 3 357 17,766 6,621 Caltech30 Caltech 30 1,000 20,623 8,838 Gisette LIBSVM 2 5,000 6,000 1,000 BBC UCD 5 9,636 1,558 667 RCV1 LIBSVM 2 47,236 20,242 677,399 URL LIBSVM 2 3,231,961 1,677,291 718,839

Table 1: Statistics of standard datasets.

Caltech30 is a subset of the Caltech256 image dataset. We ﬁltered out the 30 most popular categories. The BBC news article dataset was gathered from BBC news website. The Protein is a bioinformatics dataset, which is used to predict the local conformation of the polypeptide chain. The Gisette dataset is used for a handwritten recognition problem, released from NIPS 2003 Feature Selection Challenge. The RCV1 is a binary text classiﬁcation dataset. The URL dataset is used for malicious URL detection, consisting of 2.4 million URLs and 3.2 million features collected in 120-day. For evaluation, we used the standard training and testing split given

by the providers, except for Caltch30 and BBC. For these two datasets, we randomly split them into a training set (70%) and a test set (30%). To generate a triplet (xt, x+ t , x t ), xt is ﬁrstly randomly selected from the whole training set, then x+ t is randomly selected from the subset of training set, which consists of the examples with the same class of xt, at last, x t is randomly selected from the rest of training set, which consists of the examples with different classes of xt. To make a fair comparison, all methods adopted the same experimental setup. We randomly selected T=50, 000 triplets as training instances and set the number of epochs to be 5 for all stochastic methods. The average results over ﬁve trials were reported ﬁnally. Cross-validation was used to select the values of hyperparameters for all algorithms. Speciﬁcally, the parameters set by cross-validation included: the aggressiveness parameter C for OASIS (C {1, 0.1, 0.08, ..., 0.01}) and λ {5e-2, 5e-1, ..., 5e+6}. Moreover, the hinge loss was used in the implementation of the proposed algorithms. To evaluate the quality of learned metrics, we ﬁrst used mean-average-precision (MAP) to evaluate the accuracy of the retrieval performance. Second, we evaluated the classiﬁcation performance using k-nearest neighbor classiﬁer. More speciﬁcally, we applied the proposed algorithms to learn a measurement metric. Then for each test instance q, we used the learned metric to ﬁnd the top k-nearest training examples and predicted the class assignment by choosing the majority class among the k-nearest neighbors. We also recorded the time cost of the proposed algorithms.

4.2 Comparison Algorithms We compared eight approaches. Euclidean: The baseline measurement method using the standard Euclidean distance in feature space. OASIS: A state-of-the-art algorithm learns a bilinear similarity, which is based on online passiveaggressive algorithm using triplet instances. [Chechik et al., 2010]. Du DML: This algorithm applies Stochastic Dual Coordinate Ascent (SDCA) [Shalev-Shwartz and Zhang, 2013] to learn the distance metric. Du RSL: This algorithm applies SDCA to solve the dual problem Eq.(4) and recover the similarity metric by Eq.(5). SRP: Apply random projection to reduce the dimensionality and then use SDCA to learn the similarity metric in subspace. SPCA: Apply PCA to project original data into lower dimensional space and then use SDCA to learn the similarity metric in subspace. Du RPDML: Dual Random Projected Distance Metric Learning proposed by [Qian et al., 2015]. Du RPRSL: The proposed algorithm (Algorithm 1). OASIS, Du DML, and Du RSL are the algorithms that solve the optimization problem in the original space. Du RPDML, Du RPRSL, SRP, and SPCA are the methods that apply RP or PCA and solve the optimization problem in subspace. Moreover, Du RPDML and Du RPRSL will recover the solution in the original space by using the suboptimal results.

4.3 Evaluation by Ranking To demonstrate the effectiveness of the proposed method, we ﬁrst compared the MAP performance for different approaches, where the dimension of the subspace projection is set as 100. The results are summarized in Table 2. Then,

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Metric in Original Space Metric in Sub Space Euclidean OASIS Du DML Du RSL Du RPDML Du RPRSL SRP SPCA

Protein 38.78 45.18 0.12 47.05 0.31 49.30 0.24 43.35 0.16 45.47 0.08 41.38 0.29 43.98 0.15 Caltech30 17.45 31.37 0.03 28.39 0.37 30.71 0.40 25.05 0.18 28.43 0.13 17.37 0.43 25.78 0.11 Gisette 62.05 82.62 0.22 88.92 0.53 95.10 0.61 84.26 0.67 91.01 0.66 69.10 2.35 87.95 0.09 BBC 31.75 79.36 0.42 83.39 0.35 94.02 0.70 81.54 0.31 90.45 0.78 50.08 1.52 88.58 0.08 RCV1 60.50 92.75 0.03 81.44 0.78 92.41 0.53 80.05 0.82 90.62 0.54 60.41 2.33 85.57 0.17 URL 67.21 87.92 0.24 91.15 0.02 94.04 0.03 82.67 0.42 72.68 0.25 83.09 0.20 86.92 0.03

Table 2: Comparison of ranking results by MAP (%) (dim=100).

50 100 150 200 #Projections m

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

(a) Protein

50 100 150 200 #Projections m

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

(b) Caltech30

50 100 150 200 #Projections m

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

(c) Gisette

50 100 150 200 #Projections m

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

50 100 500 1000 2000 4000 #Projections m

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

50 100 500 1000 2000 4000 #Projections m

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

Figure 1: Performance of ranking algorithms with different number of projection dimensions in terms of MAP.

in order to observe the effect of different dimensions on the MAP performance, we varied the dimension of the subspace from 20 to 200 on each dataset. Fig. 1 shows how the MAP results are affected by the projection dimension. In average, Du RPRSL provides at least 5% improvement than Du RPDML. The proposed dual optimal solution Du RPRSL is also quite close to the primal optimal solutions. For the ﬁrst experiment, Table 2 shows the MAP results of different learning approaches. First, we observe that RSLbased methods perform better than DML-based methods, and Du RSL achieves the best performance among all the methods. Second, compared with Du RSL, Du RPRSL achieves similar performance on all datasets except URL, which resulted from an extremely dimension reduction, from 3 million to 100. When the dimension increases to 2000, we can see that Du RPRSL works best among all algorithms in Fig. 1(f). With the growth of dimensions, the MAP is increasing. In the subspace, SPCA performs signiﬁcantly better than SRP, while SRP is slightly better than the baseline method. That is because SPCA keeps the maximal data variance when making projections. Compared with methods in original space, SPCA achieves similar performance as Du RPDML. For the second experiment, we compared the MAP performance changes with different projection dimensions in Fig. 1. We set the number of random projection dimensions from 20 to 200 for datasets Protein, Caltech30, Gisette, and BBC. For RCV1 and URL, we changed the number of random projection dimensions from 50 to 4000 as their feature space are larger. Note that the performance of OASIS, Du DML, and Du RSL remain unchanged with the varied number of dimensions because they do not apply reduction. We can observe that Du RSL almost achieves the best performance of all the datasets.

50 100 150 200 #Projections m

Classification Accuracy (%)

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

(a) Protein

50 100 150 200 #Projections m

Classification Accuracy (%)

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

(b) Caltech30

50 100 150 200 #Projections m

Classification Accuracy (%)

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

(c) Gisette

50 100 150 200 #Projections m

Classification Accuracy (%)

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

50 100 500 1000 2000 4000 #Projections m

Classification Accuracy (%)

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

50 100 500 1000 2000 4000 #Projections m

Classification Accuracy (%)

OASIS Du DML Du RSL

Du RPDML Du RPRSL SRP SPCA

Figure 2: Performance of classiﬁcation algorithms with different number of projection dimensions in terms of k-NN accuracy.

Of the methods that use random projection, Du RPRSL almost performs the best for different projection dimensions and datasets except for the URL dataset. Du RPRSL becomes better than other random projection methods after the number of random projections reaches 2000, while SRP performs better than Du RPRSL when the dimension is less than 2000. That is because SRP projects all the data into a lowdimensional space which maximizes the variance of the lowdimension datasets and solves the optimization problem in the space. Du RPRSL needs to solve the original space optimization problem. Although RP is more computationally efﬁcient than PCA, it often yields worse performance than PCA unless the number of random projections is sufﬁciently large [Fradkin and Madigan, 2003]. In addition, when comparing SRP and SPCA, the performance of SRP improves with an increasing number of projections, while the performance of SPCA is not signiﬁcantly affected by the projection dimensions. That is because the principal components are already captured in the subspace.

4.4 Evaluation by Classiﬁcation In this experiment, we evaluate the learned metric by its classiﬁcation application accuracy with k-NN (k=5) classiﬁer. The k-NN accuracy changes in different projection dimensions, which is shown in Fig. 2. From the results, we can see that Du RSL has the best performance and Du RPRSL is similar. Totally, we essentially have the same observation as that for the ranking experiments reported in Section 4.3. Du RPRSL provides at least 10% accuracy improvement than the PR methods for the high-dimensional datasets. Du RPRSL is also quite close to the primal solution Du RSL, which indicates that the recovery error is small. Comparing Du RPRSL and Du RPDML, Du RPRSL also achieves 5% accuracy improvement except for the dataset

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Protein Caltech30 Gisette BBC RCV1 URL

Metric in Original Space

OASIS 163.18 12.62 844.74 57.54 11142.05 306.66 6952.13 560.81 22767.31 1303.77 16086.49 2253.54 Du DML 134.18 26.24 739.08 71.54 33994.66 505.87 341260.11 810.73 383310.98 2851.23 560486.51 3053.87 Du RSL 65.62 1.62 541.55 45.05 29443.66 155.07 246310.02 959.40 158217.06 821.51 537627.78 493.84

Du RPDML learning 13.12 0.62 11.59 1.87 12.11 1.43 11.69 0.76 12.98 0.13 11.70 0.83 recovering 58.76 0.46 166.15 24.33 2644.12 193.92 13106.39 434.72 2718.18 87.93 2185.98 116.92

Du RPRSL learning 10.28 0.58 9.48 0.28 9.84 0.52 9.95 0.41 10.55 0.13 9.61 0.04 recovering 30.92 0.13 110.41 1.50 1123.22 37.69 5817.32 719.52 1093.59 141.88 1155.52 217.48 Metric in Sub Space

SRP 21.41 28.05 21.85 6.65 20.33 2.18 22.09 4.04 11.27 1.61 9.95 2.33 SPCA 15.33 3.53 10.02 0.07 8.84 0.12 8.84 0.12 10.15 0.98 10.99 0.52

Table 3: Learning time (seconds) used by different approaches.

Gisette. The dual recovery framework we proposed is mainly concerned with time efﬁciency. For the dataset Gisette, Du RPRSL achieves similar performance as Du RPDML, but Du RPRSL has small computational overhead than Du RPDML as shown in Table 3.

4.5 Learning Efﬁciency of Du RPRSL In this section, we compare the computational cost of all the learning algorithms, which is recorded in Section 4.3. The dimension of the subspace projection was set to 100 and we choose the optimal performance results for each algorithm. The learning time is recorded by CPU time (in seconds). The learning time for the different methods is summarized in Table 3. Du RPDML and Du RPRSL learning time are recorded as two steps: subspace learning step and original space recovering step. It is not surprising to observe that the Du RPDML and Du RPRSL have similar subspace learning time to SRP and SPCA, which are signiﬁcantly more efﬁcient than other methods. Combining subspace learning time and original space recovering time, Du RPDML and Du RPRSL are 10 times faster than OASIS and 100 times faster for RCV1 and URL datasets. RSL-based methods are always faster than DML-based methods due to avoiding PSD constraints. This PSD constraint makes the recovery time of the Du RPRSL method 2 times faster than Du RPDML. Comparing the methods (Du RPDML, Du RPRSL) that apply random projection with those (OASIS, Du DML, Du RSL) that do not, we see that random projection methods signiﬁcantly reduce the computational cost in Table 3. Du RPRSL can save at least 80% computational time than the original space optimization methods in these test datasets.

4.6 Dual-sparse Study In this experiment, a case study in support of Du SRPRSL is presented. We used the RCV1 data in Table 1 to conduct a case study. We aimed to answer two questions related to our motivation: (i) How is the recovery error affected by the value of τ in Eq. (11)? (ii) How does the number of random projection dimension m affect the recovery error? We varied the value of τ {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}, the value of m {100, 1000, 2000, 4000}, and the value of λ {1e-4, 1e-5, 1e-6}. τ = 0 corresponds to the random reduction approach without the sparse regularizer. Three evaluation metrics were used: e1 = [eα ]Sc 1 [eα ]S [α ]S 1 ,

e2 = eα α 2

α 2 , and e3 = f M M 2

M 2 , where eα is the optimal dual solution to (11) and α is the optimal dual solution to (4) solving in the original space with the support set S. The set Sc is the complement of S. e1 tells the error ratio of nonsupport sample and support sample. e2 and e3 tell the dual

0 0.1 0.3 0.5 0.7 0.9 τ

non-support-error/support-error

dim=100 dim=1000 dim=2000 dim=4000

0 0.1 0.3 0.5 0.7 0.9 τ

relative-dual-error-L2-norm

dim=100 dim=1000 dim=2000 dim=4000

0 0.1 0.3 0.5 0.7 0.9 τ

relative-primal-error-L2-norm

dim=100 dim=1000 dim=2000 dim=4000

0 0.1 0.3 0.5 0.7 0.9 τ

non-support-error/support-error

dim=100 dim=1000 dim=2000 dim=4000

0 0.1 0.3 0.5 0.7 0.9 τ

relative-dual-error-L2-norm

dim=100 dim=1000 dim=2000 dim=4000

0 0.1 0.3 0.5 0.7 0.9 τ

relative-primal-error-L2-norm

dim=100 dim=1000 dim=2000 dim=4000

0 0.1 0.3 0.5 0.7 0.9 τ

non-support-error/support-error

dim=100 dim=1000 dim=2000 dim=4000

(a) e1 vs τ

0 0.1 0.3 0.5 0.7 0.9 τ

relative-dual-error-L2-norm

dim=100 dim=1000 dim=2000 dim=4000

(b) e2 vs τ

0 0.1 0.3 0.5 0.7 0.9 τ

relative-primal-error-L2-norm

dim=100 dim=1000 dim=2000 dim=4000

(c) e3 vs τ Figure 3: Recovery error for non-smooth hinge loss.

recovery error and primal recovery error. The recovery performance is shown in Fig. 3. In the left column, the ration of e1 decreases when τ increases. This indicates that the magnitude of dual variables for the original non-support samples decreases. The recovery errors of the dual solution and the primal solution are illustrated in the middle column and right column in Fig. 3. Generally, the larger τ will lead to a larger dual recovery error. For the case λ=1e-5 and λ=1e-6, the recovery error ﬁrst decreases and then increases. This indicates that, when τ is less than a threshold, the dual recovery error will decrease as τ increases. This can help to relax the low-rank assumption issue. Comparing case λ=1e-4, λ=1e-5 and case λ=1e-6, the difference is that smaller λ will lead M 2 larger, which makes the support samples more sparse.

5 Conclusions This paper presented a dual random projection method for similarity learning, which is suitable for large-scale highdimensional datasets. The main idea is to ﬁrst solve the dual problem in the subspace spanned by the random projection matrix, and then use these dual variables to recover the similarity function in the original space. This method can accurately and efﬁciently recover the original optimal solution with a small error. In addition, a dual-sparse regularized randomized reduction method is proposed to relax the low-rank assumption. The numerical experiments demonstrated the efﬁciency and effectiveness of our proposed reduction and recovery methods.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Acknowledgments

This work is conducted within the Rolls-Royce@NTU Corporate Lab with support from the National Research Foundation (NRF) Singapore under the Corp Lab@University Scheme.

[Balcan et al., 2006] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings. Machine Learning, 65(1):79 94, 2006.

[Bellet et al., 2012] Aur elien Bellet, Amaury Habrard, and Marc Sebban. Similarity learning for provably accurate sparse linear classiﬁcation. In Proc. ICML, pages 1491 1498, Edinburgh, Scotland, UK, June 2012.

[Bian and Tao, 2011] Wei Bian and Dacheng Tao. Learning a distance metric by empirical loss minimization. In Proc. IJCAI, pages 1186 1191, Barcelona, Spain, July 2011.

[Bingham and Mannila, 2001] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: applications to image and text data. In Proc. SIGKDD, pages 245 250, San Francisco, CA, USA, August 2001.

[Chechik et al., 2010] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11:1109 1135, 2010.

[Cheng, 2013] Li Cheng. Riemannian similarity learning. In Proc. ICML, pages 540 548, Atlanta, GA, USA, June 2013.

[Crammer and Chechik, 2012] Koby Crammer and Gal Chechik. Adaptive regularization for weight matrices. In Proc. ICML, pages 425 432, Edinburgh, Scotland, UK, June 2012.

[Davis and Dhillon, 2008] Jason V. Davis and Inderjit S. Dhillon. Structured metric learning for high dimensional problems. In Proc. SIGKDD, pages 195 203, Las Vegas, NV, USA, August 2008.

[Fern and Brodley, 2003] Xiaoli Z Fern and Carla E Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. ICML, pages 186 193, Washington, DC, USA, August 2003.

[Fradkin and Madigan, 2003] Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning. In Proc. SIGKDD, pages 517 522, Washington, DC, USA, August 2003.

[Jin et al., 2009] Rong Jin, Shijun Wang, and Yang Zhou. Regularized distance metric learning: Theory and algorithm. In Proc. NIPS, pages 862 870, Vancouver, British Columbia, Canada, December 2009.

[Kulis, 2013] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287 364, 2013.

[Maillard and Munos, 2012] Odalric-Ambrym Maillard and R emi Munos. Linear regression with random projections. Journal of Machine Learning Research, 13:2735 2772, 2012. [Paul et al., 2013] Saurabh Paul, Christos Boutsidis, Malik Magdon-Ismail, and Petros Drineas. Random projections for support vector machines. In Proc. AISTATS, pages 498 506, Scottsdale, AZ, USA, April 2013. [Qian et al., 2015] Qi Qian, Rong Jin, Shenghuo Zhu, and Yuanqing Lin. Fine-grained visual categorization via multi-stage metric learning. In Proc. CVPR, pages 3716 3724, Boston, MA, USA, June 2015. [Shalev-Shwartz and Zhang, 2013] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(1):567 599, 2013. [Shalev-Shwartz et al., 2004] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proc. ICML, page 94, Banff, Alberta, Canada, July 2004. [Shi et al., 2012] Qinfeng Shi, Chunhua Shen, Rhys Hill, and Anton van den Hengel. Is margin preserved after random projection? In Proc. ICML, pages 643 650, Edinburgh, Scotland, UK, June 2012. [Venna et al., 2010] Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11:451 490, 2010. [Weinberger and Saul, 2009] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. Journal of Machine Learning Research, 10:207 244, 2009. [Xu et al., 2017] Yi Xu, Haiqin Yang, Lijun Zhang, and Tianbao Yang. Efﬁcient non-oblivious randomized reduction for risk minimization with improved excess risk guarantee. In Proc. AAAI, pages 2796 2802, San Francisco, CA, USA, February 2017. [Yang et al., 2015] Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. Theory of dual-sparse regularized randomized reduction. In Proc. ICML, pages 305 314, Lille, France, July 2015. [Zhang et al., 2013] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Recovering the optimal solution by dual random projection. In Proc. COLT, pages 135 157, Princeton University, NJ, USA, June 2013. [Zhang et al., 2014] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Random projections for classiﬁcation: A recovery approach. IEEE Trans. Information Theory, 60(11):7300 7316, 2014. [Zhang et al., 2016] Weizhong Zhang, Lijun Zhang, Rong Jin, Deng Cai, and Xiaofei He. Accelerated sparse linear regression via random projection. In Proc. AAAI, pages 2337 2343, Phoenix, AZ, USA, February 2016.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)