# online_multitask_relative_similarity_learning__30a21ba4.pdf

Online Multitask Relative Similarity Learning

Shuji Hao1, Peilin Zhao2 , Yong Liu3, Steven C. H. Hoi4, Chunyan Miao5

1Institute of High Performance Computing, A*STAR, Singapore 2Articial Intelligence Department, Ant Financial Services Group, China 3Institute for Infocomm Research, A*STAR, Singapore 4School of Information Systems, SMU, Singapore 5School of Computer Science and Engineering, NTU, Singapore haosj@ihpc.a-star.edu.sg, peilin.zpl@antﬁn.com, liuyo@i2r.a-star.edu.sg, chhoi@smu.edu.sg, ascymiao@ntu.edu.sg

Relative similarity learning (RSL) aims to learn similarity functions from data with relative constraints. Most previous algorithms developed for RSL are batch-based learning approaches which suffer from poor scalability when dealing with realworld data arriving sequentially. These methods are often designed to learn a single similarity function for a speciﬁc task. Therefore, they may be sub-optimal to solve multiple task learning problems. To overcome these limitations, we propose a scalable RSL framework named OMTRSL (Online Multi-Task Relative Similarity Learning). Speciﬁcally, we ﬁrst develop a simple yet effective online learning algorithm for multi-task relative similarity learning. Then, we also propose an active learning algorithm to save the labeling cost. The proposed algorithms not only enjoy theoretical guarantee, but also show high efﬁcacy and efﬁciency in extensive experiments on real-world datasets.

1 Introduction The objective of relative similarity learning (RSL) is to learn similarity functions from training data with relative constraints, instead of the explicit labels commonly used in conventional classiﬁcation tasks. RSL has been extensively studied and widely used for many real-world applications, such as web search, image retrieval, data mining [Schultz and Joachims, 2004; Yang and Jin, 2006]. However, existing RSL approaches usually have two main drawbacks. On one hand, most of them are batch-based learning approaches. They may suffer from very expensive re-training cost in the application scenarios (e.g, online social media platforms) where new examples are continuously arriving. On the other hand, previous RSL methods are mainly designed for the single task learning scenario, i.e., learning a single similarity function for only one given task. When applying these methods for multi-task relative similarity learning a scenario which is common for many real-world applications (e.g., learning multiple similarity functions for different retrieval tasks), one may have

Corresponding author

to either learn a local similarity metric for each task independently or learn a global similarity metric for all tasks by combining all training data. Nevertheless, these two solutions are often sub-optimal for solving the multi-task relative similarity learning problems.

In the literature, multi-task learning has been actively studied for classiﬁcation problems [Cohen and Crammer, 2014; Gonc alves et al., 2016]. However, very few research work has been explored for multi-task relative similarity learning. The most relevant one is [Parameswaran and Weinberger, 2010], in which a multi-task metric learning algorithm, namely mt LMNN, has been proposed by extending the popular LMNN algorithm [Weinberger and Saul, 2009] to multitask learning scenarios. mt LMNN jointly learns a global metric for multiple tasks and several local metrics, each of which is designed for an individual task. There exist several major limitations of mt LMNN. First of all, mt LMNN is based on batch learning, thus it is usually time consuming and difﬁcult to scale up for large-scale applications. Moreover, mt LMNN enforces to learn a Positive Semi-deﬁnite (PSD) distance matrix, which is computationally intensive for large-scale applications. Furthermore, in [Parameswaran and Weinberger, 2010], the mt LMNN model is built based on explicit labeling information which is not as ﬂexible as the relative constraints adopted in this work.

To tackle these challenges, we propose a novel framework, namely Online Multi-Task Relative Similarity Learning (OMTRSL). Speciﬁcally, the proposed method simultaneously learns multiple similarity metrics from the relative constraints data via an online learning algorithm. In addition to the high efﬁciency of the adopted online learning scheme, the learned similarity matrices are also not required to be PSD. These characteristics of OMTRSL make it much efﬁcient and scalable compared to existing RSL approaches. Moreover, we also propose an extension of OMTRSL in an online active learning setting, named as OMTRSL-Active. This extension can reduce the labeling cost and thus further improve the efﬁciency of the proposed model. In this paper, we theoretically analyze the mistake bounds of both proposed algorithms. Then, we also perform extensive experiments on realworld datasets to demonstrate the empirical performances of these algorithms.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

The rest of this paper is organized as follows. Section 2 reviews the most relevant research work. Section 3 introduces the details of the proposed algorithms, as well as their theoretical analysis. Then, in Section 4, we empirically validate the proposed algorithms in terms of both efﬁcacy and efﬁciency. Finally, Section 5 concludes this study.

2 Related Work

This work is related to two main categories of research studies: online multi-task learning and similarity learning. Next, we brieﬂy review the most relevant work in each category.

2.1 Online Multi-task Learning Multi-task learning aims to boost the performances of all tasks via information sharing mechanism across all of the tasks. The implicit assumption is that these tasks are related and the labeled data for each task is limited [Hoi et al., 2014; Bakker and Heskes, 2003; Calandriello et al., 2014; Cohen and Crammer, 2014; Zhang et al., 2016]. This work is related to the multi-task learning methods under online settings. In the literature, Cavallanti et al. proposed the ﬁrst online multi-task learning algorithms [Cavallanti et al., 2010] based on the Perceptron algorithm [Block, 1962]. The learned relationship matrix among tasks was ﬁxed. In [Saha et al., 2011], Saha et al. proposed to adaptively update the relationship matrix via an incremental method. However, the performances of these algorithms were usually limited due to the adopted underlying learning scheme (i.e., Perceptron) [Ammar et al., 2014; Lugosi et al., 2009]. In this work, we adopt a superior ﬁrst-order based online learning scheme [Crammer et al., 2006]. By assuming all tasks share a sparse basis, Ruvolo and Eaton [Eaton and Ruvolo, 2013; Ruvolo and Eaton, 2014] proposed a series of Efﬁcient Lifelong Learning (ELLA) algorithms. In [Ammar et al., 2014; Calandriello et al., 2014], these ELLA algorithms were used for reinforcement learning. Under this scenario, these algorithms were interpreted as online learning algorithms to learn the tasks one-by-one. However, it is too restrictive to collect the entire data for one task before to learn the other tasks. Moreover, all these algorithms are designed for conventional classiﬁcation problems. In this work, we study the online multi-task learning for similarity learning problems.

2.2 Similarity Learning Similarity learning has been extensively studied in machine learning and data mining communities [Chechik et al., 2010; Crammer and Chechik, 2012; Hao et al., 2015; Schultz and Joachims, 2004; Shalev-Shwartz et al., 2004; Yang and Jin, 2006]. Here, we restrict our discussions on the most related multi-task metric learning progresses in the literature. To the best of our knowledge, the ﬁrst multi-task metric learning approach was proposed by Parameswaran and Weinberger [Parameswaran and Weinberger, 2010], in which a global metric and individual metric for each task were learned together based on the large-margin neighborhood scheme [Weinberger and Saul, 2009]. Based on similar idea, Yang et al. [Yang et al., 2013] proposed to learn multiple metrics by using von Neumann divergence among metrics.

Zhang and Yeung [Zhang and Yeung, 2010] proposed to learn a target metric from several related source tasks via multi-task learning approach. Moreover, there were other work applying multi-task metric learning to network data [Fang and Rockmore, 2015] and person re-identiﬁcation application [Ma et al., 2014]. However, these algorithms were mostly designed for classiﬁcation problems. In addition, all these algorithms were ofﬂine methods, and the learned similarity matrices were required to be PSD. These factors made these algorithms very inefﬁcient and unsuitable for large scale problems. Moreover, these algorithms also required explicit class labels, which were not as ﬂexible as the relative constraints.

3 Online Multi-Task Relative Similarity Learning (OMTRSL) 3.1 Problem Setting In this work, we investigate the problem of online learning K similarity functions for K related tasks simultaneously. More formally, we denote the similarity function for the kth task by Sk(x, x ), k [K], where x, x Rd are two instances (aka samples) from the k-th task, and [K] = {1, . . . , K} denotes the indices of all K tasks. We assume a bi-linear form for the similarity function as follows

Sk(x, x ) = x Mkx , (1)

where Mk Rd d is the similarity matrix learned for the k-th task. Note that it is possible to learn a similarity matrix between two heterogeneous spaces, where Mk Rd d and d = d . For simplicity, in this paper, we simply restrict the rest discussions by assuming d = d . For online multi-task relative similarity learning, we receive a sequence of triplet data from multiple tasks. Each triplet contains training data information of three instances, the relative label information, and the ID of the task in the triplet. More formally, at the t-th round, the received triplet is denoted as (xt, x1 t, x2 t; yt; tk) Rd Rd Rd { 1, +1} [K] ,

where xt, x1 t, x2 t are three received instances at the t-th iteration, tk [K] denotes the task s ID for which the triplet belongs to, yt indicates the relative similarity relationship between the instance pairs (xt, x1 t) and (xt, x2 t). Speciﬁcally, yt = +1 implies that the instance xt is more similar to x1 t than x2 t, which can be formally expressed as Stk(xt, x1 t) Stk(xt, x2 t). On the contrary, yt = 1 implies that xt is more similar to x2 t than x1 t. This can be formally expressed as Stk(xt, x1 t) < Stk(xt, x2 t). The objective of online multitask relative similarity learning is to simultaneously learn the set of K similarity matrices Mk that assign higher similarity scores to similar pairs and lower similarity scores to dissimilar pairs, via an online learning approach.

3.2 Proposed Method Formally, for any triplet denoted by zt = (xt, x1 t, x2 t; yt; tk), an online learner is expected to optimize the set of similarity functions in order to ensure the following constraint:

yt[Stk(xt, x1 t) Stk(xt, x2 t)] 0, t [T].

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Algorithm 1 OMTRSL: The proposed algorithm for Online Multi-Task Relative Similarity Learning.

Input: Parameters C > 0 and b > 0 Initialize: Mk 0 = Id, k [K] for t = 0, 1, 2, . . . , T do Receive (xt, x1 t, x2 t, tk) Compute pt = w t φt and ˆyt = sign(pt) Query yt and compute ℓ(wt, φt) = [0, 1 ytpt]+ if ℓ(wt, φt) > 0 then Mk t+1 = Mk t + ytτt A 1 k,tk Xt, k [K]

where τt = min n C, ℓ(wt, φt) φt 2 A 1 Idd

else Mk t+1 = Mk t , k [K] end if end for Output: Mk T +1, k [K]

In practice, it is impossible to satisfy the above constraint for every triplet. For those violated cases, we introduce some loss functions to measure the loss of the k-th similarity function on the t-th triplet. Speciﬁcally, we adopt the hinge loss and deﬁne the loss functions as follows:

ℓ Mtk; zt = 1 yt[Stk(xt, x1 t) Stk(xt, x2 t)]

where [ ]+ = max(0, ). Note that other loss functions, e.g., logistic loss and squared loss, can also be adopted. For simplicity, we vectorize the matrix representations by deﬁning wk t = vec(Mk t ) = [M11; . . . ; Md1; . . . Mdd] and φtk t = vec(Xt), where Xt = xt(x1 t x2 t) . Moreover, the whole triplet zt is represented as a (K d d)-dimensional compound vector φt = (0; . . . ; 0; φtk t ; 0; . . . ; 0) RKdd 1. The target matrices of all the K tasks are represented as a (K d d)-dimensional compound vector wt = (w1 t ; . . . ; wk t ) RKdd 1. Based on the compound vector representations, the loss function can be re-written as ℓ(wt, φt) = [1 ytw t φt]+. Similar to [Cavallanti et al., 2010], we adopt a task relationship matrix A RK K, which deﬁnes the learning rate to be used in the updating rules for each matrix Mk,

A 1 = 1 (1 + b)K

b + K b . . . b b b + K . . . b ... ... ... ... b b . . . b + K

where b is a parameter used to control to what extent the tasks would share instances with each other. In the following sections, we denote Ai,j as the entry in the i-th row and j-th column of matrix A. For an incoming triplet zt, we deﬁne the multi-task Passive-Aggressive [Crammer et al., 2006] objective function for online learning:

1 2 w wt 2 AIdd + Cξ

s.t. 1 ytw φt ξ, and ξ 0, (2)

Algorithm 2 OMTRSL-Active: The proposed algorithm for Active Online Multi-Task Relative Similarity Learning.

Input: Parameters C > 0, b > 0 and δ > 0 Initialize: Mk 0 = Id, k [K] for t = 0, 1, 2, . . . , T do Receive (xt, x1 t, x2 t, tk) Compute pt = w t φt and ˆyt = sign(pt) Sample Zt {0, 1} with Pr(Zt = 1) = δ δ+|pt|; if Zt = 1 then Query yt and compute ℓ(wt, φt) = [0, 1 ytpt]+ if ℓ(wt, φt) > 0 then Mk t+1 = Mk t + ytτt A 1 k,tk Xt, k [K]

where τt = min n C, ℓ(wt, φt) φt 2 A 1 Idd

end if else Mk t+1 = Mk t , k [K] end if end for Output: Mk T +1, k [K]

where C > 0 is a parameter used to trade-off between minimizing the adjustment of the model and minimizing the loss of the new model on current triplet. In Eq. 2, AIdd RKdd Kdd denotes A Idd, where Idd is a d2 identity matrix. The objective function enjoys a closed-form updating rule:

wt+1 = w + ytτt A 1 Iddφt, (3)

where τt = min n C, ℓ(wt, φt) φt 2 A 1 Idd

o . For the k-th

task, Eq. (3) is essentially equivalent to

Mk t+1 = Mk t + ytτt A 1 k,tk Xt. (4)

Algorithm 1 summarizes the details of the proposed algorithm OMTRSL.

3.3 An Active Learning Extension In Section 3.2, the proposed OMTRSL algorithm only requires relative label information yt that indicates which pair of instances are more similar. This greatly reduces the labeling effort compared to computing the exact similarity score between any two instances. However, the effort is still required to collect the relative label information. For many real-world applications, this may result in a huge amount of human labeling cost. Because the number of potential triplets is often much larger than the number of instances in a relative similarity learning task. To reduce the labeling cost, we propose an active learning algorithm, namely OMTRSL-Active, based on the fully-supervised OMTRSL algorithm. Speciﬁcally, we propose a simple yet effective margin-based query strategy [Cesa-Bianchi et al., 2006], which has been successfully used for active learning [Cesa-Bianchi et al., 2006; Hao et al., 2015; 2016]. A stochastic active sampling scheme is adopted to decide whether it is necessary to query the true label yt of an incoming triplet. This strategy attempts to draw a Bernoulli trial on a random variable Zt {0, 1}, where Zt = 1 indicates the

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

true label yt should be queried at the t-th step and Zt = 0 otherwise. In this paper, we deﬁne the sampling probability at the t-th step as follows:

Pr(Zt = 1) = δ δ + |pt|, (5)

where pt = w t φt indicates the difference between the similarity scores Stk(xt, x1 t) and Stk(xt, x2 t). A small value of |pt| usually means that the t-th triplet is more difﬁcult to be predicated thus more informative to train the similarity function than a large value of |pt|. Therefore, this triplet has a high probability to be queried for the true label in order to update the similarity matrix Stk. In Eq. 5, δ > 0 is a smoothing parameter used to decide the amount of queries. The details of the algorithm OMTRSL-Active are shown in Algorithm 2.

3.4 Theoretical Analysis Theorem 1 Let {(xt, x1 t, x2 t, yt, kt)|t [T]} be a sequence of examples where xt, x1 t, x2 t Rd, yt { 1, +1}, kt [K] and xt , x1 t , x2 t R for all t. Then for any matrix M Rd d, the number of mistakes for Online Multi-Task Relative Similarity Learning (OMTRSL) is bounded by:

t=1 mt max 4(b + K)R4

(1 + b)K , 1/C h K X

i=1 M i 2 F

i=1 M i 2 F + C

t=1 ℓ(M kt; zt) i

where mt = I(yt[Stk(xt, x1 t) Stk(xt, x2 t)] < 0).

Remark: It can be observed that, when b = 0, this bound corresponds to the case where all the K tasks are learnt separately; while when b is large enough, this bound corresponds to the case where all the tasks are treated as one. In practice, we can tune the parameter b to achieve a good trade-off.

Theorem 2 Let {(xt, x1 t, x2 t, yt, kt)|t [T]} be a sequence of examples where xt, x1 t, x2 t Rd, yt { 1, +1}, kt [K] and xt , x1 t , x2 t R for all t. Then for any matrix M Rd d, the expected number of mistakes for the OMTRSL-Active algorithm is bounded by:

δ max 4(b + K)R4

(1 + b)K , 1/C

i=1 M i 2 F + (δ + 1

i=1 M i 2 F

t=1 ℓ(M kt; zt)

where mt = I(yt[Stk(xt, x1 t) Stk(xt, x2 t)] < 0).

Here, we omit the detailed proofs due to space limitation.

4 Experiments In this section, we present our empirical studies on two real datatsets for evaluating both the efﬁcacy and efﬁciency of the proposed algorithms.

4.1 Experimental Settings

Datasets The performances of the proposed methods are evaluated on two real-world datasets: the Isolet spoken alphabet recognition dataset [Fanty and Cole, 1990] and the news20 dataset 1. These two datasets have been widely used for multi-task learning research. The Isolet dataset consists of 7, 797 examples, which are collected from 150 speakers who have uttered all characters in the English alphabet twice. On average, each speaker has contributed 52 training examples. The learning task on this dataset is to classify which letter has been uttered based on several acoustic features, including spectral coefﬁcients, contour-, sonorant-, and post-sonorant features. On this dataset, the speakers are categorized into smaller groups, each contains 30 similar speakers. This gives rise to 5 disjoint groups of training examples called Isolet1-5 . Each group has its own classiﬁcation task with 26 labels. Thus, there are 5 learning tasks on Isolet dataset. More details of the Isolet dataset can be found in [Fanty and Cole, 1990; Parameswaran and Weinberger, 2010]. The news20 dataset is for news document classiﬁcation. This dataset consists of 20 classes. In total, there are 15,935 examples for training and 3,993 examples for testing. We adopt 4 major categories (i.e., comp, rec, sci, and talk) in this dataset to form 4 learning tasks in our multi-task learning experiments.

Setup and Metrics On both datasets, we used standard 5-fold cross validation for evaluation, in which 80% of the data are used for training, and the remaining 20% are used for testing. We report the averaged results on the 5 test sets. In each fold, the triplet data streams used in the experiments were generated based on the training set. Speciﬁcally, to generate the t-th triplet (xt, x1 t, x2 t; yt), we ﬁrst randomly choose two examples xa and xb which belong to the same class, and another example xc from another different class. We then ﬂip a coin to decide the value of yt. If yt = +1, we assign the third example xc to x2 t in the triplet, and assign the rest two examples from the same class to xt and x1 t, respectively. If yt = 1, the third example xc is assigned to x1 t, and the rest two are assigned to xt and x2 t, respectively. Note that under the online active learning setting, yt is only disclosed to the online learner upon receiving the query request by the online active learner OMTRSL-Active. The performances of all algorithms are evaluated using precision at top k, a standard measure for ranking algorithms. For each query instance in the test set, all other test instances are ranked according to their similarities to the query instance calculated using Eq. (1). Among the top k instances, the percentage of instances from the same class with the query instance can be computed, and then averaged over all test instances. So that, we can obtain the average precision at-top-k, denoted by AP@k. Moreover, we also compute the mean Average Precision at top k, denoted by m AP@k. This measure that has been widely used in the information retrieval community. It considers the number of instances whose classes are the same with the query instance, as well as the ranking

1Available on the LIBSVM Machine Learning Repository.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

order of the retrieved instances. For both of the AP@k and m AP@k measures, k [10] are considered.

4.2 Comparison Schemes We compare the proposed algorithms with the state-of-the-art multi-task learning algorithms.

mt LMNN [Parameswaran and Weinberger, 2010]: This is the state-of-the-art multi-task similarity learning algorithm. Although it was originally proposed for classiﬁcation, it implicitly outputs distance metrics for similarity learning. OGRSL [Chechik et al., 2010]: This algorithm combines all the tasks into a single task to yield a global similarity matrix. OSTRSL [Chechik et al., 2010]: This is a online single task relative similarity learning method, which treats each task independently. OMTRSL-Random: This is the random query version of OMTRSL for active learning. It randomly decides when to query the true label of an incoming triplet. OMTRSL: This is the proposed online multi-task relative similarity learning method (Algorithm 1). OMTRSL-Active: This is the proposed active variant of the OMTRSL algorithm (Algorithm 2).

For each evaluated algorithm, the optimal parameter are chosen by using cross-validation.

4.3 Evaluation on Efﬁcacy In this section, we evaluate the proposed OMTRSL algorithm on the test set, in terms of AP@k and m AP@k, where k varies from 1 to 10. Figure 1 and Figure 2 show the average performances over all tasks achieved by different methods on the isolet dataset and news20 dataset, respectively. Based on these results, we make the following observations:

The online global relative similarity learning algorithm OGRSL consistently performs much worse than the other algorithms. This indicates that it is necessary to learn a model for each task, rather than learn one global model for all tasks. The proposed online multi-task learning algorithm OMTRSL consistently outperforms the single task learning algorithm OSTRSL which learns each similarity metric independently. This is consistent with the ﬁndings in [Caruana, 1997; Parameswaran and Weinberger, 2010]. Moreover, it also conﬁrms that the proposed OMTRSL algorithm could improve the overall performances of all tasks by sharing information between related tasks. The performances of all algorithms decrease when k gradually increases. This is consistent with our intuition that the task becomes harder when the list of retrieved instances becomes longer. In addition, we also notice that mt LMNN can achieve higher performance when k is small. One possible reason is that, in the learning process, for each query instance, mt LMNN tries to make sure its top k closest instances are from the same class.This learning strategy makes mt LMNN outperform other methods when k [2]. However, for larger k, the proposed OMTRSL algorithm

1 2 3 4 5 6 7 8 9 10

mt LMNN OGRSL OSTRSL OMTRSL

1 2 3 4 5 6 7 8 9 10

mt LMNN OGRSL OSTRSL OMTRSL

Figure 1: Performance with varied k on isolet datasets (left: AP, right: m AP).

1 2 3 4 5 6 7 8 9 10 0.5

mt LMNN OGRSL OSTRSL OMTRSL

1 2 3 4 5 6 7 8 9 10 0.4

mt LMNN OGRSL OSTRSL OMTRSL

Figure 2: Performance with varied k on news20 datasets (left: AP, right: m AP).

consistently achieve better results than the mt LMNN algorithm, on both datasets.

4.4 Evaluation on Efﬁciency

We also study the efﬁciency of the evaluation algorithms. Table 1 shows the time costs of these algorithms on both isolet dataset and news20 dataset.

Table 1: Time (s) cost on training the algorithms.

Data mt LMNN OGTRSL OSTRSL OMTRSL isolet 241.3 12.5 12.6 14.7 news20 5371.4 111.2 114.9 126.4

From Table 1, we can observe that the ofﬂine algorithm mt LMNN takes more than 16 times learning time compared to other online algorithms on isolet dataset, and more than 40 times learning time on news20 dataset due to the high dimension. In addition, it also should be noted that we implement our algorithms with pure Matlab language, and the mt LMNN algorithm provided by the author 2 is implemented by a combination of Matlab and C languages. These observations indicate that online algorithms are more advantageous, in terms of time cost. Moreover, the proposed online algorithm OMTRSL takes a little extra time compared to other two online algorithms (OGRSL, OSTRSL), due to updating several tasks simultaneously on each triplet. However, considering the efﬁciency of the online learning scheme, this extra time could be ignored. These observations conﬁrm the efﬁciency of the proposed OMTRSL algorithm, which makes it very suitable to large-scale similarity learning problems.

2http://www.cs.cornell.edu/ kilian/code/code.html

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

10 3 10 2 10 1 10 0 10 1 10 2 10 3 1

10 3 10 2 10 1 10 0 10 1 10 2 10 3 0.5

1 100 300 400 500 1

(a) OMTRSL (b) OSTRSL (c) mt LMNN

Figure 3: Performance with different parameters on validation set of isolet dataset.

4.5 Parameter Sensitivity Analysis In this section, we study the sensitivity of the model parameters. Figures 3 (a), (b) and (c) show the sensitivity of parameters in the OMTRSL, OSTRSL and mt LMNN algorithms, respectively. From Figure 3 (a), we can observe that the performance of the OMTRSL algorithm is not sensitive to the settings of parameter b. However, we can also observe that the performance decreases as b increases. This is consistent with our theoretical analysis in Section (3.3). Moreover, the OMTRSL algorithm is a little sensitive to the choice of parameter C. An empirical setting of C is 1. In Figure 3 (b), we can make similar observations as in Figure 3 (a). A too small value of C can make the performance of the OSTRSL algorithm decrease. In addition, Figure 3 (c) indicates that a large value for any of γ and γ0 would make the performance of the mt LMNN algorithm decrease. This is consistent with the ﬁndings in [Parameswaran and Weinberger, 2010].

4.6 Experiments of Active Learning In the experiments, we also evaluate the performance of the proposed active learning algorithm OMTRSL-Active. Figure 4 shows the results on isolet dataset. From Figure 4, we can notice that the proposed OMTRSL-Active method consistently outperforms the random version OMTRSL method, i.e., OMTRSL-Random. The active learning approach achieves more than 5% better performance than the random approach with less than 20% query ratio. More impressively, using around 40% of training data, the active learning approach (i.e., OMTRSL-Active) can achieve highly comparable performance with the OSTRSL and mt LMNN algorithms, which use all of the training data. These results demonstrate the effectiveness of the proposed active learning approach in reducing labeling cost. Similar observations can be made on news20 dataset, and we omit the results due to space limitation.

5 Conclusion This paper investigates online multi-task learning techniques for relative similarity learning tasks. In particular, we propose a novel Online Multi-Task Relative Similarity Learning algorithm (OMTRSL), which overcomes the drawbacks of existing approaches that are often of low efﬁcacy and inefﬁciency. To further reduce the human labeling effort, we

0 0.2 0.4 0.6 0.8 1 0.65

Varied Query Ratio

mt LMNN OSTRSL OMTRSL OMTRSL Random OMTRSL Active

0 0.2 0.4 0.6 0.8 1

Varied Query Ratio

mt LMNN OSTRSL OMTRSL OMTRSL Random OMTRSL Active

Figure 4: Performance of varied query ratios on isolet dataset.

develop an active variant of OMTRSL, namely OMTRSLActive, to avoid labeling of each incoming triplet. We theoretically analyze the mistake bounds of both OMTRSL and OMTRSL-Active algorithms, and empirically conduct extensive experiments on real datasets. We have found very encouraging results by comparing with the state-of-the-art algorithms. For future work, we would like to sparse multitask learning for similarity problems [Yao et al., 2015] and design adaptive relationship matrix method for online multitask RSL.

Acknowledgments This research is supported by the National Research Foundation, Prime Ministers Ofﬁce, Singapore under its IDM Futures Funding Initiative and the International Research Centres in Singapore Funding Initiative. This research is also partially supported by the NTU-PKU Joint Research Institute, a collaboration between the Nanyang Technological University and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Ammar et al., 2014] Haitham Bou Ammar, Eric Eaton, Paul Ruvolo, and Matthew Taylor. Online multi-task learning for policy gradient methods. In 2014 ICML, pages 1206 1214, 2014. [Bakker and Heskes, 2003] Bart Bakker and Tom Heskes. Task clustering and gating for bayesian multitask learning. JMLR, 4:83 99, 2003. [Block, 1962] HD Block. The perceptron: A model for brain functioning. i. Reviews of Modern Physics, 34(1), 1962. [Calandriello et al., 2014] Daniele Calandriello, Alessandro Lazaric, and Marcello Restelli. Sparse multi-task reinforcement learning. In NIPS, pages 819 827, 2014. [Caruana, 1997] Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997. [Cavallanti et al., 2010] Giovanni Cavallanti, Nicolo Cesa Bianchi, and Claudio Gentile. Linear algorithms for online multitask classiﬁcation. JMLR, 11:2901 2934, 2010. [Cesa-Bianchi et al., 2006] Nicol o Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Worst-case analysis of selective sampling for linear classiﬁcation. JMLR, 7:1205 1230, December 2006. [Chechik et al., 2010] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. JMLR, 11:1109 1135, 2010. [Cohen and Crammer, 2014] Haim Cohen and Koby Crammer. Learning multiple tasks in parallel with a shared annotator. In NIPS, pages 1170 1178, 2014. [Crammer and Chechik, 2012] Koby Crammer and Gal Chechik. Adaptive regularization for similarity measures. In ICML, 2012. [Crammer et al., 2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(Mar):551 585, 2006. [Eaton and Ruvolo, 2013] Eric Eaton and Paul L Ruvolo. Ella: An efﬁcient lifelong learning algorithm. In ICML, pages 507 515, 2013. [Fang and Rockmore, 2015] Chen Fang and Daniel N Rockmore. Multi-task metric learning on network data. In Advances in Knowledge Discovery and Data Mining, pages 317 329. Springer, 2015. [Fanty and Cole, 1990] Mark Fanty and Ronald Cole. Spoken letter recognition. In NIPS, 1990. [Gonc alves et al., 2016] Andr e R Gonc alves, Fernando J Von Zuben, and Arindam Banerjee. Multi-task sparse structure learning with gaussian copula models. Journal of Machine Learning Research, 17(33):1 30, 2016. [Hao et al., 2015] Shuji Hao, Peilin Zhao, Steven C. H. Hoi, and Chunyan Miao. Learning relative similarity from data streams: Active online learning approaches. In CIKM, pages 1181 1190. ACM, 2015.

[Hao et al., 2016] Shuji Hao, Peilin Zhao, Jing Lu, Steven C. H. Hoi, Chunyan Miao, and Chi Zhang. Soal: Secondorder online active learning. In 2016 ICDM, pages 931 936, 2016. [Hoi et al., 2014] Steven CH Hoi, Jialei Wang, and Peilin Zhao. Libol: A library for online learning algorithms. The Journal of Machine Learning Research, 15(1):495 499, 2014. [Lugosi et al., 2009] G abor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning with hard constraints. ar Xiv preprint ar Xiv:0902.3526, 2009. [Ma et al., 2014] Lianyang Ma, Xiaokang Yang, and Dacheng Tao. Person re-identiﬁcation over camera networks using multi-task distance metric learning. Image Processing, IEEE Transactions on, 23(8):3656 3670, 2014. [Parameswaran and Weinberger, 2010] Shibin Parameswaran and Kilian Q Weinberger. Large margin multi-task metric learning. In NIPS, pages 1867 1875, 2010. [Ruvolo and Eaton, 2014] Paul Ruvolo and Eric Eaton. Online multi-task learning via sparse dictionary optimization. In Carla E. Brodley and Peter Stone, editors, AAAI, pages 2062 2068. AAAI Press, 2014. [Saha et al., 2011] Avishek Saha, Piyush Rai, Hal Daum e III, and Suresh Venkatasubramanian. Online learning of multiple tasks and their relationships. In AISTATS, volume 15 of JMLR Proceedings, pages 643 651. JMLR.org, 2011. [Schultz and Joachims, 2004] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. NIPS, page 41, 2004. [Shalev-Shwartz et al., 2004] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y Ng. Online and batch learning of pseudo-metrics. In ICML, page 94. ACM, 2004. [Weinberger and Saul, 2009] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. JMLR, 10:207 244, 2009. [Yang and Jin, 2006] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2, 2006. [Yang et al., 2013] Peipei Yang, Kaizhu Huang, and Cheng Lin Liu. Geometry preserving multi-task metric learning. Machine learning, 92(1):133 175, 2013. [Yao et al., 2015] Dezhong Yao, Peilin Zhao, Chen Yu, Hai Jin, and Bin Li. Sparse online relative similarity learning. In 2015 ICDM, pages 529 538, 2015. [Zhang and Yeung, 2010] Yu Zhang and Dit-Yan Yeung. Transfer metric learning by learning task relationships. In SIGKDD, pages 1199 1208. ACM, 2010. [Zhang et al., 2016] Chi Zhang, Peilin Zhao, Shuji Hao, Yeng Chai Soh, and Bu-Sung Lee. Rom: A robust online multi-task learning approach. In 2016 ICDM, pages 1341 1346, 2016.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)