# semisupervised_metric_learning_a_deep_resurrection__a6b4fed5.pdf

Semi-Supervised Metric Learning: A Deep Resurrection

Ujjal Kr Dutta,1,2 Mehrtash Harandi,3 C Chandra Sekhar2

1 Data Sciences, Myntra, India 2Dept. of Computer Science and Eng., Indian Institute of Technology Madras, India 3Dept. of Electrical and Computer Systems Eng., Monash University, Australia ukd@cse.iitm.ac.in, ujjal.dutta@myntra.com, mehrtash.harandi@monash.edu, chandra@cse.iitm.ac.in

Distance Metric Learning (DML) seeks to learn a discriminative embedding where similar examples are closer, and dissimilar examples are apart. In this paper, we address the problem of Semi-Supervised DML (SSDML) that tries to learn a metric using a few labeled examples, and abundantly available unlabeled examples. SSDML is important because it is infeasible to manually annotate all the examples present in a large dataset. Surprisingly, with the exception of a few classical approaches that learn a linear Mahalanobis metric, SSDML has not been studied in the recent years, and lacks approaches in the deep SSDML scenario. In this paper, we address this challenging problem, and revamp SSDML with respect to deep learning. In particular, we propose a stochastic, graph-based approach that ﬁrst propagates the afﬁnities between the pairs of examples from labeled data, to that of the unlabeled pairs. The propagated afﬁnities are used to mine triplet based constraints for metric learning. We impose orthogonality constraint on the metric parameters, as it leads to a better performance by avoiding a model collapse.

Introduction and Motivation

Distance Metric Learning (DML) seeks to learn a discriminative embedding where similar examples are closer, and dissimilar examples are apart. The importance of DML is evidenced by the plethora of recent approaches introduced at the premier conferences in Computer Vision (CV) (Yu and Tao 2019; Wang et al. 2019; Yuan et al. 2019; Sun et al. 2020; Cakir et al. 2019; Movshovitz-Attias et al. 2017; Deng et al. 2019; Qian et al. 2019), as well as Artiﬁcial Intelligence (AI) (Dutta, Harandi, and Sekhar 2020; Gu and Ko 2020; Chen et al. 2020; Li et al. 2020; Gong, Yuan, and Bao 2020) in general. In contrast to classiﬁcation losses that learn a class logit vector, embedding losses in DML capture the relationships among examples. This makes embedding losses more general in nature, because of their ﬂexibility to provide supervisory signals in the form of pairs (Chopra, Hadsell, and Le Cun 2005), triplets (Schroff, Kalenichenko, and Philbin 2015; Weinberger, Blitzer, and Saul 2006), tuples (Sohn 2016) etc. Despite the presence of many exemplary, state-of-the-art DML approaches as mentioned above, they are supervised

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

in nature. In order to provide constraints (pairs/ triplets/ tuples) for metric learning, they require manual annotations (class labels). For large scale datasets, it is infeasible to obtain class labels for all the examples. In practical scenarios where it is possible to have a few examples annotated, with an abundance of additional unlabeled data, semi-supervised learning (Oliver et al. 2018; Zhu 2005; Chapelle, Scholkopf, and Zien 2009; Stretcu et al. 2019; Miyato et al. 2018; Iscen et al. 2019) can be applied. Surprisingly, with the exception of a few classical approaches that learn a linear Mahalanobis metric (Liu et al. 2010; Hoi, Liu, and Chang 2010; Niu et al. 2014; Ying et al. 2017; Dutta and Sekhar 2018; Shen, Du, and Li 2016), the problem of Semi-Supervised DML (SSDML) has not been studied in the recent years, not to mention the lack of SSDML approaches in the deep DML scenario. In this paper, we address this challenging problem, and revamp SSDML with respect to deep learning. To the best of our knowledge, deep SSDML has been studied for the ﬁrst time (this is despite the presence of deep semi-supervised approaches for classiﬁcation losses). As with any DML technique, our approach covers the following two major aspects of DML: 1. Constraint Mining: To appropriately mine constraints of examples (eg., pairs or triplets), and 2. DML Loss: An appropriate loss formulation to learn a metric using the mined constraints. In the recent years, a huge number of supervised, state-of-the-art DML techniques have been proposed: Tuplet Margin (Yu and Tao 2019), Multi-Similarity (Wang et al. 2019), SNR (Yuan et al. 2019), Circle Loss (Sun et al. 2020), Fast AP (Cakir et al. 2019), Arc Face (Deng et al. 2019), Soft Triple (Qian et al. 2019) etc. All of these approaches have made contributions either in terms of constraint mining, or a novel loss. However, recently (Musgrave, Belongie, and Lim 2020) and (Roth et al. 2020) observed that when ran under the exact same experimental protocol (network architecture, embedding size, optimizers etc), classical metric learning losses involving pairs (Chopra, Hadsell, and Le Cun 2005) or triplets (Schroff, Kalenichenko, and Philbin 2015; Weinberger, Blitzer, and Saul 2006) perform at par (and at times better) with that of the state-of-the-art loss functions. For this reason, we make use of a classical triplet based approach (because of the fact that the pairwise loss still has the theoretical downside of applying the same distance threshold to all the pairs, irrespective of variances in the similarities

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Initial Feature Space

Affinity Propagation

METRIC LEARNING

Learned Metric Space

Step 2: Unlabeled Examples

Labeled Examples

Neighborhood

Triplet Formation

l 2 (L) zi xi

z4 - Metric Learning

Figure 1: (Best viewed in color) (a) Illustration of our method. The raw image belongs to the MNIST dataset (Le Cun et al. 1998), (b) Triplet mining around an anchor za (shown in black), within its kneighborhood Nk(za), k= 4. Points in blue (z+ 1 , z+ 2 ) are more semantically similar (by virtue of propagated afﬁnities) to the anchor, than the points in red (z 3 , z 4 ). Hence, they should be pulled closer to the anchor in the learned space, compared to the red ones.

or dissimilarities). In particular, we make use of an angular variant of a triplet-based metric learning loss, due to certain well-established theoretical beneﬁts (Wang et al. 2017) (eg., ease of tuning an angle as compared to a distance hyperparameter, capture of third-order information by the angular variant, scale and rotation invariance, and so on). Following are the major contributions of the paper: Though end-to-end training is widely studied in supervised DML, the same is not the case for SSDML (a few classical SSDML approaches have attempted to learn a linear Mahalanobis metric, but not in an end-to-end fashion). We revisit key linear SSDML baselines, and discuss their deep extensions. Additionally, to address the theoretical limitations of prior SSDML methods, we also propose a novel approach consisting of well-studied components: i) Afﬁnity propagation (Liu et al. 2010; Dutta and Sekhar 2018) to propagate information from the limited labeled examples to the unlabeled data using a graph, and ii) Making use of an adapted angular variant of a triplet-based metric learning loss (Wang et al. 2017). It should be noted that we formulate a simple approach with well-known components, and the superiority and novelty of our approach lies by the design composition, which we establish both empirically, as well as qualitatively. Constraint mining and scalability: It should be noted that having propagated the afﬁnities, we propose a novel triplet mining strategy by considering the neighborhood of an example. Also, to make the approach scalable to large datasets, our approach is stochastic in nature (absent in earlier classical SSDML methods). Addressing model collapse: A model collapse could occur in DML which can project representations of close, but distinct examples to a single point (degenerate embeddings). To address this, we impose orthogonality on the metric and establish that it can lead to better embeddings. Arguably, some supervised works (Xie et al. 2018) have studied orthogonality while trying to alleviate class imbalance, achieving compact bases, etc. But our context is

different. We explicitly show that it alleviates a collapse. This is particularly important in the semi-supervised case where we do not have much supervision, and a model may easily lead to degenerate embeddings.

An illustration of our method is shown in Figure 1a.

Extending Classical SSDML to Deep SSDML

We begin our paper with the introduction of a few representative SSDML methods, while discussing their main principles. Let X = XL XU be a given dataset, consisting of a set of labeled examples XL = {zi Rd}NL i=1 with the associated label set {yi}NL i=1, (yi {1, , C}; C is the number of classes), and a set of unlabeled examples XU = {zj Rd}N j=NL+1. Existing SSDML approaches learn the parametric matrix M Rd d, M 0 of the squared Mahalanobis distance metric δ2 M(zi, zj) = (zi zj) M(zi zj), for a pair of descriptors zi, zj Rd (classically obtained using hand-crafted features). The SSDML approaches can mainly be categorized under two major paradigms: i) entropy minimization (Grandvalet and Bengio 2005), and ii) graph-based. The SERAPH (Niu et al. 2014; Shen, Du, and Li 2016) approach is the only SSDML representative from the ﬁrst class, and is expressed as (Niu et al. 2014):

(zi,zj):zi,zj XL yij= 1,yi =yj yij=1,yi=yj

log p M ij (yij)+

(zi,zj): zi,zj XU

y { 1,1} p M ij (y)log p M ij (y)] + λTr(M). (1)

The ﬁrst term maximizes the entropy of a conditional probability distribution over the pairwise constraints obtained from labeled data, while the second term minimizes the entropy over the unlabeled data. Here, p M ij (yij) (and correspondingly p M ij (y)) denotes a probability parameterized

Algorithm 1 Stochastic extension of the SSDML baselines

1: Initialize θ0 z, M 0. 2: for t 1 to T do 3: Fix θt 1 z and learn M t using the given SSDML method. 4: Fix M t and learn θt z by backpropagation using SGD. 5: end for 6: return θT z , M T

by the distance metric δ2 M(zi, zj), as follows: p M ij (yij) = 1 1+exp(yij(δ2 M (zi,zj) η)). The regularizer Tr(M) ensures a projection sparsity (Niu et al. 2014). η, µ, λ > 0 are hyperparameters in (1). The other category of SSDML approaches, i.e., graphbased ones, has majority of approaches to its disposal. Here we discuss the most prominent ones. The LRML (Hoi, Liu, and Chang 2010) method is expressed as:

min M 0,log |M| 0 γS X

(zi,zj):zi,zj XL yi=yj

δ2 M(zi, zj)

(zi,zj):zi,zj XL yi =yj

δ2 M(zi, zj) + Tr(MXLX ). (2)

The ﬁrst and second terms in (2) minimize and maximize the pairwise distances among similar and dissimilar pairs from labeled data, respectively. The third term is a Laplacian regularizer that preserves the pairwise distances among all the examples in X (and hence, unlabeled data as well). Here, columns of X Rd N represent the descriptors of elements of X, and L is the graph Laplacian matrix obtained using the afﬁnity matrix of the underlying k NN graph used. The recently proposed graph-based ISDML (Ying et al. 2017) approach is a state-of-the-art among SSDML methods. Similar to LRML, it seeks to optimize the pairwise distances between labeled examples. However, it also takes into account the density around an unlabeled example while computing the Laplacian regularizer. The APLLR and APIT methods have also been proposed under a common framework (Dutta and Sekhar 2018), that instead of performing naive optimization of the distances among labeled pairs, ﬁrst computes a prior metric using the labeled data. The APLLR method computes the prior metric by optimizing a log-likelihood ratio, while the APIT method computes the prior metric using an information-theoretic approach. The ﬁnal metric is learned using afﬁnity propagation (Liu et al. 2010) on unlabeled data while staying close to the prior one. Extending the above methods for end-to-end learning: Although trivial, the methods discussed above have not been studied in the context of end-to-end deep learning. It is noteworthy that as the losses in these methods are differentiable, we can backpropagate their gradients while using mini-batch based stochastic optimization. Due to the graph-based nature of LRML, ISDML, APLLR and APIT, one could ﬁrst sample a random partition of the unlabeled examples, and construct a sub-graph along with the limited number of available labeled examples. Then, the mini-batches can be drawn from this partition alone for a number of epochs (over this partition).

This process could be iterated over a number of partitions over the entire dataset. The same partition based strategy could be used for SERAPH as well, with the only exception that it does not require graph construction. Formally, we can learn a deep embedding that induces a distance metric of the form δ2 M(zi, zj), such that the descriptors zi, zj Rd are obtained using a deep neural network z : X Rd with parameters θz, while simultaneously learning (θz, M) in an end-to-end manner. This general stochastic approach is illustrated in Algorithm 1. Limitations with existing SSDML methods: Having proposed stochastic extensions for the existing classical SSDML techniques LRML, SERAPH, ISDML, APLLR and APIT, we now discuss their limitations. A fundamental weakness of the LRML method is the use of a naive Laplacian regularizer as: Tr(MXLX ) = 1

2 PN i,j=1 δ2 M(zi, zj)Wij. Here, Wij refers to the afﬁnity between the ith and jth examples, and is computed directly using the distances among the initial representations. Hence, the distance δ2 M(zi, zj) in the learned space could be affected by a poor initial Wij. The ISDML method scales the afﬁnity terms by considering the densities around each individual example. However, both these techniques are prone to be affected by a poor initial representation. Moreover, they fail to adapt their afﬁnities by the additionally present labeled data information. The APLLR and APIT techniques make use of an afﬁnity propagation principle (Liu et al. 2010) to propagate the information from the labeled pairs to the unlabeled ones. This leads to a more informative afﬁnity matrix. However, their dependency on a prior pre-computed metric sometimes have adverse effects. Especially, in scenarios where there is an extremely limited amount of labeled data, the prior parametric matrix is often close to singular, and leads to a poor metric. Furthermore, despite the afﬁnity propagation, they do not mine informative constraints for metric learning (which is very crucial). This fails to fully leverage the beneﬁts of afﬁnity propagation. On the other hand, the SERAPH method is based on the entropy minimization principle. Although it is a good principle due to its capability of minimizing the class overlap, the SERAPH method takes a long time to converge, and often gets trapped in poor local minima. To address these limitations with the existing SSDML methods, we propose a new method. However, we propose that instead of learning the metric wrt the matrix M (O(d2) parameters), we utilize the property that M 0, and factorize it as: M = LL s.t. L Rd l, l d, and learn the metric wrt L with lesser parameters (O(dl)). The pseudocode of our method is same as Algorithm 1 (after replacing M with L). We now discuss our proposed method.

Proposed Approach Triplet Mining using Afﬁnity Propagation

In the semi-supervised DML setting, the ﬁrst task is to mine informative constraints, and then use an appropriate loss function. As the ﬁrst stage, we propose a novel method for mining constraints using a graph-based technique while leveraging afﬁnity propagation (Liu et al. 2010). Let X (p) U be

a randomly selected partition of unlabeled data. We construct a k NN graph using XL X (p) U , s.t. the nodes represent the examples, and edge weights denote the afﬁnities (or similarities) among the examples. The initial afﬁnity matrix W 0 R(NL+Np) (NL+Np) is deﬁned as follows: i) W 0 ij = +1, if i = j and yi, yj, s.t. yi = yj, ii) W 0 ij = 1, if i = j and yi, yj, s.t. yi = yj, iii) W 0 ij = +1, if i = j, and iv) W 0 ij = 0, otherwise. Here, NL and Np are respective cardinalities of XL and X (p) U . The neighborhood structure of the k NN graph can be indicated using a matrix Q R(NL+Np) (NL+Np) deﬁned as: i) Qij = 1/k, if zj Nk(zi), and ii) Qij = 0, otherwise. Here, Nk(zi) is the set of k-nearest neighbor examples of zi. The ﬂow of information from the non-zero entries of W 0

(representing the labeled pairs) to the zero entries (representing the unlabeled pairs), can be performed by using a closedform expression described by Liu et al. (Liu et al. 2010), as: W = (1 γ)(INL+Np γQ) 1W 0. This step is called afﬁnity propagation. This step essentially performs a randomwalk to propagate the afﬁnities, where 0 < γ < 1 is a weight hyper-parameter between the afﬁnities obtained at the current step to that of the initial one. As NL + Np N, we do not encounter any difﬁculty while scaling up to large datasets. To obtain a symmetric afﬁnity matrix W , we perform a ﬁnal symmetrization step as follows: Wij = (W ij + W ji)/2. The ﬁnally obtained representation of the symmetric afﬁnity matrix W is used to mine triplets for metric learning. In doing so, we take into account the kneighborhood Nk(za) of an example za XL X (p) U that we consider as an anchor (Figure 1b). Let NW (za) = {z+ 1 , z+ 2 , , z+ k/2, z k/2+1, z k/2+2, , z k } be the k-neighboring examples of za sorted in descending order of their propagated afﬁnities w.r.t. za, i.e., W (za, z+ 1 ) > > W (za, z+ k/2) > W (za, z k/2+1) > > W (za, z k ). Essentially, NW (za) is simply a sorted version of Nk(za). As the obtained afﬁnities are an indication of the semantic similarities among examples, we take it as a guidance to form triplets. Given an anchor za, intuitively we can consider an example z+ i with more afﬁnity towards za as a positive, and another example z j with lesser afﬁnity towards za as a negative, and form a triplet (za, z+ i , z j ). By considering ﬁrst half of examples in the sorted neighborhood NW (za) as positives, and remaining half as negatives, we can form the following triplets: (za, z+ 1 , z k/2+1), (za, z+ 2 , z k/2+2), , (za, z+ k/2, z k ).

One may select the set of anchors from entire XL X (p) U , or by seeking the modes of the graph, without loss of generality.

Triplet-based Orthogonality-Promoting SSDML

Given XL X (p) U and the corresponding W , assume that we have obtained a triplet set Tp = S b T (b) p . Here, T (b) p is a mini-

batch of Tb triplets, and b [1, , j |Tp|

k ]. Let, T (b) p =

{(zi, z+ i , z i )}Tb i=1. Then, we propose a smooth objective to

learn the parameters (L, θz) of our deep metric as follows:

min L,θz Jmetric =

i=1 log(1 + exp(mi)). (3)

Here, mi = δ2 L(zi, z+ i ) 4 tan2α δ2 L(z i , zi avg), s.t., zi avg = (zi + z+ i )/2, and δ2 L(zi, zj) = (zi zj) LL (zi zj). mi tries to pull the anchor zi and the positive z+ i together, while moving away the negative z i from the mean zi avg, with respect to an angle α > 0 at the negative z i (Wang et al. 2017). Given the fact that Jmetric in (3) is fully-differentiable, we can backpropagate the gradients to learn the parameters θz using SGD. This helps us in integrating our method within an end-to-end deep framework using Algorithm 1 (except, we learn wrt L instead of M). Additionally, we constrain L Rd l, l d to be a orthogonal matrix, i.e., L L = Il. This is because in contrast to other regularizers that only constrain the values of the elements of a parametric matrix (like l1 or l2 regularizers), orthogonality omits redundant correlations among the different dimensions, thereby omits redundancy in the metric which could hurt the generalization ability due to overﬁtting (Xie et al. 2018). We further show that this avoids a model collapse (degenerate embeddings of distinct, close examples collapsing to a singularity). Theoretically, this is by virtue of the Johnson Lindenstrauss Lemma (Dasgupta and Gupta 2003), which ensures the following near-isometry property of the

embedding: (1 ϵ) zi zj 2 2 d

l L zi L zj 2

(1 + ϵ) zi zj 2 2, 0 < ϵ < 1. Given a mini-batch of triplets T (b) p = {(zi, z+ i , z i )}Tb i=1, we provide an efﬁcient matrix-based implementation to learn L (assuming a ﬁxed θz), in the supplementary. Here, we present the computational time complexity of the major steps involved: i) Cost Function: Computing the cost requires four matrix multiplications resulting in complexity of O(ld Tb). Next, the transpose of the matrix products need to be computed, requiring O(l Tb). Finally, the sum of losses across all triplets can be computed using a matrix trace operation requiring O(Tb) complexity. ii) Gradients: The gradient with respect to L requires the following computations: transposes requiring O(d Tb), outer products requiring O(d2Tb). The subsequent products require O(d2l). Hence, the overall complexity is O(d Tb + d2Tb + d2l). Our proposed algorithm is linear in terms of the number of triplets in a mini-batch, i.e., Tb, which is usually low. The complexity of our algorithm is either linear or quadratic in terms of the original dimensionality d (which in practice is easily controllable within a neural network). It should be noted that in contrast to existing SSDML approaches, our method also enjoys a lower theoretical computational complexity. For example, LRML has a complexity of O(d3), while APLLR and APIT are of the order O(d3 + N 3).

Experiments In this section we evaluate our proposed method in terms of its effectiveness in clustering and retrieval tasks on a number of benchmark datasets.

Dataset MNIST Fashion-MNIST CIFAR-10 Method NMI R@1 R@2 R@4 R@8 NMI R@1 R@2 R@4 R@8 NMI R@1 R@2 R@4 R@8 Initial 17.4 86.5 92.4 95.9 97.6 32.6 71.9 82.0 89.5 94.8 20.4 38.2 54.9 71.0 83.6 Deep-LRML 48.3 88.7 93.2 95.9 97.6 53.2 75.3 83.9 90.2 95.0 16.0 40.2 56.3 71.5 83.7 Deep-SERAPH 45.3 92.5 95.7 97.6 98.7 53.0 75.9 85.1 91.5 95.4 21.3 39.5 56.2 71.1 83.7 Deep-ISDML 44.1 92.1 95.7 97.5 98.6 51.3 74.4 84.2 90.7 95.1 20.0 36.4 52.4 68.5 82.1 Deep-APLLR 30.5 57.6 71.1 82.1 90.1 38.9 59.5 73.3 84.1 91.4 13.3 24.4 40.0 58.8 76.4 Deep-APIT 31.7 87.6 92.9 96.0 97.9 37.3 69.4 80.1 88.5 94.1 11.4 26.0 41.4 59.6 76.4 Ours 47.5 93.9 96.6 98.2 98.9 52.1 77.6 86.0 91.8 95.6 25.3 41.4 57.3 72.6 84.9

Table 1: Comparison against state-of-the-art SSDML approaches on MNIST, Fashion-MNIST and CIFAR-10.

Dataset CUB-200 Cars-196 Method NMI R@1 R@2 R@4 R@8 NMI R@1 R@2 R@4 R@8 Initial 34.3 31.7 42.2 55.4 67.9 23.7 24.2 32.8 43.4 54.9 Deep-LRML 49.9 34.9 45.0 55.4 65.7 40.5 33.2 41.2 49.4 58.3 Deep-SERAPH 50.5 39.7 52.2 64.4 76.5 37.2 34.8 46.7 59.4 71.7 Deep-ISDML 50.6 43.7 55.7 67.6 78.4 25.8 30.6 40.8 52.2 63.2 Deep-APLLR 38.9 25.7 36.6 48.7 61.9 33.2 33.0 43.6 55.6 67.9 Deep-APIT 52.1 42.2 54.5 66.7 78.2 37.3 38.5 50.7 62.7 74.9 Ours 54.0 44.8 56.9 69.1 79.9 40.7 45.7 58.1 69.5 80.4

Table 2: Comparison against state-of-the-art SSDML approaches on ﬁne-grained datasets.

Dataset CUB-200 Cars-196 Method NMI R@1 R@2 R@4 R@8 NMI R@1 R@2 R@4 R@8 Exemplar 45.0 38.2 50.3 62.8 75.0 35.4 36.5 48.1 59.2 71.0 Rotation 49.1 42.5 55.8 68.6 79.4 32.7 33.3 44.6 56.4 68.5 NCE 45.1 39.2 51.4 63.7 75.8 35.6 37.5 48.7 59.8 71.5 Deep Cluster 53.0 42.9 54.1 65.6 76.2 38.5 32.6 43.8 57.0 69.5 Synthetic 53.4 43.5 56.2 68.3 79.1 37.6 42.0 54.3 66.0 77.2 Ours 54.0 44.8 56.9 69.1 79.9 40.7 45.7 58.1 69.5 80.4

Table 3: Comparison of our method against state-of-the-art deep unsupervised methods on ﬁne-grained datasets.

Datasets: Following recent literature, the benchmark datasets that have been used are as follows: MNIST (Le Cun et al. 1998): It is a benchmark dataset that consists of 70000 gray-scale images of handwritten digits. Each image is of 28 28 pixels. There are 60000 training images and 10000 test images in the standard split. Fashion-MNIST (Xiao, Rasul, and Vollgraf 2017): It is a similar dataset as the MNIST, but consists of images from 10 categories of fashion products. There are 60000 training images and 10000 test images in the standard split. CIFAR-10 (Krizhevsky, Hinton et al. 2009): This dataset consists of colour images of 32 32 pixels, containing animal or vehicle objects from 10 different categories. There are 50000 training images and 10000 test images. CUB-200 (Welinder et al. 2010): This dataset consists of images of 200 species of birds with ﬁrst 100 species for training (5864 examples) and remaining for testing (5924 examples). Cars-196 (Krause et al. 2013): It consists of images of cars belonging to 196 models. The ﬁrst 98 models containing 8054 images are used for training. The remaining 98 models containing 8131 images are used for testing. The MNIST, Fashion-MNIST and CIFAR-10 datasets are widely used benchmarks with sufﬁciently large number of images for a comparative evaluation of different approaches. The CUB-200 and Cars-196 datasets are well-known for their use in Fine-Grained Visual Categorization (FGVC), and have huge intra-class variances and inter-class similarities. Implementation details: We adapted the network architectures in the Mat Conv Net tool (Vedaldi and Lenc 2015).

For MNIST and Fashion datasets, the network for MNIST has been adapted as: Conv1(5 5, 20) max-pool Conv2(5 5, 50) max-pool Conv3(4 4, 500) Re LU FC(500 128) l2 L(128 64). For CIFAR-10, we used the following adapted network: Conv1(5 5, 32) max-pool Re LU Conv2(5 5, 32) Re LU avg-pool Conv3(5 5, 64) Re LU avg-pool Conv4(4 4, 64) Re LU l2 L(64 32). For our method, we set γ = 0.99 in the afﬁnity propagation step, k = 10 in the k NN graph, α = 40 in (3), and initial learning rate 10 4. For MNIST and Fashion datasets, we choose 100 labeled examples (10 per class), while for CIFAR-10, we choose 1000 labeled examples (100 per class). We sample a random subset of 9k unlabeled examples and use it along with the labeled data to mine triplets. For each random subset, we run our method for 10 epochs (with mini-batch size of 100 triplets). In total, we run for a maximum of 50 epochs and choose the best model from across all runs. For MNIST and Fashion, we train upon randomly initialized networks. For CIFAR-10, we observed a better performance by pretraining with labeled examples (by replacing the L layer with softmax) for 30 epochs, and then ﬁne-tune using our loss for 50 epochs. For all datasets, the graph has been constructed using the l2-normalized representations obtained just before L.

For the FGVC task, the Goog Le Net (Szegedy et al. 2015) architecture pretrained on Image Net (Russakovsky et al. 2015), has been used as the backbone CNN, using Mat Conv Net (Vedaldi and Lenc 2015). We used the Regional Maximum Activation of Convolutions (R-MAC) (Tolias, Sicre, and J egou 2016) right before the average pool layer, aggregated

0 20 40 60 80 100 Epoch

α NMI R@1 R@2 R@4 R@8 35 40.5 45.0 57.2 69.0 79.8 40 40.3 45.5 57.8 69.1 80.0 45 40.7 45.7 58.1 69.5 80.4 55 40.4 45.7 57.9 69.4 80.1

Figure 2: Left: Convergence behaviour of our method on the Cars-196 dataset (Krause et al. 2013). Middle: Ablation study showing sensitivity of our method towards α on Cars-196 (Krause et al. 2013) dataset. Right: t SNE embeddings for the test examples of MNIST (top row) and CIFAR-10 (bottom row). The left column represents the embeddings obtained right after random initialization. The embeddings obtained by our method: without orthogonality constraint on L (middle column) and with orthogonality constraint on L (rightmost column). Orthogonality leads to better embeddings (see Table 4).

SERAPH Ours Query

Figure 3: Qualitative comparison of retrieval performance against state-of-the-art SSDML approaches on the Cars-196 dataset. For a retrieved image, a red box denotes an incorrect retrieval (different class as the query), and a green box denotes a correct retrieval (same class as the query).

over three input scales (512, 512/

2, 256). We choose ﬁve labeled examples per class. For our method, we set γ = 0.99, k = 50, α = 45 in (3), and embedding size of 128. We take the entire dataset without partition based sampling and run for a maximum of 200 epochs (mini-batch size of 100 triplets), and choose the best model. Speciﬁcally, using our triplet mining strategy, we mined 146600 triplets for the CUB dataset, and 201350 triplets for the Cars dataset. In all experiments, we ﬁx a validation dataset by sampling 15% examples from each class of the training data. This validation dataset is used to tune the hyperparameters without taking any feedback from the test data. Note that to learn L with orthogonal constraints we made use of the Manopt (Boumal et al. 2014) tool with CGD (max iter = 10, with all other default settings). Compared state-of-the-art baseline approaches: We compare our proposed SSDML method against the following baseline techniques: Deep-LRML: This is the stochastic extension of the classical LRML method (Hoi, Liu, and Chang 2010) discussed earlier, by making use of our proposed stochastic Algorithm 1. It follows the min-max principle for the labeled data (minimizing distances between similar pairs, while maximizing distances between dissimilar pairs). A Laplacian regularizer is used to capture information from the unlabeled data. Deep-ISDML: Stochastic extension of the ISDML (Ying et al. 2017) method. It is similar to LRML, but makes use of densities around an example to adapt the Laplacian

regularizer. Deep-SERAPH: Stochastic extension of the SERAPH (Niu et al. 2014) method that makes use of the entropy minimization principle. Deep-APLLR: Stochastic extension of the APLLR (Dutta and Sekhar 2018) method. Makes use of a Log-Likelihood Ratio (LLR) based prior metric. The afﬁnity propagation principle is used to propagate information from the labeled pairs to the unlabeled ones, and adapt the Laplacian (but no triplet mining like ours). Deep-APIT: Stochastic extension of the APIT (Dutta and Sekhar 2018) method. Makes use of an informationtheoretic prior metric. The afﬁnity propagation principle is used to adapt the Laplacian as in APLLR. Exemplar (TPAMI 16): This method (Dosovitskiy et al. 2016) attempts to learn the parameters associated with certain elementary transformation operations like translation, scaling, rotation, contrast, and colorization applied on random image patches. Rotation Net (ICLR 18): This method (Gidaris, Singh, and Komodakis 2018) aims to learn representations of data that can accurately capture the information present in an image despite any rotation of the subject. NCE (CVPR 18): This method (Wu et al. 2018) aims at bringing augmentations of an example together while moving away augmentations from different examples. Deep Cluster (ECCV 18): This method (Caron et al. 2018) aims to jointly cluster metric representations while learn-

ing pseudo-labels for data in an end-to-end manner. Synthetic (AAAI 20): This method (Dutta, Harandi, and Sekhar 2020) learns a metric using synthetic constraints generated in an adversarial manner. Triplet (CVPR 15): This method (Schroff, Kalenichenko, and Philbin 2015) learns a metric using a standard triplet loss. Angular (ICCV 17): This method (Wang et al. 2017) learns a metric using an angular loss on a triplet of examples. Observations: For comparing the approaches, we ﬁrst learn a metric with the training data, and using it we obtain the test embeddings. The methods are compared based on their clustering (wrt NMI) and retrieval (wrt Recall@K, K=1,2,4,8) performances on the test embeddings. NMI is deﬁned as the ratio of mutual information and the average entropy of clusters and entropy of actual ground truth class labels. The Recall@K metric gives us the percentage of test examples that have at least one K nearest neighbor from the same class. A higher value of all these metrics indicates a better performance for an approach. The methods Deep-LRML, Deep-ISDML, Deep-SERAPH, Deep-APLLR and Deep-APIT are semi-supervised in nature. They have been chosen as direct counterparts for our proposed SSDML method. These baselines, including ours make use of both labeled (limited in number) and unlabeled examples to learn a metric. As seen in Tables 1-2, we outperform the baselines on all the datasets that vary in sizes and complexities. In Figure 3, we also show some qualitative retrieval comparisons of our method against the baselines on the Cars196 dataset (successful retrieval is shown in green and failure is shown in red). Our method performs fairly better retrieval than the baselines. The Exemplar, Rotation Net, NCE, Deep Cluster and Synthetic techniques are unsupervised in nature, i.e., they learn a metric using only unlabeled data. As shown in Table 3, a better performance of our method in contrast to these approaches demonstrates the beneﬁt of additionally available labeled examples while learning a metric. Lastly, the Triplet and Angular methods are fullysupervised baselines that only learn from labeled examples. As shown in Table 5, a better performance of our method in contrast to these approaches demonstrates the beneﬁt of leveraging additional unlabeled examples via afﬁnity propagation and our mining technique. On the other hand, the fully-supervised baselines overﬁt to the training data because of the availability of only a limited number of labeled examples. Ablation studies: We now perform a few ablation studies to understand the different components of our method. Table 4 reports the performance of our method for the two cases: i) w/o orth: L is not orthogonal, and ii) w/ orth: L is orthogonal. We observed that with orthogonality, the performance is better. To observe this qualitatively, we plot the t SNE embeddings of the test examples of MNIST and CIFAR-10 in Figure 2-(Right). We could observe that with orthogonality the embeddings are relatively better separated. Especially, in contrast to the simpler MNIST, orthogonality seems to have a more profound effect on the complex CIFAR-10 dataset. As

Dataset Method NMI R@1 R@2 R@4 R@8

MNIST w/o orth 42.9 91.5 95.3 97.3 98.6 w/ orth 47.5 93.9 96.6 98.2 98.9 Fashion MNIST

w/o orth 50.3 73.3 82.4 89.7 94.4 w/ orth 52.1 77.6 86.0 91.8 95.6 CIFAR10

w/o orth 19.0 37.4 54.2 69.7 83.4 w/ orth 25.3 41.4 57.3 72.6 84.9

Table 4: Quantitative comparison of the performance of our method, without (w/o orth) and with orthogonality (w/ orth) on the metric parameters.

Method R@1 R@2 R@4 R@8 Triplet 42.6 55.2 66.9 77.6 Angular 41.4 54.5 67.1 78.4 Ours 44.8 56.9 69.1 79.9

Table 5: Comparison of our method against supervised deep metric learning baselines on CUB.

conjectured, without orthogonality one may learn degenerate embeddings due to a model collapse. Additionally, for the Cars-196 dataset we show the convergence behaviour of our method for the ﬁrst 100 epochs in Figure 2-(Left), and sensitivity of our method towards α in Figure 2-(Middle). The performance is fairly stable in the range 35 55 . Why our method addresses the limitations of the existing SSDML methods, and performs better ? 1. LRML / ISDML vs Ours: Both the LRML and ISDML methods deﬁne the afﬁnity matrices directly using the distances among the initial representations. If the initial afﬁnities are poor, the learned metric would be poor as well. On the other hand, our afﬁnity matrix is adapted by the afﬁnity propagation principle, while leveraging labeled data information as well. 2. APLLR / APIT vs Ours: Although APLLR and APIT make use of afﬁnity propagation, they do not mine constraints using the enriched afﬁnity matrix information, whereas we do. While their prior metric may be singular and hence poor, our method has no such dependency on a pre-computed metric. 3. SERAPH vs Ours: In contrast to SERAPH, our method has a stronger regularizer by virtue of orthogonality.

Conclusions In this paper we revisit and revamp the important problem of Semi-Supervised Distance Metric Learning (SSDML), in the end-to-end deep learning paradigm. We discuss a simple stochastic approach to extend classical SSDML methods to the deep SSDML setting. Owing to the theoretical limitations of the existing SSDML techniques, we propose a method to address these limitations. While the components in our method have been studied earlier, their composition has not been performed before. We show that by using this design composition, the overall approach could outperform existing approaches by the collective use of some of the best practices. In short, following are our major contributions: 1. Extension of classical SSDML approaches for end-to-end stochastic deep learning, with the proposal of a new method, 2. A novel triplet mining strategy leveraging graph-based afﬁnity propagation, and 3. Adaptation of an angular variant of a metric learning loss with orthogonality imposed on the metric parameters to avoid a model collapse.

Acknowledgements

The work was initiated when UKD was in IIT Madras. We would like to thank the anonymous reviewers for their efforts at reviewing our work and providing valuable comments to polish our paper.

Boumal, N.; Mishra, B.; Absil, P.-A.; and Sepulchre, R. 2014. Manopt, a Matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research 15(1): 1455 1459.

Cakir, F.; He, K.; Xia, X.; Kulis, B.; and Sclaroff, S. 2019. Deep metric learning to rank. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1861 1870.

Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. Deep clustering for unsupervised learning of visual features. In Proc. of European Conference on Computer Vision (ECCV), 132 149.

Chapelle, O.; Scholkopf, B.; and Zien, A. 2009. Semisupervised learning. IEEE Transactions on Neural Networks 20(3): 542 542.

Chen, Z.; Gong, M.; Xu, Y.; Wang, C.; Zhang, K.; and Du, B. 2020. Compressed Self-Attention for Deep Metric Learning. In Proc. of Association for the Advancement of Artiﬁcial Intelligence (AAAI), 3561 3568.

Chopra, S.; Hadsell, R.; and Le Cun, Y. 2005. Learning a similarity metric discriminatively, with application to face veriﬁcation. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 539 546. IEEE.

Dasgupta, S.; and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22(1): 60 65.

Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4690 4699.

Dosovitskiy, A.; Fischer, P.; Springenberg, J.; Riedmiller, M.; and Brox, T. 2016. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(9): 1734 1747.

Dutta, U. K.; Harandi, M.; and Sekhar, C. C. 2020. Unsupervised Metric Learning with Synthetic Examples. In Proc. of Association for the Advancement of Artiﬁcial Intelligence (AAAI).

Dutta, U. K.; and Sekhar, C. C. 2018. Afﬁnity Propagation Based Closed-Form Semi-supervised Metric Learning Framework. In Proc. of International Conference on Artiﬁcial Neural Networks (ICANN), 556 565. Springer.

Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised Representation Learning by Predicting Image Rotations. In Proc. of International Conference on Learning Representations (ICLR).

Gong, X.; Yuan, D.; and Bao, W. 2020. Online Metric Learning for Multi-Label Classiﬁcation. In Proc. of Association for the Advancement of Artiﬁcial Intelligence (AAAI), 4012 4019.

Grandvalet, Y.; and Bengio, Y. 2005. Semi-supervised learning by entropy minimization. In Proc. of Neural Information Processing Systems (Neur IPS), 529 536. Gu, G.; and Ko, B. 2020. Symmetrical Synthesis for Deep Metric Learning. In Proc. of Association for the Advancement of Artiﬁcial Intelligence (AAAI).

Hoi, S. C.; Liu, W.; and Chang, S.-F. 2010. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 6(3): 18.

Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label propagation for deep semi-supervised learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5070 5079. Krause, J.; Stark, M.; Deng, J.; and Fei-Fei, L. 2013. 3d object representations for ﬁne-grained categorization. In Proc. of IEEE International Conference on Computer Vision Workshops (ICCVW), 554 561. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer. Le Cun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11): 2278 2324. Li, M.; Zhang, S.; Zhu, F.; Qian, W.; Zang, L.; Han, J.; and Hu, S. 2020. Symmetric Metric Learning with Adaptive Margin for Recommendation. In Proc. of Association for the Advancement of Artiﬁcial Intelligence (AAAI), 4634 4641. Liu, W.; Ma, S.; Tao, D.; Liu, J.; and Liu, P. 2010. Semisupervised sparse metric learning using alternating linearization optimization. In Proc. of ACM International Conference on Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 1139 1148. Miyato, T.; Maeda, S.-i.; Ishii, S.; and Koyama, M. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) . Movshovitz-Attias, Y.; Toshev, A.; Leung, T. K.; Ioffe, S.; and Singh, S. 2017. No fuss distance metric learning using proxies. In Proc. of IEEE International Conference on Computer Vision (ICCV), 360 368.

Musgrave, K.; Belongie, S.; and Lim, S.-N. 2020. A metric learning reality check. In Proc. of European Conference on Computer Vision (ECCV). Niu, G.; Dai, B.; Yamada, M.; and Sugiyama, M. 2014. Information-theoretic semi-supervised metric learning via entropy regularization. Neural computation 26(8): 1717 1762.

Oliver, A.; Odena, A.; Raffel, C. A.; Cubuk, E. D.; and Goodfellow, I. 2018. Realistic evaluation of deep semi-supervised

learning algorithms. In Proc. of Neural Information Processing Systems (Neur IPS), 3235 3246.

Qian, Q.; Shang, L.; Sun, B.; Hu, J.; Li, H.; and Jin, R. 2019. Softtriple loss: Deep metric learning without triplet sampling. In Proc. of IEEE International Conference on Computer Vision (ICCV), 6450 6458.

Roth, K.; Milbich, T.; Sinha, S.; Gupta, P.; Ommer, B.; and Cohen, J. P. 2020. Revisiting training strategies and generalization performance in deep metric learning. In Proc. of International Conference on Machine Learning (ICML).

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision (IJCV) 115(3): 211 252.

Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A uniﬁed embedding for face recognition and clustering. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815 823.

Shen, P.; Du, X.; and Li, C. 2016. Distributed semisupervised metric learning. IEEE Access 4: 8558 8571.

Sohn, K. 2016. Improved deep metric learning with multiclass n-pair loss objective. In Proc. of Neural Information Processing Systems (Neur IPS), 1857 1865.

Stretcu, O.; Viswanathan, K.; Movshovitz-Attias, D.; Platanios, E.; Ravi, S.; and Tomkins, A. 2019. Graph agreement models for semi-supervised learning. In Proc. of Neural Information Processing Systems (Neur IPS), 8713 8723.

Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; and Wei, Y. 2020. Circle loss: A uniﬁed perspective of pair similarity optimization. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6398 6407.

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1 9.

Tolias, G.; Sicre, R.; and J egou, H. 2016. Particular object retrieval with integral max-pooling of CNN activations. In Proc. of International Conference on Learning Representations (ICLR).

Vedaldi, A.; and Lenc, K. 2015. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, 689 692. ACM.

Wang, J.; Zhou, F.; Wen, S.; Liu, X.; and Lin, Y. 2017. Deep metric learning with angular loss. In Proc. of IEEE International Conference on Computer Vision (ICCV).

Wang, X.; Han, X.; Huang, W.; Dong, D.; and Scott, M. R. 2019. Multi-similarity loss with general pair weighting for deep metric learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5022 5030.

Weinberger, K. Q.; Blitzer, J.; and Saul, L. 2006. Distance metric learning for large margin nearest neighbor classiﬁcation. In Proc. of Neural Information Processing Systems (Neur IPS), 1473 1480. Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.

Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3733 3742. Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747 .

Xie, P.; Wu, W.; Zhu, Y.; and Xing, E. P. 2018. Orthogonalitypromoting distance metric learning: convex relaxation and theoretical analysis. In Proc. of International Conference on Machine Learning (ICML). Ying, S.; Wen, Z.; Shi, J.; Peng, Y.; Peng, J.; and Qiao, H. 2017. Manifold preserving: An intrinsic approach for semisupervised distance metric learning. IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 29(7): 2731 2742. Yu, B.; and Tao, D. 2019. Deep metric learning with tuplet margin loss. In Proc. of IEEE International Conference on Computer Vision (ICCV), 6490 6499. Yuan, T.; Deng, W.; Tang, J.; Tang, Y.; and Chen, B. 2019. Signal-to-noise ratio: A robust distance metric for deep metric learning. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4815 4824. Zhu, X. J. 2005. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.