# nearoptimal_comparison_based_clustering__5be195c1.pdf

Near-Optimal Comparison Based Clustering

Michaël Perrot Univ Lyon, UJM-Saint-Etienne, CNRS, IOGS, Lab HC UMR 5516, F-42023, SAINT-ETIENNE, France michael.perrot@univ-st-etienne.fr

Pascal Mattia Esser , Debarghya Ghoshdastidar Department of Informatics Technical University of Munich {esser,ghoshdas}@in.tum.de

The goal of clustering is to group similar objects into meaningful partitions. This process is well understood when an explicit similarity measure between the objects is given. However, far less is known when this information is not readily available and, instead, one only observes ordinal comparisons such as object i is more similar to j than to k. In this paper, we tackle this problem using a two-step procedure: we estimate a pairwise similarity matrix from the comparisons before using a clustering method based on semi-deﬁnite programming (SDP). We theoretically show that our approach can exactly recover a planted clustering using a near-optimal number of passive comparisons. We empirically validate our theoretical ﬁndings and demonstrate the good behaviour of our method on real data.

1 Introduction

In clustering, the objective is to group together objects that share the same semantic meaning, that are similar to each other, into k disjoint partitions. This problem has been extensively studied in the literature when a measure of similarity between the objects is readily available, for example when the examples have a Euclidean representation or a graph structure (Shi and Malik, 2000; Arthur and Vassilvitskii, 2007; von Luxburg, 2007). However, it has attracted less attention when the objects are difﬁcult to represent in a standard way, for example cars or food. A recent trend to tackle this problem is to use comparison based learning (Ukkonen, 2017; Emamjomeh-Zadeh and Kempe, 2018) where, instead of similarities, one only observes comparisons between the examples: Triplet comparison: Object xi is more similar to object xj than to object xk; Quadruplet comparison: Objects xi and xj are more similar to each other than objects xk and xl. There are two ways to obtain these comparisons. On the one hand, one can adaptively query them from an oracle, for example a crowd. This is the active setting. On the other hand, they can be directly given, with no way to make new queries. This is the passive setting. In this paper, we study comparison based learning for clustering using passively obtained triplets and quadruplets.

Comparison based learning mainly stems from the psychometric and crowdsourcing literature (Shepard, 1962; Young, 1987; Stewart et al., 2005) where the importance and robustness of collecting ordinal information from human subjects has been widely discussed. In recent years, this framework has attracted an increasing amount of attention in the machine learning community and three main learning paradigms have emerged. The ﬁrst one consists in obtaining an Euclidean embedding of the data that respects the comparisons as much as possible and then applying standard learning

Equal contribution.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

techniques (Borg and Groenen, 2005; Agarwal et al., 2007; Jamieson and Nowak, 2011; Tamuz et al., 2011; van der Maaten and Weinberger, 2012; Terada and von Luxburg, 2014; Zhang et al., 2015; Amid and Ukkonen, 2015; Arias-Castro, 2017). The second paradigm is to directly solve a speciﬁc task from the ordinal comparisons, such as data dimension or density estimation (Kleindessner and von Luxburg, 2015; Ukkonen et al., 2015), classiﬁcation and regression (Haghiri et al., 2018), or clustering (Vikram and Dasgupta, 2016; Ukkonen, 2017; Ghoshdastidar et al., 2019). Finally, the third paradigm is an intermediate solution where the idea is to learn a similarity or distance function, as in embedding approaches, but, instead of satisfying the comparisons, the objective is to solve one or several standard problems such as classiﬁcation or clustering (Kleindessner and von Luxburg, 2017). In this paper, we focus on this third paradigm and propose two new similarities based on triplet and quadruplet comparisons respectively. While these new similarities can be used to solve any machine learning problem, we show that they are provably good for clustering under a well known planted partitioning framework (Abbe, 2017; Yan et al., 2018; Xu et al., 2020).

Motivation of this work. A key bottleneck in comparison based learning is the overall number of available comparisons: given n examples, there exist O n3 different triplets and O n4 different quadruplets. In practice, it means that, in most applications, obtaining all the comparisons is not realistic. Instead, most approaches try to use as few comparisons as possible. This problem is relatively easy when the comparisons can be actively queried and it is known that Ω(n ln n) adaptively selected comparisons are sufﬁcient for various learning problems (Haghiri et al., 2017; Emamjomeh-Zadeh and Kempe, 2018; Ghoshdastidar et al., 2019). On the other hand, this problem becomes harder when the comparisons are passively obtained. The general conclusion in most theoretical results on learning from passive ordinal comparisons is that, in the worst case, almost all the O n3 or O n4

comparisons should be observed (Jamieson and Nowak, 2011; Emamjomeh-Zadeh and Kempe, 2018). The focus of this work is to show that, by carefully handling the passively obtained comparisons, it is possible to design comparison based approaches that use almost as few comparisons as active approaches for planted clustering problems.

Near-optimal guarantees for clustering with passive comparisons. In hierarchical clustering, Emamjomeh-Zadeh and Kempe (2018) showed that constructing a hierarchy that satisﬁes all comparisons in a top-down fashion requires Ω n3 passively obtained triplets in the worst case. Similarly, Ghoshdastidar et al. (2019) considered a planted model and showed that Ω n3.5 ln n passive quadruplets sufﬁce to recover the true hierarchy in the data using a bottom-up approach. Since the main difﬁculty lies in recovering the small clusters at the bottom of the tree, we believe that this latter result also holds for standard clustering. In this paper, we consider a planted model for standard clustering and we show that, when the number of clusters k is constant, Ω n(ln n)2 passive triplets or quadruplets are sufﬁcient for exact recovery.2 This result is comparable to the sufﬁcient number of active comparisons in most problems, that is Ω(n ln n) (Haghiri et al., 2017; Emamjomeh-Zadeh and Kempe, 2018). Furthermore, it is near-optimal. Indeed, to cluster an example, it is necessary to observe it in a comparison at least once as, otherwise, it can only be assigned to a random cluster. Thus, to cluster n objects, it is necessary to have access to at least Ω(n) comparisons. Finally, to obtain these results, we study a semi-deﬁnite programming (SDP) based clustering method and our analysis could be of signiﬁcant interest beyond the comparison based framework.

General noise model for comparison based learning. In comparison based learning, there are two main sources of noise. First, the observed comparisons can be noisy, that is the observed triplets and quadruplets are not in line with the underlying similarities. This noise stems, for example, from the randomness of the answers gathered from a crowd. It is typically modelled by assuming that each observed comparison is randomly (and independently) ﬂipped (Jain et al., 2016; Emamjomeh-Zadeh and Kempe, 2018). This is mitigated in the active setting by repeatedly querying each comparison, but may have a signiﬁcant impact in the passive setting where a single instance of each comparison is often observed. Apart from the aforementioned observation errors, the underlying similarities may also have intrinsic noise. For instance, the food data set by Wilber et al. (2014) contains triplet comparisons in terms of which items taste more similar, and it is possible that the taste of a dessert is closer to a main dish than to another dessert. This noise has been considered in Ghoshdastidar et al. (2019) by assuming that every pair of items possesses a latent random similarity, which affects the

2When we write that Ω n(ln n)2 comparisons are sufﬁcient, we express that any number of comparisons greater than Cn(ln n)2 with C a constant is sufﬁcient to solve the problem. In other words, it means that having exactly Cn(ln n)2 comparisons is sufﬁcient but also that having more comparisons is not detrimental. This notation is used in statistics and information theory (Fletcher et al., 2009) and is equivalent to .

responses to comparisons. In this paper, we propose, to the best of our knowledge, the ﬁrst analysis that considers and shows the impact of both types of noise on the number of passive comparisons.

Scalable comparison based similarity functions. Several similarity and kernel functions have been proposed in the literature (Kleindessner and von Luxburg, 2017; Ghoshdastidar et al., 2019). However, computing these similarities is usually expensive as they require up to O (n) passes over the set of available comparisons. In this paper, we propose new similarity functions whose construction is much more efﬁcient than previous kernels. Indeed, they can be obtained with a single pass over the set of available comparisons. It means that our similarity functions can be computed in an online fashion where the comparisons are obtained one at a time from a stream. The main drawback compared to existing approaches is that we lose the positive semi-deﬁniteness of the similarity matrix, but our theoretical results show that this is not an issue in the context of clustering. We also demonstrate this empirically as our similarities obtain results that are comparable with state of the art methods.

2 Background and theoretical framework

In this section, we present the comparison based framework and our planted clustering model, under which we later show that a small number of passive comparisons sufﬁces for learning. We consider the following setup. There are n items, denoted by [n] = {1, 2, . . . , n}, and we assume that, for every pair of distinct items i, j [n], there is an implicit real-valued similarity wij that we cannot directly observe. Instead, we have access to

Triplets: T = (i, j, r) [n]3 : wij > wir, i, j, r distinct , or

Quadruplets: Q = (i, j, r, s) [n]4 : wij > wrs, i = j, r = s, (i, j) = (r, s) . (1)

There are O n4 possible quadruplets and O n3 possible triplets, and it is expensive to collect such a large number of comparisons via crowdsourcing. In practice, T or Q only contain a small fraction of all possible comparisons. We note that if a triple i, j, r [n] is observed with i as reference item, then either (i, j, r) T or (i, r, j) T depending on whether i is more similar to j or to r. Similarly, when tuples (i, j) and (r, s) are compared, we have either (i, j, r, s) Q or (r, s, i, j) Q.

Sampling and noise in comparisons. This paper focuses on passive observation of comparisons. To model this, we assume that the comparisons are obtained via uniform sampling, and every comparison is equally likely to be observed. Let p (0, 1] denote a sampling rate that depends on n. We assume that every comparison (triplet or quadruplet) is independently observed with probability p. In expectation, |Q| = O pn4 and |T | = O pn3 , and we can control the sampling rate p to study the effect of the number of observations, |Q| or |T |, on the performance of an algorithm.

As noted in the introduction, the observed comparisons are typically noisy due to random ﬂipping of answers by the crowd workers and inherent noise in the similarities. To model the external (crowd) noise we follow the work of Jain et al. (2016) and, given a parameter ϵ (0, 1], we assume that any observed comparison is correct with probability 1

2(1 + ϵ) and ﬂipped with probability 1

2(1 ϵ). To be precise, for observed triple i, j, r [n] such that wij > wir,

P (i, j, r) T | wij > wir = 1 + ϵ

2 , whereas P (i, r, j) T | wij > wir = 1 ϵ

The probabilities for ﬂipping quadruplets can be similarly expressed. We model the inherent noise by assuming wij to be random, and present a model for the similarities under planted clustering.

Planted clustering model. We now present a theoretical model for the inherent noise in the similarities that reﬂects a clustered structure of the items. The following model is a variant of the popular stochastic block model, studied in the context of graph clustering (Abbe, 2017), and is related to the non-parametric weighted stochastic block model (Xu et al., 2020).

We assume that the item set [n] is partitioned into k clusters C1, . . . , Ck of sizes n1, . . . , nk, respectively, but the number of clusters k as well as the clusters C1, . . . , Ck are unknown to the algorithm. Let Fin and Fout be two distributions deﬁned on R. We assume that the inherent (and unobserved) similarities {wij : i < j} are random and mutually independent, and

wij Fin if i, j Cℓfor some ℓ, and wij Fout otherwise.

We further assume that wii is undeﬁned, wji = wij, and that for w, w independent,

Pw,w Fin(w > w ) = Pw,w Fout(w > w ) = 1/2, and

Pw Fin,w Fout(w > w ) = (1 + δ)/2 for some δ (0, 1]. (3)

The ﬁrst condition in (3) requires that Fin, Fout do not have point masses, and is assumed for analytical convenience. The second condition ensures that within cluster similarities are larger than inter-cluster similarities a natural requirement. Ghoshdastidar et al. (2019) used a special case of the above model, where Fin, Fout are assumed to be Gaussian with identical variances σ2, and means satisfy µin > µout . In this case, δ = 2Φ (µin µout)/

2σ 1 where Φ is the cumulative distribution function of the standard normal distribution.

The goal of this paper is to obtain bounds on the number of passively obtained triplets/quadruplets that are sufﬁcient to recover the aforementioned planted clusters with zero error. To this end, we propose two similarity functions respectively computed from triplet and quadruplet comparisons, and show that a similarity based clustering approach using semi-deﬁnite programming (SDP) can exactly recover clusters planted in the data using few passive comparisons.

3 A theoretical analysis of similarity based clustering

Before presenting our new comparison based similarity functions, we describe the SDP approach for clustering from similarity matrices that we use throughout the paper (Yan et al., 2018; Chen and Yang, 2020). In addition, we prove a generic theoretical guarantee for this approach that holds for any similarity matrix and, thus, that could be of interest even beyond the comparison based setting.

Similarity based clustering is widely used in machine learning, and there exist a range of popular approaches including spectral methods (von Luxburg, 2007), semi-deﬁnite relaxations (Yan and Sarkar, 2016), or linkage algorithms (Dasgupta, 2016) among others. We consider the following SDP for similarity based clustering. Let S Rn n be a symmetric similarity matrix among n items, and Z {0, 1}n k be the cluster assignment matrix that we wish to estimate. For unknown number of clusters k, it is difﬁcult to directly determine Z, and hence, we estimate the normalised clustering matrix X Rn n such that Xij = 1 |C| if i, j co-occur in estimated cluster C, and Xij = 0 otherwise. Note that trace (X) = k. The following SDP was proposed and analysed by Yan et al. (2018) under the stochastic block model for graphs, and can also be applied in the more general context of data clustering (Chen and Yang, 2020). This SDP is agnostic to the number of clusters, but penalises large values of trace (X) to restrict the number of estimated clusters:

max X trace (SX) λ trace (X)

s.t. X 0, X 0, X1 = 1. (SDP-λ)

Here, λ is a tuning parameter and 1 denotes the vector of all ones. The constraints X 0 and X 0 restricts the optimisation to non-negative, positive semi-deﬁnite matrices.

We ﬁrst present a general theoretical result for SDP-λ. Assume that the data has an implicit partition into k clusters C1, . . . , Ck of sizes n1, . . . , nk and with cluster assignment matrix Z, and suppose that the similarity S is close to an ideal similarity matrix e S that has a k k block structure e S = ZΣZT . The matrix Σ Rk k is such that Σℓℓ represents the ideal pairwise similarity between items from clusters Cℓand Cℓ . Typically, under a random planted model, e S is the same as E[S] up to possible differences in the diagonal terms. For S = e S and certain values of λ, the unique optimal solution of SDP-λ is a block diagonal matrix X = ZN 1ZT , where N Rk k is diagonal with entries n1, . . . , nk (see Appendix B). Thus, in the ideal case, solving the SDP provides the desired normalised clustering matrix from which one can recover the partition C1, . . . , Ck. The following result shows that X is also the unique optimal solution of SDP-λ if S is sufﬁciently close to e S. Proposition 1 (Recovery of planted clusters using SDP-λ). Let Z {0, 1}n k be the assignments for a planted k-way clustering, e S = ZΣZT , and X = ZN 1ZT as deﬁned above. Deﬁne

1 = min ℓ =ℓ

2 Σℓℓ , and 2 = max i [n] max ℓ [k]

Sij e Sij .

X is the unique optimal solution of SDP-λ for any choice of λ in the interval S e S 2 < λ < min ℓ nℓ min 1

The proof of Proposition 1, given in Appendix B, is adapted from Yan et al. (2018) although uniqueness was not proved in this previous work. The term 1 quantiﬁes the separation between the ideal within and inter-cluster similarities, and is similar in spirit to the weak assortativity criterion for stochastic block models (Yan et al., 2018). On the other hand, the matrix spectral norm S e S 2 and the term 2 both quantify the deviation of the similarities S from their ideal values e S. Note that the number of clusters can be computed as k = trace (X) and cluster assignment Z is obtained by clustering the rows of X using k-means or spectral clustering for example. In the experiments (Section 5), we present a data-dependent approach to tune λ and ﬁnd k.

We conclude this section by noting that most of the previous analyses of SDP clustering either assume sub-Gaussian data (Yan and Sarkar, 2016) or consider similarity matrices with independence assumptions (Chen and Xu, 2014; Yan et al., 2018) that might not hold in general, and do not hold for our Add S-3 and Add S-4 similarities described in the next section. In contrast, the deterministic criteria stated in Proposition 1 make the result applicable in more general settings.

4 Similarities from passive comparisons

We present two new similarity functions computed from passive comparisons (Add S-3 and Add S-4) and guarantees for recovering planted clusters using SDP-λ in conjunction with these similarities. Kleindessner and von Luxburg (2017) introduced pairwise similarities computed from triplets. A quadruplets variant was proposed by Ghoshdastidar et al. (2019). These similarities, detailed in Appendix A, are positive-deﬁnite kernels and have multiplicative forms. In contrast, we compute the similarity between items i, j by simply adding binary responses to comparisons involving i and j.

Similarity from quadruplets. We construct the additive similarity for quadruplets, referred to as Add S-4, in the following way. Recall the deﬁnition of Q in Equation (1) and for every i = j, deﬁne

I{(i,j,r,s) Q} I{(r,s,i,j) Q} , (Add S-4)

where I{ } is the indicator function. The intuition is that if i, j are similar (wij is large), then for every observed tuple i, j, r, s, wij > wrs is more likely to be observed. Thus, (i, j, r, s) appears in Q more often than (r, s, i, j), and Sij is a (possibly large) positive term. On the other hand, smaller wij leads to a negative value of Sij. Under the aforementioned planted model with clusters of size n1, . . . , nk, one can verify that Sij indeed reveals the planted clusters in expectation since if i, j belong to the

same planted cluster, then E[Sij] = pϵδ X

2 , and E[Sij] = pϵδ X

Thus, in expectation, the within cluster similarity exceeds the inter-cluster similarity by pϵδ n 2 .

Similarity from triplets. The additive similarity based on passive triplets Add S-3 is given by

I{(i,j,r) T } I{(i,r,j) T } + I{(j,i,r) T } I{(j,r,i) T } (Add S-3)

for every i = j. The Add S-3 similarity Sij aggregates all the comparisons that involve both i and j, with either i or j as the reference item. Similar to the case of Add S-4, Sij tends to be positive when wij is large, and negative for small wij. One can also verify that, under a planted model, the expected within cluster Add S-3 similarity exceeds the inter-cluster similarity by pϵδ(n 2).

A signiﬁcant advantage of Add S-3 and Add S-4 over existing similarities is in terms of computational time for constructing S. Unlike existing kernels, both similarities can be computed from a single pass over T or Q. In addition, the following result shows that the proposed similarities can exactly recover planted clusters using only a few (near optimal) number of passive comparisons. Theorem 1 (Cluster recovery using Add S-3 and Add S-4). Let X denote the normalised clustering matrix corresponding to the true partition, and nmin be the size of the smallest planted cluster.

Given the triplet or the quadruplet setting, there exist absolute constants c1, c2, c3, c4 > 0 such that, with probability at least 1 1

n, X is the unique optimal solution of SDP-λ if δ satisﬁes

n ln n nmin < δ 1 , and one of the following two conditions hold:

(triplet setting) S is given by Add S-3, and the number of triplets |T | and the parameter λ satisfy

|T | > c2 n3(ln n)2

ϵ2δ2n2 min and c3 max

n3 , (ln n)2 )

< λ < c4|T |ϵδnmin

(quadruplet setting) S is given by Add S-4, and the number of quadruplets |Q| and λ satisfy

|Q| > c2 n3(ln n)2

ϵ2δ2n2 min and c3 max

n3 , (ln n)2 )

< λ < c4|Q|ϵδnmin

The condition on δ and the number of comparisons ensure that the interval for λ is non-empty.

Theorem 1 is proved in Appendix C. This result shows that given a sufﬁcient number of comparisons, one can exactly recover the planted clusters using SDP-λ with an appropriate choice of λ. In particular, if there are k planted clusters of similar sizes and δ satisﬁes the stated condition, then recovery of the planted clusters with zero error is possible with only Ω k2 ϵ2δ2 n(ln n)2 passively obtained triplets or quadruplets. In this particular context, we make a few important remarks about the sufﬁcient conditions. Remark 1 (Comparison with existing results). For ﬁxed k and ﬁxed ϵ, δ (0, 1], Theorem 1 states that Ω n(ln n)2 passive comparisons (triplets or quadruplets) sufﬁce to exactly recover the clusters. This signiﬁcantly improves over the result of Ghoshdastidar et al. (2019) stating that Ω n3.5 ln n

passive quadruplets are sufﬁcient in a planted setting, and the fact that Ω n3 triplets are necessary in the worst case (Emamjomeh-Zadeh and Kempe, 2018). Remark 2 (Dependence of the number of comparisons on the noise levels ϵ, δ). When one can actively obtain comparisons, Emamjomeh-Zadeh and Kempe (2018) showed that it sufﬁces to query Ω n ln n

ϵ triplets. Compared to the ln 1

ϵ dependence in the active setting, the sufﬁcient number of passive comparisons in Theorem 1 has a stronger dependence of 1

ϵ2 on the crowd noise level ϵ. While we do not know whether this dependence is optimal, the stronger criterion is intuitive since, unlike the active setting, the passive setting does not provide repeated observations of the same comparisons that can easily nullify the crowd noise. The number of comparisons also depends as 1 δ2 on the inherent noise level, which is similar to the conditions in Ghoshdastidar et al. (2019).

Theorem 1 states that exact recovery primarily depends on two sufﬁcient conditions, one on δ and the other on the number of passive comparisons (|T | or |Q|). The following two remarks show that both conditions are necessary, up to possible differences in logarithmic factors. Remark 3 (Necessity of the condition on δ). The condition on δ imposes the condition of nmin = Ω

n ln n . This requirement on nmin appears naturally in planted problems. Indeed, assuming that

all k clusters are of similar sizes, the above condition is equivalent to a requirement of k = O p n

and it is believed that polynomial time algorithms cannot recover k n planted clusters (Chen and Xu, 2014, Conjecture 1). Remark 4 (Near-optimal number of comparisons). To cluster n items, one needs to observe each example at least once. Hence, one trivially needs at least Ω(n) comparisons (active or passive). Similarly, existing works on actively obtained comparisons show that Ω(n ln n) comparisons are sufﬁcient for learning in supervised or unsupervised problems (Haghiri et al., 2017; Emamjomeh Zadeh and Kempe, 2018; Ghoshdastidar et al., 2019). We observe that, in the setting of Remark 1, it sufﬁces to have Ω n(ln n)2 passive comparisons which matches the necessary conditions up to logarithmic factors. However, the sufﬁcient condition on the number of comparisons becomes Ω k2n(ln n)2 if k grows with n while ϵ and δ are ﬁxed. It means that the worst case of k = O p n

ln n , stated in Remark 3, can only be tackled using at least Ω n2 ln n passive comparisons.

Remark 5 (No new information beyond Ω n2/ϵ2 comparisons). Theorem 1 shows that for large n and Ω n2/ϵ2 number of comparisons, the condition for exact recovery of the clusters is only

governed by the condition on δ as the interval for λ is always non empty. It means that, beyond a quadratic number of comparisons, no new information is gained by observing more comparisons. This explains why signiﬁcantly fewer passive comparisons sufﬁce in practice than the known worst-case requirements of Ω n3 passive triplets or Ω n4 passive quadruplets.

We conclude our theoretical discussion with a remark about recovering planted clusters when the pairwise similarities wij are observed. Our methods are near optimal even in this setting.

Remark 6 (Recovering planted clusters for non-parametric Fin, Fout). Theoretical studies in the classic setting of clustering with observed pairwise similarities {wij : i < j} typically assume that the distributions Fin and Fout for the pairwise similarities are Bernoulli (in unweighted graphs), or take ﬁnitely many values (labelled graphs), or belong to exponential families (Chen and Xu, 2014; Aicher et al., 2015; Yun and Proutiere, 2016). Hence, the applicability of such results are restrictive. Recently, Xu et al. (2020) considered non-parametric distributions for Fin, Fout, and presented a near-optimal approach based on discretisation of the similarities into ﬁnitely many bins. Our work suggests an alternative approach: compute ordinal comparisons from the original similarities and use clustering on Add S-3 or Add S-4. Theorem 1 then guarantees, for any non-parametric and continuous Fin and Fout, exact recovery of the planted clusters under a near-optimal condition on δ.

5 Experiments

The goal of this section is three-fold: present a strategy to tune λ in SDP-λ; empirically validate our theoretical ﬁndings; and demonstrate the performance of the proposed approaches on real datasets.

Choosing λ and estimating the number of clusters based on Theorem 1. Given a similarity matrix S, the main difﬁculty involved in using SDP-λ is tuning the parameter λ. Yan et al. (2018)

proposed the algorithm SPUR to select the best λ as λ = arg max0 λ λmax

i kλ σi(Xλ)

trace(Xλ) where Xλ is the solution of SDP-λ, kλ is the closest integer to trace (Xλ) and an estimate of the number of clusters, σi(Xλ) is the i-th largest eigenvalue of Xλ, and λmax is a theoretically well-founded upper bound on λ. The maximum of the above objective is 1, achieved when Xλ has the same structure as X in Proposition 1. In our setting, Theorem 1 gives an upper bound on λ that depends on ϵ, δ and nmin which are not known in practice. Furthermore, it is computationally beneﬁcial to use the theoretical lower bound for λ instead of using λ 0 as suggested in SPUR.

We propose to modify SPUR based on the fact that the estimated number of clusters k monotonically decreases with λ (details in Appendix D). Given Theorem 1, we choose λmin = p

c(ln n)/n and λmax = c/n, where c = |Q| or |T |. The trace of the SDP-λ solution then gives two estimates of the number of clusters, kλmin and kλmax, and we search over k [kλmax, kλmin] instead of searching over λ in practice, it helps to search over the values max{2, kλmax} k kλmin + 2. We select k that maximises the above SPUR objective, where X is computed using a simpler SDP (Yan et al., 2018):

max X S, X s.t. X 0, X 0, X1 = 1, trace (X) = k. (SDP-k)

The overall approach is summarized in Algorithm 1.

Clustering with Add S-3 and Add S-4.3 For the proposed similarity matrices Add S-3 and Add S-4, the above strategy provides the optimal number of clusters k and a corresponding solution Xk of SDP-k. The partition is obtained by clustering the rows of Xk using k-means. Alternative approaches, such as spectral clustering, lead to similar performances (see Appendix E).

Evaluation function. We use the Adjusted Rand Index (ARI) (Hubert and Arabie, 1985) between the ground truth and the predictions. The ARI takes values in [ 1, 1] and measures the agreement between two partitions: 1 implies identical partitions, whereas 0 implies that the predicted clustering is random. In all the experiments, we report the mean and standard deviation over 10 repetitions.

Simulated data with planted clusters. We generate data using the planted model from Section 2 and verify that the learned clusters are similar to the planted ones. As default parameters we use n = 1000, k = 4, ϵ = 0.75, |T | = |Q| = n(ln n)4 and Fin = N

2 , σ2 , Fout = N 0, σ2

with σ = 0.1 and δ = 0.5. In each experiment, we investigate the sensitivity of our method by varying one of the parameters while keeping the others ﬁxed. We use SPUR to estimate the number of clusters.

3We provide a Python implementation on https://github.com/mperrot/Add S-Clustering

Algorithm 1: Comparison-based SPUR input : The number of examples n and the comparisons T or Q. begin

Deﬁne c = |T | or |Q|. Let S be obtained with Add S-3 or Add S-4.

Deﬁne λmin = q

n and λmax = c

n. Xλmin, Xλmax SDP-λmin, SDP-λmax on S. kλmin, kλmax trace (Xλmin) , trace (Xλmax) . for k = max{2, kλmax} to kλmin + 2 do

Solve SDP-k to obtain Xk. end

Choose ˆk = argmax k

i k σi(Xk) trace(Xk) , where σi(Xk) denotes the i-th largest eigenvalue of Xk.

end output : Number of clusters ˆk, Xˆk.

n(ln n)1 n(ln n)3.5 n(ln n)6

Add S-4 Mul K-4 Add S-3 Mul K-3

(a) Vary the number of comparisons

0.4 0.6 0.8 1.0 ϵ

Add S-3 Mul K-3

(b) Vary the external noise level, ϵ

U, β n(ln n)3

U, β n(ln n)4

U, N n(ln n)3

U, N n(ln n)4

Add S-3 Mul K-3

(c) Vary the distributions Fin, Fout

Figure 1: ARI of various methods on the planted model (higher is better). We vary: (1a) the number of comparisons |T | and |Q|; (1b) the crowd noise level ϵ; (1c) the distributions Fin and Fout.

As baselines, we use SDP-k (using the number of clusters estimated by our approaches) followed by k-means with two comparison based multiplicative kernels: Mul K-3 for triplets (Kleindessner and von Luxburg, 2017) and Mul K-4 for quadruplets (Ghoshdastidar et al., 2019).

We present some signiﬁcant results in Figure 1 and defer the others to Appendix E. In Figure 1a, we vary the number of sampled comparisons. Unsurprisingly, our approaches are able to exactly recover the planted clusters using as few as n(ln n)3 comparisons extra ln n factor compared to Theorem 1 accounts for ϵ, δ and constants. Mul K-3 and Mul K-4 respectively need n(ln n)4.5 and n(ln n)5.5 comparisons (both values exceed n2 for n = 1000). In all our experiments, Add S-3 and Add S-4 have comparable performance while Mul K-3 is signiﬁcantly better than Mul K-4. Thus we focus on triplets in the subsequent experiments for the sake of readability. In Figure 1b, we vary the external noise level ϵ. Given n(ln n)4 comparisons, Add S-3 exactly recovers the planted clusters for ϵ as small as 0.25 (high crowd noise) while, given the same number of comparisons, Mul K-3 only recovers the planted clusters for ϵ > 0.9. Figure 1c shows that Add S-3 outperforms Mul K-3 even when different distributions for Fin and Fout are considered (Uniform + Beta or Uniform + Normal; details in Appendix E). It also shows that the distributions affect the performances, which is not evident from Theorem 1, indicating the possibility of a reﬁned analysis under distributional assumptions.

MNIST clustering with comparisons. We consider two datasets which are subsets of the MNIST test dataset (Le Cun and Cortes, 2010) that originally contains 10000 examples roughly equally distributed among the ten digits: (i) a subset of 2163 examples containing all the 1 and 7 (MNIST 1vs.7), two digits that are visually very similar, and (ii) a randomly selected subset of 2000 examples drawn without replacement and covering all 10 classes (MNIST 10). In both cases, to generate the comparisons, we use the Gaussian similarity (See Section F.3 in the supplementary) on a 2dimensional embedding of the entire MNIST test data constructed with t-SNE (van der Maaten, 2014) and normalized so that each example lies in [ 1, 1]2. We focus on the triplet setting and we randomly and uniformly draw, without replacement, between n(ln n)2 and n(ln n)4 comparisons to be observed by the different approaches. We also consider two additional baselines. First, we

n(ln n)2 n(ln n)3 n(ln n)4 0.0

k-means, k=2 Mul K-3, k=2 Add S-3, SPUR Add S-3, k=2 t-STE, k=2

(a) MNIST 1vs.7, n = 2163

n(ln n)2 n(ln n)3 n(ln n)4 0.0

k-means, k=10 Mul K-3, k=10 Add S-3, SPUR Add S-3, k=10 t-STE, k=10

(b) MNIST 10, n = 2000

true labels

t-STE k = 3

Add S-3 SPUR Mul K-3 k = 2

t-STE k = 2

true labels Add S-3 k = 3 Mul K-3 k = 3

t-STE k = 3

(c) Car dataset, |T | = 12112

Figure 2: Experiments on real datasets. (2a) (2b) ARI on MNIST; (2c) ARI similarity matrix comparing the clusters obtained by the different methods on car (darker means more agreement).

use t-STE (van der Maaten and Weinberger, 2012), an ordinal embedding approach, to embed the examples in 2 dimensions, and then cluster them using k-means on the embedded data. Second, we directly use k-means on the normalized data obtained with t-SNE. The latter is a baseline with access to Euclidean data instead of triplet comparisons.

For MNIST 1vs.7 (Figure 2a), |T | = n(ln n)2 is sufﬁcient for Add S-3 to reach the performance of k-means and t-STE while Mul K-3 requires n(ln n)3 triplets. Furthermore, note that Add S-3 with known number of clusters performs similarly to Add S-3 using SPUR, indicating that SPUR estimates the number of clusters correctly. If we consider MNIST 10 (Figure 2b) and |T | = n(ln n)2, Add S-3 with known k outperforms Add S-3 using SPUR, suggesting that the number of comparisons here is not sufﬁcient to estimate the number of clusters accurately. Moreover, Add S-3 with known k outperforms Mul K-3 while being close to the performance of t-STE. Finally for n(ln n)4 triplets, all ordinal methods converge to the baseline of k-means with access to original data. The ARI of Add S-3 SPUR improves when the number of comparisons increases due to better estimations of the number of clusters estimated k increases from 3 for |T | = n(ln n)2 up to 9 for |T | = n(ln n)4.

Real comparison based data. First, we consider the Food dataset (Wilber et al., 2014) that contains 100 examples and 190376 triplet comparisons. Unfortunately, there is no ground truth and, thus, quantitatively assessing the quality of the obtained partitions is difﬁcult. Thus, in Appendix F, we simply compare the different methods with respect to one another and present the partition obtained by Add S-3 for visual inspection. Second, we consider the Car dataset (Kleindessner and von Luxburg, 2016). It contains 60 examples grouped into 3 classes (SUV, city cars, sport cars) with 4 outliers, and exhibits 12112 triplet comparisons. For this dataset, Add S-3 SPUR estimates k = 2 instead of the correct 3 clusters. Figure 2c considers all ordinal methods with k = 2 and k = 3, and shows the pairwise agreement (ARI) between different methods and also with the true labels. While Mul K-3 with k = 3 agrees the most with the true labels, all the clustering methods agree well for k = 2 (top-left 3 3 block). Hence, the data may have another natural clustering with two clusters, suggesting possible discrepancies in how different people judge the similarities between cars (for instance, color or brand instead of the speciﬁed classes).

6 Conclusion

It is generally believed that a large number of passive comparisons is necessary in comparison based learning. Existing results on clustering require at least Ω n3 passive comparisons in the worst-case or under a planted framework. We show that, in fact, Ω n(ln n)2 passive comparisons sufﬁce for accurately recovering planted clusters. This number of comparisons is near-optimal, and almost matches the number of active comparisons typically needed for learning. Our theoretical ﬁndings are based on two simple approaches for constructing pairwise similarity matrices from passive comparisons (Add S-3 and Add S-4). The present analysis is in a restricted framework, where all within (or inter) cluster similarities are assumed to be identically distributed. Based on existing work on robustness of SDPs (Moitra et al., 2016), we believe that our theoretical result holds in a more general semi-random setting. Lastly, while we studied the merits of Add S-3 and Add S-4 in the context of clustering, they could be used for other problems such as semi-supervised learning, data embedding, or classiﬁcation.

Broader Impact

This work primarily has applications in the ﬁelds of psychophysics and crowdsourcing, and more generally, in learning from human responses. Such data and learning problems could be affected by implicit biases in human responses. However, this latter issue is beyond the scope of this work and, thus, was not formally analysed.

Acknowledgments and Disclosure of Funding

The work of DG is partly supported by the Baden-Württemberg Stiftung through the BW Eliteprogramm for postdocs. The work of MP has been supported by the ACADEMICS grant of the IDEXLYON, project of the Université de Lyon, PIA operated by ANR-16-IDEX-0005.

E. Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017.

S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalized non-metric multidimensional scaling. In International Conference on Artiﬁcial Intelligence and Statistics, pages 11 18, 2007.

C. Aicher, A. Z. Jacobs, and A. Clauset. Learning latent block structure in weighted networks. Journal of Complex Networks, 3(2):221 248, 2015.

E. Amid and A. Ukkonen. Multiview triplet embedding: Learning attributes in multiple maps. In International Conference on Machine Learning, pages 1472 1480, 2015.

E. Arias-Castro. Some theory for ordinal embedding. Bernoulli, 23(3):1663 1693, 2017.

D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, page 1027 1035, 2007.

I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer, 2005.

X. Chen and Y. Yang. Diffusion k-means clustering on manifolds: provable exact recovery via semideﬁnite relaxations. Applied and Computational Harmonic Analysis, 2020.

Y. Chen and J. Xu. Statistical-computational phase transitions in planted models: The highdimensional setting. In International Conference on Machine Learning, pages 244 252, 2014.

S. Dasgupta. A cost function for similarity-based hierarchical clustering. In Symposium on Theory of Computing, pages 118 127, 2016.

E. Emamjomeh-Zadeh and D. Kempe. Adaptive hierarchical clustering using ordinal queries. In Symposium on Discrete Algorithms, pages 415 429, 2018.

Alyson K Fletcher, Sundeep Rangan, and Vivek K Goyal. Necessary and sufﬁcient conditions for sparsity pattern recovery. IEEE Transactions on Information Theory, 55(12):5758 5772, 2009.

D. Ghoshdastidar, M. Perrot, and U. von Luxburg. Foundations of comparison-based hierarchical clustering. In Advances in Neural Information Processing Systems, pages 7454 7464, 2019.

S. Haghiri, D. Ghoshdastidar, and U. von Luxburg. Comparison-based nearest neighbor search. In International Conference on Artiﬁcial Intelligence and Statistics, pages 851 859, 2017.

S. Haghiri, D. Garreau, and U. von Luxburg. Comparison-based random forests. In International Conference on Machine Learning, pages 1866 1875, 2018.

L. Hubert and P. Arabie. Comparing partitions. Journal of Classiﬁcation, 2(1):193 218, 1985.

L. Jain, K. G. Jamieson, and R. Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Advances in Neural Information Processing Systems, pages 2711 2719, 2016.

K. G. Jamieson and R. D. Nowak. Low-dimensional embedding using adaptively selected ordinal data. In Annual Allerton Conference on Communication, Control, and Computing, pages 1077 1084, 2011.

S. Janson and A. Ruci nski. The infamous upper tail. Random Structures & Algorithms, 20(3): 317 342, 2002.

M. Kleindessner and U. von Luxburg. Dimensionality estimation without distances. In International Conference on Artiﬁcial Intelligence and Statistics, pages 471 479, 2015.

M. Kleindessner and U. von Luxburg. Lens depth function and k-relative neighborhood graph: versatile tools for ordinal data analysis, 2016.

M. Kleindessner and U. von Luxburg. Kernel functions based on triplet similarity comparisons. In Advances in Neural Information Processing Systems, pages 6807 6817, 2017.

Y. Le Cun and C. Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010.

B. Mason, L. Jain, and R. Nowak. Learning low-dimensional metrics. In Advances in neural information processing systems, pages 4139 4147, 2017.

Ankur Moitra, William Perry, and Alexander S Wein. How robust are reconstruction thresholds for community detection? In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 828 841, 2016.

R. N. Shepard. The analysis of proximities: Multidimensional scaling with an unknown distance function. i. Psychometrika, 27(2):125 140, 1962.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

N. Stewart, G. D. A. Brown, and N. Chater. Absolute identiﬁcation by relative judgment. Psychological review, 112(4):881, 2005.

O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. T. Kalai. Adaptively learning the crowd kernel. In International Conference on Machine Learning, pages 673 680, 2011.

Y. Terada and U. von Luxburg. Local ordinal embedding. In International Conference on Machine Learning, pages 847 855, 2014.

J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389 434, 2012.

A. Ukkonen. Crowdsourced correlation clustering with relative distance comparisons. ar Xiv preprint ar Xiv:1709.08459, 2017.

A. Ukkonen, B. Derakhshan, and H. Heikinheimo. Crowdsourced nonparametric density estimation using relative distances. In AAAI Conference on Human Computation and Crowdsourcing, 2015.

L. van der Maaten and K. Weinberger. Stochastic triplet embedding. In IEEE International Workshop on Machine Learning for Signal Processing, pages 1 6, 2012.

Laurens van der Maaten. Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221 3245, 2014.

S. Vikram and S. Dasgupta. Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081 2090, 2016.

U. von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

M. Wilber, S. Kwak, and S. Belongie. Cost-effective hits for relative similarity comparisons. In Human Computation and Crowdsourcing (HCOMP), Pittsburgh, 2014.

M. Xu, V. Jog, and P.-L. Loh. Optimal rates for community estimation in the weighted stochastic block model. The Annals of Statistics, 48(1):183 204, 2020.

B. Yan and P. Sarkar. On robustness of kernel clustering. In Advances in Neural Information Processing Systems, pages 3098 3106, 2016.

B. Yan, P. Sarkar, and X. Cheng. Provable estimation of the number of blocks in block models. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1185 1194, 2018.

F. W. Young. Multidimensional scaling: History, theory, and applications. Lawrence Erlbaum Associates, 1987.

S.-Y. Yun and A. Proutiere. Optimal cluster recovery in the labeled stochastic block model. In Advances in Neural Information Processing Systems, pages 965 973, 2016.

L. Zhang, S. Maji, and R. Tomioka. Jointly learning multiple measures of similarities from triplet comparisons. ar Xiv preprint ar Xiv:1503.01521, 2015.