# fuzzy_clustering_with_similarity_queries__00083bd9.pdf

Fuzzy Clustering with Similarity Queries

Wasim Huleihel Department of Electrical Engineering

Tel Aviv University Tel Aviv 6997801, Israel wasimh@tauex.tau.ac.il

Arya Mazumdar Halıcıo glu Data Science Institute University of California, San Diego

La Jolla, CA 92093

arya@ucsd.edu

Soumyabrata Pal College of Information & Computer Sciences

University of Massachusetts Amherst

Amherst, MA 01003 soumyabratap@umass.edu

The fuzzy or soft k-means objective is a popular generalization of the well-known kmeans problem, extending the clustering capability of the k-means to datasets that are uncertain, vague and otherwise hard to cluster. In this paper, we propose a semisupervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide algorithms for fuzzy clustering in this setting that ask O(poly(k) log n) similarity queries and run with polynomialtime-complexity, where n is the number of items. The fuzzy k-means objective is nonconvex, with k-means as a special case, and is equivalent to some other generic nonconvex problem such as non-negative matrix factorization. The ubiquitous Lloyd-type algorithms (or alternating-minimization algorithms) can get stuck at a local minima. Our results show that by making few similarity queries, the problem becomes easier to solve. Finally, we test our algorithms over real-world datasets, showing their effectiveness in real-world applications.

1 Introduction

The k-means objective for clustering is perhaps the most ubiquitous of all unsupervised learning paradigms. It is extremely well studied, with Lloyd s algorithm being the most popular solution method, while the problem in general being NP Hard [11, 7]. A variety of approximate solutions for k-means exists (such as, [20, 3]).

In recent times, there have been efforts to bring in a ﬂavor of active learning in k-means clustering; by allowing the learner to make a limited number of carefully chosen label/membership queries [4, 2, 18]. In this setting, the main objective is to show that a polynomial-time solution exists provided that small number of such queries to an oracle is permitted. Note that, an alternative to membership query (i.e., query of the form does the ith element belong to the jth cluster ) is the same-cluster/similarity query (i.e., do elements i and j belong to the same cluster ). Given representatives from each cluster is available, one can simulate any membership query with at most k same cluster queries. Hence whatever is achievable with membership queries, can be also achieved by same cluster queries by asking at most k times as many queries [4].

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

How realistic is to allow the learner to make such label queries, and how realistic is the oracle assumption? It turns out that many clustering tasks, primarily entity-resolution, are being delegated to crowdsourcing models. One can easily assume that the crowd is playing the role of an oracle here. It is natural in these models to assign crowd workers with sub-tasks, which can either be answering the membership query or the same cluster query [5, 49, 50, 38, 23].

Fuzzy k-means or soft k-means is a generalized model of the k-means objective that covers lot more grounds being applicable to dataset where datapoints show afﬁnity to multiple labels, the clustering criteria are vague and data features are unavailable [8]. In fuzzy clustering, instead of an element being part of one ﬁxed cluster, it is assumed that it has a membership value between 0 and 1 to each of the clusters. The objective of the clustering is to recover all of the membership values for each element or datapoint. This relaxation in membership values from {0, 1} to [0, 1] makes the fuzzy k-means an attractive model for overlapping clusters [52], multilabel classiﬁcation [30] and numerous other applications such as, pattern recognition [8], computational biology [6], bioinformatics [47], image segmentation [1, 19] and so on.

What is more interesting is that solving fuzzy clustering is equivalent to approximate solution of nonnegative symmetric matrix factorization problems [22]. Given such applications, it is somewhat surprising that the theoretical results known about fuzzy clustering is quite limited. As is even the computational tractability of fuzzy k-means is only conjecturally known. In fact algorithmic study of fuzzy k-means algorithm seems to have been only recently started [9, 10, 31].

Membership vs. similarity queries. As mentioned above, in the hard-clustering case, one can simulate any membership query with at most k same-cluster (or, similarity) queries [4]. In fuzzy clustering, we are given a set of n vectors (items) in the d-dimensional Euclidean space, and we are tasked the problem of determining the k-dimensional membership vector for each of the items. The clustering algorithm has access to the set of vectors, and can also make queries about the underlying soft clustering. As in hard-clustering, these queries can take at least two forms. Speciﬁcally, membership queries refer to the existence of a membership-oracle which whenever queried, replies with the membership value of the ith item to the jth cluster, i.e., the jth entry of the ith membership vector. One could argue that querying the membership values, and speciﬁcally, the existence of such a membership-oracle, is not realistic, and often impractical in real-world settings since it requires knowledge of the relevant clusters. Instead, the similarity query model that takes two or three elements as input and ask How similar are all these elements?" is much easier to implement in practice. Accordingly, in a similar vein to hard-clustering, we show that membership queries can be simulated by similarity queries, measured by the inner-product between pairs (and triplets) of membership vectors. Finally, as will be explained in more detail, the responses of the oracle or the target solution do not need to correspond to the optimum solution of the fuzzy k-means objective, but instead need to satisfy a few mild conditions. Accordingly, the target solution can represent noisy/quantized version of the optimum solution of the fuzzy k-means objective function making our set-up highly practical in crowdsourcing settings.

A natural question that one can ask about fuzzy clustering following the lead of [4] is how many membership/similarity queries one may ask so that the clustering can be efﬁciently solved with a good approximation of the target solution? In this paper, our main motivation is to answer this rather basic question. It has to be noted that revealing a fractions of labels during clustering algorithm in fuzzy k-means has been studied with some heuristics in [42], but not only the model was different from ours, a rigorous theoretical study was also lacking in the past literature.

Our contributions and techniques. We design several algorithms and analyze their query and computational complexities. Speciﬁcally, the ﬁrst algorithm we propose works in two phases: in the ﬁrst phase, we estimate the center of each cluster using non-adaptive queries, and in the second phase we recover the membership of each element to each cluster using the estimated centers obtained in the ﬁrst phase. We show with this algorithm it is possible to get an approximation of the membership vectors by making only O(poly(k/β) log n) queries, where β is ratio of the volume of the smallest cluster to the average volume, informally speaking. Then, we show that by allowing one to ask adaptive/sequential queries in the ﬁrst phase, the polynomial dependence on β of the query complexity can be further improved. Despite the strong dependency of the query complexity on β, for a large regime of parameters, our query complexity is non-trivial and sublinear in n. Interestingly, for the special case of k = 2, we can completely get rid of the dependency of the query complexity on β, and it sufﬁces to ask O(log2 n) queries only.

For our theoretical results to hold, we make an assumption on the dataset that quantiﬁes the clusterability of the data. We would like to emphasize that our assumption is only needed for deriving the theoretical guarantees; our algorithms are demonstrated to perform well on real-world datasets irrespective of whether our assumption holds or not. The computational complexity of all the algorithms scales as d times the number of queries.

It is to be noted that the analysis of the aforementioned algorithms are technically much more demanding than solving the analogous query complexity problem for hard k-means. For the case of hard k-means, a general principle would be to sample a small set of items and estimate the centre of the largest cluster from the samples (see, [4]). Subsequently all the items closest to the estimated centre can be assigned to it and the cluster can be removed from consideration. One can iterate over this algorithm to ﬁnd all the clusters. However the size of a cluster is not well-deﬁned in the fuzzy setting; and also it is not possible to assign items to a cluster and remove them from consideration, since they can very well have nonzero memberships to other clusters.

To tackle these difﬁculties, we have to use more challenging technical tools and analysis methods than the hard k-means literature. In a nutshell, since at any step of the algorithm we cannot remove items from consideration, we propose sophisticated random sub-sampling algorithms that explore the space of items efﬁciently. Speciﬁcally, having reliable approximations for the centers and memberships of some processed clusters, we partition the space of items into bins that have the same approximated sum of membership weights to unprocessed clusters. Sampling equally from every bin ensures that we obtain enough samples from items which have high membership weight to those clusters which have not yet been approximated. Also, contrary to hard k-means where a simple (unbiased) empirical mean estimator is used to approximate the means, in the fuzzy setting, we end up with a form of self-normalized importance sampling estimators. While these estimators are biased, we managed to bound their deviations in an elegant way using concentration inequalities. We tested our algorithms over both synthetic and real-world datasets, and illustrate the tradeoff and effect of different values of β and k on the estimation error and query complexity.

Related work. Clustering with limited label-queries have mostly followed two different lines of works. In the ﬁrst line of work, queries were used to make the objective based clustering, such as k-means, polynomial time solvable [4, 2, 18, 12, 13, 14]. The problem being considered here is most closely related to this line of work, albeit, instead of hard k-means, we focus on a soft objective.

In the second line of work, instead of an objective based clustering, it is assumed that there exists a ground-truth clustering, and by label-querying, one gets a (noisy) answer from an oracle about this ground-truth. The aim of this line of work has been to give statistical-computational guarantees on recovery of the ground truth [36, 37, 35, 2, 15, 46, 43]. The work in this canon that is closest to our problem is [28], where an overlapping clustering groundtruth was considered. At a high level, we also consider overlapping clusters, however, focus on objective-based clustering therein.

It is worth noting that clustering with same cluster, or other label queries, has found connections with and application to correlation clustering [15, 43], community detection [37, 17], heavy-hitters [44] and possibly other areas, that we might have missed. Our problem can be seen as approximately equivalent to non-negative symmetric matrix factorization [40] with limited access to the entries of the factors (via queries) [22].

Organization. The rest of the paper is organized as follows. In Section 2 we present our model and the learning problem. Then, in Section 3 we present our algorithms and their theoretical guarantees. The main theoretical results can be found in Theorems 1,2,3. Due to space limitation, proofs, some pseudocodes, experimental results over both synthetic and real-world datasets, and conclusions are relegated to the supplementary material.

2 Model formulation and background

2.1 The optimization problem

We start by describing the hard k-means problem, and then move forward to the fuzzy k-means problem (soft clustering). Let X Rd be a set of d-dimensional points, with |X| = n. We denote this set of vectors by {xi}n

i=1, and assume without loss of generality that xi 2 B(0, R), for any i 2 [n], where B(a, R) is the L2-Euclidean ball of radius R 2 R+ centered around a 2 Rd. Let

CX , {C1, C2, . . . , Ck} be a clustering (or, partition) of X, where k 2 N is the number of clusters. We say that a clustering CX is center-based if there exists a set of centers µ , {µ1, µ2, . . . , µk} Rd such that the clustering is induced by the Voronoi diagram over those center points. The objective function of hard k-means is

Jkm(X, µ, U) =

where U 2 {0, 1}n k is the partition matrix, i.e., the membership weight Uij is 1 if xi 2 Cj, and 0 otherwise. In hard clustering each data point is assigned exactly to one cluster and every cluster must contain at least one data point. The goal of k-means is to minimize (1) w.r.t. (µ, U).

In soft-clustering there is no partitioning of X. Instead, we describe the kth cluster of a fuzzy clustering as a vector of the fractions of points assigned to it by the membership function. In particular, the memberships weights take on real-valued numbers in [0, 1] rather than {0, 1}, as for hard clustering. The fuzzy k-means problem is deﬁned as follows. Deﬁnition 1 (Fuzzy k-means). Let 1 and k 2 N. The fuzzy k-means problem is to ﬁnd a set of means µ = {µ } 2[k] Rd and a membership matrix U 2 [0, 1]n k minimizing

Jfm(X, µ, U) ,

subject to Pk

j=1 Uij = 1 and 0 < Pn

i=1 Uij < n, for all i 2 [n] and j 2 [k].

The parameter is called the fuzziﬁer, and is not subject to optimization. It can be shown that when ! 1, the fuzzy k-means solution coincides with that of the hard k-means problem. Similarly to the classical k-means problem, it is easy to optimize the means or memberships of fuzzy k-means, assuming the other part of the solution is ﬁxed. Speciﬁcally, given a set of means µ, it can be shown that U in (3) below is the optimal membership weights minimizing the cost in (2) (see, e.g., [8][Theorem 11.1]). On the other hand, given a set of membership weights U, the weighted centers in (4) below are the optimal minima.

for all i 2 [n], j 2 [k]. Iterating these solutions is known as the Lloyd s algorithm [32], which provably lands on a local optima. It is well-known, however, that ﬁnding the optimal solution (i.e., global minimia) to the k-means problem is is NP-hard in general [33, 48]. While the same result is unknown for the fuzzy formulation, it is strongly believed to be the case (e.g., [9, 16]).

Finally, although several validity indexes for the performance of fuzzy k-means solutions have been proposed in the literature (see, e.g., [24, 25]), the Xie Beni [51] is the most widely used in the literature. This measure is deﬁned as follows1

XB(X, µ, U) , Jfm(X, µ, U)

For the rest of this paper, we denote a clustering by P , (µ, U), and we replace XB(X, µ, U) by XB(X, P). Accordingly, the optimal solution to the fuzzy k-means problem (Deﬁntion 1) is denoted by P?. Finally, we say that a clustering P is consistent center-based if:

Given membership weights U, the centers µ are given by (4). The membership weights U are monotone, i.e., if

2 then Uij U j, for any i, 2 [n] and j 2 [k].

The ﬁrst condition appears also in [4]. The second condition excludes inconsistent membership weights U. In particular, as is shown in [16], and in Lemma 5 in the appendix, both conditions

1In (5) we divide by nk and not just by n, as in [51], where k was consider a ﬁxed parameter. If k is allowed to grow (e.g., with n), then our deﬁnition makes more sense, as in general the numerator in (5) can grow with k.

are satisﬁed by the optimal solution P?, as well as by (4). We would like to clarify that the above assumptions are only required for the theoretical analysis; Our proposed algorithms can be implemented regardless of these assumptions. As is shown in Appendix F, in real-world applications, our algorithm works well even if these assumptions do not hold.

2.2 Domain expert/oracle

Our goal is to construct efﬁcient algorithms with the aid of a domain expert who can give an approximate solution to the fuzzy k-means problem. Speciﬁcally, given a clustering P, a P-oracle is a function OP that answers queries consistent with P. One can think of such an oracle as a user that has some idea about its desired solution, enough to answer the algorithm s queries. Importantly, note that as long as P is a consistent center-based clustering, it need not be the optimal solution P? minimizing (2). The algorithm then tries to recover (or, approximate) P by querying the oracle a small number of times. For hard clustering, the same-cluster query oracle model (i.e., do elements i and j belong to the same cluster") was considered in [4]. Using a certain number of such same-cluster queries and the data points X, the clustering P can be found efﬁciently [4], while ﬁnding the optimal solution without the aid of an oracle is NP-hard. For the fuzzy formulation, one can think of analogous oracle models, and in this paper, we focus on the following.

Deﬁnition 2 (Membership-Oracle). A membership query asks the membership weight of an instance xi to a cluster j, i.e., Ofuzzy(i, j) = Uij.

Basically we ask the oracle for the membership weight of xi to cluster Cj. Using a certain amount of such queries and the data set X we would like to approximate the solution for fuzzy k-means problem. As discussed in the Introduction, one could argue that querying the membership values, and speciﬁcally, the existence of such a membership-oracle, is not realistic, and often impractical in real-world settings since it requires knowledge of the relevant clusters. Instead, the following similarity query models are easy to implement in practice.

Deﬁnition 3 (Similarity-Oracle). A fuzzy pairwise similarity query asks the similarity of two distinct instances xp and xq i.e., Osim(p, q) = h Up, Uqi. A fuzzy triplet similarity query asks the similarity of three distinct instances xp, xq, xr, i.e., Otriplet(p, q, r) = P

t2[k] Upt Uqt Urt.

We show in Appendix H, however, that each membership query can be simulated with a few similarity queries. From a theoretical perspective, membership queries can be simulated with only fuzzy pairwise similarity queries if there exists many pure instances (instances that have a membership weight of 1 to a particular cluster in the dataset), see Lemma 14 in appendix. Unfortunately this strong condition is necessary as well. But it turns out that by using a few triplet similarity queries at the beginning, we can resolve this issue. Speciﬁcally, we prove the following result, informally stated, that is interesting in its own right.

Lemma 1 (Membership to Similarity Queries Reduction, Informal). Suppose there exists an algorithm An with time complexity Tn that uses m membership queries to provide some guaranteed solution of the fuzzy k-means problem. In that case, under mild conditions on the dataset, the same guarantee can be obtained with O(k3) fuzzy triplet similarity queries and O(km) fuzzy pairwise similarity queries, with a time complexity of Tn + O(mk + k3).

Therefore, we can express everything in terms of the more natural notion of similarity queries and hence, we use membership queries for the rest of the paper to make the representation of our algorithms simpler.

Noisy/Inaccurate Oracle. Our algorithms work even with inaccuracies in the oracle answer, provided the deviation is small. Under the assumption that such inaccuracies are random, in particular, a repetition algorithm can remedy such errors. Speciﬁcally, consider the case where Ofuzzy(i, j) = Uij + ij, where ij is a zero-mean random variable, and assume the errors to be independent, i.e., the answers are collected from independent crowd-workers. Then, in Appendix G we show how one can convert any successful" solver for the noiseless problem into a solver for the noisy problem; generally speaking, by repeating the noiseless solver for T consecutive steps using the noisy responses, and then averaging the results to obtain the ﬁnal estimate. Note that, the technique used to handle the subsequent steps are also applicable to any bounded error. For simplicity of exposition and clarity, we opted to focus on the noiseless setting given in Deﬁnition 2.

2.3 Objective and learning problem

The goal of the algorithm is to ﬁnd an approximated clustering that is consistent with the answers given to its queries. We have the following deﬁnition.

Deﬁnition 4 (Oracle-Based Solver). Let (X, P) be a clustering instance. An algorithm A is called a ( 1, 2, Q)-solver if it can approximate P by having access to X and making at most Q membership queries to a P-oracle. Speciﬁcally, for any 1, 2 > 0, the algorithm outputs b P such that maxj2[k]

2 1 and maxi2[n],j2[k] |Uij b Uij| 2. Such an algorithm is a polynomial ( 1, 2, Q)-solver if its time-complexity is polynomial in (n, k, 1

We would like to emphasize that we seek an approximate solution rather than exact one. It is important to note Q = O(nk) membership queries sufﬁce to recover the P-oracle clustering. Also, with unbounded computational resources one can ﬁnd the optimal clustering P? without any queries. Nonetheless, in this paper, we would like to have a sub-linear dependency of the query complexity on n and a polynomial dependency of the time complexity on n, k. Having an algorithm such as the one in Deﬁnition 4, implies the following guarantee, proved in Appendix A.

Lemma 2. Let (X, P) be a clustering instance, and A be a ( 1, 2, Q)-solver. Then,

+++XB(X, P) XB(X, b P)

+++ XB(X, P) O( 1) + O( 2)

The lemma above shows that if 1 and 2 are small enough, then the XB measure associated with an ( 1, 2, Q)-solver is close" to the XB measure associated with P. Note, however, that the r.h.s. of (6)

depends inversely on mini6=j

2. Accordingly, if for some clustering instance (X, P) this

quantity is too small compared to the estimation error, then the difference |XB(X, P) XB(X, b P)| is uncontrolled. This is reasonable since when clusters are close", the estimation task gets harder.

Finally, we introduce certain notations that will be needed for the presentation of our main results. Given a set of points X and v 2 Rd, let v : [n] ! [n] be the permutation deﬁned on [n] such that

++++v x v(i)

++++v x v(j)

for all i < j 2 [n]. To wit, v is the ranking of the elements in X when they are sorted in ascending order w.r.t. to their distance from v. Then, we let γ 2 R+ be the maximum positive number such that for all j 2 [k],

bµ = µj, 8bµ 2 B(µj, γ). (8)

Intuitively, if one has reliable estimates for the centers, then the ordering of X w.r.t. these estimated centers is the same as the ordering w.r.t. the target centers. Next, we deﬁne the cluster membership size associated with P by P

i2[n] Uij, for all j 2 [k], and a parameter β 2 R+ by β , minj2[k] kn 1 P

i2[n] Uij. Accordingly, in terms of β, the minimum cluster membership size is at least βn/k. Notice that β must be less than unity since P

i,j Uij = n, and there must exist at least one cluster whose membership size is at most n/k. We mention that β was introduced in [18] as well.

3 Algorithms and theoretical guarantees

3.1 Two-Phase algorithm

We start by providing a non-formal description of our algorithm, whose pseudocode is given in Algorithms 1 3. Speciﬁcally, in the ﬁrst three steps of Algorithm 1 we approximate the centers of all clusters. To that end, we start by sampling at random m elements from [n] with replacement, and denote the obtained samples by S. The value of m is speciﬁed in Theorem 1 below. Subsequently, we query the membership weights Uij, for all i 2 S and j 2 [k]. Using these queries, we approximate the center of every cluster using the estimator deﬁned in Step 3 of Algorithm 1. Note that this estimator is the most natural candidate when one knows the membership weights only for a set of sub-sampled elements. Although this algorithm is quite intuitive, it suffers from the natural drawback that if the

size of a particular cluster is small, many samples are needed for a reliable estimate of its center.

Algorithm 1 Parallel algorithm for approximating P.

Input: X, Ofuzzy, , m, and . Output: An estimate b P.

1: Let S be a set of m elements sampled at random

from [n] with replacement. 2: Query Ofuzzy(i, j), 8i 2 S, j 2 [k].

3: Compute bµj =

ij , 8j 2 [k].

4: for j = 1, 2, . . . , k do 5: {b Uij}n

i=1 MEMBERSHIP(X, bµj, , ).

6: b Uij b Uij +

k , 8i 2 [n]. 7: end for

Algorithm 2 MEMBERSHIP(X, bµj, , )

Input: X, and Ofuzzy. Output: An estimate {b Uij}n

i=1. 1: Find bµj w.r.t. X. 2: for s = 0, 1, . . . , 1/ do 3: Find s = argmaxi2[n]U bµj (i)j s using BINARYSEARCH(X, bµj, s ). 4: end for 5: for s = 0, 1, 2, . . . , 1/ do 6: for i = s, s 1, . . . , , (s+1) + 1 do 7: Set b U bµj (i)j = s . 8: end for 9: end for

Algorithm 3 BINARYSEARCH(X, , x)

Input: Ofuzzy.

1: Set low = 1 and high = n. 2: while low 6= high do 3: Set mid = d(low + high)/2e. 4: Query Ofuzzy( (mid), j). 5: if U (mid)j x then 6: Set low = mid. 7: else 8: Set high = mid 1. 9: end if 10: end while

Next, using the approximated centers, we approximate the membership weights of every element up to a precision of by using Algorithm 2. To that end, in the ﬁrst step of this algorithm, we sort the elements of X in an ascending order according to their distance from the estimated center. Accordingly, if the estimation error of the centers satisﬁes γ, then (8) implies that the obtained ordering of elements after sorting from the estimated centers is same as the true ordering of the elements obtained by sorting according to distance (in ascending order) from the exact centers (instead of approximated centers). Furthermore, the fact that P is a consistent center-based clustering implies that for any j 2 [k] the membership weights U bµj (i)j are non-increasing in i. Therefore, for any j 2 [k] using the binary search in Algorithm 3, we ﬁnd the maximum index s such that U bµj ( s)j is larger than s , for s 2 {0, 1, . . . , 1/ }. This will create an -spaced grid, and we approximate each membership weight in accordance to the cell it falls into. Speciﬁcally, we assign b U bµj (t)j = s , for all t 2 [ s, s 1, . . . , s+1 + 1]. Clearly, the estimation error is at most . Overall, each binary search step takes O(log n) queries, and therefore the total query complexity of this step is O(log n/ ). Finally, Step 6 transforms b Uij to ensure that Pn

i=1 b Uij = 1, for j 2 [k]. We have the following result, the detailed proof of which is delegated to Appendix C. Theorem 1. Let (X, P) be a consistent center-based clustering instance, and recall the deﬁnitions of γ 2 R+ and β 2 (0, 1). Pick γ, 2 (0, 1), and δ 2 (0, 1). Take as input to Algorithm 1 m =

δ , for some constant c > 0. Then, with probability at least 1 δ, Algorithm 1 is ( , , Q)-

solver using Q = O

( β )4 log k

δ + k log n

membership queries. Furthermore, Algorithm 1

time-complexity is of order O

kn log n + d R4k4 +1

( β )4 log k

δ + k log n

Proof Sketch of Theorem 1. To prove Theorem 1 we ﬁrst show that with high probability, Steps 1 4 in Algorithm 1 output an -approximation for all centers. To that end, we ﬁrst show in Lemma 6

that for a given cluster j 2 [k], with probability 1 δ,

δ, where Y ,

ij. This result follows by noticing that both the numerator and denominator of bµj are unbiased estimates of the corresponding numerator and denominator of µj in (4). Representing bµj as a sum of the true underlying center µj and a certain error term, the generalized Hoeffding s inequality in Lemma 4 implies the above estimation error guarantee. In light of this result, using Hölder s inequality and the union bound we show in Lemma 8 that

2 , for all j 2 [k], if m is as given in Theorem 1. Note that this centers estimation algorithm requires a total of km membership-oracle queries. Next, conditioned on the above proxy of any given center, we show in

Lemma 7 that Algorithm 2 estimates the corresponding membership weights within a certain range of error with high probability. Speciﬁcally, for a given cluster j, we prove that Algorithm 2 outputs b Uij, for i 2 [n], such that 0 Uij b Uij and b Uij 2 {0, , 2 , . . . , 1}, 8i 2 [n], using O (log n/ ) queries to the membership-oracle. Roughly speaking, this result follows from the arguments presented in the paragraph preceding the statement of Theorem 1. Finally, it is evident from the above that the total number of membership-oracle queries is O(km + k log n/ ).

Note that both the query and computational complexities in Theorem 1 exhibit an exponential dependency on the fuzziﬁer" . As mentioned in Subsection 2.1, this parameter is not subject to optimization and, usually, practitioners pick = 2. Nonetheless, there has been a lot of work on this topic; one notable example is [27] which presents an initial theoretical approach towards the investigation of this question. It seems that the value of is inﬂuenced by the structure of the dataset, i.e., the number of clusters and the distribution of each cluster. Currently, a deﬁnite answer on how to pick is unknown. Indeed, since there is no baseline, one cannot decide which value of is preferable. Note also that despite the fact that our query complexity naturally depend on , it cannot be used to determine the best" value of . Assuming the existence of a ground truth, in Appendix F we compare the clustering accuracy for different values of on a real-world dataset.

3.2 Sequential algorithm

As mentioned in the previous subsection, the main drawback faced by Algorithm 1 is its undesirable strong dependency on β. The algorithm proposed in this section remedies this drawback to some extent. The algorithm works as follows. In the ﬁrst four steps of Algorithm 1, we randomly draw m samples from [n], and query the membership weights that correspond to all these samples. Using these samples we estimate the center of largest cluster t1, having the maximal total membership weight P

ij, among the k clusters. Corollary 4 in Section D speciﬁes the number of such elements needed to have a reliable estimate. Then, using this approximated center, we estimate the membership of every element in cluster t1 using Algorithm 2 such that the estimate is a multiple of 1 and further, the estimate has an additive error of at most 1, just as we did in the two-phase algorithm. For the rest of the clusters, we follow the sequential algorithm in Steps 5 13 of Algorithm 4, the guarantees of which are characterized in the following result:

Theorem 2. Let (X, P) be a consistent center-based clustering instance, and recall the deﬁnitions of γ 2 R+ and β 2 (0, 1). Take m = c

δ , r = c R4k4 4β4 4 log 4k

1δ, for some constant c > 0, and any δ 2 (0, 1). Then, with probability at least 1 δ, Algorithm 4 is

( , 2, Q)-solver using Q = O

R4k4 +1 1 4β4 4 log 4k

1δ + k log n

1 + k log n

membership queries, for any

chosen γ and 2 1 1

. Furthermore, Algorithm 4 time-complexity is of order

d R4k4 +1 1 4β4 4 log 4k

1δ + k log n

1 + k log n

2 + kn log n

The proof of Theorem 2 follows from a technically complicated induction. Due to space limitations, the details have been delegated to Appendix D and below, we provide only a short sketch.

Proof Sketch of Theorem 2. Suppose that we have approximated the means of k clusters, indexed by T , {t1, t2, . . . , t }, with an estimation error at most γ, and further approximated the membership weights of every element in X for all clusters indexed by T , up to an additive error 1. The leftover clusters, indexed by [k] \ T , are the unprocessed clusters. The above implies that we can also approximate the sum P

j2[ ] Uitj, for each i 2 X, up to an additive error 1.

Note that, since b Uitj is a multiple of 1 for all j 2 [ ], P

j2[ ] b Uitj must also be a multiple of 1. Subsequently, we partition X into bins, where the elements in a particular bin have the same approximated sum of membership weights. Formally, let , {0, 1, . . . , 1/ 1}. For each s 2 , we deﬁne Xs , {i 2 [n] : P

j2[ ] b Uitj = s 1} to be the sth bin. Then, in Steps 9-10 of Algorithm 4, we randomly sample min(r, |Xs|) elements from each bin with replacement, and denote the resulted sub-sampled set by Ys. The value of r is speciﬁed in Theorem 2. Sampling equally from every bin ensures that we obtain enough samples from elements which have high membership weight to the unprocessed clusters, i.e., P

j2[k]\[ ] Uitj is large. Given these sub-sampled sets, we query the

membership weights Uij, for all i 2 [s2 Ys and j 2 [k] \ T . Using these queries we would like to approximate the center of some unprocessed cluster. We choose the unprocessed cluster having the largest total membership size, weighted by the size of each bin. Mathematically, the index of the ( + 1)th cluster, and its approximated center are given as follows,

t +1 , argmax

ij and bµt +1 ,

We choose the number of samples r large enough so that the estimation error is less than . The above steps are repeated until all unprocessed clusters are exhausted. Finally, once we have approximated all the centers, we apply Algorithm 2 to approximate the membership weight of every cluster up to an additive error of 2 1.

Algorithm 4 Sequential algorithm for approximating P.

Input: X, Ofuzzy, , m, r, 1, and 2. Output: An estimate b P.

1: Let S be a set of m elements sampled at random

from [n] with replacement. 2: Query Ofuzzy(i, j), 8i 2 S, j 2 [k]. 3: t1 = argmaxj2[k]

4: Compute bµt1 =

5: for = 1, 2, . . . , k 1 do 6: {b Ui, }n

i=1 MEMBERSHIP(X, bµt , , 1).

7: Set Xs , {i 2 [n] : P

j2[ ] b Uitj = s 1} for s 2 {0, 1, . . . , 1/ 1}.

8: for s = 0, 1, . . . , 1/ 1 do 9: Let Ys be a set of min(r, |Xs|) elements sampled at random from Xs with replacement. 10: end for 11: Compute t +1 using (9). 12: Compute bµt +1 using (9). 13: end for 14: for j = 1, 2, . . . , k do 15: {b Uij}n

i=1 MEMBERSHIP(X, bµj, , 2).

16: b Uij b Uij +

k , 8i 2 [n]. 17: end for

Comparing Theorems 1 and 2 we see that the sequential algorithm has a more graceful dependency on β, the size of the smallest cluster. On the other hand, the query complexity of the two-phase algorithm is better in terms of k (recall that 1 1/k for the sequential algorithm). Note that the query complexity of the sequential algorithm still has some dependency on β. This stems from the fact that we estimate the membership weights up to a precision of 1, which might be too large if there exists a cluster j such that maxi2[n] Uij 1. Accordingly, in this case, all elements will fall into a single bin in the partition, leading to the same drawback the two-phase algorithm suffers from. Finally, we mention that for k = 2, we devise an algorithm whose query complexity is completely independent of β. Due to space limitation, the details are given in Appendix E, but we state here our main conclusion and provide a short sketch of the proof. Theorem 3. Let (X, P) be a consistent center-based clustering instance, recall the deﬁnition of γ 2 R+ in (8), and let k = 2. Then, with probability at least 1 δ, Al-

gorithm 5 with m = O

is a ( , , Q)-solver us-

R4 log n log(1/ δ)

4 + log2 n + log n

membership queries, for any chosen

γ, δ 2 (0, 1) and > 0. Furthermore, Algorithm 5 time-complexity is of order

n log n + d R4 log n log(1/ δ)

4 + log2 n + log n

We now explain the main ideas behind Algorithm 5. First, let S be a set of m elements sampled from [n] with replacement. Denote by t1 2 {1, 2} the index of the larger cluster and t2 2 {1, 2} the index of the smaller cluster, i.e., P

it2. The algorithm works as follows. We start by computing an estimate bµt1 for the center of the largest cluster t1. Corollary 4 shows that O

(R/ )4 log(4/δ)

membership queries are needed to guarantee that

2 . Now, the main novelty in Algorithm 5 is a non-trivial querying scheme used to estimate the memberships of the second cluster, Uit2, for all i 2 [n], using the center estimated for the ﬁrst cluster t1. Note that for the special case of two clusters, querying Uit1 reveals Uit2 since Uit1 + Uit2 = 1. While, in the sequential algorithm, we always approximated the membership weights using bins of ﬁxed size 1, here, we will approximate Uit2 with bins whose size is data-dependent. The reason for using variable bin sizes rather than ﬁxed size bins is that with the latter we might ask for many redundant

membership queries simply because many bins might be empty. Accordingly, by adaptively choosing the bin size for approximating the membership weights of the smaller cluster t2, while querying membership weights to the larger cluster t1, allows us to get around this issue completely; this results in a theoretical guarantee that is completely independent of β.

Algorithm 5 Sequential algorithm for approximating P with two clusters Input: X, Ofuzzy, , m, r, and .

1: Let S be a set of m elements sampled at ran-

dom from [n] with replacement. 2: Query Ofuzzy(i, j) and store Uij for all j 2

{1, 2}. 3: Choose t1 = argmaxj2{1,2}

4: Compute estimator bµt1 =

5: Obtain P1, P2, X1, X2, . . . , Xq and b Uit2 for

all i 2 [n] using MEMBERSHIP2(X, bµt1, ) 6: for q = 1, 2, 3, . . . , 3 log n do 7: Let Yq be a set of r elements sampled at random from Xq with replacement.

8: Query Ofuzzy(i, t2) to obtain Uit2 for all i 2 Yq 9: end for 10: Let Q be a set of r elements sampled at ran-

dom from P1 with replacement. 11: Query Ofuzzy(i, t2) to obtain Uit2 for all i 2

Q. 12: Compute bµt2 in (111). 13: for j = 1, 2 do 14: Run Alg MEMBERSHIP(X, bµj, , , j) to update b Uij for all i 2 [n]. 15: end for 16: Return bµ1, bµ2 and b Uij 8 i 2 [n], j 2 {1, 2}.

We now describe Algorithm 7 used to estimate the membership weights of cluster t2. We start by sorting X in ascending order according to their distance from bµt1, and initialize 1 = U bµt1 (n)t2 and p 1 , n. To wit, 1 is the membership of the element farthest from bµt1 to the smaller cluster t2. For each q 2= {2, 3 . . . , 3 log n}, we set q = q 1/2, and then perform binary search to ﬁnd p0

q = argminj2[n]1[U bµt1 (j)t2 q]. Here, we should think of p0

q to be the index for which x bµt1 (p0 q ) is the element closest from bµt1, such that its membership to the smaller cluster t2 is at least q. Now, if |p0

q p q 1| log n, we keep q unchanged, set p q = p0

q, and put all elements which are closer to bµt1 than x bµt1 (p q 1), but farther than x bµt1 (p q ) into a single bin. Notice that each such bin contains at least log n elements by deﬁnition. However, if p0

q is the same as p q 1, then the membership queries in the latter binary search to compute p0

q is wasteful. In order to ﬁx this issue, we check if there are a sufﬁcient number (at least log n) of elements between p0

q and p q 1. If there are not, then we set p q such that the above condition is satisﬁed. More formally, if |p0

q p q 1| log n we update p q = min(0, p q 1 1 log n) and q = U bµt1 (p q )t2. In other words, we set p q such that there at least log n elements between p q and p q 1, and accordingly, we set q to be the membership of bµt1 (p q) to the cluster t2. Subsequently, we query Uit2 for every element that is closer to bµt1 than x bµt1 (p q 1) but farther than x bµt1 (p q ). In this case, we call these elements special elements, since we query every one of them. We will assume that all these special elements between x bµt1 (p q 1) and x bµt1 (p q ) are assigned a single bin.

To resolve the edge case, we put all those elements which are closer to bµt1 than x bµt1 (p 3 log n) into a

separate bin. Let g be the total number of bins formed by this algorithm, and say Xj is the jth bin formed by this method that does not contain any special elements. Let P be the set of all special elements. Notice that the number of bins g is at most 1 + 3 log n, and the number of special elements is at most 3 log2 n (since each bin can contain at most log n special elements). Moreover, for any bin Xj not containing any special elements, we can show that maxi2Xj Uit2 2 mini2Xj Uit2, which should be contrasted with the guarantee we have for the sequential algorithm, i.e., for a particular bin Xj, | maxi2Xj Uit2 mini2Xj Uit2| 1. Finally, we again sub-sample r elements from each such bin Xj with replacement, and denote the resulting subset by Yj. We approximate the center of the second cluster by

and show that the required r is independent of the cluster size.

Acknowledgements: This work is supported in part by NSF awards 2133484, 2127929, and 1934846.

[1] M. N. Ahmed, S. M. Yamany, N. Mohamed, A. A. Farag, and T. Moriarty. A modiﬁed fuzzy

c-means algorithm for bias ﬁeld estimation and segmentation of mri data. IEEE transactions on medical imaging, 21(3):193 199, 2002.

[2] N. Ailon, A. Bhattacharya, R. Jaiswal, and A. Kumar. Approximate clustering with same-cluster

queries. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

[3] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In N. Bansal,

K. Pruhs, and C. Stein, editors, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027 1035. SIAM, 2007.

[4] H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with same-cluster queries. In Advances

in Neural Information Processing Systems 29, pages 3216 3224. 2016.

[5] M.-F. Balcan and A. Blum. Clustering with interactive feedback. In International Conference

on Algorithmic Learning Theory, pages 316 328. Springer, 2008.

[6] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Journal of

computational biology, 6(3-4):281 297, 1999.

[7] P. Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data,

pages 25 71. Springer, 2006.

[8] J. C. Bezdek. Pattern recognition with fuzzy objective function algorithms. Springer Science &

Business Media, 2013.

[9] J. Blömer, S. Brauer, and K. Bujna. A theoretical analysis of the fuzzy k-means problem. In

2016 IEEE 16th International Conference on Data Mining (ICDM), pages 805 810. IEEE, 2016.

[10] J. Blömer, S. Brauer, and K. Bujna. Coresets for Fuzzy K-Means with Applications. In 29th

International Symposium on Algorithms and Computation (ISAAC 2018), volume 123 of Leibniz International Proceedings in Informatics (LIPIcs), pages 46:1 46:12, 2018.

[11] A. Blum, J. Hopcroft, and R. Kannan. Foundations of data science. Cambridge University

Press, 2020.

[12] M. Bressan, N. Cesa-Bianchi, S. Lattanzi, and A. Paudice. Exact recovery of mangled clusters

with same-cluster queries. ar Xiv preprint ar Xiv:2006.04675, 2020.

[13] M. Bressan, N. Cesa-Bianchi, S. Lattanzi, and A. Paudice. Exact recovery of clusters in ﬁnite

metric spaces using oracle queries. ar Xiv preprint ar Xiv:2102.00504, 2021.

[14] M. Bressan, N. Cesa-Bianchi, S. Lattanzi, and A. Paudice. On margin-based cluster recovery

with oracle queries. ar Xiv preprint ar Xiv:2106.04913, 2021.

[15] M. Bressan, N. Cesa-Bianchi, A. Paudice, and F. Vitale. Correlation clustering with adaptive

similarity queries. In Advances in Neural Information Processing Systems, pages 12510 12519, 2019.

[16] K. Bujna. Soft Clustering Algorithms Theoretical and Practical Improvements. Ph D thesis,

Paderborn University, 2017.

[17] I. Chien, C.-Y. Lin, and I.-H. Wang. Community detection in hypergraphs: Optimal statistical

limit and efﬁcient algorithms. In International Conference on Artiﬁcial Intelligence and Statistics, pages 871 879, 2018.

[18] I. Chien, C. Pan, and O. Milenkovic. Query k-means clustering and the double dixie cup

problem. In Advances in Neural Information Processing Systems, pages 6649 6658, 2018.

[19] K.-S. Chuang, H.-L. Tzeng, S. Chen, J. Wu, and T.-J. Chen. Fuzzy c-means clustering with

spatial information for image segmentation. Computerized Medical Imaging and Graphics, 30(1):9 15, 2006.

[20] W. F. De La Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for

clustering problems. In Proceedings of the thirty-ﬁfth annual ACM symposium on Theory of computing, pages 50 58, 2003.

[21] M. L. D. Dias. fuzzy-c-means: An implementation of fuzzy c-means clustering algorithm., May

[22] C. Ding, X. He, and H. D. Simon. On the equivalence of nonnegative matrix factorization and

spectral clustering. In Proceedings of the 2005 SIAM international conference on data mining, pages 606 610. SIAM, 2005.

[23] D. Firmani, S. Galhotra, B. Saha, and D. Srivastava. Robust entity resolution using a crowdoracle.

IEEE Data Eng. Bull., 41(2):91 103, 2018.

[24] Y. Fukuyama. A new method of choosing the number of clusters for the fuzzy c-mean method.

[25] I. Gath and A. B. Gev. Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Mach.

Intell., 11(7):773 780, July 1989.

[26] K. Huang, N. D. Sidiropoulos, and A. Swami. Non-negative matrix factorization revisited:

Uniqueness and algorithm for symmetric decomposition. IEEE Transactions on Signal Processing, 62(1):211 224, 2013.

[27] M. Huang, Z. Xia, H. Wang, Q. Zeng, and Q. Wang. The range of the value for the fuzziﬁer of

the fuzzy c-means algorithm. Pattern Recognition Letters, 33(16):2280 2284, 2012.

[28] W. Huleihel, A. Mazumdar, M. Médard, and S. Pal. Same-cluster querying for overlapping

clusters. In Advances in Neural Information Processing Systems, pages 10485 10495, 2019.

[29] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455

[30] X. Lin and X.-w. Chen. Mr. knn: soft relevance for multi-label classiﬁcation. In Proceedings

of the 19th ACM international conference on Information and knowledge management, pages 349 358, 2010.

[31] Q. Liu, J. Liu, M. Li, and Y. Zhou. A novel initialization algorithm for fuzzy c-means problem.

In J. Chen, Q. Feng, and J. Xu, editors, Theory and Applications of Models of Computation, pages 215 225. Springer International Publishing, 2020.

[32] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. Inf. Theory, 28(2):129 136, 1982.

[33] M. Mahajan, P. Nimbhorkar, and R. Kasturi. The planar k-means problem is np-hard. volume

442, pages 274 285, 02 2009.

[34] X. Mao, P. Sarkar, and D. Chakrabarti. On mixed memberships and symmetric nonnegative

matrix factorizations. In International Conference on Machine Learning, pages 2324 2333. PMLR, 2017.

[35] A. Mazumdar and S. Pal. Semisupervised clustering, and-queries and locally encodable source

coding. In Advances in Neural Information Processing Systems, pages 6489 6499, 2017.

[36] A. Mazumdar and B. Saha. Clustering with noisy queries. In Advances in Neural Information

Processing Systems, pages 5788 5799, 2017.

[37] A. Mazumdar and B. Saha. Query complexity of clustering with side information. In Advances

in Neural Information Processing Systems, pages 4682 4693, 2017.

[38] A. Mazumdar and B. Saha. A theoretical analysis of ﬁrst heuristics of crowdsourced entity

resolution. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

[39] A. Moitra. Algorithmic aspects of machine learning. Lecture notes, 2014.

[40] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with

optimal utilization of error estimates of data values. Environmetrics, 5(2):111 126, 1994.

[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[42] W. Pedrycz and J. Waletzky. Fuzzy clustering with partial supervision. IEEE Transactions on

Systems, Man, and Cybernetics, Part B (Cybernetics), 27(5):787 795, 1997.

[43] B. Saha and S. Subramanian. Correlation clustering with same-cluster queries bounded by

optimal cost. In 27th Annual European Symposium on Algorithms (ESA 2019). Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2019.

[44] S. Sarmasarkar, K. S. Reddy, and N. Karamchandani. Query complexity of heavy hitter

estimation. ar Xiv preprint ar Xiv:2005.14425, 2020.

[45] N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposition of n-way arrays.

Journal of Chemometrics: A Journal of the Chemometrics Society, 14(3):229 239, 2000.

[46] C. E. Tsourakakis, M. Mitzenmacher, K. G. Larsen, J. Błasiok, B. Lawson, P. Nakkiran, and

V. Nakos. Predicting positive and negative links with noisy queries: Theory & practice. ar Xiv preprint ar Xiv:1709.07308, 2017.

[47] F. VALAFAR. Pattern recognition techniques in microarray data analysis: A survey. Annals of

the New York Academy of Sciences, 980:41 64, 2002.

[48] A. Vattani. The hardness of k-means clustering in the plane.

[49] N. Vesdapunt, K. Bellare, and N. Dalvi. Crowdsourcing algorithms for entity resolution.

Proceedings of the VLDB Endowment, 7(12):1071 1082, 2014.

[50] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution.

Proceedings of the VLDB Endowment, 5(11), 2012.

[51] X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal.

Mach. Intell., 13(8):841 847, Aug. 1991.

[52] S. Zhang, R.-S. Wang, and X.-S. Zhang. Identiﬁcation of overlapping community structure in

complex networks using fuzzy c-means clustering. Physica A: Statistical Mechanics and its Applications, 374(1):483 490, 2007.