# clustering_with_nonadaptive_subset_queries__68da89f2.pdf Clustering with Non-adaptive Subset Queries Hadley Black UC San Diego Euiwoong Lee University of Michigan Arya Mazumdar UC San Diego UC San Diego Recovering the underlying clustering of a set U of n points by asking pair-wise same-cluster queries has garnered significant interest in the last decade. Given a query S U, |S| = 2, the oracle returns yes if the points are in the same cluster and no otherwise. We study a natural generalization of this problem to subset queries for |S| > 2, where the oracle returns the number of clusters intersecting S. Our aim is to determine the minimum number of queries needed for exactly recovering an arbitrary k-clustering. We focus on non-adaptive schemes, where all the queries are asked in one round, thus allowing for the querying process to be parallelized, which is a highly desirable property. For adaptive algorithms with pair-wise queries, the complexity is known to be Θ(nk), where k is the number of clusters. In contrast, non-adaptive pair-wise query algorithms are extremely limited: even for k = 3, such algorithms require Ω(n2) queries, which matches the trivial O(n2) upper bound attained by querying every pair of points. Allowing for subset queries of unbounded size, O(n) queries is possible with an adaptive scheme. However, the realm of non-adaptive algorithms remains completely unknown. Is it possible to attain algorithms that are nonadaptive while still making a near-linear number of queries? In this paper, we give the first non-adaptive algorithms for clustering with subset queries. We provide, (i) a non-adaptive algorithm making O(n log2 n log k) queries which improves to O(n log k) when the cluster sizes are within any constant factor of each other, (ii) for constant k, a non-adaptive algorithm making O(n log log n) queries. In addition to non-adaptivity, we take into account other practical considerations, such as enforcing a bound on query size. For constant k, we give an algorithm making e O(n2/s2) queries on subsets of size at most s n, which is optimal among all non-adaptive algorithms within a log n-factor. For arbitrary k, the dependence varies as O(n2/s). 1 Introduction Clustering is one of the most fundamental problems in unsupervised machine learning, and permeates beyond the boundaries of statistics and computer science to social sciences, economics and so on. The goal of clustering is to partition items so that similar items are in the same group. The applications of clustering are manifold. However, finding the underlying clusters is sometimes hard for an automated process due to data being noisy, incomplete, but easily discernible by humans. Motivated by this scenario, in order to improve the quality of clustering, early works have studied the so-called clustering under limited supervision (e.g.,[1, 2]). Balcan and Blum initiated the study of clustering under active feedback [3] where given the current clustering solution, the users can provide feedback whether a cluster needs to be merged or split. Perhaps a simpler query model would be where users only need to answer the number of clusters, and that too only on a subset of points without requiring to analyze the entire clustering. This scenario is common in unsupervised learning problems, where a centralized algorithm aims to compute a clustering by crowdsourcing. The crowd-workers play the role of an oracle here, and are able to answer simple queries that involve a small subset of the universe. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Mazumdar and Saha [4, 5, 6], and in independent works Mitzenmacher and Tsourakis [7], as well as Asthani, Kushagra and Ben-David [8] initiated a theoretical study of clustering with pair-wise aka same-cluster queries. Given any pair of points u, v, the oracle returns whether u and v belong to the same cluster or not. Such queries are easy to answer and lend itself to simple implementations [9]. This has been subsequently extremely well-studied in the literature, e.g. [10, 11, 4, 12, 13]. In fact, triangle-queries have also been studied, e.g. [14]. Moreover, clustering with pair-wise queries is intimately related to several well-studied problems such as correlation clustering [15, 16, 17, 10, 18], edge-sign prediction problem [19, 7], stochastic block model [20, 21] etc. Depending on whether there is an interaction between the learner/algorithm and the oracle, the querying algorithms can be classified as adaptive and non-adaptive [5]. In adaptive querying, the learner can decide the next query based on the answers to the previous queries. An algorithm is called non-adaptive if all of its queries can be specified in one-round. Non-adaptive algorithms can parallelize the querying process as they decide the entire set of queries apriori. This may greatly speed up the algorithm in practice, significantly reducing the time to acquire answers [22]. Thus, in a crowdsourcing setting being non-adaptive is a highly desirable property. On the flip side, this makes non-adaptive algorithms significantly harder to design. In fact, when adaptivity is allowed, nk pair-wise queries are both necessary and sufficient to recover the entire clustering, where n is the number of points in the ground set to be clustered and k (unknown) is the number of clusters. However as shown in [5] and our Theorem C.1, even for k = 3, even randomized non-adaptive algorithms can do no better than the trivial O(n2) upper bound attained by querying all pairs. We study a generalization of pair-wise queries to subset queries, where given any subset of points, the oracle returns the number of clusters in it. We consider the problem of recovering an unknown k-clustering (a partition) on a universe U of n points via black-box access to a subset query oracle. More precisely, we assume that there exists a groundtruth partitioning of U = Fk i=1 Ci, and upon querying with a subset S U, the oracle returns q(S) = |{i : Ci S = }|, the number of clusters intersecting S. Considering the limitations of pair-wise queries for non-adaptive schemes, we ask the question if it is possible to use subset queries to design significantly better non-adaptive algorithms. In addition to being a natural model for interactive clustering, this problem also falls into the growing body of work known as combinatorial search [23, 24] where the goal is to reconstruct a hidden object by viewing it through the lens of some indirect query model (such as group testing [25, 26, 24, 27, 28]). The problem is also intimately connected to coin weighing where given a hidden vector x {0, 1}n, the goal is to reconstruct x using queries of the form q(S) := P i S xi for S [n]. It is known that Θ(n/ log n) is the optimal number of queries [29, 30, 31], which can be obtained by a non-adaptive algorithm. There are improvements for the case when x 1 = d for d n [32, 33, 34]. Moreover, there has been significant work on graph reconstruction where the task is to reconstruct a hidden graph G = (V, E) from queries of the form q(S, T) := |{(u, v) E : u S, v T}| for subsets S, T V . [35, 36, 37, 38]. There are also algorithms that perform certain tasks more efficiently than learning the whole graph (sometimes using different types of queries) [39, 40, 41, 42, 43, 44, 45, 46], and quantum algorithms that use fewer queries than classical algorithms [47]. It is not too difficult to show that an algorithm making O(n log k) queries (Appendix H) is possible for k-clustering, while Ω(n) queries is an obvious information theoretic lower bound since each query returns log k bits of information and the number of possible k-clusterings is kn = 2n log k. In fact, it is possible to have an algorithm with O(n) query complexity (personal communication, Chakrabarty and Liao). However, both of these algorithms are adaptive, ruling them out for the non-adaptive setting. So far, the non-adaptive setting of this problem remained unexplored. 1.1 Results Our main results showcase the significant strength of using subset queries in the non-adaptive setting. We give randomized algorithms that recover the exact clustering with probability 1 δ, for any arbitrary constant δ > 0 using only near-linear number of subset queries. Theorem 1.1. (Theorem 2.5, simplified) There is a randomized, non-adaptive k-clustering algorithm making O(n log2 n log k) subset queries. For constant k, this dependency can be further improved. Theorem 1.2. (Theorem 2.2, simplified) There is a randomized, non-adaptive k-clustering algorithm making O(n log log n) subset queries when k is any constant. Note that the algorithm of Theorem 1.2 works for any value of k, but its dependence on this parameter is inferior to that of Theorem 1.1 (see the formal version Theorem 2.2 for the exact dependence on k). Thus, we state the theorem above for constant k to emphasize the much improved dependence on n. Our algorithms also run in polynomial time, and generalizes to work with queries of bounded size. Bounding query size: Another practical consideration is query size, s. Depending on the scenarios, and capabilities of the oracle, it may be easier to handle queries on small subsets. An extreme case is pair-wise queries (s = 2), where O(nk) pair queries are enough with adaptivity but any non-adaptive algorithm has to use Ω(n2) queries even for k = 3. Since a subset query on S can be simulated by |S| 2 pair queries, we immediately get the following theorem. Theorem 1.3. (Corollary C.2, restated) Any non-adaptive k-clustering algorithm that is only allowed to query subsets of size at most s must make at least Ω(min( n2 s2 , n)) queries. Theorems 1.1 and 1.2 above show that this can be bypassed by allowing larger subset queries. However, some of these queries are of size Ω(n), and this raises the question, is there a near-linear non-adaptive algorithm which only queries subsets of size at most O( n)? We answer this in the affirmative, implying that our lower bound is tight in terms of s. Theorem 1.4 (Theorem A.1, informal). There is a non-adaptive k-clustering algorithm making O(n log n log log n) subset queries of size at most O( n) when k is any constant. For all sufficiently small s = o( n), the algorithm makes O( n2 s2 log n) subset queries of size at most s. The result also extends to arbitrary k with slightly worse dependency on s (Theorem 2.5). Our algorithm for bounded queries from Theorem 1.4 has the additional desirable property of being sample-based meaning that each of its queries is a set formed by independent, uniform samples. I.e. the algorithm specifies a query size t s, and then receives (S, q(S)) where S is formed by t i.i.d. uniform samples from U. Being sample-based enables the algorithm to leave the task of curating each query up to the individual answering the query. The algorithm needs only to specify the query sizes, and then recover the clustering once the queries have been curated and answered. The "roughly balanced" case: Next, we consider the natural special case of recovering a k-clustering when the cluster sizes are within a constant factor of one another. Informally, let us call such a clustering "roughly balanced". Theorem 1.5 (Theorems B.1 and E.1, informal). There are non-adaptive algorithms for recovering a roughly balanced k-clustering which make (a) O(n log k) subset queries when k O( n log3 n), and (b) O(n log2 k) subset queries for any k n. Allowing two rounds of adaptivity Finally, we show if we allow an extra round of adaptivity, then that helps to improve the dependency on the logarithmic factors further. Specifically, we prove the following theorems. Theorem 1.6 (Theorems F.1 and F.3, informal). There is a 2-round deterministic k-clustering algorithm making O(n log k) subset queries. There is a randomized 2-round algorithm for recovering a roughly-balanced k-clustering making O(n log log k) subset queries. Organization: The remainder of the paper is organized as follows. In Section 2, we give our main results developing non-adaptive algorithms with near-linear query complexity Theorems 1.1 and 1.2. Our results for sample-based, bounded query algorithms are given in Appendix A. Finally, we prove our results for the balanced setting in Appendix B, our lower bounds in Appendix C, and our results for two-round algorithms in Appendix F. 2 Algorithms with Nearly Linear Query Complexity In this section we describe the algorithms behind our main results, Theorems 1.1 and 1.2, and give formal proofs of their correctness. In Section 2.1 we describe an algorithm making O(n log log n) subset queries when the number of clusters k is assumed to be a constant. In general, the dependence on the number of clusters is O(k log k). In Section 2.2, we give an alternative algorithm with e O(n) query complexity for any k n. 2.1 An O(n log log n) Algorithm for Constant k Warm Up. When there are only 2 clusters, there is a trivial non-adaptive algorithm making O(n) pair queries: Choose an arbitrary x U and query {x, y} for every y U. The set of points y where q({x, y}) = 1 form one cluster, and the second cluster is the complement. If we allow one more round of adaptivity, then for 3-clustering we could repeat this one more time and again get an O(n) query algorithm. However, for non-adaptive 3-clustering it is impossible to do better than the trivial O(n2) algorithm (see Theorem C.1). Essentially, this is because in order to distinguish the clusterings ({x}, {y}, U \ {x, y}) and ({x, y}, , U \ {x, y}) the algorithm must query {x, y} and their are n 2 ways to hide this pair. Overcoming this barrier using subset queries require significant new ideas. Our main ideas are best communicated by focusing on the case of 3-clustering. It suffices to correctly reconstruct the two largest clusters, since the third cluster is just the complement of their union. Let A, B denote the largest, and second largest clusters, respectively. Since |A| n/3, it is easy to find: sample a random x U and query {x, y} for every y U. The cluster containing x is precisely {y U : q({x, y}) = 1}. With probability at least 1/3, we have x A and so repeating this a constant number of times will always recover A. On the other hand, B may be arbitrarily small and in this case the procedure clearly fails to recover it. The first observation is that once we know A, we can exploit larger subset queries to explore U \ A since q(S \ A) = q(S) 1(S A = ). Importantly, the algorithm is non-adaptive and so the choice of S cannot depend on A, but we are still able to exploit this trick with the following two strategies. Let δn = |B| denote the size of B and note that this implies |A| (1 2δ)n since the third cluster is of size at most B. Strategy 1: Suppose a query S contains exactly one point outside of A, i.e. S \ A = {x} . Then, for y / A, q(S {y}) = q(S) iff x, y belong to the same cluster. Thus, we can query S {y} for every y U to learn the cluster containing x. If S is a random set of size t 1/δ, then the probability that |S \ A| = 1 is at least t δ (1 2δ)t 1 = Ω(1). Of course, we do not know δ, but we can try t = 2p for every p log n and one of these choices will be within a factor of 2 from 1/δ. This gives us an O(n log n) query algorithm since we make n queries per iteration. Strategy 2: Suppose S intersects A and contains exactly two points outside of A, i.e. S \ A = {x, y}. Then, q({x, y}) = q(S) 1 which tells us whether or not x, y belong to the same cluster. If x, y belong to same cluster, add it to a set E, and let G(U \ A, E) denote a graph on the remaining points with this set of edges. By transitivity, a connected component in this graph corresponds to a subset of one of the remaining two clusters. In particular, if the induced subgraph, G[B], is connected, then we recover B. Moreover, if S is a random set of size t 1/δ, then the probability that two points land in B and the rest land in A is at least t 2 δ2 (1 2δ)t 2 = Ω(1). A basic fact from random graph theory says that after |B| ln |B| δn ln n occurrences of this, G[B] becomes connected with high probability and so querying Ω(δn ln n) random S of size 1/δ will suffice. Again, we try t = 2p for every p log n, resulting in a total of n ln n P p 2 p = O(n log n) queries. Finally, we can combine strategies (1) and (2) as follows to obtain our O(n log log n) query algorithm. The main observation is that the query complexity of strategy (2) improves greatly if we assume that |B| is small enough. If we know that δ 1 log n, then we only need to try t = 2p log n and so the query complexity becomes n ln n P p log log n 2 p = O(n). On the other hand, if we assume that δ > 1 log n, then in strategy (1) we only need to try p log log n yielding a total of O(n log log n) queries. Combining these yields the final algorithm. Remark 2.1 (On approximate clustering). We point out that these ideas can be used to obtain more efficient algorithms for the easier task of correctly clustering a (1 α)-fraction of points. In this setting we can ignore the case of δ < α/2 (recall the definition of δ above) as this will only result in an incorrect classification of an α-fraction of points. Thus, for example, one can employ "strategy 1" above, but only iterate over p log(2/α), leading to an O(n log 1 α) query algorithm. However, in this paper we focus on the more challenging task of recovering the clustering exactly, and leave the possibility of more efficient approximate algorithms as a possible direction of future work. Algorithm. A full description of the algorithm is given in pseudocode Alg. 1, which is split into two phases: a "query selection phase", which describes how queries are chosen by the algorithm, and a "reconstruction phase" which describes how the algorithm uses the query responses to determine the clustering. Both phases contain a for-loop iterating over all p {0, 1, . . . , log n} where the goal of the algorithm during the p th iteration is to learn all remaining clusters of size at least n 2k 2p . This is accomplished by two different strategies depending on whether p is small or large. When p log log n, the algorithm samples O(k log k) random sets T formed by 2p samples from U and makes a query on T and T {x} for every x U (see lines 5-9 of Alg. 1). Let Rp be the union of all clusters reconstructed before phase p (i.e., clusters of size at least n 2k 2p 1 ). If such a T contains exactly one point z T \ Rp belonging to an unrecovered cluster, then we can use these queries to learn the cluster containing z (see lines 24-28 of Alg. 1), since for x U \ Rp, q(T) = q(T {x}) if and only if x, z belong to the same cluster. Moreover, we show that this occurs with probability Ω(1) and repeat this O(k log k) times to ensure that every cluster C where |C| [ n 2k 2p , n 2k 2p 1 ) is learned with high probability. The total number of queries made during iterations p log log n is O(n log log n k log k). When p > log log n, the algorithm queries O(nk log n 2p ) random sets T again formed by 2p samples from U (see lines 11-14 of Alg. 1). Note that P p>log log n 2 p = O( 1 log n) and so the total number of queries made during these iterations is O(nk). We now describe the reconstruction phase (see lines 32-37 of Alg. 1). If T contains exactly two points x, y T \ Rp belonging to unrecovered clusters, then we can use the fact that we already know the clustering on Rp to tell whether or not x, y belong to the same cluster or not, i.e. we can compute q({x, y}) {1, 2} from q(T). We then consider the set of all such pairs where q({x, y}) = 1 (this is Q p defined in line 34) and consider the graph G with this edge set, and vertex set U \ Rp, the set of points whose cluster hasn t yet been determined. If two points belong to the same connected component in this graph, then they belong to the same cluster. Thus, the analysis for this iteration boils down to showing that with high probability, the induced subgraph G[C] will be connected for every C where |C| [ n 2k 2p , n 2k 2p 1 ). This is accomplished by applying a basic fact from the theory of random graphs, namely Fact 2.4. Analysis We restate the main theorem for this section. Theorem 2.2. There is a non-adaptive algorithm for k-clustering that uses O(n log log n k log k) subset queries and succeeds with probability at least 1 δ for any constant δ > 01. The following Lemma 2.3 establishes that after the first p iterations of the algorithm s query selection and reconstruction phases, all clusters of size at least n 2k 2p have been learned with high probability. This is the main technical component of the proof. After stating the lemma we show it easily implies that Alg. 1 succeeds with probability at least 99/100 by an appropriate union bound. The choice of 99/100 is arbitrary, and can be made 1 δ for any constant δ. Lemma 2.3. For each p = 0, 1, . . . , log n, let Ep denote the event that all clusters of size at least n 2k 2p have been successfully recovered immediately following iteration p of Alg. 1. Then, Pr[ E0] 1 100k and Pr[ Ep | Ep 1] 1 100k for all p {1, 2 . . . , log n}. Proof of Theorem 2.2: Before proving Lemma 2.3, we first observe that it immediately implies the correctness of Alg. 1 and thus proves Theorem 2.2. Let I0 = ( n 2k, n] and for 1 p log n, let Ip = [ n 2k 2p , n 2k 2p 1 ). If there are no clusters C for which |C| Ip, then trivially Pr[ Ep | Ep 1] = 0, and otherwise Pr[ Ep | Ep 1] 1 100k by the lemma. Since there are k clusters, clearly there are at most k values of p for which there exists a cluster with size in the interval Ip. Using this observation and a union bound, we have Pr[ Elog n] Pr[ E0] + p=1 Pr[ Ep | Ep 1] 1 100 which completes the proof of correctness since the algorithm succeeds iff Elog n occurs. Query complexity: During iterations p < log log n the algorithm makes at most O(n log log n k log k) queries. During iterations p > log log n, it makes at most O(nk log n) P p>log log n 2 p = O(nk) queries since k n. 1For simplicity of exposition, we use a constant δ in our proofs. The success probability can be boosted to any 1 1 poly(n) by paying a log n factor in the query complexity in all algorithms. Time complexity: We assume that obtaining a uniform random sample from a set of size n can be done in O(1) time. Thus, since the algorithm makes O(n log log n k log k) queries and each is on a set of size at most n, the total runtime of the query selection phase (lines 3-15) is bounded by O(n2 log log n k log k). We now account for the runtime in the reconstruction phase. Lines (25-28) clearly can be performed in O(n) time and so the time spent in lines (24-28) is O(|Qp| n). Now, for T Qp, checking if |T \ Rp| = 2 can clearly be done in O(n) time and so lines (33-34) run in time O(|Qp| n). Line (36) amounts to finding every connected component in Gp which can be done in time O(|Q p| + n) = O(|Qp| + n) by iteratively running a BFS (costing time linear in the number of edges plus the number of vertices). Thus, the runtime of the p th iteration of the for-loop is always dominated by O(|Qp| n). Since the total number of queries is O(n log log n k log k), the total runtime of the reconstruction phase is O(n2 log log n k log k). We now prove the main Lemma 2.3. Proof. of Lemma 2.3. Let Cp denote the set of clusters recovered before phase p and let Rp = S C Cp C. When p = 0, both of these sets are empty. We will consider three cases depending on the value of p. Case 1: p = 0. Let C denote some cluster of size |C| n 2k. Note that in this iteration the sets T sampled by the algorithm in line (7) are singletons. We need to argue that one of these singletons will land in C, and thus C is recovered in line (28), with probability at least 1 1 100k2 . Since there are at most k clusters, applying a union bound completes the proof in this case. A uniform random element lands in C with probability at least 1 2k and so this fails to occur for all |Q0| 4k ln 10k samples with probability at most (1 1 2k)4k ln 10k exp( 2 ln 10k) = 1 100k2 , as claimed. Case 2: 1 p log log n. Let C denote some cluster with size |C| [ n 2k 2p , n 2k 2p 1 ). Note that we are conditioning on the event that every cluster of size n 2k 2p 1 has already been successfully recovered after iteration p 1. Thus, the number of elements belonging to unrecovered clusters is |U \ Rp| k n 2k 2p 1 = n 2p . We need to argue that the set Qp will contain some T sampled in line (7) such that T \ Rp = {z} where z C, and thus C is successfully recovered in line (28), with probability at least 1 1 100k2 . Once this is established, the lemma again follows by a union bound. We have Pr T : |T |=2p[|T \Rp| = 1 and T \Rp C] = |T| |C| and so the probability that this occurs for some T Qp is at least 1 (1 1 2ek)4ek ln 10k 1 1 100k2 , as claimed. Case 3: p > log log n. Let C denote some cluster with size |C| [ n 2k 2p , n 2k 2p 1 ). Note that |U \ Rp| k n 2k 2p 1 = n 2p . Recall from lines (34-35) the definition of Q p and recall that Gp is the graph with vertex set U \ Rp and edge set Q p. We need to argue that the induced subgraph Gp[C] is connected, and thus C is successfully recovered in lines (36-37), with probability at least 1 1 100k2 . Once this is established, the lemma again follows by a union bound. We rely on the following standard fact from the theory of random graphs. For completeness, we give a proof in Appendix D.2. Fact 2.4. Let G(N, p) denote an Erdös-Rényi random graph. That is, the graph contains N vertices and there is an edge between each pair of vertices with probability p. If p 1 (δ/3N)2/N, then G(N, p) is connected with probability at least 1 δ. Consider any x, y C and observe that Pr T : |T |=2p[T \ Rp = {x, y}] = 2p Algorithm 1: Non-adaptive Algorithm for Constant k 1 Input: Subset query access to a hidden partition C1 Ck = U of |U| = n points; 2 (Query Selection Phase) 3 for p = 0, 1, . . . , log n do 4 Initialize Qp ; 5 if p log log n then 6 Repeat 4ek ln(10k) times; 7 Sample T U formed by 2p independent uniform samples from U; 8 Query T and T {x} for all x U; 9 Add T to Qp; 11 if p > log log n then 12 Repeat 40nk ln(300nk2) 13 Sample T U formed by 2p independent uniform samples from U; 14 Query T and add it to Qp; 17 (Reconstruction Phase) 18 Initialize learned cluster set C0 ; 19 for p = 0, 1, . . . , log n do 20 Let Cp denote the collection of clusters reconstructed before iteration p; 21 Let Rp = S C Cp C denote the points belonging to these clusters; 22 Initialize Cp+1 Cp; 23 if p log log n then 24 for T Qp do 25 if |T \ Rp| = 1 then 26 Let z denote the unique point in T \ Rp; 27 If x U \ Rp, then q(T) = q(T {x}) iff x, z are in the same cluster; 28 Thus, we add {x U \ Rp : q(T) = q(T {x})} to Cp+1; 32 if p > log log n then 33 Let Q p = {T \ Rp : T Qp and |T \ Rp| = 2}. Since each T Qp is a uniform random set, the elements of Q p are uniform random pairs in U \ Rp; 34 Let Q p = {{x, y} Q p : q({x, y} = 1)} denote the set of pairs in Q p where both points lie in the same cluster. This set can be computed since q(T \ Rp) = q(T) q(T Rp) and q(T Rp) is known since at this point we have reconstructed the clustering on Rp; 35 Let Gp denote the graph with vertex set U \ Rp and edge set Q p; 36 Let C1, . . . , Cℓdenote the connected components of Gp with size at least n 2k 2p ; 37 Add C1, . . . , Cℓto Cp+1; 40 Output clustering Clog n+1 Recall that the algorithm queries |Qp| = 40 nk ln(300nk2) 2p random sets T of size 2p. Thus, Pr Qp [(x, y) E(Gp[C])] = Pr Qp {x, y} Q p = Pr Qp [ T Qp : T \ Rp = {x, y}] 2p k ln(300nk2) n 4k ln(300nk2) and using |C| n 2k 2p and |C| n, we obtain Pr Qp [(x, y) E(Gp[C])] 1 exp 2 ln(300nk2) 1 exp 2 ln(300k2|C|) = 1 1 300k2|C| Thus, (x, y) is an edge in Gp[C] with probability at least 1 1 300k2|C| 2 |C| and so by Fact 2.4 Gp[C] is connected with probability at least 1 1 100k2 , as claimed. Bounded Query Size We can restrict the query size to s n, and still achieve a near-linear query complexity. We sketch the main ideas here for the case of k = 3 similar to the "warm-up" in Section 2.1. Details are provided in Appendix A. Our Theorem 1.4 gives an O(n log n log log n) query non-adaptive sample-based algorithm using subset queries of size at most O( n). The main idea is to employ "Strategy 2" described in the warm-up section of Section 2.1 with a slight alteration. Let A, B denote the largest, and second largest clusters, respectively, where |B| = δn and so |A| (1 2δ)n. Observe that if we take a random set S of size t p 1/δ, then the probability that two points land in B and the rest land in A is at least t 2 δ2 (1 2δ)t 2 = Ω(δ). Recalling the definition of the graph G and the discussion in Section 2.1, after querying Ω(n ln n) such S, the induced subgraph G[B] becomes connected with high probability, thus recovering the clustering. Similar ideas let us generaize to any s, and achieve an optimal dependency on s as stated in Corollary C.2 for constant k. 2.2 An O(n log2 n log k) Algorithm for General k We now consider the situation with general k, for which our algorithm and analysis follow a completely different approach by using techniques from combinatorial group testing. Warm up. The main subroutine in our algorithm is a procedure for recovering the support of a Boolean vector via OR queries. Given a vector v {0, 1}n, an OR query on a set S [n] returns ORS(v) = W i S vi, i.e. it returns 1 iff v has a 1-valued coordinate in S. The problem of recovering the support of v, supp(v) = {i: vi = 1} via OR queries is a basic problem from the group testing and coin-weighing literature. The relevance of this problem for k-clustering with subset queries is as follows. Consider a hidden clustering C1 Ck = U. Given x U, let C(x) denote the cluster containing U = {x1, . . . , xn} (an arbitrary ordering of U), and let v(x) {0, 1}n denote the Boolean vector where v(x) i = 1(xi C(x)). An OR query on set S to v(x) can be simulated by a subset query to the clustering on sets S and S {x} since ORS(v(x)) = _ i S v(x) i = 1(C(x) S = ) = 1(q(S {x}) = q(S)). Thus, the problem or reconstructing C(x) via subset queries is equivalent to the problem of recovering v(x) via OR queries, up to a factor of 2 in the query complexity. Then, to learn a cluster C with size n 2p |C| n 2p 1 it suffices to sample O(2p) random x (one of which lands in C with high probability) and then recover C(x) using O( n δ ) OR queries. Iterating over every p log n and boosting the number of samples to guarantee a high probability of success for all k clusters yields our algorithm. This algorithm can also be restricted to only make subset queries of size at most s, and the query complexity scales with 1 s. Theorem 2.5. For every s [2, n], there is a non-adaptive k-clustering algorithm making O(n log n log k ( n s + log s)) subset queries of size at most s. In particular, for unbounded query size the algorithm makes O(n log2 n log k) queries. Proof of Theorem 2.5 We will use the following lemma for recovering supp(v) = {i: vi = 1} via OR queries. We prove and discuss this lemma in Appendix D.1 (see Lemma D.5). Lemma 2.6. Let v {0, 1}n and s, t 1 be positive integers where s n t . There is a non-adaptive algorithm that makes O( n δ ) OR queries on subsets of size s, and if |supp(v)| t, returns supp(v) with probability 1 δ, and otherwise certifies that |supp(v)| > t. The algorithm runs in time O(n log n Recall that ORS(v(x)) = 1(q(S {x}) = q(S)), i.e. an OR query on S is simulated by subset queries on sets S and S {x}. Thus, we immediately get the following corollary. Corollary 2.7. Let x U and r 2, t 1 be positive integers where r n t . There is a nonadaptive algorithm that makes O( n δ ) subset-queries on sets of size at most r, and if |C(x)| t, returns C(x) with probability 1 δ, and otherwise certifies that |C(x)| > t. The algorithm runs in time O(n log n Algorithm The pseudocode for the algorithm is given in Alg. 2. The idea is to draw random points x U (line 5) and then use the procedure from Corollary 2.7 as a subroutine to try to learn C(x) (line 6). By the corollary, this will succeed with high probability in recovering C(x) as long as t is set to something larger than |C(x)|. Note that the query complexity of this subroutine depends2 on t. If a cluster C is small, then Pr[x C] is small, but we can call the subroutine with small t, while if C(x) is large, then Pr[x C] is reasonably large, though we will need to call the subroutine with larger t. Concretely, the algorithm iterates over every p {1, . . . , log n} (line 3), and in iteration p the goal is to learn every cluster C with |C| [ n 2p , n 2p 1 ]. To accomplish this, we sample Θ(2p log k) random points x U (line 4-5) and for each one, call the subroutine with t = n 2p 1 (line 6), which is an upper bound on the sizes of the clusters we are trying to learn.Note that we always invoke the corollary with query size r = min(s, 2p 1) s, enforcing the query size bounded stated in Theorem 2.5. Algorithm 2: Non-adaptive Algorithm for General k 1 Input: Subset query access to a hidden partition C1 Ck = U of |U| = n points; 2 Initialize hypothesis clustering C ; 3 for p = 1, . . . , log n do 4 Repeat 2p ln(200k) times: 5 Sample x U uniformly at random; 6 Run the procedure from Corollary 2.7 on x with t = n 2p 1 , query-size r = min(s, 2p 1), and error probability δ = 1 200k. This outputs C(x), the cluster containing x, with probability at least 1 δ if |C(x)| t; 7 If the procedure returns a set C, then set C C {C}. Otherwise, continue; 9 Output the clustering C. Query complexity: Note that the number of queries made in line (6) during the p th iteration is O( n s log n) when 2p 1 s, and O( n 2p log n) when 2p 1 < s. Therefore, the total number of queries made is at most p: 1 2p 1