# sets_clustering__72cfb045.pdf Sets Clustering Ibrahim Jubran * 1 Murad Tukan * 1 Alaa Maalouf * 1 Dan Feldman 1 Abstract The input to the sets-k-means problem is an integer k 1 and a set P = {P1, , Pn} of fixed sized sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes the sum P P P minp P,c C p c 2 of squared distances to these sets. An ε-core-set for this problem is a weighted subset of P that approximates this sum up to 1 ε factor, for every set C of k centers in Rd. We prove that such a core-set of O(log2 n) sets always exists, and can be computed in O(n log n) time, for every input P and every fixed d, k 1 and ε (0, 1). The result easily generalized for any metric space, distances to the power of z > 0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1 + ε approximation) for the sets-kmeans problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k = 1, d = 2). Open source code and experimental results for document classification and facility locations are also provided. 1. Introduction In machine learning it is common to represent the input as a set of n points (database records) P = {p1, , pn} in the Euclidean d-dimensional space Rd. That is, an n d real matrix whose rows correspond to the input points. Every point corresponds to e.g. the GPS address of a person (Liao et al., 2006; Nguyen et al., 2011), a pixel/feature in an image (Tuytelaars et al., 2008), bag of words of a document (Mladenic, 1999), or a sensor s sample (Dunia et al., 1996). Arguably, the most common statistics of such a set is its mean (center of mass) which is the center c Rd that minimizes its sum of squared distances P p P D(p, c) = P p P p c 2 to the input points in P. *Equal contribution 1Robotics & Big Data Lab, Department of Computer Science, University of Haifa, Israel. Correspondence to: Ibrahim Jubran . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Figure 1. sets-k-means for pairs on the plane. The input is a set of n = 5 pairs of points that correspond to home/work addresses. (left) The distance from a gas station (in red) to a person is the smaller between its distance to its home and work address (line segments). (right) For k = 2 gas stations, each person will choose its closest gas station; see real word database at section 6. Here, D(p, c) := p c 2 is the squared distance between a point p P to the center c Rd. More generally, in unsupervised learning, for a given integer (number of clusters) k 1, the k-means of the set P is a set C = {c1, , ck} of k centers (points in Rd) that minimizes the sum of squared distances X p P D(p, C) = X p P min c C p c 2 , where D(p, C) := minc C D(p, c) denotes the squared distance from each point p P to its nearest center in C. The k-means clustering is probably the most common clustering objective function, both in academy and industry as claimed in (Hartigan, 1975; Arthur & Vassilvitskii, 2006; Berkhin, 2002; Wu et al., 2008). However, in the real-world, every database s record actually links to another database table, a GPS location may correspond to multiple GPS locations (e.g. home/work), every image consists of a set of pixels/features, every document contains a set of paragraphs, and a sensor s sample may actually be a distribution over some possible values (Li et al., 2010; 2008; Dunia et al., 1996; Xiao et al., 2007). This motivates the following title and subject of this paper. Sets Clustering. Along this paper, the input is not a set of points, but rather a set P = {P1, , Pn} of sets in Rd (or any other metric space; see Section 2), each of size m, denoted as m-sets. A natural generalization of the mean of Sets Clustering a set P is what we defined as the sets-mean of our set P of sets. The sets-mean is the point c Rd that minimizes its sum of squared distances X P P D(P, c) = X P P min p P p c 2 , (1) to the nearest point in each set. Here, D(P, c) := minp P D(p, c). More generally, the sets-k-means C of P is a set of k points in Rd that minimizes its sum of squared distances X P P D(P, C) = X P P min p P,c C p c 2 , (2) to the nearest point in each set. Here, D(P, C) := minp P,c C D(p, c) is the closest distance between a pair in P C. Example. Suppose that we want to place a gas station that will serve n people whose home addresses are represented by n GPS points (on the plane). The mean is a natural candidate since it minimizes the sum of squared distances from the gas station to the people; see (Jubran et al., 2019). Now, suppose that the ith person for every i {1, , n} = [n] is represented by a pair Pi = {h, w} of points on the plane: home address h and work address w; see Fig. 1. It would be equally as convenient for a resident if the gas station was built next to his work address rather than his home address. Hence, the sets-mean of the addresses P = {P1, , Pn}, as defined in the previous page, minimizes the sum of squared distances from the gas station to the nearest address of each person (either home or work). The sets-k-means is the set C Rd of k gas stations that minimizes the sum of squared Euclidean distances from each person to its nearest gas station as in (2). 1.1. Applications From a theoretical point of view, sets clustering is a natural generalization of points clustering. The distance D(P, C) between sets generalizes the distance D(p, C) = minc C D(p, c) between a point and a set, as used e.g. in k-means clustering of points. Clustering Shapes (Srivastava et al., 2005). The first sets clustering related result appeared only recently in (Marom & Feldman, 2019) for the special case where each of the n input sets is a line (an infinite set) in Rd. However, in this paper every input set is a finite and arbitrary set in a general metric space. It is therefore not surprising that many of the numerous applications for points clustering can be generalized to sets clustering. Few examples are given below. Facility locations (Cohen-Addad et al., 2019; Blelloch & Tangwongsan, 2010; Ahmadian et al., 2013). The above gas station example immediately implies applications for Facility Location problems. Natural Language Processing (Collobert et al., 2011). A disadvantage of the common bag of words model is that the order of words in a document does not change its representation (Spanakis et al., 2012). Sets clustering can help partially overcome this issue by considering the document as the set of vectors corresponding to each of its paragraphs, as illustrated in Fig. 5. Hierarchical clustering (Abboud et al., 2019; Murtagh, 1983). Here, the goal is to compute a tree of clusters. The leaves of this tree are the input points, and the next level represent their clustering into n sets. In the next level, the goal is to cluster these n sets into k sets. Probabilistic databases (Suciu et al., 2011). Here, each data sample corresponds to a finite distribution over possible values. E.g. a sample that was obtained from a sensor with a known noise model. Algorithm for computing the minimum enclosing ball (1-center) for sets (distributions) was suggested in (Munteanu et al., 2014) using coresets, as defined in section 1.4. 1.2. Why is it Hard? Computing the k-means of points in Rd (m = 1) is already NP-hard when k is not fixed, even for d = 2. It can be solved in n O(dk) time using exhaustive search as explained in (Inaba et al., 1994). Multiplicative (1 + ε) approximation is also NP-hard for constant a ε > 0 (Lee et al., 2017). For fixed k, deterministic constant factor approximation can be computed in time O(ndk) by constructing coresets (see Section 1.4) of size m = O(k/ε3) (Braverman et al., 2016; Feldman & Langberg, 2011), on which the optimal exhaustive search is then applied. In practice, it has efficient approximation algorithms with provable guarantees, such as k-means++ (Arthur & Vassilvitskii, 2006) which yields O(log k) approximation, using D2 sampling. The mean (k = 1) P p P p/n of a set P of n points in Rd can be computed in linear O(nd) time. However, we could not find in the literature an algorithm for computing even the sets-mean in (1) for n pairs of points on the plane (m = d = 2). Separability. The clusters in the k-means problem are separable: the minimum enclosing ball of each cluster consists only of the points in this cluster. Fundamental results in computational geometry (Toth et al., 2017) (chapter 28) or PAC-learning theory (Shalev-Shwartz & Ben-David, 2014) prove that there are only n O(1) partitions of P into k such clusters that can be covered by balls. On the contrary, even in the case of sets-mean (k = 1), the union of n representative points from each pair is not separable from the other n Sets Clustering Figure 2. Why is the sets clustering problem hard? 2(a): The space is non-metric. Two m-sets P = {p1, p2, p3} and Z = {z1, z2, z3} in Rd for m = 3 and c Rd that do not satisfy the triangle inequality since D(P, Z) = 0, D(Z, c) = 0 but D(P, c) = 0. 2(b): Separability. For d = 2, a set of n = 6 pairs (blue ellipses) and their optimal mean (red star). There is no ball that separates the closest n points (1 from each set) which are closest to the optimal mean (red circles), from the other n points (solid blue circles). points (that are not served by the center); see Fig 2. Non-metric space. The generalization of the k-means distance function to sets in (2) is not a metric space, i.e., does not satisfy the triangle inequality, even approximately. For example, two input sets might have zero distance between them while one is very far and the other is very close to a center point; see Fig 2. Furthermore, one may suggest to replace every input set P Rd of size |P| = m with m points in Rd, thus reducing the sets-k-means problem to an instance of the well known k-means problem. However, in theory, Fig. 2(a) gives a simple example where the cost of sets-k-means is 0 and the k-means cost of flattening the sets may be arbitrarily large. 1.3. How Hard? The previous section may raise the suspicion that sets-kmeans is NP-hard, even for k = 1 and d = 2. However, this is not the case. In Section 4.4, we present a simple theorem for computing the exact (optimal) sets-k-means for any input set P of n sets, each of size m. This takes time polynomial in n, i.e., n O(1), for every constant integers k, d, m 1. The theorem is based on a generic reduction for the case of k = m = 1. Unfortunately, the constants that are hidden in the O(1) notation above make our algorithm impractical for even modest values of k. This motivates the construction of the first coreset for sets, which is the main technical result of this paper. 1.4. Sets Coresets Coreset (or core-set) is a modern data summarization paradigm (Maalouf et al., 2019a; Bachem et al., 2017a; Phillips, 2016) that was originated from computational geometry (Agarwal et al., 2005). Usually, the input for a coreset construction algorithm is an approximation error ε (0, 1), a set P of n items (called points), and a loss P P P D(P, ) that we wish to minimize over a (usually infinite) set C of feasible queries (solutions). The output is a (sub)set S P and a weights function v : S [0, ), which is called an ε-coreset for the tuple (P, C, D) if P P D(P, C) X S S v(S) D(S, C) P P D(P, C), for every query C C. In particular, an optimal solution of the coreset is an approximated optimal solution to the original problem. If |S| |P|, i.e., the size of the coreset S is smaller than P by orders of magnitude, then we can run a possibly inefficient algorithm on S to compute an approximation solution to P. In this paper, unlike previous papers, P is a set of sets of size m (rather than points) in Rd and C = C Rd |C| = k . Why coresets? Applying the above optimal exhaustive search on such a coreset would reduce the running time from n O(1) to time near linear in n conditioned upon: (i) every such input P has a coreset S of size, say, |S| (log n)O(1), and (ii) this coreset can be computed in near linear time, say O(n log n). However, such a coreset construction for a problem has many other applications, including handling big streaming dynamic distributed data in parallel. Here, streaming means maintaining the sets-k-means of a (possibly infinite) stream of sets, via one pass and using only logarithmic memory and update time per new set. Dynamic data supports also deletion of sets. Distributed data means that the input is partitioned among M 2 machines, where the running time reduces by a factor of M (R egin et al., 2013). Many surveys explain how to obtain those applications, given an efficient construction of a small coreset as suggested in our paper. Due to lack of space we do not repeat them here and refer the reader to e.g. (Feldman, 2020). The recent result above (Marom & Feldman, 2019) for kmeans of lines (infinite sets) is obtained via coresets. We do not know any coresets for finite sets except for singletons (m = 1). This coreset, that is called coreset for k-means (of points) is one of the fundamental and most researched coresets in this century: (Har-Peled & Mazumdar, 2004; Chen, 2006; Frahling & Sohler, 2008; Chen, 2009; Fichtenberger et al., 2013; Bachem et al., 2015; Barger & Feldman, 2016; Bachem et al., 2017b; Feldman et al., 2017; Bachem et al., 2018; Huang et al., 2018). Coresets for fair clustering of points, which preserve sets-related properties of the input points, were suggested in (Schmidt et al., 2019). A natural open question is does a small coreset exist for the sets-k-means problem of any input? . Sets Clustering 1.5. Main Contributions In this paper we suggest the first (1 + ε) approximation for the sets-k-means problem, by suggesting the first coreset for sets. More precisely, we provide (i): A proof that an ε-coreset S of size |S| = O(log2 n) exists for every input set P of n sets in Rd, each of size m. This holds for every constants d, k, m 1. S can be computed in time O(n log n); see exact details in Theorem 4.2. We also provide a lower bound for the size of such a coreset; see Section 4.3. (ii): An algorithm that computes an optimal solution for the sets-k-means of such P in n O(1) time. See Theorem 4.4. (iii): Combining the above results implies the first PTAS ((1 + ε)-approximation) for the sets-k-means of any such input set P, that takes O(n log n) time; see Corollary 4.5. (iv): Extensions for (i) from the Euclidean distance in Rd to any metric space (X, D), distances to the power of ℓ> 0, and M-estimators that are robust to outliers. See Section 2. (v): Experimental results on synthetic and real-world datasets show that our coreset performs well also in practice. (vi): Open source implementation for reproducing our experiments and for future research (Jubran et al., 2020). 1.6. Novelty Our coreset construction needs to characterize which of the input items are similar, and which are dissimilar, in some sense. To this end, we first suggest a similarity measure for sets and then present our novel non-uniform sampling scheme for sets, which we call onion sampling. Recursive similarity. When m = 1, items are similar if their mutual distance is small. When m 2, we propose a recursive and abstract similarity measure, which requires all the m items in the first set to be close to the m items in the second set, for some ordering of the items inside each set; see Algorithm 1. Onion Sampling. Recall that the D2 sampling assigns each input point with probability that is proportional to its distance to the k-means of the input (or its approximation), which reflects its importance. When we try to generalize D2 to handle sets rather than points, it is not clear what to do when one point in an input m-set is close to the approximated center and the other one is far, as in Fig. 2(a). In particular, if the optimal sum of squared distances is zero, the coreset in the in k-means problem is trivial (the k points). This is not the case for the sets-k-mean (even for k = 1). To this end, we suggest an iterative and non-trivial alternative sampling scheme called onion sampling. In each iteration we apply an algorithm which characterizes recur- sively similar input sets, as described above, which form an onion layer . We assign those sets the same sampling probability, which is inversely proportional to the number of those items, and peal this layer off. We continue until we have pealed off the entire onion (input). Finally, we prove that a random sample according to this distribution yields a coreset for the sets clustering problem; see Algorithm 2. 2. Definitions In (2) we define sets-k-means for points in Rd. However, our coreset construction holds for any metric space, or general (non-distance) loss functions as in Table 1. Definition 2.1 (Loss function D). Let lip : [0, ) [0, ) be a non-decreasing function that satisfies the following r-log-log Lipschitz condition: There is a constant 0 < r < such that for every x, z > 0 we have lip(zx) zrlip(x). Let (X, D) be a metric space, and D : P(X) P(X) [0, ) be a function that maps every two subsets P, C X to D(P, C) := min p P,c C lip( D(p, c)). For p, b X, denote D(p, C) := D({p} , C), and D(P, b) := D(P, {b}), for short. For an integer k 1 define Xk := C X |C| = k . Although (X, D) is not necessarily a metric space, the triangle inequality is approximated as follows. Lemma 2.2 (Lemma 2.1 (ii) in (Feldman & Schulman, 2012)). Let (X, D) and r > 0 be as defined in Definition 2.1. Let ρ = max 2r 1, 1 . Then the function D satisfies the weak triangle inequality for singletons, i.e., for every p, q, c X, D(p, q) ρ( D(p, c) + D(c, q)). Table 1. Example loss functions as in Definition 2.1. Let δ > 0 be a constant and let (X, D) be a metric space where X = Rd and D(p, c) = p c for every p, c Rd. Optimization Problem lip(x) D(P, C) ρ sets-k-median x min p P,c C p c 1 sets-k-means x2 min p P,c C p c 2 2 sets-k-means with M-estimators ( 1 2x2 if x δ δ(|x| 1 2δ) otherwise min p P,c C ( 1 2 p c 2 if p c δ δ( p c 1 2δ) otherwise 2 ℓψ norm x min p P,c C p c ψ max n 2 1 ψ , 1 o Notation. For the rest of the paper we denote [n] = {1, , n} for an integer n 1. Unless otherwise stated, let (X, D) be as in Definition 2.1. As discussed in Section 1, the input set for the sets clustering problem is a set of finite and equal sized sets as follows. Definition 2.3 ((n, m)-set). An m-set P is a set of m distinct points in X, i.e. P X and |P| = m. An (n, m)-set is a set P = P P X, |P| = m such that |P| = n. Sets Clustering In what follows we define the notion of robust approximation. Informally, a robust median for an optimization problem at hand is an element b that approximates the optimal value of this optimization problem, with some leeway on the number of input elements considered. Definition 2.4 (Robust approximation). Let P be an (n, m)- set, γ (0, 1 2], τ (0, 1/10), and α 1. Let (X, D) be as in Definition 2.1. For every C Xk, we define closest(P, C, γ) to be the set that is the union of γ|P| sets P P with the smallest values of D(P, C), i.e., closest(P, C, γ) arg min Q P:|Q|= γ|P| P Q D(P, C). The singleton {b} X1 is a (γ, τ, α)-median for P if X P closest(P,{b},(1 τ)γ) D(P, b) α min b X P closest(P,{b },γ) Given an m-set P, and a set B of |B| = j m points, in what follows we define the projection of P onto B to be the set P after replacing j of its points, which are the closest to the points of B, by the points of B. We denote by proj(P, B) the remaining non-projected points of P. Definition 2.5 (Set projection). Let m 1 be an integers, P be an m-set, (X, D) be as in Definition 2.1, j [m], and let B = {b1, , bj} Xj. Let p1 P denote the closest point to b1 i.e., p1 arg minp P D(p, b1). For every integer i {2, , j} recursively define pi P to be the closest point to bi, excluding the i 1 points that were already chosen, i.e., pi arg min p P \{p1, ,pi 1} D(p, bi) . We denote (i): {(p1, b1), , (pj, bj)} by closepairs(P, B), (ii): the m j points from P that are not among the closest points to B by proj(P, B) = P \ {p1, , pj}, and (iii): the projection of P onto B by T(P, B) = {b1, , bj} P \ {p1, , pj} . For X = , we define proj(P, X) = T(P, X) = P 3. Sensitivity Based Coreset A common technique to compute coresets is the approach of non-uniform sampling, which is also called sensitivity sampling (Langberg & Schulman, 2010; Braverman et al., 2016), and was widely used lately to construct coresets for Machine Learning problems; see e.g., (Huggins et al., 2016; Munteanu et al., 2018; Maalouf et al., 2019b; Bachem et al., 2017a). Intuitively, the sensitivity of an element P P represents the importance of P with respect to the other elements, and the specific optimization problem at hand; see definition and details in Theorem 3.1. Suppose that we computed an upper bound s(P) for the sensitivity of every element P P. Then a coreset is now simply a random (sub)sample of P according to the sensitivity distribution, followed by a smart reweighting of the points. It s size is proportional to the sum of sensitivities t = P Q P s(Q) and the combinatorial complexity d of the problem at hand; see Definition A.2. The following theorem, which is a restatement of Theorem 5.5 in (Braverman et al., 2016), provides full details. Theorem 3.1. Let P be an (n, m)-set, and ( D, Xk) be as in Definition 2.1. For every P P define the sensitivity of P as D(P, C) P Q P D(Q, C) , where the sup is over every C Xk such that the denominator is non-zero. Let s : P [0, 1] be a function such that s(P) is an upper bound on the sensitivity of P. Let t = P P P s(P) and d be a complexity measure of the set clustering problem; see Definition A.2. Let c 1 be a sufficiently large constant, ε, δ (0, 1), and let S be a random sample of |S| ct ε2 d log t + log 1 δ sets from P, such that P is sampled with probability s(P)/t for every P P. Let v(P) = t s(P )|C| for every P S. Then, with probability at least 1 δ, (S, v) is an ε-coreset for (P, Xk, D). 4. Coreset for Sets Clustering In this section we give our main algorithms that compute a coreset for the sets clustering problem, along with intuition, Full theoretical proofs can be found in the appendix. 4.1. Algorithms Overview and intuition behind Algorithm 1. Given a set P of m-sets and an integer k 1, Algorithm 1 aims to compute a set Pm P of similar m-sets, which are all equally important for the problem at hand; see Lemma 4.1. At the ith iteration we wish to find a 1 4k fraction of the remaining m-sets Pi 1 which are similar in the sense that there is a dense ball of small radius that contains at least one point from each of those sets. To do so, we first compute at Line 5 ˆPi 1 which contains only the non-projected points of each m-sets in Pi 1. We then compute a median bi at Line 6 that satisfies at least 1 4k of ˆPi 1. bi is the center of the desired dense ball. At Line 7 we pick the sets that indeed have a candidate inside this dense ball and continue to the next iteration (where again, we consider only the non-projected part of those sets); see Fig. 3. After m such iterations, the surviving m-sets in Pm have been recursively similar throughout all the iterations. Overview and intuition behind Algorithm 2. Given an (n, m)-set P and an integer k 1, Algorithm 2 aims to compute an ε-coreset (S, v) for P; see Theorem 4.2. Algorithm 2 applies our onion sampling scheme; each while iteration at Line 6 corresponds to a pealing iteration. Sets Clustering Figure 3. Recursive similarity. (Upper left): An input (n, m)-set P with n = 14 and m = 5. (Lower left): The set Bi (red stars) for i = 3 from the 3rd iteration of Algorithm 1, and the projection T(P3, B3) (blue snakes) of P3 onto B3. Therefore, all the sets in T(P3, B3) have i = 3 points in common, and m i = 2 other points. (Upper mid): The set ˆP3 that contains four 2-sets (blue points right of the dashed line). The median b4 (bold red star) considers only a fraction of ˆP3. The red circle contains the points that are closest to b4. P4 contains the sets with a representative inside the red circle. (Lower mid): The projection T(P4, B4) (blue snakes) of the sets P4 onto the new B4 = B3 b4 . (Right): Onion sampling. A set P of pairs in the plane (m = d = 2) along with the sensitivity values s(P) computed in Algorithm 2 via our onion sampling. First, the densest subset of pairs are assigned a low sensitivity value (dark blue). The densest subset of the remaining pairs is then assigned a higher sensitivity value (light blue), and so on. The scattered pairs that remain at the end are assigned the highest sensitivity (dark red). The size of the subset found decreases in each step. At lines 6 14 Algorithm 2 first calls Algorithm 1 with the (n, m)-set P0 = P to obtain a set Pm P of dense and equally (un)important m-sets from the input. Second, it assigns all the sets in Pm the same sensitivity value as shown in Lemma 4.1. It then peals those sets off, and repeats this process with P0 \ Pm. Those values increase in every step since the size of the dense set returned decreases, making every point more important. This process is illustrated in Fig. 3. We then randomly sample a sufficiently large set S P at Line 17 according to the sensitivity values, and assign new weights v(P) for every set P S in Line 19. 4.2. Main Theorems The following lemma lies at the heart of our work. It proves that Algorithm 1 helps compute an upper bound for the sensitivity term of some of the input elements. Lemma 4.1. Let P be an (n, m)-set, k 1 be an integer and (X, D) be as in Definition 2.1. Let (Pm, Bm) be the output of a call to RECURSIVE-ROBUST-MEDIAN(P, k); see Algorithm 1. Then, for every P Pm we have that Q P D(Q, C) O(1) 1 |Pm| The following theorem is our main technical contribution. Algorithm 1 RECURSIVE-ROBUST-MEDIAN(P, k) 1: Input: An (n, m)-set P and a positive integer k. 2: Output: A pair (Pm, Bm) where Pm P and Bm Xm; see Lemma 4.1. 3: P0 := P and B0 := 4: for i := 1 to m do 5: ˆPi 1 := proj(P, Bi 1) P Pi 1 {see Definition 2.5} 6: Compute a 1 2k , τ, 2 -median bi X1 for ˆPi 1 for some τ (0, 1 20). {see Definition 2.4. Algorithm 3 provides a suggested implementation.} proj(P, Bi 1) closest ˆPi 1, bi , (1 τ) {Pi contains every m-set P such that proj(P, Bi 1) is in the closest fraction of (1 τ)/(4k) sets in ˆPi 1 to the center bi; see Fig. 3.} 8: Bi := Bi 1 S bi 9: end for 10: Return(Pm, Bm) Algorithm 2 CORESET(P, k, ε, δ) 1: Input: An (n, m)-set P, a positive integer k, an error parameter ε (0, 1), and a probability of failure δ (0, 1). 2: Output: An ε-coreset (S, v); see Theorem 4.2. 3: b := a constant determined by the proof of Theorem 4.2 4: d := md2k2 {the dimension of (P, Xk, D)} 5: P0 := P 6: while |Q0| > b do 7: (Pm, Bm) := RECURSIVE-ROBUST-MEDIAN(P0, k) 8: for every P Pm do 9: s(P) := b |Pm| 10: end for 11: P0 := P0 \ Pm 12: end while 13: for every P P0 do 14: s(P) = 1 15: end for 16: t := P 17: Pick S of bt ε2 log (t)d + log 1 δ m-sets from P by repeatedly, i.i.d, selecting P P with probability s(P ) t 18: for each P S do 19: v(P) := t |S| s(P ) 20: end for 21: Return (S, v) It proves that Algorithm 2 indeed computes an ε-coreset. Theorem 4.2. Let P be an (n, m)-set, k 1 be an integer, Sets Clustering (X, D) be as in Definition 2.1, and ε, δ (0, 1). Let (S, v) be the output of a call to CORESET(P, k, ε, δ). Then (i) |S| O md log n (ii) With probability at least 1 δ, (S, v) is an ε-coreset for (P, Xk, D); see Section 1.4. (iii) (S, v) can be computed in O(n log(n)(k)m) time. 4.3. Sets Clustering Coreset Lower Bound In this section we provide a lower bound for the coreset size of a sets clustering problem. This is by constructing an (n, m)-set P in R for which every P P has a sensitivity of 1; see Section 3 and Theorem 3.1. Theorem 4.3. Let k, m 1 be integers. Let P be an (n, m)-set in R containing all the possible n = m+k m subsets of m points in [m + k] = {1, 2, ..m + k}, and let (S, v) be an ε-coreset for P. Then |S| km. Proof. By its construction, every P P contains |P| = m reals among the m + k reals [m + k]. Hence, every P P has a corresponding query C = [m + k] \ P R (the remaining k reals in [m + k]) that yields non-zero distance D(P, C) = 0 to this set P, and zero distance D(Q, C) = 0 to all the other sets Q P. Hence, any coreset must contain all the m+k m sets in P, otherwise the cost of the coreset (S, v) might be 0 while the cost of the original data is nonzero, which yields an infinite multiplicative error for the coreset. Therefore, the coreset size |S| for a sets clustering problem satisfies km < |S| = m + k m where the last inequality is by the binomial inequality. 4.4. Polynomial Time Approximation Scheme. In the following theorem we present a reduction from an α-approximation for the sets clustering problem in Rd with m, k 1, to an α-approximation for the simplest case where m = k = 1, for any α 1. We give a suggested implementation in Algorithm 4. Theorem 4.4. Let P be an (n, m)-set in Rd, w : P [0, ) be a weights function, k 1 be an integer, α 1 and δ [0, 1). Let D be a loss function as in Definition 2.1 for X = Rd. Let ALG be an algorithm that solves the case where k = m = 1, i.e., it takes as input a set Q X, a weights function u : Q [0, ) and the failure probability δ, and in time T(n) outputs ˆc X that with probability at least 1 δ satisfies P q Q u(q) D(q, ˆc) q Q u(q) D(q, c). Then in T(n) (nmk)O(dk) time we can compute ˆC Xk such that with probability at least (1 k δ) we have X P P w(P) D(P, ˆC) α min C Xk P P w(P) D(P, C). The previous theorem implies a polynomial time (optimal) solution for the sets-k-means, since it is trivial to compute an optimal solution for the case of m = k = 1. Corollary 4.5 (PTAS for sets-k-means). Let P be an (n, m)-set, k 1 be an integer, and put ε 0, 1 2 and δ (0, 1). Let OPT be the cost of the sets-k-means. Then in n log(n)(k)m + log n ε dmkm O(dk) time we can com- pute ˆC Xk such that with probability at least 1 k δ, X P P min p P,c ˆ C p c 2 (1 + 4ε) OPT. 5. Robust Median In this section, we provide an algorithm that computes a robust approximation; see Definition 2.4 and its preceding paragraph. An overview is provided in Section D. Algorithm 3 MEDIAN(P, k, δ) 1: Input: An (n, m)-set P, a positive integer k 1, and the probability of failure δ (0, 1) 2: Output: A point q X that satisfies Lemma 5.1 3: b := a universal constant that can be determined from the proof of Lemma 5.1 4: Pick a random sample S of |S| = b k2 log 1 δ sets from P 5: q := a point that minimizes P p closest(S,{ˆq},(1 τ)γ) D(p, ˆq) over ˆq Q S 6: Return q Lemma 5.1 (based on Lemma 9.6 in (Feldman & Langberg, 2011)). Let P be an (n, m)-set, k 1, δ (0, 1) and (X, D) be as in Definition 2.1. Let q X be the output of a call to MEDIAN(P, k, δ); see Algorithm 3. Then with probability at least 1 δ, q is a (1/(2k), 1/6, 2)-median for P; see Definition 2.4. Furthermore, q can be computed in O tb2k4 log2 1 δ time, where t is the time it takes to compute D(P, Q) for P, Q P. 6. Experimental Results We implemented our coreset construction, as well as different sets-k-mean solvers. In this section we evaluate their empirical performance. Open source code for future research can be downloaded from (Jubran et al., 2020). Sets Clustering (a) dataset (i), k = 2 (b) dataset (i), k = 4 (c) dataset (i), k = 6 (d) dataset (i), k = 8 (e) dataset (ii), k = 4 (f) dataset (ii), k = 6 (g) dataset (ii), k = 8 (h) dataset (ii), k = 10 (i) Optimal sets-mean (j) P := dataset (i) (k) P := dataset (ii) (l) r = 106 Figure 4. Experimental results. Exact details are provided in Section 6. Theory-implementation gaps. While Theorem 4.5 suggests a polynomial time solution for sets-k-means in Rd, it is impractical even for k = 1, d = 2 and n = 10 sets. Moreover, its implementation seems extremely complicated and numerically unstable. Instead, we suggest a simple algorithm exact-mean for computing the sets-mean; see Fig. 6 in Section E. Its main components are Voronoi diagram (Aurenhammer, 1991) and hyperplanes arrangement that were implemented in Sage (The Sage Developers, 2020). For k 2 we use Expectation-Maximization (EM) heuristic, which is a generalization of the well known Lloyd algorithm (Lloyd, 1982), as commonly used in k-means and its variants (Marom & Feldman, 2019; Lucic et al., 2017). Implementations. Four algorithms were implemented: (i) our-coreset(P, σ): the coreset construction from Algorithm 2 for a given (n, m)-set P, an arbitrary given loss function D that satisfies Definition 2.1 and a sample size of |S| = σ at Line 17 of Algorithm 2. There is no epsilon and delta in the code. As common in the coreset literature, our actual code gets a number s of points and then samples s input elements (sets) based on the suggested novel distribution, i.e., sensitivities (that are independent of ε, delta, and km+4 as in the theoretical bound). We then measure the empirical error epsilon and variance (kind of delta) over larger values of s. Table 2. Comparison between sets k-means and the regular kmeans. Method k Data size Cost k-means 2 100 303 Sets-k-means 2 100 5 k-means 4 100 274 Sets-k-means 4 100 4 k-means 6 100 237 Sets-k-means 6 100 3 k-means 2 1000 299 Sets-k-means 2 1000 40 k-means 4 1000 268 Sets-k-means 4 1000 33 k-means 6 1000 220 Sets-k-means 6 1000 26 (ii) uniform-sampling(P, σ): outputs a uniform random sample S P of size |S| = σ. (iii) exact-mean(P): returns the exact (optimal) setsmean (k = 1) of a given set P of sets in Rd as in the previous paragraph. (iv) k-means(P, k): generalization of the Lloyd k-means heuristic (Lloyd, 1982) that aims to compute the setsk-mean of P via EM; see implementation details in Section E. Sets Clustering Software/Hardware. The algorithms were implemented in Python 3.7.3 using Sage 9.0 (The Sage Developers, 2020) as explained above on a Lenovo Z70 laptop with an Intel i7-5500U CPU @ 2.40GHZ and 16GB RAM. Datasets. (i): The LEHD Origin-Destination Employment Statistics (LODES) (lod). It contains information about people that live and work at the united states. We pick a sample of n = 10, 000 and their m = 2 home+work addresses, called Residence+Workplace Census Block Code. Each address is converted to a pair (x, y) of d = 2 doubles. As in Fig. 1, our goal was to compute the sets-k-mean (facilities) of these n pairs of addresses. (ii): The Reuters-21578 benchmark corpus (Bird et al., 2009). It contains n = 10, 788 records that corresponds to n Reuters newspapers. Each newspaper is represented as a bag of bag of words of its m [3, 15] paragraphs in high dimensional-space; see Fig. 5. Handling sets of different sizes is also supported; see details in Section E. We reduce the dimension of the union of these (3n to 15n) vectors to d = 15 using LSA (Landauer et al., 2013). The goal was to cluster those n documents (sets of paragraphs) into k topics; see Fig 5. (iii): Synthetic dataset. We drew a circle of radius 1, centered at the origin of R2 and then picked n1 = 9900 points evenly (uniformly) distributed on this circle. For each of these n1 points, we paired a point in the same direction but of distance 30 from the origin. This resulted in n1 pairs of points. We repeat this for another circle of radius 1 that is centered at (r, 0), for multiple values of r, and constructed n2 = 100 points similarly; see top of Fig. 4(l). Experiment (i) We ran S1(σ) := our-coreset(P, σ) and S2(σ) := uniform-sampling(P, σ) on each of the datasets for different values of sample size σ. Next, we computed the corresponding sets-k-means C1(σ), C2(σ) and C3 heuristically using Algorithm (iv). We denote the corresponding computation times in seconds by t1(σ), t2(σ) and t3, respectively. The corresponding costs of C1(σ) and C2(σ) were evaluated by computing the approximation error, for i {1, 2}, as εi(σ) = | P P P D(P,C3) P P P D(P,Ci(σ))| P P P D(P,C3) . Results (i). The approximation errors on the pair of realworld datasets are shown in Fig. 4(a) 4(h). Fig 4(j) 4(k) show relative time t1(σ)/t3 (y-axis) as a function of ε := ε1(σ) (x-axis), for σ = 20, 30, . . . , 140. The approximation errors are shown for the synthetic dataset, either for different increasing σ in Fig. 4(l) or r values in Fig. 4(m). Experiment (ii). We uniformly sampled 800 rows from the LEHD Dataset (i). Let P(σ) denote the first σ points in this sample, for σ = 20, 40, 60, . . . , 800. For each such set P(σ) we computed two different size coresets Figure 5. bag of bag of words: A set of n = 4 documents, each has m = 2 paragraphs. Each paragraph is represented by a bag of words and each document is represented by the set of m vectors of its paragraphs. The sets-2-means are presented (blue / red stars). S1(σ) := our-coreset(P(σ), σ/10) and S2(σ) := our-coreset(P(σ), σ/5). We then applied Algorithm (iii) that computes the optimal sets-mean C1(σ), C2(σ) and C3(σ) on S1(σ), S2(σ) and the full data P(σ), respectively. Results (ii). Fig 4(i) shows the cost of Ci(σ) (y-axis) as a function of σ (x-axis), for i {1, 2, 3}. Experiment (iii). This experiment aims to compare the sets-k-means to the known k-means. The input is a subset P from the Dataset (ii) (once of size 100, and other of size 1000). The goal is to compute a solution (k-centers) for the sets-k-means problem. We compare two methods: (i) computing the sets-k-means solution as explained in section 6, and (ii) computing the k-means solution on the set P = p Pi Pi P . The cost is the cost of the sets-k-means problem on the computed k-centers and the original input data P. Results (iii). The results are presented in Table 2. Discussion. As common in the coreset literature, we see that the approximation errors are significantly smaller than the pessimistic worst-case bounds. In all the experiments the coreset yields smaller error compared to uniform sampling. When running exact algorithms on the coreset, the error is close to zero while the running time is reduced from hours to seconds as shown in Fig 4(i). The running time is faster by a factor of tens to hundreds using the coresets, in the price of an error between 1/64 to 1/2 as shown in Fig. 4(j) 4(k). Furthermore, Table 2 demonstrates the difference between the sets-k-means and the k-means problems for the task of clustering input sets. 7. Conclusions and Open Problems This paper suggests coresets and near-linear time solutions for clustering of input sets such as the sets-k-means. Natural open problems include relaxation to convex optimization, handling other distance functions between sets e.g. max distance, handling infinite sets / shapes (triangles, circles, etc.) and continuous distributions (e.g. n Gaussians). We hope that this paper is only the first step toward a long line of research that include solutions to the above problems. Sets Clustering Acknowledgements This work is an outcome of a joint work with Samsung Research Israel. We would like to thank Matan Weksler and Benjamin Lastmann for their support and contribution to this paper. U.s. census bureau. longitudinal employer-household dynamics. URL https://lehd.ces.census.gov/ data/lodes/LODES7/. Abboud, A., Cohen-Addad, V., and Houdroug e, H. Subquadratic high-dimensional hierarchical clustering. In Advances in Neural Information Processing Systems, pp. 11576 11586, 2019. Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1 30, 2005. Ahmadian, S., Friggstad, Z., and Swamy, C. Local-search based approximation algorithms for mobile facility location problems. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pp. 1607 1621. SIAM, 2013. Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009. Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006. Aurenhammer, F. Voronoi diagrams a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR), 23(3):345 405, 1991. Bachem, O., Lucic, M., and Krause, A. Coresets for nonparametric estimation-the case of dp-means. In ICML, pp. 209 217, 2015. Bachem, O., Lucic, M., and Krause, A. Practical coreset constructions for machine learning. ar Xiv preprint ar Xiv:1703.06476, 2017a. Bachem, O., Lucic, M., and Lattanzi, S. One-shot coresets: The case of k-clustering. ar Xiv preprint ar Xiv:1711.09649, 2017b. Bachem, O., Lucic, M., and Krause, A. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1119 1127, 2018. Barger, A. and Feldman, D. k-means for streaming and distributed big sparse data. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 342 350. SIAM, 2016. Berkhin, P. Survey of clustering data mining techniques: Technical report. In Accrue software. 2002. Bird, S., Loper, E., and Klein, E. Natural Language Processing with Python. O Reilly Media Inc, 2009. Blelloch, G. E. and Tangwongsan, K. Parallel approximation algorithms for facility-location problems. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pp. 315 324, 2010. Braverman, V., Feldman, D., and Lang, H. New frameworks for offline and streaming coreset constructions. ar Xiv preprint ar Xiv:1612.00889, 2016. Chazelle, B., Edelsbrunner, H., Guibas, L. J., and Sharir, M. A singly exponential stratification scheme for real semi-algebraic varieties and its applications. Theoretical Computer Science, 84(1):77 105, 1991. Chen, K. On k-median clustering in high dimensions. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp. 1177 1185, 2006. Chen, K. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923 947, 2009. Cohen-Addad, V., Hjuler, N. O. D., Parotsidis, N., Saulpic, D., and Schwiegelshohn, C. Fully dynamic consistent facility location. In Advances in Neural Information Processing Systems, pp. 3250 3260, 2019. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493 2537, 2011. Dunia, R., Qin, S. J., Edgar, T. F., and Mc Avoy, T. J. Identification of faulty sensors using principal component analysis. AICh E Journal, 42(10):2797 2812, 1996. Feldman, D. Core-sets: Updated survey. In Sampling Techniques for Supervised or Unsupervised Tasks, pp. 23 44. Springer, 2020. Feldman, D. and Langberg, M. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 569 578. ACM, 2011. Feldman, D. and Schulman, L. J. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pp. 1343 1354. SIAM, 2012. Sets Clustering Feldman, D., Xiang, C., Zhu, R., and Rus, D. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 3 16. IEEE, 2017. Fichtenberger, H., Gill e, M., Schmidt, M., Schwiegelshohn, C., and Sohler, C. Bico: Birch meets coresets for k-means clustering. In European Symposium on Algorithms, pp. 481 492. Springer, 2013. Frahling, G. and Sohler, C. A fast k-means implementation using coresets. International Journal of Computational Geometry & Applications, 18(06):605 625, 2008. Har-Peled, S. and Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the thirtysixth annual ACM symposium on Theory of computing, pp. 291 300, 2004. Hartigan, J. A. Clustering algorithms. John Wiley & Sons, Inc., 1975. Huang, L., Jiang, S., Li, J., and Wu, X. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 814 825. IEEE, 2018. Huggins, J. H., Campbell, T., and Broderick, T. Coresets for scalable bayesian logistic regression. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 4087 4095, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819. Inaba, M., Katoh, N., and Imai, H. Applications of weighted voronoi diagrams and randomization to variance-based kclustering. In Proceedings of the tenth annual symposium on Computational geometry, pp. 332 339, 1994. Jubran, I., Maalouf, A., and Feldman, D. Introduction to coresets: Accurate coresets. ar Xiv preprint ar Xiv:1910.08707, 2019. Jubran, I., Tukan, M., Maalouf, A., and Feldman, D. Open source code for all the algorithms presented in this paper, 2020. Link for open-source code. Landauer, T. K., Mc Namara, D. S., Dennis, S., and Kintsch, W. Handbook of latent semantic analysis. Psychology Press, 2013. Langberg, M. and Schulman, L. J. Universal εapproximators for integrals. In Proceedings of the twentyfirst annual ACM-SIAM symposium on Discrete Algorithms, pp. 598 607. SIAM, 2010. Lee, E., Schmidt, M., and Wright, J. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40 43, 2017. Li, J., Wu, W., Wang, T., and Zhang, Y. One step beyond histograms: Image representation using markov stationary features. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8. IEEE, 2008. Li, L.-J., Su, H., Fei-Fei, L., and Xing, E. P. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems, pp. 1378 1386, 2010. Liao, L., Fox, D., and Kautz, H. Location-based activity recognition. In Advances in Neural Information Processing Systems, pp. 787 794, 2006. Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129 137, 1982. Lucic, M., Faulkner, M., Krause, A., and Feldman, D. Training gaussian mixture models at scale via coresets. The Journal of Machine Learning Research, 18(1):5885 5909, 2017. Maalouf, A., Jubran, I., and Feldman, D. Fast and accurate least-mean-squares solvers. In Advances in Neural Information Processing Systems, pp. 8305 8316, 2019a. Maalouf, A., Statman, A., and Feldman, D. Tight sensitivity bounds for smaller coresets. ar Xiv preprint ar Xiv:1907.01433, 2019b. Marom, Y. and Feldman, D. k-means clustering of lines for big data. In Advances in Neural Information Processing Systems, pp. 12797 12806, 2019. Mladenic, D. Text-learning and related intelligent agents: a survey. IEEE intelligent systems and their applications, 14(4):44 54, 1999. Munteanu, A., Sohler, C., and Feldman, D. Smallest enclosing ball for probabilistic data. In Proceedings of the thirtieth annual symposium on Computational geometry, pp. 214 223, 2014. Munteanu, A., Schwiegelshohn, C., Sohler, C., and Woodruff, D. On coresets for logistic regression. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6561 6570. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7891-on-coresets-for-logistic-regression. pdf. Sets Clustering Murtagh, F. A survey of recent advances in hierarchical clustering algorithms. The computer journal, 26(4):354 359, 1983. Nguyen, N. P., Dinh, T. N., Xuan, Y., and Thai, M. T. Adaptive algorithms for detecting community structure in dynamic social networks. In 2011 Proceedings IEEE INFOCOM, pp. 2282 2290. IEEE, 2011. Phillips, J. M. Coresets and sketches. ar Xiv preprint ar Xiv:1601.00617, 2016. R egin, J.-C., Rezgui, M., and Malapert, A. Embarrassingly parallel search. In International Conference on Principles and Practice of Constraint Programming, pp. 596 610. Springer, 2013. Schmidt, M., Schwiegelshohn, C., and Sohler, C. Fair coresets and streaming algorithms for fair k-means. In International Workshop on Approximation and Online Algorithms, pp. 232 251. Springer, 2019. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Spanakis, G., Siolas, G., and Stafylopatis, A. Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents. The Computer Journal, 55(3):299 312, 2012. Srivastava, A., Joshi, S. H., Mio, W., and Liu, X. Statistical shape analysis: Clustering, learning, and testing. IEEE Transactions on pattern analysis and machine intelligence, 27(4):590 602, 2005. Suciu, D., Olteanu, D., R e, C., and Koch, C. Probabilistic databases. Synthesis lectures on data management, 3(2): 1 180, 2011. The Sage Developers. Sage Math, the Sage Mathematics Software System (Version 9.0), 2020. https://www.sagemath.org. Toth, C. D., O Rourke, J., and Goodman, J. E. Handbook of discrete and computational geometry. Chapman and Hall/CRC, 2017. Tuytelaars, T., Mikolajczyk, K., et al. Local invariant feature detectors: a survey. Foundations and trends R in computer graphics and vision, 3(3):177 280, 2008. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., Mc Lachlan, G. J., Ng, A., Liu, B., Philip, S. Y., et al. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1 37, 2008. Xiao, X.-Y., Peng, W.-C., Hung, C.-C., and Lee, W.-C. Using sensorranks for in-network detection of faulty readings in wireless sensor networks. In Proceedings of the 6th ACM international workshop on Data engineering for wireless and mobile access, pp. 1 8, 2007.