# high_dimensional_clustering_with_rnets__549899da.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

High Dimensional Clustering with r-nets

Georgia Avarikioti, Alain Ryser, Yuyi Wang, Roger Wattenhofer ETH Zurich, Switzerland {zetavar,aryser,yuwang,wattenhofer}@ethz.ch

Clustering, a fundamental task in data science and machine learning, groups a set of objects in such a way that objects in the same cluster are closer to each other than to those in other clusters. In this paper, we consider a well-known structure, so-called r-nets, which rigorously captures the properties of clustering. We devise algorithms that improve the runtime of approximating r-nets in high-dimensional spaces with ℓ1 and ℓ2 metrics from O(dn2 Θ( ϵ)) to O(dn + n2 α), where α = Ω(ϵ1/3/log(1/ϵ)). These algorithms are also used to improve a framework that provides approximate solutions to other high dimensional distance problems. Using this framework, several important related problems can also be solved efﬁciently, e.g., (1 + ϵ)-approximate kth-nearest neighbor distance, (4 + ϵ)-approximate Min-Max clustering, (4+ϵ)-approximate k-center clustering. In addition, we build an algorithm that (1 + ϵ)-approximates greedy permutations in time O((dn + n2 α) log Φ) where Φ is the spread of the input. This algorithm is used to (2 + ϵ)-approximate k-center with the same time complexity.

Introduction Clustering aims at grouping together similar objects, where each object is often represented as a point in a high dimensional space. Clustering is considered to be a cornerstone problem in data science and machine learning, and as a consequence there exist multiple clustering variants. For instance, while each cluster may just be represented as a set of points, it is often advantageous to select one point of the data set as a prototype for each cluster. A signiﬁcant formal representation of such a prototype clustering is known as r-nets. Given a large set of n data points in d-dimensional space, an r-net is a subset (the prototypes) of these data points. This subset needs to fulﬁll two properties: First, balls of radius r around each of the prototypes need to contain every point of the whole data set (covering). Second, we must ensure that the prototypes are well separated, i.e., no ball contains the center of any other ball (packing). Approximate r-nets lift the covering constraint a tiny bit by allowing balls to have a slightly larger radius than r, while preserving the packing property, i.e., any two prototypes still need to have at least distance r.

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Throughout this paper, we assume data sets to be large and high dimensional. We therefore assume the number of features d of each object to be non-constant. This leads to interesting and important problems, as this assumption forces us to think about algorithms whose runtime is sub-exponential (preferably linear) in the number of features d. In addition, we want our runtime to be sub-quadratic in the size n of our data. In this paper we lay theoretical groundwork, by showing improved algorithms on the approximate r-net problem and applications thereof.

Related Work There is not a unique best clustering criterion, hence many methods (Estivill-Castro 2002) are proposed to solve the clustering problem for different applications (e.g., (Sibson 1973; Defays 1977; Lloyd 1982; Kriegel et al. 2011)), which makes it difﬁcult to systematically analyze clustering algorithms. In our paper we will make use of so-called polynomial threshold functions (PTF), a powerful tool developed by (Alman, Chan, and Williams 2016). PTFs are distributions of polynomials that can efﬁciently evaluate certain types of Boolean formulas with some probability. They are mainly used to solve problems in circuit theory, but were also used to develop new algorithms for other problems such as approximate all nearest neighbors or approximate minimum spanning tree in Hamming, ℓ1 and ℓ2 spaces. In the following, we employ this method to develop an algorithm that computes approximate r-nets. The algorithmic framework Net & Prune, was developed by (Har-Peled and Raichel 2015). It is able to solve so called nice distance problems, when provided with a suitable data structure for the problem. These data structures are often constructed by exploiting r-nets. A major drawback of the framework is its restriction to a constant number of features. Consequentially, this framework was later extended by (Avarikioti et al. 2017) to also solve higher dimensional cases. The algorithm, constructed in this paper, yields an immediate improvement on this framework, as the construction of the framework is based around approximate r-nets. We also present various of the previously mentioned data structures that we plug into the framework to solve high dimensional distance optimization problems. Recent work by (Eppstein, Har-Peled, and Sidiropoulos

2015) suggests a way of constructing approximate greedy permutations with approximate r-nets. Greedy permutations imply an ordering of the data, which provide a solution to 2-approximate k-center clustering as shown by (Gonzalez 1985). We present a similar construction, by applying approximate greedy permutations. An approach on hierarchical clustering was presented by (Dasgupta 2002). They construct an algorithm based on furthest ﬁrst traversal, which is essentially building a greedy permutation and then traversing the permutation in order. In (Fern and Brodley 2003) they present how random projections, in practice, can be applied to reduce the dimension of given data. We later employ a similar approach, namely random projections to lines, to reduce the approximate r-net problem with ℓ1 metrics to a low-dimensional subspace.

Our Contribution

This paper presents new theoretical results on the construction of (1 + ϵ)-approximate r-nets, improving the previous upper bound of O(dn2 Θ( ϵ)) by (Avarikioti et al. 2017). We denote n as the number of data points, d the dimension of the data and α = Ω(ϵ 1 3 / log( 1

ϵ )) for an arbitrary error parameter ϵ. The algorithm builds approximate r-nets in Hamming, ℓ1 and ℓ2 spaces, running in O(n2 α+n1.7+αd)1 time in both Hamming and Euclidean spaces. We also modify our algorithm to yield an improvement on the Net & Prune framework of (Avarikioti et al. 2017). Supplying the framework with certain data structures, which are created using approximate r-nets, we derive new algorithms with improved runtime on (1 + ϵ)-approximate k-smallest nearest neighbor distance, (4 + ϵ)-approximate Min-Max Clustering, introduced in (Har-Peled and Raichel 2015), and (4 + ϵ)-approximate k-center clustering. These algorithms run in O(dn + n2 α) for data sets in ℓ1 or ℓ2 spaces. With the exception of approximate k-smallest nearest neighbor, this is, to our knowledge, the ﬁrst time this framework is used to solve these problems in high dimensional spaces. We later also design a new algorithm to (2+ϵ)-approximate k-center clustering, by deriving an improved version of the algorithm for (1 + ϵ)-approximate greedy permutations in (Eppstein, Har-Peled, and Sidiropoulos 2015). Both of these algorithms have a runtime of O((dn + n2 α) log Φ), where Φ denotes the spread of the data. We deﬁne the spread of a dataset as the fraction of the diameter over the shortest distance of the graph. The omitted proofs can be found in the full version of the paper (https://arxiv.org/abs/1811.02288).

Approximate r-nets

In this section, we present an algorithm that builds approximate r-nets in ℓ1 and ℓ2 spaces. To that end, we ﬁrst derive an algorithm, that constructs approximate r-nets in Hamming space. We later show how to reduce the problem from ℓ1 or ℓ2 to Hamming space.

1The O notation throughout the paper hides logarithmic factors in n and polynomial terms in 1

Approximate r-net in Hamming Space Building approximate r-nets in Euclidean space is computationally expensive. Therefore, we initially restrict ourselves to datapoints on the vertices of a high dimensional hypercube. The distance between any two datapoints is then measured by the Hamming distance. In the following, we deﬁne the notion of approximate r-nets in this metric space, where the error is additive instead of multiplicative. Deﬁnition 1. Given a point set X {0, 1}d, a radius r R, an approximation parameter ϵ > 0 and the Hamming distance denoted as 1, an approximate r-net of X with additive error ϵ is as subset C X such that the following properties hold: 1. (packing) For every p, q C, p = q, it holds that

2. (covering) For every p X, there exists a q C, s. t.

p q 1 r + ϵd (additive error)

To construct approximate r-net we employ Probabilistic Polynomial Threshold Functions, a tool introduced in (Alman, Chan, and Williams 2016). To effectively apply this technique, we require a sparse dataset, meaning that we assume that most of the points are further than r from each other. To that end, we present a technique that sparsiﬁes the data in advance without losing information for our problem.

Sparsiﬁcation To sparsify our data, we apply brute force to build part of the approximate r-net. Intuitively, we randomly pick a center point from our dataset and then remove every point that is closer then r + ϵd from the center, by checking every point of the dataset. This technique was originally introduced in (Avarikioti et al. 2017). The proof of Theorem 1 closely follows this work. Theorem 1. Given X {0, 1}d, |X| = n, ϵ > 0, the Hamming distance which we denote as 1 and some distance r R, we can compute a set X X with

Pr[Y n1.7] 1 n 0.2

and a partial r-net C of X \ X , where Y := |{{i, j}|xi, xj X , xi xj 1 r + ϵd}| the number of points with close distance to each other, in time O(dn1.5).

Distance Matrix Next we introduce a tool, called distance matrix, to approximate r-nets. To construct a distance matrix, we partition the dataset into disjoint sets of equal size. The rows of the matrix correspond to partitions and the columns to points of the dataset. Each entry holds a value which indicates if any of the points in a partition (row) is at most r + ϵd close to a data point (column). We use Probabilistic Polynomial Threshold Functions, formally deﬁned below, to construct a matrix with such indicator values. Deﬁnition 2 ((Alman, Chan, and Williams 2016)). If f is a Boolean function on n variables, and R is a ring, a probabilistic polynomial for f with error 1

s and degree d is a distribution D of degree-d polynomials over R such that {0, 1}n, Prp D[p(x) = f(x)] 1 1

The main building block to construct the distance matrix is Theorem 2, which uses the fact that each entry of the distance matrix can be expressed as a Boolean formula. Theorem 2 ((Alman, Chan, and Williams 2016)). Given d, s, t, ϵ, we can construct a probabilistic polynomial P : {0, 1}ns R of degree at most := O(( 1

ϵ ) 1 3 log(s)) with at most s n , such that:

1. If Ws i=1[Pn j=1 xij t] is false, then | P(x11, ..., x1n, ..., xs1, ..., xsn)| s with probability at least 2

3; 2. If Ws i=1[Pn j=1 xij t + ϵn] is true, then P(x11, ..., x1n, ..., xs1, ..., xsn) > 2s with probability at least 2

3. Before we show how to construct the distance matrix for a given dataset, we cite the following Lemma by (Coppersmith 1982), on rectangular matrix multiplication. Lemma 1 ((Coppersmith 1982)). For all sufﬁciently large N, and α .172, multiplication of an N N α matrix with an N α N matrix can be done in N 2poly(log N) arithmetic operations, over any ﬁeld with O(2poly(log N)) elements.2

Next, we present how to build the distance matrix, combining fast matrix multiplication and Probabilistic Polynomial Threshold Functions. Theorem 3. Let X be a set of n points in {0, 1}d, a radius

r R, some ϵ log6(d log n)

log3 n , α = Ω( ϵ 1 3 log( d ϵ log n )) and let

1 denote the Hamming distance. There exists an algorithm that computes, with high probability, a n1 α n matrix W and a partition S1, ..., Sn1 α of X that satisﬁes the following properties: 1. For all i [n1 α]3 and j [n], if minp Si xj p 1 r then Wi,j > 2|Si|. 2. For all k [n1 α] and j [n], if minp Si xj p 1 > r + ϵd, then |Wi,j| |Si|

The algorithm runs in O(n2 α).

Building a Net Now, we present how we can build an approximate r-net for a data set, as in (Avarikioti et al. 2017): we ﬁrst employ the sparsiﬁcation technique and then build the distance matrix in the sparse dataset where we can search efﬁciently. The running time of building an approximate rnet is dominated by the time complexity of the construction of the distance matrix. Theorem 4. Given X {0, 1}d with |X| = n, some distance r R and some ϵ log6(d log n)

log3 n , we can compute a set C that contains the centers of an approximate r-net with additive error at most ϵ with high probability in time O(n2 α + dn1.7+α), where α = Ω( ϵ 1 3 log( d ϵ log n )).

Proof. We apply Theorem 1 to the set X with radius r and error ϵ. This results in the remaining points X , a partial

2A proof can be found in the Appendix of (Williams 2014) 3By [k] we denote the set {1, 2, ..., k}

approximate r-net C for X \ X and Pr[Y n1.7] 1 n 0.2, where Y := |{{i, j}|xi, xj X , xi xj 1 r + ϵd}|, in time O(n1.5d). We then apply Theorem 3 to assemble the distance matrix W and the partition S1, ..., Sn1 α on inputs X , ϵ and r. If we encounter more then n1.7 entries Wij where Wij > 2|Si|, we restart the algorithm. Since Pr[Y n1.7] 1 n 0.2, with high probability we pass this stage in a constant number of runs. Next, we iterate top down over every column of W. For every column j, ﬁrst check if xj is already deleted. If this is the case we directly skip to the next column. Otherwise set C = C {xj} and delete xj. For every entry in that column such that Wi,j > 2|Si|, we then delete every x Si where xj x 1 r + ϵd. As we iterate over every column of W, which correspond to every point of X , the net fulﬁlls covering. Since points that are added as centers were not covered within r of a center by previous iterations, C also fulﬁlls packing. Thus C now contains the net points of an approximate r-net of X . By Theorem 3, building the distance matrix takes O(n2 α) time and iterating over every entry of W takes O(n2 α) time as well. For at most n1.7 of these entries, we check the distance between points in a set Si and the point of the current column which takes another O(n1.7+αd). The runtime is thus as stated.

Approximate r-nets in ℓ1 and ℓ2 Spaces In this subsection, we reduce the problem of computing approximate r-nets from Euclidean to Hamming space. Then, we apply Theorem 4 to compute approximate r-nets in ℓ1 and ℓ2 spaces. We distinguish between ℓ1 and ℓ2 metrics. Speciﬁcally, we ﬁrst show how to map the dataset from Rd space with ℓ1 metric to Hamming space. Then, we present a reduction from ℓ2 to ℓ1 space. Note that the error in the original space is multiplicative, while in Hamming space additive. Although the proofs of both Theorem 5 and 6 use the same mappings as in (Alman, Chan, and Williams 2016), the problem of computing r-nets is different and more general than ﬁnding the closest pair, thus we cannot directly cite their results.

ℓ1 case To reduce the problem of computing approximate r-nets from the ℓ1 metric space to the Hamming space we use Locality Sensitive Hashing (LSH), formally deﬁned below. Deﬁnition 3. Let r, c R and p1, p2 [0, 1] where p1 > p2. A distribution D of hash functions is called (r, cr, p1, p2)-sensitive, if, for a metric space V under norm , a hash function h randomly drawn from D satisﬁes the following conditions for any points x, y V : 1. x y r Pr[h(x) = h(y)] p1 2. x y cr Pr[h(x) = h(y)] p2 We call hashing methods that exhibit these properties locality sensitive hash functions (LSH). Now, we show how to compute approximate r-nets in space under the ℓ1 norm, by employing a speciﬁc instance of LSH functions.

Theorem 5. For a set of input points X Rd, some radius r R and some error ϵ log6(log n)

log3 n , with high probability, we can construct a (1+ϵ)-approximate r-net under ℓ1 norm in time O(dn + n2 α) where α = Ω(ϵ 1 3 / log( 1

Proof. The following inequalities, are the same as the ones derived in (Alman, Chan, and Williams 2016) for their algorithm that ﬁnds all nearest/furthest neighbors of a set. First apply a variant of locality sensitive hashing, to map points from ℓ1 to Hamming space. For each point p X and i {1, ..., k}, k = O(ϵ 2 log n), we deﬁne hash func-

tions hi = {hi1(p), ..., hid(p)}, where hij = j paij +bij

aij {1, ..., d} and bij [0, 2r) sampled independently uniformly at random. For each value hi(p), deﬁne fi(p) = 0 with probability 1

2 or fi(p) = 1 with probability 1

2. We then deﬁne a new point in Hamming space as f(p) = (f1(p), ..., fk(p)). We have that for any p, q X, Pr[hij(p) = hij(q)] = 1

d Pd a=1 min{ |pa qa|

2r , 1} and Pr[fi(p) = fi(q)] = Pr[fi(p) = fi(q)|hi(p) = hi(q)]Pr[hi(p) = hi(q)] + Pr[fi(p) = fi(q)|hi(p) = hi(q)]Pr[hi(p) = hi(q)] = 0 + 1

2Pr[hi(p) = hi(q)] = 1

2Pr[Wd j=1 hij(p) = hij(q)]

2(1 Qd j=1 Pr[hij(p) = hij(q)])

2(1 Qd j=1(1 Pr[hij(p) = hij(q)])) Thus, the following properties hold for any p, q X:

1. If p q r then Pr[hij(p) = hij(q)] 1 2d and thus Pr[fi(p) = fi(q)] 1

2d)d) := α0 2. If p q (1 + ϵ)r then Pr[hij(p) = hij(q)] 1+ϵ

2d and thus Pr[fi(p) = fi(q)] 1

2d )d) := α1

Then it follows that α1 α0 = Ω(ϵ). By applying a Chernoff bound, we derive the following:

1. If p q r then E[ f(p) f(q) ] = Pk i=1 Pr[ fi(p) fi(q) ] kα0 and thus Pr[ f(p) f(q) α0k + O( k log n) := A0] 1 O( 1

n) 2. If p q (1 + ϵ)r then E[ f(p) f(q) ] = Pk i=1 Pr[ fi(p) fi(q) ] kα1 and thus Pr[ f(p) f(q) α1k O( k log n) := A1] 1 O( 1

As we know that α1 α0 = Ω(ϵ) it is easy to see that A1 A0 = k(α1 α0) O( k log n) = Ω(kϵ). For the new set of points X := f(X), we construct an approximate r-net with additive error Ω(ϵ), which yields the center points of an approximate r-net of the original points with multiplicative error (1 + ϵ). We hence apply Theorem 4 on inputs X = X , d = k, ϵ = Ω(ϵ) and r = A0. This gives us the centers C of an approximate r-net for X in time O(n2 α + n1.7+α) where α = Ω(ϵ 1 3 / log( 1

ϵ )). The points that get mapped to the net points in C are then the centers of a (1 + ϵ)-approximate r-net of the points in X under ℓ1 metrics with high probability. Applying this version of locality sensitive hashing to

X takes O(dkn) = O(dn) time, which leads to the runtime as stated.

ℓ2 case Firstly, we reduce the dimension of the dataset using the Fast Johnson Lindenstrauss Transform (Johnson and Schechtman 1982; Johnson and Lindenstrauss 1984). Speciﬁcally, we use a variant that allows us to map ℓ2 points to ℓ1 points, while preserving a slightly perturbed all pair distance under the respective norm, as for example seen in (Matouˇsek 2008). Thus, we can construct approximate r-nets in the general Euclidean space, as formally stated below.

Theorem 6. For set of input points X Rd, some radius r R, some error ϵ log6(log n)

log3 n , with high probability, we can construct a (1+ϵ)-approximate r-net under ℓ2 norm in time O(dn1.7+α +n2 α) where α = Ω(ϵ 1 3 / log( 1

Applications

In the following section, we present applications for the algorithms we presented in the previous section. To that end, we exhibit an improvement on a framework called Net & Prune. Net & Prune was invented by (Har-Peled and Raichel 2015) for low dimensional applications. An extended version of the framework, that is efﬁcient in higher dimensional datasets, was later presented by (Avarikioti et al. 2017). In what follows, we apply the approximate r-net algorithm to immediately improve the high dimensional framework. We then present various applications, that depend on approximate r-nets and the framework.

Net & Prune Framework

Net & Prune mainly consists of two algorithms, Apprx Net and Del Far, which are alternatively called by the framework, and a data structure that is speciﬁc to the problem we want to solve. When supplied with these, the framework returns an interval with constant spread, which is guaranteed to contain the optimal solution to the objective of the desired problem. To improve the framework, we ﬁrst improve these two algorithms. Apprx Net computes an approximate r-net for a given point set and Del Far deletes points the isolated points, i.e. the points that do not contain any other point in a ball of radius r around them. As an improvement to Apprx Net, we refer to Theorem 5 and Theorem 6. We now present an algorithm, that yields an improved version of Del Far:

Theorem 7. For a set of points X, some error ϵ log6(log n)

log3 n , a radius r Rd and the norm , that denotes the ℓ1 or ℓ2 norm, we can construct an algorithm Del Far that outputs, with high probability, a set F, where the following holds:

1. If for any point p X it holds that q X, q = p, p q > (1 + ϵ)r then p F 2. If for any point p X it holds that q X, q = p, p q r then p F

We do this in time O(dn + n2 α), where α = Ω( ϵ 1 3 log(1/ϵ)).

Proof. We prove the Theorem for the ℓ1 metric. For an ℓ2 instance we can simply apply the mapping of Theorem 6 and the proof holds. Initially, we map the points to Hamming space, applying the techniques described in Theorem 5. During the brute force part of the algorithm we do the following: we delete each point that is covered by a center and then we add both the point and the center to set F. We do not add centers to F that do not cover any other points. Later, when traversing the distance matrix, we check each entry that indicates if the partition contains a close point. We calculate the distances between the current point and all points of the partition. We add in set F, and then delete, the points that are actually close. We ignore points, where every entry in its column indicate no close points. In the end, we return the set F. The running time of the algorithm is the same as in Theorem 6, since deleting a point after adding it to set F takes constant time.

The Net & Prune framework allows us to solve various so called nice distance problems. As presented by (Har-Peled and Raichel 2015), the problems solved in the subsections to come, are all proven to be of such kind. One of the properties of such problems is, that there needs to exist a so called (1+ ϵ)-decider for that problem. In the following, we denote a formal deﬁnition of such deciders.

Deﬁnition 4 ((Avarikioti et al. 2017)). Given a function f : X R, we call a decider procedure a (1 + ϵ)-decider for f, if for any x X and r > 0, deciderf(r, X) returns one of the following: (i) f(x) [β, (1 + ϵ)β] for some real β, (ii) f(x) < r, or (iii) f(x) > r.

Even though (Har-Peled and Raichel 2015) presented a decider for each of the problems that follow, the extended framework by (Avarikioti et al. 2017) requires deciders to be efﬁcient, as otherwise the frameworks runtime does not hold. This is a good opportunity to apply Theorem 5 and Theorem 6 from the previous section. In the following sections, we employ the theorem below to ﬁnd constant spread intervals. These contain the solutions to nice distance problems. We then apply deciders to approximate the solution of the problem.

Theorem 8 ((Avarikioti et al. 2017)). For c 64, the Net & Prune algorithm computes in O(dn1.999999) time a constant spread interval containing the optimal value f(X), with probability 1 o(1).

kth-Smallest Nearest Neighbor Distance When having a set of points in high dimensional space, we may be interested in ﬁnding the k-smallest nearest neighbor distance. This means, when looking at the set of distances to the nearest neighbor of each point, ﬁnding the kth-smallest of these. Computing this with a naive algorithm takes O(dn2), which is not suitable in high dimensional space. Alternatively, we are able to build a (1 + ϵ)- decider and then apply the Net & Prune framework to solve the problem. This has previously been done by (Avarikioti et al. 2017). Theorem 5 and Theorem 6 yield immediate improvement on the runtime of the decider, as it is built with help of Del Far. We thus omit the proof of the following

Theorem here. The proof can be found in the supplementary material.

Theorem 9. For a set of points X, ϵ 4 log6(log n)

log3 n and the norm , that denotes the ℓ1 or ℓ2 norm, with high probability, we can ﬁnd the (1 + ϵ)-approximate k-smallest nearest neighbor distance of X in time O(dn+n2 α), where

α = Ω( ϵ 1 3 log(1/ϵ)).

Min-Max Clustering To understand the following problem, we ﬁrst deﬁne Upward Closed Set Systems and Sketchable Families, as introduced in (Har-Peled and Raichel 2015).

Deﬁnition 5 (Upward Closed Set System & Sketchable Families (Har-Peled and Raichel 2015)). Let P be a ﬁnite ground set of elements, and let F be a family of subsets of P. Then (P, F) is an upward closed set system if for any X F and any Y P, such that X Y , we have that Y F. Such a set system is a sketchable family, if for any set S P there exists a constant size sketch sk(S) such that the following hold.

1. For any S, T P that are disjoint, sk(S T) can be computed from sk(S) and sk(T) in O(1) time. We assume the sketch of a singleton can be computed in O(1) time, and as such the sketch of a set S P can be computed in O(|S|). 2. There is a membership oracle for the set system based on the sketch. That is, there is a procedure orac such that given the sketch of a subset sk(S), orac returns whether S F or not, in O(1) time.

Min-Max Clustering is a method of clustering sets of the Upward Closed Set Systems within Sketchable Families under some cost function. The following is a formal deﬁnition of Min-Max clustering, as provided by (Har-Peled and Raichel 2015).

Deﬁnition 6 (Min-Max Clustering (Har-Peled and Raichel 2015)). We are given a sketchable family (P, F), and a cost function g : 2P R+. We are interested in ﬁnding disjoint sets S1, ..., Sm F, such that (i) Sm i=1 Si = P, and (ii) maxi g(Si) is minimized. We will refer to the partition realizing the minimum as the optimal clustering of P.

We later resort to the following Lemma when building a (1 + ϵ)-decider for a concrete instance of Min-Max Clustering.

Lemma 2. Given a set of n points X Rd, a radius r R, some error parameter ϵ log6(log n)

log3 n , the norm , that denotes the ℓ1 or ℓ2 norm, and a set C X s.t. x, y C, x y 2r(1+ϵ), with high probability, we can return sets Pi, such that ci C, x Pi, ci x (1 + ϵ)r and x X Br(ci), where Br(ci) = {x : x Rd, x ci r} we have that x Pi in time O(dn + n2 α), where

α = Ω( ϵ 1 3 log(1/ϵ)).

Proof. We reduce the problem to Hamming space with error ϵ and radius r, applying the same techniques as in Theorem

5 for ℓ1 points or Theorem 6 for points in ℓ2. After this reduction, we get a set of new points X and a new radius r . We apply brute force on X to get part of the solution. We randomly choose a point of ci C and then iterate over every point in x X . We check if x ci r + ϵk, for k = (ϵ 2 log n) the dimension in Hamming space. For every point where this holds, we the original point into the set Pi and then delete x from X . We do this n times which takes O(n1.5) time in total. Without loss of generality, we now assume that |C| > n, as otherwise we would be done at this point. With a similar argument as in Theorem 1, we argue that, after using brute force, with probability at least 1 n 0.2, |{(c, x)|c C, x X , x c r + ϵk}| n1.7. As in Theorem 4, we then build the distance matrix W of the remaining points in X . We iterate over every column corresponding to remaining center points cj. For every entry Wi,j > 2|Si|, we add original version of every point x Si such that x cj r + kϵ to Pj. This takes time O(n2 α + n1.7). It then holds that ci C, x Pi, ci x (1 + ϵ)r, as we only added points to Pi s, where this property holds. It also holds that x X Br(ci), as every point is only within the ball of one single center, because of the constraint on C. As we only do brute force and then build a distance matrix which we iterate through in a similar fashion as in Theorem 4, the runtime is as stated.

The proof of the following Theorem describes how to utilize the above Lemma to build a decider. The framework then allows us to solve a concrete instance of Min-Max Clustering. A similar decider was built by (Har-Peled and Raichel 2015), to solve the same problem in low dimensional space.

Theorem 10. Let P Rd, let (P, F) be a sketchable family and let be the norm, that denotes the ℓ1 or ℓ2 norm. For a set W F, let rmin(W) be the smallest radius, such that a ball centered at a point of W encloses the whole set. We can then, for ϵ 4 log6(log n)

log3 n , (4 + ϵ)-approximate the min-max clustering of P with rmin(W) as the cost function, with high

probability, in time O(dn + n2 α), where α = Ω( ϵ 1 3 log(1/ϵ)). Speciﬁcally, one can cover P by a set of balls and assign each point of P to a ball containing that point, such that the set of assigned point of each ball is in F and the maximum radius of these balls is a (4+ϵ)-approximate of the minimum of the maximal radius used by any such cover.

Proof. First notice that, when building a (1+ϵ)-approximate (4ropt(1 + ϵ))-net, where ropt is the radius of the optimal clustering Popt of P, the following properties hold. Let Wi Popt be the cluster that contains center ci of the r net. It then holds that diam(Wi) 2ropt. Also any two center points of the net have distance at least (4ropt(1 + ϵ)) from each other, thus there are no i = j such that Wi = Wj. Now deﬁne Ci as the set of points that are contained in a ball of radius 2ropt around center ci, hence Ci = P B2ropt(ci). It then holds that Wi Ci and since Wi Popt, we know that Wi F. Thus Ci F by deﬁnition of upward closed set systems.

This observation allows us to build a decider for f(P, F), which is the function returning the optimal solution to the objective of the clustering. First, we build a (1 + ϵ

4)- approximate (4r(1 + ϵ

4))-net. We then apply Lemma 2 on inputs X = P,r = 2r, ϵ = ϵ 4 and C = C , where C is the set of center points of the net. It is easy to see, that the needed property on C is met,as it contains the center points of a (4r(1 + ϵ

4))-net. The sets Pi, that get returned by Lemma 2, are then supersets of Ci, if r ropt. From the deﬁnition of sketchable families we know that, for every Pi, we are able to decide if Pi F in O(n). Assume now that there exists a Pi which is not in F. Pi is thus not a superset of Ci, and we return r < ropt. Otherwise, we know that i, Ci Pi and thus r ropt. Now its left to decide if ropt is within some constant spread interval. To that end, we repeat the process above, but for a slightly smaller net, say a (1 + ϵ

4)-approximate 4r-net. If all of the Pi s for this new net are in F, we know that the original r was to big and we return f(P, F) < r. Otherwise, because we applied Lemma 2 to radius 2r (1+ ϵ

4 ) and found that balls of that radius centered at ci are not in F, we know that the optimal value is at least r (1+ ϵ

4 ), dew to our observation about the diameter of the clusters in the beginning. We also know that ropt can be as big as 4(1 + ϵ

4)2r as we are able to cover the whole space with balls of such radius and subsets of these are in F. Therefor, we return the interval [ r 1+ ϵ

4 ]. Plugging this into the framework thus provides us with a constant spread interval, which contains the solution. By searching the interval the same way as in Theorem 9, we end up with an interval [ r 1+ ϵ

4 ]. We return r 1+ ϵ

4 , which is a (4+ϵ)-approximate solution since the real solution may be up to a 4(1 + ϵ

4)3-factor off and 4(1 + ϵ

4)3 = 4(( ϵ

4 + 1) (4 + ϵ). In the worst case, the decider builds an approximate r-net twice and also calls Lemma 2 twice. Applying the framework with that decider and searching the returned interval thus results in O(dn + n2 α).

The k-center clustering is tightly coupled to the problem of building r-nets. For a set of high dimensional points, we want to ﬁnd k clusters, that minimize the maximum diameter of any of these. For any ϵ > 0, computing a (2 ϵ) approximate k-center clustering in polynomial time has been shown to be impossible except P = NP (Hsu and Nemhauser 1979). We thus focus on computing (2 + ϵ)-approximates of the optimal solution. In the following we present two approaches to this. First, we build a decider, such that we are able to employ the framework, which provides us with a (4 + ϵ)-approximate k-center clustering. We then exhibit a different approach to the problem. Instead of relying on the framework, we derive an algorithm that computes approximate greedy permutations. We then present a way of reducing the computation of a (2 + ϵ)-approximate k-center clustering to building approximate greedy permutations. The drawback of this approach is, that the runtime has a logarithmic dependency on the spread of the data.

(4+ϵ) approximate k-center As in previous subsections, we design a decider which then gets called by the framework. The construction is similar to (Har-Peled and Raichel 2015), where they construct a decider to (4+ϵ)-approximate k-center clustering in lower dimensions. We ﬁrst proof the following Lemma, which is going to be useful later. Lemma 3. There are the following relations between a set C, which contains the net points of a (1 + ϵ)-approximate r-net on a set of points X, and the function f(X, k), which returns the optimal clustering radius for the k-center problem on the set X. 1. If |C| k then f(X) < (1 + 2ϵ)r 2. If |C| > k then r 2f(X) Theorem 11. For a set of n points X Rd, some integer k, n k > 0, some error parameter ϵ 32 log6(log n)

log3 n and the norm , that denotes the ℓ1 or ℓ2 norm, with high probability, we return a (4 + ϵ)-approximate k-center

clustering in time O(dn + n2 α), where α = Ω( ϵ 1 3 log(1/ϵ)).

Proof. In the following we present the decider that we plug into the framework. First we create a (1 + ϵ 32)-approximate r 1+ ϵ

16 -net and check if we get k or less center points. If we do, then due to Lemma 3 we can safely return f(X, k) < r. If we do not, we create a (1+ ϵ

32)-approximate (2(1+ ϵ

16)r)- net and check if we have at most k centers in this net. In that case, due to Lemma 3 we know that r 2(1+ ϵ

16 ) f(X, k) < 2(1 + ϵ

16)2r and we return this interval. Otherwise, we have more than k-centers. Thus we know from Lemma 3 that r(1 + ϵ

16) f(X, k) and we return f(X, k) > r. We therefor satisfy the properties of a (1 + ϵ)-decider and apply it to the framework to compute a constant spread interval containing the exact solution. As in the previous subsections, we slice the interval and do binary search using the decider. If we ﬁnd an interval r 2(1+ ϵ

16 ) f(X, k) < 2(1 + ϵ 16)2r, we return 2(1 + ϵ 16)2r which is a (4 + ϵ) approximation, as it might miss the real solution up to a factor of 4(1 + ϵ 16)3 = 4+ 3ϵ

64 + ϵ3 1024 4+ϵ. It is easy to see that the decider runs in time O(dn+n2 α) and as in the previous subsection, the search takes O(1/ϵ2) iterations. Therefor, the runtime is as stated.

(2 + ϵ) approximate k-center with dependency on the spread Another way to approach the approximate k-center clustering, is given by (Gonzalez 1985). There they construct a 2-approximate k-center clustering by using greedy permutations. A greedy permutation of a point set is an ordering, such that the i-th point is the furthest from all previous points in the permutation. In (Eppstein, Har-Peled, and Sidiropoulos 2015) they describe a way of constructing an approximate greedy permutation by building approximate r-nets. In the following, we present a way to improve this construction by applying the approximate r-net algorithm from the previous section. We then present how to exploit approximate greedy permutations to create (2 + ϵ)- approximate k-center clusterings. Unfortunately, building

the greedy permutation has a logarithmic runtime dependency on the spread, which is the ratio of the biggest to the smallest distance within the point set. Therefore, the algorithm is only useful for data, where the spread is in poly(n).

Approximate greedy permutation A greedy permutation is an ordered set Π of the input points, such that the point πi is the furthest point in V from the set {πj}i 1 j=1. The following is a formal deﬁnition of approximate greedy permutations, as described in (Eppstein, Har-Peled, and Sidiropoulos 2015).

Deﬁnition 7. A Permutation Π is a (1 + ϵ)-greedy permutation on n points on metric space (V, d), if there exists a sequence r1 r2 ... rn s.t.

1. The maximum distance of a point in V from {πj}i j=1 is in the range [ri, (1 + ϵ)ri] 2. The distance between any two points u, v {πj}i j=1 is at least ri We now proof the following Lemma, which helps us build the approximate greedy permutation later.

Lemma 4. For a set of n points X Rd, the norm , that denotes the ℓ1 or ℓ2 norm, a set of points C X, such that x, y C, x y r , some error ϵ log6(log n)

log3 n and a radius r R, with high probability, we can compute a set F, such that y F, c C, x y r. We do this

in time O(dn + n2 α), where α = Ω( ϵ 1 3 log(1/ϵ)).

Proof. We proceed similar as when building the approximate r-net. We ﬁrst reduce the problem to Hamming space with additive error ϵ as in Theorem 5 for points in ℓ1 or Theorem 6 for ℓ2 points. We then arrive at the mapped point set X and radius r . Next, we apply a slightly modiﬁed version of Theorem 1. Instead of randomly choosing a point from X , we randomly choose a point c C. We then iterate over every point in x X and check if x c r + ϵk for k = (ϵ 2 log n), the dimension in Hamming space. For every point x where this holds, we delete x from X as well as from the original set X. We do this n times, which takes O(n1.5) time in total. We now assume, without loss of generality, that |C| > n, as otherwise we would be done at this point. By applying a similar argument as in Theorem 1, it holds that with probability at least 1 n 0.2, |{(c, x)|c C, x X, x c r + ϵk}| n1.7. As in Theorem 4, we now build the distance matrix W of the remaining points in X . We then iterate over every column corresponding to the remaining center points cj and, for every entry Wi,j > 2|Si|, delete the original version of every point x Si such that x c r +kϵ from X. This takes time O(n2 α+n1.7). The point set X then, by construction, only contains points which are further then r from any point in C.

The algorithm for building the greedy permutation is very similar to the algorithm presented in (Eppstein, Har-Peled, and Sidiropoulos 2015). We build sequences of approximate r-nets, starting with a radius that is an approximation of the

maximum distance within the point set. Then, we consecutively build approximate r-nets for smaller and smaller r, while keeping centers of previously computed approximate r-nets. Putting the center points into a list in the order they get computed results in an approximate greedy permutation. For a full proof outline refer to the supplementary material.

Theorem 12. For point set X Rd with |X| = n, ϵ 4 log6(log n)

log3 n and the norm , that denotes the ℓ1 or ℓ2 norm, with high probability, we can compute a (1+ϵ)-approximate greedy computation in time O((dn + n2 α) log Φ), where

α = Ω( ϵ 1 3 log(1/ϵ)) and Φ = maxx,y X,x =y x y

minx,y X,x =y x y is the spread of data X.

k-center with approximate greedy permutation In (Gonzalez 1985), Gonzales proved that an exact greedy permutation leads to a 2 approximation of the solution for the k-center objective, if we take the ﬁrst k elements out of the permutation and declare them as cluster centers. The maximum radius of a cluster, is then the minimum distance of the k +1-th point in the permutation to the one of the ﬁrst k points. With a (1 + ϵ)-approximate greedy permutation we can then derive a (2 + ϵ)-approximate solution for the k-center problem, since for every element πi and every ri as in the deﬁnition of the approximate greedy permutation, we know that, in metric space (V, ), it holds that ri maxu V minj {1,...,i} πj u (1 + ϵ)ri. Thus the radius used, if we take the ﬁrst k elements of the approximate greedy permutation as cluster centers, is at most a 1 + ϵ factor larger than the radius we would use by taking the ﬁrst k elements of the exact greedy permutation, which in turn is at most a 2 factor larger than the exact k-center clustering radius.

Conclusion & Future Work

Our work has lead to interesting improvement on the construction time of approximate r-nets and applications thereof. We wish to highlight the following open problems. First, can we ﬁnd a lower bound to the construction time of r-nets? This would also tell us more about the limits of the Net & Prune framework. Second, can we get rid of the spread dependency on the approximate greedy permutation algorithm, as this would make the algorithm suitable for much more general data sets? Our work seems to suggest that this is tightly coupled to ﬁnding all nearest neighbors.

Acknowledgments

We thank Ioannis Psarros for the helpful discussions. Yuyi Wang is partially supported by X-Order Lab.

Alman, J.; Chan, T. M.; and Williams, R. 2016. Polynomial representations of threshold functions and algorithmic applications. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 467 476. IEEE.

Avarikioti, G.; Emiris, I. Z.; Kavouras, L.; and Psarros, I. 2017. High-dimensional approximate r-nets. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, 16 30. SIAM. Coppersmith, D. 1982. Rapid multiplication of rectangular matrices. SIAM J. Comput. 11:467 471. Dasgupta, S. 2002. Performance guarantees for hierarchical clustering. In Kivinen, J., and Sloan, R. H., eds., Computational Learning Theory, 351 363. Berlin, Heidelberg: Springer Berlin Heidelberg. Defays, D. 1977. An efﬁcient algorithm for a complete link method. The Computer Journal 20(4):364 366. Eppstein, D.; Har-Peled, S.; and Sidiropoulos, A. 2015. Approximate greedy clustering and distance selection for graph metrics. Co RR abs/1507.01555. Estivill-Castro, V. 2002. Why so many clustering algorithms: a position paper. ACM SIGKDD explorations newsletter 4(1):65 75. Fern, X. Z., and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th international conference on machine learning (ICML-03), 186 193. Gonzalez, T. F. 1985. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38:293 306. Har-Peled, S., and Raichel, B. 2015. Net and prune: A linear time algorithm for euclidean distance problems. Journal of the ACM (JACM) 62(6):44. Hsu, W.-L., and Nemhauser, G. L. 1979. Easy and hard bottleneck location problems. Discrete Applied Mathematics 1(3):209 215. Johnson, W. B., and Lindenstrauss, J. 1984. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26(189-206):1. Johnson, W. B., and Schechtman, G. 1982. Embedding lm p into ln 1 . Acta Mathematica 149(1):71 85. Kriegel, H.-P.; Kr oger, P.; Sander, J.; and Zimek, A. 2011. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(3):231 240. Lloyd, S. 1982. Least squares quantization in pcm. IEEE transactions on information theory 28(2):129 137. Matouˇsek, J. 2008. On variants of the johnson-lindenstrauss lemma. Random Struct. Algorithms 33(2):142 156. Sibson, R. 1973. Slink: an optimally efﬁcient algorithm for the single-link cluster method. The computer journal 16(1):30 34. Williams, R. 2014. New algorithms and lower bounds for circuits with linear threshold gates. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, 194 202. ACM.