# kvariates_more_pluses_in_the_kmeans__7ee78d4e.pdf

k-variates++: more pluses in the k-means++

Richard Nock RICHARD.NOCK@DATA61.CSIRO.AU Rapha el Canyasse RAPHAEL.CANYASSE@POLYTECHNIQUE.EDU Roksana Boreli R.BORELI@UNSW.EDU.AU Frank Nielsen NIELSEN@LIX.POLYTECHNIQUE.FR Data61, Sydney, Australia & { The Australian National University; The University of New South Wales} Ecole Polytechnique, Palaiseau, France & { The Technion, Haifa, Israel; Sony CS Labs, Inc., Tokyo, Japan}

k-means++ seeding has become a de facto standard for hard clustering algorithms. In this paper, our ﬁrst contribution is a two-way generalisation of this seeding, k-variates++, that includes the sampling of general densities rather than just a discrete set of Dirac densities anchored at the point locations, and a generalisation of the well known Arthur-Vassilvitskii (AV) approximation guarantee, in the form of a bias+variance approximation bound of the global optimum. This approximation exhibits a reduced dependency on the noise component with respect to the optimal potential actually approaching the statistical lower bound. We show that kvariates++ reduces to efﬁcient (biased seeding) clustering algorithms tailored to speciﬁc frameworks; these include distributed, streaming and on-line clustering, with direct approximation results for these algorithms. Finally, we present a novel application of k-variates++ to differential privacy. For either the speciﬁc frameworks considered here, or for the differential privacy setting, there is little to no prior results on the direct application of k-means++ and its approximation bounds state of the art contenders appear to be signiﬁcantly more complex and / or display less favorable (approximation) properties. We stress that our algorithms can still be run in cases where there is no closed form solution for the population minimizer. We demonstrate the applicability of our analysis via experimental evaluation on several domains and settings, displaying competitive performances vs state of the art.

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

1. Introduction

Arthur-Vassilvitskii s (AV) k-means++ algorithm has been extensively used to address the hard membership clustering problem, due to its simplicity, experimental performance and guaranteed approximation of the global optimum; the goal being the k-partitioning of a dataset so as to minimize the sum of within-cluster squared distances to the cluster center (Arthur & Vassilvitskii, 2007), i.e., a centroid or a population minimizer (Nock et al., 2016b).

The k-means++ non-uniform seeding approach has also been utilized in more complex settings, including tensor clustering, distributed, data stream, on-line and parallel clustering, clustering with non-metric distortions and even clustering with distortions not allowing population minimizers in closed form (Ailon et al., 2009; Balcan et al., 2013; Jegelka et al., 2009; Liberty et al., 2014; Nielsen & Nock, 2014; Nielsen et al., 2014; Nielsen & Nock, 2015; Nock et al., 2008). However, apart from the non-uniform seeding, all these algorithms are distinct and (seemingly) do not share many common properties.

Finally, the application of k-means++ in some scenarios is still an open research topic, due to the related constraints e.g., there is limited prior work in a differentially private setting (Nissim et al., 2007; Wang et al., 2015).

Our contribution In a nutshell, we describe a generalisation of the k-means++ seeding process, k-variates++, which still delivers an efﬁcient approximation of the global optimum, and can be used to obtain and analyze efﬁcient algorithms for a wide range of settings, including: distributed, streamed, on-line clustering, (differentially) private clustering, etc. . We proceed in two steps.

First, we describe k-variates++ and analyze its approximation properties. We leverage two major components of k-means++: (i) data-dependent probes (specialized to observed data in the k-means++) are used to compute the weights for selecting centers, and (ii) selection of centers

k-variates++: more pluses in the k-means++

Figure 1. Graphical model for the k-means++ seeding process (black) and our generalisation (black + red, best viewed in color).

is based on an arbitrary family of densities (specialized to Diracs in the k-means++). Informally, the approximation properties (when only (ii) is considered), can be shown as:

expected cost(k-variates++) (2 + log k) Φ , with

Φ .= 6 optimal noise-free cost+2 noise (bias + variance), where noise refers to the family of densities (note that constants are explicit in the bound). The dependence on these densities is arguably smaller than expected (factor 2 for noise vs 6 for global optimum). There is also not much room for improvement: we show that the guarantee approaches the Fr echet-Cram er-Rao-Darmois lower bound.

Second, we use this general algorithm in two ways. We use it directly in a differential privacy setting, addressing a conjecture of (Nissim et al., 2007) with weaker assumptions. We also demonstrate the use of this algorithm for a reduction to other biased seeding algorithms for distributed, streamed or on-line clustering, and obtain the approximation bounds for these algorithms. This simple reduction technique allows us to analyze lightweight algorithms that compare favorably to the state of the art in the related domains (Ailon et al., 2009; Balcan et al., 2013; Liberty et al., 2014), from the approximation, assumptions and / or complexity aspects. Experiments against state of the art for the distributed and differentially private settings display that solid performance improvement can be obtained.

The rest of this paper is organised as follows: Section 2 presents k-variates++. Section 3 presents approximation properties for distributed, streamed and on-line clustering that use a reduction from k-variates++. Section 4 presents direct applications of k-variates++ to differential privacy. Section 5 presents experimental results. Last Section discusses extensions (to more distortion measures) and conclude. In order not to laden the paper s body, a Supplementary Information (SI) provides all proofs and extensive experiments not shown here (Nock et al., 2016a).

2. k-variates++

We consider the hard clustering problem (Banerjee et al., 2005; Nock et al., 2016b): given set A Rd and integer k > 0, ﬁnd centers C Rd which minimizes the L2 2 poten-

tial to the centers (here, c(a) .= arg minc C a c 2 2):

φ(A; C) .= X

a A a c(a) 2 2 , (2)

Algorithm 0 describes k-variates++. um denotes the uniform distribution over A (|A| = m). The parenthood with k-means++ seeding, which we name k-means++ for short1 (Arthur & Vassilvitskii, 2007) can be best understood using Figure 1 (the red parts in Figure 1 are pinpointed in Algorithm 0). k-means++ is a random process that generates cluster centers from observed data A. It can be modelled using a two-stage generative process for a mixture of Dirac distributions: the ﬁrst stage involves random variable Qt Mult(m, πt) whose parameters πt m (the m-dim probability simplex) are computed from the data and previous centers; sampling Qt chooses the Dirac distribution, which is then sampled for one center (and the process iterates). All the crux of the technique is the design of πt, which, under no assumption of the data, yield in expectation a k-means potential for the centers chosen that is within 8(2 + log k) of the global optimum (Arthur & Vassilvitskii, 2007).

k-variates++ generalize the process in two ways: ﬁrst, the update of πt depends on data and previous probes, using a sequence of probe functions t : A Rd ( = Id, the identity function, t, in k-means++). Second, Diracs are replaced by arbitrary but ﬁxed local distributions (sometimes called noisy) with parameters2 (µa, θa) depending on A. Let Copt Rd denote the set of k centers minimizing (2) on A. Let copt(a) .= arg minc Copt a c 2 2 (a A), and

a A a copt(a) 2 2 , (3)

a A µa copt(a) 2 2 , (4)

a A tr (Σa) . (5)

φopt is the optimal noise-free potential, φbias is the bias of the noise3, and φvar its variance, with Σa .= Ex pa[(x µa)(x µa) ] the covariance matrix of pa. Notice that when µa = a, φbias = φopt. Otherwise, it may hold that φbias < φopt, and even φbias = 0 if expectations coincide

1Both approaches can be completed with the same further local monotonous optimization steps like Lloyd or Hartigan iterations; furthermore, it is the biased seeding which holds the approximation properties of k-means++. 2Because expectations are the major parameter for clustering, we split the parameters in the form of µa (expectation) and θa (other parameters, e.g. covariance matrix). 3We term it bias by analogy with supervised classiﬁcation, considering that the expectations of the densities could be used as models for the cluster centers (Kohavi & Wolpert, 1996).

k-variates++: more pluses in the k-means++

Algorithm 0 k-variates++

Input: data A Rd with |A| = m, k N , densities

p(µa,θa), a A , probe functions t : A Rd (t 1); Step 1: Initialise centers C ; Step 2: for t = 1, 2, ..., k 2.1: randomly sample a qt A, with q1 .= um and, for t > 1,

qt(a) .= Dt(a)

, where Dt(a) .= min x C t(a) x 2 2 ; (1)

2.2: randomly sample x p(µa,θa); 2.3: C C {x}; Output: C;

with Copt. Let Copt denote the partition of A according to the centers in Copt. We say that probe function t is η-stretching if, informally, replacing points by their probes does not distort signiﬁcantly the observed potential of an optimal cluster, with respect to its actual optimal potential. The formal deﬁnition follows.

Deﬁnition 1 Probe functions t are said η-stretching on A, for some η 0, iff the following holds: for any cluster A Copt with |A| > 1, any a0 A such that φ( t(A); { t(a0)}) = 0, any set of k centers C Rd,

φ(A; C) φ(A; {a0}) (1 + η) φ( t(A); C) φ( t(A); { t(a0)}), t .(6)

Since φ(A; Copt) = P

a0 A φ(A; {a0}) (Arthur & Vassilvitskii, 2007) (Lemma 3.2), Deﬁnition 1 roughly states that the potential of an optimal cluster with respect to a set of cluster centers, relatively to its potential with respect to the optimal set of centers, does not blow up through probe function t. The identity function is trivially 0-stretching, for any A. Many local transformations would be eligible for η-stretching probe functions with η small, including local translations, mappings to core-sets (Har-Peled & Mazumdar, 2004), mappings to Voronoi diagram cell centers (Boissonnat et al., 2010), etc. Notice that ineq. (6) has to hold only for optimal clusters and not any clustering of A. Let E[φ(A; C)] .= R φ(A|C)dp(C) denote the expected potential over the random sampling of C in k-variates++.

Theorem 2 For any dataset A, any sequence of ηstretching probe functions t and any density {pa, a A}, the expected potential of k-variates++ satisﬁes:

E[φ(A; C)] (2 + log k) Φ , (7)

with Φ .= (6 + 4η)φopt + 2φbias + 2φvar.

(Proof in SI, Subsection 2.1) Five remarks are in order. First, we retrieve AV s bound in their setting (η = φvar = 0, φbias = φopt) (Arthur & Vassilvitskii, 2007). Second,

we may beat AV s bound when φbias < φopt, which is essentially domain or setting dependent. Third, apart from being η-stretching, there is no constraint on the choice of probe functions t: it can be randomized, iteration dependent, etc. Fourth, the algorithm can easily be generalized to the case where points are weighted. Last, as we show in the following Lemma, the dependence in noise in ineq. (7) can hardly be improved in our framework.

Lemma 3 Suppose each point in A is replaced (i.i.d.) by a point sampled in pa with Σa = Σ. Then any clustering algorithm suffers: E[φ(A; C)] = Ω(|A|tr (Σ)).

(Proof in SI, Subsection 2.2) We make use of kvariates++ in two different ways. First, we show that it can be used to prove approximation properties for algorithms operating in different clustering settings: distributed, streamed and on-line clustering. The proof involves a reduction (explained in SI, Section 2.4) from kvariates++ to each of these algorithms. By reduction, we mean there exists distributions and probe functions (even non poly-time computable) for which k-variates++ yields the same result in expectation as the other algorithm, thus directly yielding an approximability ratio of the global optimum for this latter algorithm via Theorem 2. A key is the choice of the probe functions, which, if not the identity function, models the approximation of the seeding process in the whole set by a subset of it, or by an alternative set. Second, we show how k-variates++ can directly be specialized for settings for which no efﬁcient application of k-means++ was known.

3. Reductions from k-variates++

Despite tremendous advantages, k-means++ has a serious downside: it is difﬁcult to parallelize, distribute or stream it under relevant communication, space, privacy and/or time resource constraints (Bahmani et al., 2012). Although extending k-means clustering to these settings has been a major research area in recent years, there has been no obvious solution to tailoring k-means++ (Ackermann et al., 2010;

k-variates++: more pluses in the k-means++

Ref. Property Them Us (I) (Balcan et al., 2013) Communication complexity Ω((nkd/ε4) + n2k ln(nk)) O(n2k) (II) (Balcan et al., 2013) Data points shared Ω((kd/ε4) + nk ln(nk)) k (III) (Balcan et al., 2013) Approximation bound (2 + log k)(1 + ε) 8φopt (2 + log k) 10φopt + 6φF s

(i) (Ailon et al., 2009) Time complexity (outer loop) identical (ii) (Ailon et al., 2009) Approximation bound (2 + log k)(1 + η) 32φopt (2 + log k) ((8 + 4η)φopt + 2φ s ) (a) (Liberty et al., 2014) Knowledge required Lower bound φ φopt None (b) (Liberty et al., 2014) Approximation bound O(log m φopt) (2 + log k) 4 + (32/ς2) φopt (A) (Nissim et al., 2007) Knowledge required λ(φopt) None (B) (Nissim et al., 2007) Noise variance (σ) O(λk R/ϵ) O(R/(ϵ + log m)) (C) (Nissim et al., 2007) Approximation bound O (φopt + mλ2k R2/ϵ2) O(log k(φopt + m R2/(ϵ + log m)2))

Table 1. Comparison with some state of the art approaches for distributed (I-III), streamed (i, ii) and on-line clustering (a, b), and differential privacy (A-C). We refer to related sections for notations. SI (Nock et al., 2016a) provides a more extensive table.

Algorithm 1 Dk-means++ (// PDk-means++)

Input: Forgy nodes (Fi, Ai), i [n], for t = 1, 2, ..., k

Round 1 : N picks i q D t [n] and asks Fi for a center; Round 2 : Fi picks a ui Ai and sends a to Fi, i; // PDk-means++: Fi sends x p(µa,θa) to Fi, i; Round 3 : i, Fi updates Dt(Ai) and sends it to N ; Output: C = set of broadcasted as (or xs);

Ailon et al., 2009; Bahmani et al., 2012; Balcan et al., 2013; Liberty et al., 2014; Shindler et al., 2011) (and others).

Distributed clustering We consider horizontally partitioned data among peers. Distributed clustering performs a synchronous computation of k-means++, as shown in Algorithm 1. There are two readings of Algorithm 1. Without more assumption about the distributed setting, node N denotes one of the peers that randomly selects one of the n peers that is going to sample a centroid with Forgy clustering. We have n such Forgy nodes, (Fi, Ai), i [n], where Ai is the dataset held by Fi. Our result works regardless of the way N is elicited: it can be a ﬁxed peer, a peer that varies through time, or a special node (eventually not being a peer). The notion of having a special node in collaborative services is common in a number of distributed domains, e.g. in hybrid or server assisted peer-topeer networks (Yang & Garcia-Molina, 2001). One could also imagine that Forgy nodes are non-computationally intensive and just able to perform uniform sampling in their data, so that N is a different node that performs nonuniform sampling. This setting complies with privacy constraints in which data sharing between nodes is limited. In particular, we can enforce that N is not allowed to handle any data (points) from the Forgy nodes. We therefore split the location of the computational power from the location of the data. We can also prevent the Forgy nodes from exchanging any data between themselves, with the sole exception of cluster centers. We note that none of the algorithms of (Ailon et al., 2009; Balcan et al., 2013; Bah-

mani et al., 2012) would be applicable to this setting without non-trivial modiﬁcations affecting their properties.

Algorithm 1 includes two variants: a protected version Dkmeans++ where Forgy nodes directly share local centers and a private version PDk-means++ where the nodes share noisy centers, such as to ensure a differentially private release of centers (with relevant noise calibration). Notations used in Algorithm 1 are as follows. Let Dt(Ai) .= P

a Ai Dt(a) and q D ti .= Dt(Ai) (P

j Dt(Aj)) 1 if t > 1 and q D ti .= 1/n otherwise. Also, ui is uniform distribution.

Theorem 4 Let φF s .= P i [n] P a Ai c(Ai) a 2 2 be the total spread of the Forgy nodes (c(Ai) .= (1/mi) P

Ai a). At iteration k, the expected potential on the total data A .= i Ai satisﬁes ineq. (7) with

Φ .= 10φopt + 6φF s (Dk-means++) 10φopt + 4φF s + 2φvar (PDk-means++) . (8)

Here, φopt is the optimal potential on total data A.

(Proof in SI, Subsection 2.4) We note that the optimal potential is deﬁned on the total data. The dependence on φF s , which is just the peer-wise variance of data, is thus rather intuitive. A positive point is that φF s is weighted by a factor smaller than the factor that weights the optimal potential. Another positive point is that this parameter can be computed from data, and among peers, without disclosing more data. Hence, it may be possible to estimate the loss against the centralized, k-means++ setting, taking as reference eq. (8). To gain insight in the leverage that Theorem 4 provides, Table 1 compares Dk-means++ to (Balcan et al., 2013) s (ε is the coreset approximation parameter), even though the latter approach would not be applicable to our restricted framework. To be fair, we assume that the algorithm used to cluster the coreset in (Balcan et al., 2013) is k-means++. We note that, considering the communication complexity and the number of data points shared, Algorithm 1 is a clear winner. In fact, Algorithm 1 can also win from the approximability standpoint. The dependence in ε

k-variates++: more pluses in the k-means++

Algorithm 2 Sk-means++

Input: Stream S Step 1: S .= {(sj, mj), i [n]} SYNOPSIS(S, n); Step 2: for t = 1, 2, ..., k

2.1: if t = 1 then let sj un S else sj q S t S s.t.

q S t (sj) .= mj Dt(sj)

j [n] mj Dt(sj )

// Dt(sj) .= minc C sj c 2 2; 2.2: C C {sj}; Output: Cluster centers C;

prevents to ﬁx it too small in (Balcan et al., 2013). Comparing the bounds in row (III) shows that if ε > 1/4, then we can also be better from the approximability standpoint if the spread satisﬁes φF s = O(φopt). While this may not be feasible over arbitrary data, it becomes more realistic on several real-world scenarii, when Forgy nodes aggregate local data with respect to features, e.g., state-wise insurance data, city-wise ﬁnancial data, etc. When n increases, this also becomes more realistic.

Streaming clustering We have access to a stream S, with an assumed ﬁnite size: S is a sequence of points a1, a2, ..., am. We authorise the computation / output of the clustering at the end of the stream, but the memory n allowed for all operations satisﬁes n < m, such as n = mα with α < 1 in (Ailon et al., 2009). We assume for simplicity that each point can be stored in one storage memory unit. Algorithm 2 (Sk-means++) presents our approach. It relies on the standard trick of summarizing massive datasets via compact representations (synopses) before processing them (Indyk et al., 2014). The approximation properties of Sk-means++, proven using a reduction from k-variates++, hold regardless of the way synopses are built. They show that two key parameters may guide its choice: the spread of the synopses, analogous to the spread of Forgy nodes for distributed clustering, and the stretching properties of the synopses used as centers.

Theorem 5 Let (a) .= arg mins S a s 2 2, a S. Let φ s .= P

a S (a) a 2 2 be the spread of on synopses set S. Let η > 0 such that is η-stretching on S. Then the expected potential of Sk-means++ on stream S satisﬁes ineq. (7) with Φ .= (8 + 4η)φopt + 2φ s . Here, φopt is the optimal potential on stream S.

(Proof in SI, Subsection 2.4) It is not surprising to see that

Sk-means++ looks like a generalization of (Ailon et al., 2009) and almost matches it (up to the number of centers delivered) when k k synopses are learned from

Algorithm 3 OLk-means++

Input: Minibatch Sj, current weighted centers C; Step 1: if j = 1 then let s u1 S1 else s q O j Sj s.t.

q O j (s) .= Dt(s)

s Sj Dt(s )

// Dt(s) .= minc C s c 2 2; Step 2: C C {s};

k -means#. Yet, we rely on a different and more general analysis of its approximation properties. Table 1 compares properties of Sk-means++ to (Ailon et al., 2009) (η relates to approximation of the k-means objective in their inner loop).

On-line clustering This setting is probably the farthest from the original setting of the k-means++ algorithm. Here, points arrive in a sequence, ﬁnite, but of unknown size and too large to ﬁt in memory (Liberty et al., 2014). We make no other assumptions the sequence can be random, or chosen by an adversary. Therefore, the expected analysis we make is only with respect to the internal randomisation of the algorithm, i.e., for the ﬁxed stream sequence as it is observed. We do not assume a feedback for learning (common for supervised learning); so, we do not assume that the algorithm has to predict a cluster for each point that arrives, yet it has to be easily modiﬁable to do so.

Our approach is summarized in Algorithm 3 (OLk-means++), a variation of k-means++ which consists of splitting the stream S into minibatches Sj for j = 1, 2, ..., each of which is used to sample one center. u1 denotes the uniform distribution with support S1. Let R .= maxa,a S a a 2( ) be the diameter of S.

Theorem 6 Let ς > 0 be the largest real such that the following conditions are met (for any A Copt, j 1): for any set of at most k centers C, P

a 2 2 ς |A| 2 R2 and P

a A Sj a c(a) 2 2 ς P

a A a c(a) 2 2 (with c(a) deﬁned in eq. (2)). Then the expected potential of OLk-means++on stream S satisﬁes ineq. (7) with Φ .= (4 + (32/ς2))φopt, where φopt is the optimal potential on stream S.

(Proof in SI, Subsection 2.4) Notice that loss function φ(S, C) in eq. (2) implies the ﬁniteness of S, and the existence of ς > 0; also, the second condition implies ς 1. In (Liberty et al., 2014), the clustering algorithm is required to have space and time at most polylog in the length of the stream. Hence, each minibatch can be

k-variates++: more pluses in the k-means++

reasonably large with respect to the stream the larger they are, the larger ς. The knowledge of ς is not necessary to run OLk-means++; it is just a part of the approximation bound which quantiﬁes the loss in approximation due to the fact that centers are computed from the partial knowledge of the stream. Table 1 compares properties of OLk-means++ to (Liberty et al., 2014) (we picked the fully on-line, non-heuristic algorithm). To compare the bounds, suppose that batches have the same size, b, so that log k = log(m/b). If batches are at least polylog size, up to what is hidden in the big-Oh notation, our approximation can be quite competitive when ς is large, e.g., if d is large and optimal clusters are not too small.

4. Direct use of k-variates++

The most direct application domain of k-variates++ is differential privacy. Several algorithms have independently emphasised the idea that powerful mechanisms may be amended via a carefully designed noise mechanism to broaden their scope with new capabilities, without overly challenging their original properties. Examples abound (Hardt & Price, 2014; Kalai & Vempala, 2005; Chaudhuri et al., 2011; Chichignoud & Lousteau, 2014), etc. Few approaches are related to clustering, yet noise injected is big the existence of a smaller, sufﬁcient noise, was conjectured in (Nissim et al., 2007) and approaches rely on a variety of assumptions or knowledge about the optimum (See Table 1) (Nissim et al., 2007; Wang et al., 2015). To apply k-variates++, we consider that t = Id, t, and assume 0 < R s.t. maxa,a A a a 2 R (a current assumption in the ﬁeld (Dwork & Roth, 2014)).

A general likelihood ratio bound for k-variates++ We show that the likelihood ratio of the same clustering for two close instances is governed by two quantities that rely on the neighborhood function. Most importantly for differential privacy, when densities p(µa,θa) are carefully chosen, this ratio always 1 as a function of m, which is highly desirable for differential privacy. We let NNN(a) .= arg mina N a a 2 denote the nearest neighbour of a in N, and let c(A) .= (1/|A|) P a A a.

Deﬁnition 7 We say that neighborhood in A is δw-spread for some δw > 0 iff for any N A with |N| = k 1, and any B A with |B| = |A| 1,

a B a NNN(a) 2 2 R2

Deﬁnition 8 We say that neighborhood in A is δsmonotonic for some δs > 0 iff the following holds. N A with |N| {1, 2, ..., k 1}, for any A A\N which is

Figure 2. Checking that δs is small, for N the set of crosses (+). Any set A of points close to each other, such as the black dots ( ), would be N-packed (pick x = c(A) in this case), but would fail to be N-packed if too spread (e.g., red dot ( ) plus black dots). Segments depict the Voronoi diagram of N. Best viewed in color.

N-packed, we have: X

a A a NNN(a) 2 2

a A a NNN {c(A)}(a) 2 2 . (12)

Set A is said N-packed iff there exists x Rd satisfying x = arg minc N {x} a c 2 2, a A.

It is worthwhile remarking that as long as k < |A| , both 0 < δw and 0 < δs always exist. Informally, δw brings that the sum of squared distances to any subset of k 1 centers in A must not be negligible against the diameter R. δs yields a statement a bit more technical, but it roughly reduces to stating that adding one center to any set of at most k 1 points that are already close to each other should not decrease signiﬁcantly the overall potential to the set of centers. Figure 2 provides a schematic view of the property, showing that the modiﬁcations of the potential can be very local, thus yielding small δs in ineq. (12). The following Theorem uses the deﬁnition of neighbouring samples: samples A and A are neighbours, written A A , iff they differ by one point. We also deﬁne P[C|A] to be the density of output C given input data A.

Theorem 9 Fix t = Id ( t) and densities p(µ.,θ.) having the same support Ωin k-variates++. Suppose there exists ϱ(R) > 0 such that densities p(µ.,θ.) satisfy the following pointwise likelihood ratio constraint:

p(µa ,θa )(x)

p(µa,θa)(x) ϱ(R) , a, a A, x Ω.(13)

Then, there exists a function f(.) such that, for any δw, δs > 0 such that A is δw-spread and δs-monotonic, for any A A, for any k > 0 and any C Ωof size k output by Algorithm k-variates++ on whichever of A or A , the likelihood ratio of C given A and A is upperbounded as:

P[C|A] (1 + δw)k 1+f(k) δw (1 + δs)k 1 ϱ(R) . (14)

k-variates++: more pluses in the k-means++

(Proof in SI, Subsection 2.5) Notice that Theorem 9 makes just one assumption (13) about the densities, so it can be applied in fairly general settings, such as for regular exponential families (Banerjee et al., 2005). These are a key choice because they extensively cover the domain of distortions for which the average is the population minimiser.

An (almost) distribution-free 1 + o(1) likelihood ratio We now show that if A is sampled i.i.d. from any distribution D which satisﬁes the mild assumption that it is locally bounded everywhere (or almost surely) in a ball, then with high probability the right-hand side of ineq. (14) is 1+o(1) where the little-oh vanishes with m. The proof, of independent interest, involves an explicit bound on δw and δs.

Theorem 10 Suppose A with |A| = m > 1 sampled i.i.d. from distribution D whose support contains a L2 ball B2(0, R) with density inside in between ϵm > 0 and ϵM ϵm. Let ρD .= ϵM/ϵm ( 1). For any 0 < δ < 1/2, if (i) A B(0, R) and (ii) the number of clusters k meets:

4ρD m , (15)

then there is probability 1 δ over the sampling of A that k-variates++, instantiated as in Theorem 9, satisﬁes P[C|A ]/P[C|A] 1 + ρk D g(m, k, d, R), A A, with

g(m, k, d, R) .= 4

m 1 4 + 1 d+1 + 64

(Proof in SI, Subsection 2.6) The key informal statement of Theorem 10 is that one may obtain with high probability some good datasets A, i.e., for which δw, δs are small, under very weak assumptions about the domain at hand. The key point is that if one has access to the sampling, then one can resample datasets A until a good one comes.

Applications to differential privacy Let M be any algorithm which takes as input A and k, and returns a set of k centers C. Let PM[C|A] denote the probability, over the internal randomisation of M , that M returns C given A and k (k, ﬁxed, is omitted in notations). Following is the deﬁnition of differential privacy (Dwork et al., 2006), tailored for conciseness to our clustering problem.

Deﬁnition 11 M is ϵ-differentially private (DP) for k clusters iff for any neighbors A A , set C of k centers,

PM[C|A ]/PM[C|A] exp ϵ . (17)

A relaxed version of ϵ-DP is (ϵ, δ)-DP, in which we require PM[C|A ] PM[C|A] exp ϵ + δ; thus, ϵ-DP = (ϵ, 0)- DP (Dwork & Roth, 2014). We show that low noise may be

affordable to satisfy ineq. (17) using Laplace distribution, Lap(σ/

2). We refer to the Laplace mechanism as a popular mechanism which adds to the output of an algorithm a sufﬁciently large amount of Laplace noise to be ϵ-DP. We refer to (Dwork et al., 2006) for details, and assume from now on that data belong to a L1 ball B1(0, R).

Theorem 12 Using notations and setting of Theorem 9, let

exp(ϵ) (1 + δw)k 1

f(k) δw (1 + δs)k 1

Then, k-variates++ with p(µ.,θ.) a product of Lap(σ1/

2), for σ1 .= 2

2R/ ϵ, both meets ineq. (17) and its expected potential satisﬁes ineq. (7) with

Φ = Φ1 .= 8 φopt + m R2

On the other hand, if we opt for σ2 .= 2

2k R/ϵ, then kvariates++ is an instance of the Laplace mechanism and its expected potential satisﬁes ineq. (7) with

Φ = Φ2 .= 8 φopt + mk2R2

(Proof in SI, Subsection 2.7) A question is how do σ1 (resp. Φ1) and σ2 (resp. Φ2) compare with each other, and how do they compare to the state of the art (Nissim et al., 2007; Wang et al., 2015) (we only consider methods with provable approximation bounds of the global optimum). The key fact is that, if m is sufﬁciently large, then it happens that we can ﬁx δw = O(1/m) and δs = O(1). The proof of Theorem 10 in SI formalizes this intuition (SI, Subsection 2.6) and the experiments (SI, Section 3) display that such regimes are indeed observed. In this case, it is not hard to show that ϵ = Ω(ϵ + log m), granting σ1 = o(σ2) since

σ1 = O R ϵ + log(m)

i.e. the noise guaranteeing ineq. (17) vanishes at 1/ log(m) rate. Consequently, in this regime, Φ1 in eq. (19) becomes:

φopt + m R2

(ϵ + log m)2

ignoring all factors other than those noted. Thus, the noise dependence grows sublinearly in m. Since in this setting, unless all datapoints are the same, δw and δs for A and any possible neighbor A are within 1 + o(1), it is also possible to overestimate δw and δs to still have δw = O(1/m) and δs = O(1) and grant ϵ-DP for k-variates++. Otherwise, the setting of Theorem 10 can be used to grant (ϵ, δ)-DP without any tweak. Table 1 compares k-variates++ to (Nissim et al., 2007) in this large sample regime, which is actually a prerequisite for (Nissim et al., 2007).

k-variates++: more pluses in the k-means++

4 5 6 7 8 9 10 k 0

-4 -2 0 2 4 6 8 ρφ

4 5 6 7 8 9 10 k 0

-4 -2 0 2 4 6 8

Figure 3. Plot of ρφ(H) = f(k, p) (points below z = 0 isocontour shown correspond to superior performances for

Dk-means++). Left: H=k-means++; right: H=k-means (best viewed in color).

5. Experiments

Due to the large number of experiments carried out, the overview we provide appears in extenso in SI (Section 3).

Dk-means++ vs k-means++ and k-means (Bahmani et al., 2012) To address algorithms that can be reduced from k-variates++ (Section 3), we have tested Dk-means++ vs state of the art approach k-means ; to be fair with Dkmeans++, we use k-means++ seeding as the reclustering algorithm in k-means . Parameters are in line with (Bahmani et al., 2012). To control the spread of Forgy nodes φF s (Theorem 4), each peer s initial data consists of points uniformly sampled in a random hyperrectangle in a space of d = 50 (expected number of peers points mi = 500, i). We sample peers until a total of m 20000 point is sampled. Then, each point moves with p% chances to a uniformly sampled peer. We checked that φF s blows up with p, i.e., >20 times for p = 50% with respect to p = 0. A remarkable phenomenon was the fact that, even when the number of peers n is quite large (dozens on average), Dk-means++ is able to beat both k-means++ and k-means , even for large values of p, as computed by ratio ρφ(H) .= 100 (φ(Dk-means++) φ(H))/φ(H) for H {k-means++, k-means } (Figure 3). Another positive point is that the amount of data to compute a center for Dkmeans++ is in average n times smaller than k-means .

k-variates++ vs Forgy-DP and GUPT To address algorithms that can be obtained via a direct use of kvariates++ (Section 4), we have tested it in a differential privacy framework vs state of the art approach GUPT (Mohan et al., 2012). We let ϵ = 1 in our experiments. We also compare it to Forgy DP (F-DP), which is just Forgy initialisation in the Laplace mechanism, with noise rate (standard dev.) k R/ϵ. In comparison, the noise rate for GUPT is k R/(ℓϵ) at the end of its aggregation process, where ℓis the number of blocks. Table 2 gives results for the average (over the choices of k) parameters used, k, ϵ, and ratio ρ φ where ρ φ(H) .= φ(H)/φ(k-variates++) values above 1 indicate better results for k-variates++. We use ϵ as the

Dataset m d k ( ϵ/ϵ) ρ φ(F-DP) ρ φ(GUPT)

Life Sci 26 733 10 3 4.5 163.0 0.7 Image 34 112 3 2.5 7.9 188.5 2.9 Europe Diff 169 308 2 5 13.0 2857.1 40.4

Table 2. k-variates++ vs F-DP and GUPT (see text).

equivalent ϵ for k-variates++, i.e. the value that guarantees ineq. (17). From Theorem 12, when ϵ > ϵ, this brings a smaller noise magnitude, desirable for clustering. The obtained results show that k-variates++ becomes more of a contender with increasing m, but its relative performance tends to decrease with increasing k. This is in accordance with the good regime of Theorem 12. Results on synthetic domains display the same patterns, along with the fact that relative performances of k-variates++ improves with d, making it a relevant choice for big domains.

In fact, extensive experiments on synthetic data (Nock et al., 2016a) show that intuitions regarding the sublinear noise regime in eq. (22) are experimentally observed, and furthermore they may happen for quite small values of m.

6. Discussion and Conclusion

We ﬁrst show in this paper that the k-means++ analysis of Arthur and Vassilvitskii can be carried out on a significantly more general scale, aggregating various clustering frameworks of interest and for which no trivial adaptation of k-means++ was previously known. Our contributions stand at two levels: (i) we provide the meta algorithm, k-variates++, and two key results, one on its approximation abilities of the global optimum, and one on the likelihood ratio of the centers it delivers. We do expect further applications of these results, in particular to address several other key clustering problems: stability, generalisation and smoothed analysis (Arthur et al., 2011; von Luxburg, 2010); (ii) we provide two examples of application. The ﬁrst is a reduction technique from k-variates++, which shows a way to obtain straight approximabilty results for other clustering algorithms, some being efﬁcient proxies for the generalisation of existing approaches (Ailon et al., 2009). The second is a direct application of k-variates++ to differential privacy, exhibiting a noise component signiﬁcantly better than existing approaches (Nissim et al., 2007; Wang et al., 2015).

Due to the lack of space, we refer to the SI (Nock et al., 2016a) for the extension of our results to extreme cases of distortions (other than L2 2) that do not even admit population minimizers in closed form (Nielsen & Nock, 2015) clustering being a huge practical problem, it is indeed reasonable to tailor the distortion to the application at hand.

k-variates++: more pluses in the k-means++

Acknowledgments

Thanks are due to Stephen Hardy, Guillaume Smith, Wilko Henecka and Max Ott for stimulating discussions and feedback on the subject. Work carried out in NICTA which was supported by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Center of Excellence Program.

Ackermann, M.-R., Lammersen, C., M artens, M., Raupach, C., Sohler, C., and Swierkot, K. Streamkm++: A clustering algorithms for data streams. In 12th ALENEX, pp. 173 187, 2010.

Ailon, N., Jaiswal, R., and Monteleoni, C. Streaming kmeans approximation. In NIPS*22, pp. 10 18, 2009.

Arthur, D. and Vassilvitskii, S. k-means++ : the advantages of careful seeding. In 19th SODA, pp. 1027 1035, 2007.

Arthur, D., Manthey, B., and R oglin, H. Smoothed analysis of the k-means method. JACM, 58:19, 2011.

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. Scalable k-means++. In 38th VLDB, pp. 622 633, 2012.

Balcan, M.-F., Ehrlich, S., and Liang, Y. Distributed kmeans and k-median clustering on general communication topologies. In NIPS*26, pp. 1995 2003, 2013.

Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. Clustering with Bregman divergences. JMLR, 6:1705 1749, 2005.

Boissonnat, J.-D., Nielsen, F., and Nock, R. Bregman voronoi diagrams. DCG, 44(2):281 307, 2010.

Chaudhuri, K., Monteleoni, C., and Sarwate, A.-D. Differentially private empirical risk minimization. JMLR, 12: 1069 1109, 2011.

Chichignoud, M. and Lousteau, S. Adaptive noisy clustering. IEEE Trans. IT, 60:7279 7292, 2014.

Dwork, C. and Roth, A. The algorithmic foudations of differential privacy. Found. & Trends in TCS, 9:211 407, 2014.

Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In 3rd TCC, pp. 265 284, 2006.

Har-Peled, S. and Mazumdar, S. On coresets for k-means and k-median clustering. In 37th ACM STOC, pp. 291 300, 2004.

Hardt, M. and Price, E. The noisy power method: a meta algorithm with applications. In NIPS*27, pp. 2861 2869, 2014.

Indyk, P., Mahabadi, S., Mahdian, M., and Mirrokni, V.-S. Composable core-sets for diversity and coverage maximization. In 33rd ACM PODS, pp. 100 108, 2014.

Jegelka, S., Sra, S., and Banerjee, A. Approximation algorithms for tensor clustering. In 20th ALT, pp. 368 383, 2009.

Kalai, A. and Vempala, S. Efﬁcient algorithms for online decision problems. J. Comp. Syst. Sc., pp. 291 307, 2005.

Kohavi, R. and Wolpert, D. Bias plus variance decomposition for zero-one loss functions. In 13th ICML, pp. 275 283, 1996.

Liberty, E., Sriharsha, R., and Sviridenko, M. An algorithm for online k-means clustering. Co RR, abs/1412.5721, 2014.

Mohan, P., Thakurta, A., Shi, E., Song, D., and Culler, D.- E. GUPT: privacy preserving data analysis made easy. In 38th ACM SIGMOD, pp. 349 360, 2012.

Nielsen, F. and Nock, R. Optimal interval clustering: Application to bregman clustering and statistical mixture learning. IEEE Signal Processing Letters, 21:1289 1292, 2014.

Nielsen, F. and Nock, R. Total Jensen divergences: deﬁnition, properties and clustering. In 40th IEEE ICASSP, pp. 2016 2020, 2015.

Nielsen, F., Nock, R., and S.-I.Amari. On clustering histograms with k-means by using mixed α-divergences. Entropy, 16, 2014.

Nissim, K., Raskhodnikova, S., and Smith, A. Smooth sensitivity and sampling in private data analysis. In 40th

ACM STOC, pp. 75 84, 2007.

Nock, R., Luosto, P., and Kivinen, J. Mixed Bregman clustering with approximation guarantees. In 19th ECML, pp. 154 169, 2008.

Nock, R., Canyasse, R., Boreli, R., and Nielsen, F. kvariates++: more pluses in the k-means++ (supplementary information to this paper). In 33rd ICML, 2016a.

Nock, R., Nielsen, F., and Amari, S.-I. On conformal divergences and their population minimizers. IEEE Trans. IT, 62:1 12, 2016b.

Shindler, M., Wong, A., and Meyerson, A. Fast and accurate k-means for large datasets. In NIPS*24, pp. 2375 2383, 2011.

k-variates++: more pluses in the k-means++

von Luxburg, U. Clustering stability: an overview. Found. & Trends in ML, 2(3):235 274, 2010.

Wang, Y., Wang, Y.-X., and Singh, A. Differentially private subspace clustering. In NIPS*28, 2015.

Yang, B. and Garcia-Molina, H. Comparing hybrid peerto-peer systems. In 27th VLDB, pp. 561 570, 2001.