# on_generalization_bounds_for_projective_clustering__6e48b8ff.pdf

On Generalization Bounds for Projective Clustering

Maria Sofia Bucarelli1 Matilde Fjeldsø Larsen2 Chris Schwiegelshohn2 Mads Bech Toftrup2

1Department of Computer, Control and Management Engineering Antonio Ruberti, Sapienza University of Rome, Italy 2Department of Computer Science, Aarhus University, Denmark mariasofia.bucarelli@uniroma1.it 201805091@post.au.dk {schwiegelshohn,toftrup}@cs.au.dk

Given a set of points, clustering consists of finding a partition of a point set into k clusters such that the center to which a point is assigned is as close as possible. Most commonly, centers are points themselves, which leads to the famous k-median and k-means objectives. One may also choose centers to be j dimensional subspaces, which gives rise to subspace clustering. In this paper, we consider learning bounds for these problems. That is, given a set of n samples P drawn independently from some unknown, but fixed distribution D, how quickly does a solution computed on P converge to the optimal clustering of D? We give several near optimal results. In particular,

1. For center-based objectives, we show a convergence rate of O p

k/n . This matches the known optimal bounds of [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] and [Bartlett, Linder, and Lugosi, IEEE Trans. Inf. Theory 1998] for k-means and extends it to other important objectives such as k-median. 2. For subspace clustering with j-dimensional subspaces, we show a convergence rate of O p

kj2/n . These are the first provable bounds for most of these problems. For the specific case of projective clustering, which generalizes k-means, we show a convergence rate of Ω p

kj/n is necessary, thereby proving that the bounds from [Fefferman, Mitter, and Narayanan, Journal of the Mathematical Society 2016] are essentially optimal.

1 Introduction

Among the central questions in machine learning is, given a sample of n points P drawn from some unknown but fixed distribution D, how well does a classifier trained on P generalize to D? The probably most popular way to formalize this question is, given a loss function L and optimal solutions SP and SD for sample P and distribution D, respectively, how the empirical excess risk L(D, SP ) L(D, SD) decreases as a function of n. This paper focuses on loss functions associated with the clustering problem. Popular examples include (k, z) clustering, which asks for a set of k centers S Rd minimizing the cost P

p P mins S p s z 2 and more generally, (k, j, z) subspace clustering which asks for a set of k subspaces U := {U1, U2, . . . Uk} minimizing P

p P min Ui U (I Ui U T i )p z 2. Special cases include (k, 1) clustering, known as k-median, (k, 2) clustering known as k-means and (k, j, 2) clustering known as projective clustering. Generally, there seems to be an interest in varying z, as letting z tend towards 1 tends to result in outlier-robust clusterings. The problem is less widely explored for z > 2, although in particular for subspace

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

approximation there is some recent work [27, 34, 79, 77]. Higher powers give more emphasis on outliers. For example, centralised moments with respect to the three and four norms are skewness and kurtosis, respectively, and are extensively employed in statistics, see Cohen-Addad et al. [31] for previous work on clustering with these measures. Fitting a mixture model with respect to skewness minimizes asymmetry around the target center. Studing the problems for z is very well motivated, as the (1, ) clustering is equivalent to the minimum enclosing ball problem. Unfortunately, one often requires additional assumptions, as the minimum enclosing ball problem suffers from the curse of dimensionality [2], is very prone to outliers [25, 38].

Despite a huge interest and a substantial amount of research, so far optimal risk bounds O p

for the k-means problem have been established, see the seminal paper by Fefferman et al. [39] for the upper bound and Bartlett et al. [10] for nearly matching lower bounds. For general (k, z)-clustering

problems, the best known results prove a risk bound of O p

kd/n [10]. For (k, j, 2) clustering, the

best known bounds of O p

kj/n are due to Fefferman et al. [39]. Thus, the following question naturally arises:

Is it possible to obtain optimal generalization bounds for all (k, j, z)-clustering objectives?

We answer this question in the affirmative whenever j and z are constant, which seems to be the most relevant case in practise [76]. Specifically, we show

The excess risk bound for (k, z)-clustering when given n independent samples from an

unknown fixed distribution D is bounded by O p

k/n , matching the lower bound of [10].

The excess risk bound for (k, j, z)-clustering when given n independent samples from an

unknown fixed distribution D is bounded by O p

There exists a distribution such that the excess risk for the (k, j, 2)-clustering problem is at

kj/n , matching the upper bound of Fefferman et al. [39] up to polylog factors.

We note that we assume z to be a constant, which is the case for projective clustering, k-means and k-median. For non-constant z, the dependency on z is exponential.

1.1 Related work

The most basic question one could answer is if the empirical estimation performed on P is consistent, i.e. as n , whether the excess risk tends to 0. This was shown in a series of works by Pollard [64, 66], see also Abaya and Wise [1]. Subsequent work then analyzed the convergence rate of the risk. The first works in this direction proved convergence rates of the order O(1/ n) without giving dependencies on other parameters [22, 65]. Linder et al. [55] gave an upper bound of O(d3/2p

k/n). Linder [54] improved the upper bound to O(d p

k/n). Bartlett et al. [8] showed an upper bound O( p

kd/n) and gave a lower bound of Ω( p

k1 4/d/n). Motivated by applications of clustering for high dimensional kernel spaces [7, 18, 20, 21, 37, 39, 56, 58, 67, 78, 80, 81], research subsequently turned its efforts towards minimizing the dependency on the dimension. Biau et al. [14] presented an upper bound of O(k/ n), see also the work by Clémençcon [26]. Fefferman et al. [39] gave a matching upper bound of the order O( p

k/n), which was later recovered by using techniques from Foster and Rakhlin [45] and Liu [57]. Further improvements require additional assumptions of the distribution D, see Antos et al. [5], Levrard [52], Li and Liu [53]. For subspace clustering, there have only been results published for the case z = 2 [39, 50, 72], for which the state of the art provides a O p

kj/n risk bound due to Fefferman et al. [39]. A highly related line of research originated with the study of coresets for compression. For Euclidean (k, z) clustering, coresets with space bounds of O (k/ε2+z) have been established [30, 32], which roughly corresponds to a error rate of

1 O hides logarithmic terms, i.e. we consider O p

k/n polylog(k, n) = O p

k/n as a function of the size of the compression. For the specific case of k-median and

k-means, coresets with space bounds of O k(2z+2)/(z+2)/ε2 are known [33], which corresponds to a

error rate of O q

k(2z+2)/(z+2)/n . Both results are optimal for certain ranges of ε and k [47] and

while these bounds are worse than what we hope to achieve for generalization, many of the techniques such as terminal embeddings are relevant for both fields. For (k, j, z) clustering, coresets are only known to exist under certain assumptions, where the provable size is O exp(k, j, ε 1) [40, 44].

2 Preliminaries

We use x p := pp P |xi|p to denote the ℓp norm of a vector x. For p , we define the limiting norm x = max xi. Further, we refer to the d-dimensional unit Euclidean ball by Bd 2, i.e. x Bd 2 is a vector in Rd and x 2 := q Pd i=1 x2 i 1. Let U be a d j orthogonal matrix, i.e., with pairwise and orthogonal columns with unit Euclidean norm. We say that UU T is the projection matrix associated with U. Let z be a positive integer. Given any set S of k points in Bd 2 we denote the (k, z)-clustering cost for a point set P with respect to solution S as

cost(P, S) := X

p P min s S p s z 2.

Special cases include k-means (z = 2) and k-median (z = 1). Similarly, given a collection U of k orthogonal matrices of rank at most j, we denote the (k, j, z)-clustering cost of a point set P as

cost(P, U) := X

p P min U U (I UU T )p z 2.

The specific case (k, j, 2) is often known as projective clustering in literature. The cost vector v S,P R|P |, respectively v U,P R|P | has entries v S p = mins S p s z 2, respectively v U p = min U U (I UU T )p z 2 for p P. We will omit P from v S,P and v U,P , if P is clear from context. The overall cost is v S 1 = P

p P mins S p s z 2 and v U 1 = P

p P min U U (I UU T )p z 2. The set of all cost vectors is denoted by V .

Let D be an unknown but fixed distribution on Bd 2 with probability density function P. For any solution S, respectively U, we define cost(D, S) := R

p Bd 2 mins S p s z P[p]dp and

OPT := min S cost(D, S) and respectively cost(D, U) := R

p Bd 2 min U U (I UU T )p z P[p]dp and OPT := min U cost(D, U). Let P be a set of n points sampled independently from D. We denote the cost of the empirical risk minimizer on P by OPTP := 1

n min S v S 1, and respectively, OPTP := 1

n min U v U 1. The excess risk of P with respect to a set of cost vectors is denoted by

E|P |(V ) := EP [OPTP ] OPT.

Finally, we use the notion of a net. Let (V, dist) be a metric space, N(V, dist, ε) is an ε-net of the set of vectors V , if for all v V v N(V, dist, ε) such that dist(v, v ) ε. We will particularly focus on nets for cost vectors induced by (k, z)-clustering and (k, j, z)-clustering defined as follows, prior work has proposed similar nets for coresets and sublinear algorithms for (k, z) clustering [31]. Definition 2.1 (Clustering Nets). A set Nε of |P|-dimensional vectors is an ε-clustering net if for every cost vector v obtained from a solution S or U, there exists a vector v Nε with v v ε

A slightly weaker condition as required by these nets requiring only v v 2 ε n would also be sufficient. Nevertheless, we are not able to show better bounds when relaxing the condition and having a point-wise guarantee may be of independent interest.

3 Outline and technical contribution

Due to space restrictions, the full proofs are provided in the supplementary material. In this section, we endeavour to present a complete and accessible overview of the key ideas behind the theorems.

Let P be a set of n points sampled independently from some unknown but fixed distribution D. To show that the excessive risk with respect to clustering objectives is in O(f(n)) for some function f, it is sufficient to show two things. First, that for the optimal solution UOPT, the clustering cost estimated using P is close to the true cost. Second, any solution that is more expensive than UOPT does not become too cheap when evaluated on P. Both conditions are satisfied if for any solution U 1 ncost(P, U) cost(D, U) O(f(n)).

ncost(P, UOPT) cost(D, UOPT) O( p

1/n) with good probability is typically a straightforward application of concentration bounds such as Chernoff s bound. In fact, these concentration bounds show something even stronger. Given t solutions U1, . . . Ut, we have

1 ncost(P, Ui) cost(D, Ui) O

What remains is to bound the number of solutions t.

Clustering nets and dimension reduction for center based clustering Unfortunately, the total number of expensive clusterings in Euclidean space is infinite, making a straightforward application of 1 useless. Nets as per Definition 2.1 are now typically used to reduce the infinite number of solutions to a finite number. Specifically, one has to show that by preserving the costs of all solutions in the net, the cost of any other solution is also preserved. Using basic techniques from high dimensional computational geometry, it is readily possible to prove that a ε-net for (k, j, z) clustering of size exp(k j d log ε 1) exists, where d is the dimension of the ambient space. Plugging this into Equation 1 and setting ε 1 = n2 then yields a generalization bound of the order O p

kjd log n/n . Unfortunately, this leads to a dependency on d, which is suboptimal. To improve the upper bounds, we take inspiration from coreset research. For (k, z)-clustering, a number of works have investigated dimension reduction techniques known as terminal embeddings, see [11, 46]. Given a set of points P Rd, a terminal embedding f : Rd Rm guarantees p q 2 = (1 ε) f(p) f(q) 2 for any p P and q Rd. Terminal embeddings are very closely related to the Johnson-Lindenstrauss lemma, see [16, 28, 60] for applications to clustering, but more powerful in key regard: only one of the points is required to be in P. The added guarantee extended to arbitrary q Rd due to terminal embeddings allows us to capture all possible solutions. There are also even simpler proofs for k-mean that avoid this machinery entirely, see [39, 45, 57]. Unfortunately, these arguments are heavily reliant on properties of inner products and are difficult to extend to other values of z. The terminal embedding technique may be readily adapted to (k, z)-clustering, though some care in the analysis must be made to avoid the worse dependencies on the sample size necessitated for the corset guarantee, described as follows.

Improving the union bound via chaining: To illustrate the chaining technique, consider the simple application of the union bound for a terminal embedding with target dimension m = Θ(ε 2 log n), see the main result of Narayanan and Nelson [62]. Replacing the dependency on d with an appropriately chosen parameters and plugging the resulting net Nε of size exp(kε 2 log n log ε 1) yields a generalization bound of O 4p

k log2 n/n for (k, z) clustering. We improve on this using a

chaining analysis, see [30, 32] for its application to coresets for (k, z) clustering and [39] for (k, j, 2) clusterings. Specifically, we use a nested sequence of nets N1/2, N1/4, N1/8, . . . , N2 2 log n. Note that for every solution S, we may now write cost(p, S) for any p P as a telescoping sum

cost(p, S) =

h=0 cost(p, S2 (h+1)) cost(p, S2 h)

with , S2 h Nh and cost(p, S1) being set to 0. We use this as follows. Suppose for some solution S, we have solutions S2 h N2 h and S2 (h+1) N2 (h+1). Then |cost(p, S2 h) cost(p, S2 (h+1))| O(2 h) |cost(p, S2 h) cost(p, S)| for all p P. Instead of applying the union bound for a small set of solutions, we apply the union bound along every pair of

solutions appearing in the telescoping sum. Using arguments similar to Equation 1, we then obtain

EP sup S2 h S2 (h+1) Nh Nh+1

1 ncost(P, S2 h) 1

ncost(P, S2 (h+1))

log(|Nh| |Nh+1|)

k 22h polylog(k/2h)

This is the desired risk bound for (k, z) clustering. To complete the argument in a rigorous fashion, we must now merely combine the decomposition of cost(P, S) into the telescoping sum with the learning rate that we just derived. Indeed, this already provides a simple way of obtaining a bound on the risk of the order O p

k/n , which turns out to be optimal. In summary, to apply the chaining technique successfully, the following two properties are sufficient: (i) the dependency on ε in the net size can be at most exp( O(ε 2)), as the increase in net size is then met with a corresponding decrease between successive estimates along the chain and (ii) the nets have to preserve the cost up to an additive ε for every sample point p. The second property is captured by Definition 2.1. Both properties impose restrictions on the dimension reductions that can be successfully integrated into the chaining analysis.

Dimension reduction for projective clustering: It turns out that extending this analysis (k, j, z) clustering is a major obstacle. While the chaining method itself uses no particular properties of (k, z) clustering, the terminal embeddings needed to obtain nets cannot be applied to subspaces. Indeed, terminal embeddings by the very nature of their guarantee, cannot be linear2, and hence a linear structure such as a subspace will not be preserved. At this stage, there are a number of initially promising candidates that can provide alternative dimension reduction methods. For example, the classic Johnson-Lindenstrauss lemma can be realized via a random embedding matrix and, moreover, preserves subspaces, see for example [69, 23, 28]. Unfortunately, as remarked by [46], there is an inherent difficulty in applying Johnson-Lindenstrauss type embeddings even for (k, z) clustering coresets and the same arguments also apply for generalization bounds.

An alternative dimension reduction method based on principal component analysis was initially proposed by [44] for (k, j, 2), see also [28] and most notably [74] for a different variant that applies to arbitrary (k, j, z) objectives. For (k, j, 2) clustering, it states that a dimension reduction on the first O(D/ε) principal components preserves the projective cost of all subspaces of dimension D. Since (k, j, 2) clustering is a special case of a k j dimensional projection, it implies that O(kj/ε) dimensions are sufficient. Given that these dimension reductions are based on PCA-type methods, they are linear and therefore seem promising initially. Unfortunately, this technique has serious drawbacks. It does not satisfy the requirements for Definition 2.1, only preserving the cost on aggregate rather then per individual point, and thus cannot be combined with the chaining technique3. Without the chaining technique, the best bound one can hope for is of the order O 3p

k2j2/n , which falls short of what we are aiming for.

Another important technique used to quantify optimal solutions of (k, j, z) clustering initially proposed by [73] and subsequently explored by [43, 35] and has frequently seen use in coreset literature [40, 46]. Succinctly, it states that a (1 + ε) approximate solution to the (1, j, z) clustering problem of a point set P is contained in a subspace spanned by O(j2/ε) input points of P. While this result improves over PCA for large values of k, applying it only yields a learning rate of the order O( 3p

kj3/n). It turns out that this technique has the exact same limitations as PCA, namely that costs per point are not preserved, and thus only offers a different tradeoff in parameters.

Our new insight: Given the state of the art, designing a dimension reduction technique that would enable the application of the chaining technique might seem hopeless, and indeed, we were not able to prove such. The key insight that allows us to bypass these bottlenecks is to find a dimension reduction

2Consider an embedding matrix S Rd m. Clearly, there exists some vector x Rd that is in the kernel of S whenever m < d, hence for any vector p, p (x + p) 2 cannot be preserved. 3PCA as well as the other potential alternative dimension reduction techniques also do not satisfy the relaxed definition that would be sufficient for the analysis to go through.

that applies not to all solutions U, but only to a certain subset of them. Indeed, we show that for any point set P contained in the unit ball and any subspace U of dimension j, there exists a subspace S spanned by O(j/ε2) points of P such that for every point p: |cost(p, U) cost(p S, US)| ε. This is similar to the guarantee provided by [73] but stronger in that it (i) applies to arbitrary subspaces, which is required for the chaining analysis, and (ii) applies to each point of P individually, rather than for the entire point set P on aggregate. We then augment the chaining analysis by applying a union bound over all |P | j/ε2 possible dimension reductions, thereby capturing all solutions U. We are unaware of any previously successful attempts at integrating multiple dimension reductions within a chaining analysis and believe that the technique may be of independent interest.

4 Useful results from learning theory

Our goal is to bound the rate with which the empirical risk decreases for clustering problems. For a fixed set of n points P and a set of functions F : P R, we define the Rademacher complexity (Radn) and the Gaussian complexity (Gn) wrt F respectively as

Radn(F) = 1

n Er sup f F

p P f(p) rp Gn(F) = 1

n Eg sup f F

p P f(p) gp

where rp are independent random variables following the Rademacher distribution, whereas gp are independent Gaussian random variables. In our case, we can think of f as being associated to a solution S (respectively a solution U) and f(p) = cost(p, S) = mins S p s z 2 (respectively f(p) = cost(p, U) = min U U (I UU T )p z 2). Since we associate every f with a cost vector v S, we will use Radn(F) and Radn(V ) as well as Gn(F) and Gn(V ) interchangeably. The following theorem is due to Bartlett and Mendelson. [9]. Theorem 4.1 (Simplified variant of Theorem 8 of Bartlett and Mendelson [9]). Consider a loss function L : A [0, 1]. Let F be a class of functions mapping from X to A and let (Xi)n i=1 be independent samples from D. Then, for any integer n and any δ > 0, with probability at least 1 δ over samples of length n, denoting by ˆEn the empirical risk, every f F satisfies

EL(f(X)) ˆEn L(f(X)) + Radn(F) +

Thus, in order to bound the excess risk, Theorem 4.1 shows that it is sufficient to bound the Rademacher complexity. It is well known (see, for example, B.3 of Rudra and Wootters [68]) that Radn(V )

2πGn(V ). Thus we can alternatively bound the Gaussian complexity, which is sometimes more convenient. Note that if V is the set of all cost vectors, clustering nets are mere N(V, . , ε). Using these nets, we can bound the Rademacher and Gaussian complexity. Indeed the following lemma holds. Lemma 4.2. Let D be a distribution over Bd 2 and let P a set of n points sampled from D. Suppose that for a set of n-dimensional vector V , we have an absolute constant C, γ > 0 such that log |N(V, . , ε)| O(ε 2 logγ(nε 1)C). Then

The specific types of nets used in our study and the size bounds for those nets will be the key to obtaining the desired upper bounds and will be detailed in the next section.

5 Generalization bounds for center-based clustering and subspace clustering

We start by giving our generalization bounds for center based clustering and subspace clustering problems. For subspace clustering problems, we first state the result for general (k, j, z) clustering. An improvement for the special case z = 2 will be given later. Theorem 5.1. Let D be a distribution over Bd 2 and let P be a set of n points sampled from D. For any set of k points S Bd 2, we denote by v S the n-dimensional cost vector of P in solution S

with respect to the (k, z)-clustering objective. Moreover we denote by v U the n-dimensional cost vector of P in solution U with respect to the (k, j, z)-clustering objective. Let Vz be the union of all cost vectors of P for the center-based clustering and Vj,z the union of all cost vectors for subspace clustering. Then with probability at least 1 δ

k j2 log jn log3n

Following Theorem 4.1, it is sufficient to bound the Rademacher complexity in order to bound the excess risk. The Rademacher complexity is, up to lower order terms, equal to the Gaussian complexity, which, following Lemma 4.2 may be bounded by obtaining small nets with respect to the . norm. We believe that the results on the bounds of the nets, may be of independent interest and we ll state these results in the following Lemma. Lemma 5.2. Let D be a distribution over Bd 2 and let P a set of n points sampled from D, let Vz be defined as in Theorem 5.1 let Vj,z be defined as in Theorem 5.1. Then

|N(Vz, . , ε)| exp(O(1)z3 k ε 2 log n (log(z) + log(ε 1))) (4)

|N(Vj,z, . , ε)| exp(O(1)(3z)z+2 k j ε 2(log n + j log(jε 1)) log ε 1). (5)

Combining Lemma 5.2 with Lemma 4.2 now yields the immediate bound on the Rademacher and Gaussian complexity. Following the discussion from Section 3, we use terminal embeddings to prove the part of Lemma 5.2 pertaining to (k, z) clustering, see Appendix B. Unfortunately, the terminal embedding technique is not admissible for obtaining nets for subspace clustering as clarified in Section 3. Thus, we use an entirely different approach. We show the existence of a collection of dimension reducing maps with subspace preserving properties. Fortunately, the number of dimension reducing maps is small. Our desired net sizes then follow by enumerating over all of these dimension reducing maps, and for the candidate solutions covered by each such dimension reducing map, we can find an efficient net. First, we introduce a slightly different, but closely related notion to (1, j, z)-nets.

Definition 5.3 (Projective Nets). Let P Bd 2 be a set of points, and let z be a positive integer. For a d j matrix S with columns that have at most unit norm and any point p P, define the projective cost as costproj(p, S) = ST p 2. Let V be the set of all projective cost vectors induced by such matrix S. We call a N(V, . , ε) a (ε, j)-projective net of P.

On a high level, the proof largely relies on the following decomposition. Let U be a candidate subspace and let Π be a projection matrix used to approximate (I UU T )p z 2 We have

(I UU T)p 2= Πp 2 | {z } (1)

U TΠp 2 | {z } (2)

+ (I Π)p 2 | {z } (3)

UU T(I Π)p 2 | {z } (4)

+2p T ΠUU T(I Π)p | {z } (5)

Here, we wish to select Π such that U T (I Π)p 2 is small for all p P. Note that this implies that the terms 2p T ΠUU T (I Π)p and UU T (I Π)p 2 are small. For the term (2), we merely have to show that projective nets exist. If the number of Π is small, we can further construct good nets for the terms (1) and (3) . We start by giving a bound for the projective nets. Our first Lemma 5.4 shows that if the points lie in a sufficiently low-dimensional space, such a net can be obtained by constructing a net N(Bd 2, . 2, ε ) for a sufficiently small ε .

Lemma 5.4. Let P Bd 2 be a set of points, and z be a positive integer. Then there exists an (ε, j)-projective net of size |N(V, . , ε)| exp(O(1) d j log(jε 1)).

To reduce the dependency on the dimension, we now use the following lemma. Essentially, it shows that in order to retain the properties of U, we can find a projection matrix Π of rank at most O(jε 2).

Lemma 5.5. Let P Bd 2. For any orthogonal matrix U Rj d, there exists M P, with |M| O(j ε 2), such that p P, U T (I ΠM)p ε (I ΠM)p .

We now use this lemma as follows. We can efficiently enumerate over all candidate Π, as Lemma 5.5 guarantees us that we only have to consider n j ε 2 exp(j ε 2 log n) many different M inducing projection matrices. This immediately gives us 0-nets for the terms (1) and (3). For each Π, we then apply Lemma 5.4, which gives us a net for term (2). Finally, by choice of Π, we can show that terms (4) and (5) are negligible.

5.1 Tight generalization bounds for projective clustering

For the specific case of (k, j, 2) clustering, also known as projective clustering, we obtain an even better dependency on j. A similar bound can likely also be derived using the seminal work of [39], though the dependencies on log n and log 1/δ are slightly weaker. The proof uses the main result by [45], itself heavily inspired by [39], and arguments related to bounding the Rademacher complexity of linear function classes. Crucially, it avoids the issue of obtaining an explicit dimension reduction entirely, but the approach cannot be extended to general (k, j, z) clustering.

Theorem 5.6. Let D be a distribution over Bd 2 and let P a set of n points sampled from D. For any set U of k orthogonal matrices of rank at most j, we denote by v U the n-dimensional cost vector of P in solution U with respect to the (k, j, 2)-clustering objective, i.e. v U p = min U U (I UU T )p 2. Let Vj,2 be the union of all cost vectors of P. Then with probability at least 1 δ for any γ > 0

Finally, we also show that the bounds from Theorem 5.6 and [39] are optimal up to polylogarithmic factors.

Theorem 5.7. There exists a distribution D supported on Bd 2 such that En(Vj,2) Ω p

The rough idea is to define a distribution D supported on the nodes of a 2kj-dimensional simplex with some points having more probability mass and some points having smaller mass. Using the tightness of Chernoff bounds, we may then show that the probability of fitting a subspace clustering to a good fraction of the lower mass points is always sufficiently large.

6 Experiments

Theoretical guarantees are often notoriously conservative compared to what is seen in practice. In this section, we present empirical findings detailing whether the risk bounds from the previous sections are also the risk bounds one can expect when dealing with real datasets. Indeed, for the related question of computing coresets, experimetal work by [71] seems to indicate that the worst case bounds by [47] are not what one has to expect in practise for center based clustering. Generally, two properties can determine the risk decrease. First, the clusters may be well separated [4, 29]. Indeed, making assumptions to this end, there is also some theoretical evidence that a rate of O(k/n) is possible [5, 52]. The other, somewhat related explanation is that if the ground truth consists of k < k clusters [13, 63], the dependency on k will point more towards the smaller, true number of clusters. We run the experiments both for center based clustering, as well as subspace clustering. While the focus of the paper is arguably more on subspace clustering, the experiments are important in both cases. Although both problems are hard to optimize exactly, center based clustering is significantly more tractable and thus may lend better insight into practical learning rates. For example, we have an abundance of approximation algorithms for (k, z) clustering [6, 61] whereas, even in the case of (k, 1, z) clustering in two dimensions [49] it is not possible to find any finite approximation in polynomial time.

In the main body, we focus on (k, 1, z) clustering, as there already exists a phase transition in terms of the computational complexity between the normal k-median and k-means problems and the (k, 1, 1) and (k, 1, 2) clustering objectives, while j = 1 still admits more positive results than other subspace clustering problems Agarwal et al. [3], Feldman et al. [41, 42].

Datasets We use four publicly available real-world datasets: Mushroom [70], Skin-Nonskin [12], MNIST [51], and Covtype [15]. Below, we show the results on the Covtype dataset, and the remaining

Figure 1: Excess risk for line clustering on Covtyp. Shaded areas show max-min intervals.

experiments are deferred to the supplementary material. Each dataset was normalized by the diameter, ensuring that all points lie in Bd 2.

Table 1: Datasets used for the experiments Dataset Points Dim Labels

Mushrooms 8,124 112 2 MNIST 60,000 784 10 Skin_Nonskin 245,057 3 2 Covtype 581,012 54 7

Problem parameters and algorithms For both center based clustering as well as subspace clustering, we focus on the powers z {1, 2, 3, 4}. z = 2 is arguably the most popular and also the most tractable variant. z = 1 is the objective with the least susceptibility to outliers. Finally, we consider the cases z = 3, due to it minimizing asymmetry and z = 4 as a tractable alternative to the coverage objective z . The excess risk is evaluated for k {10, 20, 30, 50} for both center based and subspace clustering. Expectation maximization (EM) type algorithms are used for both center-based and subspace clustering, though this is a severe computational challenge fo (1, j, z) clustering, if z = 2, see [24, 36]. Given a solution S we first assign every point to its closest center and subsequently recompute the center. For more details on initialization and concrete implementations, we refer to the supplementary material.

Experimental setup and results To estimate the optimal cost OPT for the two objective functions, we run the corresponding appropriate algorithms mentioned above ten times on the entire dataset P and use the minimal objective value as an estimate for OPT. We obtain a sample Si of size n by sampling uniformly at random and estimate the optimal cost for that sample, OPTi. We repeat this 5 times. The empirical excess risk is calculated as En = 1 |P | P5 i=1 cost(P,OP Ti)

5 OPT. The excess risk for center-based clustering is evaluated on exponential-sized subset sizes n {26, 27, . . . , 212}.

We fit a line of the form c kq1

nq2 where c, q1, q2 are the optimizeable parameters. Let yi be the excess risk in run i. Let ki and ni be the values of k and n in run i and let r be the total number of times the excess risk was evaluated for each combination of algorithm and dataset. We use gradient descent on the following loss to optimize the parameters LSE = Pr i=1 yi c kq1

The results in Figure 1 show that the excess risk for subspace clustering decreases quicker for higher values of z, and we see a similar pattern for center-based clustering. The supplementary material contains more details on the empirical evaluations of center-based clustering. The best-fit lines shown in Tables 2 and 3 in the supplementary material indicate that the empirical excess risk values decrease slightly quicker than predicated by theory. The expected values are q1 = q2 = 0.5 and we observe q1, q2 around 0.44, 0.52 respectively. For k this indicates a slightly favorable dependency in practice. For q2, we consider the difference to the theoretical bound of 0.5 negligible. The choice of z does not seem to have a significant impact on either finding. For subspace clustering, the dependency on k is a bit more pronounced and increases slightly towards the theoretical guarantees. Contrary to hopes that margin or stability conditions might occur on practical datasets, the results indicate that the theoretical guarantees of the learning rate are near-optimal even in practice. Moreover, the rates were not particularly affected by either the choice of z or by the dimension j when analyzing subspace clustering.

7 Conclusion and open problems

In this paper, we presented several new generalization bounds for clustering objectives such as k-median and subspace clustering. When the centers are points or constant dimensional subspaces, our upper bounds are optimal up to logarithmic terms. For projective clustering, we give a lower bound showing that the results obtained by [39] are nearly optimal. A key novel technique was using an ensemble of dimension reduction methods with very strong guarantees.

An immediate open question is to which degree ensembles of dimension reductions can improve learning rates over a single dimension reduction. Is it possible to find natural problems where there is a separation between the embeddability and the learnablity of a class of problems, or given the ensemble, is it always possible to find a single dimension reduction with the guarantees of the ensemble? Another open question is motivated by the recent treatment of clustering through the lens of computational social choice [19]. Using current techniques from coresets [17] and learning theory [45], it seems difficult to improve over the learning rate of O p

k2/n for the fair clustering problem specifically. It it possible to match the bounds for unconstrained clustering?

Disclosure of Funding Acknowledgements

Maria Sofia Bucarelli was partially supported by projects FAIR (PE0000013) and SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union - Next Generation EU. Supported also by the ERC Advanced Grant 788893 AMDROMA, EC H2020RIA project So Big Data++ (871042), PNRR MUR project IR0000013-So Big Data.it.

Chris Schwiegelshohn was supported by the Independent Research Fund Denmark (DFF) under a Sapere Aude Research Leader grant No 1051-00106B.

[1] E. F. Abaya and G. L. Wise. Convergence of vector quantizers with applications to optimal quantization. SIAM Journal on Applied Mathematics, 44(1):183 189, 1984. doi: 10.1137/ 0144013. URL https://doi.org/10.1137/0144013.

[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606 635, 2004.

[3] P. K. Agarwal, C. M. Procopiuc, and K. R. Varadarajan. Approximation algorithms for a k-line center. Algorithmica, 42(3-4):221 230, 2005. doi: 10.1007/s00453-005-1166-x. URL https://doi.org/10.1007/s00453-005-1166-x.

[4] H. Angelidakis, K. Makarychev, and Y. Makarychev. Algorithms for stable and perturbationresilient problems. In H. Hatami, P. Mc Kenzie, and V. King, editors, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 438 451. ACM, 2017. doi: 10.1145/3055399.3055487. URL https://doi.org/10.1145/3055399.3055487.

[5] A. Antos, L. Gyorfi, and A. Gyorgy. Individual convergence rates in empirical vector quantizer design. IEEE Transactions on Information Theory, 51(11):4013 4022, 2005. doi: 10.1109/TIT. 2005.856976.

[6] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027 1035, 2007. URL http://dl.acm. org/citation.cfm?id=1283383.1283494.

[7] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In Proceedings of the 22nd international conference on Machine learning, ICML 05, page 33 40, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 1595931805. doi: 10.1145/1102351.1102356. URL https://doi.org/10.1145/1102351.1102356.

[8] P. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):1802 1813, 1998. doi: 10.1109/18. 705560.

[9] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, 2002. URL http://jmlr.org/papers/ v3/bartlett02a.html.

[10] P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Trans. Inf. Theory, 44(5):1802 1813, 1998. doi: 10.1109/18.705560. URL https://doi.org/10.1109/18.705560.

[11] L. Becchetti, M. Bury, V. Cohen-Addad, F. Grandoni, and C. Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1039 1050, 2019. doi: 10.1145/3313276.3316318. URL https://doi.org/10.1145/3313276.3316318.

[12] R. Bhatt and A. Dhall. Skin Segmentation. UCI Machine Learning Repository, 2012.

[13] C. Bhattacharyya, R. Kannan, and A. Kumar. How many clusters? - an algorithmic answer. In J. S. Naor and N. Buchbinder, editors, Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, pages 2607 2640. SIAM, 2022. doi: 10.1137/1.9781611977073.102. URL https://doi.org/10.1137/1.9781611977073.102.

[14] G. Biau, L. Devroye, and G. Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781 790, 2008. doi: 10.1109/TIT.2007.913516.

[15] J. Blackard. Covertype. UCI Machine Learning Repository, 1998.

[16] C. Boutsidis, A. Zouzias, and P. Drineas. Random projections for $k$-means clustering. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., pages 298 306, 2010.

[17] V. Braverman, V. Cohen-Addad, S. H. Jiang, R. Krauthgamer, C. Schwiegelshohn, M. B. Toftrup, and X. Wu. The power of uniform sampling for coresets. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 462 473. IEEE, 2022. doi: 10.1109/FOCS54457.2022.00051. URL https://doi.org/10.1109/FOCS54457.2022.00051.

[18] D. Calandriello and L. Rosasco. Statistical and computational trade-offs in kernel kmeans. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ 18903e4430783a191b0cfab439daaef8-Paper.pdf.

[19] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5029 5037, 2017.

[20] R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 11, page 895 903, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308137. doi: 10.1145/2020408.2020558. URL https://doi.org/10.1145/2020408.2020558.

[21] R. Chitta, R. Jin, and A. K. Jain. Efficient kernel clustering using random fourier features. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, ICDM 12, page 161 170, USA, 2012. IEEE Computer Society. ISBN 9780769549057. doi: 10.1109/ICDM. 2012.61. URL https://doi.org/10.1109/ICDM.2012.61.

[22] P. Chou. The distortion of vector quantizers trained on n vectors decreases to the optimum as Op(1/n). In Proceedings of 1994 IEEE International Symposium on Information Theory, pages 457 , 1994. doi: 10.1109/ISIT.1994.395072.

[23] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), pages 205 214, 2009.

[24] K. L. Clarkson and D. P. Woodruff. Input sparsity and hardness for robust subspace approximation. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 310 329, 2015. doi: 10.1109/FOCS.2015.27. URL http://dx.doi.org/10.1109/FOCS.2015.27.

[25] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. J. ACM, 59(5):23:1 23:49, 2012. doi: 10.1145/2371656.2371658. URL https://doi.org/10. 1145/2371656.2371658.

[26] S. Clémençcon. On u-processes and clustering performance. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips. cc/paper/2011/file/a1d0c6e83f027327d8461063f4ac58a6-Paper.pdf.

[27] M. B. Cohen and R. Peng. Lp row sampling by lewis weights. In R. A. Servedio and R. Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 183 192. ACM, 2015. doi: 10.1145/2746539.2746567. URL https://doi.org/10.1145/2746539.2746567.

[28] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 163 172, 2015.

[29] V. Cohen-Addad and C. Schwiegelshohn. On the local structure of stable clustering instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49 60, 2017. doi: 10.1109/FOCS.2017.14. URL https://doi.org/10.1109/FOCS.2017.14.

[30] V. Cohen-Addad, D. Saulpic, and C. Schwiegelshohn. A new coreset framework for clustering. In S. Khuller and V. V. Williams, editors, STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021. ACM, 2021. URL https: //doi.org/10.1145/3406325.3451022.

[31] V. Cohen-Addad, D. Saulpic, and C. Schwiegelshohn. Improved coresets and sublinear algorithms for power means in euclidean spaces. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 7-10, 2021, Virtual Conference, 2021.

[32] V. Cohen-Addad, K. G. Larsen, D. Saulpic, and C. Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In S. Leonardi and A. Gupta, editors, STOC 22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1038 1051. ACM, 2022. doi: 10.1145/3519935.3519946. URL https: //doi.org/10.1145/3519935.3519946.

[33] V. Cohen-Addad, K. G. Larsen, D. Saulpic, C. Schwiegelshohn, and O. A. Sheikh-Omar. Improved coresets for euclidean k-means. In Neur IPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.

[34] V. Cohen-Addad, D. Saulpic, and C. Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023 (to appear), 2023.

[35] A. Deshpande and K. R. Varadarajan. Sampling-based dimension reduction for subspace approximation. In D. S. Johnson and U. Feige, editors, Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007, pages 641 650. ACM, 2007. doi: 10.1145/1250790.1250884. URL https://doi.org/10.1145/ 1250790.1250884.

[36] A. Deshpande, M. Tulsiani, and N. K. Vishnoi. Algorithms and hardness for subspace approximation. In D. Randall, editor, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 482 496. SIAM, 2011. doi: 10.1137/1.9781611973082.39. URL https://doi.org/10.1137/1.9781611973082.39.

[37] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 04, page 551 556, New York, NY, USA, 2004. Association for Computing Machinery. ISBN 1581138881. doi: 10.1145/1014052.1014118. URL https: //doi.org/10.1145/1014052.1014118.

[38] H. Ding. A sub-linear time framework for geometric optimization with outliers in high dimensions. In F. Grandoni, G. Herman, and P. Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 38:1 38:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. doi: 10.4230/LIPIcs.ESA.2020.38. URL https://doi.org/10.4230/LIPIcs.ESA.2020.38.

[39] C. Fefferman, S. Mitter, and H. Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016.

[40] D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011, pages 569 578, 2011.

[41] D. Feldman, A. Fiat, and M. Sharir. Coresets forweighted facilities and their applications. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 315 324. IEEE Computer Society, 2006. doi: 10.1109/FOCS.2006.22. URL https://doi.org/10.1109/FOCS.2006.22.

[42] D. Feldman, A. Fiat, M. Sharir, and D. Segev. Bi-criteria linear-time approximations for generalized k-mean/median/center. In J. Erickson, editor, Proceedings of the 23rd ACM Symposium on Computational Geometry, Gyeongju, South Korea, June 6-8, 2007, pages 19 26. ACM, 2007. doi: 10.1145/1247069.1247073. URL https://doi.org/10.1145/1247069. 1247073.

[43] D. Feldman, M. Monemizadeh, C. Sohler, and D. P. Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In M. Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 630 649. SIAM, 2010. doi: 10.1137/1.9781611973075.53. URL https://doi.org/10.1137/1.9781611973075.53.

[44] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM J. Comput., 49(3):601 657, 2020. doi: 10.1137/18M1209854. URL https://doi.org/10.1137/18M1209854.

[45] D. J. Foster and A. Rakhlin. ℓ vector contraction for rademacher complexity. Co RR, abs/1911.06468, 2019. URL http://arxiv.org/abs/1911.06468.

[46] L. Huang and N. K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In K. Makarychev, Y. Makarychev, M. Tulsiani, G. Kamath, and J. Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 1416 1429. ACM, 2020. doi: 10.1145/ 3357713.3384296. URL https://doi.org/10.1145/3357713.3384296.

[47] L. Huang, J. Li, and X. Wu. Towards optimal coreset construction for (k, z)-clustering: Breaking the quadratic dependency on k. Co RR, abs/2211.11923, 2022. doi: 10.48550/ar Xiv.2211.11923. URL https://doi.org/10.48550/ar Xiv.2211.11923.

[48] P. N. Klein and N. E. Young. On the number of iterations for dantzig-wolfe optimization and packing-covering approximation algorithms. SIAM J. Comput., 44(4):1154 1172, 2015. doi: 10.1137/12087222X. URL https://doi.org/10.1137/12087222X.

[49] V. S. A. Kumar, S. Arya, and H. Ramesh. Hardness of set cover with intersection 1. In U. Montanari, J. D. P. Rolim, and E. Welzl, editors, Automata, Languages and Programming, 27th International Colloquium, ICALP 2000, Geneva, Switzerland, July 9-15, 2000, Proceedings, volume 1853 of Lecture Notes in Computer Science, pages 624 635. Springer, 2000. doi: 10.1007/3-540-45022-X\_53. URL https://doi.org/10.1007/3-540-45022-X_53.

[50] F. Lauer. Risk bounds for learning multiple components with permutation-invariant losses. In S. Chiappa and R. Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 1178 1187. PMLR, 2020. URL http://proceedings.mlr.press/v108/lauer20a.html.

[51] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. doi: 10.1109/5.726791.

[52] C. Levrard. Nonasymptotic bounds for vector quantization in hilbert spaces. The Annals of Statistics, 43(2), apr 2015. doi: 10.1214/14-aos1293. URL https://doi.org/10.1214% 2F14-aos1293.

[53] S. Li and Y. Liu. Sharper generalization bounds for clustering. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6392 6402. PMLR, 2021. URL http://proceedings.mlr.press/v139/li21k.html.

[54] T. Linder. On the training distortion of vector quantizers. IEEE Transactions on Information Theory, 46(4):1617 1623, 2000. doi: 10.1109/18.850705.

[55] T. Linder, G. Lugosi, and K. Zeger. Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding. IEEE Transactions on Information Theory, 40(6):1728 1740, 1994. doi: 10.1109/18.340451.

[56] F. Liu, X. Huang, Y. Chen, and J. A. K. Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond, 2020. URL https://arxiv.org/abs/2004. 11154.

[57] Y. Liu. Refined learning bounds for kernel and approximate $k$-means. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 6142 6154, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 30f8f6b940d1073d8b6a5eebc46dd6e5-Abstract.html.

[58] Y. Liu, S. Liao, S. Jiang, L. Ding, H. Lin, and W. Wang. Fast cross-validation for kernel-based algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1 1, 01 2019. doi: 10.1109/TPAMI.2019.2892371.

[59] S. P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129 136, 1982. doi: 10.1109/TIT.1982.1056489. URL https://doi.org/10.1109/TIT.1982.1056489.

[60] K. Makarychev, Y. Makarychev, and I. P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1027 1038, 2019. doi: 10.1145/3313276.3316350. URL https://doi.org/10. 1145/3313276.3316350.

[61] R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Mach. Learn., 56(1-3):35 60, 2004. doi: 10.1023/B:MACH.0000033114.18632.e0. URL https: //doi.org/10.1023/B:MACH.0000033114.18632.e0.

[62] S. Narayanan and J. Nelson. Optimal terminal dimensionality reduction in euclidean space. In M. Charikar and E. Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1064 1069. ACM, 2019. doi: 10.1145/3313276.3316307. URL https://doi.org/10.1145/3313276. 3316307.

[63] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1 28:22, 2012. doi: 10.1145/2395116.2395117. URL https://doi.org/10.1145/2395116.2395117.

[64] D. Pollard. Strong Consistency of K-Means Clustering. The Annals of Statistics, 9(1):135 140, 1981. doi: 10.1214/aos/1176345339. URL https://doi.org/10.1214/aos/1176345339.

[65] D. Pollard. A Central Limit Theorem for k-Means Clustering. The Annals of Probability, 10 (4):919 926, 1982. doi: 10.1214/aop/1176993713. URL https://doi.org/10.1214/aop/ 1176993713.

[66] D. Pollard. Quantization and the method of k-means. IEEE Trans. Inf. Theory, 28(2):199 204, 1982. doi: 10.1109/TIT.1982.1056481. URL https://doi.org/10.1109/TIT.1982. 1056481.

[67] A. Rudi and L. Rosasco. Generalization properties of learning with random features, 2016. URL https://arxiv.org/abs/1602.04474.

[68] A. Rudra and M. Wootters. Every list-decodable code for high noise has abundant nearoptimal rate puncturings. In D. B. Shmoys, editor, Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 764 773. ACM, 2014. doi: 10.1145/2591796.2591797. URL https://doi.org/10.1145/2591796.2591797.

[69] T. Sarlós. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143 152, 2006.

[70] J. Schlimmer. Mushroom. UCI Machine Learning Repository, 1987.

[71] C. Schwiegelshohn and O. A. Sheikh-Omar. An empirical evaluation of k-means coresets. In S. Chechik, G. Navarro, E. Rotenberg, and G. Herman, editors, 30th Annual European Symposium on Algorithms, ESA 2022, September 5-9, 2022, Berlin/Potsdam, Germany, volume 244 of LIPIcs, pages 84:1 84:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. doi: 10.4230/LIPIcs.ESA.2022.84. URL https://doi.org/10.4230/LIPIcs.ESA.2022.84.

[72] J. Shawe-Taylor, C. K. I. Williams, N. Cristianini, and J. S. Kandola. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Trans. Inf. Theory, 51(7): 2510 2522, 2005. doi: 10.1109/TIT.2005.850052. URL https://doi.org/10.1109/TIT. 2005.850052.

[73] N. D. Shyamalkumar and K. R. Varadarajan. Efficient subspace approximation algorithms. In N. Bansal, K. Pruhs, and C. Stein, editors, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 532 540. SIAM, 2007. URL http://dl.acm.org/citation.cfm?id= 1283383.1283440.

[74] C. Sohler and D. P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pages 802 813, 2018. doi: 10.1109/FOCS.2018. 00081. URL https://doi.org/10.1109/FOCS.2018.00081.

[75] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. C. Eldar and G. Kutyniok, editors, Compressed Sensing, pages 210 268. Cambridge University Press, 2012. doi: 10.1017/cbo9780511794308.006. URL https://doi.org/10.1017/ cbo9780511794308.006.

[76] R. Vidal. Subspace clustering. IEEE Signal Process. Mag., 28(2):52 68, 2011. doi: 10.1109/ MSP.2010.939739. URL https://doi.org/10.1109/MSP.2010.939739.

[77] R. Wang and D. P. Woodruff. Tight bounds for lp oblivious subspace embeddings. In T. M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1825 1843. SIAM, 2019. doi: 10.1137/1.9781611975482.110. URL https://doi.org/10.1137/1.9781611975482. 110.

[78] S. Wang, A. Gittens, and M. W. Mahoney. Scalable kernel k-means clustering with nystrom approximation: Relative-error bounds. Journal of Machine Learning Research, 20(12):1 49, 2019. URL http://jmlr.org/papers/v20/17-517.html.

[79] D. P. Woodruff and T. Yasuda. Sharper bounds for ℓp sensitivity sampling. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 37238 37272. PMLR, 2023. URL https://proceedings.mlr.press/v202/woodruff23a.html.

[80] R. Yin, Y. Liu, L. Lu, W. Wang, and D. Meng. Divide-and-conquer learning with nyström: Optimal rate and algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 34 (04):6696 6703, Apr. 2020. doi: 10.1609/aaai.v34i04.6147. URL https://ojs.aaai.org/ index.php/AAAI/article/view/6147.

[81] X. Zhang and S. Liao. Incremental randomized sketching for online kernel learning. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7394 7403. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/zhang19h. html.

Supplementary Materials

A Proof of Lemma 4.2

In this section, we include the proof of Lemma 4.2 and some preliminary facts that will be useful for the proof.

Let r be a Rademacher vector, i.e. every entry ri is sampled independently uniformly from { 1, 1}. Further, we say that g is a Gaussian vector if every entry gi is a standard Gaussian with mean 0 and variance 1. We have the following useful properties of Gaussians.

Fact A.1 (Appendix B.1 by [68]). Let g1, . . . gn be Gaussians with means µi and variances σ2 i .

If σ2 i σ2 for all i, then E[maxgi |gi|] 2σ 2 log n.

If the Gaussians are independent, then Pn i=1 aigi is Gaussian distributed with mean Pn i=1 aiµi and variance Pn i=1 a2 i σ2 i .

If the gi are independent standard Gaussians with mean 0 and variance 1, then Y := Pn i=1 g2 i is Chi-squared distributed with mean E[

Another result we need is the following

Lemma A.2 (Lemma 5.2 of [75]). |N(Bd 2, . 2, ε)| (1 + 2/ε)d.

We are now ready to prove the Lemma 4.2. The proof of Lemma is similar to arguments used to prove Dudley s theorem. We also write here the statement of the Lemma for the sake of completeness

Lemma (Lemma 4.2). Let D be a distribution over Bd 2 and let P be a set of n points sampled from D. Suppose that for a set of n-dimensional vectors V , we have absolute constants C, γ > 0 such that

log |N(V, . , ε)| O(ε 2 logγ(n ε 1) C). (7)

Proof. For ease of notation, we use solutions S induced by points, but the proof carries over without any modifications other than changing the notation to collections of subspaces U.

Consider an arbitrary cost vector v S. We write v S as a telescoping sum

h=0 vh+1,S vh,S

where v0 = 0 and vi,S is a vector from N(V, . , 2 i) approximating v S. Observe that

vh+1,S vh,S vh+1,S v S + v S vh,S 2 2 h (8)

due to the triangle inequality. Thus we have

n Gn(V ) = EP,g

sup S (v S)T g = EP,g

h=0 (vh+1,S vh,S)T g

sup S (vh+1,S vh,S)T g

sup vh+1,S,vh,S N(V, . ,2 (h+1)) N(V, . ,2 h)

(vh+1,S vh,S)T g

sup vh+1,S,vh,S N(V, . ,2 (h+1)) N(V, . ,2 h)

(vh+1,S vh,S)T g

sup vh+1,S,vh,S N(V, . ,2 (h+1)) N(V, . ,2 h)

(vh+1,S vh,S)T g

sup vh+1,S,vh,S N(V, . ,2 (h+1)) N(V, . ,2 h)

(vh+1,S vh,S)T g

sup S (v S vlog n,S)T g

We bound the terms 9 and 10 differently, starting with the latter.

For every S (v S vlog n,S)T g v S vlog n,S 2 E[ g 2], due to the Cauchy Schwarz inequality. Further,

v S vlog n,S 2 n v S vlog n,S n 2 log n =

which, combined with the third item in Fact A.1 yields

sup S (v S vlog n,S)T g O

= O(1). (11)

We now consider the term 9. Due to the second item of Fact A.1, (vh+1,S vh,S)T g is Gaussian distributed with mean 0 and variance n X

i=1 (vh+1,S vh,S)2 i 4n 2 2h.

Thus, we have, using the first item in Fact A.1

sup vh+1,S,vh,S N(V, . ,2 (h+1)) N(V, . ,2 h)

(vh+1,S vh,S)T g

32n 2 2h log N(V, . , 2 (h+1)) N(V, . , 2 h)

Now using equation (7) we obtain that,

32n 2 2h log N(V, . , 2 (h+1)) N(V, , 2 h) O(n logγ n)

So we have that

32n 2 2h log N(V, . , 2 (h+1)) N(V, . , 2 h) O( q

n logγ+2 n) (13)

Adding the bounds (13) and (11) for Terms (10) and (9), respectively yields the claim.

Finally, we will frequently use the following triangle inequality extended to powers. Lemma A.3 (Triangle Inequality for Powers (Lemma A.1 of [60])). Let a, b, c be an arbitrary set of points in a metric space with distance function d and let z be a positive integer. Then for any ε > 0

d(a, b)z (1 + ε)z 1d(a, c)z + 1 + ε

z 1 d(b, c)z

|d(a, b)z d(a, c)z| ε d(a, c)z + 2z + ε

z 1 d(b, c)z.

B Omitted Proofs for Center-Based Clustering

Lemma B.1. Let P Bd 2 be a set of points. Let V be the set of all cost vectors of P for (k, z)- clustering. Then there exists an ε-clustering net of size

|N(V, . , ε)| exp(O(1) z k d log(zε 1)).

Proof. We start by proving the bound for k = 1. Suppose we are given a net N(Bd 2, . 2, δ), for a δ to be determined later. Consider a candidate solution {s} with cost vector v{s} V . Let s be the point in N(Bd 2, . 2, δ) of such that s s δ, if s is not unique any one will be sufficient. Let v S be the cost vector of S . The number of distinct solutions S are |N(Bd 2, . 2, δ)| = exp(O(1) d log δ 1) due to Lemma A.2.

What is left to show is that all solutions constructed in this way satisfy the guarantee of N(V, . , δ), for an appropriately chosen δ. We have for any p P and any non-negative integer z due to Lemma A.3

| p s z p s z| α p s z + 2z + α

α p s z + (3z)z δ

We set α = 1 2 2z ε and δ = α 1 2(3z)z ε = 1 4(6z)z ε2. Then the term above is upper bounded by at most ε as p s 2. Now since | p s z p s z| ε for all s Bd 2 also implies |mins S p s z mins S p s z| ε, we have proven our desired approximation.

To conclude, observe that by our choice of δ, the overall net N has size at most exp(O(1) z d log(zε 1)).

To extend this proof to k-centers, observe that any solution consisting of k centers can be obtained by selecting k points from Bd 2, rather than one. This raises the net size of the single cluster case by a power of k.

We now show that Lemma B.1 combined with terminal embeddings yields the desired net. Lemma (Equation 4 in Lemma 5.2). Let D be a distribution over Bd 2 and let P a set of n points sampled from D and let V be defined as in Theorem 2. Then

|N(V, . , ε)| exp(O(1)z3 k ε 2 log n (log(z) + log(ε 1))).

Proof. Let f : Rd Rm be a terminal embedding, that is f is such that m O(z2 ε 2 log |P|)4

and for all p P and q Rd

p q z = (1 ε) f(p) f(q) z,

as given by [62]. Therefore, for any candidate solution S, we also have

cost(p, S) = (1 2ε)cost(f(p), f(S)).

In other words, the set of cost vectors in the image of f is the desired O(ε)-net for the true set of cost vectors. Hence an ε-net for the cost vectors induced by solutions in the image of f is also an O(ε)-net for the set of cost vectors. We thus may apply Lemma B.1 for all cost vectors induced by solutions in the image of f. After rescaling ε by constant factors, the overall net size is therefore exp(O(1)z3 k ε 2 log n (log(z) + log(ε 1)))

C Omitted Proofs for Subspace Clustering

In this section, we provide full proofs for Section 5 relative to subspace clustering.

We start with a few basic lemmas that will be useful in the calculations later.

We further require the following bounds that will prove useful in the calculations later. Lemma C.1. Let a, b be numbers in [0, 2] and let ε > 0. Suppose a2 = b2 ε b. Then

Moreover, for any non-negative integer z, we have

|az bz| 2 (3z)z ε.

Proof. For the first part of the lemma, we observe

|a2 b2| = |a b| (a + b) ε b

which implies

|a b| ε. For the second part, Lemma A.3 implies

|az bz| ε max(a, b)z + 2z + ε

z 1 |a b|z ε 2z + 3z + ε

z 1 εz 2(3z)zε.

This lemma now immediately implies the following corollary by rescaling ε. Corollary C.2. Let a, b be numbers in [0, 2] and let ε > 0. Suppose a2 = b2 1 4 (3z)z max(ε b, ε2). Then for any non-negative integer z, we have

We now show that for any candidate subspace U we can find a subspace representing it that is spanned by only a few vectors in P. Lemma (Lemma 5.5). Let P Bd 2. For any orthogonal matrix U Rj d, there exists M P, with |M| = O(j ε 2), such that

p P, U T (I ΠM)p ε (I ΠM)p . (14)

Proof. Initially, let M = . We add points to M in rounds and denote by Mt the set after t rounds. Furthermore, let Πt be the projection matrix onto the subspace spanned by Mt at round t. If there is a p P in round t such that

U T (I Πt)p > ε (I Πt)p (15)

4The dependency on z is easily derived via a straightforward application of Lemma A.3.

then we let Mt+1 = Mt {p}. Our goal is to show that after T O(jε 2) many rounds, we have U T (I ΠT )p ε (I ΠT )p . We show this by proving inductively

U T Πt 2 F ε2 t.

For the base case t = 0, this is trivially true. Thus suppose we add a point p in iteration t + 1. Reformulating Equation 15, we have U T (I Πt)p

(I Πt)p > ε. By the Pythagorean theorem, we therefore have

U T Πt+1 2 F = U T Πt 2 F + U T (I Πt)p 2

(I Πt)p 2 ε2 t + ε2 ε2 (t + 1).

Now since Πt is a projection and since U has j orthonormal columns j ||U T ||2 F ||U T Πt||2 F . If T ε 2j, we obtain ||U T ΠT ||2 F j. This implies that U is contained in the space spanned by MT . Conversely, U must also be orthogonal to the kernel of MT that is U(I ΠT ) = 0. Therefore after at most ε 2j rounds, we have U T (I ΠT )p ε (I ΠT )p .

Lemma (Lemma 5.4). Let P Bd 2 be a set of points and let z be a positive integer. Then there exists an (ε, j)-projective net of size

|N(V, . , ε)| exp(O(1) d j log(jε 1)).

Proof. Let N be an ε/j-net of the Euclidean unit ball, i.e. N = N(Bd 2, . 2, ε/j) due to Lemma A.2. Let N = j i=1N be the set of j subsets of of N. We claim that for every S, there exists an S N such that ST p 2 = S T p 2 ε.

Note that this implies the claim as |N| 1 + 2j

ε d j = exp(O(1) d j log(jε 1)).

Define S T i to be the vector in N closest to the ith row of ST , i.e. ST i S T i 2 ε/j. We have S T S 2 Pj i=1 S T i ST i 2 ε. Therefore

ST p 2 = (ST S T )p + S T p 2

(ST S T )p 2 + S T p 2

S T p 2 + ST S T 2 p 2

S T p 2 + ε.

The bound ST p 2 S T p 2 ε is proven analogously.

We can now conclude with the proof of Equation 5 in Lemma 5.2. Lemma ( Equation 5 in Lemma 5.2). Let D be a distribution over Bd 2 and let P a set of n points sampled from D and let Vj,z be defined as in Theorem 5.1. Then

|N(Vj,z, . , ε)| exp(O(1)(3z)z+2 k ε 2(log n + j log(jε 1)) log ε 1).

Proof. Let α, β > 0 be sufficiently small parameters depending on ε that will determined later. We first describe a construction for nets for a single subspace of rank at most j, before composing to k subspaces.

We start by describing the composition of the nets. For every subset M P, with |M| O(jα 2), we let ΠM denote an orthogonal projection matrix of the span of M. Note that this implies rank(ΠM) = O(jα 2). Further, let N(ΠM) := N(Brank(ΠM) 2 , . 2, β) be a (β, j)-projective net of the point set p M{ΠMp} of size at most exp(O(1) rank(ΠM) log(jβ 1)) given by Lemma 5.4. Finally, let N := MN(ΠM).

We consider an arbitrary orthogonal matrix U Rj d. Denote by MU the subset of points and by ΠU the projection matrix given by Lemma 5.5, using α as the precision variable. We claim that for every U, there exists an U N such that for all p P ΠUp 2 2 U T ΠUp 2 2 + (I ΠU)p 2 2 z/2 (I UU T )p z O(α + β).

In other words, by enumerating over all (β, j)-projective nets, we obtain an O(α + β)-subspace clustering net for (1, j, z)-clustering. The desired error of ε then follows by choosing α and β accordingly. For U, we construct U as follows. Let D = ΠU, i.e. DDT = ΠU. Further, let V = U T D, notice that V has at most j rows that have at most unit norm. Hence, there exists a U N such that | UΠUp 2 U ΠUp 2| ε

that is a (β, j)-projective net.

We then obtain

ΠUp 2 2 U T ΠUp 2 2 + (I ΠU)p 2 2 = ΠUp 2 2 U T ΠUp 2 2 β + (I ΠU)p 2 2 = ΠUp 2 2 UU T ΠUp 2 2 β + (I ΠU)p 2 2 = (I UU T )ΠUp 2 2 + (I ΠU)p 2 2 β

(Eq.6) = (I UU T )p 2 2 β U T (I ΠU)p 2 2p T ΠUUU T (I ΠU)T p

(Lem.5.5) = (I UU T )p 2 2 α2 (I UU T )p 2 2α (I UU T )p β

Setting α2 = β = 1 64(3z)z ε2, we then have due to Corollary C.2

ΠUp 2 2 U T ΠUp 2 2 + (I ΠU)p 2 2 z (I UU T )p z ε. (16)

To extend this from a single j-dimensional subspace to a solution U given by the intersection of k j-dimensional subspaces, we define cost vectors v S obtained from N = k i=1N as follows. For each U U let U be constructed as above and let U be the union of the thus constructed U . Then, with a slight abuse of notation, letting ΠU correspond to the subspace used to obtain U , we define

v U p := min U U (I ΠU )p 2 + ΠU p 2 U ΠU p 2 z/2 .

Let U be the subspace to which p is assigned U and let U be the center in U used to approximate U and let U = argmin U U (I Π U)p 2 + ΠU p 2 U Π Up 2 z/2 and let U U be the center approximated by U . Then applying Equation 16, we have

(I UU T )p z

(I U U T p z

(I ΠU )p 2 + ΠU p 2 U ΠU p 2 z/2 + ε

(I ΠU )p 2 + ΠU p 2 U ΠU p 2 z/2 + ε

Thus, the cost vectors obtained from N are a (k, j, z)-clustering net, i.e.

v S p v S p := min s S Πs p [s , 0] z min s S p s z ε.

What remains is to bound the size of the clustering net. Here we first observe that size of the clustering net is equal to |N| = |N|k. For |N|, we have |P | O(α 2 log α 1) n O(jα 2 log α 1) many choices of

N(Π). In turn, the size of each N(Π) is bounded by (β/j) O(j2α 2) due to Lemma 5.4. Thus the overall size of N is

exp k j O(α 2 log α 1(log n + j log β/j))

= exp(O(1)(3z)z+2 k j ε 2(log n + j log(jε 1)) log ε 1)

as desired.

C.1 Proofs of Theorem 5.6 (Section 5.1)

The proof of the theorem is a straightforward application of Theorem 4.1 with the following Lemma

Lemma C.3. Let D be a distribution over Bd 2, let P a set of n points sampled from D, and let V be defined as in Theorem 5.6. Then for any γ > 0

Radn(Vj,2) O

Proof. We use the following result due to Foster and Rakhlin [45].

Theorem C.4 (ℓ contraction inequality (Theorem 1 by [45])). Let F X Rk, and let ϕ : Rk R be L-Lipschitz with respect to the ℓ norm, i.e. ϕ(X) ϕ(X ) L X X for all X, X Rk. For any γ > 0, there exists a constant C > 0 such that if |ϕt(f(x))| f(x) β, then

Radn(ϕ F) C L

K max i Radn(F|i) log3/2+γ βn maxi Rn(F|i)

We use this theorem as follows. Our functions are associated with candidate solutions U, that is ϕ(f) = min U U (I UU T )p 2 2. In other words, f maps a point p to the k-dimensional vector, where fi(p) = (I Ui U T i )p 2 2 and ϕ selects the minimum value among all I Ui U T i )p 2 2.

Thus, we require three more steps. First, we have to bound the Lipschitz constant of the minimum operator. Second, we have to give a bound on β. Third and last, we have to give a bound on the Rademacher complexity

Radn(V ) = 1

p P (I UU T )p 2 2rp. (17)

The Lipschitz constant of the minimum operator with respect to the ℓ norm can be readily shown to be 1 as for any two vectors x, y with mini yi = yj

min i xi min i yi = min i xi yj xj yj |xj yj| x y .

Since U is an orthogonal matrices and p Bd 2, we have (I UU T )p 2 2 1 and thus β is bounded by 1.

Thus, we only require a bound on Equation 17. For this, we use a result by [50]. Since the result is embedded in the proof of another result, we restate it here for the convenience of the reader.

Lemma C.5 (Compare the proof Theorem 3 of [50]). Let P be an set of n points in Bd 2 and let U be the set of all orthogonal matrices of rank at most j. For every U U, define f U(p) = (I UU T )p 2 2 and let F be the set of all functions f U(p) Then.

Radn(F) := 1

n Er sup U U

p P (I UU T )p 2 2 rp O

Proof. We have

Radn(F) = Er sup U

p P (I UU T )p 2 2rp = Er X

p P p 2rp + Er sup U

p P U T p 2 2rp.

We observe that the term Er P p P p 2rp is 0. Thus, we focus on the second term. We have

p P U T p 2 2 rp = Er sup U

p P p T UU T p rp = Er sup U

p P trace(p T UU T p) rp

p P trace(UU T pp T ) rp

= Er sup U trace

Er sup U U F

p P rp pp T

We have U F j, so we focus on P

p P rp pp T F . Here, we have

p P rp pp T

p P rp pp T

p P rp pp T

q P rp rq trace pp T qq T = X

q P rp rq (p T q)2.

This implies

n Radn(F) = Er sup U

p P U T p 2 2rp Er sup U U F

p P rp pp T

q P rp rq (p T q)2

(Jensen s inequality) p

q P rp rq (p T q)2

p P (p T p)2 p

Solving the above for Radn(F) concludes the proof.

We can now conclude the proof. Combining the bounds on L and β with Lemma C.5 and Theorem C.4, we have

Radn(Vj,2) O

j n log3+γ (n)

as desired.

C.2 Lower Bound

Finally, we also show that the bound given in Theorem 5.6 is optimal, up to polylog factors.

Theorem (5.7). There exists a distribution D supported on Bd 2 such that E(Vj,2) Ω q

Proof. We first describe the hard instance distribution D. We assume that we are given d = 2kj dimensions. Let ei be the standard unit vector along dimension i with i {1, . . . d}. Let p, ε [0, 1] be a parameters, where ε is sufficiently small. We set the densities for a point q as follows.

p if q = ei, i {1, . . . , k j} p ε p if q = ei, i {kj + 1, . . . , d} 0 otherwise (18)

We choose p such that integral over densities is 1, i.e. kj p + kj (p εp) = 1. It is straightforward to verify that for ε sufficiently small, p ( 1

kj ). We denote the points {e1, . . . ekj} by G for "good" and the points {ekj+1, . . . ed} by B for "bad".

We now characterize the properties of the optimal solution as well as suboptimal solutions.

Lemma C.6. Let D be the distribution described above in Equation (18). Then for any optimal solution U = {U1, . . . Uk}, we have ei Ut for i {1, . . . , kj} and some t and OPT = kj p (1 ε).

Proof. We transform the instance into a d d diagonal matrix D where Di,i = p

P[ei]. So D is a d d diagonal matrix with diagonal entries equal to p for the first k j elements and p ε p for elements from k j + 1 to d. Now consider any partition of the points into clusters Ct with the corresponding subspace Ut for (t {1, . . . , k} ). The optimal solution for Ut is simply the right singular vector of the submatrix of D corresponding to points in Ct, which by the construction of D is the j points with the largest weight. This means that each cluster can remove at most Pj i=1 1 = j from the cost, so k clusters can remove at most Pk i=1 j from the cost. This imples that the cost of the clustering is lower bounded by Pd i=1 D2 i,i Pkj i=1 D2 i,i = Pd i=kj+1 D2 i,i. Conversely, the solution U has exactly this cost, which implies that it must be optimal.

Using Lemma C.6, we now have to, given n independent samples from D. Control the probability that the sample P will (falsely) put a higher weight on some of the points in B than the points in G. Let Bex denote the set of misclassified points in B and let POPT denote the optimum computed on the sample P. We have

E[cost(D, POPT)] = kj p (1 ε) + p ε |Bex|.

and hence an expected excess risk bound of

E[cost(D, POPT)] OPT = p ε E[Bex].

By linearity of expectation, we have E[|Bex|] = kj P[ekj+1 Bex]. Thus, E[cost(D, POPT)] OPT Θ(1)ε P[ekj+1 Bex]. Define Glow to be the set of points from G that are have an empirical density of at most p. Let \ ekj+1 denote the empirical density of ekj+1. We now claim that

P[ekj Bex] P[\ ekj+1 > p ekj+1 Bex]

= P[ekj+1 Bex|\ ekj+1 > p] P[\ ekj+1 > p] 1/2 P[\ ekj+1 > p]

The first inequality follows because we are considering a subset of the possible events, the second inequality follows because the number of points with an empirical estimated density greater than p is negatively correlated with the empirical density \ ekj+1 of the point ekj. Specifically, conditioned on \ ekj+1 > p, the mean and median density of any point ei G is at most 1

n p (n p n) = p (1 p) < p. Thus, the (marginal) mean and median density of any other point is below p and therefore the probability that ekj+1 will be in Bex is at least 1/2.

Thus, what remains to be shown is a bound on P[ekj > p]. Here, we use the tightness of the Chernoff bound (see Lemma 4 of [48]).

Lemma C.7 (Tightness of the Chernoff Bound). Let X be the average of n independent, 0/1 random variables. For any ε (0, 1/2] and µ (0, 1/2], assuming ε2µn 3 if each random variable is 1 with probability at least µ, then

P[X > (1 + ε)p] > exp( 9ε2µn).

Thus, sampling n elements, we have

P [ekj > p] = P ekj > 1 + ε 1 ε

(1 ε)2 (1 ε)pn Ω(1) exp ε2

If we require E[cost(D, POPT)] OPT = ε c for a sufficiently small absolute constant c, we also

require P [ekj > p] = c and hence q

n ε c for a sufficiently small absolute constants c

and c . Letting ε 0 then shows that the excess risk can asymptotically decrease no faster than

D Details for the Experiments (Section 6)

D.1 Description of datasets

Mushroom comprises of 112 categorical features of the appearance of mushrooms with class labels corresponding to poisonous or edible. MNIST contains 28x28 pixel images of handwritten digits. Skin_Nonskin are RGB values given as 3 numerical features used to predict if a pixel is skin or not. Lastly, Covtype consists of a mix of categorical and numerical features used to predict seven different cover types of forests. In the main body, we focus on Covtype because of its large number of points.

D.2 Description of algorithms

Center based clustering For each experiment, we use an expectation maximization (EM) type algorithm. Given a solution S, we first assign every point to its closest center and subsequently, we recompute the center. For the case z = 2, we do this analytically and in this case the EM algorithm is more commonly known as Llyod s method [59]. For the cases, z {1, 3, 4}, the new center is obtained via gradient descent. The initial centers are chosen via Dz sampling, i.e. sampling centers proportionate to the zth power of the distance between a point and its closest center (for z = 2 this is the k-means++ algorithm by [6]).

We wrote all of the code using Python 3 and utilized the Pytorch library for implementations using gradient descent. Specifically, we employed the Adam W optimizer to find the closest center with a learning rate set to 0.01. All experiments were conducted on a machine equipped with a single NVIDIA RTX 2080 GPU.

Subspace Clustering For subspace clustering, we consider j {1, 2, 5} to demonstrate the effects of the subspace dimension on convergence rate, taking computational expenses into consideration. Since there are no known tractable algorithms for these problems with guarantees, we initialize a solution U = {U1, . . . , Uk} by sampling k orthogonal matrices of rank j, where the subspace for each matrix is determined via the volume sampling technique [35]. Subsequently, we run the EM algorithm. As before, the expectation step consists of finding the closest subspace for every point. For z = 2, the maximization step consists of finding the j principal component vectors of the data matrix induced by each cluster. For the other values of z, it is NP-hard even approximate the maximization step [24], so we use gradient descent to find a local optimum. Due to the fact that Skin_nonskin only has 3 features, we only evaluate the excess risk for j {1, 2}. Due to a large computational dependency on dimension, we do not evaluate subspaces on the MNIST dataset.

D.3 Experimental results

In this section, we provide plots of the excess risk and the found parameters of the best-fit lines for each of the datasets.

Table 2: Best fit lines on Covtype and Mushroom (left to right) z c q1 q2

1 3 10 2 0.44 0.54 2 4 10 3 0.42 0.52 3 6 10 4 0.44 0.51 4 1 10 4 0.44 0.51

1 1 10 1 0.48 0.51 2 8 10 2 0.48 0.51 3 4 10 2 0.49 0.50 4 3 10 2 0.49 0.50

Figure 2: Excess risk for center-based clustering on the Covertype dataset. The shaded areas indicate the maximal and minimal deviation for the respective sample sizes.

Figure 3: Excess risk for center-based clustering on the Mushroom dataset. The shaded areas indicate the maximal and minimal deviation for the respective sample sizes.

Table 3: Best fit lines on Skin_Non Skin and MNIST (left to right)

1 2 10 2 0.49 0.50 2 3 10 3 0.47 0.52 3 8 10 4 0.46 0.53 4 2 10 4 0.46 0.53

1 1 10 1 0.49 0.51 3 5 10 2 0.50 0.50 4 3 10 2 0.50 0.50 2 8 10 02 0.50 0.50

Figure 4: Excess risk for center-based clustering on the Skin_Nonskin dataset. The shaded areas indicate the maximal and minimal deviation for the respective sample sizes.

Figure 5: Excess risk for center-based clustering on the MNIST dataset. The shaded areas indicate the maximal and minimal deviation for the respective sample sizes.

Figure 6: Excess risk for subspace clustering on the Mushroom dataset. The shaded areas indicate min/max values

Table 4: Best fit line for subspace clustering on Covtype and Mushroom (left to right)

j z c q1 q2 1 1 0.1 0.45 0.54 1 2 2 10 2 0.48 0.51 1 3 3 10 4 0.46 0.53 1 4 4 10 5 0.46 0.52 2 1 8 10 2 0.48 0.51 2 2 2 10 3 0.47 0.51 2 3 4 10 5 0.46 0.53 2 4 2 10 6 0.46 0.52 5 1 8 10 3 0.48 0.51 5 2 5 10 5 0.46 0.53 5 3 4 10 7 0.47 0.52 5 4 3 10 9 0.47 0.51

j z c q1 q2

1 1 1 10 1 0.48 0.51 1 2 1 10 1 0.48 0.51 1 5 1 10 1 0.49 0.49 2 1 7 10 2 0.48 0.51 2 2 6 10 2 0.50 0.49 2 5 6 10 2 0.49 0.48 3 1 4 10 2 0.49 0.50 3 2 3 10 2 0.49 0.50 3 5 3 10 2 0.49 0.49 4 1 2 10 2 0.49 0.50 4 2 2 10 2 0.49 0.50 4 5 1 10 2 0.48 0.50

Figure 7: Excess risk for subspace clustering on the Skin_Nonskin dataset. The shaded areas indicate min/max values

Table 5: Best fit line for subspace clustering on Skin-Nonskin

j z c q1 q2

1 1 1 10 2 0.48 0.50 1 2 3 10 3 0.45 0.53 2 1 2 10 3 0.46 0.53 2 2 2 10 4 0.46 0.53 3 1 4 10 4 0.46 0.53 3 2 2 10 5 0.46 0.53 4 1 9 10 5 0.46 0.53 4 2 3 10 6 0.46 0.53