# statistical_and_computational_tradeoffs_in_kernel_kmeans__76392455.pdf Statistical and Computational Trade-Offs in Kernel K-Means Daniele Calandriello LCSL IIT & MIT, Genoa, Italy Lorenzo Rosasco University of Genoa, LCSL IIT & MIT We investigate the efficiency of k-means in terms of both statistical and computational requirements. More precisely, we study a Nyström approach to kernel k-means. We analyze the statistical properties of the proposed method and show that it achieves the same accuracy of exact kernel k-means with only a fraction of computations. Indeed, we prove under basic assumptions that sampling n Nyström landmarks allows to greatly reduce computational costs without incurring in any loss of accuracy. To the best of our knowledge this is the first result of this kind for unsupervised learning. 1 Introduction Modern applications require machine learning algorithms to be accurate as well as computationally efficient, since data-sets are increasing in size and dimensions. Understanding the interplay and trade-offs between statistical and computational requirements is then crucial [31, 30]. In this paper, we consider this question in the context of clustering, considering a popular nonparametric approach, namely kernel k-means [33]. K-means is arguably one of most common approaches to clustering and produces clusters with piece-wise linear boundaries. Its kernel version allows to consider nonlinear boundaries, greatly improving the flexibility of the approach. Its statistical properties have been studied [15, 24, 10] and from a computational point of view it requires manipulating an empirical kernel matrix. As for other kernel methods, this latter operation becomes unfeasible for large scale problems and deriving approximate computations is subject of a large body of recent works, see for example [34, 16, 29, 35, 25] and reference therein. In this paper we are interested into quantifying the statistical effect of computational approximations. Arguably one could expect the latter to induce some loss of accuracy. In fact, we prove that, perhaps surprisingly, there are favorable regimes where it is possible maintain optimal statistical accuracy while significantly reducing computational costs. While a similar phenomenon has been recently shown in supervised learning [31, 30, 12], we are not aware of similar results for other learning tasks. Our approach is based on considering a Nyström approach to kernel k-means based on sampling a subset of training set points (landmarks) that can be used to approximate the kernel matrix [3, 34, 13, 14, 35, 25]. While there is a vast literature on the properties of Nyström methods for kernel approximations [25, 3], experience from supervised learning show that better results can be derived focusing on the task of interest, see discussion in [7]. The properties of Nyström approximations for k-means has been recently studied in [35, 25]. Here they focus only on the computational aspect of the problem, and provide fast methods that achieve an empirical cost only a multiplicative factor larger than the optimal one. Our analysis is aimed at combining both statistical and computational results. Towards this end we derive a novel additive bound on the empirical cost that can be used to bound the true object of interest: the expected cost. This result can be combined with probabilistic results to show that optimal statistical accuracy can be obtained considering only O( n) Nyström landmark points, 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. where n is the number of training set of points. Moreover, we show similar bounds not only for the optimal solution, which is hard to compute in general, but also for approximate solutions that can be computed efficiently using k-means++. From a computational point of view this leads to massive improvements reducing the memory complexity from O(n2) to O(n n). Experimental results on large scale data-sets confirm and illustrate our findings. The rest of the paper is organized as follows. We first overview kernel k-means, and introduce our approximate kernel k-means approach based on Nyström embeddings. We then prove our statistical and computational guarantees and empirically validate them. Finally, we present some limits of our analysis, and open questions. 2 Background Notation Given an input space X, a sampling distribution µ, and n samples {xi}n i=1 drawn i.i.d. from µ, we denote with µn(A) = (1/n) Pn i=1 I{xi A} the empirical distribution. Once the data has been sampled, we use the feature map ϕ( ) : X H to maps X into a Reproducing Kernel Hilbert Space (RKHS) H [32], that we assume separable, such that for any x X we have φ = ϕ(x). Intuitively, in the rest of the paper the reader can assume that φ RD with D n or even infinite. Using the kernel trick [2] we also know that φTφ = K(x, x ), where K is the kernel function associated with H and φTφ = φ, φ H is a short-hand for the inner product in H. With a slight abuse of notation we will also define the norm φ 2 = φTφ, and assume that ϕ(x) 2 κ2 for any x X. Using φi = ϕ(xi), we denote with D = {φi}n i=1 the input dataset. We also represent the dataset as the map Φn = [φ1, . . . , φn] : Rn H with φi as its i-th column. We denote with Kn = ΦT nΦn the empirical kernel matrix with entries [Kn]i,j = ki,j. Finally, given Φn we denote as Πn = ΦnΦT n(ΦnΦT n)+ the orthogonal projection matrix on the span Hn = Im(Φn) of the dataset. k-mean s objective Given our dataset, we are interested in partitioning it into k disjoint clusters each characterized by its centroid cj. The Voronoi cell associated with a centroid cj is defined as the set Cj := {i : j = arg mins=[k] φi cs 2}, or in other words a point φi D belongs to the j-th cluster if cj is its closest centroid. Let C = [c1, . . . ck] be a collection of k centroids from H. We can now formalize the criterion we use to measure clustering quality. Definition 1. The empirical and expected squared norm criterion are defined as W(C, µn) := 1 i=1 min j=1,...,k φi cj 2, W(C, µ) := E φ µ min j=1,...,k φ cj 2 . The empirical risk minimizer (ERM) is defined as Cn := arg min C Hk W(C, µn). The sub-script n in Cn indicates that it minimizes W(C, µn) for the n samples in D. Biau et al. [10] gives us a bound on the excess risk of the empirical risk minimizer. Proposition 1 ([10]). The excess risk E(Cn) of the empirical risk minimizer Cn satisfies E(Cn) := ED µ [W(Cn, µ)] W (µ) O k/ n where W (µ) := inf C Hk W(C, µ) is the optimal clustering risk. From a theoretical perspective, this result is only k times larger than a corresponding O( p k/n) lower bound [18], and therefore shows that the ERM Cn achieve an excess risk optimal in n. From a computational perspective, Definition 1 cannot be directly used to compute Cn, since the points φi in H cannot be directly represented. Nonetheless, due to properties of the squared norm criterion, each cj Cn must be the mean of all φi associated with that center, i.e., Cn belongs to Hn. Therefore, it can be explicitly represented as a sum of the φi points included in the j-th cluster, i.e., all the points in the j-th Voronoi cell Cj. Let V be the space of all possible disjoint partitions [C1, . . . , Cj]. We can use this fact, together with the kernel trick, to reformulate the objective W( , µn). Proposition 2 ([17]). We can rewrite the objective min C H W(C, µn) = 1 with φi 1 |Cj|φCj1|Cj| 2 = ki,i 2 |Cj| P s Cj ki,s + 1 |Cj|2 P While the combinatorial search over V can now be explicitly computed and optimized using the kernel matrix Kn, it still remains highly inefficient to do so. In particular, simply constructing and storing Kn takes O(n2) time and space and does not scale to large datasets. 3 Algorithm A simple approach to reduce computational cost is to use approximate embeddings, which replace the map ϕ( ) and points φi = ϕ(xi) H with a finite-dimensional approximation eφi = eϕ(xi) Rm. Nyström kernel k-means Given a dataset D, we denote with I = {φj}m j=1 a dictionary (i.e., subset) of m points φj from D, and with Φm : Rm H the map with these points as columns. These points acts as landmarks [36], inducing a smaller space Hm = Im(Φm) spanned by the dictionary. As we will see in the next section, I should be chosen so that Hm is close to the whole span Hn = Im(Φn) of the dataset. Let Km,m Rm m be the empirical kernel matrix between all points in I, and denote with Πm = ΦmΦ T m(ΦmΦ T m)+ = Φm K+ m,mΦ T m, (1) the orthogonal projection on Hm. Then we can define an approximate ERM over Hm as Cn,m = arg min C Hkm i=1 min j=[k] φi cj 2 = arg min C Hkm i=1 min j=[k] Πm(φi cj) 2, (2) since any component outside of Hm is just a constant in the minimization. Note that the centroids Cn,m are still points in Hm H, and we cannot directly compute them. Instead, we can use the eigen-decomposition of Km,m = UΛUT to rewrite Πm = Φm UΛ 1/2Λ 1/2UTΦT m. Defining now eϕ( ) = Λ 1/2UTΦT mϕ( ) we have a finite-rank embedding into Rm. Substituting in Eq. (2) Πm(φi cj) 2 = Λ 1/2U TΦ T m(φi cj) 2 = eφi Λ 1/2U TΦ T mcj 2, where eφi := Λ 1/2UTΦT mφi are the embedded points. Replacing ecj := Λ 1/2UTΦT mcj and searching over e C Rm k instead of searching over C Hk m, we obtain (similarly to Proposition 2) e Cn,m = arg min e C Rm k i=1 min j=[k] eφi ecj 2 = 1 where we do not need to resort to kernel tricks, but can use the m-dimensional embeddings eφi to explicitly compute the centroid P s Cj eφs. Eq. (3) can now be solved in multiple ways. The most straightforward is to run a parametric k-means algorithm to compute e Cn,m, and then invert the relationship ecj = Φm UΛ 1/2cj to bring back the solution to Hm, i.e., Cn,m = Φ+ m UTΛ1/2 e Cn,m = Φm UΛ 1/2 e Cn,m. This can be done in in O(nm) space and O(nmkt + nm2) time using t steps of Lloyd s algorithm [22] for k clusters. More in detail, computing the embeddings eφi is a one-off cost taking nm2 time. Once the m-rank Nyström embeddings eφi are computed they can be stored and manipulated in nm time and space, with an n/m improvement over the n2 time and space required to construct Kn. 3.1 Uniform sampling for dictionary construction Due to its derivation, the computational cost of Algorithm 1 depends on the size m of the dictionary I. Therefore, for computational reasons we would prefer to select as small a dictionary as possible. As a conflicting goal, we also wish to optimize W( , µn) well, which requires a eϕ( ) and I rich enough to approximate W( , µn) well. Let Π m be the projection orthogonal to Hm. Then when ci Hm φi ci 2 = (Πm + Π m)(φi ci) 2 = Πm(φi ci) 2 + Π mφi 2. We will now introduce the concept of a γ-preserving dictionary I to control the quantity Π mφi 2. Algorithm 1 Nyström Kernel K-Means Input: dataset D = {φi}n i=1, dictionary I = {φj}m j=1 with points from D, number of clusters k compute kernel matrix Km,m = ΦT mΦm between all points in I compute eigenvectors U and eigenvalues Λ of Km,m for each point φi, compute embedding eφi = Λ 1/2UTΦT mφi = Λ 1/2UTKm,i Rm compute optimal centroids e Cn,m Rm k on the embedded dataset e D = {eφi}n i=1 compute explicit representation of centroids Cn,m = Φm UΛ 1/2 e Cn,m Definition 2. We define the subspace Hm and dictionary I as γ-preserving w.r.t. space Hn if Π m = Πn Πm γ 1 ε (ΦnΦ T n + γΠn) 1 . (4) Notice that the inverse (ΦnΦT n + γΠn) 1 on the right-hand side of the inequality is crucial to control the error Π mφi 2 γφT i (ΦnΦT n + γΠn) 1 φi. In particular, since φi Φn, we have that in the worst case the error is bounded as φT i (ΦnΦT n + γΠn) 1 φi φT i (φiφT i)+ φi 1. Conversely, since λmax(ΦnΦT n) κ2n we know that in the best case the error can be reduced up to 1/n φT iφi/λmax(ΦnΦT n) φT i (ΦnΦT n + γΠn) 1 φi. Note that the directions associated with the larger eigenvalues are the ones that occur most frequently in the data. As a consequence, Definition 2 guarantees that the overall error across the whole dataset remains small. In particular, we can control the residual Π mΦn after the projection as Π mΦn 2 γ ΦT n(ΦnΦT n + γΠn) 1Φn γ. To construct γ-preserving dictionaries we focus on a uniform random sampling approach[7]. Uniform sampling is historically the first [36], and usually the simplest approach used to construct I. Leveraging results from the literature [7, 14, 25] we can show that uniformly sampling e O(n/γ) landmarks generates a γ-preserving dictionary with high probability. Lemma 1. For a given γ, construct I by uniformly sampling m 12κ2n/γ log(n/δ)/ε2 landmarks from D. Then w.p. at least 1 δ the dictionary I is γ-preserving. Musco and Musco [25] obtains a similar result, but instead of considering the operator Πn they focus on the finite-dimensional eigenvectors of Kn. Moreover, their Πn Πm + εγ 1 ε (ΦnΦT n)+ bound is weaker and would not be sufficient to satisfy our definition of γ-accuracy. A result equivalent to Lemma 1 was obtained by Alaoui and Mahoney [3], but they also only focus on the finite-dimensional eigenvectors of Kn, and did not investigate the implications for H. Proof sketch of Lemma 1. It is well known [7, 14] that uniformly sampling O(n/γε 2 log(n/δ)) points with replacement is sufficient to obtain w.p. 1 δ the following guarantees on Φm (1 ε)ΦnΦ T n εγΠn n mΦmΦ T m (1 + ε)ΦnΦ T n + εγΠn. Which implies n mΦmΦ T m + γΠn 1 ((1 ε)ΦnΦ T n εγΠn + γΠn) 1 = 1 1 ε (ΦnΦ T n + γΠn) 1 We can now rewrite Πn as mΦmΦ T m + γΠn n mΦmΦ T m + γΠn 1 mΦmΦ T m + γΠn 1 + γ n mΦmΦ T m + γΠn 1 mΦmΦ T m + + γ n mΦmΦ T m + γΠn 1 mΦmΦ T m + γΠn 1 Πm + γ 1 ε (ΦnΦ T n + γΠn) 1 In other words, using uniform sampling we can reduce the size of the search space Hm by a 1/γ factor (from n to m n/γ) in exchange for a γ additive error, resulting in a computation/approximation trade-off that is linear in γ. 4 Theoretical analysis Exploiting the error bound for γ-preserving dictionaries we are now ready for the main result of this paper: showing that we can improve the computational aspect of kernel k-means using Nyström embedding, while maintaining optimal generalization guarantees. Theorem 1. Given a γ-preserving dictionary E(Cn,m) = W(Cn,m, µ) W(Cn, µ) O k 1 n + γ From a statistical point of view, Theorem 1 shows that if I is γ-preserving, the ERM in Hm achieves the same excess risk as the exact ERM from Hn up to an additional γ/n error. Therefore, choosing γ = n the solution Cn,m achieves the O (k/ n) + O(k n/n) O(k/ n) generalization [10]. From a computational point of view, Lemma 1 shows that we can construct an n-preserving dictionary simply by sampling e O( n) points uniformly1, which greatly reduces the embedding size from n to n, and the total required space from n2 to e O(n n). Time-wise, the bottleneck becomes the construction of the embeddings eφi, which takes nm2 e O(n2) time, while each iterations of Lloyd s algorithm only requires nm e O(n n) time. In the full generality of our setting this is practically optimal, since computing a n-preserving dictionary is in general as hard as matrix multiplication [26, 9], which requires Ω(n2) time. In other words, unlike the case of space complexity, there is no free lunch for time complexity, that in the worst case must scale as n2 similarly to the exact case. Nonetheless embedding the points is an embarrassingly parallel problem that can be easily distributed, while in practice it is usually the execution of the Lloyd s algorithm that dominates the runtime. Finally, when the dataset satisfies certain regularity conditions, the size of I can be improved, which reduces both embedding and clustering runtime. Denote with dn eff(γ) = Tr KT n(Kn + In) 1 the so-called effective dimension [3] of Kn. Since Tr KT n(Kn + In) 1 Tr (KT n(Kn)+), we have that dn eff(γ) r := Rank(Kn), and therefore dn eff(γ) can be seen as a soft version of the rank. When dn eff(γ) n it is possible to construct a γ-preserving dictionary with only dn eff(γ) landmarks in e O(ndn eff(γ)2) time using specialized algorithms [14] (see Section 6). In this case, the embedding step would require only e O(ndn eff(γ)2) e O(n2), improving both time and space complexity. Morever, to the best of our knowledge, this is the first example of an unsupervised non-parametric problem where it is always (i.e., without assumptions on µ) possible to preserve the optimal O(1/ n) risk rate while reducing the search from the whole space H to a smaller Hm subspace. Proof sketch of Theorem 1. We can separate the distance between W(Cn,m, µ) W(Cn, µ) in a component that depends on how close µ is to µn, bounded using Proposition 1, and a component W(Cn,m, µn) W(Cn, µn) that depends on the distance between Hn and Hm Lemma 2. Given a γ-preserving dictionary W(Cn,m, µn) W(Cn, µn) min(k, dn eff(γ)) 1 ε γ n To show this we can rewrite the objective as (see [17]) W(Cn,m, µn) = Φn ΠmΦn Sn,m 2 F = Tr(Φ T nΦn SnΦ T nΠmΦn Sn), where Sn Rn n is a k-rank projection matrix associated with the exact clustering Cn. Then using Definition 2 we have Πm Πn γ 1 ε(ΦnΦT n + γΠn) 1 and we obtain an additive error bound Tr(Φ T nΦn SnΦ T nΠmΦn Sn) Tr Φ T nΦn SnΦ T nΦn Sn + γ 1 εSnΦ T n(ΦnΦ T n + γΠn) 1Φn Sn = W(Cn, µn) + γ 1 ε Tr SnΦ T n(ΦnΦ T n + γΠn) 1Φn Sn . 1 e O hides logarithmic dependencies on n and m. Since ΦT n(ΦnΦT n + γΠn) 1Φn 1, Sn is a projection matrix, and Tr(Sn) = k we have γ 1 ε Tr SnΦ T n(ΦnΦ T n + γΠn) 1Φn Sn γ 1 ε Tr (Sn Sn) = γk 1 ε. Conversely, if we focus on the matrix ΦT n(ΦnΦT n + γΠn) 1Φn Πn we have γ 1 ε Tr SnΦ T n(ΦnΦ T n + Πn) 1Φn Sn γ 1 ε Tr Φ T n(ΦnΦ T n + Πn) 1Φn γdn eff(γ) 1 ε . Since both bounds hold simultaneously, we can simply take the minimum to conclude our proof. We now compare the theorem with previous work. Many approximate kernel k-means methods have been proposed over the years, and can be roughly split in two groups. Low-rank decomposition based methods try to directly simplify the optimization problem from Proposition 2, replacing the kernel matrix Kn with an approximate e Kn that can be stored and manipulated more efficiently. Among these methods we can mention partial decompositions [8], Nyström approximations based on uniform [36], k-means++ [27], or ridge leverage score (RLS) sampling[35, 25, 14], and random-feature approximations [6]. None of these optimization based methods focus on the underlying excess risk problem, and their analysis cannot be easily integrated in existing results, as the approximate minimum found has no clear interpretation as a statistical ERM. Other works take the same embedding approach that we do, and directly replace the exact ϕ( ) with an approximate eϕ( ), such as Nyström embeddings [36], Gaussian projections [10], and again random-feature approximations [29]. Note that these approaches also result in approximate e Kn that can be manipulated efficiently, but are simpler to analyze theoretically. Unfortunately, no existing embedding based methods can guarantee at the same time optimal excess risk rates and a reduction in the size of Hm, and therefore a reduction in computational cost. To the best of our knowledge, the only other result providing excess risk guarantee for approximate kernel k-means is Biau et al. [10], where the authors consider the excess risk of the ERM when the approximate Hm is obtained using Gaussian projections. Biau et al. [10] notes that the feature map ϕ(x) = PD s=1 ψs(x) can be expressed using an expansion of basis functions ψs(x), with D very large or infinite. Given a matrix P Rm D where each entry is a standard Gaussian r.v., [10] proposes the following m-dimensional approximate feature map eϕ(x) = P[ψ1(x), . . . , ψD(x)] Rm. Using Johnson-Lindenstrauss (JL) lemma [19], they show that if m log(n)/ν2 then a multiplicative error bound of the form W(Cn,m, µn) (1+ν)W(Cn, µn) holds. Reformulating their bound, we obtain that W(Cn,m, µn) W(Cn, µn) νW(Cn, µn) νκ2 and E(Cn,m) O(k/ n + ν). Note that to obtain a bound comparable to Theorem 1, and if we treat k as a constant, we need to take ν = γ/n which results in m (n/γ)2. This is always worse than our e O(n/γ) result for uniform Nyström embedding. In particular, in the 1/ n risk rate setting Gaussian projections would require ν = 1/ n resulting in m n log(n) random features, which would not bring any improvement over computing Kn. Moreover when D is infinite, as it is usually the case in the non-parametric setting, the JL projection is not explicitly computable in general and Biau et al. [10] must assume the existence of a computational oracle capable of constructing eϕ( ). Finally note that, under the hood, traditional embedding methods such as those based on JL lemma, usually provide only bounds of the form Πn Πm γΠn, and an error Π mφi 2 γ φi 2 (see the discussion of Definition 2). Therefore the error can be larger along multiple directions, and the overall error Π mΦn 2 across the dictionary can be as large as nγ rather than γ. Recent work in RLS sampling has also focused on bounding the distance W(Cn,m, µn) W(Cn, µn) between empirical errors. Wang et al. [35] and Musco and Musco [25] provide multiplicative error bounds of the form W(Cn,m, µn) (1+ν)W(Cn, µn) for uniform and RLS sampling. Nonetheless, they only focus on empirical risk and do not investigate the interaction between approximation and generalization, i.e., statistics and computations. Moreover, as we already remarked for [10], to achieve the 1/ n excess risk rate using a multiplicative error bound we would require an unreasonably small ν, resulting in a large m that brings no computational improvement over the exact solution. Finally, note that when [31] showed that a favourable trade-off was possible for kernel ridge regression (KRR), they strongly leveraged the fact that KRR is a γ-regularized problem. Therefore, all eigenvalues and eigenvectors in the ΦnΦT n covariance matrix smaller than the γ regularization do not influence significantly the solution. Here we show the same for kernel k-means, a problem without regularization. This hints at a deeper geometric motivation which might be at the root of both problems, and potentially similar approaches could be leveraged in other domains. 4.1 Further results: beyond ERM So far we provided guarantees for Cn,m, that this the ERM in Hm. Although Hm is much smaller than Hn, solving the optimization problem to find the ERM is still NP-Hard in general [4]. Nonetheless, Lloyd s algorithm [22], when coupled with a careful k-means++ seeding, can return a good approximate solution C++ n,m. Proposition 3 ([5]). For any dataset EA[W(C++ n,m, µn)] 8(log(k) + 2)W(Cn,m, µn), where A is the randomness deriving from the k-means++ initialization. Note that, similarly to [35, 25], this is a multiplicative error bound on the empirical risk, and as we discussed we cannot leverage Lemma 2 to bound the excess risk E(C++ n,m). Nonetheless we can still leverage Lemma 2 to bound only the expected risk W(C++ n,m, µ), albeit with an extra error term appearing that scales with the optimal clustering risk W (µ) (see Proposition 1). Theorem 2. Given a γ-preserving dictionary h E A[W(C++ n,m, µ)] i O log(k) k n + k γ n + W (µ) . From a statistical perspective, we can once again, set γ = n to obtain a O(k/ n) rate for the first part of the bound. Conversely, the optimal clustering risk W (µ) is a µ-dependent quantity that cannot in general be bounded in n, and captures how well our model, i.e., the choice of H and how well the criterion W( , µ), matches reality. From a computational perspective, we can now bound the computational cost of finding C++ n,m. In particular, each iteration of Lloyd s algorithm will take only e O(n nk) time. Moreover, when k-means++ initialization is used, the expected number of iterations required for Lloyd s algorithm to converge is only logarithmic [1]. Therefore, ignoring the time required to embed the points, we can find a solution in e O(n nk) time and space instead of the e O(n2k) cost required by the exact method, with a strong O( n) improvement. Finally, if the data distribution satisfies some regularity assumption the following result follows [15]. Corollary 1. If we denote by Xµ the support of the distribution µ and assume ϕ(Xµ) to be a ddimensional manifold, then W (µ) dk 2/d, and given a n-preserving dictionary the expected cost satisfies ED µ[EA[W(C++ n,m, µ)]] O log(k) k n + dk 2/d . 5 Experiments We now evaluate experimentally the claims of Theorem 1, namely that sampling e O(n/γ) increases the excess risk by an extra γ/n factor, and that m = n is sufficient to recover the optimal rate. We use the Nystroem and Mini Batch Kmeans classes from the sklearn python library [28]to implement kernel k-means with Nyström embedding (Algorithm 1) and we compute the solution C++ n,m. For our experiments we follow the same approach as Wang et al. [35], and test our algorithm on two variants of the MNIST digit dataset. In particular, MNIST60K [20] is the original MNIST dataset containing pictures each with d = 784 pixels. We divide each pixel by 255, bringing each feature in a [0, 1] interval. We split the dataset in two part, n = 60000 samples are used to compute the W(C++ n,m) centroids, and we leave out unseen 10000 samples to compute W(C++ n,m, µtest), as a proxy for W(C++ n,m, µ). To test the scalability of our approach we also consider the MNIST8M dataset from the infinite MNIST project [23], constructed using non-trivial transformations and corruptions of the original MNIST60K images. Here we compute C++ n,m using n = 8000000 images, and compute W(C++ n,m, µtest) on 100000 unseen images. As in Wang et al. [35] we use Gaussian kernel with bandwidth σ = (1/n2) q P i,j xi xj 2. MNIST60K: these experiments are small enough to run in less than a minute on a single laptop with 4 cores and 8GB of RAM. The results are reported in Fig. 1. On the left we report in blue Figure 1: Results for MNIST60K Figure 2: Results for MNIST8M W(C++ n,m, µtest), where the shaded region is a 95% confidence interval for the mean over 10 runs. As predicted, the expected cost decreases as the size of Hm increases, and plateaus once we achieve 1/m 1/ n, in line with the statistical error. Note that the normalized mutual information (NMI) between the true [0 9] digit classes y and the computed cluster assignments yn,m also plateaus around 1/ n. While this is not predicted by the theory, it strengthens the intuition that beyond a certain capacity expanding Hm is computationally wasteful. MNIST8M: to test the scalability of our approach, we run the same experiment on millions of points. Note that we carry out our MNIST8M experiment on a single 36 core machine with 128GB of RAM, much less than the setup of [35], where at minimum a cluster of 8 such nodes are used. The behaviour of W(C++ n,m, µtest) and NMI are similar to MNIST60K, with the increase in dataset size allowing for stronger concentration and smaller confidence intervals. Finalle, note that around m = 400 uniformly sampled landmarks are sufficient to achieve NMI(yn,m, y) = 0.405, matching the 0.406 NMI reported by [35] for a larger m = 1600, although smaller than the 0.423 NMI they report for m = 1600 when using a slower, PCA based method to compute the embeddings, and RLS sampling to select the landmarks. Nonetheless, computing C++ n,m takes less than 6 minutes on a single machine, while their best solution required more than 1.5hr on a cluster of 32 machines. 6 Open questions and conclusions Combining Lemma 1 and Lemma 2, we know that using uniform sampling we can linearly trade-off a 1/γ decrease in sub-space size m with a γ/n increase in excess risk. While this is sufficient to maintain the O(1/ n) rate, it is easy to see that the same would not hold for a O(1/n) rate, since we would need to uniformly sample n/1 landmarks losing all computational improvements. To achieve a better trade-off we must go beyond uniform sampling and use different probabilities for each sample, to capture their uniqueness and contribution to the approximation error. Definition 3 ([3]). The γ-ridge leverage score (RLS) of point i [n] is defined as τi(γ) = φ T i(ΦnΦ T n + γΠn) 1φi = e T i Kn(Kn + γIn) 1ei. (5) The sum of the RLSs dn eff(γ) = Pn i=1 τi(γ) is the empirical effective dimension of the dataset. Ridge leverage scores are closely connected to the residual Π mφi 2 after the projection Πm discussed in Definition 2. In particular, using Lemma 2 we have that the residual can be bounded as Π mφi 2 γ 1 εφT i(ΦnΦT n + γΠn) 1φi. It is easy to see that, up to a factor γ 1 ε, high-RLS points are also high-residual points. Therefore it is not surprising that sampling according to RLSs quickly selects any high-residual points and covers Hn, generating a γ-preserving dictionary. Lemma 3. [11] For a given γ, construct I by sampling m 12κ2dn eff(γ) log(n/δ)/ε2 landmarks from D proportionally to their RLS. Then w.p. at least 1 δ the dictionary I is γ-preserving. Note there exist datasets where the RLSs are uniform,and therefore in the worst case the two sampling approaches coincide. Nonetheless, when the data is more structured m dn eff(γ) can be much smaller than the n/γ dictionary size required by uniform sampling. Finally, note that computing RLSs exactly also requires constructing Kn and O(n2) time and space, but in recent years a number of fast approximate RLSs sampling methods [14] have emerged that can construct γ-preserving dictionaries of size e O(dn eff(γ)) in just e O(ndn eff(γ)2) time. Using this result, it is trivial to sharpen the computational aspects of Theorem 1 in special cases. In particular, we can generate a n-preserving dictionary with only dn eff( n) elements instead of the n required by uniform sampling. Using concentration arguments [31] we also know that w.h.p. the empirical effective dimension is at most three times dn eff(γ) 3dµ eff(γ) the expected effective dimension, a µ-dependent quantity that captures the interaction between µ and the RKHS H. Definition 4. Given the expected covariance operator Ψ := Ex µ [φ(x)φ(x)T], the expected effective dimension is defined as dµ eff(γ) = Ex µ h φ(x) (Ψ + γΠ) 1 φ(x) i . Moreover, for some constant c that depends only on ϕ( ) and µ, dµ eff(γ) c (n/γ)η with 0 < η 1. Note that η = 1 just gives us the dµ eff(γ) O(n/γ) worst-case upper bound that we saw for dn eff(γ), and it is always satisfied when the kernel function is bounded. If instead we have a faster spectral decay, η can be much smaller. For example, if the eigenvalues of Ψ decay polynomially as λi = i η, then dµ eff(γ) c (n/γ)η, and in our case γ = n we have dµ eff( n) cnη/2. We can now better characterize the gap between statistics and computation: using RLSs sampling we can improve the computational aspect of Theorem 1 from n to dµ eff(γ), but the risk rate remains O(k/ n) due to the O(k/ n) component coming from Proposition 1. Assume for a second we could generalize, with additional assumptions, Proposition 1 to a faster O(1/n) rate. Then applying Lemma 2 with γ = 1 we would obtain a risk E(Cn,m) O(k/n) + O(k/n). Here we see how the regularity condition on dµ eff(1) becomes crucial. In particular, if η = 1, then we have dµ eff(1) n and no gain. If instead η < 1 we obtain dµ eff(1) nη. This kind of adaptive rates were shown to be possible in supervised learning [31], but seems to still be out of reach for approximate kernel k-means. One possible approach to fill this gap is to look at fast O(1/n) excess risk rates for kernel k-means. Proposition 4 ([21], informal). Assume that k 2, and that µ satisfies a margin condition with radius r0. If Cn is an empirical risk minimizer, then, with probability larger than 1 e δ, E(Cn) e O 1 (k + log(|M|)) log(1/δ) where |M| is the cardinality of the set of all optimal (up to a relabeling) clustering. For more details on the margin assumption, we refer the reader to the original paper [21]. Intuitively the margin condition asks that every labeling (Voronoi grouping) associated with an optimal clustering is reflected by large separation in µ. This margin condition also acts as a counterpart of the usual margin conditions for supervised learning where µ must have lower density around the neighborhood of the critical area {x|µ (Y = 1|X = x) = 1/2}. Unfortunately, it is not easy to integrate Proposition 4 in our analysis, as it is not clear how the margin condition translate from Hn to Hm. Acknowledgments. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this research. L. R. acknowledges the support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project No MADS - DLV-777826. A. R. acknowledges the support of the European Research Council (grant SEQUOIA 724063). [1] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. In Advances in neural information processing systems, pages 10 18, 2009. [2] M. A. Aizerman, E. A. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. In Automation and Remote Control,, number 25 in Automation and Remote Control pages 821 837, 1964. [3] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel methods with statistical guarantees. In Neural Information Processing Systems, 2015. [4] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np-hardness of euclidean sum-of-squares clustering. Machine learning, 75(2):245 248, 2009. [5] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027 1035. Society for Industrial and Applied Mathematics, 2007. [6] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning, 2017. [7] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, 2013. [8] Francis R Bach and Michael I Jordan. Predictive low-rank decomposition for kernel methods. In Proceedings of the 22nd international conference on Machine learning, pages 33 40. ACM, 2005. [9] Arturs Backurs, Piotr Indyk, and Ludwig Schmidt. On the fine-grained complexity of empirical risk minimization: Kernel methods and neural networks. In Advances in Neural Information Processing Systems, 2017. [10] Gérard Biau, Luc Devroye, and Gábor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781 790, 2008. [11] Daniele Calandriello. Efficient Sequential Learning in Structured and Constrained Environments. Ph D thesis, 2017. [12] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Efficient second-order online kernel learning with adaptive embedding. In Advances in Neural Information Processing Systems, pages 6140 6150, 2017. [13] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Second-order kernel online convex optimization with adaptive sketching. In International Conference on Machine Learning, 2017. [14] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive sampling for kernel matrix approximation. In AISTATS, 2017. [15] Guillermo Canas, Tomaso Poggio, and Lorenzo Rosasco. Learning manifolds with k-means and k-flats. In Advances in Neural Information Processing Systems, pages 2465 2473, 2012. [16] Radha Chitta, Rong Jin, Timothy C. Havens, and Anil K. Jain. Approximate kernel k-means: Solution to large scale kernel clustering. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 895 903. ACM, 2011. [17] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. A unified view of kernel k-means, spectral clustering and graph cuts. Technical Report No. UTCS TR-04-25, University of Texas at Austin, 2004. [18] Siegfried Graf and Harald Luschgy. Foundations of Quantization for Probability Distributions. Lecture Notes in Mathematics. Springer-Verlag, Berlin Heidelberg, 2000. ISBN 978-3-54067394-1. [19] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26, 1984. [20] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http: //yann.lecun.com/exdb/mnist/. [21] Clément Levrard et al. Nonasymptotic bounds for vector quantization in hilbert spaces. The Annals of Statistics, 43(2):592 619, 2015. [22] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28 (2):129 137, 1982. [23] Gaëlle Loosli, Stéphane Canu, and Léon Bottou. Training invariant support vector machines using selective sampling. In Large Scale Kernel Machines, pages 301 320. MIT Press, Cambridge, MA., 2007. [24] Andreas Maurer and Massimiliano Pontil. $ K $-dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 5846, 2010. [25] Cameron Musco and Christopher Musco. Recursive Sampling for the Nyström Method. In NIPS, 2017. [26] Cameron Musco and David Woodruff. Is input sparsity time possible for kernel low-rank approximation? In Advances in Neural Information Processing Systems 30. 2017. [27] Dino Oglic and Thomas Gärtner. Nyström method with kernel k-means++ samples as landmarks. Journal of Machine Learning Research, 2017. [28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [29] Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems, 2007. [30] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages 3218 3228, 2017. [31] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1657 1665, 2015. [32] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001. [33] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319, 1998. [34] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Fixed-rank approximation of a positive-semidefinite matrix from streaming data. In Advances in Neural Information Processing Systems, pages 1225 1234, 2017.