# sharper_generalization_bounds_for_clustering__4a2aa3af.pdf

Sharper Generalization Bounds for Clustering

Shaojie Li 1 2 Yong Liu 1 2

Existing generalization analysis of clustering mainly focuses on speciﬁc instantiations, such as (kernel) k-means, and a uniﬁed framework for studying clustering performance is still lacking. Besides, the existing excess clustering risk bounds are mostly of order O(K/ n) provided that the underlying distribution has bounded support, where n is the sample size and K is the cluster numbers, or of order O(K2/n) under strong assumptions on the underlying distribution, where these assumptions are hard to be veriﬁed in general. In this paper, we propose a uniﬁed clustering learning framework and investigate its excess risk bounds, obtaining state-of-the-art upper bounds under mild assumptions. Speciﬁcally, we derive sharper bounds of order O(K2/n) under mild assumptions on the covering number of the hypothesis spaces, where these assumptions are easy to be veriﬁed. Moreover, for the hard clustering scheme, such as (kernel) k-means, if just assume the hypothesis functions to be bounded, we improve the upper bounds from the order O(K/ n) to O(

K/ n). Furthermore, state-of-the-art bounds of faster order O(K/n) are obtained with the covering number assumptions.

1. Introduction

Clustering is one of the fundamental issues in unsupervised learning and has been used in various applications (Xu & Wunsch, 2005; Von Luxburg, 2007; Jain, 2010; Shaham et al., 2018; Liu et al., 2018). In a clustering scheme, datasets are divided into several subgroups, such that data points in the same subgroup are more similar to each other than to those in other subgroups. Although clustering has been studied for decades, by contrast to the thriving of

1Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China. Correspondence to: Yong Liu <liuyonggsai@ruc.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

clustering algorithm design and application, the statistical theory of clustering may appear to be not sufﬁcient. The existing excess risk bounds are mostly derived for different speciﬁc instantiations of clustering learning problems, such as k-means (Thorpe et al., 2015; Tang & Monteleoni, 2016; Levrard et al., 2013; Antos, 2005), kernel k-means (Biau et al., 2008; Antos et al., 2005; Levrard et al., 2015; Levrard, 2018) or spectral clustering (Terada & Yamamoto, 2019), and a uniﬁed framework to study generalization performance for clustering learning is still lacking. Moreover, the existing excess clustering risk bounds either have a slow convergence rate or require pretty strong assumptions on the underlying distribution to get the faster convergence rate, however, whose assumptions are hard to be veriﬁed in general. Speciﬁcally, if the distribution has bounded support, the excess risk upper bounds are mostly of order O(K/ n) (Linder, 2000; Biau et al., 2008; Maurer & Pontil, 2010). If the distribution further satisﬁes a strong assumption called margin condition, the faster convergence rate of order O(K2/n) appears to be obtained (Chou, 1994; Antos et al., 2005; Levrard et al., 2015).

Motivated by these problems, we ﬁrst propose a clustering learning framework, which is suitable for (kernel) k-means, soft k-means, spectral clustering, neural network clustering scheme, etc, and then investigate its excess risk bounds. We start our analysis by proposed clustering Rademacher complexity and show that the upper bound obtained by it is just of order O(K/ n) under the bounded hypothesis functions assumption. Since the Rademacher complexity (Bartlett & Mendelson, 2002) considers the worst case, we further deﬁne the more reasonable local clustering Rademacher complexity (Bartlett et al., 2005) and use it to get a basic excess risk bound with the ﬁxed point under the same assumptions. After that, by using some assumptions of the covering number on the hypothesis function classes, we obtain suitable ﬁxed points and get sharper excess risk upper bounds, whose convergence rates are of order O(K2/n). Note that our assumptions of covering number on hypothesis classes are mild and easy to be veriﬁed compared with the margin condition (Pollard et al., 1982; Chou, 1994; Antos et al., 2005; Levrard et al., 2015; Terada & Yamamoto, 2019). Furthermore, different from bounding the clustering Rademacher complexity by the maximum Rademacher complexity of the restrictions of the function class along

Sharper Generalization Bounds for Clustering

each coordinate with timing a factor of O(K), we provide a new bound for clustering Rademacher complexity, reducing the order of K to

K. The excess clustering risk bounds are therefore improved to O(L

K/ n) and O(L2K/n) dependent on the clustering Rademacher complexity and the local clustering Rademacher complexity respectively, where L denotes a Lipschitz parameter. Particularly for the hard clustering scheme, L = 1, hence, our excess clustering risk bounds have convergence rates of order O(

K/ n) and O(K/n), respectively, which are state-of-the-art.

2. Related Work

In this section, we introduce the related work on excess risk bound analysis of clustering learning and the local Rademacher complexity.

2.1. Excess Risk Bounds

The existing excess risk bounds of clustering learning are mostly derived for (kernel) k-means.

VC-dimension Bounds. VC-dimension technique is commonly used for ﬁnite-dimensional problem of k-means. (Linder et al., 1994; Bartlett et al., 1998; Linder, 2000) show that the excess clustering risk bounds obtained by VCdimension technique are of order O(

K/ n) provided that the underlying distribution has bounded support. In addition, the problem of quantifying how good empirically designed minimizers are, compared to the truly optimal ones, has also been extensively studied in (Linder, 2002).

Rademacher Complexity Bounds. An emerging problem in ﬁnite dimension is that the upper bound usually relevant to the dimension d, while the hypothesis space of kernel k-means is typically an inﬁnite-dimensional Hilbert space. (Biau et al., 2008; Maurer & Pontil, 2010; Canas et al., 2012; Fefferman et al., 2016; Calandriello & Rosasco, 2018; Fischer, 2010) use the Rademacher complexity technique to extend the previous results (Linder et al., 1994; Bartlett et al., 1998; Linder, 2000) and provide dimension-independent bounds for kernel k-means, whose bounds are mostly of order O(K/ n).

U-Process Bounds. (Cao et al., 2016) study the excess risk of clustering learning from the perspective of similarity learning, not only focusing on k-means. (Cl emenc con, 2011) study the hard clustering scheme where a sample is assigned to each subgroup with probability 0 or 1. They all model the clustering learning as pairwise learning problems and use the tool of U-process to analyze the excess clustering risk bound. They use symmetry of U-statistics to control supremum of a U-process in generalization analysis by the supremum of a Rademacher process, which can be bounded by standard techniques in the i.i.d context, providing upper bounds of order O(K/ n).

Margin Condition Bounds. Several results (Pollard et al., 1982; Chou, 1994; Antos et al., 2005) show that the convergence rate can be improved to O(CK/n) under different sets of assumption on the distribution in the ﬁnite-dimensional problem of k-means, where CK denotes a parameter with respect to K. Note that the relationship in their results between K and the upper bound is not clear. (Levrard et al., 2013) show that these different sets of assumptions turn out to be equivalent in the continuous density case to a technical condition, and also provide an upper bound of order O(CK/n). In a context of a separable Hilbert space, (Levrard et al., 2015) extend the above results and propose an assumption on the underlying distribution, called margin condition, to satisfy the technical condition. They establish a fast convergence rate of order O(K2/n). Under the framework of margin condition, (Levrard, 2018) reﬁne the results in (Levrard et al., 2015) and give an excess clustering risk upper bound of order O(K/n) under another pretty strong assumption. (Terada & Yamamoto, 2019) provide the convergence rate of order O(K2/n) for normalized cut (Von Luxburg, 2007) based on margin condition assumption (Levrard et al., 2015), which is a related work on spectral clustering. We will show the margin condition and explain that it is hard to be veriﬁed in general in Section 5.2.

2.2. Local Rademacher Complexity

Local Rademacher complexity is an important tool in statistical learning theory, and has been used to obtain better generalization error bounds for many important supervised learning problems (Bartlett et al., 2005; Koltchinskii et al., 2006; Liu & Liao, 2015; Liu et al., 2017b;a; Xu et al., 2016; Youseﬁet al., 2018; Li et al., 2018). In the generalization analysis of clustering learning, (Levrard et al., 2015; Levrard, 2018) also apply it to obtain the sharper excess risk bounds, however, their results are built on the margin condition, leaving challenges in the clustering learning since their assumptions are hard to be veriﬁed in general. In this paper, we use assumptions of covering number instead of margin condition to derive sharper excess risk bounds together with the local Rademacher complexity technique. Since the covering number of many popular function classes are known, thus our assumptions are milder and easier to check.

3. A Clustering Learning Framework

In this section, after a brief description of notations, we ﬁrst introduce the clustering learning framework and then present some examples to explain it.

3.1. Notations

Assume that µ is an underlying probability distribution on the feature space X, and S = {xi}n i=1 X n is a set of samples drawn independent and identically distributed (i.i.d)

Sharper Generalization Bounds for Clustering

from µ, where n is the sample size. The empirical distribution of S is deﬁned as µn(S) = 1

n Pn i=1 I{xi S}, where I denotes the indictor function. The samples are vectors in euclidian space Rm typically such that m 1. The lp norm on Rm is deﬁned by x p = (Pm i=1 |xi|p)1/p when 1 p < and by x = max1 i m |xi| in the case p = . The expectation and the variance of a random variable X are denoted as E(X) and V (X) respectively. e O hides logarithmic terms.

3.2. Framework

Clustering aims to divide the samples into several clusters such that samples lying in the same cluster have more similarities than those in other clusters. Assume that a clustering task wants to partition the samples into K clusters, W : X 2 R+ is a pairwise distance-based function used to measure the dissimilarity between pair observations, and Z = [Z1, ..., ZK] is a collection of K partition functions Zk : X 2 R+ for k = 1, ..., K, which is together with the function W and used to divide the given dataset S into K disjoint clusters. The clustering framework can be cast as the problem of minimizing the following criterion:

ˆLn(W, Z) = 1 n(n 1)

i,j=1,i =j W(xi, xj)Zk(xi, xj),

over all possible functions Zk for k = 1, ..., K and W. Assume that W and Zk for k = 1, .., K are symmetry, that is for all (x, x ) X 2, there have D(x, x ) = D(x , x) and Zk(x, x ) = Zk(x , x) for k = 1, ..., K.

To make the notations concise, let f W,Zk(X, X ) = W(X, X )Zk(X, X ) and f W,Z = (f W,Z1, ..., f W,ZK) be a vector-valued function of the collection of f W,Z1, ..., f W,ZK, then ˆLn(W, Z) can be written as

ˆLn(f W,Z) = 1 n(n 1)

i,j=1,i =j f W,Zk(xi, xj), (2)

The expectation of the Eq. (2) is deﬁned as:

L(f W,Z) = E

k=1 f W,Zk(X, X ). (3)

where (X, X ) is a pair of i.i.d random variables drawn from the distribution µ.

Let F be a family of vector-valued functions f W,Z:

F := f W,Z|f W,Z(X, X ) RK + , X, X X

and Fk be a function class of the output coordinate k of F.

Usually, ˆLn(f W,Z) is called empirical clustering risk, and L(f W,Z) called (expected) clustering risk. Since the underlying distribution µ is unknown, a clustering algorithm always minimizes the empirical clustering risk ˆLn to obtain the ﬁnal partitions, formalized as:

ˆf W,Z = arg min f W,Z F ˆLn(f W,Z),

over all possible vector-valued function f W,Z in function class F. The optimal risk of the feature space X is the inﬁmum of the clustering risk L(f W,Z):

L = inf f W,Z F L(f W,Z).

We now deﬁne the excess function class used in Section 4,

Fexc := f W,Z f W,Z|f W,Z, f W,Z F ,

where f W,Z = arg minf W,Z F L(f W,Z).

By deﬁnition, the empirical excess clustering risk is L( ˆf W,Z) L , and the (expected) excess clustering risk is E[L( ˆf W,Z)] L , which are typical research objectives in learning theory (Li et al., 2019b;a; Yin et al., 2020; Kang et al., 2021), and are the main objectives this paper focuses on.

3.3. Examples

We now brieﬂy explain our proposed clustering framework in the following examples.

Example 1: k-Means and Kernel k-Means. In k-means, we can set W(x, x ) = x x 2 2, and set Zk(x, x ) = I{(x, x ) C2 k}, where k = 1, ..., K and where C1, ..., CK are partitions of the feature space X. In kernel k-means, we can set W(x, x ) = φ( ψ(x) ψ(x ) p), where p 1, ψ : X H is the feature map which maps X into a reproducing kernel Hilbert space (RKHS) H and φ : R+ R+ is a monotone function such that φ(0) = 0 and φ(t) > 0 for all t > 0, and set Zk(x, x ) = I{(ψ(x), ψ(x )) C2 k}, where k = 1, ..., K and where C1, ..., CK are partitions of H. In particular, k-means and kernel k-means belong to the hard clustering scheme. In the hard clustering scheme, a pair of observations can at most correspond to one cluster, which means that Zk is valued either 0 or 1 for a pair of observations where k = 1, ..., K, and at most one valued 1. Thus the corresponding formula of the hard clustering scheme can be easily written.

Example 2: Soft k-Means. In soft k-means, or in the broader soft clustering scheme, the cluster center , such as the mean, median, or other metric measuring the center of a cluster, is contributed by all the samples, different from the hard clustering scheme. In this case, the pairwise distancebased function W can also be chosen as W(x, x ) = φ( x

Sharper Generalization Bounds for Clustering

x p), where p 1, and φ(x) is same as the hard clustering scheme; For any k = 1, ..., K, if Zk(x, x ) is restricted to the samples, it should satisfy the following constraints:

Zk(xi, xj) [0, 1],

k=1 Zk(xi, xj) = 1 i = j,

i,j=1,i =j Zk(xi, xj) n.

Example 3: Spectral Clustering Scheme. Assume that the samples are constructed as the fully connected graph, W in this case can be chosen as the Gaussian Kernel function W(x, x ) = exp( x x 2 2/(2σ2)), where σ2

is the variance determining the connectivity length scale (Von Luxburg, 2007); Since eigenvectors can be seen as empirical versions of underlying eigenfunctions (Rosasco et al., 2010), thus Zk(xi, xj) = (fk(xi) fk(xj))2 if Zk( , ) is restricted to samples xi, xj, where fk( ) is the k-th eigenfunction, and fk(xi) and fk(xj) are elements of the corresponding k-th eigenvector. This deﬁnition of Zk and W is suitable for NCut and Ratio Cut spectral clustering problem (Von Luxburg, 2007). Of course, if the samples are constructed as other types of graph, pairwise distance function W(x, x ) can be written as corresponding formulas.

Example 4: Neural Network Clustering Scheme. Assume that the feature map encoded by a neural network model after many non-linear layers is ψ(x) : X R, where R is a non-linear high dimensional function space. In this case, W(x, x ) = φ( ψ(x) ψ(x ) p), where p 1, and φ(x) is same as the hard clustering scheme. If we use the hard clustering scheme, Zk(x, x ) = I{(ψ(x), ψ(x ) C2 k}, where k = 1, ..., K and where C1, ..., CK are partitions of R.

The above examples suggest that our proposed clustering framework is generalized well and suitable for a lot of clustering algorithms.

4. Sharper Excess Clustering Risk Bounds

In this section, we ﬁrst introduce the assumptions used. Then, we prove that the clustering Rademacher complexity technique can just obtain an excess risk bound of slow convergence rate. Under the same condition of bounded hypothesis functions, we then derive a basic excess risk bound by the local clustering Rademacher complexity. This basic bound can be further together with some mild assumptions of the covering number and used to derive sharper bounds.

4.1. Assumptions

Assumption 1. Assume that the hypothesis functions f W,Zk( , ) [0, M] for k = 1, ..., K where M > 0 is a

constant, and that PK k=1 f W,Zk( , ) [0, E] where E > 0 is a constant.

Assumption 1 is a very mild assumption. In examples 1, 2 and 4, it can be easily fulﬁlled if the pairwise distance function W is bounded or normalized because the partition functions Z1, ..., ZK are obviously bounded. In example 3, if the function W is bounded, the corresponding integral operator related to W is a bounded operator and the eigenfunctions in RKHS are thus continuous and bounded (Rosasco et al., 2010), which means f W,Zk for k = 1, ..., K are bounded. Obviously, PK k=1 f W,Zk is bounded if f W,Z1, ..., f W,ZK are bounded. Besides, since Z1, ..., ZK are usually indicator functions, which means E = M, so it is unreasonable to assume that E KM.

Deﬁnition 1 (covering number (Zhou, 2002)). For any ϵ > 0 and a function class H, the L2 covering number N(ϵ, H, ℓ2(µn(S))) is the supremum over samples S = {x1, ..., xn} of the size of a minimal cover Cϵ such that

f H, fϵ Cϵ s.t. q

1 n Pn i=1(f(xi) fϵ(xi))2 ϵ. Furthermore, the following covering number is introduced:

N (ϵ, H, 2) := sup n sup µn N ϵ, H, ℓ2(µn) .

Assumption 2 (logarithmic covering number). Assume that there exist three positive constants γ, d and p satisfying log N(ϵ, Fk, 2) d logp(γ/ϵ) for any 0 < ϵ γ and k = 1, ..., K.

Many popular function classes satisfy Assumption 2 when the function classes Fk for k = 1, ..., K are bounded:

1.) Any function space with ﬁnite VC-dimension (Vaart & Wellner, 1997), including linear functions and univariate polynomials of degree e (for which d = e + 1 and p = 1) as special cases; In particular, the corresponding function class of k-means clustering is a VC major class with ﬁnite VC dimension, see section 19.1 in (Devroye et al., 2013) for details.

2.) Any unit Euclidean ball B2 Rd with ﬁxed ϵ (0, 1) (Rigollet & H utter, 2015).

3.) Any RKHS based on a kernel with rank d (Carl & Triebel, 1980).

Assumption 3 (polynomial covering number). Assume that there exist two constants γ > 0 and p > 0 satisfying log N(ϵ, Fk, 2) γϵ p for any k = 1, ..., K.

Classes that fulﬁll Assumption 3 are known as satisfying the uniform entropy condition (Wellner et al., 2013). If the function classes Fk for k = 1, ..., K are bounded, this type of covering number is satisﬁed by many Sobolev/Besov classes (Gu, 2013). For instance, if the kernel eigenvalues decay at a rate of t 2

p , where t denotes a sequence notation,

Sharper Generalization Bounds for Clustering

then the RKHS satisﬁes this assumption of covering number (Carl & Triebel, 1980). The popular RKHSs of Gaussian, polynomial and ﬁnite rank kernels satisfy this assumption.

Assumption 4. Assume that there exist two constants γ > 0 and p > 0 satisfying log N(ϵ, Fk, 2) γϵ p log2 2

ϵ for any k = 1, ..., K.

Assumption 4 combines the logarithmic and the polynomial covering number.

4.2. A Basic Excess Clustering Risk Bound

Rademacher complexity is widely used to measure the complexity of the function class and has been used for classiﬁcation and regression problems (Bartlett & Mendelson, 2002; Koltchinskii et al., 2002). However, the traditional Rademacher complexity deﬁnition is not suitable for clustering learning. The summation over all pairs of observations in Eq. (1) makes its study more difﬁcult, rendering standard techniques in the i.i.d case not applicable in this context. Inspired by (Cl emenc on et al., 2008; 2005), by using the permutations in U-process, we convert the non-sum-of-i.i.d pairwise function to a sum-of-i.i.d form. The empirical clustering Rademacher complexity thus deﬁned on the F function space as follows: Deﬁnition 2. The empirical clustering Rademacher complexity of F is:

sup f W,Z F

k=1 σif W,Zk xi, xi+ n

where σ1, ..., σ n/2 are i.i.d Rademacher variables, taking values in { 1, 1} with equal probability. These Rademacher variables are independent of the samples S = {x1, ..., xn}. The expected clustering Rademacher complexity of F is deﬁned as follows:

R(F) = ESRn(F).

By the proposed clustering Rademacher complexity, we have the following Theorem.

Theorem 1. Under Assumption 1, with probability 1 δ, we have

L( ˆf W,Z) L 8KM n + E

Remark 1. Theorem 1 suggests that the (empirical) excess clustering risk upper bound obtained by Rademacher complexity has a convergence rate of order O(K/ n) provided that the hypothesis functions have bounded support.

It is worth noticing that Rademacher complexity considers the worst-case of the element in function space, neglecting

that the algorithm will likely pick functions that have a small error. As a result, the best error rate that can be obtained via the Rademacher complexity is at least of order O(1/ n). Indeed, the type of algorithms we consider here is known in the statistical literature as M-estimators. They minimize the empirical loss in a ﬁxed class of functions. (Bartlett et al., 2005; Liu et al., 2019) demonstrate that the local Rademacher complexity is more reasonable to be served as a complexity measure. Since the local Rademacher complexity cannot be deﬁned on the non-sum-of-i.i.d pairwise objective of Eq. (1), combined with Deﬁnition 2, we deﬁne the expectation local Rademacher complexity for clustering learning as follows.

Deﬁnition 3. For any r > 0, the expectation local Rademacher complexity of the function class Fexc for clustering learning is deﬁned as:

R(Fr exc) := R α(f W,Z f W,Z) |α [0, 1],

f W,Z f W,Z Fexc, η r ,

where η := L[(PK k=1 α(f W,Zk f W,Zk))2], that is η = E[(PK k=1 α(f W,Zk f W,Zk))2].

If we pick a variance as small as possible while requiring that the f W,Z f W,Z is still in Fr exc, we can choose a much smaller class Fr exc Fexc and obtain sharper excess clustering risk upper bounds. With the local clustering Rademacher complexity, we obtain a basic excess clustering risk upper bound.

Theorem 2. Under Assumption 1, and let r be the ﬁxed point of R(Fr exc), that is r is the solution of R(Fr exc) = r

with respect to r. Then, h > max 1,

2 4E , with probability 1 δ:

L( ˆf W,Z) L ch,Er + ch,δ

where ch,E and ch,δ are constants dependent on h, E and h, δ respectively.

A basic excess clustering risk upper bound is given by Theorem 2. We can choose suitable ﬁxed points r to derive sharper bounds, which will be introduced in Section 4.3.

Remark 2. [Proof Techniques] [1] The generalization error bounds in (Bartlett et al., 2005; Koltchinskii et al., 2006) are complex, we therefore provide excess risk bound here, which need ﬁne-grained generalization analysis (see proof of Theorem 2 in Appendix) and deﬁne the local Rademacher complexity on the excess function class Fexc. [2] The Eq. (2) involves with pairwise observations, hinders the standard i.i.d technique of local Rademacher complexity (Bartlett et al., 2005) to apply to. To overcome this difﬁculty, we consider the U-process (Cl emenc on et al., 2005; 2008) to convert the Eq. (2) to a sum-of-i.i.d form (see Lemma 4 in

Sharper Generalization Bounds for Clustering

Appendix); [3] To present this basic excess clustering risk bound, we must use the Bennett concentration inequality (Bousquet, 2002a) to bound the empirical uniform deviation to get a term of order O(1/n), which can be together with the ﬁxed point r to derive result (4) (see Lemma 4 and Lemma 5 in Appendix).

4.3. Sharper Excess Clustering Risk Bounds

In this paper, we use the covering number to bound the expected local clustering Rademacher complexity R(Fr exc) to obtain the ﬁxed point r and derive the sharper excess clustering risk bounds. Covering number (Kolmogorov & Tikhomirov, 1959; Zhou, 2002) is also a tool to measure the complexity of the function class, which implies the richness of a function class and is widely used in statistical learning theory. By the covering number, we can derive the following sharper excess risk bounds for clustering learning.

Theorem 3. With different assumptions, we have the following different results:

1.) Under Assumptions 1 and 2, with probability 1 δ, we have

L( ˆf W,Z) L c1K2 logp(n)

n + c2 n 1,

where c1 and c2 are constants dependent on γ, d, p, h, M, E and h, δ respectively.

2.) Under Assumptions 1 and 3, with probability 1 δ, we have

L( ˆf W,Z) L c1K2 1

n 2 p+2 + c2 n 1,

where c1 and c2 are constants dependent on γ, p, h, M, E and h, δ respectively.

3.) Under Assumptions 1 and 4, with probability 1 δ, we have

L( ˆf W,Z) L c1K2 (log n)

n 2 p+2 log n

(log n) 2 p+2 + c2 n 1,

where c1 and c2 are constants dependent on γ, p, h, M, E and h, δ respectively.

For result (1) in Theorem 3, the empirical excess clustering risk L( ˆf W,Z) L has a convergence rate of order e O(K2/n), which is much faster than O(K/ n) because usually K n. For results (2) and (3) in Theorem 3, the empirical excess clustering risk L( ˆf W,Z) L have

convergence rates of order O(K2/n 2 p+2 ) and e O(K2/n 2 p+2 ) respectively, which are all faster than O(K/ n) when 0 < p < 2. It is easy to verify that the upper bound of the expected excess clustering risk E[L( ˆf W,Z)] L has the same order as L( ˆf W,Z) L .

According to (Hanneke, 2016; Zhivotovskiy & Hanneke, 2018; Ehrenfeucht et al., 1989), if a function class is a VC major class with ﬁnite VC dimension, which is a special case of Assumption 2, its excess risk lower bound is of order O(d/n). Since the upper bound of excess clustering risk in result (1) is also of order e O(d/n) (refer to Remark 4), thus the convergence rate obtained here is optimal in a minimax sense. Besides, (Rakhlin et al., 2017; Rakhlin & Sridharan, 2015) show that if there have the bounded loss function assumption and the polynomial covering number assumption on the hypothesis function space, which corresponds to Assumption 1 and Assumption 3, the lower bound obtained is of order O(n 2 p+2 ) for supervised learning problems. Since we obtain the upper bound of order O(n 2 p+2 ) for the more complex pairwise clustering learning problem under the same assumptions, thus this convergence rate in result (2) may be optimal rate as well.

Remark 3. [Proof Techniques] To present Theorem 3, we must establish the relationship between the expected local clustering Rademacher complexity and the covering number, however, the conventional Dudley s covering number bounds (Dudley, 1978; Bousquet, 2002b; Bartlett et al., 2005; Srebro & Sridharan, 2010) are built on the empirical constraints. In Deﬁnition 3, our local clustering Rademacher complexity has a constraint: L[(PK k=1(f W,Zk f W,Zk))2] r which is constructed on the expectation. To overcome this difﬁculty, we extend Theorem 2 in (Lei et al., 2016) to our pairwise clustering learning problem (see Lemma 6 in Appendix). Subsequently, by using the commonly used mild assumptions of covering number, the suitable ﬁxed points r can be derived so that it can be substituted into the basic excess clustering risk bound of Theorem 2.

Remark 4. The factor K2 appears because of the local Rademacher complexity technique. For example in result (1) of Theorem 3, we need to solve the following equality:

r = c K x +

where x = d logp(2γn1/2)

n and c is a constant. Since usually K n, so the upper bound in Theorem 3 is obviously faster than that in Theorem 1. For the extreme case of focusing on K, we further improve the order of K in Section 5.

5. Improve the Order of K

The results in Section 4 and the existing studies on datadependent excess risk bounds for clustering (Levrard et al., 2013; Biau et al., 2008; Levrard et al., 2015; Calandriello & Rosasco, 2018; Cl emenc con, 2011) usually build on the following result for the Rademacher complexity:

R(F) K max k R(Fk),

Sharper Generalization Bounds for Clustering

where K maxk R(Fk) means the maximum Rademacher complexity of the restrictions of the function class along each coordinate with timing a factor of O(K). Therefore, the existing excess risk bounds are linearly dependent on K obtained by the clustering Rademacher complexity or K2

obtained by the local clustering Rademacher complexity. However, for ﬁne-grained analysis in the social network or recommendation systems, the number of clusters K may be very large. In this section, we reduce the order of K to improve the results of Section 4. Moreover, we will compare the results this paper obtained with the related work.

5.1. Improved Excess Clustering Risk Bounds

We ﬁrst show that the clustering Rademacher complexity can be bounded by the maximum Rademacher complexity of the restrictions of the function class along each coordinate with timing a factor of O(

K) by reﬁning the Theorem 1 in (Foster & Rakhlin, 2019). Under the new clustering Rademacher complexity bound, we improve the convergence rates of the excess clustering risk of Section 4.

Assumption 5. Assume the function PK k=1 f W,Zk to be Llipschitz continuous with respect to the L norm, that is: f W,Z, f W,Z F,

L f W,Z f W,Z .

Assumption 5 is a very mild assumption. In fact, the function PK k=1 f W,Zk in our proposed clustering framework is K-Lipschitz w.r.t. the L norm in the worst case (see Lemma 7 in the supplementary material for details). In particular, for the hard clustering scheme, such as k-means or kernel k-means, the function PK k=1 f W,Zk is 1-Lipschitz with respect to the L norm (see Lemma 7 in the supplementary material), which allows us to derive the state-ofthe-art convergence rates for the hard clustering scheme.

Theorem 4. Under Assumption 1 and 5, for any η > 0 and S = xn i=1 X n, there exists a constant C > 0 such that

K max k e Rn(Fk) log 3 2 +η( n),

where e Rn(Fk) = sup S X n Rn(Fk).

Remark 5. From Theorem 4, one can see that the clustering Rademacher complexity can be bounded by the maximum Rademacher complexity of the restrictions of the function class along each coordinate with timing a factor of O(

K). It allows us to reduce the order of K of results of Section 4.

Remark 6. [Proof Techniques] The Theorem 1 in (Foster & Rakhlin, 2019) has a term maxi e Rn(Fi) in the denominator, which hinders the construction of the relationship between the expected local clustering Rademacher complexity and the covering number. To present the following

Theorems 5 and 6, we must prove the lower bound of the term maxi e Rn(Fi) to overcome this difﬁculty, where the proof refers to using the Khintchine inequality (Haagerup, 1981) and the U-process technique (see the proof of Theorem 4 in the supplementary material for details).

Based on Theorem 4, we can derive new excess clustering risk upper bounds with a lower order of K. Theorem 5. Under Assumptions 1 and 5, for any η > 0, there exist a constant C > 0 such that with probability 1 δ,

L( ˆf W,Z) L 8MCL

K log 3 2 +η( n) n + E

From Theorem 5, one can see that the upper bound of the (empirical) excess clustering risk has a convergence rate of order e O(L

K/ n). For the hard clustering scheme whose L = 1, the convergence rate is of order e O(

K/ n). In other words, if there just have assumption 1, we improve the existing excess clustering risk upper bound to e O(

K/ n) for the hard clustering scheme. Theorem 6. With different assumptions, we have the following different results:

1.) Under Assumptions 1, 2 and 5, for any η > 0, with probability 1 δ, we have

L( ˆf W,Z) L O L2K log3+p+2η( n)

2.) Under Assumptions 1, 3 and 5, for any η > 0, with probability 1 δ, we have

L( ˆf W,Z) L O L2K log3+2η( n)

3.) Under Assumptions 1, 4 and 5, for any η > 0, with probability 1 δ, we have

L( ˆf W,Z) L O

L2K (log n)r

n 2 p+2 log n

(log n) 2 p+2

where r = 2 p

p+2 + 3 + 2η.

Theorem 6 reduces the order of K for the upper bounds of the (empirical) excess clustering risk, from K2 in Theorem 3 to K. For the hard clustering scheme whose L = 1, the convergence rate of excess clustering risk in result (1) of Theorem 6 is of order e O(K/n) and in results (2) and (3) of Theorem 6 are of order e O(K/n 2 p+2 ), which are all faster than the results in Theorem 3. From Theorem 6, one can see that the best convergence rate obtained for (empirical) excess clustering risk is of order e O(K/n), which is the state-of-the-art convergence rate under mild assumptions.

Sharper Generalization Bounds for Clustering

5.2. Comparison with Related Work

Bounded Support. Suppose the distribution µ has bounded support, the excess clustering bounds are mostly of or-

n for ﬁnite-dimensional problems(Linder et al., 1994; Bartlett et al., 1998; Linder, 2000; 2002), or of order O K n (Biau et al., 2008; Maurer & Pontil, 2010; Canas et al., 2012; Fefferman et al., 2016; Calandriello & Rosasco, 2018; Fischer, 2010) for inﬁnite-dimensional problems. For example, it is shown in (Bartlett et al., 1998) that if µ has bounded support, then

E[L( ˆf W,Z)] L C min

m m log n n

where m is the dimension of the sample x Rm. Or, it is shown in (Biau et al., 2008) that if µ has bounded support R, then

E[L( ˆf W,Z)] L 12KR2/ n.

While supposing the hypothesis function has a bounded support, (Cl emenc con, 2011; Cao et al., 2016) obtain the convergence rate of order O( K n). For example, it is shown in (Cl emenc con, 2011) that if the loss function has bounded support B, then with probability at least 1 δ,

L( ˆf W,Z) L Cδ,BK/ n,

where Cδ,B is a constant dependent on B and δ.

From Theorem 5, one can see that our obtained bound for hard clustering scheme is of order O(

K/ n) provided that there just have a bounded hypothesis function assumption. Note that this upper bound is suitable for inﬁnitedimensional problem, thus it is faster than O(K/ n) that is provided for inﬁnite dimension problem in (Biau et al., 2008; Maurer & Pontil, 2010; Calandriello & Rosasco, 2018; Cl emenc con, 2011).

Margin Condition. As shown in Section 2, (Levrard et al., 2015) propose an assumption on µ, called margin condition, to satisfy the technical condition (Pollard et al., 1982; Chou, 1994; Antos et al., 2005; Levrard et al., 2013) and establish a fast convergence rate of order O(K2/n). Let M be the set of all c = {c 1, ..., c K} where c is the set of optimal cluster centers constructed on the underlying distribution µ for (kernel) k-means algorithm. For t 0, we deﬁne p(t) := supc M P(f(c )t), where, for any set A E, the term At stands for the t-neighborhood of A in E and where f(c ) denotes the frontier of the Voronoi diagram generated by c . Speciﬁcally, f(c) := S

i =j Ui(c) Uj(c) where Uj(c) := {x E : i {1, ..., K}, d(x, cj) d(x, ci)} and d( , ) is a distance metric deﬁned on the metric space E. The margin condition consists of: (1) for any x E,

P(x : |x| R) = 1 for some R > 0; (2) suppose there exists ro > 0 such that for all 0 < t < ro, p(t) Bpmin

128R2 t, where B := infc M,i =j |c i c j| and pmin := infc M,1 j K P(Uj(c )). Since P(f(c )t) corresponds to the probability mass of the frontier of the associated Voronoi diagram of c inﬂated by t, the margin condition suggests that if p(t) does not increase too rapidly with t, the faster rate O(K2/n) appears to be obtained. From the introduction of the margin condition, one can see that it is a strong assumption and difﬁcult to be veriﬁed in general. Assume that the underlying distribution µ satisﬁes the margin condition with radius ro, and let δ denote the quantity pmin B2r2 o 64R2 ϵ where ϵ > 0 is a constant derived by the margin condition (Levrard et al., 2015), (Levrard, 2018) reﬁne the results in (Levrard et al., 2015):

E[L( ˆf W,Z)] L C(K + log(|M|))R2

n δ Iδ<γ + R2e n 2R4 (δ 12KR2

where γ = 12KR2

n . They show that if E[L( ˆf W,Z)] L δ, a faster convergence rate of order O(K/n) appears to be obtained. Their result is built on stronger assumptions than (Levrard et al., 2015) and also difﬁcult to be veriﬁed in general. Moreover, (Terada & Yamamoto, 2019) give the excess risk bound of order O(K2/n) for NCut (Von Luxburg, 2007) under the framework of margin condition.

Compared with these margin condition-based bounds, our excess clustering risk bounds are ﬁrst suitable for many clustering algorithms, not just k-means. Besides, except for the bounded function assumption, we obtain the convergence rate of fast order O(K2/n) just assuming mild assumptions of covering number on the function classes Fk for k = 1, ..., K. Since the covering numbers of many popular function classes are already known, thus our assumption is quite mild and easy to check. In addition, we obtain the state-of-the-art convergence rate of order O(K/n) for the hard clustering scheme under the same mild assumptions.

6. Conclusion

In this paper, we propose a clustering learning framework and analyze its excess risk. We all obtain sharper excess clustering upper bounds under two sets of mild assumptions: bounded support of hypothesis functions and assumptions of covering number on hypothesis function spaces, respectively. Besides, the state-of-the-art upper bound of order O(K/n) is derived for the hard clustering scheme under mild assumptions. We believe our work will provide a new studying perspective for clustering learning.

In future work, we will investigate the lower bound and the optimal rate for our proposed clustering framework.

Sharper Generalization Bounds for Clustering

7. Acknowledgment

This work was supported in part by the National Natural Science Foundation of China NO. 62076234, the Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, China Unicom Innovation Ecological Cooperation Plan, the Public Computing Cloud of Renmin University of China, and the Major Innovation & Planning Interdisciplinary Platform for the Double-First Class Initiative, Renmin University of China.

Antos, A. Improved minimax bounds on the test and training distortion of empirically designed vector quantizers. IEEE Transactions on Information Theory, 51(11):4022 4032, 2005.

Antos, A., Gyorﬁ, L., and Gyorgy, A. Individual convergence rates in empirical vector quantizer design. IEEE Transactions on Information Theory, 51(11):4013 4022, 2005.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Bartlett, P. L., Linder, T., and Lugosi, G. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information theory, 44(5):1802 1813, 1998.

Bartlett, P. L., Bousquet, O., Mendelson, S., et al. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 1537, 2005.

Biau, G., Devroye, L., and Lugosi, G. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781 790, 2008.

Bousquet, O. A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6):495 500, 2002a.

Bousquet, O. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph D thesis, Department of Applied Mathematics, Ecole Polytechnique, 2002b.

Calandriello, D. and Rosasco, L. Statistical and computational trade-offs in kernel k-means. Advances in Neural Information Processing Systems (Neur IPS 2018), 31: 9357 9367, 2018.

Canas, G., Poggio, T., and Rosasco, L. Learning manifolds with k-means and k-ﬂats. Advances in Neural Information Processing Systems (NIPS 2012), 25:2465 2473, 2012.

Cao, Q., Guo, Z.-C., and Ying, Y. Generalization bounds for metric and similarity learning. Machine Learning, 102 (1):115 132, 2016.

Carl, B. and Triebel, H. Inequalities between eigenvalues, entropy numbers, and related quantities of compact operators in banach spaces. Mathematische Annalen, 251(2): 129 133, 1980.

Chou, P. A. The distortion of vector quantizers trained on n vectors decreases to the optimum as Op(1/n). In Proceedings of 1994 IEEE International Symposium on Information Theory (ISIT 1994), pp. 457. IEEE, 1994.

Cl emenc con, S. On U-processes and clustering performance. In Advances in Neural Information Processing Systems (NIPS 2011), pp. 37 45, 2011.

Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and scoring using empirical risk minimization. In International Conference on Computational Learning Theory (COLT 2005), pp. 1 15. Springer, 2005.

Cl emenc on, S., Lugosi, G., Vayatis, N., et al. Ranking and empirical minimization of U-statistics. The Annals of Statistics, 36(2):844 874, 2008.

Devroye, L., Gy orﬁ, L., and Lugosi, G. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

Dudley, R. M. Central limit theorems for empirical measures. The Annals of Probability, pp. 899 929, 1978.

Ehrenfeucht, A., Haussler, D., Kearns, M., and Valiant, L. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247 261, 1989.

Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016.

Fischer, A. Quantization and clustering with bregman divergences. Journal of Multivariate Analysis, 101(9):2207 2221, 2010.

Foster, D. J. and Rakhlin, A. l vector contraction for rademacher complexity. ar Xiv preprint ar Xiv:1911.06468, 2019.

Gu, C. Smoothing spline ANOVA models, volume 297. Springer Science & Business Media, 2013.

Haagerup, U. The best constants in the khintchine inequality. Studia Mathematica, 70:231 283, 1981.

Hanneke, S. Reﬁned error bounds for several learning algorithms. The Journal of Machine Learning Research, 17 (1):4667 4721, 2016.

Sharper Generalization Bounds for Clustering

Jain, A. K. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651 666, 2010.

Kang, Y., Liu, Y., Niu, B., and Wang, W. Weighted distributed differential privacy erm: Convex and non-convex. Computers & Security, 106:102275, 2021.

Kolmogorov, A. N. and Tikhomirov, V. M. ε-entropy and ε-capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk, 14(2):3 86, 1959.

Koltchinskii, V., Panchenko, D., et al. Empirical margin distributions and bounding the generalization error of combined classiﬁers. The Annals of Statistics, 30(1): 1 50, 2002.

Koltchinskii, V. et al. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006.

Lei, Y., Ding, L., and Bi, Y. Local rademacher complexity bounds based on covering numbers. Neurocomputing, 218:320 330, 2016.

Levrard, C. Quantization/clustering: when and why does k-means work? ar Xiv preprint ar Xiv:1801.03742, 2018.

Levrard, C. et al. Fast rates for empirical vector quantization. Electronic Journal of Statistics, 7:1716 1746, 2013.

Levrard, C. et al. Nonasymptotic bounds for vector quantization in hilbert spaces. The Annals of Statistics, 43(2): 592 619, 2015.

Li, J., Liu, Y., Yin, R., Zhang, H., Ding, L., and Wang, W. Multi-class learning: From theory to algorithm. In Advances In Neural Information Processing Systems (Neur IPS 2018), pp. 1586 1595, 2018.

Li, J., Liu, Y., Yin, R., and Wang, W. Approximate manifold regularization: Scalable algorithm and generalization analysis. In International Joint Conference on Artiﬁcial Intelligence (IJCAI 2019), pp. 2887 2893, 2019a.

Li, J., Liu, Y., Yin, R., and Wang, W. Multi-class learning using unlabeled samples: Theory and algorithm. In International Joint Conference on Artiﬁcial Intelligence (IJCAI 2019), pp. 2880 2886, 2019b.

Linder, T. On the training distortion of vector quantizers. IEEE Transactions on Information Theory, 46(4):1617 1623, 2000.

Linder, T. Learning-theoretic methods in vector quantization. In Principles of nonparametric learning, pp. 163 210. Springer, 2002.

Linder, T., Lugosi, G., and Zeger, K. Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding. IEEE Transactions on Information Theory, 40(6):1728 1740, 1994.

Liu, F., Choi, D., Xie, L., and Roeder, K. Global spectral clustering in dynamic networks. Proceedings of the National Academy of Sciences, 115(5):927 932, 2018.

Liu, Y. and Liao, S. Eigenvalues ratio for kernel selection of kernel methods. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI 2015), number 1, pp. 2814 2820, 2015.

Liu, Y., Liao, S., Lin, H., Yue, Y., and Wang, W. Generalization analysis for ranking using integral operator. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI 2017), pp. 2273 2279, 2017a.

Liu, Y., Liao, S., Lin, H., Yue, Y., and Wang, W. Inﬁnite kernel learning: generalization bounds and algorithms. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI 2017), pp. 2280 2286, 2017b.

Liu, Y., Liao, S., Jiang, S., Ding, L., Lin, H., and Wang, W. Fast cross-validation for kernel-based algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(5):1083 1096, 2019.

Maurer, A. and Pontil, M. K-dimensional coding schemes in hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 5846, 2010.

Pollard, D. et al. A central limit theorem for k-means clustering. The Annals of Probability, 10(4):919 926, 1982.

Rakhlin, A. and Sridharan, K. Online nonparametric regression with general loss functions. ar Xiv preprint ar Xiv:1501.06598, 2015.

Rakhlin, A., Sridharan, K., Tsybakov, A. B., et al. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23 (2):789 824, 2017.

Rigollet, P. and H utter, J.-C. High dimensional statistics. Lecture notes for course 18S997, 2015.

Rosasco, L., Belkin, M., and De Vito, E. On learning with integral operators. Journal of Machine Learning Research, 11(1):905 934, 2010.

Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., and Kluger, Y. Spectralnet: Spectral clustering using deep neural networks. ar Xiv preprint ar Xiv:1801.01587, 2018.

Srebro, N. and Sridharan, K. Note on reﬁned dudley integral covering number bound. Unpublished results. http://ttic. uchicago. edu/karthik/dudley. pdf, 2010.

Sharper Generalization Bounds for Clustering

Tang, C. and Monteleoni, C. On lloyd s algorithm: New theoretical insights for clustering in practice. In Artiﬁcial Intelligence and Statistics (AISTATS 2016), pp. 1280 1289, 2016.

Terada, Y. and Yamamoto, M. Kernel normalized cut: a theoretical revisit. In International Conference on Machine Learning (ICML 2019), pp. 6206 6214, 2019.

Thorpe, M., Theil, F., Johansen, A. M., and Cade, N. Convergence of the k-means minimization problem using γ-convergence. SIAM Journal on Applied Mathematics, 75(6):2444 2474, 2015.

Vaart, A. v. d. and Wellner, J. A. Weak convergence and empirical processes with applications to statistics. Journal of the Royal Statistical Society-Series A Statistics in Society, 160(3):596 608, 1997.

Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007.

Wellner, J. et al. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 2013.

Xu, C., Liu, T., Tao, D., and Xu, C. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495 1507, 2016.

Xu, R. and Wunsch, D. Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3):645 678, 2005.

Yin, R., Liu, Y., Lu, L., Wang, W., and Meng, D. Divideand-conquer learning with nystr om: Optimal rate and algorithm. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence (AAAI 2020), pp. 6696 6703, 2020.

Youseﬁ, N., Lei, Y., Kloft, M., Mollaghasemi, M., and Anagnostopoulos, G. C. Local rademacher complexitybased learning guarantees for multi-task learning. The Journal of Machine Learning Research, 19(1):1385 1431, 2018.

Zhivotovskiy, N. and Hanneke, S. Localization of vc classes: Beyond local rademacher complexities. Theoretical Computer Science, 742:27 49, 2018.

Zhou, D.-X. The covering number in learning theory. Journal of Complexity, 18(3):739 767, 2002.