# discriminative_similarity_for_data_clustering__0e9249db.pdf

Published as a conference paper at ICLR 2022

DISCRIMINATIVE SIMILARITY FOR DATA CLUSTERING

Yingzhen Yang School of Computing and Augmented Intelligence Arizona State University Tempe, AZ 85281, USA yingzhen.yang@asu.edu

Ping Li Cognitive Computing Lab Baidu Research Bellevue, WA 98004, USA liping11@baidu.com

Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose Clustering by Discriminative Similarity (CDS), a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.

1 INTRODUCTION

Similarity-based clustering methods segment the data based on the similarity measure between the data points, such as spectral clustering (Ng et al., 2001), pairwise clustering method (Shental et al., 2003), K-means (Hartigan & Wong, 1979), and kernel K-means (Sch olkopf et al., 1998). The success of similarity-based clustering highly depends on the underlying pairwise similarity over the data, which in most cases are constructed empirically, e.g., by Gaussian kernel or the K-Nearest Neighbor (KNN) graph. In this paper, we model data clustering as a multiclass classification problem and seek for the data partition where the associated classifier, trained on cluster labels, can have low generalization error. Therefore, it is natural to formulate data clustering problem as a problem of training unsupervised classifiers: a classifier can be trained upon each candidate partition of the data, and the quality of the data partition can be evaluated by the performance of the trained classifier. Such classifier trained on a hypothetical labeling associated with a data partition is termed an unsupervised classifier.

We present Clustering by Discriminative Similarity (CDS), wherein discriminative similarity is derived by the generalization error bound for an unsupervised similarity-based classifier. CDS is based on a novel framework of discriminative clustering by unsupervised classification wherein an unsupervised classifier is learnt from unlabeled data and the preferred hypothetical labeling should minimize the generalization error bound for the learnt classifier. When the popular Support Vector Machines (SVMs) is used in this framework, unsupervised SVM (Xu et al., 2004) can be deduced. In this paper, a similarity-based classifier motivated by similarity learning (Balcan et al., 2008; Cortes et al., 2013), is used as the unsupervised classifier. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as sum of pairwise similarity between the data from different classes. Such pairwise similarity, parameterized by the weights of the unsupervised similarity-based classifier, serves as the discriminative similarity. The term discriminative similarity emphasizes the fact that the similarity is learnt so as

Yingzhen Yang s work was conducted as a consulting researcher at Baidu Research - Bellevue, WA, USA.

Published as a conference paper at ICLR 2022

to improve the discriminative capability of a certain classifier such as the aforementioned unsupervised similarity-based classifier.

1.1 CONTRIBUTIONS AND MAIN RESULTS

Firstly, we present Clustering by Discriminative Similarity (CDS) where discriminative similarity is induced by the generalization error bound for unsupervised similarity-based classifier on unlabeled data. The generalization bound for such similarity-based classifier is of independent interest, which is among the few results of generalization bounds for classification using general similarity functions (Section B.1 of Appendix). When the general similarity function is set to a Positive Semi-Definite (PSD) kernel, the derived discriminative similarity between two data points xi,xj is SK ij = 2(αi+αj λαiαj)K(xi xj), where K can be an arbitrary PSD kernel and αi is the kernel weight associated with xi. With theoretical and empirical study, we argue that SK ij should be used for data clustering instead of the conventional kernel similarity corresponding to uniform kernel weights. In the case of binary classification, we prove that the derived discriminative similarity SK ij has the same form as the similarity induced by the integrated squared error bound for kernel density classification (Section A of the appendix). Such connection suggests that there exists informationtheoretic measure which is implicitly equivalent to our CDS framework for unsupervised learning, and our CDS framework is well grounded for learning similarity from unlabeled data.

Secondly, based on our CDS model, we develop a clustering algorithm termed Clustering by Discriminative Similarity via unsupervised Kernel classification (CDSK) in Section 5. CDSK uses a PSD kernel as the similarity function, and outperforms competing clustering algorithms, including nonparametric discriminative similarity based clustering methods and similarity graph based clustering methods, demonstrating the effectiveness of CDSK. When the kernel weights {αi} are uniform, CDSK is equivalent to kernel K-Means (Sch olkopf et al., 1998). CDSK is more flexible by learning adaptive kernel weights associated with different data points.

1.2 CONNECTION TO RELATED WORKS

Our CDS model is related to a class of discriminative clustering methods which classify unlabeled data by various measures on discriminative unsupervised classifiers, and the measures include generalization error (Xu et al., 2004) or the entropy of the posterior distribution of the label (Gomes et al., 2010). Discriminative clustering methods (Xu et al., 2004) predict the labels of unlabeled data by minimizing the generalization error bound for the unsupervised classifier with respect to the hypothetical labeling. Unsupervised SVM is proposed in Xu et al. (2004) which learns a binary classifier to partition unlabeled data with the maximum margin between different clusters. The theoretical properties of unsupervised SVM are further analyzed in Karnin et al. (2012). Kernel logistic regression classifier is employed in Gomes et al. (2010), and it uses the entropy of the posterior distribution of the class label by the classifier to measure the quality of the hypothetical labeling. CDS model performs discriminative clustering based on a novel unsupervised classification framework by considering similarity-based or kernel classifiers which are important classification methods in the supervised learning literature. In contrasts with kernel similarity with uniform weights, the induced discriminative similarity with learnable weights enhances its capability to represent complex interconnection between data. The generalization analysis for CDS is primarily based on distribution free Rademacher complexity. While Yang et al. (2014a) propose nonparametric discriminative similarity for clustering, the nonparametric similarity requires probability density estimation which is difficult for high-dimensional data, and the fixed nonparametric similarity is not adaptive to complicated data distribution.

The paper is organized as follows. We introduce the problem setup of Clustering by Discriminative Similarity in Section 3. We then derive the generalization error bound for the unsupervised similarity-based classifier for CDS in Section 4 where the proposed discriminative similarity is induced by the error bound. The application of CDS to data clustering is shown in Section 5. Throughout this paper, the term kernel stands for the PSD kernel if no special notes are made.

Published as a conference paper at ICLR 2022

2 SIGNIFICANCE OF CDSK OVER EXISTING DISCRIMINATIVE AND SIMILARITY-BASED CLUSTERING METHODS

Effective data similarity highly depends on the underlying probabilistic distribution and geometric structure of the data, and these two characteristics leads to data-driven similarity, such as Zhu et al. (2014); Bicego et al. (2021); Ng et al. (2001); Shental et al. (2003); Hartigan & Wong (1979); Sch olkopf et al. (1998) and similarity based on geometric structure of the data, such as the subspace structure (Sparse Subspace Clustering, or SSC in Elhamifar & Vidal (2013)). Note that the sparse graph method, ℓ1-Graph (Yan & Wang, 2009), has the same formulation as SSC. Most existing clustering methods based on data-driven or geometric structure-driven similarity suffer from a common deficiency, that is, the similarity is not explicitly optimized for the purpose of separating underlying clusters. In particular, the Random Forest-based similarity (Zhu et al., 2014; Bicego et al., 2021) is extracted from features in decision trees. Previous works about subspace-based similarity (Yan & Wang, 2009; Elhamifar & Vidal, 2013) try to make sure that only data points lying on or close to the same subspace have nonzero similarity, so that data points from the same subspace can form a cluster. However, it is not guaranteed that features in the decision trees are discriminative enough to separate clusters, because the candidate data partition (or candidate cluster labels) do not participate in the feature or similarity extraction process. Note that synthetically generated negative class are suggested in Zhu et al. (2014); Bicego et al. (2021) to train unsupervised random forest, however, the synthetic labels are not for the original data. Moreover, it is well known that the existing subspace learning methods only obtain reliable subspace-based similarity with restrictive geometric assumptions on the data and the underlying subspaces, such as large principal angle between intersecting subspaces (Soltanolkotabi & Candes, 2012; Elhamifar & Vidal, 2013).

Therefore, it is particularly important to derive similarity for clustering which meets two requirements: (1) discriminative measure with information such as cluster partition is used to derive such similarity so as to achieve compelling clustering performance; (2) it requires less restrictive assumptions on the geometric structure of the data than current geometric structure-based similarity learning methods, such as subspace clustering (Yan & Wang, 2009; Elhamifar & Vidal, 2013).

Significance. The proposed discriminative similarity of this paper meets these two requirements. First, the discriminative similarity is derived by the generalization error bound associated with candidate cluster labeling, and minimizing the objective function of our optimization problem for clustering renders a joint optimization of discriminative similarity and candidate cluster labeling in a way such that the similarity-based classifier has small generalization error bound. Second, our framework only assumes a mild classification model in Definition 3.1, which only requires an unknown joint distribution over data and its labels. In this way, the restrictive geometric assumptions are avoided in our method. Compared to the existing discriminative clustering methods, such as MMC (Xu et al., 2004), BMMC (Chen et al., 2014), RIM (Gomes et al., 2010), and the other discriminative clustering methods such as (Huang et al., 2015; Nguyen et al., 2017), the optimization problem of CDSK with discriminative similarity-based formulation is much easier to solve and it enjoys convexity and efficiency in each iteration of coordinate descent described in Algorithm 1. In particular, as mentioned in Section D of the appendix, the first step (11) of each iteration can be solved by efficient SVD or other randomized large-scale SVD methods, and the second step (12) of each iteration can be solved by efficient SMO (Platt, 1998). Moreover, the optimization problems in these two steps are either convex or having closed-form solution. In contrast, MMC requires expensive semidefinite programming. RIM has to solve a nonconvex optimization problem and its formulation does not guarantee that the trained multi-class kernelized logistic regression has low classification error on candidate labeling, which explains why it has inferior performance compared to our method. The discriminative Extreme Learning Machine (Huang et al., 2015) trains ELM using labels produced by a simple clustering method such as K-means, and the potentially poor cluster labels by the simple clustering method can easily result in unsatisfactory performance of this method. The discriminative Bayesian nonparametric clustering (Nguyen et al., 2017) and BMMC (Chen et al., 2014) require extra efforts of sampling hidden variables and tuning hyperparameters to generate the desirable number of clusters (or model selection), which could reduce the effect of discriminative measures used in these Bayesian nonparametric methods.

Published as a conference paper at ICLR 2022

3 PROBLEM SETUP

We introduce the problem setup of the formulation of clustering by unsupervised classification. Given unlabeled data {xl}n l=1 Rd, clustering is equivalent to searching for the hypothetical labeling which is optimal in some sense. Each hypothetical labeling corresponds to a candidate data partition. Figure 1 illustrates four binary hypothetical labelings which correspond to four partitions of the data, and the data is divided into two clusters by each hypothetical labeling.

Figure 1: Illustration of binary hypothetical labelings

The discriminative clustering literature (Xu et al., 2004; Gomes et al., 2010) has demonstrated the potential of multi-class classification for clustering problem. Inspired by the natural connection between clustering and classification, we proposes the framework of Clustering by Unsupervised Classification which models clustering problem as a multi-class classification problem. A classifier is learnt from unlabeled data with a hypothetical labeling, which is associated with a candidate partition of the unlabeled data. The optimal hypothetical labeling is supposed to be the one such that its associated classifier has the minimum generalization error bound. To study the generalization bound for the classifier learnt from hypothetical labeling, the concept of classification model is needed. Given unlabeled data {xl}n l=1, a classification model MY is constructed for any hypothetical labeling Y = {yl}n l=1 as follows. Definition 3.1. The classification model corresponding to the hypothetical labeling Y = {yl}n l=1 is defined as MY = (S, F). S = {xl, yl}n l=1 are the labeled data by the hypothetical labeling Y, and S are assumed to be i.i.d. samples drawn from the some unknown joint distribution PXY , where (X, Y ) is a random couple, X X Rd represents the data in some compact domain X, and Y {1, 2, ..., c} is the class label of X, c is the number of classes. F is a classifier trained on S. The generalization error of the classification model MY is defined as the generalization error of the classifier F in MY.

The basic assumption of CDS is that the optimal hypothetical labeling minimizes the generalization error bound for the classification model. With f being different classifiers, different discriminative clustering models can be derived. When SVMs is used as the classifier F in the above discriminative model, unsupervised SVM (Xu et al., 2004) is obtained.

In Balcan et al. (2008), the authors proposes a classification method using general similarity functions. The classification rule measures the similarity of the test data to each class, and then assigns the test data to the class such that the weighed average of the similarity between the test data and the training data belonging to that class is maximized over all the classes. Inspired by this classification method, we now consider using a general symmetric and continuous function S : X X [0, 1] as the similarity function in our CDS model. We propose the following hypothesis,

h S(x, y) = X

i: yi=y αi S(x, xi). (1)

In the next section, we derive generalization bound for the unsupervised similarity-based classifier based on the above hypothesis, and such generalization bound leads to discriminative similarities for data clustering. When S is a PSD kernel, minimizing the generalization error bound amounts to minimization of a new form of kernel similarity between data from different clusters, which lays the foundation of a new clustering algorithm presented in Section 5.

4 GENERALIZATION BOUND FOR SIMILARITY-BASED CLASSIFIER

In this section, the generalization error bound for the classification model in Definition 3.1 with the unsupervised similarity-based classifier is derived as a sum of discriminative similarity between the data from different classes.

Published as a conference paper at ICLR 2022

4.1 GENERALIZATION BOUND

The following notations are introduced before our analysis. Let α = [α1, . . . , αn] be the nonzero weights that sum up to 1, α(y) be a n 1 column vector representing the weights belonging to class y such that α(y) i is αi if y = yi, and 0 otherwise. The margin of the labeled sample (x, y) is defined as mh S(x, y) = h S(x, y) argmaxy =yh S(x, y ), the sample (x, y) is classified correctly if mh S(x, y) 0.

The general similarity-based classifier f S predicts the label of the input x by f S(x) = argmaxy {1,...,c}h S(x, y). We then begin to derive the generalization error bound for f S using the Rademacher complexity of the function class comprised of all the possible margin functions mh S. The Rademacher complexity (Bartlett & Mendelson, 2003; Koltchinskii, 2001) of a function class is defined below:

Definition 4.1. Let {σi}n i=1 be n i.i.d. random variables such that Pr[σi = 1] = Pr[σi = 1] = 1

2. The Rademacher complexity of a function class A is defined as

R(A) = E{σi},{xi}

i=1 σih(xi)

In order to analyze the generalization property of the classification rule using the general similarity function, we first investigate the properties of general similarity function and its relationship to PSD kernels in terms of eigenvalues and eigenfunctions of the associated integral operator. The integral operator (LSf)(x) = R S(x, t)f(t)dt is well defined. It can be verified that LS is a compact operator since S is continuous. According to the spectral theorem in operator theory, there exists an orthonormal basis {ϕ1, ϕ2, . . .} of L2 which is comprised of the eigenfunctions of LS, where L2 is the space of measurable functions which are defined over X and square Lebesgue integrable. ϕk is the eigenfunction of LS with eigenvalue λk if LSϕk = λkϕk. The following lemma shows that under certain assumption on the eigenvalues and eigenfunctions of LS, a general symmetric and continuous similarity can be decomposed into two PSD kernels.

Lemma 4.1. Suppose S : X X [0, 1] is a symmetric continuous function, and {λk} and {ϕk} are the eigenvalues and eigenfunctions of LS respectively. Suppose P

k 1 λk|ϕk(x)|2 < C for some

constant C > 0. Then S(x, t) = P

k 1 λkϕk(x)ϕk(t) for any x, t X, and it can be decomposed as

the difference between two positive semi-definite kernels: S(x, t) = S+(x, t) S (x, t), with

S+(x, t) = X

k : λk 0 λkϕk(x)ϕk(t), S (x, t) = X

k:λk<0 |λk|ϕk(x)ϕk(t). (3)

We now use a regularization term to bound the Rademacher complexity for the classification

rule using the general similarity function. Let Ω+(α) = c P

y=1 α(y) S+α(y) and Ω (α) =

y=1 α(y) S α(y) with [S+]ij = S+(xi, xj) and [S ]ij = S (xi, xj). The space Hy of all the

hypothesis h S( , y) associated with label y is defined as

HS,y = {(x, y) X

i: yi=y αi S(x, xi): α 0, 1 α = 1, Ω+(α) B+2, Ω (α) B 2}

for 1 y c, with positive number B+ and B which bound Ω+ and Ω respectively. Let the hypothesis space comprising all possible margin functions be HS = {(x, y) mh S(x, y): h S(x, y) HS,y}. We then present the main result in this section about the generalization error of unsupervised similarity-based classifier f S.

Theorem 4.2. Given the discriminative model MY = (S, f S), suppose Ω+(α) B+2, Ω (α) B 2, supx X |S+(x, x)| R2, supx X |S (x, x)| R2 for positive constants B+, B and R. Then with probability 1 δ over the labeled data S with respect to any distribution in PXY , under

Published as a conference paper at ICLR 2022

the assumptions of Lemma 4.1, the generalization error of the general classifier f S satisfies

R(f S) =Pr [Y = f S(X)]

b Rn(f S) + 8R(2c 1)c(B+ + B )

γ n + 16c(2c 1)(B+ + B )R2

where b Rn(f S) = 1

i=1 Φ h S(xi,yi) argmaxy =yh S(xi,y )

γ is the empirical loss of f S on the labeled

data, γ > 0 is a constant and Φ is defined as Φ(x) = min {1, max{0, 1 x}}. Moreover, if γ 1, the empirical loss b Rn(f S) satisfies

b Rn(f S) 1 1

2 S(xi, xj) + 1

1 i<j n 2(αi + αj)S(xi, xj)1Iyi =yj. (5)

The indicator function 1IE in (5) is 1 if event E is true, and 0 otherwise. Remark 4.3. Lemma E.3 in the Appendix shows that the Rademacher complexity of HS is bounded in terms of B+ and B , and that is why these two quantities appear on the RHS of (4). In addition, when S is a Positive Semi-Definite (PSD) kernel K, it can be verified that S 0, S = S+. Remark 4.4. When the decomposition S = S+ S exists and S+, S are PSD kernels, S is the kernel of some Reproducing Kernel Kre ın Space (RKKS) (Mary, 2003). Ong et al. (2004) and Loosli et al. (2016) analyzed the problem of learning SVM-style classifiers with indefinite kernels from the Kre ın space. However, their work does not show when and how an indefinite and general similarity function can have PSD decomposition, as well as the generalization analysis for the similarity-based classifier using such general indefinite function as similarity measure. Our analysis deals with these problems in Lemma 4.1 and Theorem 4.2. It should be emphasized that our generalization bound is of independent interest in supervised learning, because it is among the few results of generalization bounds using general similarity-based classifier. Section B.1 shows that the our bound is a principled result with strong connection to established generalization error bound for Support Vector Machines (SVMs) or Kernel Machines.

4.2 CLUSTERING BY DISCRIMINATIVE SIMILARITY

Ssim ij = 2(αi + αj)S(xi, xj) 2λαiαj S+(xi, xj) 2λαiαj S (xi, xj) (6)

be the discriminative similarity between data from different classes, which is induced by the generalization error bound (4) for the unsupervised general similarity-based classifier f S. Minimizing the bound (4) motivates us to consider the optimization problem that minimizes b Rn(f S) + λ Ω+(α) + Ω (α) . Replacing b Rn(f S) by its upper bound in (5), we consider the following problem,

1 i<j n Ssim ij 1Iyi =yj

2 S(xi, xj) + λ(α S+α + α S α)

s.t. α 0, 1 α = 1, Y = {yi}n i=1, (7)

where λ > 0 is the weighting parameter for the regularization term Ω+(α) + Ω (α). Note that we do not set λ to 16c(2c 1)R2+8

2γ exactly matching the RHS of (4), because λ controls the weight of the regularization term which bounds the unknown complexity of the function class HS. Note that (7) encourages the discriminative similarity Ssim ij between the data from different classes small. The optimization problem (7) forms the formulation of Clustering by Discriminative Similarity (CDS).

By Remark 4.3, when S is a PSD kernel K, S 0, S = S+, Ssim ij reduces to the following discriminative similarity for PSD kernels:

SK ij = 2(αi + αj λαiαj)K(xi xj), 1 i, j n, (8)

Published as a conference paper at ICLR 2022

and SK ij is the similarity induced by the unsupervised kernel classifier by the kernel K.

Without loss of generality, we set K = Kτ(x) = exp( x 2 2 2τ 2 ) which is the isotropic Gaussian kernel with kernel bandwidth τ > 0, and we omit the constant that makes integral of K unit.

When setting the general similarity function to kernel Kτ, CDS aims to minimize the error bound for the corresponding unsupervised kernel classifier, which amounts to minimizing the following objective function

min α Λ,Y={yi}n i=1

1 i<j n SK ij 1Iyi =yj

2 Kτ(xi xj) + λα Kα, (9)

where SK ij is defined in (8) with K = Kτ. K Rn n and Kij = Kτ(xi xj). λ is tuned such that SK ij 0, e.g., λ 2. In Section A, it is shown that the discriminative similarity (8) can also be induced from the perspective of kernel density classification by kernel density estimators with nonuniform weights. It supports the theoretical justification for the induced discriminative similarity in this section.

5 APPLICATION TO DATA CLUSTERING

In this section, we propose a novel data clustering method termed Clustering by Discriminative Similarity via unsupervised Kernel classification (CDSK) which is an empirical method inspired by our CDS model when the similarity function is a PSD kernel K = Kτ. In accordance with the CDS model in Section 4.2, CDSK aims to minimize (9). However, problem (9) involves minimization with respect to discrete cluster labels Y = {yi} which is NP-hard. In addition, it potentially results in a trivial solution which puts all the data in a single cluster due to the lack of constraints on the cluster balance. When Y is a binary matrix where each column is a membership vector for a

particular cluster, n P

1 i<j n SK ij 1Iyi =yj = 1

2Tr(Y LKY). Therefore, (9) is relaxed in the proposed

optimization problem for CDSK below:

min α Λ,Y Rn c 1 2Tr(Y LKY)

2 Kτ(xi xj) + λα Kα s.t. Y DKY = Ic,

where Λ = {α: α 0, 1 α = 1}, SK ij = SK ij , LK = DK SK is the graph Laplacian computed with SK, DK is a diagonal matrix with each diagonal element being the sum of the corresponding

row of SK: [DK]ii = n P

j=1 SK ij , Ic is a c c identity matrix, c is the number of clusters. The

constraint in (10) is used to balance the cluster size. This is because minimizing (9) without any constraint on the cluster size results in a trivial solution where all data points form a single cluster. Inspired by spectral clustering (Ng et al., 2001), the constraint Y DKY = Ic used in CDSK prevents imbalanced data clusters.

Problem (10) is optimized by coordinate descent. In each iteration of coordinate descent, optimization with respect to Y is performed with fixed α, which is exactly the same problem as that of spectral clustering with a solution formed by the smallest c eigenvectors of the normalized graph Laplacian (DK) 1/2LK(DK) 1/2; then the optimization with respect to α is performed with fixed Y, which is a standard constrained quadratic programming problem. The iteration of coordinate descent proceeds until convergence or the maximum iteration number M is achieved. Each iteration solves two subproblems, (11) and (12). In order to promote sparsity of α, α can be initialized

by solving Pn i=1 xi P

j =i xjαj 2

2 + τ α 0 for a positive weighting parameter τ = 0.1. The algorithm of CDSK is described in Algorithm 1.

Furthermore, Section C in the appendix explains the theoretical properties of the coordinate descent algorithm for problem (10).

The baseline named SC-NS performs spectral clustering on the nonparametric similarity proposed in Yang et al. (2014a). The baseline named SC-MS first constructs a similarity matrix between data

Published as a conference paper at ICLR 2022

Algorithm 1 Clustering by Discriminative Similarity via unsupervised Kernel classification (CDSK)

Input: Unlabeled dataset {xl}n l=1, parameter λ, maximum iteration number M. for t 1 to M do

With fixed α, solve

min Y Rn c Tr(Y LKY) s.t. Y DKY = Ic, (11)

With fixed Y, solve

min α Λ, Tr(Y LKY)

2 Kτ(xi xj) + λα Kα

s.t. Y DKY = Ic, (12)

end for Perform K-Means Clustering on rows of Y to obtain the clustering result.

denoted by W, where Wij = Kτ(xi xj), then optimize the kernel bandwidth h by minimizing P

j Wijxj 2 where di = P

j Wij. SC-MS then performs spectral clustering on W with

the kernel bandwidth h obtained from the optimization.

To demonstrate the advantage of the proposed parametric discriminative similarity, we compare CDSK to various baseline clustering methods. SC stands for Spectral Clustering, which is the best performer among spectral clustering with similarity matrix set by Gaussian kernel (SCK), spectral clustering with similarity matrix set by a manifold-based similarity learning method (SC-MS) (Karasuyama & Mamitsuka, 2013), and spectral clustering with similarity matrix set by the nonparametric discriminative similarity (SC-NS) in Yang et al. (2014a). In SC-MS, Gaussian kernel is used as data similarity, and the parameters of the diagonal covariance matrix is optimized so as to minimize the data reconstruction error term. SC-NS minimizes nonparametric kernel similarity between data across different clusters, which is the same objective as that of kernel K-Means (Sch olkopf et al., 1998), so its performance is the same as kernel K-Means. Please refer to Section 2 for discussion about other baselines.

Datasets. We conduct experiments on the Yale face dataset, UCI Ionosphere dataset, the MNIST handwritten digits dataset and the Georgia Face dataset. The Yale face dataset has face images of 15 people with 11 images for each person. The Ionosphere data contains 351 points of dimensionality 34. The Georgia Face dataset contains images of 50 people, and each person is represented by 15 color images with cluttered background. COIL-20 dataset has 1440 images of size 32 32 for 20 objects with background removed in all images. The COIL-100 dataset contains 100 objects with 72 images of size 32 32 for each object. CMU PIE face data contains 11554 cropped face images of size 32 32 for 68 persons, and there are around 170 facial images for each person under different illumination and expressions. The UMIST face dataset is comprised of 575 images of size 112 92 for 20 people. CMU Multi-PIE (MPIE) data (Gross et al., 2010) contains 8916 facial images captured in four sessions. The MNIST handwritten digits database has a total number of 70000 samples of dimensionality 1024 for digits from 0 to 9. The digits are normalized and centered in a fixed-size image. The Extended Yale Face Database B (Yale-B) dataset contains face images for 38 subjects with 64 frontal face images taken under different illuminations for each subject. CIFAR-10 dataset consists of 50000 training images and 10000 testing images in 10 classes, and each image is a color one of size 32 32, and we perform data clustering using all the training and testing images. We also use the mini Image Net dataset used in Vinyals et al. (2016) to evaluate the potential of clustering methods. Mini Image Net consists of 60, 000 color images of size 84 84 with 100 classes, and each class has 600 images. Mini Image Net is known to be more complex than the CIFAR-10 dataset, and we perform clustering on the 64 classes in mini Image Net which are used for few-shot learning, so 38, 400 images are used for clustering. For every clustering method involving randomness such as K-Means, we report the average performance of running it for 10 times.

Performance Measures and Tuning λ by Cross Validation. We use Accuracy (AC) and Normalized Mutual Information (NMI) (Zheng et al., 2004) as the performance measures. The results of different clustering methods are shown in Table 1 and Table 2 in the format of AC(NMI). Except

Published as a conference paper at ICLR 2022

for SC-MS, the kernel bandwidth in all methods is set as the variance of the pairwise Euclidean distance between the data. λ is the weight for the regularization term in our derived generalization bound. As explained in Section B.1 of the appendix, λ plays the same role as the weight in the regularization term of SVMs or Kernel Machines. Following the common practice in the literature of SVM or Kernel Machines, λ can be tuned by Cross-Validation (CV). While this is an unsupervised learning task and these is no labeled data for CV, we still developed a well-motivated CV procedure. Following the practice in Mairal et al. (2012), we randomly sampled 10% of the given data as the validation data, then perform CDSK on the validation data. The best λ is chosen among the discrete values between [0.05, 05] with a step of 0.05 which minimizes the average entropy of the obtained embedding matrix Y Rn c by Algorithm 1, where the average entropy is compute as

i=1 entropy(softmax(Yi)). This is because we would like to choose λ which renders the most

confident clustering embedding. We perform CDSK on all the datasets using this tuning strategy and observe improved performance as shown in the above two tables. For clustering methods involving random operations, the average performance over 10 runs is reported.

Computational Complexity. Suppose the optimization of CDSK comprises M iterations of coordinate descent. The first subproblem (11) in Algorithm 1 takes O(n2c) steps using truncated Singular Value Decomposition (SVD) by Krylov subspace iterative method. We adopt Sequential Minimal Optimization (SMO) (Platt, 1998) to solve the second subproblem (12), which takes roughly O(n2.1) steps as reported in Platt (1998). Therefore, the overall time complexity of CDSK is O(Mcn2 + Mn2.1). M is set to 20 throughout all the experiments.

Table 1: Clustering results on Yale-B, Ionosphere, Georgia Face, COIL-20, COIL-100, CMU PIE and UMIST Face.

Methods Dataset Yale-B Ionosphere Georgia Face COIL-20 COIL-100 CMU PIE UMIST Face

K-Means 0.09(0.13) 0.71(0.13) 0.50(0.69) 0.65(0.76) 0.49(0.75) 0.08(0.19) 0.42(0.64) SC 0.11(0.15) 0.74(0.22) 0.52(0.70) 0.43(0.62) 0.28(0.59) 0.07(0.18) 0.42(0.61) ℓ1-Graph (Elhamifar & Vidal, 2013) 0.79(0.78) 0.51(0.12) 0.54(0.70) 0.79(0.91) 0.53(0.80) 0.23(0.34) 0.44(0.65) SMCE (Elhamifar & Vidal, 2011) 0.34(0.39) 0.68(0.09) 0.60(0.74) 0.88(0.88) 0.56(0.81) 0.16(0.34) 0.45(0.66) Lap-ℓ1-Graph (Yang et al., 2014b) 0.79(0.78) 0.50(0.09) 0.58(0.73) 0.79(0.91) 0.56(0.81) 0.30(0.51) 0.50(0.69) RAG (Zhu et al., 2014) 0.13(0.19) 0.70(0.11) 0.17(0.38) 0.50(0.64) 0.58(0.81) 0.14(0.34) 0.26(0.28) MMC (Xu et al., 2004) 0.71(0.69) 0.75(0.21) 0.42(0.58) 0.80(0.89) 0.61(0.63) 0.22(0.30) 0.51(0.56) BMMC (Chen et al., 2014) 0.65(0.63) 0.70(0.15) 0.34(0.41) 0.82(0.93) 0.64(0.69) 0.18(0.23) 0.55(0.61) RIM (Gomes et al., 2010) 0.62(0.74) 0.59(0.08) 0.39(0.56) 0.77(0.82) 0.71(0.79) 0.26(0.34) 0.40(0.53) Ratio RF (Bicego et al., 2021) 0.39(0.53) 0.62(0.05) 0.18(0.40) 0.65(0.75) 0.36(0.64) 0.15(0.36) 0.29(0.34) CDSK (Ours) 0.83(0.86) 0.76(0.25) 0.60(0.74) 0.93(0.97) 0.78(0.92) 0.32(0.50) 0.67(0.80)

Table 2: Clustering results on CMU Multi-PIE which contains the facial images captured in four sessions (S1 to S4), MNIST, CIFAR-10, and Mini-Image Net

Methods Dataset MPIE S1 MPIE S2 MPIE S3 MPIE S4 MNIST CIFAR-10 Mini-Image Net

KM 0.12(0.50) 0.13(0.48) 0.13(0.48) 0.13(0.49) 0.52(0.47) 0.19(0.06) 0.27(0.33) SC 0.13(0.53) 0.14(0.51) 0.14(0.52) 0.15(0.53) 0.38(0.36) 0.21(0.04) 0.29(0.35) ℓ1-Graph (Elhamifar & Vidal, 2013) 0.59(0.77) 0.70(0.81) 0.63(0.79) 0.68(0.81) 0.57(0.61) 0.28(0.24) 0.28(0.37) SMCE (Elhamifar & Vidal, 2011) 0.17(0.55) 0.19(0.53) 0.19(0.52) 0.18(0.53) 0.65(0.67) 0.31(0.30) 0.29(0.37) Lap-ℓ1-Graph (Yang et al., 2014b) 0.59(0.77) 0.70(0.81) 0.63(0.79) 0.68(0.81) 0.56(0.60) 0.29(0.30) 0.29(0.37) RAG (Zhu et al., 2014) 0.34(0.75) 0.30(0.69) 0.31(0.68) 0.29(0.70) 0.59(0.51) 0.22(0.10) 0.18(0.33) MMC (Xu et al., 2004) 0.49(0.58) 0.51(0.60) 0.53(0.65) 0.50(0.61) 0.64(0.60) 0.31(0.28) 0.19(0.34) BMMC (Chen et al., 2014) 0.40(0.51) 0.44(0.59) 0.45(0.61) 0.49(0.66) 0.66(0.69) 0.29(0.26) 0.16(0.32) RIM (Gomes et al., 2010) 0.50(0.63) 0.52(0.68) 0.55(0.71) 0.51(0.67) 0.54(0.62) 0.20(0.25) 0.17(0.38) Ratio RF (Bicego et al., 2021) 0.54(0.85) 0.55(0.86) 0.64(0.86) 0.62(0.86) 0.48(0.39) 0.20(0.09) 0.26(0.38) CDSK (Ours) 0.66(0.85) 0.72(0.88) 0.68(0.87) 0.73(0.89) 0.76(0.75) 0.46(0.39) 0.31(0.41)

6 CONCLUSION We propose a new clustering framework termed Clustering by Discriminative Similarity (CDS), which searches for the optimal partition of data where the associated unsupervised classifier has minimum generalization error bound. Under this framework, discriminative similarity is induced by the generalization error bound for unsupervised similarity-based classifier, and CDS minimizes discriminative similarity between different clusters. It is also proved that the discriminative similarity can be induced from kernel density classification. Based on CDS, we propose a new clustering method named CDSK (CDS via unsupervised kernel classification), and demonstrate its effectiveness in data clustering.

Published as a conference paper at ICLR 2022

Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. A theory of learning with similarity functions. Mach. Learn., 72(1-2):89 112, 2008.

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463 482, March 2003.

Manuele Bicego, Ferdinando Cicalese, and Antonella Mensi. Ratiorf: a novel measure for random forest clustering based on the tversky s ratio model. IEEE Transactions on Knowledge and Data Engineering, pp. 1 1, 2021.

Changyou Chen, Jun Zhu, and Xinhua Zhang. Robust bayesian max-margin clustering. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 532 540, 2014.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Multi-class classification with maximum margin multiple kernel. In Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 46 54, Atlanta, GA, 2013.

Ehsan Elhamifar and Ren e Vidal. Sparse manifold clustering and embedding. In Advances in Neural Information Processing Systems (NIPS), pp. 55 63, Granada, Spain, 2011.

Ehsan Elhamifar and Ren e Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2765 2781, 2013.

Evarist Gine and Joel Zinn. Some limit theorems for empirical processes. Ann. Probab., 12(4): 929 989, 11 1984.

Mark Girolami and Chao He. Probability density estimation from optimally condensed data samples. IEEE Trans. Pattern Anal. Mach. Intell., 25(10):1253 1264, 2003.

Ryan Gomes, Andreas Krause, and Pietro Perona. Discriminative clustering by regularized information maximization. In Advances in Neural Information Processing Systems (NIPS), pp. 775 783, Vancouver, Canada, 2010.

Ralph Gross, Iain A. Matthews, Jeffrey F. Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image Vis. Comput., 28(5):807 813, 2010.

J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28:100 108, 1979.

Gao Huang, Tianchi Liu, Yan Yang, Zhiping Lin, Shiji Song, and Cheng Wu. Discriminative clustering via extreme learning machine. Neural Networks, 70:1 8, 2015.

Masayuki Karasuyama and Hiroshi Mamitsuka. Manifold-based similarity adaptation for label propagation. In Advances in Neural Information Processing Systems (NIPS), pp. 1547 1555, Lake Tahoe, NV, 2013.

Zohar Shay Karnin, Edo Liberty, Shachar Lovett, Roy Schwartz, and Omri Weinstein. Unsupervised SVMs: On the complexity of the furthest hyperplane problem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), pp. 2.1 2.17, Edinburgh, Scotland, 2012.

Joo Seuk Kim and Clayton D. Scott. Performance analysis for l 2 kernel classification. In Advances in Neural Information Processing Systems (NIPS), pp. 833 840, Vancouver, Canada, 2008.

Joo Seuk Kim and Clayton D. Scott. Robust kernel density estimation. J. Mach. Learn. Res., 13: 2529 2565, 2012.

V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist., 30(1):1 50, 02 2002. doi: 10.1214/aos/1015362183.

Published as a conference paper at ICLR 2022

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory, 47(5):1902 1914, 2001.

James R. Lee, Shayan Oveis Gharan, and Luca Trevisan. Multi-way spectral partitioning and higherorder cheeger inequalities. In Howard J. Karloff and Toniann Pitassi (eds.), Proceedings of the 44th Symposium on Theory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pp. 1117 1130. ACM, 2012.

Ga elle Loosli, St ephane Canu, and Cheng Soon Ong. Learning SVM in kre ın spaces. IEEE Trans. Pattern Anal. Mach. Intell., 38(6):1204 1216, 2016.

Ravi Sastry Ganti Mahapatruni and Alexander G. Gray. CAKE: convex adaptive kernel density estimation. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 498 506, Fort Lauderdale, FL, 2011.

Julien Mairal, Francis Bach, and Jean Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):791 804, 2012.

Xavier Mary. Hilbertian subspaces, subdualities and applications. Ph.D. Dissertation, Institut National des Sciences Appliquees Rouen, 2003.

Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), pp. 849 856, Vancouver, Canada], 2001.

Vu Nguyen, Dinh Q. Phung, Trung Le, and Hung Bui. Discriminative bayesian nonparametric clustering. In Carles Sierra (ed.), Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 2550 2556. ijcai.org, 2017.

Cheng Soon Ong, Xavier Mary, St ephane Canu, and Alexander J. Smola. Learning with nonpositive kernels. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML), Banff, Alberta, Canada, 2004.

John Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical report, 1998.

Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10(5):1299 1319, July 1998.

Noam Shental, Assaf Zomet, Tomer Hertz, and Yair Weiss. Pairwise clustering and graphical models. In Advances in Neural Information Processing Systems (NIPS), pp. 185 192, Vancouver and Whistler, Canada, 2003.

Mahdi Soltanolkotabi and Emmanuel J. Candes. A geometric analysis of subspace clustering with outliers. Ann. Statist., 40(4):2195 2238, 08 2012.

Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3630 3638, 2016.

Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering. In Advances in Neural Information Processing Systems (NIPS), pp. 1537 1544, Vancouver, Canada], 2004.

Shuicheng Yan and Huan Wang. Semi-supervised learning by sparse representation. In Proceedings of the SIAM International Conference on Data Mining (SDM), pp. 792 801, Sparks, NV, 2009.

Yingzhen Yang, Feng Liang, Shuicheng Yan, Zhangyang Wang, and Thomas S. Huang. On a theory of nonparametric pairwise similarity for clustering: Connecting clustering to classification. In Advances in Neural Information Processing Systems (NIPS), pp. 145 153, Montreal, Canada, 2014a.

Published as a conference paper at ICLR 2022

Yingzhen Yang, Zhangyang Wang, Jianchao Yang, Jiangping Wang, Shiyu Chang, and Thomas S. Huang. Data clustering by laplacian regularized l1-graph. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), pp. 3148 3149, Qu ebec City, Canada, 2014b.

Xin Zheng, Deng Cai, Xiaofei He, Wei-Ying Ma, and Xueyin Lin. Locality preserving clustering for image database. In Proceedings of the 12th ACM International Conference on Multimedia (MM), pp. 885 891, New York, NY, 2004.

Xiatian Zhu, Chen Change Loy, and Shaogang Gong. Constructing robust affinity graphs for spectral clustering. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1450 1457, 2014.

A CONNECTION TO KERNEL DENSITY CLASSIFICATION

In this section, we show that the discriminative similarity (8) can also be induced from kernel density classification with varying weights on the data, and binary classification is considered in this section. For any classification model MY = (S, f) with hypothetical labeling Y and the labeled data S = {xi, yi}n i=1, suppose the joint distribution PXY over X {1, 2} has probabilistic density function p(x, y). Let PX be the induced marginal distribution over the data with probabilistic density function p(x). Robust kernel density estimation methods (Girolami & He, 2003; Kim & Scott, 2008; Mahapatruni & Gray, 2011; Kim & Scott, 2012) suggest the following kernel density estimator where the kernel contributions of different data points are reflected by different nonnegative weights that sum up to 1:

i=1 αi Kτ(x xi), 1 α = 1, α 0, (13)

where τ0 = 1 (2π)d/2hd . Based on (13), it is straightforward to obtain the following kernel density estimator of the density function p(x, y):

bp(x, y) = τ0 X

i:yi=y αi Kτ(x xi). (14)

Kernel density classifier is learnt from the labeled data S and constructed by kernel density estimators (14). Kernel density classifier resembles the Bayes classifier, and it classifies the test data x based on the conditional label distribution P(Y |X = x), or equivalently, x is assigned to class 1 if bp(x, 1) bp(x, 2) 0, otherwise it is assigned to class 2. Intuitively, it is preferred that the decision function br(x, α) = bp(x, 1) bp(x, 2) is close to the true Bayes decision function r = p(x, 1) p(x, 2). Girolami & He (2003); Kim & Scott (2008) propose to use Integrated Squared Error (ISE) as the metric to measure the distance between the kernel density estimators and their true counterparts, and the oracle inequality is obtained that relates the performance of the L2 classifier in Kim & Scott (2008) to the best possible performance of kernel density classifier in the same category. ISE is adopted in our analysis of kernel density classification, and the ISE between the decision function br and the true Bayes decision function r is defined as

ISE(br, r) = br r 2 L2 = Z

Rd (br r)2dx. (15)

The upper bound for the ISE ISE(r, br) also induces discriminative similarity between the data from different classes, which is presented in the following theorem. Theorem A.1. Let n1 = P

i=1 1Iyi=1 and n2 = P

i=1 1Iyi=2. With probability at least 1 2n2 exp

2(n 1)ε2 2n exp 2nε2 over the labeled data S, the ISE between the decision function br(x, α) and the true Bayes decision function r(x) satisfies

ISE(br, r) τ0

n d ISE(br, r) + τ1K(α) + 2τ0 1 n 1 + ε , (16)

d ISE(br, r) = 4 X

1 i<j n (αi + αj)Kτ(xi xj)1Iyi =yj

i,j=1 (αi + αj)Kτ(xi xj), (17)

Published as a conference paper at ICLR 2022

K(α) = α (K

1 i<j n αiαj K

2h(xi xj)1Iyi =yj, (18)

2h is the gram matrix evaluated on the data {xi}n i=1 with the kernel K

Let λ1 > 0 be a weighting parameter, then the cost function d ISE + λ1K(α), designed according to the empirical term ISE(br, r) and the regularization term K(α) in the ISE error bound (16), can be expressed as

d ISE + λ1K(α) X

1 i<j n Sise ij 1Iyi =yj

i,j=1 (αi + αj)Kτ(xi xj) + λ1α K

where the first term is comprised of sum of similarity between data from different classes with similarity Sise ij = 4(αi + αj λ1αiαj)Kτ(xi xj), and Sise ij is the discriminative similarity induced by the ISE bound for kernel density classification. Note that Sise ij has the same form as the discriminative similarity SK ij (8) induced by our CDS model, up to a scaling constant and the choice of the balancing parameter λ. The proof of Theorem A.1 is deferred to Section E.3.

B MORE EXPLANATION ABOUT THE THEORETICAL RESULTS IN SECTION 4.1

B.1 THEORETICAL SIGNIFICANCE OF THE BOUND (4)

To the best of our knowledge, our generalization error bound (4) is the first principled result about generalization error bound for general similarity-based classifier with strong connection to the established generalization error bound for Support Vector Machines (SVMs) or Kernel Machines.

We now explain this claim. When the similarity function S is a PSD kernel function, we have S 0, S = S+ as explained in Remark 4.3. As a reminder, S is the general similarity function used in the similarity-based classification, and S+, S are PSD kernel functions, and it is proved that S can be decomposed by S = S+ S under the mild conditions of Lemma 3.1. It follows that we can set Ω (α) = 0 and B = 0. Plugging B = 0 in the derived generalization error bound for the general similarity-based classification (4), we have

Prob [Y = f S(X)] b Rn(f S) + 8R(2c 1)c B+

γ n + 16c(2c 1)B+R2

δ 2n . (19)

According to its definition, Ω+(α) = c P

y=1 α(y) Sα(y) because S = S+. We define B := B+.

Because Ω+(α) B+2 as mentioned in Theorem 3.2, B satisfies c P

y=1 α(y) Sα(y) B2. As a

result, when S is a PSD kernel function, inequality (19) becomes

Prob [Y = f S(X)] b Rn(f S) + 8R(2c 1)c B

γ n + 16c(2c 1)R2B

δ 2n , (20)

y=1 α(y) Sα(y) B2.

Note that the bound (20) is in fact the generalization error bound for supervised learning when using S as the similarity function in the similarity-based classification. At the end of this subsection, we

provide a lemma proving that c P

y=1 α(y) Sα(y) B2 α Sα c B2, where c is the number of

Now we compare the generalization error bound (20) to the established generalization error bound for Kernel Machines in Bartlett & Mendelson (2003, Theorem 21) for the case that c = 2 with

Published as a conference paper at ICLR 2022

notations adapted to our analysis. The bound in Bartlett & Mendelson (2003, Theorem 21) is for binary classification, which is presented as follows:

Prob [Y = f S(X)] b Rn(f S) + 4

2n , α Sα B2. (21)

Comparing our generalization error bound (2) with c = 2 to the max-margin generalization error bound (21), it can be easily seen that the two bounds are equivalent up to a constant scaling factor. In fact, our bound (2) is more general which handles multi-class classification.

Lemma B.1. When S is a PSD kernel, then c P

y=1 α(y) Sα(y) B2 α Sα c B2.

Proof. Let H be the Reproducing Kernel Hilbert Space associated with the PSD kernel function S, and H is also called the feature space associated with S. We use , H to denote the inner product in the feature space H. Then we have S(xi, xj) = Sij = Φ(xi), Φ(xj) H where Φ is the feature

mapping associated with H. Because α = c P

y=1 αy, it can be verified that

j=1 αiαj Sij = n P

i=1 αiΦ(xi), n P

j=1 αjΦ(xj) c c P

y=1 ey, ey H, where ey =

i: yi=y αy i Φ(xi). It follows that α Sα c c P

y=1 α(y) Sα(y) c B2.

B.2 TIGHTNESS OF THE BOUND

Note that the generalization error Pr [Y = f(X)] is bounded by b Rn(f) =

i=1 Φ h S(xi,yi) P

y =yi h S(xi,y)

γ in Theorem 4.2. The underlying principle behind this bound and

all such bounds in the statistical machine learning literature such as Bartlett & Mendelson (2003) is the following property about empirical process (adjusted using our notations):

R(H) Ex,y sup h H |Ex,y b Rn(f) Rn(f)| 2R(H), (22)

where Ex,y indicates expectation with respect to random couple (x, y) PXY , and PXY is a joint distribution in a discriminative model MY = (S, f). (22) is introduced in the classical properties of empirical process in Gine & Zinn (1984). By Lemma E.2 of this paper, with probability at least 1 δ over the data {xi}n i=1 i.i.d. PXY ,

R(H) (2c 1)c n B +

δ 2n = O( B n) (23)

for some constant B. It follows from (22), (23), and concentration inequality (such as Mc Diarmid s inequality) that for each sufficiently large n, with large probability, suph H |Ex,y b Rn(f) Rn(f)| is less than O( B n). Therefore, we can bound the expectation of the empirical loss, i.e., Ex,y b Rn(f), tightly using the empirical loss Rn(f) uniformly over the function space H.

C THEORETICAL PROPERTIES OF THE COORDINATE DESCENT ALGORITHM IN SECTION 5

In this subsection, we give a detailed explanation about the theoretical properties of the coordinate descent algorithm presented in Section 5. We first explain how the objective function of CDSK (10) is connected to the objective function (9) developed in our theoretical analysis. It should be emphasized that (9) cannot be directly used for data clustering since it cannot avoids the trivial solution where all the data are in a single cluster. We adopt the broadly used formulation of normalized

Published as a conference paper at ICLR 2022

cut and use c P

cut(Ak, Ak)

vol(Ak) to replace P

i<j SK ij 1Iyi =yj in (9), leading to the following optimization

min α Λ,Y={yi}n i=1

cut(Ak, Ak)

2 Kτ(xi xj) + λα Kα Q(α, Y), (9 )

where {A}c k=1 are c data clusters according to the cluster labels {yi}, Ak is the complement of Ak, cut(A, B) = P

xi A,xj B SK ij , vol(A) = P

xi A,1 j n SK ij . We have the following theorem, which

can be derived based on Theorem 4.1 in the work of multi-way Cheeger inequalities (Lee et al., 2012).

Theorem C.1. minα Λ,Y

cut(Ak, Ak)

vol(Ak) c P

t=1 σt(Lnor), where Lnor is the normalized graph

Laplacian Lnor = (DK) 1/2LK(DK) 1/2, a b indicates a < Cb for some constant C and σt( ) indicates the t-th smallest singular value of a matrix.

Based on Theorem C.1, we resort to solve the following more tractable problem, that is,

t=1 σt(Lnor)

2 Kτ(xi xj) + λα Kα Q(α), (9 )

because Q (9 ) is an upper bound for Q (9 ) up to a constant scaling factor. It can be verified that problem (10) is equivalent to (9 ), and (9 ) is the underlying optimization problem for data clustering in Section 5. The following proposition shows that the iterative coordinate descent algorithm in Section 5 reduces the value of Q at each iteration.

Proposition C.2. The coordinate descent algorithm for problem (10) reduces the value of the objective function Q(α) at each iteration.

Proof. Let Q (α, Y) Tr(Y LKY) n P

2 Kτ(xi xj) + λα Kα, and we use su-

perscript to denote the iteration number of coordinate descent. At iteration m, after solving the subproblems (11) and (12), we have Q (α(m), Y(m)) = Q(α(m)). At iteration m + 1, by solving the subproblems (4) and (3) in order again, we have Q (α(m+1), Y(m+1)) = Q(α(m+1)). Because of the nature of coordinate descent, Q (α(m+1), Y(m+1)) Q (α(m), Y(m)), it follows that Q(α(m+1)) Q(α(m)).

Based on Proposition C.2, the iterations of coordinate descent are similar to that of EM algorithms and they reduce the value of Q, where Y plays the role of latent variable for EM algorithms.

D MORE DETAILS ABOUT OPTIMIZATION OF CDSK

The optimization of CDSK comprises M iterations of coordinate descent, wherein each iteration solves the following two subproblems.

1) With constant α,

min Y Rn c Tr(Y LKY) s.t. Y DKY = Ic, (24)

2) With constant Y,

min α Λ, Tr(Y LKY)

2 Kτ(xi xj) + λα Kα

s.t. Y DKY = Ic, (25)

Published as a conference paper at ICLR 2022

The first subproblem (24) takes O(n2c) steps using truncated Singular Value Decomposition (SVD) by Krylov subspace iterative method. We adopt Sequential Minimal Optimization (SMO) (Platt, 1998) to solve the second subproblem (25), which takes roughly O(n2.1) steps as reported in Platt (1998). SMO is an iterative algorithm where each iteration of SMO solves the quadratic programming (25) with respect to only two elements of the weights α, so that each iteration of SMO can be performed efficiently. Therefore, the overall time complexity of CDSK is O(Mcn2 + Mn2.1).

E.1 PROOF OF LEMMA 4.1

Before stating the proof of Lemma 4.1, we introduce the famous spectral theorem in operator theory below.

Theorem E.1. (Spectral Theorem) Let L be a compact linear operator on a Hilbert space H. Then there exists in H an orthonormal basis {ϕ1, ϕ2, . . .} consisting of eigenvectors of L. If λk is the eigenvalue corresponding to ϕk, then the set {λk} is either finite or λk 0 when k . In addition, the eigenvalues are real if L is self-adjoint.

Recall that the integral operator by S is defined as

(LSf)(x) = Z S(x, t)f(t)dt,

and we are ready to prove Lemma 4.1.

Proof of Lemma 4.1. It can be verified that LS is a compact operator. Therefore, according to Theorem E.1, {ϕk} is an orthogonal basis of L2. Note that ϕk is the eigenfunction of LS with eigenvalue λk if LSϕk = λkϕk.

With fixed x X, we then have

k=m λkϕk(x)ϕk(t)

k=m |λk||ϕk(x)|2 1

k=m |λk||ϕk(t)|2 1

k=m |λk||ϕk(x)|2 1

It follows that the series P

k 1 λkϕk(x)ϕk(t) converges to a continuous function ex uniformly on t.

This is because ϕk = LSϕk

λk is continuous for nonzero λk.

On the other hand, for fixed x X, as a function in L2,

k 1 S(x, ), ϕk ϕk = X

k 1 λkϕk(x)ϕk( ).

Therefore, for fixed x X, S(x, ) = P

k 1 λkϕk(x)ϕk( ) = ex( ) almost surely w.r.t the Lebesgue

measure. Since both are continuous functions, we must have S(x, t) = P

k 1 λkϕk(x)ϕk(t) for any

t X. It follows that S(x, t) = P

k 1 λkϕk(x)ϕk(t) for any x, t X.

We now consider two series which correspond to the positive eigenvalues and negative eigenvalues of LS, namely P

k : λk 0 λkϕk(x)ϕk( ) and P

k:λk<0 |λk|ϕk(x)ϕk( ). Using similar argument, for fixed

Published as a conference paper at ICLR 2022

x, both series converge to a continuous function, and we let

S+(x, t) = X

k : λk 0 λkϕk(x)ϕk(t),

S (x, t) = X

k:λk<0 |λk|ϕk(x)ϕk(t).

S+(x, t) and S (x, t) are continuous function in x and t. All the eigenvalues of S+ and S are nonnegative, and it can be verified that both are PSD kernels since

i,j=1 cicj S+(xi, xj) =

i,j=1 cicj X

k : λk 0 λkϕk(xi)ϕk(xj)

k : λk 0 λk

i,j=1 cicjϕk(xi)ϕk(xj)

k : λk 0 λk(

i=1 ciϕ(xi))2 0.

Similarly argument applies to S . Therefore, S is decomposed as S(x, t) = S+(x, t) S (x, t).

E.2 PROOF OF THEOREM 4.2

Lemma E.3 will be used in the Proof of Theorem 4.2. The following lemma is introduced for the proof of Lemma E.3, whose proof appears in the end of this subsection. Lemma E.2. The Rademacher complexity of the class HS satisfies

R(HS) (2c 1)

y=1 R(HS,y). (26)

Lemma E.3. Define Ω+(α) = c P

y=1 α(y) S+α(y) and Ω (α) = c P

y=1 α(y) S α(y). When

Ω+(α) B+2,Ω (α) B 2 for positive constant B+ and B , supx X |S+(x, x)| R2, supx X |S (x, x)| R2 for some R > 0, then with probability at least 1 δ over the data {xi}n i=1, the Rademacher complexity of the class HS satisfies

R(HS) R(2c 1)c(B+ + B ) n + 2c(2c 1)(B+ + B )R2

δ 2n . (27)

Proof of Lemma E.3 . According to Lemma 4.1, S is decomposed into two PSD kernels as S = S+ S . Therefore, the are two Reproducing Kernel Hilbert Spaces H+ S and H S that are associated with S+ and S respectively, and the canonical feature mappings in H+ S and H S are ϕ+ and ϕ , with S+(x, t) = ϕ+(x), ϕ+(t) H+ K and S (x, t) = ϕ (x), ϕ (t) H K. In the following text, we

will omit the subscripts H+ K and H K without confusion.

For any 1 y c,

h S(x, y) = X

i: yi=y αi S(x, xi) = w+, ϕ+(x) w , ϕ (x)

with w+ 2 = α(y) S+α(y) B+2 and w 2 = α(y) S α(y) B 2. Therefore,

HS,y HS,y = {(x, y) w+, ϕ+(x) w , ϕ (x) ,

w+ 2 B+2, w 2 B 2}, 1 y c,

Published as a conference paper at ICLR 2022

and R(HS,y) R( HS,y). Since we are deriving upper bound for R(HS,y), we slightly abuse the notation and let HS,y represent HS,y in the remaining part of this proof.

For x, t Rd and any h S HS,y, we have

|h S(x) h S(t)| = | w+, ϕ+(x) w , ϕ (x) w+, ϕ+(t) + w , ϕ (t) |

= | w+, ϕ+(x) ϕ+(t) + w , ϕ (t) ϕ (x) |

B+ ϕ+(x) ϕ+(t) + B ϕ (x) ϕ (t) )

(B+ + B ) q

S+(x, x) + S+(t, t) + 2 p

S+(x, x)S+(t, t)

2R2(B+ + B ).

We now approximate the Rademacher complexity of the function class HS,y with its empirical version b R(HS,y) using the sample {xi}. For each 1 y c, Define E(y) {xi} = b R(HS,y) =

suph S( ,y) HS,y 1

i=1 σih S(xi, y) , then c P

y=1 R(HS,y) = E{xi} h c P

y=1 E(y) {xi} i , and

sup x1,...,xn,x t

E(y) x1,...,xt 1,xt,xt+1,...,xn E(y) x1,...,xt 1,x t,xt+1,...,xn

= sup x1,...,xn,x t

sup h S( ,y) HS,y

i=1 σih S(xi, y) sup h S( ,y) HS,y

i =t σih S(xi, y) + h S(x t, y) n

sup x1,...,xn,x t E{σi}

" sup h S( ,y) HS,y

i=1 σih S(xi, y) sup h S( ,y) HS,y

i =t σih S(xi, y) + h S(x t, y) n

sup x1,...,xn,x t E{σi}

sup h S( ,y) HS,y

i=1 σih S(xi, y) 1

i =t σih S(xi, y) + h S(x t, y) n

sup x1,...,xn,x t E{σi}

sup h S( ,y) HS,y

i=1 σih S(xi, y) 1

i =t σih S(xi, y) + h S(x t, y) n

= sup xt,x t E{σi}

sup h S( ,y) HS,y

n h S(x t, y) n

2R2(B+ + B )

It follows that c P

y=1 E(y) x1,...,xt 1,xt,xt+1,...,xn c P

y=1 E(y) x1,...,xt 1,x t,xt+1,...,xn

2R2(B++B )c

cording to the Mc Diarmid s Inequality,

y=1 b R(HS,y)

y=1 R(HS,y) ε i 2 exp nε2

2(B+ + B )2R4c2 . (28)

Published as a conference paper at ICLR 2022

Now we derive the upper bound for the empirical Rademacher complexity:

y=1 b R(HS,y) =

sup h S HS,y

j=1 σih S(xi)

sup w+ B+, w B

i=1 σi w+, ϕ+(xi) w , ϕ (xi)

i=1 σiϕ+(xi) + B

i=1 σiϕ (xi)

i=1 σiϕ+(xi)

i=1 σiϕ (xi)

v u u t E{σi}

i=1 σiϕ+(xi) 2 #

v u u t E{σi}

i=1 σiϕ (xi) 2 #

i s+(xi, xi) + B c

i s (xi, xi) Rc n(B+ + B ).

By Lemma E.2, (28) and (29), with probability at least 1 δ, we have

R(HS) (2c 1)

y=1 R(HS,y) R(2c 1)c(B+ + B ) n + 2c(2c 1)(B+ + B )R2

Proof of Theorem 4.2 . According to Theorem 2 in Koltchinskii & Panchenko (2002), with probability 1 δ over the labeled data S with respect to any distribution in P, the generalization error of the kernel classifier f S satisfies

R(f S) b Rn(f S) + 8

where b Rn(f S) = 1

i=1 Φ( mh S (xi,yi)

γ ) is empirical error of the classifier for γ > 0. Due to the facts

that mh S(x, y) = h S(xi, yi) argmaxy =yih S(xi, y ), α is a positive vector and S is nonnegative, we have mh S(x, y) h S(xi, yi) P

y =yi h S(xi, y). Note that Φ( ) is a non-increasing function, it

follows that

Φ(mh S(xi, yi)

γ ) Φ h S(xi, yi) P

y =yi h S(xi, y)

Applying Lemma E.3, (4) holds with probability 1 δ. When γ 1, it can be verified that h S(xi, yi) P

y =yi h S(xi, y) c P

y=1 h S(xi, y) n P

i=1 αi = 1 γ for all (xi, yi), so that

Φ h S(xi, yi) P

y =yi h S(xi, y)

h(xi, yi) P

y =yi h(xi, y)

Published as a conference paper at ICLR 2022

Note that when h S(xi, yi) P

y =yi h S(xi, y) 0, then 1

h S(xi, yi) P

y =yi h S(xi, y)

that Φ h S(xi,yi) P

y =yi h S(xi,y)

y =yi h(xi,y)

γ . If h S(xi, yi) P

y =yi h S(xi, y) < 0, we

have Φ h S(xi,yi) P

y =yi h S(xi,y)

y =yi h(xi,y)

γ . Therefore, (33) always holds.

By the definition of b Rn(f S), (32), and (33), (5) is obtained.

Remark E.4. It can be verified that the image of the similarity function S in Lemma 4.1 and Theorem 4.2 can be generalized from [0, 1] to [0, a] for any a R, a > 0 with the condition γ 1 replaced by γ a. This is because LS is a compact operator for continuous similarity function

S : X X [0, a], and h S(xi, yi) P

y =yi h S(xi, y) c P

y=1 h S(xi, y) a. Furthermore, given a

symmetric and continuous function S : X X [c, d], c, d R, c < d, we can obtain a symmetric and continuous function S : X X [0, 1] by setting S = S a

b a , and then apply all the theoretical results of this paper to CDS with S being the similarity function for the similarity-based classifier.

Proof of Lemma E.2. Inspired by Koltchinskii & Panchenko (2002), we first prove that the Rademacher complexity of the function class formed by the maximum of several hypotheses is bounded by two times the sum of the Rademacher complexity of the function classes that these hypothesis belong to. That is,

y=1 R(HS,y), (34)

where Hmax = {max{h1, . . . , hk}: hy HS,y, 1 y k} for 1 k c 1.

If no confusion arises, the notations ({σi}, {xi, yi}) are omitted in the subscript of the expectation operator in the following text, i.e., E{σi},{xi,yi} is abbreviated to E. According to Theorem 11 of Koltchinskii & Panchenko (2002), it can be verified that

E{σi},{xi,yi}

i=1 σih S(xi)

y=1 E{σi},{xi,yi}

i=1 σih S(xi)

R(Hmax) = E{σi},{xi,yi}

i=1 σih S(xi)

E{σi},{xi,yi}

" sup h Hmax

i=1 σih S(xi) + #

+E{σi},{xi,yi}

" sup h Hmax 1

i=1 σih S(xi) + #

= 2E{σi},{xi,yi}

" sup h Hmax

i=1 σih S(xi) + #

y=1 E{σi},{xi,yi}

" sup h HS,y

i=1 σih S(xi) + #

Published as a conference paper at ICLR 2022

y=1 E{σi},{xi,yi}

i=1 σih S(xi)

y=1 R(HS,y). (35)

The equality in the third line of (35) is due to the fact that σi has the same distribution as σi. Using this fact again, (34), we have

R(HS) = E{σi},{xi,yi}

sup mh S HS

i=1 σimh S(xi, yi)

= E{σi},{xi,yi}

sup mh S HS

y=1 mh S(xi, y)1Iy=yi

y=1 E{σi},{xi,yi}

sup mh S HS

i=1 σimh S(xi, y)1Iy=yi

y=1 E{σi},{xi,yi}

sup mh S HS

i=1 σimh S(xi, y)(21Iy=yi 1)

y=1 E{σi},{xi}

sup mh S HS

i=1 σimh S(xi, y)

y=1 E{σi},{xi}

sup mh S HS

i=1 σimh S(xi, y)

Also, for any given 1 y c,

1 n E{σi},{xi}

sup mh S HS

i=1 σimh S(xi, y)

n E{σi},{xi}

sup h S( ,y) HS,y,y=1...c

i=1 σih S(xi, y) σiargmaxy =yh S(xi, y )

n E{σi},{xi}

sup h S( ,y) HS,y

i=1 σih S(xi, y)

n E{σi},{xi}

sup h S( ,y ) H S,y,y =y

i=1 σiargmaxy =yh S(xi, y )

n E{σi},{xi}

sup h S( ,y) HS,y

i=1 σih S(xi, y)

y =y E{σi},{xi}

sup h S( ,y ) H S,y

i=1 σih S(xi, y )

Combining (36) and (37),

1 n E{σi},{xi}

sup h S( ,y) HS,y

i=1 σih S(xi, y)

y =y E{σi},{xi}

sup h S( ,y ) H S,y

i=1 σih S(xi, y )

y=1 E{σi},{xi}

sup h S( ,y) HS,y

i=1 σih S(xi, y)

y=1 R(HS,y). (38)

Published as a conference paper at ICLR 2022

E.3 PROOF OF THEOREM A.1

Proof. According to definition of ISE,

ISE(br, r) = Z

Rd (br r)2dx = Z

Rd br(x, α)2dx 2 Z

Rd br(x, α)r(x)dx + Z

Rd r(x)2dx. (39)

For a given distribution, R

Rd r(x)2dx is a constant. By Gaussian convolution theorem,

Rd br(x, α)2dx = τ1

y=1 α(y) (K

2h)α(y) τ1 X

1 i<j n 2αiαj K

2h(xi xj)1Iyi =yj, (40)

where τ1 = 1 (2π)d/2(

2h)d . Moreover, Z

Rd br(x, α)r(x)dx

Rd bp(x, 1)p(x, 1)dx + Z

Rd bp(x, 2)p(x, 2)dx Z

Rd bp(x, 1)p(x, 2)dx Z

Rd bp(x, 2)p(x, 1)dx.

Note that 1 τ0

Rd bp(x, 1)p(x, 1)dx = X

Rd αj Kτ(x xj)p(x, 1)dx,

we then use the empirical term

i : i =j αj Kτ (xi xj)1Iyi=1

n 1 to approximate the integral R

xj)p(x, 1)dx. Since E{xi,yi}i =j

i : i =j αj Kτ (xi xj)1Iyi=1

Rd αj Kτ(x xj)p(x, 1)dx, and

bounded difference holds for

i : i =j αj Kτ (xi xj)1Iyi=1

n 1 , therefore,

i: i =j αj Kτ(xi xj)1Iyi=1

Rd αj Kτ(x xj)p(x, 1)dx αjε

2 exp 2(n 1)ε2 .

It follows that with probability at least 1 2n1 exp 2(n 1)ε2 , where ni is the number of data points with label i,

i,j : i =j,yi=yj=1 αj Kτ(xi xj)

Rd bp(x, 1)p(x, 1)dx X

j : yj=1 αjε. (42)

Similarly, with probability at least 1 2n2 exp 2(n 1)ε2 ,

i,j : i =j,yi=yj=2 αj Kτ(xi xj)

Rd bp(x, 2)p(x, 2)dx X

j : yj=2 αjε. (43)

It follows from (42) and (43) that with probability at least 1 2n exp 2(n 1)ε2 ,

i,j : i =j,yi=yj αj Kτ(xi xj)

bp(x, 1)p(x, 1) + bp(x, 2)p(x, 2) dx ε. (44)

In the same way, with probability at least 1 2n exp 2nε2 ,

i,j : yi =yj αj Kτ(xi xj)

bp(x, 1)p(x, 2) + bp(x, 2)p(x, 1) dx ε. (45)

Published as a conference paper at ICLR 2022

Based on (44) and (45), with probability at least 1 2n2 exp 2(n 1)ε2 2n exp 2nε2 ,

ISE(br, r) 2τ0

i,j : yi =yj αj Kτ(xi xj)

i,j : i =j,yi=yj αj Kτ(xi xj)

y=1 α(y) (K

2h)α(y) τ1 X

1 i<j n 2αiαj K

2h(xi xj)1Iyi =yj + 2τ0ε

1 i<j n (αi + αj)Kτ(xi xj)1Iyi =yj

i,j=1 (αi + αj)Kτ(xi xj)

y=1 α(y) (K

2h)α(y) τ1 X

1 i<j n 2αiαj K

2h(xi xj)1Iyi =yj + 2τ0( 1 n 1 + ε).

The conclusion of this theorem can be obtained from (46).