# discriminative_similarity_for_data_clustering__0e9249db.pdf Published as a conference paper at ICLR 2022 DISCRIMINATIVE SIMILARITY FOR DATA CLUSTERING Yingzhen Yang School of Computing and Augmented Intelligence Arizona State University Tempe, AZ 85281, USA yingzhen.yang@asu.edu Ping Li Cognitive Computing Lab Baidu Research Bellevue, WA 98004, USA liping11@baidu.com Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose Clustering by Discriminative Similarity (CDS), a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results. 1 INTRODUCTION Similarity-based clustering methods segment the data based on the similarity measure between the data points, such as spectral clustering (Ng et al., 2001), pairwise clustering method (Shental et al., 2003), K-means (Hartigan & Wong, 1979), and kernel K-means (Sch olkopf et al., 1998). The success of similarity-based clustering highly depends on the underlying pairwise similarity over the data, which in most cases are constructed empirically, e.g., by Gaussian kernel or the K-Nearest Neighbor (KNN) graph. In this paper, we model data clustering as a multiclass classification problem and seek for the data partition where the associated classifier, trained on cluster labels, can have low generalization error. Therefore, it is natural to formulate data clustering problem as a problem of training unsupervised classifiers: a classifier can be trained upon each candidate partition of the data, and the quality of the data partition can be evaluated by the performance of the trained classifier. Such classifier trained on a hypothetical labeling associated with a data partition is termed an unsupervised classifier. We present Clustering by Discriminative Similarity (CDS), wherein discriminative similarity is derived by the generalization error bound for an unsupervised similarity-based classifier. CDS is based on a novel framework of discriminative clustering by unsupervised classification wherein an unsupervised classifier is learnt from unlabeled data and the preferred hypothetical labeling should minimize the generalization error bound for the learnt classifier. When the popular Support Vector Machines (SVMs) is used in this framework, unsupervised SVM (Xu et al., 2004) can be deduced. In this paper, a similarity-based classifier motivated by similarity learning (Balcan et al., 2008; Cortes et al., 2013), is used as the unsupervised classifier. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as sum of pairwise similarity between the data from different classes. Such pairwise similarity, parameterized by the weights of the unsupervised similarity-based classifier, serves as the discriminative similarity. The term discriminative similarity emphasizes the fact that the similarity is learnt so as Yingzhen Yang s work was conducted as a consulting researcher at Baidu Research - Bellevue, WA, USA. Published as a conference paper at ICLR 2022 to improve the discriminative capability of a certain classifier such as the aforementioned unsupervised similarity-based classifier. 1.1 CONTRIBUTIONS AND MAIN RESULTS Firstly, we present Clustering by Discriminative Similarity (CDS) where discriminative similarity is induced by the generalization error bound for unsupervised similarity-based classifier on unlabeled data. The generalization bound for such similarity-based classifier is of independent interest, which is among the few results of generalization bounds for classification using general similarity functions (Section B.1 of Appendix). When the general similarity function is set to a Positive Semi-Definite (PSD) kernel, the derived discriminative similarity between two data points xi,xj is SK ij = 2(αi+αj λαiαj)K(xi xj), where K can be an arbitrary PSD kernel and αi is the kernel weight associated with xi. With theoretical and empirical study, we argue that SK ij should be used for data clustering instead of the conventional kernel similarity corresponding to uniform kernel weights. In the case of binary classification, we prove that the derived discriminative similarity SK ij has the same form as the similarity induced by the integrated squared error bound for kernel density classification (Section A of the appendix). Such connection suggests that there exists informationtheoretic measure which is implicitly equivalent to our CDS framework for unsupervised learning, and our CDS framework is well grounded for learning similarity from unlabeled data. Secondly, based on our CDS model, we develop a clustering algorithm termed Clustering by Discriminative Similarity via unsupervised Kernel classification (CDSK) in Section 5. CDSK uses a PSD kernel as the similarity function, and outperforms competing clustering algorithms, including nonparametric discriminative similarity based clustering methods and similarity graph based clustering methods, demonstrating the effectiveness of CDSK. When the kernel weights {αi} are uniform, CDSK is equivalent to kernel K-Means (Sch olkopf et al., 1998). CDSK is more flexible by learning adaptive kernel weights associated with different data points. 1.2 CONNECTION TO RELATED WORKS Our CDS model is related to a class of discriminative clustering methods which classify unlabeled data by various measures on discriminative unsupervised classifiers, and the measures include generalization error (Xu et al., 2004) or the entropy of the posterior distribution of the label (Gomes et al., 2010). Discriminative clustering methods (Xu et al., 2004) predict the labels of unlabeled data by minimizing the generalization error bound for the unsupervised classifier with respect to the hypothetical labeling. Unsupervised SVM is proposed in Xu et al. (2004) which learns a binary classifier to partition unlabeled data with the maximum margin between different clusters. The theoretical properties of unsupervised SVM are further analyzed in Karnin et al. (2012). Kernel logistic regression classifier is employed in Gomes et al. (2010), and it uses the entropy of the posterior distribution of the class label by the classifier to measure the quality of the hypothetical labeling. CDS model performs discriminative clustering based on a novel unsupervised classification framework by considering similarity-based or kernel classifiers which are important classification methods in the supervised learning literature. In contrasts with kernel similarity with uniform weights, the induced discriminative similarity with learnable weights enhances its capability to represent complex interconnection between data. The generalization analysis for CDS is primarily based on distribution free Rademacher complexity. While Yang et al. (2014a) propose nonparametric discriminative similarity for clustering, the nonparametric similarity requires probability density estimation which is difficult for high-dimensional data, and the fixed nonparametric similarity is not adaptive to complicated data distribution. The paper is organized as follows. We introduce the problem setup of Clustering by Discriminative Similarity in Section 3. We then derive the generalization error bound for the unsupervised similarity-based classifier for CDS in Section 4 where the proposed discriminative similarity is induced by the error bound. The application of CDS to data clustering is shown in Section 5. Throughout this paper, the term kernel stands for the PSD kernel if no special notes are made. Published as a conference paper at ICLR 2022 2 SIGNIFICANCE OF CDSK OVER EXISTING DISCRIMINATIVE AND SIMILARITY-BASED CLUSTERING METHODS Effective data similarity highly depends on the underlying probabilistic distribution and geometric structure of the data, and these two characteristics leads to data-driven similarity, such as Zhu et al. (2014); Bicego et al. (2021); Ng et al. (2001); Shental et al. (2003); Hartigan & Wong (1979); Sch olkopf et al. (1998) and similarity based on geometric structure of the data, such as the subspace structure (Sparse Subspace Clustering, or SSC in Elhamifar & Vidal (2013)). Note that the sparse graph method, ℓ1-Graph (Yan & Wang, 2009), has the same formulation as SSC. Most existing clustering methods based on data-driven or geometric structure-driven similarity suffer from a common deficiency, that is, the similarity is not explicitly optimized for the purpose of separating underlying clusters. In particular, the Random Forest-based similarity (Zhu et al., 2014; Bicego et al., 2021) is extracted from features in decision trees. Previous works about subspace-based similarity (Yan & Wang, 2009; Elhamifar & Vidal, 2013) try to make sure that only data points lying on or close to the same subspace have nonzero similarity, so that data points from the same subspace can form a cluster. However, it is not guaranteed that features in the decision trees are discriminative enough to separate clusters, because the candidate data partition (or candidate cluster labels) do not participate in the feature or similarity extraction process. Note that synthetically generated negative class are suggested in Zhu et al. (2014); Bicego et al. (2021) to train unsupervised random forest, however, the synthetic labels are not for the original data. Moreover, it is well known that the existing subspace learning methods only obtain reliable subspace-based similarity with restrictive geometric assumptions on the data and the underlying subspaces, such as large principal angle between intersecting subspaces (Soltanolkotabi & Candes, 2012; Elhamifar & Vidal, 2013). Therefore, it is particularly important to derive similarity for clustering which meets two requirements: (1) discriminative measure with information such as cluster partition is used to derive such similarity so as to achieve compelling clustering performance; (2) it requires less restrictive assumptions on the geometric structure of the data than current geometric structure-based similarity learning methods, such as subspace clustering (Yan & Wang, 2009; Elhamifar & Vidal, 2013). Significance. The proposed discriminative similarity of this paper meets these two requirements. First, the discriminative similarity is derived by the generalization error bound associated with candidate cluster labeling, and minimizing the objective function of our optimization problem for clustering renders a joint optimization of discriminative similarity and candidate cluster labeling in a way such that the similarity-based classifier has small generalization error bound. Second, our framework only assumes a mild classification model in Definition 3.1, which only requires an unknown joint distribution over data and its labels. In this way, the restrictive geometric assumptions are avoided in our method. Compared to the existing discriminative clustering methods, such as MMC (Xu et al., 2004), BMMC (Chen et al., 2014), RIM (Gomes et al., 2010), and the other discriminative clustering methods such as (Huang et al., 2015; Nguyen et al., 2017), the optimization problem of CDSK with discriminative similarity-based formulation is much easier to solve and it enjoys convexity and efficiency in each iteration of coordinate descent described in Algorithm 1. In particular, as mentioned in Section D of the appendix, the first step (11) of each iteration can be solved by efficient SVD or other randomized large-scale SVD methods, and the second step (12) of each iteration can be solved by efficient SMO (Platt, 1998). Moreover, the optimization problems in these two steps are either convex or having closed-form solution. In contrast, MMC requires expensive semidefinite programming. RIM has to solve a nonconvex optimization problem and its formulation does not guarantee that the trained multi-class kernelized logistic regression has low classification error on candidate labeling, which explains why it has inferior performance compared to our method. The discriminative Extreme Learning Machine (Huang et al., 2015) trains ELM using labels produced by a simple clustering method such as K-means, and the potentially poor cluster labels by the simple clustering method can easily result in unsatisfactory performance of this method. The discriminative Bayesian nonparametric clustering (Nguyen et al., 2017) and BMMC (Chen et al., 2014) require extra efforts of sampling hidden variables and tuning hyperparameters to generate the desirable number of clusters (or model selection), which could reduce the effect of discriminative measures used in these Bayesian nonparametric methods. Published as a conference paper at ICLR 2022 3 PROBLEM SETUP We introduce the problem setup of the formulation of clustering by unsupervised classification. Given unlabeled data {xl}n l=1 Rd, clustering is equivalent to searching for the hypothetical labeling which is optimal in some sense. Each hypothetical labeling corresponds to a candidate data partition. Figure 1 illustrates four binary hypothetical labelings which correspond to four partitions of the data, and the data is divided into two clusters by each hypothetical labeling. Figure 1: Illustration of binary hypothetical labelings The discriminative clustering literature (Xu et al., 2004; Gomes et al., 2010) has demonstrated the potential of multi-class classification for clustering problem. Inspired by the natural connection between clustering and classification, we proposes the framework of Clustering by Unsupervised Classification which models clustering problem as a multi-class classification problem. A classifier is learnt from unlabeled data with a hypothetical labeling, which is associated with a candidate partition of the unlabeled data. The optimal hypothetical labeling is supposed to be the one such that its associated classifier has the minimum generalization error bound. To study the generalization bound for the classifier learnt from hypothetical labeling, the concept of classification model is needed. Given unlabeled data {xl}n l=1, a classification model MY is constructed for any hypothetical labeling Y = {yl}n l=1 as follows. Definition 3.1. The classification model corresponding to the hypothetical labeling Y = {yl}n l=1 is defined as MY = (S, F). S = {xl, yl}n l=1 are the labeled data by the hypothetical labeling Y, and S are assumed to be i.i.d. samples drawn from the some unknown joint distribution PXY , where (X, Y ) is a random couple, X X Rd represents the data in some compact domain X, and Y {1, 2, ..., c} is the class label of X, c is the number of classes. F is a classifier trained on S. The generalization error of the classification model MY is defined as the generalization error of the classifier F in MY. The basic assumption of CDS is that the optimal hypothetical labeling minimizes the generalization error bound for the classification model. With f being different classifiers, different discriminative clustering models can be derived. When SVMs is used as the classifier F in the above discriminative model, unsupervised SVM (Xu et al., 2004) is obtained. In Balcan et al. (2008), the authors proposes a classification method using general similarity functions. The classification rule measures the similarity of the test data to each class, and then assigns the test data to the class such that the weighed average of the similarity between the test data and the training data belonging to that class is maximized over all the classes. Inspired by this classification method, we now consider using a general symmetric and continuous function S : X X [0, 1] as the similarity function in our CDS model. We propose the following hypothesis, h S(x, y) = X i: yi=y αi S(x, xi). (1) In the next section, we derive generalization bound for the unsupervised similarity-based classifier based on the above hypothesis, and such generalization bound leads to discriminative similarities for data clustering. When S is a PSD kernel, minimizing the generalization error bound amounts to minimization of a new form of kernel similarity between data from different clusters, which lays the foundation of a new clustering algorithm presented in Section 5. 4 GENERALIZATION BOUND FOR SIMILARITY-BASED CLASSIFIER In this section, the generalization error bound for the classification model in Definition 3.1 with the unsupervised similarity-based classifier is derived as a sum of discriminative similarity between the data from different classes. Published as a conference paper at ICLR 2022 4.1 GENERALIZATION BOUND The following notations are introduced before our analysis. Let α = [α1, . . . , αn] be the nonzero weights that sum up to 1, α(y) be a n 1 column vector representing the weights belonging to class y such that α(y) i is αi if y = yi, and 0 otherwise. The margin of the labeled sample (x, y) is defined as mh S(x, y) = h S(x, y) argmaxy =yh S(x, y ), the sample (x, y) is classified correctly if mh S(x, y) 0. The general similarity-based classifier f S predicts the label of the input x by f S(x) = argmaxy {1,...,c}h S(x, y). We then begin to derive the generalization error bound for f S using the Rademacher complexity of the function class comprised of all the possible margin functions mh S. The Rademacher complexity (Bartlett & Mendelson, 2003; Koltchinskii, 2001) of a function class is defined below: Definition 4.1. Let {σi}n i=1 be n i.i.d. random variables such that Pr[σi = 1] = Pr[σi = 1] = 1 2. The Rademacher complexity of a function class A is defined as R(A) = E{σi},{xi} i=1 σih(xi) In order to analyze the generalization property of the classification rule using the general similarity function, we first investigate the properties of general similarity function and its relationship to PSD kernels in terms of eigenvalues and eigenfunctions of the associated integral operator. The integral operator (LSf)(x) = R S(x, t)f(t)dt is well defined. It can be verified that LS is a compact operator since S is continuous. According to the spectral theorem in operator theory, there exists an orthonormal basis {ϕ1, ϕ2, . . .} of L2 which is comprised of the eigenfunctions of LS, where L2 is the space of measurable functions which are defined over X and square Lebesgue integrable. ϕk is the eigenfunction of LS with eigenvalue λk if LSϕk = λkϕk. The following lemma shows that under certain assumption on the eigenvalues and eigenfunctions of LS, a general symmetric and continuous similarity can be decomposed into two PSD kernels. Lemma 4.1. Suppose S : X X [0, 1] is a symmetric continuous function, and {λk} and {ϕk} are the eigenvalues and eigenfunctions of LS respectively. Suppose P k 1 λk|ϕk(x)|2 < C for some constant C > 0. Then S(x, t) = P k 1 λkϕk(x)ϕk(t) for any x, t X, and it can be decomposed as the difference between two positive semi-definite kernels: S(x, t) = S+(x, t) S (x, t), with S+(x, t) = X k : λk 0 λkϕk(x)ϕk(t), S (x, t) = X k:λk<0 |λk|ϕk(x)ϕk(t). (3) We now use a regularization term to bound the Rademacher complexity for the classification rule using the general similarity function. Let Ω+(α) = c P y=1 α(y) S+α(y) and Ω (α) = y=1 α(y) S α(y) with [S+]ij = S+(xi, xj) and [S ]ij = S (xi, xj). The space Hy of all the hypothesis h S( , y) associated with label y is defined as HS,y = {(x, y) X i: yi=y αi S(x, xi): α 0, 1 α = 1, Ω+(α) B+2, Ω (α) B 2} for 1 y c, with positive number B+ and B which bound Ω+ and Ω respectively. Let the hypothesis space comprising all possible margin functions be HS = {(x, y) mh S(x, y): h S(x, y) HS,y}. We then present the main result in this section about the generalization error of unsupervised similarity-based classifier f S. Theorem 4.2. Given the discriminative model MY = (S, f S), suppose Ω+(α) B+2, Ω (α) B 2, supx X |S+(x, x)| R2, supx X |S (x, x)| R2 for positive constants B+, B and R. Then with probability 1 δ over the labeled data S with respect to any distribution in PXY , under Published as a conference paper at ICLR 2022 the assumptions of Lemma 4.1, the generalization error of the general classifier f S satisfies R(f S) =Pr [Y = f S(X)] b Rn(f S) + 8R(2c 1)c(B+ + B ) γ n + 16c(2c 1)(B+ + B )R2 where b Rn(f S) = 1 i=1 Φ h S(xi,yi) argmaxy =yh S(xi,y ) γ is the empirical loss of f S on the labeled data, γ > 0 is a constant and Φ is defined as Φ(x) = min {1, max{0, 1 x}}. Moreover, if γ 1, the empirical loss b Rn(f S) satisfies b Rn(f S) 1 1 2 S(xi, xj) + 1 1 i 0 is the weighting parameter for the regularization term Ω+(α) + Ω (α). Note that we do not set λ to 16c(2c 1)R2+8 2γ exactly matching the RHS of (4), because λ controls the weight of the regularization term which bounds the unknown complexity of the function class HS. Note that (7) encourages the discriminative similarity Ssim ij between the data from different classes small. The optimization problem (7) forms the formulation of Clustering by Discriminative Similarity (CDS). By Remark 4.3, when S is a PSD kernel K, S 0, S = S+, Ssim ij reduces to the following discriminative similarity for PSD kernels: SK ij = 2(αi + αj λαiαj)K(xi xj), 1 i, j n, (8) Published as a conference paper at ICLR 2022 and SK ij is the similarity induced by the unsupervised kernel classifier by the kernel K. Without loss of generality, we set K = Kτ(x) = exp( x 2 2 2τ 2 ) which is the isotropic Gaussian kernel with kernel bandwidth τ > 0, and we omit the constant that makes integral of K unit. When setting the general similarity function to kernel Kτ, CDS aims to minimize the error bound for the corresponding unsupervised kernel classifier, which amounts to minimizing the following objective function min α Λ,Y={yi}n i=1 1 i 0 be a weighting parameter, then the cost function d ISE + λ1K(α), designed according to the empirical term ISE(br, r) and the regularization term K(α) in the ISE error bound (16), can be expressed as d ISE + λ1K(α) X 1 i 0, then with probability at least 1 δ over the data {xi}n i=1, the Rademacher complexity of the class HS satisfies R(HS) R(2c 1)c(B+ + B ) n + 2c(2c 1)(B+ + B )R2 δ 2n . (27) Proof of Lemma E.3 . According to Lemma 4.1, S is decomposed into two PSD kernels as S = S+ S . Therefore, the are two Reproducing Kernel Hilbert Spaces H+ S and H S that are associated with S+ and S respectively, and the canonical feature mappings in H+ S and H S are ϕ+ and ϕ , with S+(x, t) = ϕ+(x), ϕ+(t) H+ K and S (x, t) = ϕ (x), ϕ (t) H K. In the following text, we will omit the subscripts H+ K and H K without confusion. For any 1 y c, h S(x, y) = X i: yi=y αi S(x, xi) = w+, ϕ+(x) w , ϕ (x) with w+ 2 = α(y) S+α(y) B+2 and w 2 = α(y) S α(y) B 2. Therefore, HS,y HS,y = {(x, y) w+, ϕ+(x) w , ϕ (x) , w+ 2 B+2, w 2 B 2}, 1 y c, Published as a conference paper at ICLR 2022 and R(HS,y) R( HS,y). Since we are deriving upper bound for R(HS,y), we slightly abuse the notation and let HS,y represent HS,y in the remaining part of this proof. For x, t Rd and any h S HS,y, we have |h S(x) h S(t)| = | w+, ϕ+(x) w , ϕ (x) w+, ϕ+(t) + w , ϕ (t) | = | w+, ϕ+(x) ϕ+(t) + w , ϕ (t) ϕ (x) | B+ ϕ+(x) ϕ+(t) + B ϕ (x) ϕ (t) ) (B+ + B ) q S+(x, x) + S+(t, t) + 2 p S+(x, x)S+(t, t) 2R2(B+ + B ). We now approximate the Rademacher complexity of the function class HS,y with its empirical version b R(HS,y) using the sample {xi}. For each 1 y c, Define E(y) {xi} = b R(HS,y) = suph S( ,y) HS,y 1 i=1 σih S(xi, y) , then c P y=1 R(HS,y) = E{xi} h c P y=1 E(y) {xi} i , and sup x1,...,xn,x t E(y) x1,...,xt 1,xt,xt+1,...,xn E(y) x1,...,xt 1,x t,xt+1,...,xn = sup x1,...,xn,x t sup h S( ,y) HS,y i=1 σih S(xi, y) sup h S( ,y) HS,y i =t σih S(xi, y) + h S(x t, y) n sup x1,...,xn,x t E{σi} " sup h S( ,y) HS,y i=1 σih S(xi, y) sup h S( ,y) HS,y i =t σih S(xi, y) + h S(x t, y) n sup x1,...,xn,x t E{σi} sup h S( ,y) HS,y i=1 σih S(xi, y) 1 i =t σih S(xi, y) + h S(x t, y) n sup x1,...,xn,x t E{σi} sup h S( ,y) HS,y i=1 σih S(xi, y) 1 i =t σih S(xi, y) + h S(x t, y) n = sup xt,x t E{σi} sup h S( ,y) HS,y n h S(x t, y) n 2R2(B+ + B ) It follows that c P y=1 E(y) x1,...,xt 1,xt,xt+1,...,xn c P y=1 E(y) x1,...,xt 1,x t,xt+1,...,xn 2R2(B++B )c cording to the Mc Diarmid s Inequality, y=1 b R(HS,y) y=1 R(HS,y) ε i 2 exp nε2 2(B+ + B )2R4c2 . (28) Published as a conference paper at ICLR 2022 Now we derive the upper bound for the empirical Rademacher complexity: y=1 b R(HS,y) = sup h S HS,y j=1 σih S(xi) sup w+ B+, w B i=1 σi w+, ϕ+(xi) w , ϕ (xi) i=1 σiϕ+(xi) + B i=1 σiϕ (xi) i=1 σiϕ+(xi) i=1 σiϕ (xi) v u u t E{σi} i=1 σiϕ+(xi) 2 # v u u t E{σi} i=1 σiϕ (xi) 2 # i s+(xi, xi) + B c i s (xi, xi) Rc n(B+ + B ). By Lemma E.2, (28) and (29), with probability at least 1 δ, we have R(HS) (2c 1) y=1 R(HS,y) R(2c 1)c(B+ + B ) n + 2c(2c 1)(B+ + B )R2 Proof of Theorem 4.2 . According to Theorem 2 in Koltchinskii & Panchenko (2002), with probability 1 δ over the labeled data S with respect to any distribution in P, the generalization error of the kernel classifier f S satisfies R(f S) b Rn(f S) + 8 where b Rn(f S) = 1 i=1 Φ( mh S (xi,yi) γ ) is empirical error of the classifier for γ > 0. Due to the facts that mh S(x, y) = h S(xi, yi) argmaxy =yih S(xi, y ), α is a positive vector and S is nonnegative, we have mh S(x, y) h S(xi, yi) P y =yi h S(xi, y). Note that Φ( ) is a non-increasing function, it follows that Φ(mh S(xi, yi) γ ) Φ h S(xi, yi) P y =yi h S(xi, y) Applying Lemma E.3, (4) holds with probability 1 δ. When γ 1, it can be verified that h S(xi, yi) P y =yi h S(xi, y) c P y=1 h S(xi, y) n P i=1 αi = 1 γ for all (xi, yi), so that Φ h S(xi, yi) P y =yi h S(xi, y) h(xi, yi) P y =yi h(xi, y) Published as a conference paper at ICLR 2022 Note that when h S(xi, yi) P y =yi h S(xi, y) 0, then 1 h S(xi, yi) P y =yi h S(xi, y) that Φ h S(xi,yi) P y =yi h S(xi,y) y =yi h(xi,y) γ . If h S(xi, yi) P y =yi h S(xi, y) < 0, we have Φ h S(xi,yi) P y =yi h S(xi,y) y =yi h(xi,y) γ . Therefore, (33) always holds. By the definition of b Rn(f S), (32), and (33), (5) is obtained. Remark E.4. It can be verified that the image of the similarity function S in Lemma 4.1 and Theorem 4.2 can be generalized from [0, 1] to [0, a] for any a R, a > 0 with the condition γ 1 replaced by γ a. This is because LS is a compact operator for continuous similarity function S : X X [0, a], and h S(xi, yi) P y =yi h S(xi, y) c P y=1 h S(xi, y) a. Furthermore, given a symmetric and continuous function S : X X [c, d], c, d R, c < d, we can obtain a symmetric and continuous function S : X X [0, 1] by setting S = S a b a , and then apply all the theoretical results of this paper to CDS with S being the similarity function for the similarity-based classifier. Proof of Lemma E.2. Inspired by Koltchinskii & Panchenko (2002), we first prove that the Rademacher complexity of the function class formed by the maximum of several hypotheses is bounded by two times the sum of the Rademacher complexity of the function classes that these hypothesis belong to. That is, y=1 R(HS,y), (34) where Hmax = {max{h1, . . . , hk}: hy HS,y, 1 y k} for 1 k c 1. If no confusion arises, the notations ({σi}, {xi, yi}) are omitted in the subscript of the expectation operator in the following text, i.e., E{σi},{xi,yi} is abbreviated to E. According to Theorem 11 of Koltchinskii & Panchenko (2002), it can be verified that E{σi},{xi,yi} i=1 σih S(xi) y=1 E{σi},{xi,yi} i=1 σih S(xi) R(Hmax) = E{σi},{xi,yi} i=1 σih S(xi) E{σi},{xi,yi} " sup h Hmax i=1 σih S(xi) + # +E{σi},{xi,yi} " sup h Hmax 1 i=1 σih S(xi) + # = 2E{σi},{xi,yi} " sup h Hmax i=1 σih S(xi) + # y=1 E{σi},{xi,yi} " sup h HS,y i=1 σih S(xi) + # Published as a conference paper at ICLR 2022 y=1 E{σi},{xi,yi} i=1 σih S(xi) y=1 R(HS,y). (35) The equality in the third line of (35) is due to the fact that σi has the same distribution as σi. Using this fact again, (34), we have R(HS) = E{σi},{xi,yi} sup mh S HS i=1 σimh S(xi, yi) = E{σi},{xi,yi} sup mh S HS y=1 mh S(xi, y)1Iy=yi y=1 E{σi},{xi,yi} sup mh S HS i=1 σimh S(xi, y)1Iy=yi y=1 E{σi},{xi,yi} sup mh S HS i=1 σimh S(xi, y)(21Iy=yi 1) y=1 E{σi},{xi} sup mh S HS i=1 σimh S(xi, y) y=1 E{σi},{xi} sup mh S HS i=1 σimh S(xi, y) Also, for any given 1 y c, 1 n E{σi},{xi} sup mh S HS i=1 σimh S(xi, y) n E{σi},{xi} sup h S( ,y) HS,y,y=1...c i=1 σih S(xi, y) σiargmaxy =yh S(xi, y ) n E{σi},{xi} sup h S( ,y) HS,y i=1 σih S(xi, y) n E{σi},{xi} sup h S( ,y ) H S,y,y =y i=1 σiargmaxy =yh S(xi, y ) n E{σi},{xi} sup h S( ,y) HS,y i=1 σih S(xi, y) y =y E{σi},{xi} sup h S( ,y ) H S,y i=1 σih S(xi, y ) Combining (36) and (37), 1 n E{σi},{xi} sup h S( ,y) HS,y i=1 σih S(xi, y) y =y E{σi},{xi} sup h S( ,y ) H S,y i=1 σih S(xi, y ) y=1 E{σi},{xi} sup h S( ,y) HS,y i=1 σih S(xi, y) y=1 R(HS,y). (38) Published as a conference paper at ICLR 2022 E.3 PROOF OF THEOREM A.1 Proof. According to definition of ISE, ISE(br, r) = Z Rd (br r)2dx = Z Rd br(x, α)2dx 2 Z Rd br(x, α)r(x)dx + Z Rd r(x)2dx. (39) For a given distribution, R Rd r(x)2dx is a constant. By Gaussian convolution theorem, Rd br(x, α)2dx = τ1 y=1 α(y) (K 2h)α(y) τ1 X 1 i