# multiview_randomized_kernel_classification_via_nonconvex_optimization__37292e6c.pdf Multi-View Randomized Kernel Classification via Nonconvex Optimization Xiaojian Ding, Fan Yang College of Information Engineering, Nanjing University of Finance and Economics, Nanjing 210023, China {wjswsl,nufe yf}@163.com Multi-kernel learning (MKL) is a representative supervised multi-view learning method widely applied in multi-modal and multi-view applications. MKL aims to classify data by integrating complementary information from predefined kernels. Although existing MKL methods achieve promising performance, they fail to consider the tradeoff between diversity and classification accuracy of kernels, preventing further improvement of classification performance. In this paper, we tackle this problem by generating a number of high-quality base learning kernels and selecting a kernel subset with maximum pairwise diversity and minimum generalization errors. We first formulate this idea as a nonconvex quadratic integer programming problem. Then, we transform this nonconvex problem into a convex optimization problem and show it is equivalent to a semidefinite relaxation problem, which a semidefinite-based branch-and-bound algorithm can quickly solve. Experimental results on the real-world datasets demonstrate the superiority of the proposed method. The results also show that our method works for the support vector machine (SVM) classifier and other state-of-the-art kernel classifiers. Introduction Numerous methods of learning from multi-view data by considering the diversity of different views have been developed recently. These views can come from multiple sources or modalities. Multiple kernel learning (MKL) is a representative supervised multi-view learning method widely applied in multi-modal and multi-view applications (Arabacı et al. 2021; Zhang et al. 2020; Peng et al. 2019; Wu et al. 2020; Jiang et al. 2022; Shah et al. 2021; Yang et al. 2020). MKL was initially proposed to regulate the capacity of the search space for potential kernel matrices to achieve better generalization (G onen 2013). However, it has been extensively utilized in multi-view data scenarios because MKL kernels inherently correspond to various sources or views. For MKL analysis, learning an optimal linear or nonlinear combination of a predefined set of kernels is the foundation of using the complementary information within them and improving learning performance. Some approaches consider using linear combination methods to combine these Fan Yang is the corresponding author. Copyright c 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. kernels, such as fixed rule, heuristic, and optimization methods. Fixed rule methods learn the kernel as a weighted linear combination of the available kernels without training (Ben Hur and Noble 2005). Heuristic methods find the combination parameters by some heuristic measures like conditional class probabilities calculated from the kernel matrices or the performance values trained by each kernel separately (Aiolli and Donini 2015; Fan et al. 2017; Liu et al. 2020). Optimization methods usually learn the kernel combination parameters together with the parameters of the base learner (e.g., parameter C in SVM) by solving an optimization problem (Han et al. 2018; Wang, Lu, and Zhang 2018; Chamakura and Saha 2022). Some other approaches consider designing a nonlinear combination method that uses nonlinear functions of kernels, namely, multiplication, power, and exponentiation (Gu et al. 2016; Song et al. 2018). Although the algorithms mentioned above achieve promising performance, they fail to consider the tradeoff between diversity and classification accuracy of kernels, preventing further improvement of classification performance. Diversity and accuracy have been considered two essential characteristics in kernel combinations (Xia and Hoi 2013; Liu et al. 2016). Ko et al. (Ko, Sabourin, and de Souza Britto Jr 2009) studied the correlation between accuracy and diversity and established theoretically and empirically that the two are indeed correlated. These studies assert that to get a good combination, the combined kernels should be as diverse as possible without sacrificing accuracy. In previous studies on MKL, various approaches have been used to select base learning kernels. Some approaches (Han et al. 2018; Shen et al. 2021) studied predefined kernels from commonly used kernels (e.g., linear, Gaussian, and Polynomial) with different parameters. Chamakura and Saha (Chamakura and Saha 2022) used linear kernel, Intersection kernel, and Chi-squared kernel as base learning kernels. Ding et al. (Ding, Tang, and Guo 2020) used the Gaussian Interaction Profile of the network (GIP) kernel as a base learning kernel. Yu et al. (Yu, Gong, and Jiang 2020) generated multiple base kernels from a specific kernel set, including the Hadamard, RBF, and linear kernels. To summarize, predefined kernels can be different kernels or the same kernel with different parameters. Since a kernel learning algorithm (e.g., SVM) is sensitive to kernel parameters, few parameter candidates usually perform best. To this end, con- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) sidering only the diversity between kernels cannot ensure their good performance. In addition, quantifying the diversity of kernels, however, is difficult as there is no formal definition. For this reason, few MKL approaches concentrated on the diversity and/or the accuracy of combined kernels. The recent work in (Ding et al. 2021a,b) proposed a randomized kernel and established theoretically and empirically that SVM with the randomized kernel is not sensitive to kernel parameters. The key idea of the randomized kernel is to assign a random parameter to each input dimension (or feature). Then each input dimension is allowed to have a different parameter value. Randomizing the kernel parameters many times can obtain a pool of randomized kernels while retaining their performance. The randomized kernel was proved to satisfy Mercer s theorem. Motivated by this work, this paper proposes a new MKL learning method named RMKL that optimizes kernel weights over combinations of randomized kernels. We first introduce a kernel quality measure by considering kernels diversity and generalization errors. By maximizing the pairwise diversity and minimizing the generalization errors of kernels, we formulate the optimization problem of the RMKL as a nonconvex quadratic integer programming problem. We transform this nonconvex problem into a convex optimization problem and show that it is equivalent to a semidefinite programming (SDP) relaxation problem. Further, wo develop a branch-and-bound branch-and-bound algorithm to solve the proposed quadratic integer programming problem. Experimental results on the real-world datasets demonstrated the superiority of the proposed method. To sum up, our main contributions are: We propose a new MKL method that optimizes the kernel weights by considering both kernels diversity and performance and formulating it as a nonconvex quadratic integer programming problem. We transform this nonconvex problem into a convex optimization problem and develop an efficient solution. We validate the effectiveness of the proposed method on real-world multi-view datasets of different types. Preliminaries We consider the classification learning problem from data {xi, yi}N i=1, where xi Rd belongs to some input space X, and y = (y1, ..., y N)T {1, 2, ..., c}N denoting the class labels of samples xi. Multiple Kernel Learning A classifier is a function f : X Y which labels each sample x X with some y Y. We denote by {kl(x, z) : X X 7 R, l = 1, ..., m} the set of m basis kernels to be combined. For each kernel function kj( , ) denote by Kl = [kl(xi, zi)]N N the associated kernel matrix. We denote by η = (η1, ..., ηm)T Rm + the vector of kernel weights used to combine the basis kernels and denote by k(x, z; η) = Pm l=1 ηl Kl(x, z) and k(η) = Pm l=1 ηl Kl the combined kernel function and kernel matrix, respectively. MKL approach can be considered as the search of the kernel k(x, z; η) with reproducing kernel Hilbert space (RKHS) Hη that have good generalization properties. For a binary classification problem, we learn the optimal combination of kernels by solving the following optimization problem minf Hη,η ϱ 1 2 f 2 Hη + C X s.t. yi(f(xi) + b) 1 ξi, i ξi 0, i where ϱ is a domain for kernel weights vector η, b R is the bias term, and ξi are positive slack variables. The main task of MKL is to learn the kernel weights η. Different strategies differ in the way they put restrictions on η: the linear combination (η Rm), the conic combination (η Rm +) or the convex combination (η Rm + and Pm l=1 ηl = 1). A more general practice is to restrict η to a probability distribution, leading to the following definition of domain ϱ η Rm + : η 1 = Using the domain ϱ defined in (2), one can obtain a sparse solution of kernel combination weights, eliminating irrelevant kernels. Randomized Kernel For an SVM classifier, the Gaussian kernel is the most commonly used kernel function k Gau (x, z) = exp where σ is the kernel parameter. Standard Gaussian kernel has a single kernel width σ that controls the spread of the kernel. Actually, this kernel width is invariant to the region of the input space, which means all features share the same width. Different regions of the input space have different data characteristics, and more kernel parameters should be introduced to ensure each feature is associated with a kernel width value. To this end, the work in (Ding et al. 2021b) proposed a randomized kernel, which can be expressed as k RCG (x, z) = exp where σ1, ..., σd are kernel parameters. All kernel parameters are randomly assigned in the randomized kernel based on a uniform distribution. SVM with a randomized kernel is not sensitive to kernel parameters, and repeated runs can achieve good performance. In addition, a randomized kernel with d-dimensional parameter vector can capture the data characteristics in different regions of the input space. Therefore, different randomized kernels may provide complementary information for data distribution investigation. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) The Proposed Method Measuring the Quality of Kernels By performing the random assignment of kernel parameters M times, a pool of M randomized kernels is obtained, denoted as M = {k1, ..., k M} . (5) In ensemble classification, accuracy diversity tradeoff has been well studied (Aksela and Laaksonen 2006), and it is opined that accuracy should not be sacrificed for diversity. Motivated by this work, we select a subset of kernels from the set M while maximizing their diversity and minimizing their generalization error. Two sub-problems are involved: measuring randomized kernels diversity and performance. The diversity of kernels is generally accepted as a necessary condition for combining them. However, quantifying the diversity of classifiers is difficult as there is no general agreement about quantifying diversity among a set of kernels. Giacinto and Roli (Giacinto and Roli 2001) proposed a double-fault measure to measure diversity between two basis classifiers. The intuition defines this measure as two diverse classifiers performing differently on the same dataset. Similarly, the double-fault measure measures diversity between a pair of basis kernels. Given two basis kernels ki and kj, let N ab be the number of samples on which the predictive output of a classifier with kernels ki and kj is a and b, respectively. The diversity between the two basis kernels is measured by D(ki, kj) = N ab + N ba N aa + N ba + N ab + N bb . (6) We train a classifier with the basis kernels on the validation set and calculate their average k-fold cross-validation error rates to measure the basis kernels performance. Specifically, we denote the average error rate of kernel ki by E(ki). Optimizing the Kernel Weights The selected kernels should be as accurate and diverse as possible to obtain a good combination of basis kernels in the set M. We then propose an optimization objective by selecting a subset of kernels from a pool of kernel candidates to minimize the objective function, which allows accuracy and diversity to be incorporated within a single measure. By introducing binary kernel weights, the objective of selecting a subset of kernels can be formulated by ηiηj D(ki, kj)and i=1 ηi E(ki) s.t. ηi {0, 1} , i = 1, ..., M, where ηi = 1 means that the kernel ki is selected in the kernel subset. When we obtain the kernel weights η, we consider a convex linear combination of basis kernels k(xi, xj) = PM l=1 ηlkl(xi, xj) PM l=1 ηl . (8) A direct way of solving (7) is to traverse the whole set M to find the best combination of basis kernels, which leads to the impractical computation cost. To solve it efficiently, we employ a matrix Q to store the pairwise diversity terms of kernels in the set M. Let the offdiagonal entry Qij be the inverse of the pairwise diversity given by kernels ki and kj, that is, Qii = 1 D(ki,kj), and let all diagonal entries Qii be 0. In addition, we use a vector r to store the accuracy terms where ri = E(ki). By introducing a free parameter m, we transform the problem (7) to the following quadratic-(0,1) integer programming problem min p(η) = ηT Qη + r T η i=1 ηi = m, ηi {0, 1} , i = 1, ..., M, with Q being a symmetric M M matrix, r is a M real vector, and m R. Here, the parameter m denotes the total number of selected kernels. Semidefinite programming is well known to provide powerful relaxations for quadratic-(0,1) integer programming problems. As a relaxation for the boolean quadric polytope of (9), we model the dyadic product ηT η by a matrix variable X, and denote the diagonal of this matrix by D. Then, we define the SDP relaxation of the problem (9) min p(η) = r T η + j=1 qij Xij s.t. Xij = Di, i = 1, ..., M j=1 Xij = m Di, i = 1, ..., M where X SM, e denotes the M-vector of all ones, and qij denotes the general term of matrix Q. Solving the Problem It is known that the integer programming problem of (9) is NP-complete (Hartmanis 1982). Note that the objective function in (9) is not convex due to the Q diagonal terms. By introducing two parameters u RM and v RM, we can transform this nonconvex problem into the following convex optimization problem i=1 ηi = m, ηi {0, 1} , i = 1, ..., M, (11) The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) p (η) = ηT Qη + r T η + i=1 ui(η2 i ηi) = ηT Q1η + r T 1 η + i=1 ui(η2 i ηi) = ηT Q2η + r T 2 η, and Q1 = Q+ 1 2 u T e + e T u , Q2 = Q1 +Diag(v), r1 = r u T m, r2 = r1 v. Here, Diag(v) denotes a diagonal M M matrix with the elements of v on the diagonal. To solve (11), we can use the Branch-and-Bound (B&B) algorithm based on the relaxation of the constraints η [0, 1]M, and denote S := n η : e T η = m, η [0, 1]Mo by the set of feasible solutions of the continuous relaxation of (9). For all η S, it is easy to see that the objective function p (η) equals p(η). The main idea of the B&B algorithm is to divide the problem into smaller subproblems by branching the variables into possible values, creating a lower bound of the function in a specific domain, and finding approximate solutions that converge to the optimal solution. In the case of problem (11), the bound can be solved by finding the optimum value of the continuous relaxation. To this end, we should solve parameters u and v in p (η) to find a tight bound. Specifically, we need to solve the following problem max u,v min η S p (η) s.t. Q2 0. (12) Theorem 1. (Lemar echal and Oustry 1999) The optimal value of problem (12) is equal to the SDP relaxation (10). For problem (12), optimal values u i are solved are by the optimal values of the dual variables associated with constraints Xij = Di, i = 1, ..., M of (10), and optimal values v i are solved are by the optimal values of the dual variables associated with constraints PM j=1 Xij = m Di, i = 1, ..., M of (10). The Branch-and-Bound Method We present an exact solution for solving the problem (9) by designing a B&B framework using relaxation (10) discussed above for getting lower bounds. Global optimization methods based on B&B techniques hinge on two pivotal procedures aimed at calculating upper and lower bounds for the global optimum in nonconvex problems. In the context of a minimization problem, the upper bound can be derived by selecting any point within the feasible set. Typically, a local search method is employed to refine and enhance this upper bound. Conversely, the lower bound can be determined through a convex relaxation of the original nonconvex problem. Algorithm 1: SDP based Branch-and-Bound algorithm Input: The minimization problem (9) Initializaiton: Global upper-/lower-bounds GUB=+ , GLB= , priority queue L 1: while L = do 2: Perform Subproblem selection: O=sel(L). 3: Perform Bounding: [ UB,LB,η, X ] =bound(O). 4: Update global bounds: 5: GLB=min(LB, min O LLB(O )). 6: if GUB