# sampleadaptive_multiple_kernel_learning__2d56e6d7.pdf

Sample-Adaptive Multiple Kernel Learning

Xinwang Liu School of Computer National University of Defense Technology Changsha, China, 410073

Lei Wang School of Computer Science and Software Engineering University of Wollongong NSW, Australia, 2522

Jian Zhang Faculty of Engineering and Information Technology University of Technology Sydney NSW, Australia, 2007

Jianping Yin School of Computer National University of Defense Technology Changsha, China, 410073

Existing multiple kernel learning (MKL) algorithms indiscriminately apply a same set of kernel combination weights to all samples. However, the utility of base kernels could vary across samples and a base kernel useful for one sample could become noisy for another. In this case, rigidly applying a same set of kernel combination weights could adversely affect the learning performance. To improve this situation, we propose a sample-adaptive MKL algorithm, in which base kernels are allowed to be adaptively switched on/off with respect to each sample. We achieve this goal by assigning a latent binary variable to each base kernel when it is applied to a sample. The kernel combination weights and the latent variables are jointly optimized via margin maximization principle. As demonstrated on ﬁve benchmark data sets, the proposed algorithm consistently outperforms the comparable ones in the literature.

Introduction Kernel methods such as support vector machines (SVMs) have been an active research topic in the past decade (Sch olkopf and Smola 2002; Shawe-Taylor and Cristianini 2004; Cortes, Mohri, and Rostamizadeh 2012). As well known, effectively learning an optimal kernel is of great importance to the success of kernel methods. Along this line of research, many pioneering kernel learning algorithms have been proposed (Chapelle et al. 2002; Lanckriet et al. 2004; Bach, Lanckriet, and Jordan 2004; Liu et al. 2013a). In particular, multiple kernel learning (MKL) has attracted much attention (Sonnenburg et al. 2006; Rakotomamonjy et al. 2008; Xu et al. 2010; G onen 2012; Gai, Chen, and Zhang 2010; Kloft et al. 2011; Cortes, Mohri, and Rostamizadeh 2012; Yan et al. 2012; Yang et al. 2012; Hinrichs et al. 2012; Kumar et al. 2012; Cortes, Mohri, and Rostamizadeh 2013; Liu et al. 2012; 2013b). It not only provides an efﬁcient way to learn an optimal kernel, but also builds an elegant framework to integrate multiple heterogeneous data sources. The existing research work on MKL has made signiﬁcant contributions in two aspects: speeding up computation (Sonnenburg et al. 2006; Rakotomamonjy et al. 2008; Xu et al. 2010; Orabona and Luo 2011; Afkanpour et al. 2013) and improving classiﬁcation performance (Gai, Chen, and Zhang 2010;

Copyright c 2014, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Cortes, Mohri, and Rostamizadeh 2012; Kumar et al. 2012; G onen 2012; Cortes, Mohri, and Rostamizadeh 2013). The ﬁrst contribution makes MKL applicable to large scale learning tasks, while the second one helps MKL attain superior classiﬁcation performance. By pre-specifying a group of base kernels, existing MKL algorithms learn the kernel combination weights based on a given training sample set. Usually, a same set of combination weights is indiscriminately applied to all samples. Since the weight can be viewed as an indicator of the utility of a base kernel for classiﬁcation, existing algorithms implicitly assume that this utility remains unchanged in classifying all the samples. Nevertheless, the utility of a base kernel could change with samples because i) In MKL, each base kernel represents an evaluation of the pair-wise sample similarity. This evaluation is not necessarily equally effective across all samples; ii) Some input features of a sample could be contaminated by noise in practical applications. This can make a base kernel computed with these input features unreliable for this speciﬁc sample; and iii) In addition, when the kernel value (sample similarity) of a base kernel is collected from human evaluation or laboratory test, unexpected errors in this process could signiﬁcantly corrupt the similarity of a sample to the others. All these cases can turn a base kernel useful for one sample out to a noise for another. Nevertheless, this issue has not been well addressed in the current literature of MKL. A straightforward solution to the above issue may be to learn an individual set of kernel combination weights for every sample. However, this will lead to an optimization problem with many (number of base kernels number of samples) real-valued variables. Also, this will result in an over-ﬂexible kernel learning model. A sufﬁciently strong regularization has to be imposed upon these combination weights, which could further complicate the optimization process (generally leading to a large scale quadraticconstrained quadratic programming for SVMs-based MKL). A possible remedy to reduce the number of variables and the model ﬂexibility is to use a parametric model to predict the kernel combination weights for a sample, as developed in the work of localized MKL (LMKL) algorithm in (G onen and Alpaydin 2008), where a gating model is used to predict the kernel combination weights locally. The framework of LMKL is elegant and it is able to improve the performance

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

of MKL. However, how to deﬁne an appropriate parametric model remains an issue. For example, the linear gating model in (G onen and Alpaydin 2008) assumes that kernel combination weights change smoothly in any local region. Note that the assumption may not be true when base kernels are irregularly corrupted across different samples. In addition, when the input features of a sample are contaminated by noise, the kernel combination weights predicted via a parametric model will not be accurate any more. To handle the variation of the utility of base kernels across samples, we hope that there is a latent mechanism in MKL that can dynamically choose an appropriate subset of base kernels for each sample. Following this idea, we deﬁne a latent binary vector to each individual sample to adaptively switch each base kernel on/off. More speciﬁcally, by switching a base kernel off for a sample we mean that this kernel will not take part in evaluating the similarity of this sample to any other samples. This equals to ﬁlling a whole row and column of the corresponding base kernel matrix with zeros. With this latent mechanism, we can maintain to learn a same set of kernel combination weights for all samples as before. This avoids an over-ﬂexible learning model and the smoothness assumption required by a parametric model. Also, as will be seen, allocating a latent variable to switch the base kernels allows the optimization of the latent variables and the kernel combination weights to be well separated. This makes the proposed algorithm still largely follow existing MKL framework and take advantage of existing efﬁcient optimization algorithms. Although the number of latent variables is still as large as the number of base kernels multiplied by the number of samples, their optimization can be decomposed in a sample-wise way. Each sub-problem in this decomposition is a 0/1 linear integer programming which can be solved via off-the-shelf packages. In addition, the proposed algorithm completely works at the kernel level and thus can effectively handle the corruption and noise at both the base kernel level and the input sample level (since noise at the input sample level will be ﬁnally reﬂected at the base kernel level). Experimental study is conducted on multiple MKL benchmark data sets. Compared with existing MKL algorithms, the proposed algorithm can consistently achieve higher classiﬁcation performance.

Background and notations In existing MKL literature, each sample is mapped onto a multiple-kernel-induced feature space and a linear classiﬁer is learned in this space. The feature mapping used in MKL takes the form of φ( ) = [φ 1 ( ), φ 2 ( ), , φ m( )] , which are induced by m pre-deﬁned base kernels {κp( , )}m p=1. MKL learns the classiﬁer by maximizing the margin between classes via solving the following problem (Bach, Lanckriet, and Jordan 2004; Sonnenburg et al. 2006)

min {ωp}m p=1,b,ξ 1 2

p=1 ωp Hp 2 + C Xn

p=1 ω p φp(xi) + b 1 ξi, ξi 0, i,

(1) where {(xi, yi)}n i=1 denotes a collection of n training samples and their labels, ωp and Hp represent the normal and

the feature space corresponding to the p-th base kernel, ξ and b are the slack variables and bias term, and C is a regularization parameter. According to (Micchelli and Pontil 2005), the problem in Eq.(1) is proven to be equivalent to the one in Eq.(2)

min {ωp}m p=1,b,ξ,γ 1 2

ωp 2 Hp γp + C Xn

p=1 ω p φp(xi) + b 1 ξi, ξi 0, i,

where γp is the combination weight of the p-th base kernel and = {γ : Pm p=1 γp = 1, γp 0, p}. Usually, {ωp}m p=1, b and ξ are obtained by solving the dual problem of the inner minimization problem. This leads to a min-max optimization problem in Eq.(3)

min γ max α 1

(α y) + α e

s.t. α y = 0, 0 αi C, i,

where α = [α1, α2, , αn] are the Lagrange multipliers, y = [y1, y2, , yn] , (α y) denotes the component-wise multiplication between α and y, Kp is the p-th base kernel matrix and e is a column vector with all elements being one. After obtaining the optimal α, b and γ, the decision score of a test sample x is calculated via

i=1 αiyi Xm

p=1 γp Kp(xi, x) + b. (4)

As can been seen in Eq.(3) and Eq.(4), a same set of kernel combination weights γ is applied to all samples in existing MKL algorithms. Following our previous analysis, we improve this situation by allowing different samples to use different base kernel subsets.

Sample-adaptive MKL (SAMKL) Problem formulation Noting that a base kernel induces a feature mapping, switching off the base kernel can be conceptually realized by switching off the mapping. Therefore, we deﬁne the mapping of a sample xi by introducing a latent variable for each base-kernel-induced feature mapping. Let hi = [hi1, hi2, , him] {1, 0}m be the latent binary vector with respect to xi. Speciﬁcally, hip = 1 (p = 1, , m) means that the corresponding mapping is useful for the classiﬁcation of xi, while hip = 0 indicates it is not. Let 1 r1 < r2 < < rl m be the indexes of the l non-zero components of hi. The mapping of a sample xi is deﬁned as

φ(xi; hi) = [φ r1(xi), φ r2(xi), , φ rl(xi)] . (5)

With the above deﬁnition, different samples will be mapped onto different sub-spaces of a common feature space, denoted by [φ 1 (x), φ 2 (x), , φ m(x)] . This is different from existing MKL where all samples are mapped onto the common feature space. We use the sample-based margin proposed in (Chechik et al. 2008) to measure the margin of each sample in the respective sub-space.

Let ω = [ω 1 , , ω m] denote the normal in a common feature space, where ωp is the component of ω corresponding to the mapping φp( ) induced by the p-th base kernel. The sample-based margin of xi in multi-kernel-induced feature space is deﬁned as

ρ(φ(xi; hi)) yi Pm p=1 hipω p φp(xi)

Pm p=1 hip ωp Hp , (6)

where only part of the components of ω, namely those for which hip = 1, are involved in calculating the margin of xi. When all base kernel mappings are useful, this deﬁnition reduces to the one used by the existing MKL algorithms. Following the principle of margin maximization, we maximize the minimum of all sample-based margins to seek the optimal normal ω1, , ωm and all the latent variables h1, , hn. Mathematically, the objective of our proposed MKL is expressed as in Eq.(7),

max {{ωp}m p=1,{hi}n i=1}

min i=1, ,n

yi Pm p=1 hipω p φp(xi)

Pm p=1 hip ωp Hp

The optimization problem in Eq.(7) is more difﬁcult than the traditional margin-maximization problem in SVMs due to the facts that: (1) Both the numerator and denominator in the inner minimization vary across samples, while the denominator in the traditional SVMs problem is shared by all samples; and (2) Compared with the traditional problem, a group of additional latent binary variables are required to be optimized in Eq.(7). In the following part, we propose an efﬁcient alternate optimization algorithm to solve this problem. It is worth pointing out that Eq.(7) is a prototype used to show the essential idea of our approach. We will further reﬁne this model step by step by incorporating the slack variables, bias term and the prior on latent variables in the following part.

Optimization

We deﬁne an auxiliary variable τi =

Pm p=1 hip ωp Hp Pm p=1 ωp Hp (1

i n). By substituting τ = [τ1, τ2, , τn] into Eq.(7), we obtain

max {{ωp}m p=1,{hi}n i=1} min i=1, ,n

Pm p=1 hipω p φp(xi)

Pm p=1 ωp Hp . (8)

Following the way in deriving the SVMs objective in (Chechik et al. 2008), Eq.(8) can be further rewritten as a constrained optimization problem in Eq.(9)

max {{ωp}m p=1,{hi}n i=1} 1

p=1 hipω p φp(xi)

Similar to the derivation in SVMs and according to (Micchelli and Pontil 2005), the problem in Eq.(9) is equivalent to the one in Eq.(10),

min {{ωp}m p=1,{hi}n i=1}

ωp 2 Hp γp ; s.t. yi

p=1 hipω p φp(xi)

After adding the slack variables ξ = [ξ1, ξ2, , ξn] and the bias term b, we arrive at the following problem

min {{ωp}m p=1,{hi}n i=1},ξ,b, γ

ωp 2 Hp γp + C Xn

p=1 hipω p φp(xi) + b 1 ξi, ξi 0.

Besides maximizing the margin and minimizing the training errors, an ℓ1-norm regularization term should be imposed on hi (1 i n), making it effectively eliminate the base kernels that are not helpful for the classiﬁcation of xi. Moreover, in order to avoid over-ﬁtting, a prior has to be imposed on hi. In this work we enforce hi not to be far from a pre-speciﬁed h0. In doing so, we obtain the optimization problem of the proposed SAMKL as follows,

min {ωp,hi},ξ,b, γ

ωp 2 Hp γp + C

i=1 ξi + C n X

p=1 hipω p φp(xi) + b

1 ξi, ξi 0, i

hi h0 1 m0, hi {0, 1}m, i,

where h0 is an initial estimate, which is also a binary vector. h0 can be either empirically set or obtained by solving an ℓ1-norm MKL (for example, Simple MKL (Rakotomamonjy et al. 2008)) and then setting h0 according to the non-zero kernel weights. m0 is a pre-deﬁned parameter controlling the deviation of each hi from h0, and it is the number of bits that they can differ from each other. As can be seen, jointly optimizing {hi}n i=1 and {ωp}m p=1, ξ, b, γ via Eq.(12) is difﬁcult since it is a mixed integer optimization problem. Instead, we propose an alternate optimization procedure to solve it. Speciﬁcally, at the t-th iteration, we ﬁrst optimize {ωp}m p=1, ξ, b, γ with ﬁxed {ht 1 i }n i=1 and τ t 1 by solving Eq.(11), whose dual problem is

min γ max α

αiαjyiyjht 1 ip ht 1 jp τ t 1 i τ t 1 j Kp(xi, xj) +

= 0, 0 αi C, i.

This optimization problem can be efﬁciently solved via existing MKL packages. Let (αt, γt, bt) denote the solution at the t-th iteration, which is obtained based on the value of h and τ at the (t 1)-th iteration. After that, we update h at the t-th iteration by solving the following integer linear programming problem,

min {hi}n i=1

τi (h i gt i + bt) 1 ξi, h i (e 2h0) + h 0 e m0,

where gt i = [γt 1 Pn j=1 αt j yj τ t 1 j ht 1 j1 K1(xj, xi), , γt m Pn j=1 αt j yj τ t 1 j ht 1 jm Km(xj, xi)] , e is a column vector with all el-

ements being one, and {ht 1 i }n i=1 are the latent variables

Algorithm 1 Proposed Sample-adaptive MKL Algorithm

1: Input: {Kp}m p=1, y, C and m0. 2: Output: α, b, γ and {hi}n i=1. 3: Initialize h0 and set t = 1 and τ 0 = e. 4: repeat 5: Update (αt, γt, bt, ξt) by Eq.(13) with (ht 1, τ t 1). 6: for i = 1 to n do 7: Update ht i with αt, γt, bt, ξt i, τ t 1 i , ht 1 i by Eq.(15). 8: end for 9: Update τ t with αt, γt, ht, τ t 1, ht 1 by Eq.(16). 10: t = t + 1. 11: until objt 1 objt /objt 1e 4

obtained at the (t 1)-th iteration. Note that the second linear constraint on h is due to the identity that hi h0 1 = h i (e 2h0)+h 0 e for binary variables. Directly solving the optimization problem in Eq.(14) appears to be computationally intractable because its complexity is O(2nm). However, since the constraints are separately deﬁned on each hi and the objective function is a sum over each hi, the problem in Eq.(14) can be equivalently solved via solving n independent sub-problems, as stated in Eq.(15),

min hi hi 1

τi (h i gt i + bt) 1 ξi, h i (e 2h0) + h 0 e m0. (15)

This reduces the total computational complexity to O(n 2m). The optimization problem in Eq.(15) is a linear integer programming, which can be solved via off-the-shelf packages such as MOSEK1. After obtaining αt, γt and ht, the value of τ t is obtained as

Pm p=1 ht ipγt p r Pn i,j=1 αt iαt j yiyj

τt 1 i τt 1 j ht 1 ip ht 1 jp Kp(xi, xj)

Pm p=1 γt p r Pn i,j=1 αt iαt j yiyj

τt 1 i τt 1 j ht 1 ip ht 1 jp Kp(xi, xj) ,

(16) where τ t 1 i (1 i n) is optimized in the last iteration. Our algorithm for solving SAMKL is presented in Algorithm 1, where objt 1 and objt denote the objective values at the (t 1)-th and t-th iterations, respectively.

Number of switching patterns h

We analyze the number of possible patterns of h on a given training data set. It appears that this number could reach the value as high as Pm0 i=0 Ci m. However, considering that each training sample can only contribute one unique pattern, the number of possible patterns will be actually capped by n, the number of training samples. Moreover, in practice this number is usually even less than n because different samples often share a same pattern h, as can be observed from Figures 1(a), 1(b) and 1(c) in the experiments.

1http://www.mosek.com/

Inference The classiﬁcation procedure with the proposed SAMKL is slightly different from existing MKL algorithms, since the subset of base kernels that is useful for classiﬁcation has to be inferred for a given test sample. By following the strategy in structure SVMs with latent variables (Yu and Joachims 2009), we enumerate all possible conﬁgurations (label and switching pattern) and select the pair maximizing the margin. Speciﬁcally, the class label of a sample in a binary classiﬁcation case is inferred by

(y , h ) = arg max y { 1} hi H y ω φ(x; hi) + b

= arg max y { 1} hi H y

τj Kp(xj, x) + b

(17) where H = Sn i=1{hi} is the set of all the unique h patterns learned from the training stage. In the case of multi-class classiﬁcation, a set of normal vectors ˆω1, , ˆωL can be obtained when the one-versus-rest strategy is used, where L is the number of classes. We can work out the decision score for the l-th class as

fl(x) = max hi H

ˆω l φ(x; hi) + ˆbl

τj hjp Kp(xj, x) + ˆbl

The class label is predicted as l which corresponds to the largest fl(x).

Computational efﬁciency At each iteration, SAMKL alternately solves an MKL problem (line 5 in Algorithm 1) and n 0/1 programming problem (line 6 8 in Algorithm 1). The size of each 0/1 programming problems is m, the number of base kernels. Note that m is usually not large in practical applications, for example, the largest m is 69 among all MKL benchmark data sets, which allows each 0/1 programming problem to be efﬁciently solved. In our experiments, Algorithm 1 converges in a small (usually less than ten) number of iterations. Also, the computation in the test stage is proportional to the size of H, which is smaller than the number of training samples as analyzed previously. We will report the training and test time in the experiments.

Experimental result Our experiment consists of two parts. The ﬁrst part compares the proposed SAMKL with localized MKL (LMKL) in (G onen and Alpaydin 2008). Both approaches assume that the base kernel weights vary across samples. Note that LMKL requires to access the original input data and only focuses on binary classiﬁcation tasks. In the second part, the proposed SAMKL is compared with Simple MKL (Rakotomamonjy et al. 2008), non-sparse MKL algorithm (Xu et al. 2010) and the unweighted MKL algorithm. All MKL algorithms are compared in terms of both classiﬁcation performance and computational cost. Also, the adaptivity of the

learned latent variables with respect to each training sample will be shown.

Results on protein fold prediction data set

We compare the proposed SAMKL with LMKL (G onen and Alpaydin 2008) on the protein fold prediction data set http://mkl.ucsd.edu/dataset/. Simple MKL is also included as a reference. This data set has 27 classes and 12 base kernels, and its original input data are available. Besides, the training/test partition of this data set has been pre-speciﬁed. The code of LMKL is downloaded from the authors website2, which focuses on binary classiﬁcation. To construct binary classiﬁcation tasks, we select ﬁve classes from 27 classes with the largest number of training samples. Speciﬁcally, they are the 1st, 7th, 9th, 12th and 16th classes. Every two classes are selected to construct a binary classiﬁcation task. By this way, we generate ten binary classiﬁcation tasks in total. For the proposed SAMKL, we empirically set h0 = (1, 1, , 1) and m0 = 3. To ensure fair comparison, the regularization parameter C for all the three MKL algorithms is chosen from [10 1, 100, , 104] by ﬁve-fold cross-validation on training data sets. Each base kernel matrix is normalized to have a unit trace.

Table 1: Classiﬁcation accuracy and traing/test time comparison among the proposed SAMKL, LMKL (G onen and Alpaydin 2008) and Simple MKL (Rakotomamonjy et al. 2008) on the protein fold prediction data sets.

TASK SAMKL LMKL SIMPLEMKL

C7 VS. C16 88.0 80.4 86.9 32.7/0.04 12.1/0.01 2.9/0.01

C7 VS. C9 91.2 89.5 87.7 33.7/0.02 4.4/0.02 5.9/0.00

C7 VS. C1 98.0 96.0 98.0 21.1/0.02 4.3/0.02 2.7/0.00

C7 VS. C12 85.7 77.8 79.4 103.1/0.05 4.9/0.01 9.1/0.00

C16 VS. C9 95.1 83.6 95.1 89.5/0.02 5.3/0.01 13.9/0.00

C16 VS. C1 98.2 96.3 98.2 11.1/0.02 9.7/0.01 1.3/0.00

C16 VS. C12 86.6 80.6 86.6 39.5/0.02 17.5/0.01 3.1/0.00

C9 VS. C1 94.7 89.5 94.7 24.4/0.01 12.9/0.01 2.6/0.00

C9 VS. C12 84.4 68.8 84.4 25.3/0.01 4.2/0.01 5.8/0.00

C1 VS. C12 92.0 72.0 88.0 7.8/0.01 3.9/0.01 1.3/0.00

The classiﬁcation accuracy on the ten tasks is reported in Table 1, where the highest values are shown in bold. As can be seen, the proposed SAMKL obtains better overall classiﬁcation performance than LMKL and Simple MKL, and its improvement over LMKL is signiﬁcant. We attribute the superiority of SAMKL to its latent mechanism designed to

2http://users.ics.aalto.ﬁ/gonen/icml08.php/

adaptively switch off less useful base kernels for each individual sample during MKL. At the same time, the less satisfying performance of LMKL indicates that more appropriate parametric models for kernel weight prediction need to be sought and the appropriateness of the smoothness assumption may need to be reviewed. The training and test time are reported in the second row of each cell in Table 1. Compared with Simple MKL and LMKL, SAMKL leads to a mildly increased computational time due to the use of the proposed latent mechanism. The learned latent variable h is shown in Figure 1(a), where one binary classiﬁcation task (Class 7 vs. Class 9) is selected for demonstration. The h on each classiﬁcation task is shown as an n m matrix, where n and m are the number of training samples and base kernels, respectively. The red color indicates 1 while the blue and green colors indicate 0 . The difference between blue and green color is clariﬁed as follows. The blue color indicates those latent variables which switch off the base kernels whose weights are non-zeros. We call them active latent variables. The green color indicates those latent variables which switch off the base kernels whose weights are zeros. We call them inactive latent variables. As shown, h switches on/off the base kernels differently across training samples. Looking into these matrices shows that each row has exactly three 0 s. This is due to the constraint hi h0 m0, i. This experiment preliminarily demonstrates the effectiveness and the properties of the proposed SAMKL.

Index of Base Kernel Matrices

Index of Training Samples

protein flod prediction (Class7 vs. Class9)

2 4 6 8 10 12

Index of Base Kernel Matrices

Index of Training Samples

10 20 30 40 50 60

Index of Base Kernel Matrices

Index of Training Samples

Caltech 101

5 10 15 20 25

Figure 1: The latent variable h learned for each sample of a training group (red 1 , blue 0 (active latent variables) , green 0 (inactive latent variables) ) on different data sets. (a): Class7 vs. Class9 (protein fold prediction); (b): psort Pos; (c): Caltech-101.

Results on benchmark data sets

Four benchmark data sets are used, including psort Pos, psort Neg, plant and Caltech-101 data sets. All of them can be downloaded from http://mkl.ucsd.edu/dataset/. The ﬁrst three are for protein subcellular localization and have been widely used by MKL algorithms (Zien and Ong 2007). Their class numbers are four, ﬁve and four, respectively. The 69 base kernel matrices have been pre-computed and provided on the above website. Caltech-101 contains 25 base kernel matrices based on a set of visual features extracted from the Caltech-101 object recognition data set. 15 training and 15 test examples are used for each class. The base kernel matrices of ﬁve random splits of training and test sets are pre-computed and provided. We compare the proposed

Table 2: Experimental comprison of the proposed SAMKL, Simple MKL (Rakotomamonjy et al. 2008), ℓp-MKL (Xu et al. 2010) and UMKL on three protein and Caltech-101 data sets. The two rows of each cell represent mean accuracy standard deviation and training/test time (in seconds). Boldface means the best one.

DATASET SAMKL SIMPLEMKL ℓp-MKL UMKL PROPOSED p = 2 p = 4 p = 8

90.6 1.5 87.7 2.3 86.4 1.8 84.6 2.0 83.8 2.2 84.3 1.9 343.0/5.8 157.0/0.8 9.2/0.1 5.1/0.1 3.5/0.1 0.6/0.1

92.5 1.0 89.9 1.1 88.9 1.1 87.7 1.3 86.9 1.2 84.0 1.5 4666.8/54.1 1184.3/2.0 121.1/0.8 60.7/0.8 43.0/0.7 5.5/0.1

89.5 1.7 88.0 1.70 86.4 1.5 85.0 1.9 84.1 2.1 83.1 2.2 864.7/18.4 515.1/0.4 34.7/0.3 17.7/0.3 13.7/0.3 2.0/0.1

CALTECH-101

67.2 1.0 63.7 1.3 65.3 1.5 65.1 1.5 65.1 1.5 65.0 1.8 157480/1591.7 30079/15.1 2906.8/1.5 1505.1/1.5 1007.3/1.5 165.7/0.2

SAMKL with the state-of-the-art MKL algorithms, including Simple MKL (Rakotomamonjy et al. 2008) and ℓp-norm MKL (Xu et al. 2010) with p = 2, 4, 8, respectively. The uniformly weighted MKL (UMKL) is also included as a baseline. Note that LMKL in (G onen and Alpaydin 2008) is not included in comparison because it needs to access original input data, which are not available for all the above benchmark data sets. Furthermore, it focuses on binary classiﬁcation tasks only. Following (Zien and Ong 2007), F1-score is used to measure classiﬁcation performance on the psort Pos and psort Neg data sets, while matthew correlation coefﬁcient (MCC) is used for the plant data set. For Caltech-101, classiﬁcation accuracy is used as in the literature. For the psort Pos, psort Neg and plant data sets, we randomly split the data into 20 groups, with 50% : 50% for training and test. For Caltech101, we use the ﬁve pre-deﬁned training and test partitions. To conduct a rigorous comparison, the paired Student s ttest is performed. For the three protein data sets, C for each MKL algorithm is chosen from [10 1, 100, , 104] by ﬁve-fold cross-validation. For the Caltech-101, C is set to 104 experimentally. C is not needed in our alternate optimization method. Each base kernel matrix is normalized to have a unit trace. For our proposed SAMKL, h0 is again set as (1, 1, , 1) and m0 is empirically set as 20 and 10 on three protein data sets and Caltech-101, respectively.

Results on three protein and Caltech-101 data sets As seen in Table 2, SAMKL consistently achieves superior performance to Simple MKL (Rakotomamonjy et al. 2008), ℓpnorm MKL (Xu et al. 2010) and UMKL on all the three protein data sets. In the literature, Simple MKL (Rakotomamonjy et al. 2008) is the one that achieves the best performance on these data sets (Kloft et al. 2011). However, SAMKL further improves its performance by 3.0%, 2.5% and 1.5%, respectively and these improvements are tested to be statistically signiﬁcant. The above results indicate that introducing the latent variables allows each sample to effectively focus on more useful base kernels and avoid being affected by less useful ones, leading to better performance. The latent variable h learned on the psort Pos is plotted in Figure 1(b). As seen, h dynamically switches off many base kernels such as the 1st 5th, 18th and 20th, to name

just a few. Cross-referencing the learned kernel combination weights, we can ﬁnd that many base kernels switched off by h have non-zero combination weights (in blue color). Switching them allows different samples to effectively utilize an appropriate subset of base kernels. This conﬁrms the sample-based adaptivity of the SAMKL. The averaged result on Caltech101 is reported in the last row of Table 2. Again, we observe that SAMKL achieves the highest classiﬁcation accuracy. Speciﬁcally, SAMKL gains 1.9% improvement over the second best one, i.e., ℓ2-norm MKL. Also, the learned latent variables on a training group are plotted in Figure 1(c). In this ﬁgure, it is worth noting that all latent variables are active because we ﬁnd that the optimal combination weights of all base kernels are nonzeros. All base kernels across samples are actively switched off by the latent variables. Together with the results on the protein subcellular localization, the result on Caltech-101 data set validates the effectiveness of the proposed SAMKL algorithm. Note that the above training/test timing results are based on a relatively straightforward implementation of the Algorithm 1, where no speciﬁc tuning is conducted to speed up the training/test process. However, the n 0/1 programming problems at each iteration can be solved in a parallel way. Also, the test of each sample against different switching patterns can be performed parallelly. These properties will be exploited in future work to improve the computational efﬁciency of our algorithm.

This work proposes the SAMKL a novel MKL algorithm which jointly performs MKL and infers the base kernel subsets that are useful for the classiﬁcation of each sample. By allowing each sample to adaptively switch on/off each base kernel, SAMKL achieves clear improvement over the comparable MKL algorithms in the recent literature. Further improving the computational efﬁciency of the proposed SAMKL is another piece of our future work.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 60970034, 61170287,

61232016 and 61125201), the Australian Research Council Linkage Grant LP0991757 and UOW URC Near Miss Grant (DP120102145 and DP130100926). We thank Dr. Zenglin Xu for helpful discussions.

References Afkanpour, A.; Gy orgy, A.; Szepesv ari, C.; and Bowling, M. 2013. A randomized mirror descent algorithm for large scale multiple kernel learning. In ICML, 374 382. Bach, F. R.; Lanckriet, G. R. G.; and Jordan, M. I. 2004. Multiple kernel learning, conic duality, and the smo algorithm. In ICML. Chapelle, O.; Vapnik, V.; Bousquet, O.; and Mukherjee, S. 2002. Choosing multiple parameters for support vector machines. Machine Learning 46(1-3):131 159. Chechik, G.; Heitz, G.; Elidan, G.; Abbeel, P.; and Koller, D. 2008. Max-margin classiﬁcation of data with absent features. Journal of Machine Learning Research 9:1 21. Cortes, C.; Mohri, M.; and Rostamizadeh, A. 2012. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 13:795 828. Cortes, C.; Mohri, M.; and Rostamizadeh, A. 2013. Multiclass classiﬁcation with maximum margin multiple kernel. In ICML, 46 54. Gai, K.; Chen, G.; and Zhang, C. 2010. Learning kernels with radiuses of minimum enclosing balls. In NIPS, 649 657. G onen, M., and Alpaydin, E. 2008. Localized multiple kernel learning. In ICML, 352 359. G onen, M. 2012. Bayesian efﬁcient multiple kernel learning. In ICML. Hinrichs, C.; Singh, V.; Peng, J.; and Johnson, S. C. 2012. Qmkl: Matrix-induced regularization in multi-kernel learning with applications to neuroimaging. In NIPS, 1430 1438. Kloft, M.; Brefeld, U.; Sonnenburg, S.; and Zien, A. 2011. lp-norm multiple kernel learning. Journal of Machine Learning Research 12:953 997. Kumar, A.; Niculescu-Mizil, A.; Kavukcuoglu, K.; and III, H. D. 2012. A binary classiﬁcation framework for two-stage multiple kernel learning. In ICML, 1295 1302. Lanckriet, G. R. G.; Cristianini, N.; Bartlett, P. L.; Ghaoui, L. E.; and Jordan, M. I. 2004. Learning the kernel matrix with semideﬁnite programming. Journal of Machine Learning Research 5:27 72. Liu, X.; Wang, L.; Yin, J.; and Liu, L. 2012. Incorporation of radius-info can be simple with simplemkl. Neurocomputing 89:30 38. Liu, X.; Yin, J.; Wang, L.; Liu, L.; Liu, J.; Hou, C.; and Zhang, J. 2013a. An adaptive approach to learning optimal neighborhood kernels. IEEE Transactions on Cybernetics 43(1):371 384. Liu, X.; Wang, L.; Yin, J.; Zhu, E.; and Zhang, J. 2013b. An efﬁcient approach to integrating radius information into multiple kernel learning. IEEE Transactions on Cybernetics 43(2):557 569.

Micchelli, C. A., and Pontil, M. 2005. Learning the kernel function via regularization. Journal of Machine Learning Research 6:1099 1125. Orabona, F., and Luo, J. 2011. Ultra-fast optimization algorithm for sparse multi kernel learning. In ICML, 249 256. Rakotomamonjy, A.; Bach, F. R.; Canu, S.; and Grandvalet, Y. 2008. Simplemkl. Journal of Machine Learning Research 9:2491 2521. Sch olkopf, B., and Smola, A. J. 2002. A short introduction to learning with kernels. In Machine Learning Summer School, 41 64. Shawe-Taylor, J., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press. Sonnenburg, S.; R atsch, G.; Sch afer, C.; and Sch olkopf, B. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7:1531 1565. Xu, Z.; Jin, R.; Yang, H.; King, I.; and Lyu, M. R. 2010. Simple and efﬁcient multiple kernel learning by group lasso. In ICML, 1175 1182. Yan, F.; Kittler, J.; Mikolajczyk, K.; and Tahir, M. A. 2012. Non-sparse multiple kernel ﬁsher discriminant analysis. Journal of Machine Learning Research 13:607 642. Yang, W.; Wang, Y.; Vahdat, A.; and Mori, G. 2012. Kernel latent svm for visual recognition. In NIPS, 818 826. Yu, C.-N. J., and Joachims, T. 2009. Learning structural svms with latent variables. In ICML, 1169 1176. Zien, A., and Ong, C. S. 2007. Multiclass multiple kernel learning. In ICML, 1191 1198.