# online_multikernel_learning_with_graphstructured_feedback__02917a64.pdf

Online Multi-Kernel Learning with Graph-Structured Feedback

Pouya M Ghari 1 Yanning Shen 1

Abstract is more powerful, as it learns the optimal kernel from a dic

Multi-kernel learning (MKL) exhibits reliable per formance in nonlinear function approximation tasks. Instead of using one kernel, it learns the optimal kernel from a pre-selected dictionary of kernels. The selection of the dictionary has crucial impact on both the performance and complexity of MKL. Specifcally, inclusion of a large num ber of irrelevant kernels may impair the accuracy, and increase the complexity of MKL algorithms. To enhance the accuracy, and alleviate the com putational burden, the present paper develops a novel scheme which actively chooses relevant ker nels. The proposed framework models the pruned kernel combination as feedback collected from a graph, that is refned on the fy. Leveraging the random feature approximation, we propose an online scalable multi-kernel learning approach with graph feedback, and prove that the proposed algorithm enjoys sublinear regret. Numerical tests on real datasets demonstrate the effectiveness of the novel approach.

1. Introduction

Function approximation emerges in various machine learn ing tasks such as classifcation, regression and clustering. In present paper, we will focus on supervised function learning tasks, i.e., given samples {(x1, y1), ..., (x T , y T )}T such t=1 that xt Rd and yt R, the goal is to fnd a function f(.) such that the discrepancy between f(xt) and yt is min imized. To this end, a cost function C(f(xt), yt) can be employed to measure the difference between f(xt) and yt PT and C(f(xt), yt) is minimized. Upon assuming that t=1 f(.) belongs to a reproducing kernel Hilbert space (RKHS), the problem becomes tractable (Scholkopf & Smola, 2001). Most of kernel based learning relies on a preselected kernel, while the data-driven multi-kernel learning (MKL) approach

1University of California, Irvine, CA, USA. Correspondence to: Yanning Shen <yannings@uci.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au thor(s).

tionary of kernels, see e.g., (Cortes et al., 2010).

In a number of tasks, it is also required to carry out func tion learning in an online fashion, especially when data is collected sequentially. Moreover, the sheer volume of data makes it impossible to store it in batch. In these cases, it is imperative to perform learning tasks in an online fashion. However, kernel based learning also suffers from the wellknown curse of dimensionality (Shawe-Taylor & Cris tianini, 2004; Wahba, 1990), which makes it not amenable for sequential processing. In order to address this issue, online single kernel-based learning has attracted increas ing attention, see e.g., (Bouboulis et al., 2018; Lu et al., 2016; Zhang & Liao, 2019), where (Zhang & Liao, 2019) developed a random sketching based framework. Kernels can be approximated using fnite-dimensional feature maps which can effectively make the learning task scalable, see e.g., (Rahimi & Recht, 2007; Shahrampour & Tarokh, 2018; Williams & Seeger, 2000). Employing fnite feature map ping, (Bouboulis et al., 2018; Lu et al., 2016) developed online kernel learning approach based on random feature (RF) approximation (Rahimi & Recht, 2007) . Building upon the online RF-based single kernel learning framework, multi-kernel variants have also been developed in (Lu et al., 2020; Sahoo et al., 2019; Shen et al., 2018; 2019).

A largely overlooked issue for most of the MKL approaches is the proper selection of the kernel dictionary, which has pivotal impact on both accuracy and complexity of MKL. Specifcally, inclusion of a large number of irrelevant ker nels may impair the accuracy, and increase the complexity of the MKL algorithm. Faced with these challenges, the present paper aims at developing an OMKL framework, which can actively choose relevant kernels from the dictio nary, and refne the selection on the fy . The proposed framework can alleviate the computational burden of online MKL due to the reduced number of kernels used at each time instant. Specifcally, the online kernel selection is modeled as graph-structured feedback (henceforth termed (OMKL-GF)), where each kernel can be viewed as an expert lying in a graph. At each time slot, instead of collecting feedback from all kernel nodes at each time, the graph is updated and refned so that only a subset of kernels are employed at each time slot. Relative to existing OMKL approaches, our novelty can be summarized as follows:

Online Multi-Kernel Learning with Graph-Structured Feedback

c1) Different from existing works which employ all kernels in the dictionary, while only learns the combination coeff cients, OMKL-GF only uses a subset of kernels each time slot according to a time-varying graph;

c2) We develop an adaptive and disciplined framework to refne the graph structure according to the loss incurred by kernel-based approximants, and prove that the resulting OMKL-GF approach achieves sublinear regret;

c3) Experiments on real datasets demonstrate the effective ness of the novel algorithm compared with several state-of art OMKL alternatives.

2. Preliminaries

Given samples {(x1, y1), , (x T , y T )}T , the goal is to t=1 fnd the nonlinear function f(.), such that yt = f(xt) + et. In the context of kernel based learning, it is assumed that the sought f(.) belongs to the reproducing Hilbert kernel space (RHKS) H := {f|f(x) = P αtκ(x, xt)}, where t=1 κ(x, xt) : Rd Rd R measures the similarity between x and xt which is a symmetric positive semidefnite basis function called kernel. A kernel is reproducing if it satisfes hκ(x, xt), κ(x, xt0 )i H = κ(xt, xt0 ) where h , i H denotes vector inner product in Hilbert space, with the RKHS norm P P defned as kfk2

t0 αtαt0 κ(xt, xt0 ). The function H := t approximation problem can henceforth be solved via

T X min 1 C(f(xt), yt) + λΩ(kfk2 (1) H) f H T t=1

where C(., .) is a cost function, which can be task-specifc, e.g., least-squares for regression, and logistic cost for clas sifcation. And the regularizer Ω(.) is a non-decreasing function introduced to control model complexity, and pre vent overftting. Given fnite data samples, the representer theorem states that the optimal solution of (1) can be written as (Wahba, 1990)

T X fˆ(x) = αtκ(x, xt) := α>κ(x) (2)

where α := [α1, ..., αT ]> collects unknown coeffcients to be estimated, and κ(x) := [κ(x, x1), , κ(x, x T )]> . Hence, the RKHS norm can be written as kfk2

H := α>Kα, where K is the kernel matrix whose (t, t0)-th entry is κ(xt, xt0 ). Note that the dimension of α increases with the number of data samples T , which is also known as curse of dimensionality (Wahba, 1990). This brings challenge for solving (1) in an online fashion.

2.1. Random Feature Approximation

One way to cope with the increasing dimension of the unknown variables is by resorting to random feature ap

proximation for kernels (Rahimi & Recht, 2007). As in (Rahimi & Recht, 2007), we will approximate κ in (2) using shift-invariant kernels that satisfy κ(xt, xt0 ) = κ(xt xt0 ). For κ(xt xt0 ) absolutely integrable, its Fourier trans form πκ(v) exists and represents the power spectral density, which upon normalizing to ensure κ(0) = 1, can also be viewed as a probability density function (pdf); hence,

Z κ(xt xt0 ) = πκ(v)e

> jv (xt xt0 )dv

jv (xt x :=Ev[e

By sampling suffcient number of independent and iden tically distributed (i.i.d) samples {vi}D from πκ(v), i=1 κ(xt xt0 ) can be approximated by the ensemble mean PD > 1 jv (xt xt0 ) κˆc(xt xt0 ) := , the real part of T i=1 e which also constitutes an unbiased estimator of κ(xt xt0 ) (Rahimi & Recht, 2007)

> κˆ(xt xt0 ) = z (xt)zv(xt0 ) (4) v

> > > > = 1 [sin v1 x, ..., sin v Dx, cos v1 x, ..., cos v Dx]> .

Hence, fˆ(x) in (2) can be approximated as

T X > fˆRF(x) = αtz (xt)zv(x) := θ> zv(x) (6) v t=1

where θ is a 2D vector whose dimension does not increase with the number of data samples. Thus, the nonlinear func tion that is optimal in sense of (1) is approximated by a linear function in the random feature space. Hence, using RF approximation resolves the growing size of parameters, and makes the problem (10) amenable for scalable online implementation.

Note that kernel based learning approaches rely on a pres elected kernel, which often requires task-specifc prior in formation that may not be available. To cope with this challenge, MKL has been proposed which learns the ker nel as a combination of a prescribed and suffciently rich dictionary of kernels {κn}N The kernel combination n=1.

belongs to the convex hull K := {κ = PN βnκn, βn n=1 0, PN βn = 1}, and is itself a kernel (Scholkopf & n=1 Smola, 2001). Hence, the function approximation can be performed by seeking functions in the following form

N X f(xt) = w nfn(xt) (7)

n=1 PN where w n = 1 and fn(xt) is in Hn which is a n=1 RKHS induced by κn. Substituting (7) into (1) and resorting

Online Multi-Kernel Learning with Graph-Structured Feedback

Jensen s inequality, the function approximation problem can then be solved in an online fashion by minimizing the upper bound of the original objective function. Upon replacing fn(.) using its RF approximation

fˆRF,n(xt) = θ> zn(xt) (8) n

the function approximation problem can be written as (Shen et al., 2019)

T N X X min w n,t C(θ> zn(xt), yt)+λΩ(kθn,tk2) n {w n,t,θn} t=1 n=1

N X s.t. w n,t = 1, w n,t 0, t : 1 t T. (9)

However, employing a large kernel dictionary (i.e., large N) increases the computational complexity, and may dete riorate the accuracy of function approximation if too many irrelevant kernels are included. This motivates us to investi gate data-driven approaches to choose and refne the subset of relevant kernels from a large dictionary.

3. Online MKL with Graph Feedback

Based on RF formulation in Section 2, the present sec tion will introduce an online multi-kernel learning approach which can adaptively refne the kernel selection.

3.1. Time-Varying Kernel Selection

Instead of combining the entire dictionary of the kernels, in present paper, we will consider a subset of kernels {κn, n St} at time instant t, where St is the index set of the chosen subset of kernels at time instant t, and |St| = Nt. Hence, the original function approximation problem boils down to

T X X min w n,t C(θ> zn(xt), yt)+λΩ(kθn,tk2) n {w n,t ,θn}t=1n St X s.t. w n,t = 1, w n,t 0, 1 t T. (10)

At each time instant t, the loss of n-th kernel is defned as

L(θ> zn(xt), yt) = C(θ> zn(xt), yt) + λΩ(kθnk2) n n

wn Upon defning the normalized weights w n,t = P , wm m St (10) can be re-written as

T X X wn min P L(θ> zn(xt), yt) n {wn},{θn} m St wm t=1 n St s.t. wn > 0, 1 n N. (11)

However, (10) assumes that {St}T

t=1 are preselected sets. In the following section, we will study data-driven scheme which can adaptively select a subset of kernels on the fy upon receiving new data samples.

3.2. Data-Driven Graph-Structured Kernel Selection

In order to adaptively choose the subset of kernels, the present section models the pruned kernel combination as feedback collected from a graph, that is refned in an online fashion. By doing this, the proposed approach trims irrelevant kernels in the dictionary to both improve the func tion approximation accuracy and reduce the computational complexity of MKL.

Consider a time varying bipartite graph (Asratian et al., 1998) Gt at time t, which consists of two sets of nodes:

(k) (k) N kernel nodes {v , ..., v } and J selective nodes 1 N (c) (c) (k) (c) {v , ..., v } where vn and v are the n-th kernel node 1 J j and j-th selective node, respectively. And the edges of the graph represents the association between the kernel nodes and the selective nodes. Specifcally, an edge between vn

(c) and vj exists at time t if the n-th kernel is assigned to j-th selective node. The construction and refnement of the graph will be discussed in Section 3.3 .

At each time slot, one of selective nodes vj

(c) is chosen, and (c) the subset of kernel nodes connected with v will be used j for the instantaneous function approximation at time t, [c.f. (11)]. Then, the loss L(fˆ RF,n(xt), yt) is observed for every kernel in the chosen subset and θn,t is updated as

θn,t+1 = θn,t ηr L(θ> zv,n(xt), yt). (12) n,t

Let wn,t denotes the weighting coeffcient wn at time instant t. We leverage multiplicative update for weights wn,t

n,t wn,t+1 = wn,t exp( η ) (13) 2b

where b = blog2(J)c, and n,t denotes the importance sam pling loss estimates (Alon et al., 2017)

L(fˆRF,n(xt), yt) n,t = I{n St} (14) qn,t

which is the observed loss L(fˆ RF,n(xt), yt) divided by the probability qn,t which is the probability that the loss of as sociated kernel is observed. The value of qn,t depends on how the graph is generated, and I(.) denotes the indicator function. Since we initialize the value of wn with 1, the multiplicative update in (13) satisfes the constraint in (11). Upon updating parameters θn,t+1 and wn,t+1, the function approximation can be obtained via (7) and (8) using the cho sen subset of kernel nodes connected to one of the selective nodes.

Online Multi-Kernel Learning with Graph-Structured Feedback

Then, the selective nodes are assigned to weights {uj,t+1} according to the the kernel nodes weights {wn,t+1}. In deed, each selective node s weight {uj,t+1} is the total summation of weights of kernel nodes which are connected

(c) to that selective node. Specifcally the weight of vj is obtained via

X uj,t+1 = wn,t+1. (15)

(k) (c) n:vn vj

Note that the weights of the selective nodes are determined by its adjacent kernel nodes, which indicates the reliability of the corresponding kernel-based function estimate.Hence, the probability according to which a selective node is chosen in the next time slot can be updated as

uj,t+1 ηe pj,t+1 = (1 ηe) + (16) Ut+1 J

PJ where Ut+1 := j=1 uj,t+1, and 0 < ηe 1 is the explo ration rate. The term ηe is introduced to tradeoff between J exploitation and exploration.

To sum up, each kernel is viewed as an expert and at each time instant a subset of function approximations provided by these experts is combined. In this regard, the RF approx imation fˆRF,n(xt) can be viewed as the feedback provided by n-th kernel node, and the proposed framework mod els the pruned kernel combination as feedback collected from a graph, where the feedback are combined only if the corresponding kernel node is connected to the chosen selective node.By doing this, the proposed approach trims irrelevant kernels in the dictionary to both improve the func tion approximation accuracy and reduce the computational complexity of MKL. The graph refning approach will be proposed in the ensuing subsection.

3.3. Online Graph Refnement

Note that generating and refning the time varying graph is of utmost importance, as it affects both function approxima tion accuracy and computational complexity. In this regard, a graph is successful if it can provide a subset of kernels which results in as less as possible loss. Indeed, consider ing computational complexity, the graph should provide a limited number of kernels which obtain minimum loss. To this end, we aim to propose a generating method for graph. However, in order to execute more exploration, the graph is generated in a stochastic manner.

Increasing the number of kernel nodes connected to v(c) , j increases the computational complexity of performing func

(c) tion approximation by choosing vj . Therefore, the graph generation algorithm should be designed to limit the number of kernel nodes connected to each selective node. Let M denote the maximum number of kernel nodes connected

to each selective node. The procedure to generate the graph Gt is presented in Algorithm 1. Let At represents N J sub-adjacency matrix between two disjoint subsets

(c) (c) (k) (k) {v , ..., v } and {v , ..., v }. Also, At(n, j) repre 1 J 1 N sents n-th row and j-th column of the sub-adjacency matrix

(k) (c) At and it is equal to 1 if vn is connected to vj , and 0 otherwise.

Algorithm 1 Generating Graph Gt

Input:Kernels κn, n = 1, ..., N, exploration coeffcient ηe > 0, the maximum number of connected kernel nodes to each computation node M and weighting coeffcient wn,t for kernels. Initialize: Sub-adjacency matrix At = 0N J . for j = 1, ..., J do

for n = 1, ..., N do

(κn) wn,t ηe Set p = (1 ηej ) P +

j . N t,j N wn,t end for

for k = 1, ..., M do Choose one of nodes vk,i drawn according to PMF

(κ) (κ1) (κN ) p = (p , ..., p ). t,j t,j t,j Set At(n, j) = 1. end for end for

(c) (k) Each selective node v draws kernel nodes vn in M j independent trials and in each trial selective node draws only one kernel node. We put more weight on kernels which obtain less loss. The probability that selective node v(c)

j (k) draws the kernel node vn in a trial at time t is

(κn) wn,t ηej pt,j = (1 ηe

j ) PN + N (17)

Note that the frst term in (17) discriminates between kernels based on their weights which is determined by their loss in function approximation [c.f. (13)]. Furthermore, the second term allows exploration over all kernel nodes. Specially, the

(c) selective node vj draws kernel nodes according to uniform distribution if ηe = 1. Furthermore, note that ηej is a nonincreasing function of j for 0 < ηe 1. The selective node v(c) puts more weight on exploration in comparison with 1 others while v(c) considers more exploitation than all the J other selective nodes. Therefore, the selective nodes entail different level of exploration and exploitation.

(κn) Based on the defnition of pt,j in (17), the probability that

(c) (κn) the n-th kernel node is connected to v is 1 (1 p )M . j t,j (κn))M In fact, (1 pt,j is the probability that n-th kernel (c) node is chosen by vj in none of M trials. Therefore, the probability of observing the loss of n-th kernel at time t is

Online Multi-Kernel Learning with Graph-Structured Feedback

Algorithm 2 OMKL with Graph Feedback (OMKL-GF)

Input:Kernels κn, n = 1, ..., N, step size η > 0, the number of features D and b such that J [2b , 2b+1 1]. Initialize: θn,1 = 0, wn,1 = 1, n = 1, ..., N for t = 1, ..., T do

Receive one datum xt. Generate Gt using Algorithm 1. P Set uj,t = (k) (c) wn,t. n:vn vj Obtain pj,t via (16).

(c) Choose one selective node vj according to PMF pt = (p1,t, ..., p J,t). P Predict fˆ t(xt) = P wn,t fˆRF,n(xt) with n St m St wm,t fˆRF,n(xt) in (8). Obtain loss L(fˆ RF,n(xt), yt) for all n St. Update θn,t+1 via (12) for all n St. Update wn,t+1 via (13). end for

J X (κn) qn,t = pj,t 1 (1 pt,j )M (18)

for 1 n N. The value of qn,t is computed and used for importance sampling loss estimate in (14). The graph generation framework is summarized in Algorithm 1

At each time slot t, a graph Gt is generated, and used for choosing a selective node, and henceforth subset of the ker nels. Then the weights of the selected kernels are updated according to the loss [c.f.(12) and (13)]. Then the graph can be refned as it is presented in Algorithm 1, and henceforth resulting in a better graph, leads to better function approxi mation. Given the graph, the function approximation will be carried out by choosing one of the selective nodes which leads to selecting a subset of kernels. Our proposed online multi-kernel learning with graph-structured feedback (OMKL-GF) is summarized in Algorithm 2.

Memory Requirement. At time instant t, OMKL-GF needs to store a real 2D random feature vector in addition to a weighting vector for each kernel in conjunction with a weighting vector for each selective node. As the number of kernels is in general larger than the number of selective nodes, the required memory is of order O(d DN).

Computational Complexity. The per-iteration complex ity of our OMKL-GF (e.g. calculating inner products) is O(d DM + JN). In comparison, the per-iteration complex ity of OMKR developed (Sahoo et al., 2014) is O(td N), while more contemporary online RF-based OMKL ap proaches proposed in (Sahoo et al., 2019; Shen et al., 2019) both have per-iteration complexity O(d DN). Hence,

OMKL-GF can signifcantly reduce the per iteration com plexity especially when J M << N.

Comparison with Online Expert Learning. Note that the updates resembles those in online learning with expert ad vice (Cesa-Bianchi et al., 1997; Littlestone & Warmuth, 1994; Vovk, 1998), and online expert learning with graph-

structured feedback (Alon et al., 2015; 2017; Cortes et al., 2019; Liu et al., 2018; Mannor & Shamir, 2011), where each kernel can be viewed as an expert. However, rela tive to the generic expert advice problem, there exist three innovative differences in OMKL-GF: i) the kernel-based function estimator itself performs effcient online learning scheme for self-improvement; ii) unlike conventional online learning with expert advice which combines feedback from all experts, OMKL-GF only collects feedback from a subset of experts based on a graph; iii) we proposed an adaptive scheme to actively refne feedback graph based on the incurred online loss.

Comparison with Raker (Shen et al., 2019). Comparing OMKL-GF with Raker (Shen et al., 2019), it can be readily observed that both algorithms rely on RF-approximation to make online kernel based learning more scalable. While Raker employs all kernels in the dictionary for function approximation, our proposed OMKL-GF chooses a timevarying subset of kernels at each time instant by adaptively pruning irrelevant kernels. Experiments on real datasets will be presented in Section 5 to show that OMKL-GF can attain lower MSE and execution time in comparison with Raker by actively choosing a subset of kernels.

4. Regret Analysis

In order to analyze the performance of our OMKL-GF, we assume that the following conditions hold:

(a1) At each time instance t the loss function L(θ> zn(xt), yt) is convex with respect to θ. n,t

(a2) For θ in a bounded set Θ which satisfes kθk Cθ the loss is bounded as L(θ> (xt), yt) [0, 1]. Also, n,tzv,n in this case the loss has bounded gradient which means kr L(θ> (xt), yt)k L. n,tzv,n

(a3) Kernels {κn}n

=1 are shift-invariant, standardized, and bounded, that is κn(xi, xj ), xi, xj . Also, each datum is bounded i.e. kxtk 1.

Note that (a1) is satisfed by many convex loss functions such as least-squares loss. Furthermore, (a2) states that the losses are bounded and L-Lipschitz continuous. In addition, a number of kernels satisfy (a3), e.g., Gaussian, Laplacian and Cauchy kernels(Rahimi & Recht, 2007). Generally, the above conditions are standard in online convex optimization

Online Multi-Kernel Learning with Graph-Structured Feedback

(Hazan, 2016) and kernel learning (Rahimi & Recht, 2007; Shen et al., 2019).

We consider stochastic regret, which is a typical criterion for analyzing online convex optimization schemes (Hazan, 2016). Specifcally, it measures the difference between expected aggregate loss of the online algorithm and the best function approximant in the hindsight. Given a sequence of approximations fˆ t(.) obtained by the algorithm A, we have

T T X X E[RA

s (T )]= E[L(fˆ t(xt), yt)] L(f (xt), yt). (19)

where f (.) denotes the best function approximant in the hindsight and it can be obtained as follows

T X f (.) arg

(xt), yt) (20a) ,n {1,...,N} n t=1

T X f arg min L(f(xt), yt) (20b) n f Hn t=1

where Hn denotes the RKHS induced by κn. Thus, to compute the stochastic static regret analysis we have

T T X X E[RA

s (T )]= E[L(fˆ t(xt), yt)] L(f (xt), yt). (21)

Note that in this paper, E[.] at time t denotes conditional expected value given {L(fˆ τ (xτ ), yτ )}t 1

In order to analyze the regret for OMKL-GF, we frst estab lish an intermediate result in the following lemma.

Lemma 1. The regret of our proposed algorithm under (a1), (a2) and with Fn = {fˆ n|fˆ n(x) = θ>zn(x), θ R2D} satisfes the following bound

T T X X 2b E[L(fˆ t(xt), yt)] L(fˆ (xt), yt) < ln N t,n η t=1 t=1 kθ k2NJ ηL2T ηN 2JT n + + + ηe JT + (22) 2b+1η2 2ηη2 2 e e

where θ is associated with the best RF function approxi n

(xt) = θ > mant fˆ zn(xt). t,n n

The next theorem further characterizes the difference be tween the loss of OMKL-GF relative to the best functional estimator in the RKHS.

Theorem 2. The following bound holds with probability at least 1 28( σn )2 exp( Dϵ2 ) under (a1)-(a3) for ϵ > 0 ϵ 4d+8

and with f belonging to RKHS Hn as in (20b) n

T T X X E[L(fˆ t(xt), yt)] min L(f (xt), yt) n n {1,...,N} t=1 t=1 2b NJ(1 + ϵ)C2 ηL2T < ln N + + + ηe JT η 2ηη2 e ηN 2JT + ϵLT C + (23) 2b+1η2 e

where C is a constant, and σ2 is the second order mon ment of the RF vector norm which can be defned as σ2 := Eπ

κn [kvk2]. n

1 According to Theorem 2, by setting η = ϵ = O( ) and T ηe = O(T 1/6) in (23), the stochastic static regret in (19) satisfes E[Rs (T )] = O(T 6

5 ). Thus, by selecting appropri A ate parameters, our proposed OMKL-GF achieves sublinear regret in expectation with respect to the best static function approximant in (19).

Note that while proper settings of ϵ and η relies on the knowl edge of T , such information may not be necessary, via em ploying, e.g., doubling trick (Cesa-Bianchi & Lugosi, 2006). Considering (23), the probability 1 28( σn )2 exp( Dϵ2 ) ϵ 4d+8 is an increasing function of D such that for a fxed ϵ, al ways there are some values for D which result in posi tive probability. Furthermore, (23) shows that by setting D = O(T log(T )), a sublinear regret can be obtained with high probability. In addition, more recent results about RF approximation, e.g., (Rudi & Rosasco, 2017) can also be readily applied and show that sublinear regret can be obtained with O( T log(T )) random features.

5. Experiments

In this section, our proposed online OMKL-GF framework is tested in terms of accuracy and computational complexity. We evaluate our proposed online OMKL-GF in comparison with the following kernel learning baselines

OMKR: Online MKL Hedge algorithm developed in

(Sahoo et al., 2014).

OMKR-FA: Random feature based online MKL via

function approximation (Sahoo et al., 2019).

Raker: Random feature based online MKL (Shen

et al., 2019).

Mean square error (MSE) is used to evaluate the accuracy of MKL algorithms when it comes to performing online regres sion. For random feature based MKL algorithms including our proposed OMKL-GF, Raker and OMKR-FA, we gener ate 50 different sets of random features and the mean value

Online Multi-Kernel Learning with Graph-Structured Feedback

of MSE for all sets of random features is reported. Thus, in this section we compute MSE at time instant t as follows

NMSE t X X MSE = 1 1 (ˆyτ yτ )2 (24) NMSE t i=1 τ=1

where yˆτ is the approximation of yt provided by the MKL algorithms. Also, NMSE denotes the number of rounds that MSE is computed. For our proposed OMKL-GF, Raker and OMKR-FA the value of NMSE is set to 50 whereas the value of NMSE for OMKR is set to 1. The number of random features D is fxed to 50 for all random feature-based MKL algorithms. Also, the regularization coeffcient λ is set to 10 3 . In addition, for MKL algorithms we consider a dictionary of 17 radial basis function (RBF) kernels with different bandwidths. Let σi,

dic denotes the variance of ith RBF kernel in the dictionary. In this case, the value of σi,

dic can be expressed as σi,

i 9 where 1 i 2 17. Moreover, for all MKL algorithms, stepsize η and the

1 exploration rate ηe are set to . t Algorithm 2 requires to observe the graph to choose a sub set of kernels. The graph Gt is generated by Algorithm 1 and disclosed. However, generating graph in every time instant can increase the computational complexity of the proposed OMKL-GF while it cannot improve MSE consid erably. To further decrease the computational complexity of our proposed OMKL-GF, we consider the following sce nario. The graph is generated until an amount of loss is achieved. For this scenario, the graph will not be generated in τter + 1, . . . , T if the condition (ˆyτter yτter )2 < 10 4 is met at time instant τter.

The performance of MKL algorithms are tested for online regression, over the following real datasets downloaded from UCI Machine Learning Repository are used.

Air Quality. The dataset contains 9358 instances of hourly averaged responses from an array of 5 sensors located on the feld in a signifcantly polluted area. Data samples contain 13 features which are averaged sensors response. It is aimed at predicting yt which is polluting chemical concentration in the air. (Vito et al., 2008).

Istanbul Stock Exchange. Data is organized with regard to working days in Istanbul Stock Exchange which contains 536 instances with 7 features including stock market return indices. The goal is to predict Istanbul stock exchange national 100 index (Akbilgic et al., 2014).

Twitter. This dataset contains 14000 samples from a micro-blogging platform Twitter with 77 features including the length of discussion on a given topic

Table 1. MSE ( 10 3) and execution time of MKL algorithms for Air Quality dataset when η = 1 . t

Algorithms M J MSE Execution time (s)

OMKR - - 3.3 2533.96s

OMKR-FA - - 41.2 1.67s

Raker - - 4.7 2.88s

OMKL-GF 1 1 8.1 0.74s OMKL-GF 7 1 4.2 1.47s OMKL-GF 10 1 3.9 1.68s OMKL-GF 17 1 4.1 2.56s

OMKL-GF 1 2 7.7 0.75s OMKL-GF 1 4 11.2 0.78s OMKL-GF 7 2 4.6 1.46s OMKL-GF 7 4 6.8 1.67s

Table 2. MSE ( 10 3) and execution time of MKL algorithms for Istanbul Exchange dataset when η = 1 . t

Algorithms M J MSE Execution time (s)

OMKR - - 10.0 17.16s

OMKR-FA - - 187.8 0.11s

Raker - - 11.3 0.19s

OMKL-GF 1 1 61.9 0.05s OMKL-GF 7 1 13.3 0.12s OMKL-GF 10 1 12.2 0.14s OMKL-GF 17 1 11.3 0.18s

OMKL-GF 1 2 38.5 0.05s OMKL-GF 1 4 38.4 0.05s OMKL-GF 7 2 12.9 0.17s OMKL-GF 7 4 15.2 0.15s

and the number of new interactive authors. Also, yt represents the average number of active discussion (popularity) on a certain topic (Kawala et al., 2013).

Tom s Hardware. dataset contains 10000 samples from a worldwide new technology forum with 96 fea tures including the number of discussions on a topic. Moreover, yt represents the average number of display about a certain topic on Tom s hardware (Kawala et al., 2013).

Naval Propulsion Plants. Data has been generated from a sophisticated simulator of gas turbines. Dataset contains 11934 samples with 15 features including ship speed. Furthermore, yt represents lever position (Coraddu et al., 2016).

Table 1, Table 2, Table 3 and Table 4 list performance of MKL algorithms in terms of both MSE and execution time for each dataset. Note that while OMKR provides better

Online Multi-Kernel Learning with Graph-Structured Feedback

Table 3. MSE ( 10 3) and execution time of MKL algorithms for Twitter dataset when η = 1 . t

Algorithms M J MSE Execution time (s)

OMKR - - 3.3 9099.95s

OMKR-FA - - 29.6 4.06s

Raker - - 5.3 6.06s

OMKL-GF 1 1 16.5 1.46s OMKL-GF 7 1 5.3 2.80s OMKL-GF 10 1 4.7 4.03s OMKL-GF 17 1 4.6 5.17s

OMKL-GF 1 2 12.2 1.50s OMKL-GF 1 4 14.4 1.56s OMKL-GF 7 2 5.1 3.94s OMKL-GF 7 4 8.7 3.49s

Table 4. MSE ( 10 3) and execution time of MKL algorithms for Tom s Hardware dataset when η = 1 . t

Algorithms M J MSE Execution time (s)

OMKR - - 2.2 4474.20s

OMKR-FA - - 21.4 3.04s

Raker - - 4.6 4.64s

OMKL-GF 1 1 12.3 1.03s OMKL-GF 7 1 3.7 2.26s OMKL-GF 10 1 3.5 2.65s OMKL-GF 17 1 3.7 3.67s

OMKL-GF 1 2 12.1 1.07s OMKL-GF 1 4 13.0 1.08s OMKL-GF 7 2 5.2 2.39s OMKL-GF 7 4 8.5 2.46s

accuracy compared with RF-based alternatives, it is also much more computationally complex, without resorting to RF approximation. It can be observed that the proposed OMKL-GF outperforms OMKR-FA in terms of accuracy. Also, the results show that when the maximum number of kernels that can be chosen is one (M = 1), OMKL-GF can achieve higher accuracy with lower execution time in comparison with OMKR-FA. Furthermore, Tables 1, 2 3 and 4 show that MSE obtained by OMKL-GF is comparable to that of Raker while OMKL-GF reaches this with lower execution time. It can also be seen that when M = 7, 10 and 17, the proposed OMKL-GF can provide more accurate function approximations compared to Raker. Furthermore, our proposed OMKL-GF is much more scalable in compari son with OMKR and using our OMKL-GF framework can reduce the execution time of function approximation consid erably while preserving comparatively accurate estimates.

The results in Tables 1, 2, 3 and 4 also demonstrate that for

Figure 1. MSE of MKL Algorithms over time for Naval Propulsion Plants dataset.

Table 5. MSE ( 10 3) and execution time of MKL algorithms for Naval Propulsion Plants dataset when η = 1 . t

Algorithms M J MSE Execution time (s)

OMKR - - 1.0 9690.24s

OMKR-FA - - 294.9 2.42s

Raker - - 1.4 4.14s

OMKL-GF 1 1 20.1 1.24s OMKL-GF 10 1 1.4 2.49s OMKL-GF 1 3 18.6 1.24s OMKL-GF 10 3 1.7 2.62s

the same value of M, OMKL-GF achieves lower MSE by decreasing J, the number of selective nodes in virtually all cases. It can also be observed that by increasing the value of M, the MSE obtained by OMKL-GF reduces in most of cases. On the other hand, larger M (i.e. the maximum number of kernels chosen at each time) leads to higher the execution time as well. Also, increase in the number of selective nodes leads to increase in the execution time.

Furthermore, Fig. 1 illustrates the MSE of competative algorithms for Naval Propulsion Plants dataset over time. As it can be observed from Fig. 1, our proposed OMKL GF can provide lower MSE in comparison with OMKR FA. When M = 10 and J = 1 in almost all time indices our proposed OMKL-GF provides similar MSE compared to Raker. Also, Fig. 1 shows that increasing the number of selective nodes from 1 to 3 leads to increase in MSE for OMKL-GF. Execution time of all algorithms for Naval Propulsion Plants dataset is reported in Table 5, which shows OMKL-GF runs faster than Raker while it can still provide comparable MSE. This validates the effectiveness of the proposed graph based kernel selection scheme.

Online Multi-Kernel Learning with Graph-Structured Feedback

6. Conclusion

The present paper developed an online MKL approach for nonlinear function approximation based on random feature approximation. By generating graph feedback, proposed alogrithmic scheme chose a subset of relevant kernels based on losses observed in prior time instants. By choosing a subset of relevant kernels, our proposed approach trimmed irrelevant kernels to enhance the accuracy, and reduce the computational complexity. We proved that our proposed approach achieve sublinear regret in expectation. Numerical tests on several real datasets reveal merits of our proposed algorithmic framework in comparison with other online MKL benchmarks.

Akbilgic, O., Bozdogan, H., and Balaban, M. E. A novel hy

brid rbf neural networks model as a forecaster. Statistics and Computing, 24(3):365 375, May 2014.

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In Pro ceedings of Conference on Learning Theory, volume 40, pp. 23 35, Jul 2015.

Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. Nonstochastic multi-armed ban dits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

Asratian, A. S., Denley, T. M. J., and H aggkvist, R. Bipartite Graphs and Their Applications. Cambridge University Press, 1998.

Bouboulis, P., Chouvardas, S., and Theodoridis, S. Online distributed learning over networks in rkh spaces using random fourier features. IEEE Transactions on Signal Processing, 66(7):1920 1932, Apr 2018.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, USA, 2006.

Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. How to use expert advice. Journal of the ACM, 44(3):427 485, May 1997.

Coraddu, A., Oneto, L., Ghio, A., Savio, S., Anguita, D., and Figari, M. Machine learning approaches for im proving condition-based maintenance of naval propulsion plants. Proceedings of the Institution of Mechanical En gineers, Part M: Journal of Engineering for the Maritime Environment, 230(1):136 153, 2016.

Cortes, C., Mohri, M., and Talwalkar, A. On the impact of kernel approximation on learning accuracy. In Proceed ings of International Conference on Artifcial Intelligence and Statistics, volume 9, pp. 113 120, May 2010.

Cortes, C., Desalvo, G., Gentile, C., Mohri, M., and Yang,

S. Online learning with sleeping experts and feedback graphs. In Proceedings of International Conference on Machine Learning, pp. 1370 1378, Jun 2019.

Hazan, E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Kawala, F., Douzal-Chouakria, A., Gaussier, E., and Dimert,

E. Pr e dans les r edictions d activit eseaux sociaux en ligne. In 4i erence sur les mod eme conf eles et l analyse des r ematiques et informatiques, eseaux : Approches math pp. 16, France, October 2013.

Littlestone, N. and Warmuth, M. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994.

Liu, F., Buccapatnam, S., and Shroff, N. B. Information directed sampling for stochastic bandits with graph feed back. In Proceedings of AAAI Conference on Artifcial Intelligence, Feb 2018.

Lu, J., Hoi, S. C., Wang, J., Zhao, P., and Liu, Z.-Y. Large scale online kernel learning. Journal of Machine Learning Research, 17(47):1 43, 2016.

Lu, Q., Karanikolas, G., Shen, Y., and Giannakis, G. B. Ensemble gaussian processes with spectral features for online interactive learning with scalability. In Proceed ings of International Conference on Artifcial Intelligence and Statistics, volume 108, pp. 1910 1920, Aug 2020.

Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. In Proceedings of Inter national Conference on Neural Information Processing Systems, pp. 684 692, Dec 2011.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Proceedings of International Con ference on Neural Information Processing Systems, pp. 1177 1184, Dec 2007.

Rudi, A. and Rosasco, L. Generalization properties of learn

ing with random features. In Proceedings of International Conference on Neural Information Processing Systems, pp. 3218 3228, 2017.

Sahoo, D., Hoi, S. C., and Li, B. Online multiple kernel re

gression. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 293 302, 2014.

Sahoo, D., Hoi, S. C. H., and Li, B. Large scale online multiple kernel regression with application to time-series prediction. ACM Transactions on Knowledge Discovery from Data, 13(1), Jan 2019.

Online Multi-Kernel Learning with Graph-Structured Feedback

Scholkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.

Shahrampour, S. and Tarokh, V. Learning bounds for greedy approximation with explicit feature maps from multiple kernels. In Advances in Neural Information Processing Systems, pp. 4690 4701, Dec 2018.

Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.

Shen, Y., Chen, T., and Giannakis, G. Online ensemble multi-kernel learning adaptive to non-stationary and ad versarial environments. In Proceedings of International Conference on Artifcial Intelligence and Statistics, vol ume 84, pp. 2037 2046, Apr 2018.

Shen, Y., Chen, T., and Giannakis, G. B. Random featurebased online multi-kernel learning in environments with unknown dynamics. Journal of Machine Learning Re search, 20(1):773 808, Jan 2019.

Vito, S. D., Massera, E., Piga, M., Martinotto, L., and Fran

cia, G. D. On feld calibration of an electronic nose for benzene estimation in an urban pollution monitoring sce nario. Sensors and Actuators B: Chemical, 129(2):750 757, 2008.

Vovk, V. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153 173, Apr 1998.

Wahba, G. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990.

Williams, C. K. I. and Seeger, M. Using the nystr om method to speed up kernel machines. In Proceedings of the Inter national Conference on Neural Information Processing Systems, pp. 661 667, Jan 2000.

Zhang, X. and Liao, S. Incremental randomized sketching for online kernel learning. In Proceedings of International Conference on Machine Learning, pp. 7394 7403, Jun 2019.