# graphaided_online_multikernel_learning__ddb77636.pdf Journal of Machine Learning Research 24 (2023) 1-44 Submitted 8/21; Published 1/23 Graph-Aided Online Multi-Kernel Learning Pouya M. Ghari pmollaeb@uci.edu Department of Electrical Engineering and Computer Science University of California Irvine, CA 92697, USA Yanning Shen yannings@uci.edu Department of Electrical Engineering and Computer Science University of California Irvine, CA 92697, USA Editor: Karsten Borgwardt Multi-kernel learning (MKL) has been widely used in learning problems involving function learning tasks. Compared with single kernel learning approach which relies on a pre-selected kernel, the advantage of MKL is its flexibility results from combining a dictionary of kernels. However, inclusion of irrelevant kernels in the dictionary may deteriorate the accuracy of MKL, and increase the computational complexity. Faced with this challenge, a novel graph-aided framework is developed to select a subset of kernels from the dictionary with the assistance of a graph. Different graph construction and refinement schemes are developed based on incurred losses or kernel similarities to assist the adaptive selection process. Moreover, to cope with the scenario where data may be collected in a sequential fashion, or cannot be stored in batch due to the massive scale, random feature approximation are adopted to enable online function learning. It is proved that our proposed algorithms enjoy sub-linear regret bounds. Experiments on a number of real datasets showcase the advantages of our novel graph-aided algorithms compared to state-of-the-art alternatives. 1 Keywords: Multi-Kernel Learning, Graphs, Random Features, Function Approximation, Online Learning 1. Introduction The need for function approximation arises in many machine learning studies including regression, classification, and reinforcement learning, see e.g., (Chung et al., 2019). This paper studies supervised function approximation where given data samples {(xt, yt)}T t=1, the goal is to find the function f( ), such that the difference between f(xt) and yt is minimized. In this context, kernel learning methods exhibit reliable performance. Specifically, the function approximation problem becomes tractable under the assumption that f( ) belongs to a reproducing kernel Hilbert space (Scholkopf and Smola, 2001). In some cases, it is imperative to perform function approximation task in an online fashion. For instance, when the volume of data is large and is collected in a sequential fashion, it is impossible to store 1. Preliminary results of this work were presented in part at the 2020 International Conference on Machine Learning (Ghari and Shen, 2020). The work in this paper is supported by NSF ECCS 2207457. Corresponding author: Yanning Shen. 2023 Pouya M. Ghari and Yanning Shen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0877.html. Ghari and Shen or process it in batch. Furthermore, suffering from the well-known problem of curse of dimensionality (Bengio et al., 2006; Shawe-Taylor and Cristianini, 2004), kernel learning methods are not suitable for sequential settings. This has motivated studies on online single kernel learning (Lu et al., 2016; Ding et al., 2017; Bouboulis et al., 2018; Zhang and Liao, 2019) to address the curse of dimensionality. Specifically, approximating kernels by finite-dimensional feature representations such as random Fourier feature by Rahimi and Recht (2007) and Nystr om method by Williams and Seeger (2000), function approximation task becomes scalable. Furthermore, kernel approximation with finite-dimensional features has been extensively studied by e.g. Sriperumbudur and Szab o (2015); Rudi and Rosasco (2017); Shahrampour and Tarokh (2018); Ding et al. (2020). Most of prior studies rely on a pre-selected kernel; however, such selection requires prior information which may not be available. By contrast, utilizing a dictionary of multiple kernel in lieu of a pre-selected kernel provides more flexible approach to obtain more accurate function approximations as it can learn combination of kernels (Sonnenburg et al., 2006; Kloft et al., 2011). Multiple kernel learning successfully has been employed in many learning methods as well as practical applications including cross domain learning (Duan et al., 2012) and computer vision applications (Bucak et al., 2014). To utilize the merits of multi-kernel learning several algorithms have been emerged (see e.g. (Rakotomamonjy et al., 2008; Cortes et al., 2009; G onen and Alpaydin, 2011)) which exhibit well-documented advantages compared to their single kernel learning counterparts. However, the mentioned algorithms are suitable to apply to batch kernel learning cases and are less efficient or even intractable when it comes to performing kernel learning in an online fashion. In order to make multiple kernel learning amenable for online function approximation, several algorithms have been proposed in the literature (see e.g. (Hoi et al., 2013; Sahoo et al., 2014)). However, the aforementioned algorithms suffer from the curse of dimensionality and are not scalable to deal with large volume of data. Enabled by the random feature approximation by Rahimi and Recht (2007), scalable online multi-kernel learning algorithms have been developed by Sahoo et al. (2019); Shen et al. (2019); Ghari and Shen (2020). In particular, the aforementioned algorithms perform function approximation by learning the linear combination of random feature kernel approximations. One of the most important challenges of MKL is the proper selection of kernels in the dictionary, which influences both computational complexity and accuracy of the function approximation significantly. However, selecting an appropriate kernel dictionary requires prior information. When such information is not available, one solution is to include a large number of kernels in the dictionary. In this case, employing all kernels in the dictionary may not be a feasible choice. Data-driven selection of subset of kernels in a given dictionary can alleviate the computational complexity. Furthermore, data-driven subset selection of kernels can enhance the accuracy of function approximation by pruning irrelevant kernels. The goal of the present paper is to select a subset of kernels in a given dictionary at each time instant in order to alleviate the computational complexity and improve function approximation accuracy. To this end, our proposed algorithms construct a graph whose vertices represent kernels such that a subset of kernels is selected based on the structure of the graph. In this case, function approximations given by the chosen subset of kernels can be viewed as feedback collected from a graph which is called feedback graph. Relative to existing online multi-kernel learning approaches, our novelty can be summarized as follows: Graph-Aided Online Multi-Kernel Learning c1) Different from existing works which employ all kernels in the dictionary, while only learning the combination coefficients, our proposed algorithms only use a subset of kernels at each time instant according to a feedback graph. c2) An adaptive and disciplined framework is developed to construct a feedback graph at each time instant according to the loss incurred by kernel-based approximants. A novel OMKL algorithm is proposed to select kernels according to the graph-structured feedback (OMKL-GF) which achieves sublinear regret. c3) Construction of the feedback graph at each time instant increases the computational burden of multi-kernel learning. To address this issue, a similarity feedback graph is constructed based on the similarity among kernels, which does not require observing the data samples beforehand. The resulting algorithm is called Online Multi-Kernel Learning with Similarity-based Feedback Graph (OMKL-SFG). It is proved that the proposed OMKL-SFG achieves a sub-linear regret. c4) A novel algorithm called OMKL-SFG-R is proposed to adaptively refine the structure of the similarity-based feedback graph on the fly. It is proved that the OMKL-SFG-R enjoys the sublinear regret tighter than OMKL-SFG and OMKL-GF. c5) Experiments on real datasets showcase the effectiveness of our proposed algorithms in comparison with other state-of-the-art OMKL baselines. The remainder of this paper is organized as follows. Section 2 discusses preliminaries of random-feature based multi-kernel learning. Section 3 presents the proposed OMKL-GF algorithm and its regret analysis. Furthermore, section 4 presents the OMKL-SFG and OMKL-SFG-R algorithms along with their theoretical performance in terms of regret. Experimental results are provided in section 5 to study performance of MKL algorithms on several real datasets. Finally, section 6 concludes the paper. 2. Preliminaries Given samples (x1, y1), , (x T , y T ), with xt Rd and yt R, the function approximation problem can be written as the following optimization problem min f H 1 T C(f(xt), yt) + λΩ( f 2 H) (1) where C( , ) denotes the cost function, which is specified according to the learning task. For example, in regression task C( , ) can be least square function. In (1), λ denotes the regularization coefficient and Ω( ) represents a non-decreasing function, which is used to prevent over-fitting and control model complexity. 2.1 Function Approximation with Reproducing Hilbert Kernel Space Let κ(x, xt) : Rd Rd R represent a symmetric positive definite kernel function which measures the similarity between x and xt. In the context of kernel based learning, it is assumed that the sought f( ) belongs to the reproducing Hilbert kernel space (RHKS) H := Ghari and Shen {f|f(x) = P t=1 αtκ(x, xt)}. A kernel is reproducing if it satisfies κ(x, xt), κ(x, xt ) H = κ(xt, xt ) where , H denotes vector inner product in Hilbert space, with the RKHS norm defined as f 2 H := P t P t αtαt κ(xt, xt ). The representer theorem states that the optimal solution of (1) can be expressed as follows given finite data samples (Wahba, 1990) t=1 αtκ(x, xt) := α κ(x) (2) where α := [α1, . . . , αT ] denotes the vector of unknown coefficients to be estimated, and κ(x) := [κ(x, x1), . . . , κ(x, x T )] . Furthermore, it can be observed that the dimension of α increases with the number of data samples T. This is known as curse of dimensionality (Wahba, 1990) and arises as a major challenge for solving (1) in an online fashion. 2.2 Random Fourier Feature Approximation One way to cope with the increasing number of variables to be estimated is to employ random feature (RF) approximation (Rahimi and Recht, 2007). As in Rahimi and Recht (2007), we will approximate κ( ) in (2) using shift-invariant kernels which satisfy κ(xt, xt ) = κ(xt xt ). However, relying on a pre-selected kernel often requires prior information that may not be available. To cope with this, multi-kernel learning can be exploited which learns the kernel as a combination of a sufficiently rich dictionary of kernels {κi}N i=1. The kernel combination is itself a kernel (Scholkopf and Smola, 2001). Let κi(xt xt ) be the i-th kernel in the dictionary of N absolutely integrable kernels. In this case, its Fourier transform πκi(ψ) exists and can be viewed as probability density function (PDF) if the kernel is normalized such that κi(0) = 1. Specifically, it can be written as κi(xt xt ) = Z πκi(ψ)ejψ (xt xt )dψ := Eπκi(ψ)[ejψ (xt xt )]. (3) Let {ψi,j}D j=1 be a set of vectors which are independently and identically distributed (i.i.d) samples from πκi(ψ). Hence, κi(xt xt ) can be approximated by the ensemble mean ˆκi,c(xt xt ) := 1 D PD j=1 ejψ i,j(xt xt ). Furthermore, the real part of ˆκi,c(xt xt ) also constitutes an unbiased estimator of κi(xt xt ) which can be written as ˆκi(xt xt ) = z i (xt)zi(xt ) (Rahimi and Recht, 2007), where zi(xt) := 1 D [ sin(ψ i,1xt), , sin(ψ i,Dxt), cos(ψ i,1xt), , cos(ψ i,Dxt)]. Replacing κi(x, xt) with ˆκi(x xt), ˆf(x) in (2) can be approximated as ˆf RF,i(x) = t=1 αtˆκi(x xt) = t=1 αtz i (x)zi(xt) = θ i zi(x) (4) where θi R2D is a vector whose dimension does not increase with the number of data samples. Therefore, utilizing RF approximation can make the function approximation problem amenable for online implementation. Furthermore, the loss of the i-th kernel can be calculated as L(θ i zi(xt), yt) = C(θ i zi(xt), yt) + λΩ( θi 2). (5) Graph-Aided Online Multi-Kernel Learning 2.3 Online MKL as Online Learning with Expert Advice Online learning with expert advice studies the problem where a learner performs the online learning task by interacting with a set of experts, see e.g. (Cesa-Bianchi and Lugosi, 2006)). At each time instant, the learner observes the advice given by experts, and then utilize the received advice to make a decision in real time (Auer et al., 2003; Hazan, 2016; Mannor and Shamir, 2011). In multi-kernel learning, each kernel can be viewed as an expert. Specifically, the approximation obtained by the i-th kernel, can be viewed as the advice given by κi( ). In particular, when multiple kernels are employed, function approximation can be performed by functions in the form f(x) = PN i=1 wifi(x) where PN i=1 wi = 1 (Scholkopf and Smola, 2001). Also, fi(x) Hi where Hi is an RKHS induced by the kernel κi. Replacing fi(x) with ˆf RF,i(x), the function f(x) can be approximated as i=1 wi ˆf RF,i(x), i=1 wi = 1 (6) where the approximation ˆf RF(x) is a linear combination of approximations (advice) given by kernels in the dictionary. When all kernels are involved in function approximation at every time instants, the multi-kernel learning problem with RF approximation can be formulated as min { wi,θi} i=1 wiθ i zi(xt), yt + λΩ( θi 2) i=1 wi = 1, wi 0, 1 t T. (7b) In online MKL where data samples come sequentially, the optimization problem cannot be solved in batch. Online convex optimization methods can be utilized to update { wi}N i=1, {θi}N i=1 upon receiving new datum xt at each time instant t (Sahoo et al., 2019; Shen et al., 2019). Let wi,t and θi,t denote the values of wi and θi at time t. Upon receiving new datum xt and computing the loss L(θ i,tzi(xt), yt), using the online gradient descent, θi,t can be updated as θi,t+1 = θi,t η L(θ i,tzi(xt), yt) (8) where η is the learning rate. Furthermore, using multiplicative update, the value of wi,t can be updated as wi,t+1 = wi,t exp ηL(θ i,tzi(xt), yt) (9a) wi,t+1 = wi,t+1 PN j=1 wj,t+1 . (9b) Employing the update rules in (8) and (9) wi,t and θi,t can be updated in an online fashion without storing data in batch. Ghari and Shen 2.4 Assumptions In order to analyze the proposed algorithms, we apply stochastic regret (Hazan, 2016) to measure the difference between expected cumulative loss of the proposed algorithms and the best function approximant in the hindsight. Let f ( ) denote the best function approximant in hindsight which can be obtained as f ( ) arg min f i ,i {1,...,N} t=1 L(f i (xt), yt) (10a) f i ( ) arg min f Hi t=1 L(f(xt), yt). (10b) Hence, the stochastic regret is defined as t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f (xt), yt) (11) where Et[ ] denotes the expected value at time instant t given the loss observations in prior times. Furthermore, the performance of proposed algorithms is analyzed under the following assumptions: (as1) The loss function L(θ i,tzi(xt), yt) is convex with respect to θi,t at each time instant t. (as2) For θ in a bounded set T which satisfies θ Cθ the loss and its gradient are bounded as L(θ i,tzi(xt), yt) [0, 1] and L(θ i,tzi(xt), yt) L, respectively. (as3) Kernels {κi}N i=1 are shift-invariant (i.e. κi(x, x ) = κi(x x ), i: i = 1, . . . , N), standardized and bounded. Each datum xt 1. Note that (as1) can be satisfied by many convex loss functions such as least squares loss and logistic loss. Moreover, (as2) states that the losses are bounded and L-Lipschitz continuous. (as3) states that kernels are assumed to be shift-invariant, standardized and bounded, meaning that |κi(x)| 1, i, x. In what follows, we introduce a general graphaided OMKL framework, where only a subset of kernels in the dictionary are chosen at each time instant. 3. Online Multi-Kernel Learning with Bipartite Feedback Graph The present section introduces an OMKL approach which selects a subset of kernels using a bipartite graph constructed adaptively at each time instant based on the observed losses. 3.1 Data-driven Graph-based Kernel Selection Instead of combining the entire dictionary of the kernels, in the present paper, we will consider combining a subset of kernels {κi( ), i St} at time instant t instead, where St is Graph-Aided Online Multi-Kernel Learning the index set of the chosen subset of kernels at time instant t. Hence, the original function approximation problem boils down to min { wi,t,θi} i St wi,tθ i zi(xt), yt + λΩ( θi 2) i St wi,t = 1, wi,t 0, 1 t T. (12b) Upon defining the normalized weights wi,t = wi,t P j St wj,t , (12) can be re-written as min {wi,t},{θi} wi,t P j St wj,t θ i zi(xt), yt s.t.wi,t 0, 1 i N, 1 t T. (13b) However, (13) assumes that {St}T t=1 are preselected sets. In this section, we study data-driven scheme which can adaptively select a subset of kernels on the fly upon receiving new data samples. In order to adaptively choose the subset of kernels, the present section models the pruned kernel combination as feedback collected from a graph, that is constructed in an online fashion. By doing this, the proposed approach trims irrelevant kernels in the dictionary to both improve the function approximation accuracy and reduce the computational complexity of MKL. Consider a time varying bipartite graph (Asratian et al., 1998) Bt at time t, which consists of two sets of nodes: N kernel nodes {vk,1, ..., vk,N} and J selective nodes {vs,1, ..., vs,J} where vk,i and vs,j are the i-th kernel node and j-th selective node, respectively. And the edges of the graph represents the association between the kernel nodes and the selective nodes. Specifically, an edge between vk,i and vs,j exists at time t if the i-th kernel is assigned to j-th selective node. The construction of the graph will be discussed in Section 3.2 . At each time instant, one of the selective nodes vs,j is chosen, and the subset of kernel nodes connected to vs,j will be used for the instantaneous function approximation at time t, [c.f. (13)]. Then, the loss L( ˆf RF,i(xt), yt) is observed for every kernel in the chosen subset and θi,t is updated as θi,t+1 = θi,t η L(θ i,tzi(xt), yt) qi,t 1i St, (14) where qi,t is the probability that the loss of associated kernel is observed and 1i St denotes the indicator function such that 1i St = 1 if i St and 1i St = 0 otherwise. The value of qi,t depends on how the bipartite graph is generated. Upon receiving a new datum xt, the value of the weight wi is updated on the fly . Let wi,t denote the weighting coefficient wi at time instant t. We leverage multiplicative update for weights wi,t as wi,t+1 = wi,t exp( ηℓi,t) (15) where ℓi,t denotes the importance sampling loss estimates (Alon et al., 2017) associated with the i-th kernel as follows ℓi,t = L( ˆf RF,i(xt), yt) qi,t 1i St (16) Ghari and Shen Algorithm 1 Data Driven (Bipartite) Graph-based Kernel Selection Input: Shift-invariant kernels κi, i = 1, . . . , N, step size η > 0, weights {wi,t}N i=1, {uj,t}J j=1, bipartite graph Bt and datum xt. Set uj,t = P i:vk,i vs,j wi,t. Obtain pj,t via (19). Choose one selective node vs,j according to PMF pt = (p1,t, ..., p J,t). Predict ˆf RF(xt) = P i St wi,t P m St wm,t ˆf RF,i(xt) with ˆf RF,i(xt) in (4). Obtain loss L( ˆf RF,i(xt), yt) for all i St. Update θi,t+1 via (14) for all i St. Update wi,t+1 via (15). Output: ˆf RF(xt), {wi,t+1}N i=1, {θi,t+1}N i=1 which is the observed loss L( ˆf RF,i(xt), yt) divided by the probability qi,t. The function approximation can be obtained as ˆf RF(xt) = X wi,t P m St wm,t ˆf RF,i(xt) (17) where ˆf RF,i( ) is defined in (4). Then the selective nodes are assigned to weights {uj,t+1} according to the the kernel nodes weights {wi,t+1}. Indeed, each selective node s weight {uj,t+1} is the total summation of weights of kernel nodes which are connected to this selective node. Specifically the weight of vs,j is obtained via i:vk,i vs,j wi,t+1. (18) Note that the weights of the selective nodes are determined by its adjacent kernel nodes, which indicates the reliability of the corresponding kernel-based function estimate. Hence, the probability according to which a selective node is chosen in the next time instant can be updated as pj,t+1 = (1 ηe)uj,t+1 where Ut+1 := PJ j=1 uj,t+1, and 0 < ηe 1 is the exploration rate. The term ηe J is introduced to tradeoffbetween exploitation and exploration. The first term on the right hand side of (19) implies choosing a selective node with larger weight uj,t+1 with higher probability. And the term ηe J is used to to promote exploration. Algorithm 1 summarizes the data driven kernel selection scheme presented in this section. To sum up, each kernel is viewed as an expert and at each time instant a subset of function approximations provided by these experts is combined. In this regard, the RF approximation ˆf RF,i(xt) can be viewed as the feedback provided by i-th kernel node, and the proposed framework models the pruned kernel combination as feedback collected from a graph, where the feedback are combined only if the corresponding kernel node is connected Graph-Aided Online Multi-Kernel Learning Algorithm 2 Generating Bipartite Feedback Graph Input: Shift-invariant kernels κi, weighting coefficient wi,t, i = 1, . . . , N, exploration coefficient ηe and the maximum degree of each selective node M. Initialize: Sub-adjacency matrix At+1 = 0N J. for j = 1, ..., J do for i = 1, ..., N do Set πij,t+1 = (1 ηej) wi,t+1 PN i=1 wi,t+1 + ηej end for for k = 1, ..., M do Choose one of nodes vk,i drawn according to the probability mass function (PMF) πj,t+1 = (π1j,t+1, ..., πNj,t+1). Set At+1(i, j) = 1. end for end for Output: Bipartite feedback graph Bt+1 with adjacency matrix At+1 to the chosen selective node. By doing this, the proposed approach trims irrelevant kernels in the dictionary to both improve the function approximation accuracy and reduce the computational complexity of MKL. The graph construction approach will be proposed in the ensuing subsection. 3.2 Online Bipartite Feedback Graph Construction Construction of the time varying graph is of utmost importance, as it affects both function approximation accuracy and computational complexity. In this regard, a graph is successful if it can provide a subset of kernels which results in smallest possible loss. Indeed, considering computational complexity, the graph should provide a limited number of kernels which obtain minimum loss. To this end, we aim to propose a generating method for graph. Specifically, using the weights {wi,t+1}N i=1 obtained via (15), the structure of the graph is reconstructed in a stochastic manner stated below to be leveraged in the next time instant. Increasing the number of kernel nodes connected to vs,j, increases the computational complexity of performing function approximation by choosing vs,j. Therefore, the graph generation algorithm should be designed to limit the number of kernel nodes connected to each selective node. Let M denote the maximum number of kernel nodes connected to each selective node. The procedure to generate the graph Bt+1 is presented in Algorithm 2. Let At+1 represent the N J sub-adjacency matrix between two disjoint subsets {vs,1, ..., vs,J} and {vk,1, ..., vk,N}. Notation At+1(i, j) represents the element in i-th row and j-th column of the sub-adjacency matrix At+1 and it is equal to 1 if vk,i is connected to vs,j, and 0 otherwise. Each selective node vs,j draws kernel nodes vk,i in M independent trials and in each trial selective node draws only one kernel node. We put more weight on kernels which obtain less loss in a sense that the probability that selective node vs,j draws the kernel node vk,i in a Ghari and Shen (a) A bipartite feedback graph consists of J selective nodes and N kernel nodes. (b) The chosen selective node (vs,2 as an example) and edges associated with it are highlighted. Figure 1: A bipartite feedback graph generated by Algorithm 2. trial at time t + 1 is πij,t+1 = (1 ηej) wi,t+1 PN k=1 wk,t+1 + ηej Note that the first term in (20) discriminates between kernels based on their weights which is determined by their loss in function approximation [c.f. (15)]. Furthermore, the second term allows exploration over all kernel nodes. Specially, the selective node vs,j draws kernel nodes according to uniform distribution if ηe = 1. Furthermore, note that ηej is a non-increasing function of j for 0 < ηe 1. The selective node vs,1 puts more weight on exploration in comparison with others while vs,J considers more exploitation than all the other selective nodes. Therefore, the selective nodes entail different level of exploration and exploitation. Based on the definition of πij,t+1 in (20), the probability that the i-th kernel node is connected to vs,j is 1 (1 πij,t+1)M, where (1 πij,t+1)M is the probability that the i-th kernel node is chosen by vs,j in none of M trials. Therefore, the probability of observing the loss of the i-th kernel at time t + 1 is given by j=1 pj,t+1 1 (1 πij,t+1)M (21) for 1 i N. The value of qi,t+1 is computed and used for importance sampling loss estimate in (16). The graph generation framework is summarized in Algorithm 2. And Figure 1 illustrates an example of the constructed bipartite feedback graph. At each time instant t, the bipartite graph Bt is used for choosing a selective node, and henceforth subset of the kernels. Then the weights of the selected kernels are updated according to the loss [c.f.(14) and (15)]. And the graph can be constructed using Algorithm 2, based on which, the function approximation can be carried out by choosing one of the selective nodes which leads to selecting a subset of kernels. Our proposed online multi-kernel learning with graph-structured feedback (OMKL-GF) is summarized in Algorithm 3. Computational Complexity. At time instant t, OMKL-GF needs to store a real 2D random feature vector in addition to a weighting vector for each kernel in conjunction with Graph-Aided Online Multi-Kernel Learning Algorithm 3 OMKL with (Bipartite) Graph Feedback (OMKL-GF) Input: Shift-invariant kernels κi, i = 1, . . . , N, step size η > 0 and the number of random features D. Initialize: θi,1 = 0, wi,1 = 1, i = 1, ..., N, generate B1 using Algorithm 2 given wi,1, i for t = 1, ..., T do Receive one datum xt. Obtain ˆf RF(xt), {wi,t+1}N i=1, {θi,t+1}N i=1 using Algorithm 1 given Bt, {wi,t}N i=1, {θi,t}N i=1. Generate Bt+1 using Algorithm 2 with {wi,t+1}N i=1. end for a weighting vector for each selective node. As the number of kernels is in general larger than the number of selective nodes, the required memory is of order O(d DN). The per-iteration complexity of our OMKL-GF (e.g. calculating inner products) is O(d DM + JN). In comparison, the per-iteration complexity of OMKR developed by Sahoo et al. (2014) is O(td N), while more contemporary online RF-based OMKL approaches proposed by Shen et al. (2019); Sahoo et al. (2019) both have per-iteration complexity O(d DN). Hence, OMKLGF can significantly reduce the per iteration complexity especially when J M << N. 3.3 Regret Analysis This subsection presents the regret analysis of OMKL-GF. In order to analyze the regret for OMKL-GF, we first establish an intermediate result in the following lemma. Lemma 1 The regret of the proposed OMKL-GF under (as1) and (as2) with respect to a preselected kernel κi where Fi = { ˆfi| ˆfi(x) = θ zi(x), θ R2D} satisfies the following bound t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf i (xt), yt) 2η + ηe JT + ηNT 2(1 ηe) + ηL2NJT where θ i is associated with the best RF function approximant ˆf i (xt) = θ i zi(xt) and the expectation at time t is taken with respect to PMFs pt and πj,t in (19) and (20), respectively. Proof see Appendix B. The next theorem further characterizes the difference between the loss of OMKL-GF relative to the best functional estimator in the RKHS. Theorem 2 The following bound holds with probability at least 1 28(σi ϵ )2 exp( Dϵ2 4d+8) under (as1) (as3) for ϵ > 0 and with f i belonging to RKHS Hi as in (10b) t=1 Et[L( ˆf RF(xt), yt)] min i {1,...,N} t=1 L(f i (xt), yt) 2η + ηe JT + ϵLTC + ηNT 2(1 ηe) + ηL2NJT Ghari and Shen where C is a constant, and σ2 i is the second order moment of the RF vector norm which can be defined as σ2 i := Eπκi(ψ)[ ψ 2]. The expectation at time t is taken with respect to the randomness in pt and πj,t defined in (19) and (20), respectively. Proof see Appendix C. According to Theorem 2, by setting , ηe = O N 1 6 T 1 4 , J = O N 1 3 , ϵ = O T 1 in (23), the stochastic static regret in (11) satisfies O N ln NT 3 4 . Thus, by selecting appropriate parameters, our proposed OMKL-GF achieves sublinear regret in expectation with respect to the best static function approximant in (11). Note that while proper settings of ϵ and η relies on the knowledge of T, such information may not be necessary, via employing, e.g., doubling trick (Cesa-Bianchi and Lugosi, 2006). Considering (23), the probability 1 28(σi ϵ )2 exp( Dϵ2 4d+8) is an increasing function of D such that for a fixed ϵ, always there are some values for D which result in positive probability. Furthermore, (23) shows that by setting ϵ = O(T 1 4 ) and D = O( T ln T), the sublinear regret of O( N ln NT 3 4 ) can be obtained with high probability of 1 O( 1 The computational complexity of kernel learning algorithms play an important role in their applicability. Bipartite feedback graph construction at each time instant increases the computational complexity of OMKL-GF. To further alleviate the computational burden of multi-kernel learning the ensuing section investigates the problem of choosing a subset of kernels using a feedback graph while the graph is not constructed at every time instant. 4. Online Multi-Kernel Learning with Similarity-based Feedback Graph Note that OMKL-GF is a data-driven kernel selection scheme where a bipartite feedback graph is constructed at every time instant. However, online feedback graph construction increases the computational complexity of OMKL-GF. This section proposes a novel algorithmic framework to construct the feedback graph in an offline fashion such that the proposed algorithms do not need to construct the feedback graph at every time instant. Moreover, the bipartite feedback graph Bt constructed by Algorithm 2 do not exploit the relationship among kernels. Hence, in this section, the similarity among kernels is measured which will facilitate constructing the feedback graph in an offline fashion in such a way that at each time instant a subset of dissimilar kernels are chosen to avoid unnecessary computation. The present section first introduces a disciplined way to construct feedback graph based on kernel similarities in an offline fashion. Based on the constructed feedback graph, a novel online MKL algorithm (called OMKL-SFG) is developed to select a subset of kernels which is proved to obtain sub-linear regret. Furthermore, to obtain tighter regret bound a modification of OMKL-SFG algorithm (called OMKL-SFG-R) is proposed which refines the structure of the feedback graph to choose a subset of kernels. OMKL-SFG-R entails more computation than OMKL-SFG. Graph-Aided Online Multi-Kernel Learning 4.1 Offline Similarity-based Feedback Graph Construction The similarity between two shift invariant kernels κi and κj is measured through divergence between κi and κj. As κi and κj has smaller divergence, they are considered to be more similar. The present paper measures the divergence between a pair of kernels using the Bregman divergence. Let Ωbe a d-dimensional convex set. Bregman divergence defined for a strictly convex and differentiable function F( ) : Ω R as (see, e.g. (Bregman, 1967; Banerjee et al., 2005)) BF (ω1, ω2) = F(ω1) F(ω2) F(ω2) (ω1 ω2). (25) Based on Bregman divergence, the divergence between two shift invariant kernels κi and κj can be measured through the function (κi, κj) which is defined as (κi, κj) = Z BF (κi(ρ), κj(ρ)) dρ (26) where ρ Rd. As it can be inferred from (26), (κi, κj) measures the divergence between two kernels κi and κj using the aggregation of Bregman divergence on every point ρ in the input space. As (κi, κj) decreases, kernels κi and κj are considered to be more similar. Note that instead of defining the divergence as in (26), one can define the divergence function (κi, κj) as the expected Bregman divergence over points ρ. However, taking the expectation requires knowing the distribution of input data samples which may not be available priori. Furthermore, the distribution of input space may change over time and as a result calculating the expected value of Bregman divergence in an offline fashion over the input space may not be feasible. Moreover, squared Euclidean divergence BF (κi(ρ), κj(ρ)) = κi(ρ) κj(ρ) 2 is generated by the function F(ω) = ω 2. Let S( , ) denotes the function ( , ) when the Bregman divergence is obtained by F(ω) = ω 2. In this case, we have S(κi, κj) = Z |κi(ρ) κj(ρ)|2dρ. (27) The following Lemma states that the function S(κi, κj) exists for each pair of absolutely integrable kernels. Lemma 3 Under the assumption that kernels {κi}N i=1 are absolutely integrable, bounded and normalized such that κi(0) = 1, i : 1 i N, the function S(κi, κj) is bounded and exists for each pair of kernels κi( ) and κj( ). Proof Since kernels {κi(ρ)}N i=1 are assumed to be bounded as 0 κi(ρ) 1, ρ, 1 i N, it can be concluded that |κi(ρ) κj(ρ)|2 |κi(ρ) κj(ρ)|. Thus, it can be inferred that Z |κi(ρ) κj(ρ)|2dρ Z |κi(ρ) κj(ρ)|dρ. (28) Furthermore, based on the Triangular inequality, it can be written that Z |κi(ρ) κj(ρ)|dρ Z |κi(ρ)|dρ + Z |κj(ρ)|dρ. (29) Ghari and Shen Based on (28), (29) and the fact that kernels are absolutely integrable, we can conclude that the function S(κi, κj) is bounded and exists for each pair of kernels κi( ) and κj( ). Furthermore, the following lemma states that the average difference between function approximations given by each pair of kernels is bounded above in accordance with the function S( , ) defined in (27). Lemma 4 Let Cm := maxi PT t=1 |αi,t|2 where {αi,t}T t=1 are weights for (2) associated with the i-th kernel κi( ). Also, let x is bounded as x 1 and kernels are absolutely integrable. Then, the average difference between function approximations given by κi( ) and κj( ) is bounded above as Z | ˆfi(x) ˆfj(x)|2dx 2Cm t=1 ( S(κi, κj) + 2Ud) (30) where ˆfi(x) denotes the function approximation given by κi( ) as in (2) and Ud represents d-dimensional Euclidean unit norm ball volume. Proof See Appendix D. Let G := (V, E) be a directed graph with vertex vi V which represents the i-th kernel κi. In this case, there is a self-loop for each vi V which means (i, i) E. Furthermore, an edge from vi to vj is appended to E if 1 |Nout i | (κm(ρ), κj(ρ)) γi (31) where γi is a threshold for vi and Nout i denote the current out-neighborhood set of vi which means j Nout i if (i, j) E. Using the (31) to append edges to the graph G, a vertex vj associated with the kernel κj is added to the out-neighborhood set of vi if it is dissimilar to the current out-neighbors of vi. Therefore, the subset of vertices which are out-neighbors of vi, are jointly dissimilar. Since a subset of function approximations associated with kernels will be chosen using the graph G, the chosen subset of function approximations can be viewed as feedback collected from the graph G and as a result the graph G is called feedback graph. Specifically in order to restrict the number of out-neighbors for each node to M, the value of γi is obtained as γi = arg max γ {γ||Nout i | = M}. (32) Note that M is a preselected parameter in the algorithm and increasing the value of M increases the connectivity of the feedback graph. At each time instant, one of the nodes are drawn and the function approximation is carried out using the combination of a subset of kernels which are out-neighbors of the chosen node. Therefore, increase in M can increase the exploration in the approximation task while it increases the computational complexity. The feedback graph construction procedure is summarized in Algorithm 4. It can be observed Graph-Aided Online Multi-Kernel Learning Algorithm 4 Generating Similarity based Feedback Graph Input: Shift-invariant kernels κi, i = 1, ..., N. for i = 1, ..., N do Append (i, i) to E. Obtain γi via (32). Append (i, j) to E if 1 |Nout i | P m Nout i (κm(ρ), κj(ρ)) γi. end for Output: Similarity-based Feedback Graph G. (a) A similarity-based feedback graph with 5 kernel nodes. (b) As an example, the chosen node (v2) and its out-neighbors are highlighted. Figure 2: An example of similarity-based feedback graph generated by Algorithm 4. from (26) that the function (κi, κj) can be considered as a measure of divergence, and henceforth dissimilarity between kernels κi( ) and κj( ) without knowing data samples {xt}T t=1. This helps reduction of computational complexity of the function approximation since (dis)similarity among kernels in the dictionary can be measured offline before observing data samples and as a result the computation required to perform Algorithm 4 can be performed offline. At each time instant, one of the vertices of the feedback graph is drawn according to a PMF as it will be explained in the next section. Then, the function approximation is performed using the kernels associated with out-neighbors of the chosen vertex. Therefore, based on the feedback graph construction in Algorithm 4, at each time instant a subset of dissimilar kernels is chosen to avoid unnecessary computation since it is expected that similar kernels provide comparatively similar approximations. See also Figure 2 for an example of similarity based feedback graph where the number of kernels is 5 and vi represents a Gaussian kernel with bandwidth of 10i 3. For each node vi, γi is selected so that the number of out-neighbors of vi is 3. 4.2 Kernel Selection with Offline Feedback Graph The present section studies how to select a subset of kernels using the feedback graph G and prior observations of losses associated with kernels. Assume that each kernel is associated with a set of weights {wi,t}N i=1 where wi,t is the weight associated with the i-th kernel κi. The weight wi,t indicates the accuracy of the function approximation given by the κi at Ghari and Shen time t and its value can be updated when more and more information is being revealed. Furthermore, a set of weights {ui,t}N i=1 is assigned to V such that ui,t is the weight associated with vi V at time instant t, which indicates the accuracy of function approximation when the node vi is drawn. In order to choose a subset of kernels at time t, one of the vertices in V is drawn according to the PMF pt pi,t = (1 ξ)ui,t |D|1i D, i = 1, . . . , N (33) where ξ is the exploration rate and Ut := PN i=1 ui,t. D represents the dominating set of G, and |D| denotes the cardinality of D. Let St denote the subset of kernel indices chosen at time t, and It denote the index of the kernel drawn according to the PMF pt in (33). Therefore, i St if i Nout It , which means that the loss associated with the i-th kernel is calculated if the i-th kernel is an out-neighbor of the It-th node. In turn, the RF-based function approximation can be obtained as ˆf RF(xt) = X wi,t P j Nout It wj,t ˆf RF,i(xt). (34) Furthermore, the importance sampling loss estimate ℓi,t at time instant t is defined as ℓi,t = L(θ i,tzi(xt), yt) qi,t 1i St, i = 1, . . . , N (35) where qi,t is the probability that i St and it can be computed as where Nin i denote the in-neighborhood set of vi which means j Nin i if (j, i) E. In addition, the importance sampling function approximation estimate ˆℓi,t at time instant t associated with vi V is defined as ˆℓi,t = L( ˆf RF(xt), yt) pi,t 1It=i. (37) Using the importance sampling loss estimate in (36), θi,t can be updated as θi,t+1 = θi,t η ℓi,t = θi,t η L(θ i,tzi(xt), yt) qi,t 1i St, (38) where η is the learning rate. Moreover, the multiplicative update is employed to update wi,t and ui,t based on importance sampling loss estimates in (35) and (37) as follows wi,t+1 = wi,t exp( ηℓi,t), i = 1, . . . , N (39a) ui,t+1 = ui,t exp( ηˆℓi,t), i = 1, . . . , N. (39b) Graph-Aided Online Multi-Kernel Learning Algorithm 5 OMKL with Similarity-based Feedback Graph (OMKL-SFG) Input: Shift-invariant kernels κi, i = 1, . . . , N, learning rate η, exploration rate ξ, the number of random features D. Initialize: θi,1 = 0, wi,1 = 1, i = 1, . . . , N, Construct the feedback graph G via Algorithm 4. for t = 1, . . . , T do Receive one datum xt. Draw one of nodes vi V according to the PMF pt = (p1,t, ..., p N,t) in (33). Predict ˆf RF(xt) via (34). Calculate loss L( ˆf RF,i(xt), yt) for all i St. Update θi,t+1 via (38). Update wi,t+1 and ui,t+1 via (39). end for The procedure to choose a subset of kernels at each time instant for function approximation is summarized in Algorithm 5. This algorithm is called OMKL-SFG which stands for Online Multi Kernel Learning with Similarity based Feedback Graph. Figure 2(b) illustrates the case when v2 is drawn by the Algorithm 5 as an example. Then the function approximation is performed using kernels associated with out-neighbors of v2, which are v1, v2 and v3 highlighted in Figure 2(b). The following Theorem presents the upper bound for cumulative stochastic regret of OMKL-SFG. Theorem 5 Under (as1) and (as2), let j = arg min j:1 j N PT t=1 L(f j (xt), yt). Then for any i Nin j , the stochastic regret of OMKL-SFG satisfies t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f (xt), yt) ln N|Nout i | η + (1 + ϵ)C2 2η + ϵLTC + (ξ + ηN qj ,t ) (40) with probability at least 1 28(σj ϵ )2 exp( Dϵ2 4d+8) under (as1)-(as3) for any ϵ > 0. Furthermore, 1 qi,t = P j Nout i wj,t qj,t Wi,t , C is a constant and σ2 j is the second moment of πκj (ψ). Proof The proof is deferred to Appendix E. The regret bound in (40) depends on 1 qi,t and 1 qj ,t . Since, there is a self-loop for all vk V, it can be written that qk,t pk,t. In addition, based on (33), we can conclude that pk,t > ξ |D|, vk V and as a result qk,t > ξ |D|, vk V. Therefore, in the worst case where qj ,t = O( ξ |D|), considering , ϵ = ξ = O T 1 3 , D = O T 2 3 ln T (41) Ghari and Shen OMKL-SFG can achieve regret bound of O( N ln NT 2 3 ) with high probability of 1 O(T 1 3 ). Moreover, comparing regret bound of OMKL-SFG with that of OMKL-GF, it can be observed that OMKL-SFG obtains tighter regret than OMKL-GF. The reason behind this is that the regret bound of both OMKL-SFG and OMKL-GF depend on 1/qj ,t (c.f. (40) and (90)) and OMKL-SFG performs more exploration than OMKL-GF in the sense that using OMKL-SFG the lower bound for the probability qi,t, i is larger than that of OMKL-GF. Specifically, using OMKL-GF, qi,t > η2 e/NJ (c.f. (91)). Setting ηe = O(N 1 6 T 1 4 ) and J = O(N 1 3 ) as it is specified in (24), it can be concluded that qi,t > O 1 N , i {1, . . . , N}, t {1, . . . , T} (42) when OMKL-GF is employed. Moreover, since using OMKL-SFG, qi,t > ξ |D|, choosing 3 ) as in (41) and considering the fact that |D| N, the lower bound of qi,t when OMKL-SFG is employed is obtained as , i {1, . . . , N}, t {1, . . . , T}. (43) Comparing (42) with (43), it can be concluded that OMKL-SFG can provide larger lower bound for qj ,t than that of OMKL-GF. This enables OMKL-SFG to obtain tighter regret upper bound than OMKL-GF. Since the regret bound of O( T) is more satisfactory than the regret bound of O(T 2 3 ), in what follows the structure of the feedback graph is refined at each time instant so that the regret of O( T) can be achieved. 4.3 OMKL with Similarity-based Feedback Graph Refinement This subsection further improves the OMKL-SFG by refining the offline feedback graph on the fly , so that the resulting algorithm achieves a tighter sub-linear regret of O( T). To this end, at each time instant the offline feedback graph G constructed by the Algorithm 4 is refined to a feedback graph G t based on the observed losses. To begin with, let s define set D t as D t := i ui,t Ut 1 1 ξ (β ξ where β is a pre-selected constant. According to (33), it can be inferred that pi,t β, i D t. Let G t = (V, E t) be a graph such that D t is a dominating set of G t. Suppose at each time instant t, G t is employed as the feedback graph instead of G. In this case, it is ensured that there is at least one edge from D t to each vi V \ D t, i.e., D t is a dominating set of G t. In this case, we have qi,t β, vi V. To this end, at each time instant t, G t can be constructed based on G by expanding E to E t such that D t would be a dominating set of G t. Specifically, at each time instant t, the edge (di,t, i) is appended to E t, if there is not any edge from D t to vi, where di,t = arg max j D t (κi, κj). (45) Graph-Aided Online Multi-Kernel Learning Algorithm 6 OMKL with Similarity Feedback Graph Refinement (OMKL-SFG-R) Input: Shift-invariant kernels κi, i = 1, . . . , N, learning rate η > 0, exploration rate ξ > 0, the number of random features D and the constant β > 0. Initialize: θi,1 = 0, wi,1 = 1, i = 1, . . . , N, Construct the feedback graph G via Algorithm 4. for t = 1, . . . , T do Receive one datum xt. Set E t = E and obtain di,t, i V \ D t by (45). For all i V \ D t, append (di,t, i) to E t if (di,t, i) / E. Draw one of nodes vi V according to the PMF pt = (p1,t, ..., p N,t) in (46). Predict ˆf RF(xt) via (47). Calculate loss L( ˆf RF,i(xt), yt) for all i St. Update θi,t+1 via (38). Update wi,t+1 and ui,t+1 via (39). end for Hence, there is at least one edge from D t to vi V \ D t, meaning D t is a dominating set for G t. Then one of the vertices in V is drawn according to the PMF pt, with pi,t = (1 ξ)ui,t Ut + ξ |D t|1i D t, i = 1, . . . , N. (46) Let Nout i,t and Nin i,t denote sets of out-neighbors and in-neighbors of vi in G t, respectively. According to (44) and (46), we have qi,t β, vi V where qi,t = P j Nin i,t pj,t. The RF-based function approximation can be written as ˆf RF(xt) = X wi,t P j Nout It wj,t ˆf RF,i(xt). (47) According to (47), θi,t, wi,t and ui,t can be updated using (38), (39a) and (39b), respectively. The procedure is summarized in Algorithm 6, which is called OMKL-SFG-R, and its performance in terms of regret analysis is presented in the following Theorem. Theorem 6 Under (as1) (as3), the stochastic regret of OMKL-SFG-R satisfies t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f (xt), yt) η + (1 + ϵ)C2 2η + ϵLTC + (ξ + η 2 L2 + Nβ + 1 with probability at least 1 28(σj ϵ )2 exp( Dϵ2 4d+8) for any ϵ > 0 and any β 1 Proof The proof can be found in Appendix F. According to Theorem 6, by setting , ϵ = ξ = O , D = O(T ln T) (49) Ghari and Shen and β = O(1) such that β 1 N , OMKL-SFG-R can achieve regret bound of O( TN ln N) using the feedback graph G t with probability of 1 O(1). According to Theorem 6, larger D leads to larger probability that the regret bound in (48) holds true. Thus, in order to achieve regret of O( TN ln N) with high probability, OMKL-SFG-R should set sufficiently large value of order O(T ln T) for D. However, note that since some edges may be added to G to construct G t, using G t instead of G may cause increase in computational complexity of function approximation. Computational Complexity. Both OMKL-SFG and OMKL-SFG-R need to store a set of d-dimensional vectors {ψi,j}D j=1 per kernel in addition to two weighting coefficients {wi,t}N i=1 and {ui,t}N i=1. Furthermore, both OMKL-SFG and OMKL-SFG-R need to store adjacency matrix of G which is N N matrix. In order to perform required computation for (45), OMKL-SFG-R needs to store the divergence (κi, κj) between any pair of kernels in the dictionary. Therefore, the memory requirement for both algorithms are O(d DN + N2). Consider the case where the number of out-neighbors of each node vi in G satisfies M < N, the per-iteration complexity of OMKL-SFG including calculation of inner products is O(d DM). Therefore, it can be inferred that OMKL-SFG incurs less computational complexity than OMKL-GF since recall that the per-iteration complexity of OMKL-GF is O(d DM + JN). Therefore, it can be concluded that the offline feedback graph construction indeed can alleviate the computational complexity compared with OMKL-GF in section 3. Due to the graph refinement procedure, the complexity of OMKL-SFG-R is higher than OMKL-SFG. According to Algorithm 6, there is a possibility that all nodes in the graph are out-neighbors of one node in D t. Therefore, the worst-case per-iteration computational complexity of OMKL-SFG-R is O(d DN). Furthermore, the computational complexities of OMKL-SFG and OMKL-SFG-R are lower than state-or-art multi-kernel learning algorithms provided that the per-iteration computational complexity of OMKR is O(td N) Sahoo et al. (2014), and the per-iteration computational complexity of RF-based online multi-kernel learning algorithms proposed in Sahoo et al. (2019) and Shen et al. (2019) are O(d DN). Regret Bounds Comparison. The algorithm OMKR (Sahoo et al., 2014) achieves regret of O( T ln N). However, since per iteration computational complexity of OMKR is O(td N), OMKR requires much more computations than the proposed graph-aided OMKL algorithms. Furthermore, Raker (Shen et al., 2019) obtains regret of O( T ln N) with high probability when the number of random features D is set to O(T ln T). Therefore, employing all kernels in the dictionary, Raker obtains tighter regret bound than those of the proposed graph-aided OMKL algorithms. However, the proposed graph-aided OMKL algorithms require less computations than Raker. Comparison with Online Learning. In online learning with expert advice, there is a learner interacts with a set of experts such that at each round of learning the learner choose one of the experts and takes its advice for either decision making or prediction (Cesa-Bianchi and Lugosi, 2006; Auer et al., 2003). The learner may observe the loss associated with a subset of experts after decision making and in this regard this can be modeled by a graph which is called feedback graph (see e.g. (Mannor and Shamir, 2011; Cohen et al., 2016; Alon et al., 2017; Ghari and Shen, 2022)). In all algorithms proposed in this paper, each kernel can be viewed as an expert. However, there are two major innovative differences compared with online learning with feedback graph: i) the proposed algorithm constructs and refines the feedback graph to improve the performance of learning task while in online learning, the Graph-Aided Online Multi-Kernel Learning feedback graph is generated in an adversarial manner. ii) in this paper, each expert (kernel) is a learner itself and experts implement an online scheme for self-improvement. 5. Experiments This section presents experimental results over real datasets downloaded from UCI Machine Learning Repository (Dua and Graff, 2017). The accuracy of different approaches are evaluated using mean square error (MSE). Due to the randomness in the random features extracted for function approximation, we average the MSE over R = 20 different sets of random features. The MSE at time t is computed as τ=1 ( ˆf RF(xτ) yτ)2. (50) The number of random features is D = 50. The kernel dictionary contains 76 kernels including 51 RBF kernels and 25 Laplacian kernels. The bandwidth of the i-th (1 i 51) RBF kernel is 10σi with σi = 2i 52 25 . And the value of the i-th (1 i 25) Laplacian kernel s parameter is 10λi where λi = i 13 6 . For fairness of evaluation, parameters ξ, η and ηe are set to 0.1 t for all algorithms at time step t. More precise results can be obtained with more extensive parameter tuning. The performance of kernel learning algorithms is evaluated through several real datasets: Airfoil: This dataset comprises 1, 503 different size airfoils at various wind tunnel speeds and each data sample xt includes 5 features such as frequency and chord length. The output yt is scaled sound pressure level in decibels (Brooks et al., 1989). Bias: This dataset contains 7, 750 samples. Each sample has 21 features including maximum and minimum temperatures of present-day, and geographic auxiliary variables for the purpose of bias correction of next-day minimum air temperatures. The goal is to predict next-day minimum air temperature (Cho et al., 2020). Concrete: This dataset contains 1, 030 samples of 8 features, such as the amount of cement or water in a concrete. The goal is to predict concrete compressive strength (Yeh, 1998). Naval: contains 11, 934 samples of 15 features of a naval vessel characterized by a gas turbine propulsion plant including the ship speed and gas turbine shaft torque. The goal is to predict the lever position (Coraddu et al., 2016). Parameter λ is set to 10 3 for all proposed algorithms OMKL-GF, OMKL-SFG and OMKL-SFG-R. Generating the bipartite graph Bt at every time instant can increase the computational complexity of the proposed OMKL-GF while it cannot improve MSE considerably. To further decrease the computational complexity of our proposed OMKL-GF, we terminate generating Bt after 300 time instants, meaning that Bt = B300 if t > 300. The number of selective nodes for OMKL-GF is set to be 2. Furthermore, the feedback graph G in Algorithm 4 is generated with the divergence function ( , ) defined using Bregman divergence in (25) when F(ω) = ω 2. The greedy set cover algorithm by Chvatal (1979) is utilized to find the dominating set D of the feedback graph G. Moreover, in order to speed up OMKL-SFG and OMKL-SFG-R, after 300 time instants, OMKL-SFG-R chooses It = arg maxi ui,t Ut . All experiments were carried out using Intel(R) Core(TM) i7-10510U CPU Ghari and Shen @ 1.80 GHz 2.30 GHz processor with a 64-bit Windows operating system. Codes are available at https://github.com/pouyamghari/Graph-Aided-Online-Multi-Kernel-Learning. 5.1 Number of Selected Kernels The present subsection studies the effect of the maximum number of selected kernels M on the performance of the proposed OMKL-GF, OMKL-SFG and OMKL-SFG-R. Figure 3 illustrates the MSE and its standard deviation of the proposed OMKL-GF, OMKL-SFG and OMKL-SFG-R with the change in the number of selected kernels. The standard deviation and the MSE are obtained over R = 20 sets of i.i.d random features (c.f. (50)). Figure 3 shows the advantage of data-driven kernel selection by OMKL-GF since increasing M from M = 10 to M = 20 does not result in lower MSE for Airfoil, Bias and Naval datasets. Figure 3 indicates that for OMKL-SFG and OMKL-SFG-R choosing moderate value for M such as M = 10 obtains MSE comparatively close to the MSE associated with optimal M. Figure 3 illustrates that OMKL-GF can obtain lower MSE than OMKL-SFG although OMKL-SFG achieves tighter regret upper bound than that of OMKL-GF. Regret upper bounds deal with worst cases. According to Theorems 2 and 5, the worst cases for OMKL-GF and OMKL-SFG happen when the probability of observing the loss of the best kernel (i.e. qj ,t where j defined in Theorem 5) is close to its minimum value for almost all t. For both algorithms qj ,t is close to its minimum value for almost all t if the probability of choosing the best kernel for function approximation is small for all t, which is unlikely to happen although it is not impossible. In addition, both OMKL-GF and OMKL-SFG choose a subset of kernels using a trade-offbetween exploitation and exploration. OMKL-SFG performs more exploration in choosing a subset of kernels than OMKL-GF in the sense that using OMKL-SFG lower bound for qj ,t is larger than that of OMKL-GF (see (42) and (43)). OMKL-GF performs more exploitation than OMKL-SFG since using OMKL-GF the structure of the graph is changing every time instant to enable OMKL-GF to choose optimal subset of kernels while using OMKL-SFG the structure of the graph is fixed and independent of prior loss observations. Results in Figure 3 show that the proper selection of M along with data-driven kernel selection can enable OMKL-GF to choose a more powerful subset of kernels than that of OMKL-SFG which leads to better accuracy of OMKL-GF. Moreover, since using OMKL-SFG-R, qi,t β, choosing β = O(1), it can be concluded that qj ,t O(1) when OMKL-SFG-R is employed. Therefore, OMKL-SFG-R performs more exploration than OMKL-SFG which leads to tighter regret upper bound for OMKL-SFG-R. However, employing OMKL-SFG-R increases the probability that the kernels with comparatively poor prior performance being among the chosen subset of kernels. This is against exploitation and degrades the MSE performance of OMKL-SFG-R compared to OMKL-SFG. Furthermore, Figure 4 depicts that the run time of the proposed graph-aided OMKL algorithms with the change in the number of selected kernels. Figure 4 shows that a larger M leads to longer run time which is expected as larger M increases the complexity. Moreover, numerical results associated with Figures 3 and 4 are also presented in Tables 3, 4, 5 in Appendix A. 5.2 MSE and Run Time Performance Compared to Baselines The performance of the proposed algorithms OMKL-GF, OMKL-SFG and OMKL-SFG-R are compared with the following kernel learning benchmarks: Graph-Aided Online Multi-Kernel Learning (a) Airfoil dataset. (b) Bias dataset. (c) Concrete dataset. (d) Naval dataset. Figure 3: MSE and standard deviation performance of Graph-aided MKL algorithms on real datasets with the change in the number of selected kernels. (a) Airfoil dataset. (b) Bias dataset. (c) Concrete dataset. (d) Naval dataset. Figure 4: Run Time performance of Graph-aided MKL algorithms on real datasets with the change in the number of selected kernels. Ghari and Shen Table 1: MSE( 10 3) of MKL algorithms on real datasets. Algorithms Airfoil Bias Concrete Naval OMKR 32.68 15.54 41.72 9.22 RBF-1 33.17 15.96 41.73 11.99 POLY-2 355.67 23.36 50.06 90.06 RFOMKR 350.52 405.44 210.42 347.06 Raker 28.64 12.70 35.22 11.35 OMKL-GF 25.73 8.15 34.45 5.11 OMKL-SFG 32.49 12.06 37.75 5.35 OMKL-SFG-R 33.99 13.24 38.58 7.89 Table 2: Run time(s) of MKL algorithms on real datasets. Algorithms Airfoil Bias Concrete Naval OMKR 889.36 80010.03 547.93 163688.82 RBF-1 14.39 3914.67 6.60 1499.78 POLY-2 7.22 195.19 3.33 951.93 RFOMKR 2.17 27.72 1.53 18.37 Raker 3.81 46.28 2.71 31.24 OMKL-GF 2.56 17.01 2.06 10.63 OMKL-SFG 1.65 13.58 1.34 9.00 OMKL-SFG-R 2.81 17.86 2.24 11.45 OMKR: online multi-kernel regression approach proposed by Sahoo et al. (2014). RBF-1: online single kernel regression approach (Sahoo et al., 2014) using a radial basis function (RBF) with bandwidth of 1. POLY-2: online single kernel regression approach (Sahoo et al., 2014) using a polynomial kernel with degree of 2. RFOMKR: online multi-kernel learning approach utilizes RF approximation (Sahoo et al., 2019). Raker: RF-based online multi-kernel learning (Shen et al., 2019). The maximum number of kernels chosen by OMKL-GF at each time instant is 10. In addition, for OMKL-SFG, to determine the value of γi for each vertex vi V, the number of out-neighbors for each node is set to be 10. For OMKL-SFG-R, at each time, β is set to β = (1 ξ) u[10,t] + ξ N where u[10,t] denote the tenth greatest value in the sequence {ui,t Ut }N i=1. Tables 1 and Table 2 list MSE and run time performance of alternative algorithms on real datasets, respectively. It can be observed from Table 1 that the proposed OMKL-GF significantly outperforms all benchmark algorithms, which corroborate the effectiveness of Graph-Aided Online Multi-Kernel Learning (a) Airfoil dataset. (b) Bias dataset. (c) Concrete dataset. (d) Naval dataset. Figure 5: MSE performance of MKL algorithms on real datasets. data-driven feedback graph based kernel pruning. Furthermore, MSEs reported in Table 1 indicates that the accuracy of OMKL-SFG is comparable with that of Raker which employs all kernels in the dictionary while OMKL-SFG chooses a subset of kernels at each time instant. Table 2 shows that OMKL-GF and OMKL-SFG are more efficient than all other alternatives, since thanks to the graph-aided pruning only a subset of kernels instead of all kernels in the dictionary are employed at each time instant. In addition, OMKL-SFG is the fastest due to the offline graph construction. Although OMKL-SFG-R enjoys tighter sub-linear regret than those of OMKL-GF and OMKL-SFG by including a larger number kernels in the selected subset, employing OMKL-SFG-R requires more computation and increases the run time which can be inferred from the results in the Table 2. Figure 5 illustrates the MSE of OMKR, Raker and proposed algorithms over time. It can be seen that as time goes, performance gain of OMKL-GF becomes more remarkable. This confirms the effectiveness of the data-driven kernel selection in a sense that the proposed OMKL-GF learns the optimal subset of kernels in the dictionary on the fly . 5.3 Regret Performance The present section presents the regret performance of the proposed OMKL-GF, OMKL-SFG and OMKL-SFG-R. The maximum number of kernels chosen by OMKL-GF at each time instant is 10. In addition, using OMKL-SFG, at each time instant 10 kernels are chosen to perform the function approximation task. For OMKL-SFG-R, at each time, β is set to β = (1 ξ) u[10,t] + ξ N . Figure 6 illustrates the regret of the proposed algorithms over time. Ghari and Shen (a) Airfoil dataset. (b) Bias dataset. (c) Concrete dataset. (d) Naval dataset. Figure 6: Regret of proposed OMKL algorithms on real datasets. 6. Conclusion This paper develops online multi-kernel learning algorithms for non-linear function learning. By constructing a bipartite feedback graph at every time instant, OMKL-GF chooses a subset of kernels to both prune irrelevant kernels and decrease the computational complexity. It is proved that OMKL-GF can obtain regret of O(T 3 4 ). To further alleviate the computational burden of multi-kernel learning, a feedback graph is constructed in an offline fashion based on the similarities among kernels. Using the similarity-based feedback graph, a subset of kernels is chosen and the resulting algorithm is called OMKL-SFG. It is proved that OMKL-SFG can achieve sub-linear regret of O(T 2 3 ). Furthermore, refining the similarity-based feedback graph structure at each time instant, OMKL-SFG-R is proposed, which enjoys sub-linear regret of O( T). Moreover, experiments on real datasets demonstrate that by choosing a subset of kernels, OMKL-GF can obtain lower MSE in comparison with other online kernel learning algorithms including OMKR and Raker. Furthermore, experiments show that OMKL-GF and OMKL-SFG have considerably lower run time compared to online multi-kernel learning algorithms OMKR and Raker. Acknowledgement This work is supported by NSF ECCS 2207457. PI: Yanning Shen (yannings@uci.edu). Noga Alon, Nicol o Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Graph-Aided Online Multi-Kernel Learning Journal on Computing, 46(6):1785 1826, 2017. Armen S. Asratian, Tristan M. J. Denley, and Roland H aggkvist. Bipartite Graphs and Their Applications. Cambridge University Press, 1998. Peter Auer, Nicol o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, Jan 2003. Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 6(58):1705 1749, 2005. Yoshua Bengio, Olivier Delalleau, and Nicolas L. Roux. The curse of highly variable functions for local kernel machines. In Advances in Neural Information Processing Systems, pages 107 114, May 2006. Pantelis Bouboulis, Symeon Chouvardas, and Sergios Theodoridis. Online distributed learning over networks in rkh spaces using random fourier features. IEEE Transactions on Signal Processing, 66(7):1920 1932, Apr 2018. L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200 217, 1967. Thomas F. Brooks, D. Stuart Pope, and Micheal A. Marcolini. Airfoil self-noise and prediction. Technical report, Jul 1989. Serhat S. Bucak, Rong Jin, and Anil K. Jain. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1354 1369, Jul 2014. Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, USA, 2006. Dongjin Cho, Cheolhee Yoo, Jungho Im, and Dong-Hyun Cha. Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth and Space Science, 7 (4), Mar 2020. Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Two-timescale networks for nonlinear value function approximation. In International Conference on Learning Representations, May 2019. Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3):233 235, Aug 1979. Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In Proceedings of International Conference on Machine Learning, page 811 819, Jun 2016. Ghari and Shen Andrea Coraddu, Luca Oneto, Aessandro Ghio, Stefano Savio, Davide Anguita, and Massimo Figari. Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Proceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the Maritime Environment, 230(1):136 153, 2016. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learning kernels. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, page 109 116, Jun 2009. Liang Ding, Rui Tuo, and Shahin Shahrampour. Generalization guarantees for sparse kernel approximation with entropic optimal features. In Proceedings of the International Conference on Machine Learning, volume 119, pages 2545 2555, Jul 2020. Yi Ding, Chenghao Liu, Peilin Zhao, and Steven C. H. Hoi. Large scale kernel methods for online AUC maximization. In IEEE International Conference on Data Mining, pages 91 100, Nov 2017. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. Lixin Duan, Ivor W. Tsang, and Dong Xu. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):465 479, Mar 2012. Pouya M Ghari and Yanning Shen. Online multi-kernel learning with graph-structured feedback. In Proceedings of the International Conference on Machine Learning, volume 119, pages 3474 3483, Jul 2020. Pouya M. Ghari and Yanning Shen. Online learning with probabilistic feedback. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4183 4187, May 2022. Mehmet G onen and Ethem Alpaydin. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(64):2211 2268, 2011. Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016. Steven C. H. Hoi, Rong Jin, Peilin Zhao, and Tianbao Yang. Online multiple kernel classification. Machine Learning, 90:289 316, Feb 2013. Marius Kloft, Ulf Brefeld, S oren Sonnenburg, and Alexander Zien. Lp-norm multiple kernel learning. Journal of Machine Learning Research, 12:953 997, Jul 2011. Jing Lu, Steven C.H. Hoi, Jialei Wang, Peilin Zhao, and Zhi-Yong Liu. Large scale online kernel learning. Journal of Machine Learning Research, 17(47):1 43, 2016. Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In Proceedings of International Conference on Neural Information Processing Systems, pages 684 692, 2011. Graph-Aided Online Multi-Kernel Learning Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of International Conference on Neural Information Processing Systems, pages 1177 1184, Dec 2007. Alain Rakotomamonjy, Francis R. Bach, St ephane Canu, and Yves Grandvalet. Simple MKL. Journal of Machine Learning Research, 9(83):2491 2521, 2008. Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Proceedings of International Conference on Neural Information Processing Systems, page 3218 3228, 2017. Doyen Sahoo, Steven C.H. Hoi, and Bin Li. Online multiple kernel regression. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 293 302, 2014. Doyen Sahoo, Steven C. H. Hoi, and Bin Li. Large scale online multiple kernel regression with application to time-series prediction. ACM Transactions on Knowledge Discovery from Data, 13(1), Jan 2019. Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. Shahin Shahrampour and Vahid Tarokh. Learning bounds for greedy approximation with explicit feature maps from multiple kernels. In Proceedings of the International Conference on Neural Information Processing Systems, page 4695 4706, Dec 2018. John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004. Yanning Shen, Tianyi Chen, and Georgios B. Giannakis. Random feature-based online multi-kernel learning in environments with unknown dynamics. Journal of Machine Learning Research, 20(1):773 808, Jan 2019. S oren Sonnenburg, Gunnar R atsch, Christin Sch afer, and Bernhard Sch olkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531 1565, Dec 2006. Bharath K. Sriperumbudur and Zolt an Szab o. Optimal rates for random fourier features. In Proceedings of the International Conference on Neural Information Processing Systems, page 1144 1152, Dec 2015. Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990. Christopher K. I. Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In Proceedings of the International Conference on Neural Information Processing Systems, page 661 667, Jan 2000. I-Cheng Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797 1808, Dec 1998. Ghari and Shen Table 3: MSE( 10 3) and run time of OMKL-GF with different M on real datasets. MSE( 10 3) Run time (s) Datasets M = 1 M = 5 M = 10 M = 15 M = 20 M = 1 M = 5 M = 10 M = 15 M = 20 Airfoil 49.53 24.95 25.73 26.47 26.91 0.70 1.53 2.56 3.40 4.30 Bias 12.56 7.11 8.15 8.93 9.79 5.44 11.24 17.01 23.17 27.74 Concrete 79.27 34.58 34.45 34.04 33.84 0.55 1.27 2.06 2.82 3.65 Naval 14.60 4.78 5.11 5.69 6.06 3.58 7.61 10.63 14.16 18.11 Table 4: MSE( 10 3) and run time of OMKL-SFG with different M on real datasets. MSE( 10 3) Run time (s) Datasets M = 1 M = 5 M = 10 M = 15 M = 20 M = 1 M = 5 M = 10 M = 15 M = 20 Airfoil 51.03 33.49 32.49 31.82 30.89 1.17 1.36 1.65 1.98 2.27 Bias 12.35 11.78 12.06 12.11 12.13 6.58 9.94 13.57 17.49 21.26 Concrete 79.60 48.03 37.75 33.83 32.59 1.03 1.15 1.34 1.60 1.79 Naval 8.15 6.18 5.35 4.85 4.68 4.45 6.22 9.00 11.69 14.09 Xiao Zhang and Shizhong Liao. Incremental randomized sketching for online kernel learning. In Proceedings of International Conference on Machine Learning, pages 7394 7403, Jun 2019. Appendix A. Additional Experimental Results This section presents more detailed results on MSE and Run time of proposed algorithms with the change in the number of chosen kernels. Table 3 lists the MSE and run time of OMKL-GF with the change in the value of M which is the maximum number of kernels selected at each time instant by OMKL-GF. For OMKL-SFG and OMKL-SFG-R, the value of γi is chosen such that for each vertex vi V, the number of out-neighbors for each node is M. Tables 4 and 5 show both MSE and run time of OMKL-SFG and OMKL-SFG-R respectively with different values of M. Appendix B. Proof of Lemma 1 In order to prove Lemma 1, we first establish the following lemma as a step stone. Lemma 7 Let ˆf RF,i(.) denote the sequence of estimates generated with a preselected kernel κi where Fi = { ˆfi| ˆfi(x) = θ zi(x), θ R2D}. Then, under assumptions (as1) and (as2) Graph-Aided Online Multi-Kernel Learning Table 5: MSE( 10 3) and run time of OMKL-SFG-R with different M on real datasets. MSE( 10 3) Run time (s) Datasets M = 1 M = 5 M = 10 M = 15 M = 20 M = 1 M = 5 M = 10 M = 15 M = 20 Airfoil 106.66 38.03 33.99 32.42 31.21 2.28 2.32 2.81 2.97 3.18 Bias 48.81 14.77 13.24 12.78 12.57 10.87 13.98 17.86 21.13 25.13 Concrete 110.08 50.96 38.58 34.32 32.85 1.91 2.06 2.24 2.51 2.63 Naval 95.37 14.32 7.89 6.03 5.33 6.77 8.55 11.45 13.65 15.84 the following bound holds t=1 L( ˆf RF,i(xt), yt) t=1 L( ˆf i (xt), yt) θ i 2 where L is the Lipschitz constant in (as2) and θ i is the parameter vector associated with the best estimator ˆf i (x) = (θ i ) zi(x). Proof For θi,t+1 and any fixed θ, it can be written that θi,t+1 θ 2 = θi,t η ℓi,t θ 2 = θi,t θ 2 2η ℓi,t(θi,t θ) + η ℓi,t 2. (52) Furthermore, from the convexity of the loss function with respect to θ in (as1), we can conclude that L(θ i,tzi(xt), yt) L(θ zi(xt), yt) L(θ i,tzi(xt), yt)(θi,t θ). (53) Therefore, from (53), it can be inferred that L(θ i,tzi(xt), yt) qi,t L(θ zi(xt), yt) 1i St L(θ i,tzi(xt), yt) qi,t (θi,t θ)1i St. (54) Based on (35), (54) is equivalent to ℓi,t L(θ zi(xt), yt) qi,t 1i St ℓi,t(θi,t θ). (55) Combining (52) with (55), we get ℓi,t L(θ zi(xt), yt) qi,t 1i St θi,t θ 2 θi,t+1 θ 2 2 ℓi,t 2. (56) Taking the expectation of ℓi,t and ℓi,t 2 with respect to 1i St, we arrive at Et[ℓi,t] = X pj,t L(θ i,tzi(xt), yt) qi,t = L(θ i,tzi(xt), yt) (57a) Et[ ℓi,t 2] = X pj,t L(θ i,tzi(xt), yt) 2 q2 i,t = L(θ i,tzi(xt), yt) 2 qi,t . (57b) Ghari and Shen Therefore, taking the expectation with respect to 1i St from both sides of (56), we obtain L(θ i,tzi(xt), yt) L(θ zi(xt), yt) θi,t θ 2 θi,t+1 θ 2 2η + η L(θ i,tzi(xt), yt) 2 2qi,t . (58) Based on (as2), we have L(θ i,tzi(xt), yt) 2 L2. Therefore, summing (58) over time from t = 1 to t = T it can be concluded that t=1 L(θ i,tzi(xt), yt) t=1 L(θ zi(xt), yt) θ 2 θi,T+1 θ 2 2qi,t . (59) Putting θ = θ i in (59) and taking into account that θi,T+1 θ 2 0, we can write t=1 L(θ i,tzi(xt), yt) t=1 L(θ zi(xt), yt) θ i 2 which completes the proof of Lemma 7. Lemma 8 Under (as1) and (as2), the following holds t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf RF,i(xt), yt) ln N η + ηe JT + ηNT 2(1 ηe) (61) where η is the learning rate, ηe is the exploration rate, qi,t = PJ j=1 pj,t 1 (1 πij,t)M and N denotes the number of kernels. Proof Let Wt = PN n=1 wn,t. For any t we find j=1 pj,t Wt+1 Wt exp( ηℓi,t). (62) Based on (3), we have Wt = πij,t ηj e N 1 ηj e , j {1, ..., J}. (63) Combining (62) with (63) obtains πij,t ηj e N 1 ηj e exp( ηℓi,t). (64) Graph-Aided Online Multi-Kernel Learning Using the inequality e x 1 x + 1 2x2, x 0, (64) leads to πij,t ηj e N 1 ηj e 1 ηℓi,t + 1 2(ηℓi,t)2 . (65) Taking logarithm from both sides of inequality (65), and use the fact that 1 + x ex, we have πij,t ηj e N 1 ηj e 2(ηℓi,t)2 . (66) Summing (66) over t from 1 to T results in πij,t ηj e N 1 ηj e 2(ηℓi,t)2 . (67) Furthermore, recall the updating rule of wi,T+1 in (37), for any i we have W1 ln wi,T+1 t=1 ηℓi,t. (68) Combining (67) with (68) results in πij,t 1 ηj e (ηℓi,t) ηj e N 1 ηj e (ηℓi,t) + πij,t ηj e N 1 ηj e 2(ηℓn,t)2 . (69) Multiplying both sides by 1 ηJ e η , we arrive at i=1 πij,t 1 ηJ e 1 ηj e ℓi,t t=1 (1 ηJ e )ℓi,t 1 ηJ e η ln N + ηj e(1 ηJ e ) N(1 ηj e) ℓi,t (1 ηJ e )(πij,t ηj e N ) 2ℓ2 i,t). (70) Also, using the fact that 0 < ηe 1 we can conclude that 1 ηJ e < 1 and for all j 1, ηj e ηe, the RHS of (70) can be upper bounded by 1 ηJ e η ln N + ηj e(1 ηJ e ) N(1 ηj e) ℓi,t + (1 ηJ e )(πij,t ηj e N ) ηe(1 ηJ e ) N(1 ηe) ℓi,t + πij,t ηj e N 1 ηj e (η 2ℓ2 i,t). (71) Ghari and Shen Since 1 ηJ e = (1 ηe)(1 + . . . + ηJ 1 e ) and ηe 1, the following bound holds for the second term on the RHS of (71) ηe(1 ηJ e ) N(1 ηe) ℓi,t = ηe(1 + . . . + ηJ 1 e ) N ℓi,t N ℓi,t. (72) Meanwhile, as ηJ e ηj e for all j, 1 j J, the LHS of (70) can be bounded by i=1 πij,t 1 ηJ e 1 ηj e ℓi,t t=1 (1 ηJ e )ℓi,t i=1 πij,tℓi,t t=1 ℓi,t. (73) Combining (70), (71), (72) and (73), we can conclude that i=1 πij,tℓi,t πij,t ηj e N 1 ηj e (η 2ℓ2 i,t). (74) Recall the probability of observing the loss of the i-th kernel at time t given in (21), the expected first and second moments of ℓi,t in (16) given the losses incurred up to time instant t 1, i.e., {L( ˆf RF(xτ), yτ)}t 1 τ=1 can be written as j=1 pj,t 1 (1 πij,t)M L( ˆf RF,i(xt), yt) qi,t = L( ˆf RF,i(xt), yt) (75a) E[ℓ2 n,t] = j=1 pj,t 1 (1 πij,t)M L2( ˆf RF,i(xt), yt) q2 i,t = L2( ˆf RF,i(xt), yt) qi,t . (75b) Based on (75b), the third term in the right hand side of (74) can be bounded as follows πij,t ηj e N 1 ηj e (η πij,t ηj e N 1 ηj e ( η 2qi,t ). (76) Taking the expected value of (74) at each time t given{L( ˆf RF(xτ), yτ)}t 1 τ=1 and combining with (75a) and (76) we have i=1 πij,t L( ˆf RF,i(xt), yt) t=1 L( ˆf RF,i(xt), yt) N L( ˆf RF,i(xt), yt) + πij,t ηj e N 1 ηj e ( η 2qi,t ). (77) Graph-Aided Online Multi-Kernel Learning Since πij,t ηj e N (1 ηj e)qi,t πij,t (1 ηe)qi,t , replace πij,t ηj e N qi,t(1 ηj e) by πij,t (1 ηe)qi,t , the inequality in (77) still holds i=1 πij,t L( ˆf RF,i(xt), yt) t=1 L( ˆf RF,i(xt), yt) N L( ˆf RF,i(xt), yt) + η 2(1 ηe) qi,t . (78) Moreover, based on (21), the probability qi,t can be bounded from below as j=1 pj,t 1 (1 πij,t)M j=1 pj,tπij,t 1 + (1 πij,t) + . . . + (1 πij,t)M 1 > j=1 pj,tπij,t (79) Therefore, according to (79), for the third term in the right hand side of (78) we have qi,t < ηNT 2(1 ηe). (80) Furthermore, based on that L( ˆf RF,i(xt), yt) 1 in (as2), the following inequality holds N L( ˆf RF,i(xt), yt) N = ηe JT. (81) From (78), (80) and (81), we can conclude that n=1 πij,t L( ˆf RF,i(xt), yt) t=1 L( ˆf RF,i(xt), yt) ln N η + ηe JT + ηNT 2(1 ηe). (82) According to the procedure of generating the graph Bt which is presented in Algorithm 2, for each selective node vs,j a subset of kernels is chosen using PMF πij,t in M independent trials. In fact, a subset of kernels is assigned to each node vs,j in M independent trials and in each trial one kernel is assigned and its associated entry in the sub-adjacency matrix A becomes 1. Now, let bi represents the frequency that the i-th kernel is chosen in M independent trials. Thus, {bi}N i=1 can be viewed as the solution to the following linear equation b1 + . . . + b N = M, s.t. bi 0, bi N (83) where N denotes the set of natural numbers. There are N+M 1 N different solutions for (83). Let, {bi,k}N i=1 denotes the k-th set of solution for (83). Based on Jensen s inequality, for the Ghari and Shen expected value of L( ˆf RF(xt), yt) we have Et[L( ˆf RF(xt), yt)] = (N+M 1 N ) X i=1 (πij,t)bi,k ! i St wi,t ˆf RF,i(xt), yt) (N+M 1 N ) X i=1 (πij,t)bi,k ! X i St wi,t L( ˆf RF,i(xt), yt). (84) Also, considering (84) and the fact that wi,t 1, we can conclude that Et[L( ˆf RF(xt), yt)] (N+M 1 N ) X i=1 (πij,t)bi,k ! X i St wi,t L( ˆf RF,i(xt), yt) (N+M 1 N ) X i=1 (πij,t)bi,k ! X i St L( ˆf RF,i(xt), yt). (85) Note that the number of ways to solve (83) when the i-th kernel is chosen for at least one time equals to the number of ways to solve the following problem b1,i + . . . + b N,i = M 1, s.t. bm,i 0, bm,i N. (86) There are N+M 2 N different solutions for (86). Let { b(k) m,i}N i=1 denotes k-th set of solution for (86). Therefore, based on this, from (85) we can conclude the following equality (N+M 1 N ) X i=1 (πij,t)bi,k ! X i St L( ˆf RF,i(xt), yt) (N+M 2 N ) X m=1 (πmj,t) b(k) m,i ! L( ˆf RF,i(xt), yt) (87) where P(N+M 2 N ) k=1 QN m=1(πmj,t) b(k) m,i is the total probability of all N+M 2 N possible solu- tions of (86). Therefore, P(N+M 2 N ) k=1 QN m=1(πmj,t) b(k) m,n = 1. Substituting (87) into (84), we obtain Et[L( ˆf RF(xt), yt)] i=1 πij,t L( ˆf RF,i(xt), yt). (88) Combining (82) with (88) leads to t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf RF,i(xt), yt) ln N η + ηe JT + ηNT 2(1 ηe) (89) Graph-Aided Online Multi-Kernel Learning which concludes to proof of Lemma 8. From Lemma 7 and Lemma 8, we conclude that for any i : 1 i N we have t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf i (xt), yt) 2η + ηe JT + ηNT 2(1 ηe) + η qi,t . (90) From (79) and the facts that pj,t > ηe J and πij,t > ηj e N , the following inequality can be concluded j=1 pj,tπij,t > p1,tπi1,t > η2 e NJ . (91) Therefore, we find qi,t > η2 e NJ , i : 1 i N, t : 1 t T. Combining (90) and (91) we can conclude that t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf i (xt), yt) 2η + ηe JT + ηNT 2(1 ηe) + ηL2NJT 2η2e . (92) Hence, Lemma 1 is proved. Appendix C. Proof of Theorem 2 To prove Theorem 2, the following lemma is exploited. Lemma 9 For the optimal function estimator in Hi expressed as f i (x) := PT t=1 α i,tκi(x, xt) and its RF-based approximant ˆf i (x, xt) = PT t=1 α i,tz i (x)zi(xt), the following bound holds with probability at least 1 28(σi ϵ )2 exp( Dϵ2 t=1 L( ˆf i (xt), yt) t=1 L(f i (xt), yt) where the equality happens if we have C := maxi PT t=1 |α i,t|. Proof For a given shift invariant kernel κi, the maximum point-wise error of the random feature kernel approximant is uniformly bounded with probability at least 1 28(σi ϵ )2 exp( Dϵ2 4d+8), by Rahimi and Recht (2007) sup xj,xk X |z i (xj)zi(xk) κi(xj, xk)| < ϵ (94) Ghari and Shen where σ2 i is the second moment of the Fourier transform of κi. Therefore, under (a3) this implies that supxτ,xt X z i (xτ)zi(xt) 1 + ϵ holds with probability at least 1 28(σi ϵ )2 exp( Dϵ2 4d+8). Let C := maxn PT t=1 |α n,t|. Hence, θ j 2 can be bounded from above as t=1 α j,tzj(xt) 2 | τ=1 α j,tα j,τz j (xt)zj(xτ)| (1 + ϵ)C2 (95) with probability at least 1 28(σi ϵ )2 exp( Dϵ2 4d+8). Moreover, using the triangle inequality yields t=1 L( ˆf j (xt), yt) t=1 L(f j (xt), yt) L( ˆf j (xt), yt) L(f j (xt), yt) . (96) According to Lipschitz continuity of the loss function, it can be concluded that L( ˆf j (xt), yt) L(f j (xt), yt) τ=1 α j,τz j (xτ)zn(xt) τ=1 α j,τκj(xτ, xt) Using the Cauchy-Schwartz inequality, we can write τ=1 α j,τz j (xτ)zn(xt) τ=1 α j,τκj(xτ, xt) τ=1 |α j,τ| z j (xτ)zj(xt) κj(xτ, xt) . (98) Using (94) and (98), we can conclude that the inequality t=1 L( ˆf j (xt), yt) t=1 L(f j (xt), yt) holds with probability at least 1 28(σi ϵ )2 exp( Dϵ2 Combining Lemma 1 with Lemma 9 and (95), it can be concluded that the following bound holds with probability at least 1 28(σn ϵ )2 exp( Dϵ2 t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f i (xt), yt) t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf i (xt), yt) + t=1 L( ˆf i (xt), yt) t=1 L(f i (xt), yt) 2η + ηe JT + ϵLTC + ηNT 2(1 ηe) + ηL2NJT which completes the proof of Theorem 2. Graph-Aided Online Multi-Kernel Learning Appendix D. Proof of Lemma 4 According to (2), we obtain Z | ˆfi(x) ˆfj(x)|2dx = 1 t=1 αi,tκi(x, xt) t=1 αj,tκj(x, xt)|2dx. (101) Applying Arithmetic Mean-Geometric Mean inequality on the right hand side of (101), we find 1 Ud Z | ˆfi(x) ˆfj(x)|2dx t=1 αi,t (κi(x, xt) κj(x, xt))|2 + | t=1 (αj,t αi,t)κj(x, xt)|2 ! Using Cauchy-Schwartz inequality, (102) can be further relaxed to Z | ˆfi(x) ˆfj(x)|2dx t=1 |αi,t|2)( t=1 |κi(x, xt) κj(x, xt)|2)dx t=1 |αj,t αi,t|2)( t=1 |κj(x, xt)|2)dx. (103) Considering the fact that Cm := maxi PT t=1 |αi,t|2, from (103) it can be written that Z | ˆfi(x) ˆfj(x)|2dx Z |κi(x, xt) κj(x, xt)|2dx + 4Cm Z |κj(x, xt)|2dx. (104) Furthermore, based on (104) and the fact that |κj(x, xt)|2 1, we can infer that Z | ˆfi(x) ˆfj(x)|2dx 2Cm t=1 ( S(κi, κj) + 2Ud) (105) which proves Lemma 4. Appendix E. Proof of Theorem 5 Furthermore, in order to prove Theorem 5, the following intermediate Lemma is also proved. Lemma 10 The following inequality holds under (as1) and (as2) t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf RF,i(xt), yt) ln N η + η(1 + N where Li( ˆf RF(xt), yt) denote the loss of function approximation when vi is drawn. Ghari and Shen Proof For any t, we can write Ut exp( ηˆℓi,t). (107) Based on (33), we have ui,t Ut = pi,t ξ |D|1i D 1 ξ and as a result (107) can be rewritten as pi,t ξ |D|1i D 1 ξ exp( ηˆℓi,t). (108) Using the inequality e x 1 x + 1 2x2, x 0 and (108), it can be concluded that pi,t ξ |D|1i D 1 ξ 1 ηˆℓi,t + 1 2(ηˆℓi,t)2 . (109) Taking logarithm from both sides of inequality (109), and use the fact that 1 + x ex, we arrive at pi,t ξ |D|1i D 1 ξ 2(ηˆℓi,t)2 . (110) Summing (110) over t leads to pi,t ξ |D|1i D 1 ξ 2(ηˆℓi,t)2 . (111) Furthermore, ln UT +1 U1 can be bounded from below as U1 ln ui,T+1 t=1 ηˆℓi,t ln N (112) for any i such that 1 i N. Combining (111) with (112), we obtain pi,tη 1 ξ ˆℓi,t ηξ1i D (1 ξ)|D| ˆℓi,t + pi,t ξ |D|1i D 1 ξ 2(ηˆℓi,t)2 . (113) Multiplying both sides by 1 ξ η it can be concluded that i=1 pi,tˆℓi,t |D| ˆℓi,t + η(pi,t ξ |D|1i D) 2 ˆℓ2 i,t. (114) Graph-Aided Online Multi-Kernel Learning In addition, taking the expectation of ˆℓi,t and ˆℓ2 i,t, we get Et[ˆℓi,t] = pi,t L( ˆf RF,i(xt), yt) pi,t = L( ˆf RF,i(xt), yt) (115a) Et[ˆℓ2 i,t] = pi,t L2( ˆf RF,i(xt), yt) p2 i,t = L2( ˆf RF,i(xt), yt) pi,t (115b) Thus, taking the expectation from both sides of (114) leads to i=1 pi,t L( ˆf RF(xt), yt) t=1 L( ˆf RF,i(xt), yt) |D| L( ˆf RF(xt), yt) + η(pi,t ξ |D|1i D) 2pi,t . (116) Taking into account that L( ˆf RF(xt), yt) 1 and based on (116) we can write i=1 pi,t L( ˆf RF(xt), yt) t=1 L( ˆf RF,i(xt), yt) η(pi,t ξ |D|1i D) 2pi,t . (117) Moreover, using (117) and the fact that pi,t 1, it can be concluded that i=1 pi,t L( ˆf RF(xt), yt) t=1 L( ˆf RF,i(xt), yt) ln N η + (ξ + ηN 2 )T. (118) Furthermore, the expected loss incurred by OMKL-SFG given observed losses in prior time instants can be expressed as Et[L( ˆf RF(xt), yt)] = i=1 pi,t L( X wj,t P k Nout i,t wk,t ˆf RF,j(xt), yt) i=1 pi,t L( ˆf RF(xt), yt). (119) Therefore, from (118) and (119), it can be inferred that t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf RF,i(xt), yt) ln N η + (ξ + ηN which establishes the Lemma 10. Furthermore, to prove Theorem 5, we prove the following Lemma. Ghari and Shen Lemma 11 For any vi V and any j Nout i , it can be written that t=1 Li( ˆf RF(xt), yt) t=1 L( ˆf RF,j(xt), yt) ln |Nout i | η + η 1 qi,t (121) where 1 qi,t = P j Nout i wj,t qj,t Wi,t . Proof Let Wi,t = P j Nout i wj,t. For vi V we find wj,t Wi,t exp( ηℓj,t). (122) The following inequality can be obtained using the inequality e x 1 x + 1 2x2, x 0 as follows 1 ηℓj,t + 1 2(ηℓj,t)2 . (123) Taking the logarithm from both sides of (123) and using the inequality 1 + x ex, we get 2(ηℓj,t)2 . (124) Summing (124) over time, we obtain 2(ηℓj,t)2 . (125) Moreover, for any j Nout i , ln Wi,T +1 Wi,1 can be bounded from below as Wi,1 ln wj,T+1 t=1 ηℓj,t ln |Nout i | (126) Combining (125) with (126), it can concluded that wj,t Wi,t ℓj,t t=1 ℓj,t ln |Nout i | η + η wj,t Wi,t ℓ2 j,t. (127) For the expected value of ℓi,t and ℓ2 i,t, we have Et[ℓj,t] = X pk,t L(θ j,tzj(xt), yt) qj,t = L(θ j,tzj(xt), yt) (128a) Et[ℓ2 j,t] = X pk,t L2(θ j,tzj(xt), yt) q2 j,t = L2(θ j,tzj(xt), yt) qj,t (128b) Graph-Aided Online Multi-Kernel Learning Taking the expectation from (127), we get wj,t Wi,t L(θ j,tzj(xt), yt) t=1 L(θ j,tzj(xt), yt) ln |Nout i | η + η wj,t qj,t Wi,t . (129) Let 1 qi,t = P j Nout i wj,t qj,t Wi,t which is the weighted sum of 1 qj,t such that j Nout i . Furthermore, according to (34), the loss Li( ˆf RF(xt), yt) can be written as Li( ˆf RF(xt), yt) = L( X wj,t Wi,t ˆf RF,j(xt), yt). (130) Based on the Jensen s inequality Li( ˆf RF(xt), yt) can be bounded from above as Li( ˆf RF(xt), yt) X wj,t Wi,t L( ˆf RF,j(xt), yt). (131) Using (129) and (131), we can conclude that t=1 Li( ˆf RF(xt), yt) t=1 L(θ j,tzj(xt), yt) ln |Nout i | η + η 1 qi,t (132) which proves the Lemma 11. Combining Lemma 10 with Lemma 11, for any vj V and any i Nin j we obtain t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf RF,j(xt), yt) ln N|Nout i | η + (ξ + ηN 1 qi,t . (133) In addition, combining Lemma 7 with (133), for any vj V and any i Nin j we can write t=1 Et[L( ˆf RF(xt), yt)] t=1 L( ˆf j (xt), yt) ln N|Nout i | η + θ j 2 2η + (ξ + ηN qj,t ) (134) We use the above inequality as a step-stone to prove Theorem 5. Ghari and Shen Therefore, combining (134) with (95) and (99), it can be inferred that the following inequality t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f j (xt), yt) ln N|Nout i | η + (1 + ϵ)C2 2η + ϵLTC + (ξ + ηN qj,t ) (135) holds for any vj V and any i Nin j with probability at least 1 28(σi ϵ )2 exp( Dϵ2 4d+8). This completes the proof of Theorem 5. Appendix F. Proof of Theorem 6 According to Theorem 5, the following inequality t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f (xt), yt) ln N|Nout i | η + (1 + ϵ)C2 2η + ϵLTC + (ξ + ηN qj ,t ) (136) holds with probability at least 1 28(σj ϵ )2 exp( Dϵ2 4d+8) for any ϵ > 0 and any i Nin j . When G t is generated by Algorithm 6 as the feedback graph, D t is a dominating set of G t. Furthermore, k D t if pk,t β. Based on (36), if i Nin k,t (i.e. in-neighborhood of vk in G t), qk,t pi,t. Also, considering the condition β 1 N , it is ensured that D t is not an empty set. Moreover, each node vk in G t is out-neighbor of at least one node in D t. Thus, we can conclude that qk,t β, vk V. Hence, it can be written that wj,t Wi,tβ + L2 β ) = (L2 + 1)T Furthermore, since |Nout i | N, we have N|Nout i | N2. Combining (136) with (137), it can be inferred that in this case the stochastic regret of OMKL-SFG-R satisfies t=1 Et[L( ˆf RF(xt), yt)] t=1 L(f (xt), yt) η + (1 + ϵ)C2 2η + ϵLTC + (ξ + η 2 L2 + Nβ + 1 with probability at least 1 28(σj ϵ )2 exp( Dϵ2 4d+8) under (as1)-(as3) for any ϵ > 0 and any β (1 ξ) maxk uk,t N . This completes the proof of Theorem 6.