# coke_communicationcensored_decentralized_kernel_learning__09f72e1f.pdf

Journal of Machine Learning Research 22 (2021) 1-35 Submitted 1/20; Published 5/21

COKE: Communication-Censored Decentralized Kernel Learning

Ping Xu pxu3@gmu.edu

Yue Wang ywang56@gmu.edu

Xiang Chen xchen26@gmu.edu

Zhi Tian ztian1@gmu.edu Department of Electrical and Computer Engineering, George Mason University Fairfax, VA 22030, USA

Editor: Corinna Cortes

This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function deﬁned over a reproducing kernel Hilbert space by jointly minimizing a global objective function, with access to their own locally observed dataset. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge, we leverage the random feature (RF) approximation approach to enable consensus on the function modeled in the RF space by data-independent parameters across diﬀerent agents. We then design an iterative algorithm, termed DKLA, for fast-convergent implementation via ADMM. Based on DKLA, we further develop a communication-censored kernel learning (COKE) algorithm that reduces the communication load of DKLA by preventing an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests on both synthetic and real datasets are conducted to verify the communication eﬃciency and learning eﬀectiveness of COKE.1

Keywords: Decentralized nonparametric learning, reproducing kernel Hilbert space, random features, ADMM, communication censoring.

1. Introduction

Decentralized learning has attracted extensive interest in recent years, largely due to the explosion of data generated everyday from mobile sensors, social media services, and other networked multi-agent applications (Worden and Manson, 2006; Ilyas et al., 2013; Facchinei et al., 2015; Demarie and Sabia, 2019). In many of these applications, the observed data are usually kept private at local sites without being aggregated to a fusion center, either due to the prohibitively high cost of raw data transmission or privacy concerns. Meanwhile,

1. Preliminary results in this paper were presented in part at the 2019 IEEE Data Science Workshop (Xu et al., 2019).

c 2021 Ping Xu and Yue Wang and Xiang Chen and Zhi Tian.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-070.html.

Xu and Wang and Chen and Tian

each agent in the network only communicates with its one-hop neighbors within its local area to save transmission power. Such localized data processing and transmission obviate the implementation of any centralized learning techniques. Under this circumstance, this article focuses on the decentralized learning problem where a network of distributed agents aim to collaboratively learn a functional model describing the global data with only access to their own locally observed datasets.

To learn the functional model that is often nonlinear and complex, nonparametric kernel methods are widely appreciated thanks to the kernel trick that makes some well-behaved linear learning algorithms applicable in a high-dimensional implicit feature space, without explicit mapping from data to that feature space (Shawe-Taylor et al., 2004; Hofmann et al., 2008; P erez-Cruz and Bousquet, 2004). However, in the absence of any raw data sharing or aggregation, it is challenging to directly apply them to a decentralized multiagent setting and solve them under the consensus optimization framework using algorithms such as decentralized alternating direction method of multipliers (ADMM) (Shi et al., 2014). This is because decentralized learning relies on solving local optimization problems and then aggregating the updates on the local decision variables over the network through one-hop communications in an iterative manner (Nedi c et al., 2016). Unfortunately, these decision variables of local objective functions resulted from the kernel trick are data-dependent and thus cannot be optimized in the absence of raw data exchange under the decentralized consensus framework.

There are several works applying kernel methods in decentralized learning for various applications under diﬀerent settings (Predd et al., 2006; Mitra and Bhatia, 2014; Gao et al., 2015; Chouvardas and Draief, 2016; Shin et al., 2016, 2018; Koppel et al., 2018). These works, however, either assume that agents have access to their neighbors observed raw data (Predd et al., 2006) or require agents to transmit their raw data to their neighbors (Koppel et al., 2018) to ensure consensus through collaborative learning. These assumptions may not be valid in many practical applications that involve users private data. Moreover, standard kernel learning for big data faces the curse of dimensionality when the number of training examples increases (Shawe-Taylor et al., 2004). For example, in (Mitra and Bhatia, 2014; Chouvardas and Draief, 2016), the nonlinear function learned at each node is represented as a weighted combination of kernel functions centered on its local observed data. As a result, each agent needs to transmit both the weights of kernel functions and its local data to its neighbors at every iterative step to guarantee consensus of the common prediction function. Thus, both the computation and communication resources are demanding in the distributed implementation. To alleviate the curse of dimensionality problem, Gao et al. (2015) and Koppel et al. (2018) have developed compression techniques such as data selection and sparse subspace projection, respectively, but these techniques typically incur considerable extra computation, and still involve raw data exchange with no alleviation to the data privacy concern. Furthermore, when computation cost is more aﬀordable than the communication in the big data scenario, communication cost of the iterative learning algorithms becomes the bottleneck for eﬃcient distributed learning (Mc Mahan et al., 2016). Therefore, it is crucial to design communication-eﬃcient distributed kernel learning algorithms with data privacy protection.

COKE: Communication-Censored Decentralized Kernel Learning

1.1 Related work

This work lies at the intersection of non-parametric kernel methods, decentralized learning with batch-form data, and communication-eﬃcient iterative implementation. Related work to these three subjects is reviewed below.

Centralized kernel learning. Centralized kernel methods assume data are collected and processed by a single server and are known to suﬀer from the curse of dimensionality for large-scale learning tasks. To mitigate their computational complexity, various dimensionality reduction techniques are developed for both batch-form or online streaming learning, including stochastic approximation (Bucak et al., 2010; Gu et al., 2018), restricting the number of function parameters (Gomes and Krause, 2010; Wang et al., 2012; Zhang et al., 2013; Le et al., 2016; Koppel et al., 2017), and approximating the kernel during training (Honeine, 2015; Engel et al., 2004; Richard et al., 2008; Drineas and Mahoney, 2005; Dai et al., 2014; Lu et al., 2016; Sheikholeslami et al., 2018; Rahimi and Recht, 2008; B az avan et al., 2012; Nguyen et al., 2017). Among them, random feature (RF) mapping methods have gained popularity thanks to their ability to map the large-scale data into a RF space of much reduced dimension by approximating the kernel with a ﬁxed (small) number of random features, which thus circumvents the curse of dimensionality problem (Rahimi and Recht, 2008; Dai et al., 2014; B az avan et al., 2012; Nguyen et al., 2017). Enforcing orthogonality on random features can greatly reduce the error in kernel approximation (Yu et al., 2016; Shen et al., 2018), and the learning performance of RF-based methods is evaluated in (Bach, 2017; Rudi and Rosasco, 2017; Li et al., 2018).

Decentralized kernel learning. For the decentralized kernel learning problem relevant to our work (Mitra and Bhatia, 2014; Gao et al., 2015; Chouvardas and Draief, 2016; Koppel et al., 2018), gradient descent is conducted locally at each agent to update its learning model, followed by diﬀusion-based information exchange among agents. However, these methods either assume that agents have access to their neighbors observed raw data or require agents to transmit their raw data to their neighbors to ensure convergence on the prediction function. For the problem studied in this article where the observed data are only locally available, these methods are not applicable since there are no common decision parameters for consensus without any raw data exchange. Moreover, these methods operate in the kernel space parameterized by training data, and still encounter the curse of dimensionality when the local dataset goes large. Though data selection (Gao et al., 2015) and subspace projection (Koppel et al., 2018) are adopted to alleviate the curse of dimensionality problem, they typically require signiﬁcant extra computational resources. RF mapping (Rahimi and Recht, 2008) oﬀers a viable approach to overcome these issues, by having all agents map their datasets of various sizes onto the same RF space. For instance, Bouboulis et al. (2018) proposes a diﬀusion-based combine-then-adapt (CTA) method that achieves consensus on the model parameters in the RF space for the online learning problem, without the exchange of raw data. Though the batch-form counterpart of online CTA can be developed for oﬀ-line learning, the convergence speed of the diﬀusion-based method is relatively slow compared with higher-order methods such as ADMM (Liu et al., 2019).

Communication-eﬃcient optimization. Communication-eﬃcient algorithms for decentralized optimization and learning problems have attracted attention when data movement among computing nodes becomes a bottleneck due to the high latency and limited band-

Xu and Wang and Chen and Tian

width of decentralized networks. To reduce the communication cost, one way is to transmit the compressed information by quantization (Zhu et al., 2016; Alistarh et al., 2017; Zhang et al., 2019) or sparsiﬁcation (Stich et al., 2018; Alistarh et al., 2018; Wangni et al., 2018; Harrane et al., 2018). However, these methods only reduce the required bandwidth at each communication round, not the number of rounds or the number of transmissions. Alternatively, some works randomly select a number of nodes for broadcasting/communication and operate asynchronous updating to reduce the number of transmissions per iteration (Mota et al., 2013; Li et al., 2014; Jaggi et al., 2014; Arablouei et al., 2015; Mc Mahan et al., 2016; Yin et al., 2018; Yu et al., 2019). In contrast to random node selection, a more intuitive way is to evaluate the importance of a message in order to avoid unnecessary transmissions (Chen et al., 2018; Liu et al., 2019; Li et al., 2019b). This is usually implemented by adopting a censoring scheme to adaptively decide if a message is informative enough to be transmitted during the iterative optimization process. Other eﬀorts to improve the communication eﬃciency are made by accelerating the convergence speed of the iterative algorithm implementation (Shamir et al., 2014; Reddi et al., 2016; Li et al., 2019a).

1.2 Contributions

This paper develops communication-eﬃcient decentralized kernel learning algorithms under the consensus optimization framework without any central coordination or raw data exchange among agents for built-in privacy protection. Relative to prior art, our contributions are summarized as follows.

We ﬁrst formulate the decentralized multi-agent kernel learning problem as a decentralized consensus optimization problem in the RF space. Since most machine learning scenarios can aﬀord plenty computational capability but limited communication resources, we solve this problem with ADMM, which has shown fast convergence at the expense of relatively high computation cost per iteration (Shi et al., 2014). To the best of our knowledge, this is the ﬁrst work to solve decentralized kernel learning in the RF space by ADMM without any raw data exchange. The key of our proposed Decentralized Kernel Learning via ADMM (DKLA) algorithm is to apply RF mapping, which not only reduces the computational complexity but also enables consensus on a set of model parameters of ﬁxed size in the RF space. In addition, since no raw data is exchanged among agents and the mapping from the original data space to the RF space is not one-to-one mapping, data privacy is protected to a certain level.

To increase the communication eﬃciency, we further develop a COmmunicationcensored KErnel learning (COKE) algorithm, which achieves desired learning performance given limited communication resources and energy supply. Speciﬁcally, we devise a simple yet powerful censoring strategy to allow each user to autonomously skip unnecessary communications when its local update is not informative enough for transmission, without aid of a central coordinator. In this way, the communication eﬃciency can be boosted at almost no sacriﬁce of the learning performance. When the censoring strategy is absent, COKE degenerates to DKLA.

In addition, we conduct theoretical analysis in terms of both functional convergence and generalization performance to provide guidelines for practical implementations of

COKE: Communication-Censored Decentralized Kernel Learning

our proposed algorithms. We show that the individually learned functional at each agent through DKLA and COKE both converges to the optimal one at a linear rate under mild conditions. For the generalization performance, we show that O(

T log dλ K) features are suﬃcient to ensure O(1/

T) learning risk for the decentralized kernel ridge regression problem, where dλ K is the number of eﬀective degrees of freedom that will be deﬁned in Section 4.2 and T is the total number of samples.

Finally, we test the performance of our proposed DKLA and COKE algorithms on both synthetic and real datasets. The results corroborate that both DKLA and COKE exhibit attractive learning performance and COKE is highly communication-eﬃcient.

1.3 Organization and notation of the paper

Organization. Section 2 formulates the problem of non-parametric learning and highlights the challenges in applying traditional kernel methods in the decentralized setting. Section 3 develops the decentralized kernel learning algorithms, including both DKLA and COKE. Section 4 presents the theoretical results and Section 5 reports the numerical tests using both synthetic data and real datasets. Concluding remarks are provided in Section 6. Notation. R denotes the set of real numbers. 2 denotes the Euclidean norm of vectors and F denotes the Frobenius norm of matrices. | | denotes the cardinality of a set. A, a, and a denotes a matrix, a vector, and a scalar, respectively.

2. Problem Statement

This section reviews basics of kernel-based learning and decentralized optimization, introduces notation, and provides background needed for our novel DKLA and COKE schemes. Consider a network of N agents interconnected over a ﬁxed topology G = (N, C, A), where N = {1, 2 . . . , N}, C N N, and A RN N denote the agent set, the edge set and the adjacency matrix, respectively. The elements of A are ain = ani = 1 when the unordered pair of distinct agents (i, n) C, and ain = ani = 0 otherwise. For agent i, its onehop neighbors are in the set Ni = {n|(n, i) C}. The term agent used here can be a single computational system (e.g. a smart phone, a database, etc.) or a collection of co-located computational systems (e.g. data centers, computer clusters, etc.). Each agent only has access to its locally observed data composed of independently and identically distributed (i.i.d) input-label pairs {xi,t, yi,t}Ti t=1 obeying an unknown probability distribution p on X Y, with xi,t Rd and yi,t R. The kernel learning task is to ﬁnd a prediction function f that best describes the ensemble of all data from all agents. Suppose that f belongs to the reproducing kernel Hilbert space (RKHS) H := {f|f(x) = P t=1 αtκ(x, xt)} induced by a positive semideﬁnite kernel κ(x, xt) : Rd Rd R that measures the similarity between x and xt, for all x, xt X. In a decentralized setting with privacy concern, this means that each agent has to be able to learn the global function f H such that yi,t = f(xi,t)+ei,t for {{xi,t, yi,t}Ti t=1}N i=1, without exchange of any raw data and in the absence of a fusion center, where the error terms ei,t are minimized according to certain optimality metric. To evaluate the learning performance, a nonnegative loss function ℓ(y, ˆy) is utilized to measure the diﬀerence between the true label value y and the predicted value ˆy = f(x). Some common loss functions include the quadratic loss ℓ(y, ˆy) = (y ˆy)2 for regression

Xu and Wang and Chen and Tian

tasks, the hinge loss ℓ(y, ˆy) = max(0, 1 yˆy) and the logistic loss ℓ(y, ˆy) = log(1 + e yˆy) for binary classiﬁcation tasks. The above mentioned loss functions are all convex with respect to ˆy. The learning problem is then to minimize the expected risk of the prediction function:

X Y ℓ(f(x), y)dp(x, y), (1)

which indicates the generalization ability of f to new data. However, the distribution p is unknown in most learning tasks. Therefore, minimizing R(f) is not applicable. Instead, given the ﬁnite number of training examples, the problem turns to minimizing the empirical risk:

min f H ˆR(f) :=

i=1 ˆRi(f), (2)

where ˆRi(f) is the local empirical risk for agent i given by

t=1 ℓ(f(xi,t), yi,t) + λi f 2 H, (3)

with H being the norm associated with H, and λi > 0 being a regularization parameter that controls over-ﬁtting. The representer theorem states that the minimizer of a regularized empirical risk functional deﬁned over a RKHS can be represented as a ﬁnite linear combination of kernel functions evaluated on the data pairs from the training dataset (Sch olkopf et al., 2001). If {{xi,t, yi,t}Ti t=1}N i=1 are centrally available at a fusion center, the minimizer of (2) admits

t=1 αi,tκ(x, xi,t) := α κ(x), (4)

where α = [α1,1, . . . , αN,TN ] RT is the coeﬃcient vector to be learned, T = PN i=1 Ti is the total number of samples, and κ(x) = [κ(x, x1,1), . . . , κ(x, x N,TN )] RT is the kernel function parameterized by the global data XT := {{xi,t}Ti t=1}N i=1 from all agents, for any x. In RKHS, since κ(xt, x), κ(xτ, x) H = κ(xt, xτ), it yields f 2 H = α Kα, where K is the T T kernel matrix that measures the similarity between any two data points in XT . In this way, the local empirical risk (3) can be reformulated as a function of α:

ˆRi(α) : = 1

t=1 ℓ(f (xi,t), yi,t) + λi f 2 H = 1

t=1 ℓ(α κ(xi,t), yi,t) + λiα Kα. (5)

Accordingly, (2) becomes

i=1 ˆRi(α). (6)

COKE: Communication-Censored Decentralized Kernel Learning

Relating the decentralized kernel learning problem with the decentralized consensus optimization problem, solving (6) is equivalent to solving

min {αi RT }N i=1

i=1 ˆRi(αi)

s.t. αi = αn, i, n Ni,

where αi and αn are the local copies of the global decision variable α at agent i and agent n, respectively. The problem can then be solved by ADMM (Shi et al., 2014) or other primal dual methods (Terelius et al., 2011). However, it is worth noting that (7) reveals a subtle yet profound diﬀerence from a general optimization problem for parametric learning. That is, each local function ˆRi depends on not only the global decision variable α, but also the global data XT because of the kernel terms κ(xi,t) and K. As a result, solving the local objective for agent i requires raw data from all other agents to obtain κ(xi,t) and K, which contradicts the situation that private raw data are only locally available. Moreover, notice that αi is of the same size T as that of the ensemble dataset, which incurs the curse of dimensionality and insurmountable computational cost when T becomes large, even when the obstacle of making all the data available to all agents is not of concern. To resolve this issue, an alternative formulation is to associate a local prediction model fi H with each agent i, with f i = PTi t=1 αi,tκ(x, xi,t) = α i κi(x) being the local optimal solution that only involves local data {xi,t}Ti t=1 (Ji et al., 2016). Speciﬁcally, αi = [ αi,1, . . . , αi,Ti] RTi, and κi(x) = [κ(x, xi,1), . . . , κ(x, xi,Ti)] RTi is parameterized by the local data {xi,t}Ti t=1 only. In this way, the local cost function becomes

ˆRi( αi) : = 1

t=1 ℓ( f i (xi,t), yi,t) + λi f i 2 H = 1

t=1 ℓ( α i κi(xi,t), yi,t) + λi α i Ki αi, (8)

where Ki is of size Ti Ti and depends on local data only. With (8), the optimization problem (7) is then modiﬁed to

min { αi RTi}N i=1

i=1 ˆRi( αi)

s.t. fn(xi,t) = fi(xi,t), i, n Ni, t = 1, . . . , Ti,

and can be solved distributedly by ADMM. Note that the consensus constraint is the learned prediction values fi(x), not the parameters αi. This is because αi are data-dependent and may have diﬀerent sizes at diﬀerent agents (the dimension of αi equals to the number of training samples at agent i), and cannot be directly optimized through consensus. Still, this method has four drawbacks. Firstly, it is necessary to associate a local learning model fi to each agent i for the decentralized implementation. However, the local learning model fi and the global optimal model f in (2) may not be the same because diﬀerent local training data are used. Therefore, the optimization problem (9) is only an approximation of (2). Even with the equality constraint to minimize the gap between the decentralized learning output and the optimal centralized one, the approximation performance is not guaranteed. Besides, the functional consensus constraint still requires raw data exchange

Xu and Wang and Chen and Tian

among agents in order for agent n Ni to be able to compute the values fn(xi,t) from agent i s data xi,t, for i = n. Apparently, this violates the privacy-protection requirement for practical applications. In addition, when Ti is large, both the storage and computational costs are high for each agent due to the curse of dimensionality problem at the local sites. Lastly, the frequent local communication is resource-consuming under communication constraints. To circumvent all these obstacles, the goal of this paper is to develop eﬃcient decentralized algorithms that protect privacy and conserve communication resources.

3. Algorithm Development

In this section, we leverage the RF approximation and ADMM to develop our algorithms. We ﬁrst introduce the RF mapping method. Then, we devise the DKLA algorithm that globally optimizes a shared learning model for the multi-agent system. Finally, we take into consideration of the limited communication resources in large-scale decentralized networks and develop the COKE algorithm. Both DKLA and COKE are computationally eﬃcient and protect data privacy at the same time. Further, COKE is communication eﬃcient.

3.1 RF-based kernel learning

As stated in previous sections, standard kernel methods incur the curse of dimensionality issue when the data size grows large. To make kernel methods scalable for a large dataset, RF mapping is adopted for approximation by using the shift-invariance property of kernel functions (Rahimi and Recht, 2008). For a shift-invariant kernel that satisﬁes κ(xt, xτ) = κ(xt xτ), t, τ, if κ(xt xτ) is absolutely integrable, then its Fourier transform pκ(ω) is guaranteed to be nonnegative (pκ(ω) 0), and hence can be viewed as its probability density function (pdf) when κ is scaled to satisfy κ(0) = 1 (Bochner, 2005). Therefore, we have

κ(xt, xτ) = Z pκ(ω)ejω (xt xτ)dω := Eω[ejω (xt xτ)] = Eω[φ(xt, ω)φ (xτ, ω)], (10)

where E denotes the expectation operator, φ(x, ω) := ejω x with ω Rd, and is the complex conjugate operator. In (10), the ﬁrst equality is the result of the Fourier inversion theorem, and the second equality arises by viewing pκ(ω) as the pdf of ω. In this paper, we adopt a Gaussian kernel κ(xt, xτ) = exp( xt xτ 2 2/(2σ2)), whose pdf is a normal distribution with pκ(ω) N(0, σ 2I). The main idea of RF mapping is to randomly generate {ωl}L l=1 from the distribution pκ(ω) and approximate the kernel function κ(xt, xτ) by the sample average

ˆκL(xt, xτ) := 1

l=1 φ(xt, ωl)φ (xτ, ωl) := φ L(xτ)φL(xt), (11)

where φL(x) := q

1 L[φ(x, ω1), . . . , φ(x, ωL)] and is the conjugate transpose operator.

COKE: Communication-Censored Decentralized Kernel Learning

The following real-valued mappings can be adopted to approximate κ(xt, xτ), both satisfying the condition Eω[φr(xt, ω) φr(xτ, ω)] = κ(xt, xτ) (Rahimi and Recht, 2008):

φr(x, ω) = [cos(ω x), sin(ω x)] , (12)

2 cos(ω x + b), (13)

where b is drawn uniformly from [0, 2π]. With the real-valued RF mapping, the minimizer of (2) then admits the following form:

t=1 αi,tφ L(xi,t)φL(x) = θ φL(x), (14)

where θ := PN i=1 PTi t=1 αi,tφ L(xi,t) denotes the new decision vector to be learned in the

RF space and φL(x) = q

1 L[φr(x, ω1), . . . , φr(x, ωL)] . If (12) is adopted, then φL(x) and θ are of size 2L. Otherwise, if (13) is adopted, then φL(x) and θ are of size L. In either case, the size of θ is ﬁxed and does not increase with the number of data samples.

3.2 DKLA: Decentralized kernel learning via ADMM

Consider the decentralized kernel learning problem described in Section 2 and adopt the RF mapping described in Section 3.1. Let all agents in the network have the same set of random features, i.e., {ωl}L l=1. Plugging (14) into the local cost function ˆRi(f) in (3) gives

ˆRi(θ) : = 1

t=1 ℓ( ˆf (xi,t), yi,t) + λi ˆf 2 H = 1

t=1 ℓ(θ φL(xi,t), yi,t) + λi θ 2 2. (15)

In (15), we have

θ 2 2 : = (

t=1 αi,tφ L(xi,t))(

τ=1 αn,τφL(xn,τ))

τ=1 αi,tαn,τκ(xi,t, xn,τ) := ˆf 2 H.

Therefore, with RF mapping, the centralized benchmark (2) becomes

i=1 ˆRi(θ). (16)

Here for notation simplicity, we denote the size of θ by L 1, which can be achieved by adopting the real-valued mapping in (13). Adopting an alternative mapping such as (12) only changes the size of θ while the algorithm development is the same. RF mapping is essential because it results in a common optimization parameter θ of ﬁxed size for all agents. To solve (16) in a decentralized manner, we associate a model parameter θi with each agent i and enforce the consensus constraint on neighboring agents i and n using an auxiliary

Xu and Wang and Chen and Tian

variable ϑin. Speciﬁcally, the RF-based decentralized kernel learning problem is formulated to jointly minimize the following objective function:

min {θi RL},{ϑin RL}

i=1 ˆRi(θi)

s.t. θi = ϑin, θn = ϑin, (i, n) C.

Note that the new decision variables θi to be optimized are local copies of the global optimization parameter θ and are of the same size for all agents. On the contrary, the decision variables αi in (9) are data-dependent and may have diﬀerent sizes. In addition, the size of θ is L, which can be much smaller than that of α (whose size equals to T) in (6). For big data scenarios where L T, RF mapping greatly reduces the computational complexity. Moreover, as shown in the following, the updating of θ does not involve any raw data exchange and the RF mapping from x to φL(x) is not one-to-one mapping, therefore provides raw data privacy protection. Further, it is easy to set the regularization parameters λi to control over-ﬁtting. Speciﬁcally, since the parameters θi are of the same length among agents, we can set them to be λi = 1 N λ, i, where λ is the corresponding over-ﬁtting control parameter assuming all data are collected at a center. In contrast, the regularization parameters λi in (5) depend on local data and need to satisfy λ = PN i=1 λi, which is relatively diﬃcult to tune in a large-scale network. In the constraint, θi are separable when ϑin are ﬁxed, and vice versa. Therefore, (17) can be solved by ADMM. Following (Shi et al., 2014), we develop the DKLA algorithm where each agent updates its local primal variable θi and local dual variable γi by

θk i := arg min θi

ˆRi(θi) + ρ|Ni| θi 2 2 + θ i

θk 1 i + θk 1 n

γk i = γk 1 i + ρ X

θk i θk n , (18b)

where |Ni| is the cardinality of Ni. The auxiliary variable ϑin can be written as a function of θi and then canceled out. Interested readers are referred to (Shi et al., 2014) for detailed derivation. The learning algorithm DKLA is outlined in Algorithm 1. Note that the random features need to be common to all agents, hence, in step 1, we restrict them to be drawn according to a common random seed. Algorithm 1 is fully decentralized since the updates of θi and γi depend only on local and neighboring information.

3.3 COKE: Communication-censored decentralized kernel learning

From Sections 3.1 and 3.2, we can see that decentralized kernel learning in the RF space under the consensus optimization framework has much reduced computational complexity, thanks to the RF mapping technique that transforms the learning model into a smaller RF space. In this subsection, we consider the case when the communication resource is limited and we aim to further reduce the communication cost of DKLA. To start, we notice that in Algorithm 1, each agent i (i N) maintains 2 + |Ni| local variables at iteration k,

COKE: Communication-Censored Decentralized Kernel Learning

Algorithm 1 DKLA Run at Agent i Require: Kernel κ, the number of random features L, and λ to control over-ﬁtting; initialize local variables to θ0 i = 0, γ0 i = 0; set step size ρ > 0;

1: Draw L i.i.d. samples {ωl}L l=1 from pκ(ω) according to a common random seed.

2: Construct {φL(xi,t)}Ti t=1 using the random features {ωl}L l=1 via (12) or (13).

3: for iterations k = 1, 2, do

4: Update local variable θk i by (18a);

5: Transmit θk i to all neighbor n (n Ni) and receive θk n from all neighbor n;

6: Update local dual variable γk i by (18b).

i.e., its local primal variable θk i , local dual variable γk i and |Ni| state variables θk n received from its neighbors. While the dual variable γk i is kept locally for agent i, the transmission of its updated local variable θk i to its one-hop neighbors happens in every iteration, which consumes a large amount of communication bandwidth and energy along iterations for largescale networks. In order to improve the communication eﬃciency, we develop the COKE algorithm by employing a censoring function at each agent to decide if a local update is informative enough to be transmitted. To evaluate the importance of a local update at iteration k for agent i (i N), we introduce a new state variable ˆθk 1 i to record agent i s latest broadcast primal variable up to time k 1. Then, at iteration k, we deﬁne the diﬀerence between agent i s current state θk i and its previously transmitted state ˆθk 1 i as

ξk i = ˆθk 1 i θk i , (19)

and choose a censoring function as

Hi(k, ξk i ) = ξk i 2 hi(k), (20)

where {hi(k)} is a non-increasing non-negative sequence. A typical choice for the censoring function is Hi(k, ξk i ) = ξk i 2 vµk, where µ (0, 1) and v > 0 are constants. When Hi(k, ξk i ) < 0, θk i is deemed not informative enough, hence will be censored and will not be transmitted to its neighbors. When executing the COKE algorithm, each agent i maintains 3 + |Ni| local variables at each iteration k. Comparing with the DKLA update in (18), the additional local variable is the state variable ˆθk i that records its latest broadcast primal variable up to time k. Moreover, the |Ni| state variables from its neighbors are ˆθk n that record the latest received primal variables from its neighbors, instead of the timely updated and broadcast variables θk n of its neighbors n Ni. Though each agent still computes local updates at every step, its transmission to neighbors does not always occur, but is determined by the censoring criterion (20). To be speciﬁc, at each iteration k, if Hi(k, ξk i ) 0, then ˆθk i = θk i , and agent i is allowed to transmit its local primal variable θk i to its neighbors. Otherwise, ˆθk i = ˆθk 1 i and no information is transmitted. If agent i receives θk n from any neighbor n, then that neighbor s state variable kept by agent i becomes ˆθk n = θk n, otherwise, ˆθk n = ˆθk 1 n .

Xu and Wang and Chen and Tian

Algorithm 2 COKE Run at Agent i Require: Kernel κ, the number of random features L, the censoring thresholds {hi(k)}, and λ to control over-ﬁtting; initialize local variables to θ0 i = 0, ˆθ0 i = 0, γ0 i = 0; set step size ρ > 0;

1: Draw L i.i.d. samples {ωl}L l=1 from pκ(ω) according to a common random seed.

2: Construct {φL(xi,t)}Ti t=1 using the random features {ωl}L l=1 via (12) or (13).

3: for iterations k = 1, 2, do

4: Update local variable θk i by (21a);

5: Compute ξk i = ˆθk 1 i θk i ;

6: If Hi(k, ξk i ) = ξk i 2 hi(k) 0, transmit θk i to neighbors and let ˆθk i = θk i ; else do not transmit and let ˆθk i = ˆθk 1 i ;

7: If receives θk n from neighbor n, let ˆθk n = θk n; else let ˆθk n = ˆθk 1 n ;

8: Update local dual variable γk i by (21b).

Consequently, agent i s local parameters are updated as follows:

θk i := arg min θi

ˆRi(θi) + ρ|Ni| θi 2 2 + θ i

ˆθk 1 i + ˆθk 1 n

γk i = γk 1 i + ρ X

ˆθk i ˆθk n , (21b)

with a censoring step conducted between (21a) and (21b). We outline the COKE algorithm in Algorithm 2. The key feature of COKE is that agent i s local variables θk i and γk i are updated all the time, but the transmission of θk i occurs only when the censoring condition is met. By skipping unnecessary transmissions, the communication eﬃciency of COKE is improved. It is obvious that large {hi(k)} saves more communication but may lead to divergence from the optimal solution θ of (16), while small {hi(k)} does not contribute much to communication saving. Noticeably, DKLA is a special case of COKE when the communication censoring strategy is absent by setting hi(k) = 0, i, k.

4. Theoretical Guarantees

In this section, we perform theoretical analyses to address two questions related to the convergence properties of DKLA and COKE algorithms. First, do they converge to the globally optimal point, and if so, at what rate? Second, what is their achieved generalization performance in learning? Since DKLA is a special case of COKE, the analytic results of COKE, especially the second one, extend to DKLA straightforwardly. For theoretical analysis, we make the following assumptions.

Assumption 1 The network with topology G = (N, C, A) is undirected and connected.

COKE: Communication-Censored Decentralized Kernel Learning

Assumption 2 The local cost functions ˆRi are strongly convex with constants m ˆRi > 0 such that i N, ˆRi( θa) ˆRi( θb), θa θb m ˆRi θa θb 2 2, for any θa, θb RL. The minimum convexity constant is m ˆR := mini m ˆRi. The gradients of the local cost functions are Lipschitz continuous with constants M ˆRi > 0, i. That is, ˆRi( θa) ˆRi( θb) 2 M ˆRi θa θb 2 for any agent i given any θa, θb RL. The maximum Lipschitz constant is M ˆR := maxi M ˆRi.

Assumption 3 The number of training samples of diﬀerent agents is of the same order of magnitude, that is, maxi Ti mini Ti

mini Ti < 10, i N.

Assumption 4 There exists f H H, such that for all estimators f H, E(f H) E(f), where E(f) := Ep [ℓ(f(x), y)] is the expected risk to measure the generalization ability of the estimator f.

Assumption 1 and 2 are standard for decentralized optimization over decentralized networks (Shi et al., 2014), Assumption 4 is standard in generalization performance analysis of kernel learning (Li et al., 2018), and Assumption 3 is enforced to exclude the case of extremely unbalanced data distributed over the network.

4.1 Linear convergence of DKLA and COKE

We ﬁrst establish that DKLA enables agents in the decentralized network to reach consensus on the prediction function at a linear rate. We then show that when the censoring function is properly chosen and the penalty parameter satisﬁes certain conditions, COKE also guarantees that the individually learned functional on the same sample linearly converges to the optimal solution. Theorem 1 [Linear convergence of DKLA] Initialize the dual variables as γ0 i = 0, i, with Assumptions 1 - 3, the learned functional at each agent through DKLA is R-linearly convergent to the optimal functional ˆfθ (x) := (θ ) φL(x) for any x X, where θ denotes the optimal solution to (16) obtained in the centralized case. That is,

lim k ˆfθk i (x) = ˆfθ (x), i. (22)

Proof. See Appendix A. Theorem 2 [Linear convergence of COKE] Initialize the dual variables as γ0 i = 0, i, set the censoring thresholds to be h(k) = vµk, with v > 0 and µ (0, 1), and choose the penalty parameter ρ such that

0 < ρ < min

η1 , (ν 1) σ2 min(S ) νη3 σ2max(S+) , η1

4 + η2 σ2 max(S+)

m ˆR η3νM2 ˆR σ2 min(S )

where η1 > 0, η2 > 0, η3 > 0 and ν > 1 are arbitrary constants, m ˆR and M ˆR are the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. σmax(S+) and σmin(S ) are the maximum singular value of the unsigned incidence matrix S+ and the minimum non-zero singular value of the signed incidence matrix S of the network, respectively. Then, with Assumptions 1

Xu and Wang and Chen and Tian

- 3, the learned functional at each agent through COKE is R-linearly convergent to the optimal one ˆfθ (x) := (θ ) φL(x) for any x X, where θ denotes the optimal solution to (16) obtained in the centralized case. That is,

lim k ˆfθk i (x) = ˆfθ (x), i. (24)

Proof. See Appendix A. Remark 1. It should be noted that the kernel transformation with RF mapping is essential in enabling convex consensus formulation with convergence guarantee. For example, in a regular optimization problem with a local cost function (y f(x))2, even if it is quadratic, the nonlinear function f(x) inside destroys the convexity. In contrast, with RF mapping, f(x) of any form is expressed as a linear function of θ, and hence the local cost function is guaranteed to be convex. For decentralized kernel learning, many widely-adopted loss functions result in (strongly) convex local objective functions in the RF space, such as the quadratic loss in a regression problem and logistic loss in a classiﬁcation problem. Remark 2. For Theorem 2, notice that choosing larger v and µ in the design of the censoring thresholds in COKE leads to less communication per iteration at the expense of possible performance degradation, whereas smaller v and µ may not contribute much to communication saving. However, it is challenging to acquire an explicit tradeoﬀbetween communication cost and steady-state accuracy, since the designed censoring thresholds do not have an explicit relationship with the update of the model parameter. The above theorems establish the exact convergence of the functional learned in the multi-agent system for the decentralized kernel regression problem via DKLA and COKE. Diﬀerent from previous works (Koppel et al., 2018; Shin et al., 2018), our analytic results are obtained by converting the non-parametric data-dependent learning model into a parametric data-independent model in the RF space and solved under the consensus optimization framework. In this way, we not only reduce the computational complexity of the standard kernel methods and make the RF-based kernel methods scalable to large-size datasets, but also protect data privacy since no raw data exchange among agents is required and the RF mapping is not one-to-one mapping. RF mapping is crucial in our algorithms, with which we are able to show the linear convergence of the functional by showing the linear convergence of the iteratively updated decision variables in the RF space; see Appendix A for more details.

4.2 Generalization property of COKE

The ultimate goal of decentralized learning is to ﬁnd a function that generalizes well for the ensemble of all data from all agents. To evaluate the generalization property of the predictive function learned by COKE, we are then interested in bounding the diﬀerence between the expected risk of the predictive function learned by COKE at the k-th iteration, deﬁned as E( ˆfk) := PN i=1 Ei( ˆfθk i ) := PN i=1 Ep[(y (θk i ) φL(x))2], and the expected risk E(f H) in the RKHS. This is diﬀerent from bounding the approximation error between the kernel κ and the approximated ˆκL by L random features as in the literature (Rahimi and Recht, 2008; Sutherland and Schneider, 2015; Sriperumbudur and Szab o, 2015). As DKLA is a special case of COKE, the generalization performance of COKE can be extended to DKLA straightforwardly.

COKE: Communication-Censored Decentralized Kernel Learning

To illustrate our ﬁnding, we focus on the kernel regression problem whose loss function is least squares, i.e., ℓ(y, f(x)) = (y f(x))2. With RF mapping, the objective function (16) of the regression problem can be formulated as

i=1 ˆRi(θ) =

Ti yi (Φi L) θ 2 2 + λ

where yi = [yi,1, . . . , yi,Ti] RTi 1, Φi L = [φL(xi,1), . . . , φL(xi,Ti)] RL Ti, and φL(xi,t) is the data mapped to the RF space. The optimal solution of (25) is given in closed form by

θ = ( Φ Φ + λI) 1 Φ y, (26)

where Φ = [ Φ1 L, . . . , ΦN L ] RT L with Φi L = 1 Ti Φi L, i N, and y = [ y1; . . . ; y N]

RT 1 with yi = 1 Ti yi, i N. The optimal prediction model is then expressed by

ˆfθ (x) = (θ ) φL(x). (27)

In the following theorem, we give a general result of the generalization performance of the predictive function learned by COKE for the kernel regression problem, which is built on the linear convergence result given in Theorem 3 and taking into account of the number of random features adopted. Theorem 3 Let λK be the largest eigenvalue of the kernel matrix K constructed by all data, XT , and choose the regularization parameter λ < λK/T so as to control overﬁtting. Under the Assumptions 1 - 4, with the censoring function and other parameters given in Theorem 2, for all δp (0, 1) and f H 1, if the number of random features L satisﬁes

3ϵ) log 16dλ K δp ,

then with probability at least 1 δp, the excess risk of E( ˆfk) obtained by Algorithm 2 converges to an upper bound, i.e.,

lim k (E( ˆfk) E(f H)) 3λ + O( 1

where ϵ (0, 1), and dλ K := Tr(K(K+λTI) 1) is the number of eﬀective degrees of freedom that is known to be an indicator of the number of independent parameters in a learning problem (Avron et al., 2017). Proof. See Appendix B.

Theorem 3 states the tradeoﬀbetween the computational eﬃciency and the statistical eﬃciency through the regularization parameter λ, eﬀective dimension dλ K, and the number of random features adopted. We can see that to bound the excess risk with a higher probability, we need more random features, which results in a higher computational complexity. The regularization parameter is usually determined by the number of training data and one common practice is to set λ = O(1/

T) for the regression problem (Caponnetto and

Xu and Wang and Chen and Tian

De Vito, 2007). Therefore, with O(

T log dλ K) features, COKE achieves a learning risk of O(1/

T) at a linear rate. We also notice that diﬀerent sampling strategies aﬀect the number of random features required to achieve a given generalization error. For example, importance sampling is studied for the centralized kernel learning in RF space in (Li et al., 2018). Interested readers are referred to (Li et al., 2018) and references therein.

5. Experiments

This section evaluates the performance of our COKE algorithm in regression tasks using both synthetic and real-world datasets. Since we consider the case that data are only locally available and cannot be shared among agents, the following RF-based methods are used to benchmark our COKE algorithm.

CTA. This is a form of diﬀusion-based technique where all agents ﬁrst construct their RFmapped data {φL(xi,t)}Ti t=1, for t = 1, . . . , Ti, i, using the same random features {ωl}L l=1 as DKLA and COKE. Then at each iteration k, each agent i ﬁrst combines information from its neighbors, i.e., θn, n Ni with its own parameter θi by aggregation. Then, it updates its own parameter θi using the gradient descent method with the aggregated information (Sayed, 2014). The cost function for agent i is given in (15). Note that this method has not been formally proposed in existing works for RF-based decentralized kernel learning with batch-form data, but we introduced it here only for comparison purpose. An online version that deals with streaming data is available in (Bouboulis et al., 2018). The batch version of CTA introduced here is expected to converge faster than the online version.

DKLA. Algorithm 1 proposed in Section 3.2 where ADMM is applied and the communication among agents happen at every iteration without being censored.

The performance of all algorithms is evaluated using both synthetic and real-world datasets, where the entries of data samples are normalized to lie in [0, 1] and each agent uses 70% of its data for training and the rest for testing. The learning performance at each iteration is evaluated using mean-squared-error (MSE) given by MSE(k) = 1

T PN i=1 PTi t=1(yi,t (θk i ) φL(xi,t))2. The decision variable θi for CTA is initialized as θ0 i = 0, i as that in DKLA and COKE. For COKE, it should be noted that the design of the censoring function is crucial. For the censoring thresholds adopted in Theorem 2, choosing larger v and µ to design the censor thresholds leads to less communication per iteration but may result in performance degradation. For all simulations, the kernel bandwidth is ﬁne-tuned for each dataset individually via cross-validation. The parameters of the censoring function are tuned to achieve the best learning performance at nearly no performance loss.

5.1 Synthetic dataset

In this setup, the connected graph is randomly generated with N = 20 nodes and 95 edges. The probability of attachment per node equals to 0.3, that is, any pair of two nodes are connected with a probability of 0.3. Each agent has Ti (4000, 6000) data pairs generated following the model yi,t = P50 m=1 bmκ(cm, xi,t) + ei,t, where bm are uniformly drawn from [0, 1], cm N(0, I5), xi,t N(0, I5), and ei,t N(0, 0.1). The kernel κ in the model is Gaussian with a bandwidth σ = 5.

COKE: Communication-Censored Decentralized Kernel Learning

5.2 Real datasets

To further evaluate our algorithms, the following popular real-world datasets from UCI machine learning repository are chosen (Asuncion and Newman, 2007). Tom s hardware. This dataset contains T = 11000 samples with xt R96 whose features include the number of created discussions and authors interacting on a topic and yt R representing the average number of displays to a visitor about that topic (Kawala et al., 2013). Twitter. This dataset consists of T = 13800 samples with xt R77 being a feature vector reﬂecting the number of new interactive authors and the length of discussions on a given topic, etc., and yt R representing the average number of active discussion on a certain topic. The learning task is to predict the popularity of these topics. We also include a larger Twitter dataset for testing which has T = 98704 samples (Kawala et al., 2013). Energy. This dataset contains T = 19735 samples with xt R28 describing the humidity and temperature in diﬀerent areas of the houses, pressure, wind speed and viability outside, while yt denotes the total energy consumption in the house (Candanedo et al., 2017). Air quality. This dataset contains dataset collects T = 9358 samples measured by a gas multi-sensor device in an Italian city, where xt R13 represents the hourly concentration of CO, NOx, NO2, etc, while yt denotes the concentration of polluting chemicals in the air (De Vito et al., 2008).

5.3 Parameter setting and performance analysis

For synthetic data, we adopt a Gaussian kernel with a bandwidth σ = 1 for training and use L = 100 random features for kernel approximation. Note that the chosen σ diﬀers from that of the actual data model. The censoring thresholds are h(k) = 0.95k, the regularization parameter λ and stepsize ρ of DKLA and COKE are set to be 5 10 5 and 10 2, respectively. The stepsize of CTA is set to be η = 0.99, which is tuned to achieve the same level of learning performance as COKE and DKLA at its fastest speed. To show the performance of all algorithms on real datasets concisely and comprehensively, we present the experimental results on the Twitter dataset with T = 13800 samples by ﬁgures and record the experimental results on the remaining datasets by tables. For the Twitter dataset with T = 13800 samples, we randomly split it into 10 mini-batches each with Ti (1200, 1400) data pairs while P10 i=1 Ti = T. The 10 mini-batches are distributed to 10 agents connected by a random network with 28 edges. We use 100 random features to approximate a Gaussian kernel with a bandwidth σ = 1 during the training process. The parameters λ and ρ are set to be 10 3 and 10 2, respectively. The censoring thresholds are h(k) = 0.97k. The stepsize of CTA is set to be η = 0.99 to balance the learning performance and the convergence speed. In Fig.1, we show that the individually learned functional at each agent via COKE reaches consensus to the optimal estimate for both synthetic and real datasets. In Fig. 2, we compare the MSE performance of COKE, DKLA, and CTA. Both ﬁgures show that COKE converges slower than DKLA due to the communications skipped by the censoring step. However, the learning performance of COKE eventually is the same as DKLA. For the diﬀusion-based CTA algorithm, it converges the slowest. In Fig. 3, we show the MSE performance versus the communication cost (in terms of the number of transmissions). As

Xu and Wang and Chen and Tian

0 50 100 150 200 250 Iteration

optimal estimate

(a) Synthetic data.

0 50 100 150 200 Iteration

optimal estimate

(b) Twitter data.

Figure 1: Functional convergence via COKE for synthetic data (Figure 1 (a)) and the real dataset (Figure 1 (b)). The learned functionals of all distributed agents converge to the optimal estimate where data are assumed to be centrally available.

Training error (MSE (10 3)) / Commun. cost Test error (MSE(10 3)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 3.9/500 2.4/500 4.5/13 4.0 2.6 4.2 k = 100 3.3/1000 2.4/1000 2.6/100 3.4 2.6 2.8 k = 200 3.0/2000 2.3/2000 2.4/298 3.2 2.5 2.6 k = 500 2.7/5000 2.3/5000 2.3/902 2.9 2.5 2.5 k = 1000 2.5/10000 2.2/10000 2.2/4648 2.7 2.5 2.5 k = 1500 2.5/15000 2.2/15000 2.2/9648 2.7 2.5 2.5 k = 2000 2.4/20000 2.2/20000 2.2/14648 2.6 2.5 2.5

Table 1: MSE performance on the Twitter dataset (large), σ = 1, L = 100, λ = 10 3, stepsize η = 0.99 for CTA, stepsize ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.5 0.98k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource than DKLA.

CTA converges the slowest and communicates all the time, its communication cost is much higher than that of DKLA, and thus we do not include it in Fig. 3 but rather focus on the communication-saving of COKE over DKLA. We can see that to achieve the same level of learning performance, COKE requires much less communication cost than DKLA. Both the synthetic data and the real dataset show communication saving of around 50% in Fig. 3 for a given learning accuracy, which corroborate the communication-eﬃciency of COKE. The performance of all three algorithms on the rest four datasets is listed in Table 1 - 6. All results show that COKE saves much communication (almost 50%) within a negligible

COKE: Communication-Censored Decentralized Kernel Learning

0 200 400 600 Iteration

Training data, DKLA Test data, DKLA Training data, COKE Test data, COKE Training data, CTA Test data, CTA

(a) Synthetic data.

0 200 400 600 Iteration

Training data, DKLA Test data, DKLA Training data, COKE Test data, COKE Training data, CTA Test data, CTA

(b) Twitter data.

Figure 2: MSE performance for synthetic data (Figure 2 (a)) and the real dataset (Figure 2 (b)). ADMM-based algorithms (COKE and DKLA) converge faster than the diﬀusion-based algorithm (CTA) for both synthetic data (Figure 2 (a)) and the real dataset (Figure 2 (b)). Furthermore, DKLA and COKE achieve better learning performance than CTA in terms of MSE on the real dataset.

102 103 Commun. cost

(a) Synthetic data.

102 103 Commun. cost

(b) Twitter data.

Figure 3: MSE performance versus communication cost for synthetic data (Figure 3 (a)) and the real dataset (Figure 3 (b)). Compared with DKLA, COKE achieves around 50% communication saving on the same level of MSE performance for both synthetic data (Figure 3 (a)) and the real dataset (Figure 3 (b)).

Xu and Wang and Chen and Tian

Training error (MSE (10 4)) / Commun. cost Test error (MSE(10 4)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 20.02/500 10.01/500 17.40/10 20.16 11.20 18.82 k = 100 16.6/1000 9.91/1000 10.67/112 17.09 11.10 11.86 k = 200 13.68/2000 9.90/2000 9.97/331 14.58 11.10 11.15 k = 500 11.19/5000 9.90/5000 9.90/1114 12.35 11.10 11.10 k = 1000 10.27/10000 9.90/10000 9.90/5600 11.47 11.10 11.10 k = 1500 10.01/15000 9.90/15000 9.90/10600 11.22 11.10 11.10 k = 2000 9.92/20000 9.90/20000 9.90/15600 11.13 11.10 11.10

Table 2: MSE performance on the Tom s hardware dataset, σ = 1, L = 100, λ = 10 2, stepsize η = 0.99 for CTA, stepsize ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.5 0.95k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource than DKLA.

Twitter (large) Tom s hardware MSE (10 3) Commun. cost MSE (10 4) Commun. cost CTA DKLA COKE CTA DKLA COKE 5 360 20 10 18 680 20 3 4 480 30 10 16 1020 30 22 3 1860 60 48 14 1610 60 28 2.8 3250 100 55 12 2880 110 51 2.6 6120 180 100 10 7950 250 128 2.3 - 1080 577 9.95 17620 640 361 2.2 - 5660 4428 9.90 - 1550 984

Table 3: MSE performance (training error) versus communication cost on the Twitter dataset (large) and the Tom s hardware dataset. For both datasets, COKE saves around 50% communication resource than DKLA to achieve the same level of learning performance.

learning gap from DKLA, and both DKLA and COKE require much less communication resources than CTA. For example, the number of transmissions required to reach a training estimation error of 2.3 10 3 on Twitter dataset by COKE is 577, which is only 53% of that required by DKLA to reach the same level of learning performance. For Tom s hardware dataset, COKE requires 361 total transmissions to reach a learning error of 9.95 10 4, which is 56.4% of DKLA and 0.02% of CTA. Note that much of the censoring occurs at the beginning update iterations. While at the later stage, COKE nearly transmits all parameters at every iteration since the censoring thresholds are smaller than the diﬀerence between two consecutive updates.

COKE: Communication-Censored Decentralized Kernel Learning

Training error (MSE (10 3)) / Commun. cost Test error (MSE(10 3)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 25.65/500 22.52/500 25.22/0 26.45 22.97 26.02 k = 100 24.88/1000 22.12/1000 23.65/57 25.57 22.50 24.2 k = 200 24.17/2000 21.81/2000 22.57/254 24.77 22.15 23.02 k = 500 23.40/5000 21.55/5000 21.88/987 23.92 21.86 22.22 k = 1000 22.84/10000 21.48/10000 21.51/5752 23.31 21.79 21.82 k = 1500 22.54/15000 21.47/15000 21.47/10752 22.97 21.78 21.78 k = 2000 22.35/20000 21.47/20000 21.47/15752 22.75 21.78 21.78

Table 4: MSE performance on the Energy dataset, σ = 0.1, L = 100, λ = 10 3, stepsize η = 0.99 for CTA, ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.5 0.98k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource.

Training error (MSE (10 3)) / Commun. cost Test error (MSE(10 3)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 6.4/500 1.8/500 3.7/72 6.7 2.1 4.0 k = 100 4.5/1000 1.6/1000 2.2/172 4.8 1.8 2.5 k = 200 3.2/2000 1.4/2000 1.7/384 3.5 1.7 2.0 k = 500 2.2/5000 1.3/5000 1.3/2263 2.5 1.6 1.6 k = 1000 1.7/10000 1.2/10000 1.2/7263 2.0 1.6 1.6 k = 1500 1.6/15000 1.2/15000 1.2/12263 1.8 1.6 1.6 k = 2000 1.5/20000 1.2/20000 1.2/17263 1.8 1.6 1.6

Table 5: MSE performance on the Air quality dataset, σ = 0.1, L = 200, λ = 10 5, stepsize η = 0.99 for CTA, ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.9 0.97k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource than DKLA.

Xu and Wang and Chen and Tian

Energy Air quality MSE (10 3) Commun. cost MSE (10 3) Commun. cost CTA DKLA COKE CTA DKLA COKE 25 860 20 11 5.0 810 60 49 24 2290 70 48 3.0 2290 180 81 23.5 4160 140 76 2.0 6010 360 211 23 7690 250 134 1.8 8160 490 285 22.5 14750 480 258 1.6 12300 750 424 22 - 1160 652 1.5 16190 1010 586 21.5 - 4950 4062 1.2 - 5990 5383

Table 6: MSE performance (training error) versus communication cost on the Energy dataset and the Air quality dataset. For both datasets, COKE saves around 45%- 55% communication resource than DKLA to achieve the same level of learning performance.

6. Concluding Remarks

This paper studies the decentralized kernel learning problem under privacy concern and communication constraints for multi-agent systems. Leveraging the random feature mapping, we convert the non-parametric kernel learning problem into a parametric one in the RF space and solve it under the consensus optimization framework by the alternating direction method of multipliers. A censoring strategy is applied to conserve communication resources. Through both theoretical analysis and simulations, we establish that the proposed algorithms not only achieve linear convergence rate but also exhibit eﬀective generalization performance. Thanks to the ﬁxed-size parametric learning model, the proposed algorithms circumvent the curse of dimensionality problem and do not involve raw data exchange among agents. Hence, they can be applied in distributed learning that involve big-data and oﬀer some level of data privacy protection. To cope with dynamic environments and enhance the learning performance, future work will be devoted to decentralized online kernel learning and multi-kernel learning.

Acknowledgments

We would like to acknowledge support for this project from the National Science Foundation (NSF grants #1741338 and #1939553).

COKE: Communication-Censored Decentralized Kernel Learning

Appendix A. Proof of Theorem 1 and Theorem 2

Proof. As discussed in Section 3.2, solving the decentralized kernel learning problem in the RF space (17) is equivalent to solving the problem (16). From (14), it is evident that the convergence of the optimal functional f in (16) hinges on the convergence of the decision variables θ in the RF space. Since in the RF space, the decision variables are dataindependent, the convergence proof of DKLA boils down to proving the convergence of a convex optimization problem solved by ADMM. However, the convergence proof of COKE is nontrivial because of the error caused by the outdated information introduced by the communication censoring strategy. Our proof for both theorems consists of two steps. The ﬁrst step is to show linear convergence of decision parameters θ for DKLA via Theorem 4 and for COKE via Theorem 5 below, which are derived straightforwardly from (Shi et al., 2014) and (Liu et al., 2019), respectively. The second step is to show how the convergence of θ translates to the convergence of the learned functional, which are the same for both algorithms. Compared to (Shi et al., 2014) and (Liu et al., 2019) that deal with general optimization problems for parametric learning, this work focuses on speciﬁc decentralized kernel learning problem which is more challenging in both solution development and theoretical analysis. By leveraging the RF mapping technique, we successfully develop the DKLA algorithm and the COKE algorithm. Noticeably, a direct application of ADMM as in (Shi et al., 2014) on decentralized kernel learning is infeasible without raw data exchanges. Moreover, we analyze the convergence of the nonlinear functional to be learned and the generalization performance of kernel learning in the decentralized setting. The analysis is built on the work of (Shi et al., 2014) and (Liu et al., 2019) but goes further, and it is only attainable because of the adoption of the RF mapping. For both algorithms, the linear convergence of decision variables in the RF space is based on matrix reformulation of (17). Deﬁne Θ := [θ , θ , . . . , θ ] RN L and Θ := [ϑ , ϑ , . . . , ϑ ] RN L be the optimal primal variables, and β be the optimal dual variable. Then, for DKLA, Theorem 4 states that {Θk} (Θk := [θk 1, θk 2, . . . , θk N] RN L) is R-linear convergent to the optimal Θ . For detailed proof, see (Shi et al., 2014). Theorem 4 [Linear convergence of decision variables in DKLA] For the optimization problem (16), initialize the dual variables as γ0 i = 0, i, with Assumptions 1 - 2, then {Θk} is R-linearly convergent to the optimal Θ when k goes to inﬁnity following from

Θk Θ 2 F 1 m ˆR

ρ Θk 1 Θ 2 F + 1

ρ βk 1 β 2 F

where {(Θk, βk)} is Q-linearly convergent to its optimal {(Θ , β )}:

ρ Θk Θ 2 F + 1

ρ βk β 2 F 1 1 + δd

ρ Θk 1 Θ 2 F + 1

ρ βk 1 β 2 F

( (ν 1) σ2 min(S ) ν σ2max(S+) , m ˆR ρ 4 σ2max(S+) + ν

ρM2 ˆR σ2 min(S )

where ν > 1 is an arbitrary constant, σmax(S+) is the maximum singular value of the unsigned incidence matrix S+ of the network, and σ2 min(S ) is the minimum non-zero singular value of the signed incidence matrix S of the network, m ˆR and M ˆR are the minimum

Xu and Wang and Chen and Tian

strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. The Q-linear convergence rate of {(Θk, βk)} to {(Θ , β )} satisﬁes

1 1 + δd . (31)

To achieve linear convergence of decision variables in COKE, choosing appropriate censoring functions is crucial. Moreover, the penalty parameter ρ also needs to satisfy certain conditions, see Theorem 5 for details (Liu et al., 2019). Theorem 5 [Linear convergence of decision variables in COKE] For the optimization problem (16) with strongly convex local cost functions whose gradients are Lipschitz continuous, initialize the dual variables as γ0 i = 0, i, set the censoring thresholds to be h(k) = vµk, with v > 0 and µ (0, 1), and choose the penalty parameter ρ such that

0 < ρ < min

η1 , (ν 1) σ2 min(S ) νη3 σ2max(S+) , η1

4 + η2 σ2 max(S+)

m ˆR η3νM2 ˆR σ2 min(S )

where η1 > 0, η2 > 0, η3 > 0 and ν > 1 are arbitrary constants, m ˆR and M ˆR are the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. σmax(S+) and σ2 min(S ) are the maximum singular value of the unsigned incidence matrix S+ and the minimum non-zero singular value of the signed incidence matrix S of the network, respectively. Then, {Θk} is Rlinearly convergent to the optimal Θ when k goes to inﬁnity following from. Remark 3. For the kernel ridge regression problem (25), the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients are m ˆR := mini σ2 min( 1

Ti Φi L(Φi L) + 2λ

N I) and M ˆR := maxi σ2 max( 1

Ti Φi L(Φi L) + 2λ

N I), respectively. With the convergence of decision variables in the RF space given in Theorem 4 and Theorem 5, the second step is to prove the linear convergence of the learned functional ˆfθk i (x) to the optimal ˆfθ (x), which is straightforward for both algorithms.

Denote ˆfΘk(x) = [ ˆfθk 1 (x), . . . , ˆfθk N (x)] = ΘkφL(x) and ˆfΘ (x) = [ ˆfθ (x), . . . , ˆfθ (x)] = Θ φL(x), then we have

ˆfΘk(x) ˆfΘ (x) 2 = ΘkφL(x) Θ φL(x) 2

Θk Θ 2 φL(x) 2

where the second inequality comes from the fact that φL(x) 2 1 with the adopted RF mapping. For DKLA, we have

ˆfΘk(x) ˆfΘ (x) 2 Θk Θ 2 1 m ˆR

ρ Θk 1 Θ 2 F + 1

ρ βk 1 β 2 F

Therefore, the Q-linear convergence of {Θk, βk} to the optimal (Θ , β ) translates to the Rlinear convergence of { ˆfΘk(x)}. Similarly, the R-linear convergence of {Θk} to the optimal

COKE: Communication-Censored Decentralized Kernel Learning

Θ of COKE can be translated from the Q-linear convergence of {Θk, βk} to the optimal (Θ , β ), see Liu et al. (2019) for detailed proof. It is then straightforward to see that the individually learned functionals converge to the optimal one when k goes to inﬁnity, that is, for i N,

lim k | ˆfθk i (x) ˆfθ (x)| = lim k |(θk i ) φL(x) (θ ) φL(x)|

lim k θk i θ 2 φL(x) 2

lim k θk i θ 2

Appendix B. Proof of Theorem 3

Proof. The empirical risk (6) to be minimized for the kernel regression problem in the RKHS is

min α RT ˆR(α) =

i=1 ˆRi(α) =

Ti yi K i α 2 2 + λiα Kα , (36)

where yi = [yi,1, . . . , yi,Ti] RTi 1, the matrices Ki RT Ti and K RT T are used to store the similarity of the total data and data from agent i, and the similarity of all data, respectively, with the assumption that all data are available to all agents. The optimal solution is given in closed form by

α = ( K K + λK) 1 K y, (37)

where K = [ K1, . . . , KN] RT T with Ki = 1 Ti Ki, i N, y = [ y1; . . . ; y N] RT 1 with

yi = 1 Ti yi, i N, and λ = PN i=1 λi. Denote the predicted values on the training examples

using α as fi α RTi for node i and the overall predictions as fα = [f1 α ; . . . ; f N α ] RT . In the corresponding RF space, we can denote the predicted values obtained for node i by θ in (26) as fi θ RTi and the overall prediction by fθ = [f1 θ ; . . . ; f N θ ] RT . To prove Theorem 3, we start by customizing several lemmas and theorems from the literature, which facilitate proving our main results.

Deﬁnition 1 (Bartlett and Mendelson, 2002, Deﬁnition 2) Let {xq}Q q=1 be i.i.d samples drawn from the probability distribution p X . Let F be a class of functions that map X to R. Deﬁne the random variable

ˆRQ(F) := Eϵ

q=1 ϵqf(xq)

|x1, . . . , x Q

where {ϵq}Q q=1 are i.i.d. { 1}-valued random variables with P(ϵq = 1) = P(ϵq = 1) = 1

2. Then, the Rademacher complexity of F is deﬁned as

RQ(F) := E h ˆRQ(F) i . (39)

Xu and Wang and Chen and Tian

Rademacher complexity is adopted in machine learning and theory of computation to measure the richness of a class of real-valued functions with respect to a probability distribution. Here we adopt it to measure the richness of functions deﬁned in the RKHS induced by the positive deﬁnite kernel κ with respect to the sample distribution p.

Lemma 2 (Bartlett and Mendelson, 2002, Lemma 22) Let H be a RKHS associated with a positive deﬁnite kernel κ that maps X to R. Then, we have ˆRQ(H) 2

Tr(K), where

K is the kernel matrix for kernel κ over the i.i.d. sample set {xq}Q q=1. Correspondingly, the

Rademacher complexity satisﬁes RQ(H) 2

The next theorems state that the generalization performance of a particular estimator in H not only depends on the number of data points, but also depends on the complexity of H. Theorem 6 (Bartlett and Mendelson, 2002, Theorem 8, Theorem 12) Let {xq, yq}Q q=1 be i.i.d samples drawn from the distribution p deﬁned on X Y. Assume the loss function ℓ: Y R [0, 1] is Lipschitz continuous with a Lipschitz constant Mℓ. Deﬁne the expected risk for all f H be E(f) = Ep [ℓ(f(x), y)], and its corresponding empirical risk be ˆE(f) = 1 Q PQ q=1 ℓ(yq, f(xq)). Then, for δp (0, 1), with probability at least 1 δp, every f H satisﬁes

E(f) ˆE(f) + RQ( ℓ H) +

8 log(2/δp)

where ℓ H = {(x, y) ℓ(y, f(x)) ℓ(y, 0)|f H}. Theorem 7 (Bartlett and Mendelson, 2002, Theorem 12) If ℓ: Y R [0, 1] is Lipschitz with constant Mℓand satisﬁes ℓ(0) = 0, then RQ( ℓ H) 2MℓRQ(H).

Lemma 3 (Li et al., 2018, Modiﬁed Proposition 1) For the RKHS induced by the kernel κ with expression (10), deﬁne ˆHk := { ˆfk : ˆfk = (θk) φL(x) = PL l=1 θk l φ(x, ωl), then we have ˆfk ˆHk, ˆfk 2 ˆ Hk θk 2 2, where ˆHk is the RKHS of functions ˆfk at the k-th step.

The kernel that induces ˆHk is the approximated kernel ˆκL deﬁned in (11).

Lemma 4 (Li et al., 2018, Lemma 6) For the decentralized kernel regression problem deﬁned in Section 2, let fα , fθ be the predictions obtained by (37) and (26), respectively. Then, we have y fα , fθ fα = 0. (41)

Theorem 8 (Li et al., 2018, Modiﬁed Theorem 5) For the decentralized kernel regression problem deﬁned in Section 2, let λK be the largest eigenvalue of the kernel matrix K, and choose the regularization parameter λ < λK/T so as to control overﬁtting. Then, for all δp (0, 1) and f H 1, if the number of random features L satisﬁes

3ϵ) log 16dλ K δp ,

COKE: Communication-Censored Decentralized Kernel Learning

then with probability at least 1 δp, the following equation holds

sup f H 1 inf θ

1 T fx fθ 2 2 2λ, (42)

where fx RT is the predictions evaluated by f H on all samples and ϵ (0, 1).

With the above lemmas and theorems, we are ready to prove Theorem 3, which relies on the following decomposition:

E( ˆfk) E(f H) = E( ˆfk) ˆE( ˆfk) | {z } (1) estimation error

+ ˆE( ˆfk) ˆE( ˆfθ ) | {z } (2) convergence error

+ ˆE( ˆfθ ) ˆE( ˆfα ) | {z } (3) approximation error of RF mapping

+ ˆE( ˆfα ) E( ˆfα ) | {z } (4) estimation error

+ E( ˆfα ) E(f H) | {z } (5) approximation error of kernel representation

where E( ˆfk), ˆE( ˆfk), E( ˆfθ ), ˆE( ˆfθ ), ˆE( ˆfα ), E( ˆfα ) are deﬁned as follows for the kernel regression problem:

i=1 Ei( ˆfθk i ) =

i=1 Ep[(y (θk i ) φL(x))2] := Ep[ y N ΦN Θk 2 2],

ˆE( ˆfk) :=

i=1 ˆEi( ˆfθk i ) =

t=1 yi (Φi L) θk i 2 2 =

i=1 yi ( Φi L) θk i 2 2 = y ΦB Θk 2 2,

ˆE( ˆfθ ) :=

i=1 ˆEi( ˆfθ ) =

t=1 yi (Φi L) θ 2 2 =

i=1 yi ( Φi L) θ 2 2 = y ΦB Θ 2 2,

E( ˆfα ) :=

i=1 ˆEi( ˆfα ) =

i=1 Ep[(y (α ) κ(x))2] := Ep[(y (α ) κ(x))2],

ˆE( ˆfα ) :=

1 Ti yi Kiα 2 2 =

i=1 yi Kiα 2 2,

where y N = y1N, ΦN =

φL(x) 0 ... ... ... 0 φL(x)

RN NL, ΦB =

( Φ1 L) 0 ... ... ... 0 ( ΦN L )

RT NL, Θk = [θk 1; . . . ; θk i ] RNL, and Θ = [θ ; . . . ; θ ] RNL.

Then, we upper bound the excessive risk of E( ˆfk) learned by COKE by upper bounding the decomposed ﬁve terms. For term (1), for δp1 (0, 1), with probability at least 1 δp1,

Xu and Wang and Chen and Tian

E( ˆfk) ˆE( ˆfk) 2Mℓ1RT ( ℓ1 ˆHk) +

8 log(2/δp1)

2Mℓ1RT ( ˆHk) +

8 log(2/δp1)

T E[Tr( ˆK)] +

8 log(2/δp1)

8 log(2/δp1)

where C1 := 4Mℓ1 + p

8 log(2/δp1), and Mℓ1 is the Lipschitz constant for loss function ℓ1( ˆfθk i , y) = ((θk i ) φL(x) y)2. The ﬁrst inequality comes from Theorem 6, the second inequality comes from Theorem 7, and the third inequality comes from Lemma 2. For the last inequality, each element in the Gram matrix ˆK RT T is given by (11), thus Tr( ˆK) T φL(x) 2 2 T with the adopted RF mapping such that φL(x) 2 2 1. Similarly, for term (4), with probability at least 1 δp2 for δp2 (0, 1), the following holds,

ˆE( ˆfα ) E( ˆfα ) 2Mℓ2RT ( ℓ2 H) +

8 log(2/δp2)

2Mℓ2RT (H) +

8 log(2/δp2)

T E[Tr(K)] +

8 log(2/δp2)

8 log(2/δp2)

where C2 := 4Mℓ2 + p

8 log(2/δp2), and Mℓ2 is the Lipschitz constant for the loss function ℓ2( ˆfα , y) = ((α ) κ(x) y)2. For term (2), we have

ˆE( ˆfk) ˆE( ˆfθ ) = y ΦB Θk 2 2 y ΦB Θ 2 2

y ΦB Θ 2 2 Θk Θ 2 + Mℓ3

Φ B( ΦB Θ y) 2 + Mℓ3

= C3 Θk Θ 2,

where C3 := Φ B( ΦB Θ y) 2+ Mℓ3

2 , and Mℓ3 is the Lipschitz constant of the loss function ℓ3( y, Θ) = y ΦB Θ 2 2. From Theorem 4 and 5, we conclude { Θk} converges linearly to Θ .

COKE: Communication-Censored Decentralized Kernel Learning

Term (3) is the approximation error caused by RF mapping, which is bounded by

ˆE( ˆfθ ) ˆE( ˆfα ) =

i=1 yi ( Φi L) θ 2 2

i=1 yi Kiα 2 2

= y fθ 2 2 y fα 2 2 = ( y fα ) + ( fα fθ ) 2 2 y fα 2 2

y fα 2 2 + fα fθ 2 2 + 2 y fα , fα fθ y fα 2 2

= inf fθ fα fθ 2 2 + 2inf fθ y fα , fα fθ

inf fθ fα fθ 2 2 + 2 y fα , fα fθ

= inf fθ fα fθ 2 2

sup fx inf fθ fx fθ 2 2

where the seventh equality comes from Lemma 4 while the last inequality comes from Theorem 8 with fx := [ 1 T1 f1; . . . ; 1 TN f N] RT and fi = [f(xi,1), . . . , f(xi,Ti)] RTi for f H. To bound term (5) of the approximation error of the models in the RKHS H, we refer to the following Lemma.

Lemma 5 (Rudi and Rosasco, 2017, Modiﬁed Lemma 5) For the kernel κ that can be represented as (10) and bounded RF mapping, that is φ(x, ω) 1 for any x X, under Assumption 4, the following holds for any regularization parameter λ > 0,

E( ˆfα ) E(f H) = ˆfα Pfp 2 p X (Rλr)2.

In Lemma 5, fp is the ideal minimizer given the prior knowledge of the marginal distribution p X of x and P is a projection operator on fp so that Pfp is the optimal minimizer in RKHS. The parameter r [1/2, 1) is equivalent to assuming f H exits, and R can take value as either 1 or ˆfα p X . Setting r = 1/2 and R = 1, we have

E( ˆfα ) E(f H) λ. (48)

Combining (44)-(48) gives

lim k E( ˆfk) E(f H)) lim k

T + C3 Θk Θ 2 + 2λ + C2

3λ + C1 + C2

T + C3 Θk Θ 2

= 3λ + O( 1

and completes the proof.

Xu and Wang and Chen and Tian

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-eﬃcient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709 1720. 2017.

Dan Alistarh, Torsten Hoeﬂer, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsiﬁed gradient methods. In Advances in Neural Information Processing Systems, pages 5973 5983. 2018.

Reza Arablouei, Stefan Werner, Kutluyıl Do gan cay, and Yih-Fang Huang. Analysis of a reduced-communication diﬀusion lms algorithm. Signal Processing, 117:355 361, 2015.

Arthur Asuncion and David Newman. UCI machine learning repository, 2007.

Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 253 262. JMLR.org, 2017.

Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714 751, 2017.

Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002.

Eduard Gabriel B az avan, Fuxin Li, and Cristian Sminchisescu. Fourier kernel learning. In European Conference on Computer Vision, pages 459 473. Springer, 2012.

Salomon Bochner. Harmonic analysis and the theory of probability. Courier Corporation, 2005.

Pantelis Bouboulis, Symeon Chouvardas, and Sergios Theodoridis. Online distributed learning over networks in RKH spaces using random fourier features. IEEE Transactions on Signal Processing, 66(7):1920 1932, 2018.

Serhat Bucak, Rong Jin, and Anil K Jain. Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition. In Advances in Neural Information Processing Systems, pages 325 333, 2010.

Luis M Candanedo, V eronique Feldheim, and Dominique Deramaix. Data driven prediction models of energy use of appliances in a low-energy house. Energy and Buildings, 140:81 97, 2017.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007.

Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communication-eﬃcient distributed learning. In Advances in Neural Information Processing Systems, pages 5050 5060, 2018.

COKE: Communication-Censored Decentralized Kernel Learning

Symeon Chouvardas and Moez Draief. A diﬀusion kernel LMS algorithm for nonlinear adaptive networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4164 4168. IEEE, 2016.

Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041 3049, 2014.

Saverio De Vito, Ettore Massera, Marco Piga, Luca Martinotto, and Girolamo Di Francia. On ﬁeld calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B: Chemical, 129(2):750 757, 2008.

Giacomo Vincenzo Demarie and Donato Sabia. A machine learning approach for the automatic long-term structural health monitoring. Structural Health Monitoring, 18(3): 819 837, 2019.

Petros Drineas and Michael W Mahoney. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005.

Yaakov Engel, Shie Mannor, and Ron Meir. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 52(8):2275 2285, 2004.

Francisco Facchinei, Gesualdo Scutari, and Simone Sagratella. Parallel selective algorithms for nonconvex big data optimization. IEEE Transactions on Signal Processing, 63(7): 1874 1889, 2015.

Wei Gao, Jie Chen, C edric Richard, and Jianguo Huang. Diﬀusion adaptation over networks with kernel least-mean-square. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 217 220. IEEE, 2015.

Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In International Conference on Machine Learning (ICML), 2010.

Bin Gu, Miao Xin, Zhouyuan Huo, and Heng Huang. Asynchronous doubly stochastic sparse kernel learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Ibrahim El Khalil Harrane, R emi Flamary, and C edric Richard. On reducing the communication cost of the diﬀusion lms algorithm. IEEE Transactions on Signal and Information Processing over Networks, 5(1):100 112, 2018.

Thomas Hofmann, Bernhard Sch olkopf, and Alexander J Smola. Kernel methods in machine learning. The Annals of Statistics, pages 1171 1220, 2008.

Paul Honeine. Analyzing sparse dictionaries for online learning with kernels. IEEE Transactions on Signal Processing, 63(23):6343 6353, 2015.

Muhammad U Ilyas, M Zubair Shaﬁq, Alex X Liu, and Hayder Radha. A distributed algorithm for identifying information hubs in social networks. IEEE Journal on Selected Areas in Communications, 31(9):629 640, 2013.

Xu and Wang and Chen and Tian

Martin Jaggi, Virginia Smith, Martin Tak ac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-eﬃcient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068 3076, 2014.

Xinrong Ji, Cuiqin Hou, Yibin Hou, Fang Gao, and Shulong Wang. A distributed learning method for ℓ-1 regularized kernel machine over wireless sensor networks. Sensors, 16(7): 1021, 2016.

Fran cois Kawala, Ahlame Douzal-Chouakria, Eric Gaussier, and Eustache Dimert. Pr edictions d activit e dans les r eseaux sociaux en ligne. 2013.

Alec Koppel, Garrett Warnell, Ethan Stump, and Alejandro Ribeiro. Parsimonious online learning with kernels via sparse projections in function space. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4671 4675. IEEE, 2017.

Alec Koppel, Santiago Paternain, C edric Richard, and Alejandro Ribeiro. Decentralized online learning with kernels. IEEE Transactions on Signal Processing, 66(12):3240 3255, 2018.

Trung Le, Vu Nguyen, Tu Dinh Nguyen, and Dinh Phung. Nonparametric budgeted stochastic gradient descent. In Artiﬁcial Intelligence and Statistics, pages 654 572, 2016.

Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-eﬃcient distributed optimization in networks with gradient tracking. ar Xiv preprint ar Xiv:1909.05844, 2019a.

Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication eﬃcient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19 27, 2014.

Weiyu Li, Yaohua Liu, Zhi Tian, and Qing Ling. COLA: Communication-censored linearized ADMM for decentralized consensus optimization. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5237 5241. IEEE, 2019b.

Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. Towards a uniﬁed analysis of random Fourier features. ar Xiv preprint ar Xiv:1806.09178, 2018.

Yaohua Liu, Wei Xu, Gang Wu, Zhi Tian, and Qing Ling. Communication-censored ADMM for decentralized consensus optimization. IEEE Transactions on Signal Processing, 67 (10):2565 2579, 2019.

Jing Lu, Steven CH Hoi, Jialei Wang, Peilin Zhao, and Zhi-Yong Liu. Large scale online kernel learning. Journal of Machine Learning Research, 17(1):1613 1655, 2016.

H Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationeﬃcient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629, 2016.

Rangeet Mitra and Vimal Bhatia. The diﬀusion-KLMS algorithm. In 2014 International Conference on Information Technology, pages 256 259. IEEE, 2014.

COKE: Communication-Censored Decentralized Kernel Learning

Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus P uschel. D-ADMM: A communication-eﬃcient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718 2723, 2013.

Angelia Nedi c, Alex Olshevsky, and C esar A Uribe. A tutorial on distributed (non-bayesian) learning: Problem, algorithms and results. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 6795 6801. IEEE, 2016.

Tu Dinh Nguyen, Trung Le, Hung Bui, and Dinh Phung. Large-scale online kernel learning with random feature reparameterization. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 2543 2549. AAAI Press, 2017.

Fernando P erez-Cruz and Olivier Bousquet. Kernel methods and their potential use in signal processing. IEEE Signal Processing Magazine, 21(3):57 65, 2004.

Joel B Predd, Sanjeev R Kulkarni, and H Vincent Poor. Distributed kernel regression: An algorithm for training collaboratively. In 2006 IEEE Information Theory Workshop (ITW), pages 332 336. IEEE, 2006.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008.

Sashank J Reddi, Jakub Koneˇcn y, Peter Richt arik, Barnab as P ocz os, and Alex Smola. Aide: Fast and communication eﬃcient distributed optimization. ar Xiv preprint ar Xiv:1608.06879, 2016.

C edric Richard, Jos e Carlos M Bermudez, and Paul Honeine. Online prediction of time series data with kernels. IEEE Transactions on Signal Processing, 57(3):1058 1067, 2008.

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages 3215 3225, 2017.

Ali H Sayed. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning, 7(ARTICLE):311 801, 2014.

Bernhard Sch olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416 426. Springer, 2001.

Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-eﬃcient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning (ICML), pages 1000 1008, 2014.

John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004.

Fatemeh Sheikholeslami, Dimitris Berberidis, and Georgios B Giannakis. Large-scale kernelbased feature extraction via low-rank subspace tracking on a budget. IEEE Transactions on Signal Processing, 66(8):1967 1981, 2018.

Xu and Wang and Chen and Tian

Yanning Shen, Tianyi Chen, and Georgios B Giannakis. Online multi-kernel learning with orthogonal random features. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6289 6293, 2018.

Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014.

Ban-Sok Shin, Henning Paul, and Armin Dekorsy. Distributed kernel least squares for nonlinear regression applied to sensor networks. In 2016 24th European Signal Processing Conference (EUSIPCO), pages 1588 1592. IEEE, 2016.

Ban-Sok Shin, Masahiro Yukawa, Renato Luis Garrido Cavalcante, and Armin Dekorsy. Distributed adaptive learning with multiple kernels in diﬀusion networks. IEEE Transactions on Signal Processing, 66(21):5505 5519, 2018.

Bharath Sriperumbudur and Zolt an Szab o. Optimal rates for random Fourier features. In Advances in Neural Information Processing Systems, pages 1144 1152, 2015.

Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsiﬁed sgd with memory. In Advances in Neural Information Processing Systems, pages 4447 4458. 2018.

Dougal J Sutherland and JeﬀSchneider. On the error of random Fourier features. ar Xiv preprint ar Xiv:1506.02785, 2015.

H akan Terelius, Ufuk Topcu, and Richard M Murray. Decentralized multi-agent optimization via dual decomposition. IFAC Proceedings Volumes, 44(1):11245 11251, 2011.

Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training. Journal of Machine Learning Research, 13:3103 3131, 2012.

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsiﬁcation for communication-eﬃcient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299 1309. 2018.

Keith Worden and Graeme Manson. The application of machine learning to structural health monitoring. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 365(1851):515 537, 2006.

Ping Xu, Zhi Tian, Zhe Zhang, and Yue Wang. COKE: Communication-censored kernel learning via random features. In 2019 IEEE Data Science Workshop (DSW), pages 32 36. IEEE, 2019.

Wotao Yin, Xianghui Mao, Kun Yuan, Yuantao Gu, and Ali H Sayed. A communicationeﬃcient random-walk algorithm for decentralized optimization. ar Xiv preprint ar Xiv:1804.06568, 2018.

Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages 1975 1983, 2016.

COKE: Communication-Censored Decentralized Kernel Learning

Yue Yu, Jiaxiang Wu, and Longbo Huang. Double quantization for communication-eﬃcient distributed optimization. In Advances in Neural Information Processing Systems, pages 4440 4451. 2019.

Lijun Zhang, Jinfeng Yi, Rong Jin, Ming Lin, and Xiaofei He. Online kernel learning with a near optimal sparsity bound. In International Conference on Machine Learning (ICML), pages 621 629, 2013.

Mingrui Zhang, Lin Chen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Quantized frank-wolfe: Communication-eﬃcient distributed optimization. ar Xiv preprint ar Xiv:1902.06332, 2019.

Shengyu Zhu, Mingyi Hong, and Biao Chen. Quantized consensus ADMM for multi-agent distributed optimization. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4134 4138. IEEE, 2016.