# coke_communicationcensored_decentralized_kernel_learning__09f72e1f.pdf Journal of Machine Learning Research 22 (2021) 1-35 Submitted 1/20; Published 5/21 COKE: Communication-Censored Decentralized Kernel Learning Ping Xu pxu3@gmu.edu Yue Wang ywang56@gmu.edu Xiang Chen xchen26@gmu.edu Zhi Tian ztian1@gmu.edu Department of Electrical and Computer Engineering, George Mason University Fairfax, VA 22030, USA Editor: Corinna Cortes This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert space by jointly minimizing a global objective function, with access to their own locally observed dataset. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge, we leverage the random feature (RF) approximation approach to enable consensus on the function modeled in the RF space by data-independent parameters across different agents. We then design an iterative algorithm, termed DKLA, for fast-convergent implementation via ADMM. Based on DKLA, we further develop a communication-censored kernel learning (COKE) algorithm that reduces the communication load of DKLA by preventing an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests on both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.1 Keywords: Decentralized nonparametric learning, reproducing kernel Hilbert space, random features, ADMM, communication censoring. 1. Introduction Decentralized learning has attracted extensive interest in recent years, largely due to the explosion of data generated everyday from mobile sensors, social media services, and other networked multi-agent applications (Worden and Manson, 2006; Ilyas et al., 2013; Facchinei et al., 2015; Demarie and Sabia, 2019). In many of these applications, the observed data are usually kept private at local sites without being aggregated to a fusion center, either due to the prohibitively high cost of raw data transmission or privacy concerns. Meanwhile, 1. Preliminary results in this paper were presented in part at the 2019 IEEE Data Science Workshop (Xu et al., 2019). c 2021 Ping Xu and Yue Wang and Xiang Chen and Zhi Tian. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-070.html. Xu and Wang and Chen and Tian each agent in the network only communicates with its one-hop neighbors within its local area to save transmission power. Such localized data processing and transmission obviate the implementation of any centralized learning techniques. Under this circumstance, this article focuses on the decentralized learning problem where a network of distributed agents aim to collaboratively learn a functional model describing the global data with only access to their own locally observed datasets. To learn the functional model that is often nonlinear and complex, nonparametric kernel methods are widely appreciated thanks to the kernel trick that makes some well-behaved linear learning algorithms applicable in a high-dimensional implicit feature space, without explicit mapping from data to that feature space (Shawe-Taylor et al., 2004; Hofmann et al., 2008; P erez-Cruz and Bousquet, 2004). However, in the absence of any raw data sharing or aggregation, it is challenging to directly apply them to a decentralized multiagent setting and solve them under the consensus optimization framework using algorithms such as decentralized alternating direction method of multipliers (ADMM) (Shi et al., 2014). This is because decentralized learning relies on solving local optimization problems and then aggregating the updates on the local decision variables over the network through one-hop communications in an iterative manner (Nedi c et al., 2016). Unfortunately, these decision variables of local objective functions resulted from the kernel trick are data-dependent and thus cannot be optimized in the absence of raw data exchange under the decentralized consensus framework. There are several works applying kernel methods in decentralized learning for various applications under different settings (Predd et al., 2006; Mitra and Bhatia, 2014; Gao et al., 2015; Chouvardas and Draief, 2016; Shin et al., 2016, 2018; Koppel et al., 2018). These works, however, either assume that agents have access to their neighbors observed raw data (Predd et al., 2006) or require agents to transmit their raw data to their neighbors (Koppel et al., 2018) to ensure consensus through collaborative learning. These assumptions may not be valid in many practical applications that involve users private data. Moreover, standard kernel learning for big data faces the curse of dimensionality when the number of training examples increases (Shawe-Taylor et al., 2004). For example, in (Mitra and Bhatia, 2014; Chouvardas and Draief, 2016), the nonlinear function learned at each node is represented as a weighted combination of kernel functions centered on its local observed data. As a result, each agent needs to transmit both the weights of kernel functions and its local data to its neighbors at every iterative step to guarantee consensus of the common prediction function. Thus, both the computation and communication resources are demanding in the distributed implementation. To alleviate the curse of dimensionality problem, Gao et al. (2015) and Koppel et al. (2018) have developed compression techniques such as data selection and sparse subspace projection, respectively, but these techniques typically incur considerable extra computation, and still involve raw data exchange with no alleviation to the data privacy concern. Furthermore, when computation cost is more affordable than the communication in the big data scenario, communication cost of the iterative learning algorithms becomes the bottleneck for efficient distributed learning (Mc Mahan et al., 2016). Therefore, it is crucial to design communication-efficient distributed kernel learning algorithms with data privacy protection. COKE: Communication-Censored Decentralized Kernel Learning 1.1 Related work This work lies at the intersection of non-parametric kernel methods, decentralized learning with batch-form data, and communication-efficient iterative implementation. Related work to these three subjects is reviewed below. Centralized kernel learning. Centralized kernel methods assume data are collected and processed by a single server and are known to suffer from the curse of dimensionality for large-scale learning tasks. To mitigate their computational complexity, various dimensionality reduction techniques are developed for both batch-form or online streaming learning, including stochastic approximation (Bucak et al., 2010; Gu et al., 2018), restricting the number of function parameters (Gomes and Krause, 2010; Wang et al., 2012; Zhang et al., 2013; Le et al., 2016; Koppel et al., 2017), and approximating the kernel during training (Honeine, 2015; Engel et al., 2004; Richard et al., 2008; Drineas and Mahoney, 2005; Dai et al., 2014; Lu et al., 2016; Sheikholeslami et al., 2018; Rahimi and Recht, 2008; B az avan et al., 2012; Nguyen et al., 2017). Among them, random feature (RF) mapping methods have gained popularity thanks to their ability to map the large-scale data into a RF space of much reduced dimension by approximating the kernel with a fixed (small) number of random features, which thus circumvents the curse of dimensionality problem (Rahimi and Recht, 2008; Dai et al., 2014; B az avan et al., 2012; Nguyen et al., 2017). Enforcing orthogonality on random features can greatly reduce the error in kernel approximation (Yu et al., 2016; Shen et al., 2018), and the learning performance of RF-based methods is evaluated in (Bach, 2017; Rudi and Rosasco, 2017; Li et al., 2018). Decentralized kernel learning. For the decentralized kernel learning problem relevant to our work (Mitra and Bhatia, 2014; Gao et al., 2015; Chouvardas and Draief, 2016; Koppel et al., 2018), gradient descent is conducted locally at each agent to update its learning model, followed by diffusion-based information exchange among agents. However, these methods either assume that agents have access to their neighbors observed raw data or require agents to transmit their raw data to their neighbors to ensure convergence on the prediction function. For the problem studied in this article where the observed data are only locally available, these methods are not applicable since there are no common decision parameters for consensus without any raw data exchange. Moreover, these methods operate in the kernel space parameterized by training data, and still encounter the curse of dimensionality when the local dataset goes large. Though data selection (Gao et al., 2015) and subspace projection (Koppel et al., 2018) are adopted to alleviate the curse of dimensionality problem, they typically require significant extra computational resources. RF mapping (Rahimi and Recht, 2008) offers a viable approach to overcome these issues, by having all agents map their datasets of various sizes onto the same RF space. For instance, Bouboulis et al. (2018) proposes a diffusion-based combine-then-adapt (CTA) method that achieves consensus on the model parameters in the RF space for the online learning problem, without the exchange of raw data. Though the batch-form counterpart of online CTA can be developed for off-line learning, the convergence speed of the diffusion-based method is relatively slow compared with higher-order methods such as ADMM (Liu et al., 2019). Communication-efficient optimization. Communication-efficient algorithms for decentralized optimization and learning problems have attracted attention when data movement among computing nodes becomes a bottleneck due to the high latency and limited band- Xu and Wang and Chen and Tian width of decentralized networks. To reduce the communication cost, one way is to transmit the compressed information by quantization (Zhu et al., 2016; Alistarh et al., 2017; Zhang et al., 2019) or sparsification (Stich et al., 2018; Alistarh et al., 2018; Wangni et al., 2018; Harrane et al., 2018). However, these methods only reduce the required bandwidth at each communication round, not the number of rounds or the number of transmissions. Alternatively, some works randomly select a number of nodes for broadcasting/communication and operate asynchronous updating to reduce the number of transmissions per iteration (Mota et al., 2013; Li et al., 2014; Jaggi et al., 2014; Arablouei et al., 2015; Mc Mahan et al., 2016; Yin et al., 2018; Yu et al., 2019). In contrast to random node selection, a more intuitive way is to evaluate the importance of a message in order to avoid unnecessary transmissions (Chen et al., 2018; Liu et al., 2019; Li et al., 2019b). This is usually implemented by adopting a censoring scheme to adaptively decide if a message is informative enough to be transmitted during the iterative optimization process. Other efforts to improve the communication efficiency are made by accelerating the convergence speed of the iterative algorithm implementation (Shamir et al., 2014; Reddi et al., 2016; Li et al., 2019a). 1.2 Contributions This paper develops communication-efficient decentralized kernel learning algorithms under the consensus optimization framework without any central coordination or raw data exchange among agents for built-in privacy protection. Relative to prior art, our contributions are summarized as follows. We first formulate the decentralized multi-agent kernel learning problem as a decentralized consensus optimization problem in the RF space. Since most machine learning scenarios can afford plenty computational capability but limited communication resources, we solve this problem with ADMM, which has shown fast convergence at the expense of relatively high computation cost per iteration (Shi et al., 2014). To the best of our knowledge, this is the first work to solve decentralized kernel learning in the RF space by ADMM without any raw data exchange. The key of our proposed Decentralized Kernel Learning via ADMM (DKLA) algorithm is to apply RF mapping, which not only reduces the computational complexity but also enables consensus on a set of model parameters of fixed size in the RF space. In addition, since no raw data is exchanged among agents and the mapping from the original data space to the RF space is not one-to-one mapping, data privacy is protected to a certain level. To increase the communication efficiency, we further develop a COmmunicationcensored KErnel learning (COKE) algorithm, which achieves desired learning performance given limited communication resources and energy supply. Specifically, we devise a simple yet powerful censoring strategy to allow each user to autonomously skip unnecessary communications when its local update is not informative enough for transmission, without aid of a central coordinator. In this way, the communication efficiency can be boosted at almost no sacrifice of the learning performance. When the censoring strategy is absent, COKE degenerates to DKLA. In addition, we conduct theoretical analysis in terms of both functional convergence and generalization performance to provide guidelines for practical implementations of COKE: Communication-Censored Decentralized Kernel Learning our proposed algorithms. We show that the individually learned functional at each agent through DKLA and COKE both converges to the optimal one at a linear rate under mild conditions. For the generalization performance, we show that O( T log dλ K) features are sufficient to ensure O(1/ T) learning risk for the decentralized kernel ridge regression problem, where dλ K is the number of effective degrees of freedom that will be defined in Section 4.2 and T is the total number of samples. Finally, we test the performance of our proposed DKLA and COKE algorithms on both synthetic and real datasets. The results corroborate that both DKLA and COKE exhibit attractive learning performance and COKE is highly communication-efficient. 1.3 Organization and notation of the paper Organization. Section 2 formulates the problem of non-parametric learning and highlights the challenges in applying traditional kernel methods in the decentralized setting. Section 3 develops the decentralized kernel learning algorithms, including both DKLA and COKE. Section 4 presents the theoretical results and Section 5 reports the numerical tests using both synthetic data and real datasets. Concluding remarks are provided in Section 6. Notation. R denotes the set of real numbers. 2 denotes the Euclidean norm of vectors and F denotes the Frobenius norm of matrices. | | denotes the cardinality of a set. A, a, and a denotes a matrix, a vector, and a scalar, respectively. 2. Problem Statement This section reviews basics of kernel-based learning and decentralized optimization, introduces notation, and provides background needed for our novel DKLA and COKE schemes. Consider a network of N agents interconnected over a fixed topology G = (N, C, A), where N = {1, 2 . . . , N}, C N N, and A RN N denote the agent set, the edge set and the adjacency matrix, respectively. The elements of A are ain = ani = 1 when the unordered pair of distinct agents (i, n) C, and ain = ani = 0 otherwise. For agent i, its onehop neighbors are in the set Ni = {n|(n, i) C}. The term agent used here can be a single computational system (e.g. a smart phone, a database, etc.) or a collection of co-located computational systems (e.g. data centers, computer clusters, etc.). Each agent only has access to its locally observed data composed of independently and identically distributed (i.i.d) input-label pairs {xi,t, yi,t}Ti t=1 obeying an unknown probability distribution p on X Y, with xi,t Rd and yi,t R. The kernel learning task is to find a prediction function f that best describes the ensemble of all data from all agents. Suppose that f belongs to the reproducing kernel Hilbert space (RKHS) H := {f|f(x) = P t=1 αtκ(x, xt)} induced by a positive semidefinite kernel κ(x, xt) : Rd Rd R that measures the similarity between x and xt, for all x, xt X. In a decentralized setting with privacy concern, this means that each agent has to be able to learn the global function f H such that yi,t = f(xi,t)+ei,t for {{xi,t, yi,t}Ti t=1}N i=1, without exchange of any raw data and in the absence of a fusion center, where the error terms ei,t are minimized according to certain optimality metric. To evaluate the learning performance, a nonnegative loss function ℓ(y, ˆy) is utilized to measure the difference between the true label value y and the predicted value ˆy = f(x). Some common loss functions include the quadratic loss ℓ(y, ˆy) = (y ˆy)2 for regression Xu and Wang and Chen and Tian tasks, the hinge loss ℓ(y, ˆy) = max(0, 1 yˆy) and the logistic loss ℓ(y, ˆy) = log(1 + e yˆy) for binary classification tasks. The above mentioned loss functions are all convex with respect to ˆy. The learning problem is then to minimize the expected risk of the prediction function: X Y ℓ(f(x), y)dp(x, y), (1) which indicates the generalization ability of f to new data. However, the distribution p is unknown in most learning tasks. Therefore, minimizing R(f) is not applicable. Instead, given the finite number of training examples, the problem turns to minimizing the empirical risk: min f H ˆR(f) := i=1 ˆRi(f), (2) where ˆRi(f) is the local empirical risk for agent i given by t=1 ℓ(f(xi,t), yi,t) + λi f 2 H, (3) with H being the norm associated with H, and λi > 0 being a regularization parameter that controls over-fitting. The representer theorem states that the minimizer of a regularized empirical risk functional defined over a RKHS can be represented as a finite linear combination of kernel functions evaluated on the data pairs from the training dataset (Sch olkopf et al., 2001). If {{xi,t, yi,t}Ti t=1}N i=1 are centrally available at a fusion center, the minimizer of (2) admits t=1 αi,tκ(x, xi,t) := α κ(x), (4) where α = [α1,1, . . . , αN,TN ] RT is the coefficient vector to be learned, T = PN i=1 Ti is the total number of samples, and κ(x) = [κ(x, x1,1), . . . , κ(x, x N,TN )] RT is the kernel function parameterized by the global data XT := {{xi,t}Ti t=1}N i=1 from all agents, for any x. In RKHS, since κ(xt, x), κ(xτ, x) H = κ(xt, xτ), it yields f 2 H = α Kα, where K is the T T kernel matrix that measures the similarity between any two data points in XT . In this way, the local empirical risk (3) can be reformulated as a function of α: ˆRi(α) : = 1 t=1 ℓ(f (xi,t), yi,t) + λi f 2 H = 1 t=1 ℓ(α κ(xi,t), yi,t) + λiα Kα. (5) Accordingly, (2) becomes i=1 ˆRi(α). (6) COKE: Communication-Censored Decentralized Kernel Learning Relating the decentralized kernel learning problem with the decentralized consensus optimization problem, solving (6) is equivalent to solving min {αi RT }N i=1 i=1 ˆRi(αi) s.t. αi = αn, i, n Ni, where αi and αn are the local copies of the global decision variable α at agent i and agent n, respectively. The problem can then be solved by ADMM (Shi et al., 2014) or other primal dual methods (Terelius et al., 2011). However, it is worth noting that (7) reveals a subtle yet profound difference from a general optimization problem for parametric learning. That is, each local function ˆRi depends on not only the global decision variable α, but also the global data XT because of the kernel terms κ(xi,t) and K. As a result, solving the local objective for agent i requires raw data from all other agents to obtain κ(xi,t) and K, which contradicts the situation that private raw data are only locally available. Moreover, notice that αi is of the same size T as that of the ensemble dataset, which incurs the curse of dimensionality and insurmountable computational cost when T becomes large, even when the obstacle of making all the data available to all agents is not of concern. To resolve this issue, an alternative formulation is to associate a local prediction model fi H with each agent i, with f i = PTi t=1 αi,tκ(x, xi,t) = α i κi(x) being the local optimal solution that only involves local data {xi,t}Ti t=1 (Ji et al., 2016). Specifically, αi = [ αi,1, . . . , αi,Ti] RTi, and κi(x) = [κ(x, xi,1), . . . , κ(x, xi,Ti)] RTi is parameterized by the local data {xi,t}Ti t=1 only. In this way, the local cost function becomes ˆRi( αi) : = 1 t=1 ℓ( f i (xi,t), yi,t) + λi f i 2 H = 1 t=1 ℓ( α i κi(xi,t), yi,t) + λi α i Ki αi, (8) where Ki is of size Ti Ti and depends on local data only. With (8), the optimization problem (7) is then modified to min { αi RTi}N i=1 i=1 ˆRi( αi) s.t. fn(xi,t) = fi(xi,t), i, n Ni, t = 1, . . . , Ti, and can be solved distributedly by ADMM. Note that the consensus constraint is the learned prediction values fi(x), not the parameters αi. This is because αi are data-dependent and may have different sizes at different agents (the dimension of αi equals to the number of training samples at agent i), and cannot be directly optimized through consensus. Still, this method has four drawbacks. Firstly, it is necessary to associate a local learning model fi to each agent i for the decentralized implementation. However, the local learning model fi and the global optimal model f in (2) may not be the same because different local training data are used. Therefore, the optimization problem (9) is only an approximation of (2). Even with the equality constraint to minimize the gap between the decentralized learning output and the optimal centralized one, the approximation performance is not guaranteed. Besides, the functional consensus constraint still requires raw data exchange Xu and Wang and Chen and Tian among agents in order for agent n Ni to be able to compute the values fn(xi,t) from agent i s data xi,t, for i = n. Apparently, this violates the privacy-protection requirement for practical applications. In addition, when Ti is large, both the storage and computational costs are high for each agent due to the curse of dimensionality problem at the local sites. Lastly, the frequent local communication is resource-consuming under communication constraints. To circumvent all these obstacles, the goal of this paper is to develop efficient decentralized algorithms that protect privacy and conserve communication resources. 3. Algorithm Development In this section, we leverage the RF approximation and ADMM to develop our algorithms. We first introduce the RF mapping method. Then, we devise the DKLA algorithm that globally optimizes a shared learning model for the multi-agent system. Finally, we take into consideration of the limited communication resources in large-scale decentralized networks and develop the COKE algorithm. Both DKLA and COKE are computationally efficient and protect data privacy at the same time. Further, COKE is communication efficient. 3.1 RF-based kernel learning As stated in previous sections, standard kernel methods incur the curse of dimensionality issue when the data size grows large. To make kernel methods scalable for a large dataset, RF mapping is adopted for approximation by using the shift-invariance property of kernel functions (Rahimi and Recht, 2008). For a shift-invariant kernel that satisfies κ(xt, xτ) = κ(xt xτ), t, τ, if κ(xt xτ) is absolutely integrable, then its Fourier transform pκ(ω) is guaranteed to be nonnegative (pκ(ω) 0), and hence can be viewed as its probability density function (pdf) when κ is scaled to satisfy κ(0) = 1 (Bochner, 2005). Therefore, we have κ(xt, xτ) = Z pκ(ω)ejω (xt xτ)dω := Eω[ejω (xt xτ)] = Eω[φ(xt, ω)φ (xτ, ω)], (10) where E denotes the expectation operator, φ(x, ω) := ejω x with ω Rd, and is the complex conjugate operator. In (10), the first equality is the result of the Fourier inversion theorem, and the second equality arises by viewing pκ(ω) as the pdf of ω. In this paper, we adopt a Gaussian kernel κ(xt, xτ) = exp( xt xτ 2 2/(2σ2)), whose pdf is a normal distribution with pκ(ω) N(0, σ 2I). The main idea of RF mapping is to randomly generate {ωl}L l=1 from the distribution pκ(ω) and approximate the kernel function κ(xt, xτ) by the sample average ˆκL(xt, xτ) := 1 l=1 φ(xt, ωl)φ (xτ, ωl) := φ L(xτ)φL(xt), (11) where φL(x) := q 1 L[φ(x, ω1), . . . , φ(x, ωL)] and is the conjugate transpose operator. COKE: Communication-Censored Decentralized Kernel Learning The following real-valued mappings can be adopted to approximate κ(xt, xτ), both satisfying the condition Eω[φr(xt, ω) φr(xτ, ω)] = κ(xt, xτ) (Rahimi and Recht, 2008): φr(x, ω) = [cos(ω x), sin(ω x)] , (12) 2 cos(ω x + b), (13) where b is drawn uniformly from [0, 2π]. With the real-valued RF mapping, the minimizer of (2) then admits the following form: t=1 αi,tφ L(xi,t)φL(x) = θ φL(x), (14) where θ := PN i=1 PTi t=1 αi,tφ L(xi,t) denotes the new decision vector to be learned in the RF space and φL(x) = q 1 L[φr(x, ω1), . . . , φr(x, ωL)] . If (12) is adopted, then φL(x) and θ are of size 2L. Otherwise, if (13) is adopted, then φL(x) and θ are of size L. In either case, the size of θ is fixed and does not increase with the number of data samples. 3.2 DKLA: Decentralized kernel learning via ADMM Consider the decentralized kernel learning problem described in Section 2 and adopt the RF mapping described in Section 3.1. Let all agents in the network have the same set of random features, i.e., {ωl}L l=1. Plugging (14) into the local cost function ˆRi(f) in (3) gives ˆRi(θ) : = 1 t=1 ℓ( ˆf (xi,t), yi,t) + λi ˆf 2 H = 1 t=1 ℓ(θ φL(xi,t), yi,t) + λi θ 2 2. (15) In (15), we have θ 2 2 : = ( t=1 αi,tφ L(xi,t))( τ=1 αn,τφL(xn,τ)) τ=1 αi,tαn,τκ(xi,t, xn,τ) := ˆf 2 H. Therefore, with RF mapping, the centralized benchmark (2) becomes i=1 ˆRi(θ). (16) Here for notation simplicity, we denote the size of θ by L 1, which can be achieved by adopting the real-valued mapping in (13). Adopting an alternative mapping such as (12) only changes the size of θ while the algorithm development is the same. RF mapping is essential because it results in a common optimization parameter θ of fixed size for all agents. To solve (16) in a decentralized manner, we associate a model parameter θi with each agent i and enforce the consensus constraint on neighboring agents i and n using an auxiliary Xu and Wang and Chen and Tian variable ϑin. Specifically, the RF-based decentralized kernel learning problem is formulated to jointly minimize the following objective function: min {θi RL},{ϑin RL} i=1 ˆRi(θi) s.t. θi = ϑin, θn = ϑin, (i, n) C. Note that the new decision variables θi to be optimized are local copies of the global optimization parameter θ and are of the same size for all agents. On the contrary, the decision variables αi in (9) are data-dependent and may have different sizes. In addition, the size of θ is L, which can be much smaller than that of α (whose size equals to T) in (6). For big data scenarios where L T, RF mapping greatly reduces the computational complexity. Moreover, as shown in the following, the updating of θ does not involve any raw data exchange and the RF mapping from x to φL(x) is not one-to-one mapping, therefore provides raw data privacy protection. Further, it is easy to set the regularization parameters λi to control over-fitting. Specifically, since the parameters θi are of the same length among agents, we can set them to be λi = 1 N λ, i, where λ is the corresponding over-fitting control parameter assuming all data are collected at a center. In contrast, the regularization parameters λi in (5) depend on local data and need to satisfy λ = PN i=1 λi, which is relatively difficult to tune in a large-scale network. In the constraint, θi are separable when ϑin are fixed, and vice versa. Therefore, (17) can be solved by ADMM. Following (Shi et al., 2014), we develop the DKLA algorithm where each agent updates its local primal variable θi and local dual variable γi by θk i := arg min θi ˆRi(θi) + ρ|Ni| θi 2 2 + θ i θk 1 i + θk 1 n γk i = γk 1 i + ρ X θk i θk n , (18b) where |Ni| is the cardinality of Ni. The auxiliary variable ϑin can be written as a function of θi and then canceled out. Interested readers are referred to (Shi et al., 2014) for detailed derivation. The learning algorithm DKLA is outlined in Algorithm 1. Note that the random features need to be common to all agents, hence, in step 1, we restrict them to be drawn according to a common random seed. Algorithm 1 is fully decentralized since the updates of θi and γi depend only on local and neighboring information. 3.3 COKE: Communication-censored decentralized kernel learning From Sections 3.1 and 3.2, we can see that decentralized kernel learning in the RF space under the consensus optimization framework has much reduced computational complexity, thanks to the RF mapping technique that transforms the learning model into a smaller RF space. In this subsection, we consider the case when the communication resource is limited and we aim to further reduce the communication cost of DKLA. To start, we notice that in Algorithm 1, each agent i (i N) maintains 2 + |Ni| local variables at iteration k, COKE: Communication-Censored Decentralized Kernel Learning Algorithm 1 DKLA Run at Agent i Require: Kernel κ, the number of random features L, and λ to control over-fitting; initialize local variables to θ0 i = 0, γ0 i = 0; set step size ρ > 0; 1: Draw L i.i.d. samples {ωl}L l=1 from pκ(ω) according to a common random seed. 2: Construct {φL(xi,t)}Ti t=1 using the random features {ωl}L l=1 via (12) or (13). 3: for iterations k = 1, 2, do 4: Update local variable θk i by (18a); 5: Transmit θk i to all neighbor n (n Ni) and receive θk n from all neighbor n; 6: Update local dual variable γk i by (18b). i.e., its local primal variable θk i , local dual variable γk i and |Ni| state variables θk n received from its neighbors. While the dual variable γk i is kept locally for agent i, the transmission of its updated local variable θk i to its one-hop neighbors happens in every iteration, which consumes a large amount of communication bandwidth and energy along iterations for largescale networks. In order to improve the communication efficiency, we develop the COKE algorithm by employing a censoring function at each agent to decide if a local update is informative enough to be transmitted. To evaluate the importance of a local update at iteration k for agent i (i N), we introduce a new state variable ˆθk 1 i to record agent i s latest broadcast primal variable up to time k 1. Then, at iteration k, we define the difference between agent i s current state θk i and its previously transmitted state ˆθk 1 i as ξk i = ˆθk 1 i θk i , (19) and choose a censoring function as Hi(k, ξk i ) = ξk i 2 hi(k), (20) where {hi(k)} is a non-increasing non-negative sequence. A typical choice for the censoring function is Hi(k, ξk i ) = ξk i 2 vµk, where µ (0, 1) and v > 0 are constants. When Hi(k, ξk i ) < 0, θk i is deemed not informative enough, hence will be censored and will not be transmitted to its neighbors. When executing the COKE algorithm, each agent i maintains 3 + |Ni| local variables at each iteration k. Comparing with the DKLA update in (18), the additional local variable is the state variable ˆθk i that records its latest broadcast primal variable up to time k. Moreover, the |Ni| state variables from its neighbors are ˆθk n that record the latest received primal variables from its neighbors, instead of the timely updated and broadcast variables θk n of its neighbors n Ni. Though each agent still computes local updates at every step, its transmission to neighbors does not always occur, but is determined by the censoring criterion (20). To be specific, at each iteration k, if Hi(k, ξk i ) 0, then ˆθk i = θk i , and agent i is allowed to transmit its local primal variable θk i to its neighbors. Otherwise, ˆθk i = ˆθk 1 i and no information is transmitted. If agent i receives θk n from any neighbor n, then that neighbor s state variable kept by agent i becomes ˆθk n = θk n, otherwise, ˆθk n = ˆθk 1 n . Xu and Wang and Chen and Tian Algorithm 2 COKE Run at Agent i Require: Kernel κ, the number of random features L, the censoring thresholds {hi(k)}, and λ to control over-fitting; initialize local variables to θ0 i = 0, ˆθ0 i = 0, γ0 i = 0; set step size ρ > 0; 1: Draw L i.i.d. samples {ωl}L l=1 from pκ(ω) according to a common random seed. 2: Construct {φL(xi,t)}Ti t=1 using the random features {ωl}L l=1 via (12) or (13). 3: for iterations k = 1, 2, do 4: Update local variable θk i by (21a); 5: Compute ξk i = ˆθk 1 i θk i ; 6: If Hi(k, ξk i ) = ξk i 2 hi(k) 0, transmit θk i to neighbors and let ˆθk i = θk i ; else do not transmit and let ˆθk i = ˆθk 1 i ; 7: If receives θk n from neighbor n, let ˆθk n = θk n; else let ˆθk n = ˆθk 1 n ; 8: Update local dual variable γk i by (21b). Consequently, agent i s local parameters are updated as follows: θk i := arg min θi ˆRi(θi) + ρ|Ni| θi 2 2 + θ i ˆθk 1 i + ˆθk 1 n γk i = γk 1 i + ρ X ˆθk i ˆθk n , (21b) with a censoring step conducted between (21a) and (21b). We outline the COKE algorithm in Algorithm 2. The key feature of COKE is that agent i s local variables θk i and γk i are updated all the time, but the transmission of θk i occurs only when the censoring condition is met. By skipping unnecessary transmissions, the communication efficiency of COKE is improved. It is obvious that large {hi(k)} saves more communication but may lead to divergence from the optimal solution θ of (16), while small {hi(k)} does not contribute much to communication saving. Noticeably, DKLA is a special case of COKE when the communication censoring strategy is absent by setting hi(k) = 0, i, k. 4. Theoretical Guarantees In this section, we perform theoretical analyses to address two questions related to the convergence properties of DKLA and COKE algorithms. First, do they converge to the globally optimal point, and if so, at what rate? Second, what is their achieved generalization performance in learning? Since DKLA is a special case of COKE, the analytic results of COKE, especially the second one, extend to DKLA straightforwardly. For theoretical analysis, we make the following assumptions. Assumption 1 The network with topology G = (N, C, A) is undirected and connected. COKE: Communication-Censored Decentralized Kernel Learning Assumption 2 The local cost functions ˆRi are strongly convex with constants m ˆRi > 0 such that i N, ˆRi( θa) ˆRi( θb), θa θb m ˆRi θa θb 2 2, for any θa, θb RL. The minimum convexity constant is m ˆR := mini m ˆRi. The gradients of the local cost functions are Lipschitz continuous with constants M ˆRi > 0, i. That is, ˆRi( θa) ˆRi( θb) 2 M ˆRi θa θb 2 for any agent i given any θa, θb RL. The maximum Lipschitz constant is M ˆR := maxi M ˆRi. Assumption 3 The number of training samples of different agents is of the same order of magnitude, that is, maxi Ti mini Ti mini Ti < 10, i N. Assumption 4 There exists f H H, such that for all estimators f H, E(f H) E(f), where E(f) := Ep [ℓ(f(x), y)] is the expected risk to measure the generalization ability of the estimator f. Assumption 1 and 2 are standard for decentralized optimization over decentralized networks (Shi et al., 2014), Assumption 4 is standard in generalization performance analysis of kernel learning (Li et al., 2018), and Assumption 3 is enforced to exclude the case of extremely unbalanced data distributed over the network. 4.1 Linear convergence of DKLA and COKE We first establish that DKLA enables agents in the decentralized network to reach consensus on the prediction function at a linear rate. We then show that when the censoring function is properly chosen and the penalty parameter satisfies certain conditions, COKE also guarantees that the individually learned functional on the same sample linearly converges to the optimal solution. Theorem 1 [Linear convergence of DKLA] Initialize the dual variables as γ0 i = 0, i, with Assumptions 1 - 3, the learned functional at each agent through DKLA is R-linearly convergent to the optimal functional ˆfθ (x) := (θ ) φL(x) for any x X, where θ denotes the optimal solution to (16) obtained in the centralized case. That is, lim k ˆfθk i (x) = ˆfθ (x), i. (22) Proof. See Appendix A. Theorem 2 [Linear convergence of COKE] Initialize the dual variables as γ0 i = 0, i, set the censoring thresholds to be h(k) = vµk, with v > 0 and µ (0, 1), and choose the penalty parameter ρ such that 0 < ρ < min η1 , (ν 1) σ2 min(S ) νη3 σ2max(S+) , η1 4 + η2 σ2 max(S+) m ˆR η3νM2 ˆR σ2 min(S ) where η1 > 0, η2 > 0, η3 > 0 and ν > 1 are arbitrary constants, m ˆR and M ˆR are the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. σmax(S+) and σmin(S ) are the maximum singular value of the unsigned incidence matrix S+ and the minimum non-zero singular value of the signed incidence matrix S of the network, respectively. Then, with Assumptions 1 Xu and Wang and Chen and Tian - 3, the learned functional at each agent through COKE is R-linearly convergent to the optimal one ˆfθ (x) := (θ ) φL(x) for any x X, where θ denotes the optimal solution to (16) obtained in the centralized case. That is, lim k ˆfθk i (x) = ˆfθ (x), i. (24) Proof. See Appendix A. Remark 1. It should be noted that the kernel transformation with RF mapping is essential in enabling convex consensus formulation with convergence guarantee. For example, in a regular optimization problem with a local cost function (y f(x))2, even if it is quadratic, the nonlinear function f(x) inside destroys the convexity. In contrast, with RF mapping, f(x) of any form is expressed as a linear function of θ, and hence the local cost function is guaranteed to be convex. For decentralized kernel learning, many widely-adopted loss functions result in (strongly) convex local objective functions in the RF space, such as the quadratic loss in a regression problem and logistic loss in a classification problem. Remark 2. For Theorem 2, notice that choosing larger v and µ in the design of the censoring thresholds in COKE leads to less communication per iteration at the expense of possible performance degradation, whereas smaller v and µ may not contribute much to communication saving. However, it is challenging to acquire an explicit tradeoffbetween communication cost and steady-state accuracy, since the designed censoring thresholds do not have an explicit relationship with the update of the model parameter. The above theorems establish the exact convergence of the functional learned in the multi-agent system for the decentralized kernel regression problem via DKLA and COKE. Different from previous works (Koppel et al., 2018; Shin et al., 2018), our analytic results are obtained by converting the non-parametric data-dependent learning model into a parametric data-independent model in the RF space and solved under the consensus optimization framework. In this way, we not only reduce the computational complexity of the standard kernel methods and make the RF-based kernel methods scalable to large-size datasets, but also protect data privacy since no raw data exchange among agents is required and the RF mapping is not one-to-one mapping. RF mapping is crucial in our algorithms, with which we are able to show the linear convergence of the functional by showing the linear convergence of the iteratively updated decision variables in the RF space; see Appendix A for more details. 4.2 Generalization property of COKE The ultimate goal of decentralized learning is to find a function that generalizes well for the ensemble of all data from all agents. To evaluate the generalization property of the predictive function learned by COKE, we are then interested in bounding the difference between the expected risk of the predictive function learned by COKE at the k-th iteration, defined as E( ˆfk) := PN i=1 Ei( ˆfθk i ) := PN i=1 Ep[(y (θk i ) φL(x))2], and the expected risk E(f H) in the RKHS. This is different from bounding the approximation error between the kernel κ and the approximated ˆκL by L random features as in the literature (Rahimi and Recht, 2008; Sutherland and Schneider, 2015; Sriperumbudur and Szab o, 2015). As DKLA is a special case of COKE, the generalization performance of COKE can be extended to DKLA straightforwardly. COKE: Communication-Censored Decentralized Kernel Learning To illustrate our finding, we focus on the kernel regression problem whose loss function is least squares, i.e., ℓ(y, f(x)) = (y f(x))2. With RF mapping, the objective function (16) of the regression problem can be formulated as i=1 ˆRi(θ) = Ti yi (Φi L) θ 2 2 + λ where yi = [yi,1, . . . , yi,Ti] RTi 1, Φi L = [φL(xi,1), . . . , φL(xi,Ti)] RL Ti, and φL(xi,t) is the data mapped to the RF space. The optimal solution of (25) is given in closed form by θ = ( Φ Φ + λI) 1 Φ y, (26) where Φ = [ Φ1 L, . . . , ΦN L ] RT L with Φi L = 1 Ti Φi L, i N, and y = [ y1; . . . ; y N] RT 1 with yi = 1 Ti yi, i N. The optimal prediction model is then expressed by ˆfθ (x) = (θ ) φL(x). (27) In the following theorem, we give a general result of the generalization performance of the predictive function learned by COKE for the kernel regression problem, which is built on the linear convergence result given in Theorem 3 and taking into account of the number of random features adopted. Theorem 3 Let λK be the largest eigenvalue of the kernel matrix K constructed by all data, XT , and choose the regularization parameter λ < λK/T so as to control overfitting. Under the Assumptions 1 - 4, with the censoring function and other parameters given in Theorem 2, for all δp (0, 1) and f H 1, if the number of random features L satisfies 3ϵ) log 16dλ K δp , then with probability at least 1 δp, the excess risk of E( ˆfk) obtained by Algorithm 2 converges to an upper bound, i.e., lim k (E( ˆfk) E(f H)) 3λ + O( 1 where ϵ (0, 1), and dλ K := Tr(K(K+λTI) 1) is the number of effective degrees of freedom that is known to be an indicator of the number of independent parameters in a learning problem (Avron et al., 2017). Proof. See Appendix B. Theorem 3 states the tradeoffbetween the computational efficiency and the statistical efficiency through the regularization parameter λ, effective dimension dλ K, and the number of random features adopted. We can see that to bound the excess risk with a higher probability, we need more random features, which results in a higher computational complexity. The regularization parameter is usually determined by the number of training data and one common practice is to set λ = O(1/ T) for the regression problem (Caponnetto and Xu and Wang and Chen and Tian De Vito, 2007). Therefore, with O( T log dλ K) features, COKE achieves a learning risk of O(1/ T) at a linear rate. We also notice that different sampling strategies affect the number of random features required to achieve a given generalization error. For example, importance sampling is studied for the centralized kernel learning in RF space in (Li et al., 2018). Interested readers are referred to (Li et al., 2018) and references therein. 5. Experiments This section evaluates the performance of our COKE algorithm in regression tasks using both synthetic and real-world datasets. Since we consider the case that data are only locally available and cannot be shared among agents, the following RF-based methods are used to benchmark our COKE algorithm. CTA. This is a form of diffusion-based technique where all agents first construct their RFmapped data {φL(xi,t)}Ti t=1, for t = 1, . . . , Ti, i, using the same random features {ωl}L l=1 as DKLA and COKE. Then at each iteration k, each agent i first combines information from its neighbors, i.e., θn, n Ni with its own parameter θi by aggregation. Then, it updates its own parameter θi using the gradient descent method with the aggregated information (Sayed, 2014). The cost function for agent i is given in (15). Note that this method has not been formally proposed in existing works for RF-based decentralized kernel learning with batch-form data, but we introduced it here only for comparison purpose. An online version that deals with streaming data is available in (Bouboulis et al., 2018). The batch version of CTA introduced here is expected to converge faster than the online version. DKLA. Algorithm 1 proposed in Section 3.2 where ADMM is applied and the communication among agents happen at every iteration without being censored. The performance of all algorithms is evaluated using both synthetic and real-world datasets, where the entries of data samples are normalized to lie in [0, 1] and each agent uses 70% of its data for training and the rest for testing. The learning performance at each iteration is evaluated using mean-squared-error (MSE) given by MSE(k) = 1 T PN i=1 PTi t=1(yi,t (θk i ) φL(xi,t))2. The decision variable θi for CTA is initialized as θ0 i = 0, i as that in DKLA and COKE. For COKE, it should be noted that the design of the censoring function is crucial. For the censoring thresholds adopted in Theorem 2, choosing larger v and µ to design the censor thresholds leads to less communication per iteration but may result in performance degradation. For all simulations, the kernel bandwidth is fine-tuned for each dataset individually via cross-validation. The parameters of the censoring function are tuned to achieve the best learning performance at nearly no performance loss. 5.1 Synthetic dataset In this setup, the connected graph is randomly generated with N = 20 nodes and 95 edges. The probability of attachment per node equals to 0.3, that is, any pair of two nodes are connected with a probability of 0.3. Each agent has Ti (4000, 6000) data pairs generated following the model yi,t = P50 m=1 bmκ(cm, xi,t) + ei,t, where bm are uniformly drawn from [0, 1], cm N(0, I5), xi,t N(0, I5), and ei,t N(0, 0.1). The kernel κ in the model is Gaussian with a bandwidth σ = 5. COKE: Communication-Censored Decentralized Kernel Learning 5.2 Real datasets To further evaluate our algorithms, the following popular real-world datasets from UCI machine learning repository are chosen (Asuncion and Newman, 2007). Tom s hardware. This dataset contains T = 11000 samples with xt R96 whose features include the number of created discussions and authors interacting on a topic and yt R representing the average number of displays to a visitor about that topic (Kawala et al., 2013). Twitter. This dataset consists of T = 13800 samples with xt R77 being a feature vector reflecting the number of new interactive authors and the length of discussions on a given topic, etc., and yt R representing the average number of active discussion on a certain topic. The learning task is to predict the popularity of these topics. We also include a larger Twitter dataset for testing which has T = 98704 samples (Kawala et al., 2013). Energy. This dataset contains T = 19735 samples with xt R28 describing the humidity and temperature in different areas of the houses, pressure, wind speed and viability outside, while yt denotes the total energy consumption in the house (Candanedo et al., 2017). Air quality. This dataset contains dataset collects T = 9358 samples measured by a gas multi-sensor device in an Italian city, where xt R13 represents the hourly concentration of CO, NOx, NO2, etc, while yt denotes the concentration of polluting chemicals in the air (De Vito et al., 2008). 5.3 Parameter setting and performance analysis For synthetic data, we adopt a Gaussian kernel with a bandwidth σ = 1 for training and use L = 100 random features for kernel approximation. Note that the chosen σ differs from that of the actual data model. The censoring thresholds are h(k) = 0.95k, the regularization parameter λ and stepsize ρ of DKLA and COKE are set to be 5 10 5 and 10 2, respectively. The stepsize of CTA is set to be η = 0.99, which is tuned to achieve the same level of learning performance as COKE and DKLA at its fastest speed. To show the performance of all algorithms on real datasets concisely and comprehensively, we present the experimental results on the Twitter dataset with T = 13800 samples by figures and record the experimental results on the remaining datasets by tables. For the Twitter dataset with T = 13800 samples, we randomly split it into 10 mini-batches each with Ti (1200, 1400) data pairs while P10 i=1 Ti = T. The 10 mini-batches are distributed to 10 agents connected by a random network with 28 edges. We use 100 random features to approximate a Gaussian kernel with a bandwidth σ = 1 during the training process. The parameters λ and ρ are set to be 10 3 and 10 2, respectively. The censoring thresholds are h(k) = 0.97k. The stepsize of CTA is set to be η = 0.99 to balance the learning performance and the convergence speed. In Fig.1, we show that the individually learned functional at each agent via COKE reaches consensus to the optimal estimate for both synthetic and real datasets. In Fig. 2, we compare the MSE performance of COKE, DKLA, and CTA. Both figures show that COKE converges slower than DKLA due to the communications skipped by the censoring step. However, the learning performance of COKE eventually is the same as DKLA. For the diffusion-based CTA algorithm, it converges the slowest. In Fig. 3, we show the MSE performance versus the communication cost (in terms of the number of transmissions). As Xu and Wang and Chen and Tian 0 50 100 150 200 250 Iteration optimal estimate (a) Synthetic data. 0 50 100 150 200 Iteration optimal estimate (b) Twitter data. Figure 1: Functional convergence via COKE for synthetic data (Figure 1 (a)) and the real dataset (Figure 1 (b)). The learned functionals of all distributed agents converge to the optimal estimate where data are assumed to be centrally available. Training error (MSE (10 3)) / Commun. cost Test error (MSE(10 3)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 3.9/500 2.4/500 4.5/13 4.0 2.6 4.2 k = 100 3.3/1000 2.4/1000 2.6/100 3.4 2.6 2.8 k = 200 3.0/2000 2.3/2000 2.4/298 3.2 2.5 2.6 k = 500 2.7/5000 2.3/5000 2.3/902 2.9 2.5 2.5 k = 1000 2.5/10000 2.2/10000 2.2/4648 2.7 2.5 2.5 k = 1500 2.5/15000 2.2/15000 2.2/9648 2.7 2.5 2.5 k = 2000 2.4/20000 2.2/20000 2.2/14648 2.6 2.5 2.5 Table 1: MSE performance on the Twitter dataset (large), σ = 1, L = 100, λ = 10 3, stepsize η = 0.99 for CTA, stepsize ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.5 0.98k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource than DKLA. CTA converges the slowest and communicates all the time, its communication cost is much higher than that of DKLA, and thus we do not include it in Fig. 3 but rather focus on the communication-saving of COKE over DKLA. We can see that to achieve the same level of learning performance, COKE requires much less communication cost than DKLA. Both the synthetic data and the real dataset show communication saving of around 50% in Fig. 3 for a given learning accuracy, which corroborate the communication-efficiency of COKE. The performance of all three algorithms on the rest four datasets is listed in Table 1 - 6. All results show that COKE saves much communication (almost 50%) within a negligible COKE: Communication-Censored Decentralized Kernel Learning 0 200 400 600 Iteration Training data, DKLA Test data, DKLA Training data, COKE Test data, COKE Training data, CTA Test data, CTA (a) Synthetic data. 0 200 400 600 Iteration Training data, DKLA Test data, DKLA Training data, COKE Test data, COKE Training data, CTA Test data, CTA (b) Twitter data. Figure 2: MSE performance for synthetic data (Figure 2 (a)) and the real dataset (Figure 2 (b)). ADMM-based algorithms (COKE and DKLA) converge faster than the diffusion-based algorithm (CTA) for both synthetic data (Figure 2 (a)) and the real dataset (Figure 2 (b)). Furthermore, DKLA and COKE achieve better learning performance than CTA in terms of MSE on the real dataset. 102 103 Commun. cost (a) Synthetic data. 102 103 Commun. cost (b) Twitter data. Figure 3: MSE performance versus communication cost for synthetic data (Figure 3 (a)) and the real dataset (Figure 3 (b)). Compared with DKLA, COKE achieves around 50% communication saving on the same level of MSE performance for both synthetic data (Figure 3 (a)) and the real dataset (Figure 3 (b)). Xu and Wang and Chen and Tian Training error (MSE (10 4)) / Commun. cost Test error (MSE(10 4)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 20.02/500 10.01/500 17.40/10 20.16 11.20 18.82 k = 100 16.6/1000 9.91/1000 10.67/112 17.09 11.10 11.86 k = 200 13.68/2000 9.90/2000 9.97/331 14.58 11.10 11.15 k = 500 11.19/5000 9.90/5000 9.90/1114 12.35 11.10 11.10 k = 1000 10.27/10000 9.90/10000 9.90/5600 11.47 11.10 11.10 k = 1500 10.01/15000 9.90/15000 9.90/10600 11.22 11.10 11.10 k = 2000 9.92/20000 9.90/20000 9.90/15600 11.13 11.10 11.10 Table 2: MSE performance on the Tom s hardware dataset, σ = 1, L = 100, λ = 10 2, stepsize η = 0.99 for CTA, stepsize ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.5 0.95k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource than DKLA. Twitter (large) Tom s hardware MSE (10 3) Commun. cost MSE (10 4) Commun. cost CTA DKLA COKE CTA DKLA COKE 5 360 20 10 18 680 20 3 4 480 30 10 16 1020 30 22 3 1860 60 48 14 1610 60 28 2.8 3250 100 55 12 2880 110 51 2.6 6120 180 100 10 7950 250 128 2.3 - 1080 577 9.95 17620 640 361 2.2 - 5660 4428 9.90 - 1550 984 Table 3: MSE performance (training error) versus communication cost on the Twitter dataset (large) and the Tom s hardware dataset. For both datasets, COKE saves around 50% communication resource than DKLA to achieve the same level of learning performance. learning gap from DKLA, and both DKLA and COKE require much less communication resources than CTA. For example, the number of transmissions required to reach a training estimation error of 2.3 10 3 on Twitter dataset by COKE is 577, which is only 53% of that required by DKLA to reach the same level of learning performance. For Tom s hardware dataset, COKE requires 361 total transmissions to reach a learning error of 9.95 10 4, which is 56.4% of DKLA and 0.02% of CTA. Note that much of the censoring occurs at the beginning update iterations. While at the later stage, COKE nearly transmits all parameters at every iteration since the censoring thresholds are smaller than the difference between two consecutive updates. COKE: Communication-Censored Decentralized Kernel Learning Training error (MSE (10 3)) / Commun. cost Test error (MSE(10 3)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 25.65/500 22.52/500 25.22/0 26.45 22.97 26.02 k = 100 24.88/1000 22.12/1000 23.65/57 25.57 22.50 24.2 k = 200 24.17/2000 21.81/2000 22.57/254 24.77 22.15 23.02 k = 500 23.40/5000 21.55/5000 21.88/987 23.92 21.86 22.22 k = 1000 22.84/10000 21.48/10000 21.51/5752 23.31 21.79 21.82 k = 1500 22.54/15000 21.47/15000 21.47/10752 22.97 21.78 21.78 k = 2000 22.35/20000 21.47/20000 21.47/15752 22.75 21.78 21.78 Table 4: MSE performance on the Energy dataset, σ = 0.1, L = 100, λ = 10 3, stepsize η = 0.99 for CTA, ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.5 0.98k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource. Training error (MSE (10 3)) / Commun. cost Test error (MSE(10 3)) Iteration CTA DKLA COKE CTA DKLA COKE k = 50 6.4/500 1.8/500 3.7/72 6.7 2.1 4.0 k = 100 4.5/1000 1.6/1000 2.2/172 4.8 1.8 2.5 k = 200 3.2/2000 1.4/2000 1.7/384 3.5 1.7 2.0 k = 500 2.2/5000 1.3/5000 1.3/2263 2.5 1.6 1.6 k = 1000 1.7/10000 1.2/10000 1.2/7263 2.0 1.6 1.6 k = 1500 1.6/15000 1.2/15000 1.2/12263 1.8 1.6 1.6 k = 2000 1.5/20000 1.2/20000 1.2/17263 1.8 1.6 1.6 Table 5: MSE performance on the Air quality dataset, σ = 0.1, L = 200, λ = 10 5, stepsize η = 0.99 for CTA, ρ = 10 2 for DKLA and COKE, censoring thresholds h(k) = 0.9 0.97k. DKLA and COKE achieve better MSE performance than CTA while COKE requires the least communication resource than DKLA. Xu and Wang and Chen and Tian Energy Air quality MSE (10 3) Commun. cost MSE (10 3) Commun. cost CTA DKLA COKE CTA DKLA COKE 25 860 20 11 5.0 810 60 49 24 2290 70 48 3.0 2290 180 81 23.5 4160 140 76 2.0 6010 360 211 23 7690 250 134 1.8 8160 490 285 22.5 14750 480 258 1.6 12300 750 424 22 - 1160 652 1.5 16190 1010 586 21.5 - 4950 4062 1.2 - 5990 5383 Table 6: MSE performance (training error) versus communication cost on the Energy dataset and the Air quality dataset. For both datasets, COKE saves around 45%- 55% communication resource than DKLA to achieve the same level of learning performance. 6. Concluding Remarks This paper studies the decentralized kernel learning problem under privacy concern and communication constraints for multi-agent systems. Leveraging the random feature mapping, we convert the non-parametric kernel learning problem into a parametric one in the RF space and solve it under the consensus optimization framework by the alternating direction method of multipliers. A censoring strategy is applied to conserve communication resources. Through both theoretical analysis and simulations, we establish that the proposed algorithms not only achieve linear convergence rate but also exhibit effective generalization performance. Thanks to the fixed-size parametric learning model, the proposed algorithms circumvent the curse of dimensionality problem and do not involve raw data exchange among agents. Hence, they can be applied in distributed learning that involve big-data and offer some level of data privacy protection. To cope with dynamic environments and enhance the learning performance, future work will be devoted to decentralized online kernel learning and multi-kernel learning. Acknowledgments We would like to acknowledge support for this project from the National Science Foundation (NSF grants #1741338 and #1939553). COKE: Communication-Censored Decentralized Kernel Learning Appendix A. Proof of Theorem 1 and Theorem 2 Proof. As discussed in Section 3.2, solving the decentralized kernel learning problem in the RF space (17) is equivalent to solving the problem (16). From (14), it is evident that the convergence of the optimal functional f in (16) hinges on the convergence of the decision variables θ in the RF space. Since in the RF space, the decision variables are dataindependent, the convergence proof of DKLA boils down to proving the convergence of a convex optimization problem solved by ADMM. However, the convergence proof of COKE is nontrivial because of the error caused by the outdated information introduced by the communication censoring strategy. Our proof for both theorems consists of two steps. The first step is to show linear convergence of decision parameters θ for DKLA via Theorem 4 and for COKE via Theorem 5 below, which are derived straightforwardly from (Shi et al., 2014) and (Liu et al., 2019), respectively. The second step is to show how the convergence of θ translates to the convergence of the learned functional, which are the same for both algorithms. Compared to (Shi et al., 2014) and (Liu et al., 2019) that deal with general optimization problems for parametric learning, this work focuses on specific decentralized kernel learning problem which is more challenging in both solution development and theoretical analysis. By leveraging the RF mapping technique, we successfully develop the DKLA algorithm and the COKE algorithm. Noticeably, a direct application of ADMM as in (Shi et al., 2014) on decentralized kernel learning is infeasible without raw data exchanges. Moreover, we analyze the convergence of the nonlinear functional to be learned and the generalization performance of kernel learning in the decentralized setting. The analysis is built on the work of (Shi et al., 2014) and (Liu et al., 2019) but goes further, and it is only attainable because of the adoption of the RF mapping. For both algorithms, the linear convergence of decision variables in the RF space is based on matrix reformulation of (17). Define Θ := [θ , θ , . . . , θ ] RN L and Θ := [ϑ , ϑ , . . . , ϑ ] RN L be the optimal primal variables, and β be the optimal dual variable. Then, for DKLA, Theorem 4 states that {Θk} (Θk := [θk 1, θk 2, . . . , θk N] RN L) is R-linear convergent to the optimal Θ . For detailed proof, see (Shi et al., 2014). Theorem 4 [Linear convergence of decision variables in DKLA] For the optimization problem (16), initialize the dual variables as γ0 i = 0, i, with Assumptions 1 - 2, then {Θk} is R-linearly convergent to the optimal Θ when k goes to infinity following from Θk Θ 2 F 1 m ˆR ρ Θk 1 Θ 2 F + 1 ρ βk 1 β 2 F where {(Θk, βk)} is Q-linearly convergent to its optimal {(Θ , β )}: ρ Θk Θ 2 F + 1 ρ βk β 2 F 1 1 + δd ρ Θk 1 Θ 2 F + 1 ρ βk 1 β 2 F ( (ν 1) σ2 min(S ) ν σ2max(S+) , m ˆR ρ 4 σ2max(S+) + ν ρM2 ˆR σ2 min(S ) where ν > 1 is an arbitrary constant, σmax(S+) is the maximum singular value of the unsigned incidence matrix S+ of the network, and σ2 min(S ) is the minimum non-zero singular value of the signed incidence matrix S of the network, m ˆR and M ˆR are the minimum Xu and Wang and Chen and Tian strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. The Q-linear convergence rate of {(Θk, βk)} to {(Θ , β )} satisfies 1 1 + δd . (31) To achieve linear convergence of decision variables in COKE, choosing appropriate censoring functions is crucial. Moreover, the penalty parameter ρ also needs to satisfy certain conditions, see Theorem 5 for details (Liu et al., 2019). Theorem 5 [Linear convergence of decision variables in COKE] For the optimization problem (16) with strongly convex local cost functions whose gradients are Lipschitz continuous, initialize the dual variables as γ0 i = 0, i, set the censoring thresholds to be h(k) = vµk, with v > 0 and µ (0, 1), and choose the penalty parameter ρ such that 0 < ρ < min η1 , (ν 1) σ2 min(S ) νη3 σ2max(S+) , η1 4 + η2 σ2 max(S+) m ˆR η3νM2 ˆR σ2 min(S ) where η1 > 0, η2 > 0, η3 > 0 and ν > 1 are arbitrary constants, m ˆR and M ˆR are the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients, respectively. σmax(S+) and σ2 min(S ) are the maximum singular value of the unsigned incidence matrix S+ and the minimum non-zero singular value of the signed incidence matrix S of the network, respectively. Then, {Θk} is Rlinearly convergent to the optimal Θ when k goes to infinity following from. Remark 3. For the kernel ridge regression problem (25), the minimum strong convexity constant of the local cost functions and the maximum Lipschitz constant of the local gradients are m ˆR := mini σ2 min( 1 Ti Φi L(Φi L) + 2λ N I) and M ˆR := maxi σ2 max( 1 Ti Φi L(Φi L) + 2λ N I), respectively. With the convergence of decision variables in the RF space given in Theorem 4 and Theorem 5, the second step is to prove the linear convergence of the learned functional ˆfθk i (x) to the optimal ˆfθ (x), which is straightforward for both algorithms. Denote ˆfΘk(x) = [ ˆfθk 1 (x), . . . , ˆfθk N (x)] = ΘkφL(x) and ˆfΘ (x) = [ ˆfθ (x), . . . , ˆfθ (x)] = Θ φL(x), then we have ˆfΘk(x) ˆfΘ (x) 2 = ΘkφL(x) Θ φL(x) 2 Θk Θ 2 φL(x) 2 where the second inequality comes from the fact that φL(x) 2 1 with the adopted RF mapping. For DKLA, we have ˆfΘk(x) ˆfΘ (x) 2 Θk Θ 2 1 m ˆR ρ Θk 1 Θ 2 F + 1 ρ βk 1 β 2 F Therefore, the Q-linear convergence of {Θk, βk} to the optimal (Θ , β ) translates to the Rlinear convergence of { ˆfΘk(x)}. Similarly, the R-linear convergence of {Θk} to the optimal COKE: Communication-Censored Decentralized Kernel Learning Θ of COKE can be translated from the Q-linear convergence of {Θk, βk} to the optimal (Θ , β ), see Liu et al. (2019) for detailed proof. It is then straightforward to see that the individually learned functionals converge to the optimal one when k goes to infinity, that is, for i N, lim k | ˆfθk i (x) ˆfθ (x)| = lim k |(θk i ) φL(x) (θ ) φL(x)| lim k θk i θ 2 φL(x) 2 lim k θk i θ 2 Appendix B. Proof of Theorem 3 Proof. The empirical risk (6) to be minimized for the kernel regression problem in the RKHS is min α RT ˆR(α) = i=1 ˆRi(α) = Ti yi K i α 2 2 + λiα Kα , (36) where yi = [yi,1, . . . , yi,Ti] RTi 1, the matrices Ki RT Ti and K RT T are used to store the similarity of the total data and data from agent i, and the similarity of all data, respectively, with the assumption that all data are available to all agents. The optimal solution is given in closed form by α = ( K K + λK) 1 K y, (37) where K = [ K1, . . . , KN] RT T with Ki = 1 Ti Ki, i N, y = [ y1; . . . ; y N] RT 1 with yi = 1 Ti yi, i N, and λ = PN i=1 λi. Denote the predicted values on the training examples using α as fi α RTi for node i and the overall predictions as fα = [f1 α ; . . . ; f N α ] RT . In the corresponding RF space, we can denote the predicted values obtained for node i by θ in (26) as fi θ RTi and the overall prediction by fθ = [f1 θ ; . . . ; f N θ ] RT . To prove Theorem 3, we start by customizing several lemmas and theorems from the literature, which facilitate proving our main results. Definition 1 (Bartlett and Mendelson, 2002, Definition 2) Let {xq}Q q=1 be i.i.d samples drawn from the probability distribution p X . Let F be a class of functions that map X to R. Define the random variable ˆRQ(F) := Eϵ q=1 ϵqf(xq) |x1, . . . , x Q where {ϵq}Q q=1 are i.i.d. { 1}-valued random variables with P(ϵq = 1) = P(ϵq = 1) = 1 2. Then, the Rademacher complexity of F is defined as RQ(F) := E h ˆRQ(F) i . (39) Xu and Wang and Chen and Tian Rademacher complexity is adopted in machine learning and theory of computation to measure the richness of a class of real-valued functions with respect to a probability distribution. Here we adopt it to measure the richness of functions defined in the RKHS induced by the positive definite kernel κ with respect to the sample distribution p. Lemma 2 (Bartlett and Mendelson, 2002, Lemma 22) Let H be a RKHS associated with a positive definite kernel κ that maps X to R. Then, we have ˆRQ(H) 2 Tr(K), where K is the kernel matrix for kernel κ over the i.i.d. sample set {xq}Q q=1. Correspondingly, the Rademacher complexity satisfies RQ(H) 2 The next theorems state that the generalization performance of a particular estimator in H not only depends on the number of data points, but also depends on the complexity of H. Theorem 6 (Bartlett and Mendelson, 2002, Theorem 8, Theorem 12) Let {xq, yq}Q q=1 be i.i.d samples drawn from the distribution p defined on X Y. Assume the loss function ℓ: Y R [0, 1] is Lipschitz continuous with a Lipschitz constant Mℓ. Define the expected risk for all f H be E(f) = Ep [ℓ(f(x), y)], and its corresponding empirical risk be ˆE(f) = 1 Q PQ q=1 ℓ(yq, f(xq)). Then, for δp (0, 1), with probability at least 1 δp, every f H satisfies E(f) ˆE(f) + RQ( ℓ H) + 8 log(2/δp) where ℓ H = {(x, y) ℓ(y, f(x)) ℓ(y, 0)|f H}. Theorem 7 (Bartlett and Mendelson, 2002, Theorem 12) If ℓ: Y R [0, 1] is Lipschitz with constant Mℓand satisfies ℓ(0) = 0, then RQ( ℓ H) 2MℓRQ(H). Lemma 3 (Li et al., 2018, Modified Proposition 1) For the RKHS induced by the kernel κ with expression (10), define ˆHk := { ˆfk : ˆfk = (θk) φL(x) = PL l=1 θk l φ(x, ωl), then we have ˆfk ˆHk, ˆfk 2 ˆ Hk θk 2 2, where ˆHk is the RKHS of functions ˆfk at the k-th step. The kernel that induces ˆHk is the approximated kernel ˆκL defined in (11). Lemma 4 (Li et al., 2018, Lemma 6) For the decentralized kernel regression problem defined in Section 2, let fα , fθ be the predictions obtained by (37) and (26), respectively. Then, we have y fα , fθ fα = 0. (41) Theorem 8 (Li et al., 2018, Modified Theorem 5) For the decentralized kernel regression problem defined in Section 2, let λK be the largest eigenvalue of the kernel matrix K, and choose the regularization parameter λ < λK/T so as to control overfitting. Then, for all δp (0, 1) and f H 1, if the number of random features L satisfies 3ϵ) log 16dλ K δp , COKE: Communication-Censored Decentralized Kernel Learning then with probability at least 1 δp, the following equation holds sup f H 1 inf θ 1 T fx fθ 2 2 2λ, (42) where fx RT is the predictions evaluated by f H on all samples and ϵ (0, 1). With the above lemmas and theorems, we are ready to prove Theorem 3, which relies on the following decomposition: E( ˆfk) E(f H) = E( ˆfk) ˆE( ˆfk) | {z } (1) estimation error + ˆE( ˆfk) ˆE( ˆfθ ) | {z } (2) convergence error + ˆE( ˆfθ ) ˆE( ˆfα ) | {z } (3) approximation error of RF mapping + ˆE( ˆfα ) E( ˆfα ) | {z } (4) estimation error + E( ˆfα ) E(f H) | {z } (5) approximation error of kernel representation where E( ˆfk), ˆE( ˆfk), E( ˆfθ ), ˆE( ˆfθ ), ˆE( ˆfα ), E( ˆfα ) are defined as follows for the kernel regression problem: i=1 Ei( ˆfθk i ) = i=1 Ep[(y (θk i ) φL(x))2] := Ep[ y N ΦN Θk 2 2], ˆE( ˆfk) := i=1 ˆEi( ˆfθk i ) = t=1 yi (Φi L) θk i 2 2 = i=1 yi ( Φi L) θk i 2 2 = y ΦB Θk 2 2, ˆE( ˆfθ ) := i=1 ˆEi( ˆfθ ) = t=1 yi (Φi L) θ 2 2 = i=1 yi ( Φi L) θ 2 2 = y ΦB Θ 2 2, E( ˆfα ) := i=1 ˆEi( ˆfα ) = i=1 Ep[(y (α ) κ(x))2] := Ep[(y (α ) κ(x))2], ˆE( ˆfα ) := 1 Ti yi Kiα 2 2 = i=1 yi Kiα 2 2, where y N = y1N, ΦN = φL(x) 0 ... ... ... 0 φL(x) RN NL, ΦB = ( Φ1 L) 0 ... ... ... 0 ( ΦN L ) RT NL, Θk = [θk 1; . . . ; θk i ] RNL, and Θ = [θ ; . . . ; θ ] RNL. Then, we upper bound the excessive risk of E( ˆfk) learned by COKE by upper bounding the decomposed five terms. For term (1), for δp1 (0, 1), with probability at least 1 δp1, Xu and Wang and Chen and Tian E( ˆfk) ˆE( ˆfk) 2Mℓ1RT ( ℓ1 ˆHk) + 8 log(2/δp1) 2Mℓ1RT ( ˆHk) + 8 log(2/δp1) T E[Tr( ˆK)] + 8 log(2/δp1) 8 log(2/δp1) where C1 := 4Mℓ1 + p 8 log(2/δp1), and Mℓ1 is the Lipschitz constant for loss function ℓ1( ˆfθk i , y) = ((θk i ) φL(x) y)2. The first inequality comes from Theorem 6, the second inequality comes from Theorem 7, and the third inequality comes from Lemma 2. For the last inequality, each element in the Gram matrix ˆK RT T is given by (11), thus Tr( ˆK) T φL(x) 2 2 T with the adopted RF mapping such that φL(x) 2 2 1. Similarly, for term (4), with probability at least 1 δp2 for δp2 (0, 1), the following holds, ˆE( ˆfα ) E( ˆfα ) 2Mℓ2RT ( ℓ2 H) + 8 log(2/δp2) 2Mℓ2RT (H) + 8 log(2/δp2) T E[Tr(K)] + 8 log(2/δp2) 8 log(2/δp2) where C2 := 4Mℓ2 + p 8 log(2/δp2), and Mℓ2 is the Lipschitz constant for the loss function ℓ2( ˆfα , y) = ((α ) κ(x) y)2. For term (2), we have ˆE( ˆfk) ˆE( ˆfθ ) = y ΦB Θk 2 2 y ΦB Θ 2 2 y ΦB Θ 2 2 Θk Θ 2 + Mℓ3 Φ B( ΦB Θ y) 2 + Mℓ3 = C3 Θk Θ 2, where C3 := Φ B( ΦB Θ y) 2+ Mℓ3 2 , and Mℓ3 is the Lipschitz constant of the loss function ℓ3( y, Θ) = y ΦB Θ 2 2. From Theorem 4 and 5, we conclude { Θk} converges linearly to Θ . COKE: Communication-Censored Decentralized Kernel Learning Term (3) is the approximation error caused by RF mapping, which is bounded by ˆE( ˆfθ ) ˆE( ˆfα ) = i=1 yi ( Φi L) θ 2 2 i=1 yi Kiα 2 2 = y fθ 2 2 y fα 2 2 = ( y fα ) + ( fα fθ ) 2 2 y fα 2 2 y fα 2 2 + fα fθ 2 2 + 2 y fα , fα fθ y fα 2 2 = inf fθ fα fθ 2 2 + 2inf fθ y fα , fα fθ inf fθ fα fθ 2 2 + 2 y fα , fα fθ = inf fθ fα fθ 2 2 sup fx inf fθ fx fθ 2 2 where the seventh equality comes from Lemma 4 while the last inequality comes from Theorem 8 with fx := [ 1 T1 f1; . . . ; 1 TN f N] RT and fi = [f(xi,1), . . . , f(xi,Ti)] RTi for f H. To bound term (5) of the approximation error of the models in the RKHS H, we refer to the following Lemma. Lemma 5 (Rudi and Rosasco, 2017, Modified Lemma 5) For the kernel κ that can be represented as (10) and bounded RF mapping, that is φ(x, ω) 1 for any x X, under Assumption 4, the following holds for any regularization parameter λ > 0, E( ˆfα ) E(f H) = ˆfα Pfp 2 p X (Rλr)2. In Lemma 5, fp is the ideal minimizer given the prior knowledge of the marginal distribution p X of x and P is a projection operator on fp so that Pfp is the optimal minimizer in RKHS. The parameter r [1/2, 1) is equivalent to assuming f H exits, and R can take value as either 1 or ˆfα p X . Setting r = 1/2 and R = 1, we have E( ˆfα ) E(f H) λ. (48) Combining (44)-(48) gives lim k E( ˆfk) E(f H)) lim k T + C3 Θk Θ 2 + 2λ + C2 3λ + C1 + C2 T + C3 Θk Θ 2 = 3λ + O( 1 and completes the proof. Xu and Wang and Chen and Tian Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709 1720. 2017. Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973 5983. 2018. Reza Arablouei, Stefan Werner, Kutluyıl Do gan cay, and Yih-Fang Huang. Analysis of a reduced-communication diffusion lms algorithm. Signal Processing, 117:355 361, 2015. Arthur Asuncion and David Newman. UCI machine learning repository, 2007. Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 253 262. JMLR.org, 2017. Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714 751, 2017. Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Eduard Gabriel B az avan, Fuxin Li, and Cristian Sminchisescu. Fourier kernel learning. In European Conference on Computer Vision, pages 459 473. Springer, 2012. Salomon Bochner. Harmonic analysis and the theory of probability. Courier Corporation, 2005. Pantelis Bouboulis, Symeon Chouvardas, and Sergios Theodoridis. Online distributed learning over networks in RKH spaces using random fourier features. IEEE Transactions on Signal Processing, 66(7):1920 1932, 2018. Serhat Bucak, Rong Jin, and Anil K Jain. Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition. In Advances in Neural Information Processing Systems, pages 325 333, 2010. Luis M Candanedo, V eronique Feldheim, and Dominique Deramaix. Data driven prediction models of energy use of appliances in a low-energy house. Energy and Buildings, 140:81 97, 2017. Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communication-efficient distributed learning. In Advances in Neural Information Processing Systems, pages 5050 5060, 2018. COKE: Communication-Censored Decentralized Kernel Learning Symeon Chouvardas and Moez Draief. A diffusion kernel LMS algorithm for nonlinear adaptive networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4164 4168. IEEE, 2016. Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041 3049, 2014. Saverio De Vito, Ettore Massera, Marco Piga, Luca Martinotto, and Girolamo Di Francia. On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B: Chemical, 129(2):750 757, 2008. Giacomo Vincenzo Demarie and Donato Sabia. A machine learning approach for the automatic long-term structural health monitoring. Structural Health Monitoring, 18(3): 819 837, 2019. Petros Drineas and Michael W Mahoney. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153 2175, 2005. Yaakov Engel, Shie Mannor, and Ron Meir. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 52(8):2275 2285, 2004. Francisco Facchinei, Gesualdo Scutari, and Simone Sagratella. Parallel selective algorithms for nonconvex big data optimization. IEEE Transactions on Signal Processing, 63(7): 1874 1889, 2015. Wei Gao, Jie Chen, C edric Richard, and Jianguo Huang. Diffusion adaptation over networks with kernel least-mean-square. In 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 217 220. IEEE, 2015. Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In International Conference on Machine Learning (ICML), 2010. Bin Gu, Miao Xin, Zhouyuan Huo, and Heng Huang. Asynchronous doubly stochastic sparse kernel learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Ibrahim El Khalil Harrane, R emi Flamary, and C edric Richard. On reducing the communication cost of the diffusion lms algorithm. IEEE Transactions on Signal and Information Processing over Networks, 5(1):100 112, 2018. Thomas Hofmann, Bernhard Sch olkopf, and Alexander J Smola. Kernel methods in machine learning. The Annals of Statistics, pages 1171 1220, 2008. Paul Honeine. Analyzing sparse dictionaries for online learning with kernels. IEEE Transactions on Signal Processing, 63(23):6343 6353, 2015. Muhammad U Ilyas, M Zubair Shafiq, Alex X Liu, and Hayder Radha. A distributed algorithm for identifying information hubs in social networks. IEEE Journal on Selected Areas in Communications, 31(9):629 640, 2013. Xu and Wang and Chen and Tian Martin Jaggi, Virginia Smith, Martin Tak ac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068 3076, 2014. Xinrong Ji, Cuiqin Hou, Yibin Hou, Fang Gao, and Shulong Wang. A distributed learning method for ℓ-1 regularized kernel machine over wireless sensor networks. Sensors, 16(7): 1021, 2016. Fran cois Kawala, Ahlame Douzal-Chouakria, Eric Gaussier, and Eustache Dimert. Pr edictions d activit e dans les r eseaux sociaux en ligne. 2013. Alec Koppel, Garrett Warnell, Ethan Stump, and Alejandro Ribeiro. Parsimonious online learning with kernels via sparse projections in function space. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4671 4675. IEEE, 2017. Alec Koppel, Santiago Paternain, C edric Richard, and Alejandro Ribeiro. Decentralized online learning with kernels. IEEE Transactions on Signal Processing, 66(12):3240 3255, 2018. Trung Le, Vu Nguyen, Tu Dinh Nguyen, and Dinh Phung. Nonparametric budgeted stochastic gradient descent. In Artificial Intelligence and Statistics, pages 654 572, 2016. Boyue Li, Shicong Cen, Yuxin Chen, and Yuejie Chi. Communication-efficient distributed optimization in networks with gradient tracking. ar Xiv preprint ar Xiv:1909.05844, 2019a. Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19 27, 2014. Weiyu Li, Yaohua Liu, Zhi Tian, and Qing Ling. COLA: Communication-censored linearized ADMM for decentralized consensus optimization. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5237 5241. IEEE, 2019b. Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. Towards a unified analysis of random Fourier features. ar Xiv preprint ar Xiv:1806.09178, 2018. Yaohua Liu, Wei Xu, Gang Wu, Zhi Tian, and Qing Ling. Communication-censored ADMM for decentralized consensus optimization. IEEE Transactions on Signal Processing, 67 (10):2565 2579, 2019. Jing Lu, Steven CH Hoi, Jialei Wang, Peilin Zhao, and Zhi-Yong Liu. Large scale online kernel learning. Journal of Machine Learning Research, 17(1):1613 1655, 2016. H Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. ar Xiv preprint ar Xiv:1602.05629, 2016. Rangeet Mitra and Vimal Bhatia. The diffusion-KLMS algorithm. In 2014 International Conference on Information Technology, pages 256 259. IEEE, 2014. COKE: Communication-Censored Decentralized Kernel Learning Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus P uschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing, 61(10):2718 2723, 2013. Angelia Nedi c, Alex Olshevsky, and C esar A Uribe. A tutorial on distributed (non-bayesian) learning: Problem, algorithms and results. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 6795 6801. IEEE, 2016. Tu Dinh Nguyen, Trung Le, Hung Bui, and Dinh Phung. Large-scale online kernel learning with random feature reparameterization. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), pages 2543 2549. AAAI Press, 2017. Fernando P erez-Cruz and Olivier Bousquet. Kernel methods and their potential use in signal processing. IEEE Signal Processing Magazine, 21(3):57 65, 2004. Joel B Predd, Sanjeev R Kulkarni, and H Vincent Poor. Distributed kernel regression: An algorithm for training collaboratively. In 2006 IEEE Information Theory Workshop (ITW), pages 332 336. IEEE, 2006. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pages 1177 1184, 2008. Sashank J Reddi, Jakub Koneˇcn y, Peter Richt arik, Barnab as P ocz os, and Alex Smola. Aide: Fast and communication efficient distributed optimization. ar Xiv preprint ar Xiv:1608.06879, 2016. C edric Richard, Jos e Carlos M Bermudez, and Paul Honeine. Online prediction of time series data with kernels. IEEE Transactions on Signal Processing, 57(3):1058 1067, 2008. Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages 3215 3225, 2017. Ali H Sayed. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning, 7(ARTICLE):311 801, 2014. Bernhard Sch olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416 426. Springer, 2001. Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning (ICML), pages 1000 1008, 2014. John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004. Fatemeh Sheikholeslami, Dimitris Berberidis, and Georgios B Giannakis. Large-scale kernelbased feature extraction via low-rank subspace tracking on a budget. IEEE Transactions on Signal Processing, 66(8):1967 1981, 2018. Xu and Wang and Chen and Tian Yanning Shen, Tianyi Chen, and Georgios B Giannakis. Online multi-kernel learning with orthogonal random features. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6289 6293, 2018. Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014. Ban-Sok Shin, Henning Paul, and Armin Dekorsy. Distributed kernel least squares for nonlinear regression applied to sensor networks. In 2016 24th European Signal Processing Conference (EUSIPCO), pages 1588 1592. IEEE, 2016. Ban-Sok Shin, Masahiro Yukawa, Renato Luis Garrido Cavalcante, and Armin Dekorsy. Distributed adaptive learning with multiple kernels in diffusion networks. IEEE Transactions on Signal Processing, 66(21):5505 5519, 2018. Bharath Sriperumbudur and Zolt an Szab o. Optimal rates for random Fourier features. In Advances in Neural Information Processing Systems, pages 1144 1152, 2015. Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4447 4458. 2018. Dougal J Sutherland and JeffSchneider. On the error of random Fourier features. ar Xiv preprint ar Xiv:1506.02785, 2015. H akan Terelius, Ufuk Topcu, and Richard M Murray. Decentralized multi-agent optimization via dual decomposition. IFAC Proceedings Volumes, 44(1):11245 11251, 2011. Zhuang Wang, Koby Crammer, and Slobodan Vucetic. Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training. Journal of Machine Learning Research, 13:3103 3131, 2012. Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299 1309. 2018. Keith Worden and Graeme Manson. The application of machine learning to structural health monitoring. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 365(1851):515 537, 2006. Ping Xu, Zhi Tian, Zhe Zhang, and Yue Wang. COKE: Communication-censored kernel learning via random features. In 2019 IEEE Data Science Workshop (DSW), pages 32 36. IEEE, 2019. Wotao Yin, Xianghui Mao, Kun Yuan, Yuantao Gu, and Ali H Sayed. A communicationefficient random-walk algorithm for decentralized optimization. ar Xiv preprint ar Xiv:1804.06568, 2018. Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages 1975 1983, 2016. COKE: Communication-Censored Decentralized Kernel Learning Yue Yu, Jiaxiang Wu, and Longbo Huang. Double quantization for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 4440 4451. 2019. Lijun Zhang, Jinfeng Yi, Rong Jin, Ming Lin, and Xiaofei He. Online kernel learning with a near optimal sparsity bound. In International Conference on Machine Learning (ICML), pages 621 629, 2013. Mingrui Zhang, Lin Chen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Quantized frank-wolfe: Communication-efficient distributed optimization. ar Xiv preprint ar Xiv:1902.06332, 2019. Shengyu Zhu, Mingyi Hong, and Biao Chen. Quantized consensus ADMM for multi-agent distributed optimization. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4134 4138. IEEE, 2016.